My client is sending millions of files and my program needs to say like...
Hey there,
you've already done your job for some files and they has never been changed.
Do not repeat the job for them but changed files"
and my code blocks like below...
# dict has key value of a file and hash value of a file content
newList = dict(getListofFilesFromMyClient())
oldList = dict(getListofFilesFromHistory())
for keyValue, hashValue in newList.items() :
if(keyValue not in oldList)
# if this file is the new friend,
# do calculate hash code of this file and record it,
# and do heavy job.
else if(keyValue in oldList)
if(hashValue == oldList[keyValue])
# if this file is an old friend and has never been changed
# do not repeat heavy job.
else
# if this file is an old friend but has ever been changed
# repeat heavy job and re-calculate hash value and record it.
else
# It's not my business!
An identical hash value from different files are not my concern
because hash collision probability between two files is less than 0.1%, right?
My concern is only for throughput to calculate hash value from a few mega byte file.
Which algorithm is the most suitable in this situation?
Any advice would be appreciated.
question from:
https://stackoverflow.com/questions/65517666/what-is-the-fastest-hash-algorithm-for-only-two-files 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…