I have a very large database of jpeg images, about 2 million. I would like to do a fuzzy search for duplicates among those images. Duplicate images are two images that have many (around half) of their pixels with identical values and the rest are off by about +/- 3 in their R/G/B values. The images are identical to the naked eye. It's the kind of difference you'd get from re-compressing a jpeg.
I already have a foolproof way to detect if two images are identical: I sum the delta-brightness over all the pixels and compare to a threshold. This method has proven 100% accurate but doing 1 photo against 2 million is incredibly slow (hours per photo).
I would like to fingerprint the images in a way that I could just compare the fingerprints in a hash table. Even if I can reliably whittle down the number of images that I need to compare to just 100, I would be in great shape to compare 1 to 100. What would be a good algorithm for this?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…