I’ve been considering a project idea for Seneca’s partnership with Creative Commons. For that idea to work I would need a tool to create perceptual hashes from images that:
- Give true positive results when comparing images that were resized, and/or their colours changed.
- Give very few (near zero percent) false positive results. Too many false positives (even 0.1%) would kill this idea.
I’ve spent the weekend working on the second part, because it’s the biggest unknown. I’ve compared pHash 0.9.6 and Blockhash 40bfc84f03. The comparison isn’t entirely fair because Blockhash supposedly has a better algorithm in the works but that is not yet published.
Test Data (sample images)
The categories I chose semi-randomly, but slightly biased towards what I thought were likely candidates to show perceptual hashing limitations: 3D_computer_graphics, Computer_generated_images, Computer_generated_particles, Digital_forensic_facial_reconstructions, Fractals, and HDR_images.
From that I ended up with 900 very different and very similar JPG files.
The open source, published pHash hasn’t been updated since 2013. The project isn’t abandoned but they’re not spending a lot of time on it. There is one github repo (sdepold’s) that’s attempted to collect random patches, but that’s explicitly not maintained and I didn’t even manage to build it.
One definite bug I ran into is described here. The fix reduced the number of hashes of zero I got for PNG images but there were still some left. So I decided to sidestep this bug by using the convert command to convert all the downloaded images to JPG before hashing them.
pHash test implementation
I had to write a simple C++ program based on their example that basically consists of this line:
Because the example was too complicated and useless to me out-of-the-box.
Then I wrote another C++ program just to compute the hamming distance, again basically one line:
ph_hamming_distance(ulong64 hash1, ulong64 hash2);
Which worked. Later I replaced it with a perl script to be able to handle 256bit hashes from Blockhash.
Those two combined with several shell scripts gave me an incredibly slow but good-enough-for-my-investigation process for running the tests on my 900 sample images.
Comparing the hash of every image to that of every other image and looking for hash hamming distances <= 10 I ended up with 63 matches. 31 of those looked like real false positives to me, the other 32 were actually the same image with no (or irrelevant) variations. That means out of 900 files there were 31 of which each one had at least one match in the set.
That’s a 3.4% false positive rate, which is completely unacceptable for what I need. So I reduced the maximum distance from 10 to 4. That reduced the number of false positives to 6, 4 of which are debatable. Tenuously I could say of the following there are only two images that are different (false positive rate of 0.2), but more testing would be needed to confirm that:
And yes, I was that picky when deciding whether there is a false positive hit.
Now to make sure that pHash works well for finding true positives I took this image:
I know it’s not a great example but it was the first in the list and I was just running a quick test.
The same algorithm matched all of them with a distance <=6, and all but the 64 pixel version with a distance <= 4. For me false positives are significantly worse than false negatives, I can live with a size limitation if that’s all it is.
In short: it appears that pHash may be able to do what I want. But I need to do more testing to confirm that.
Blockhash test implementation
I didn’t need to write any code for the hashing part, the included build/blockhash did exactly what I need.
I still had to write some shell scripts to make it and the hamming distance calculator work on my image set.
I left the default hash size at 16^2 (256 bits), that’s already 4 times the pHash hash size.
I ran essentially the same test on the same test data with Blockhash as pHash. With distance <=10 I got 53 matches, of which 29 looked like real false positives.
I was about to test with distance <= 4 when I thought I should do the true positive test first, to make sure it’s similar to pHash. It wasn’t.
In face for my 7 resized versions of one image I got distances as large as 27, and the majority over 10. That means with Blockhash taking the false positive rate down destroys the true positive rate, which makes it useless to me. I did not pursue further testing, but perhaps I’ll try again if the Blockhash algorithm is updated.
It appears to be possible to get an extremely low false positive rate with pHash while keeping a very good true positive rate.
Perceptual image hashes are a complex business. If you try to come up with an algorithm in your head – it will suck. Which makes me worried about whether it’s a solved problem or I would need to solve it myself.
If I want to pursue this idea further I will need to run more tests on more images. Testing like this takes time. My hacky test system takes literally hours to run through 900 images, and it can be optimized a lot but the optimization process will take days/weeks. I will post followup results if they will exist.