In this series:

I’ll now describe the components making the registry possible. All the source code for this is open source and available from GitHub.

CC_Registry_subsystems_overview

Data Source

When this goes into production – the images need to be hashed at the source. The bandwidth necessary to transfer all newly created images to a central server for registration would cost too much, there are literally millions of new images uploaded a day.

To accomplish that – one or more partnerships need to be set up. Either directly with new content hosts (Wikimedia Commons, Flickr, etc) or with an intermediary such as the Internet Archive.

Hashes

We’ve run quite a bit of testing on different perceptual hashing mechanisms:

The result was obvious – the DCT algorithm from pHash won. What it does is create a 64bit hash for an image of any original size and complexity. I was worried about too many false positives but the results were very good, in the range of under 1%.

It’s not the fastest algorithm but it only takes a couple of seconds to compute a hash and because of its small size – it can be stored in RAM for fast searches.

Database

The data is persisted in an off-the-shelf MySQL database. I was worried that querying from a hundred million records would tax the system too much but a simple select by primary key takes no time at all so we didn’t investigate other options.

The persistent database stores all the metadata, but not the thumbnails. Those take too much space, so each of those is stored as a regular file on an XFS filesystem on a rotating drive. The path to the thumbnail is stored in the database.

Hash Matcher

This part needs speed. For each query we need to run an operation against every record in the database. Using any existing technology for that would have been too slow so we developed our own.

The hash matcher loads pairs of (DBprimaryKey, hash) into RAM as a singly-linked list. That’s 24 bytes per record. I made sure that a server with 32GB of ram would have enough memory for 100 million records and it does, that’s way more than enough.

The matching operation is performed using a single-instruction POPCNT operation via the GCC __builtin_popcount(). One of my biggest worries was the time it will take per query, but it’s very fast, under one second against a hundred million records on a quad-core Xeon E5-1603 (inexpensive workstation CPU).

Internal API

The hash matcher is not externally accessible, it’s used on localhost via a unix domain socket. See daemon/readme.md and daemon/interface.md for usage info.

External API

The service is accessible by the world via a web API. This is implemented in PHP and is documented here.

Client Software

We built a demo page (that’s also written in PHP but is separate from the API) to show that this system works. It has over a million images (that’s what we had the bandwidth to download) from Wikimedia and Flickr.