Perceptual hashing is a technique to compute the fingerprint of an image and save it as a 'hash'. The binary hash is just a string of 1s and 0s and is much smaller than the original image. Adding a watermark, cropping, changing the colors or resizing a picture will generally only change its hash by a few bits.
Here are examples of well-known perceptual hashing functions, with different resolutions (8B, 32B and 128B):
To check how similar two images hashes are, we need to compute the Hamming distance (or Hamming weight), which is the number of bits that are different in each hash.
For a 64-bit hash, an hamming distance of 0 indicates a very strong match, while values higher than 10 usually suggests that the images are significantly different.
On a processor with the POPCNT instruction (SSE4), one can calculate the Hamming distance in four instructions per 64-bit chunk. The domain-specific PostgreSQL module used in this project does not have any boundary checking or loops, and is essentially as fast as a typical sequential scan.
hash1 = 10110110
hash2 = 10010101
mov rax, [hash1]
xor rax, [hash2] ; rax = hash1 XOR hash2 = 00100011
popcnt rax, rax ; rax = popcount(00100011) = 3
A number of Web crawler scripts send items to a message queue, then a 'hasher' daemon receives the items, fetches the picture, computes the hashes for a dozen different hash configs. Then the metadata is saved alongside the binary hashes in the PostgreSQL database. The simple web app lets the user upload a picture and (asynchronously) query the whole database.
Below is an overview of the components involved in a single query.
Examples of projects that can be used to scrape images in real time.