Hamming Weight Trees

How do you compare two images for similarity? One way is by hashing them using something like JImageHash. Libraries such as this reduce an image to a much smaller binary hash. The idea is that when hashing two images which look similar (but aren’t identical), the two hashes will be very similar. It’s possible then to get a ‘similarity score’ by finding how many bits in the two binary hashes are different.

For example, for simplicity let’s assume we’re generating four-bit hashes. In reality they tend to be 32 or 64 bits, but 4 bits is easier to demonstrate. Let’s say we hash two images and they both result in the hash 0110. All four bits of both hashes are identical, so we could say the difference is zero. This difference between two binary values is called the Hamming Distance.

If the Hamming Distance is 0, then the images are considered maximally similar, at least within the confines of our hashing resolution.

Another example - let’s say the two hashes are 0110 and 0111. In this case, one bit is different so we’d say the Hamming Distance is 1.

When comparing two images by their hashes, we could define an upper bound for the Hamming Distance in order to decide they are ‘similar enough’. We’d decide on this upper bound by looking at examples from our problem domain, and deciding on the best trade-off between a search which is too broad (potentially allowing false positives) or too narrow (excluding images which qualify as ‘similar’ in our domain).

This approach works well when we have two images and we want to decide if they are ‘similar’. We hash both images, calculate the Hamming Distance, then decide if the HD is within our upper bound. If it is, then the images are ‘similar’.

However, a more interesting use-case is nearest-neighbour search. Let’s say we have a database of images, potentially millions, and we want to find images which fall within a Hamming Distance of 1 compared to a candidate image. This is a harder problem to solve as for example, there are many different ways two binary values can differ by 1 bit.

For example, if our candidate image hashes to 0110, and we want to find all images with a HD of 1, then this would allow a large number of possibilities: 0111, 0100, and so on. How do we build an ‘index’ to rapidly find all matching image hashes within a particular Hamming Distance?

A paper was published in 2016 which gives an approach relying on Hamming Weight Tree data structure. The idea is based on two premises: (i) this problem is much easier to solve by representing the domain of hashes by their Hamming Weight, and (ii) searching for hashes within a given Hamming Distance is equivalent to searching for hashes whose Hamming Weight is within the same distance.

This solution dramatically reduces the search space, as we are able to compare hashes which fall within a range of Hamming Weights within the Hamming Distance we are interested in.

The Hamming Weight of a binary value is the number of bits set to 1. For example 0110 has a HW of 2.

So by following the approach given in the paper, we would build a tree of Hamming Weights, and put each of our hashes in the search domain into the appropriate node representing its weight. Then when searching within Hamming Distance 1, we would search across nodes with Hamming Weight is within 1 from the HW of our candidate hash.

Let’s assume we build a HW tree with only one level. For a four bit hash, this would have five nodes, as there are five possible weights - from 0 with no bits set, up to 4 with all bits set. So our example hash of 0110 has a weight of 2, and it would belong in the node with HW 2.

Unfortunately our node for HW 2 also includes hashes which are dissimilar to our candidate hash. For example, 1001 also has a HW of 2 but all the bits are different. So even though the HW is the same for 0110 and 1001, the Hamming Distance is the maximal value of 4.

To further reduce the search space, we can create another layer of nodes by breaking the four bits into two two-bit partitions. For HW of two, we would create child nodes for each combination of the two halves …