Bloom Filter
A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. You basically compare 2 hashed codes, because of the nature of hash function you can tell an element is either definitely not in the set or may be in the set.Invertible Bloom Filter (IBF): Let Magic Happen
For the purpose of synchronisation, we want to know which elements were added and which ones were removed. Replacing a set of hashed values by a more complete data structure we can achieve a clever filter. Instead of storing hashed values, we can use Bucket structure adding information like number of items, key of the item, and hashed value of the item. Using the magic XOR operator with its reversibility nature. We can xor a sum of element together.xor 2 elements together, xor again and you get our initial element!
Hashing into buckets
In a bucket, the number of items is increased by one, each time we add an item in the bucket. The key is xored with the other keys. A hash function is used to hash the key into an hashed value, itself xored to the hashed sum.
With this same technique, you can spread hashed items into different buckets increasing odds to compare them. The bucket distribution should be uniform and reproducible on both sides (local client and remote server in our exemple).
Using recursive approach, incrementing bucket numbers each time, we build buckets set locally and remotely. With 2 sets, we compute the difference subtracting number of items. In the difference, if the count of elements is 1, we know one element was added (similarly if the count is -1, one element was removed). From the key element, we can hash it, xor it to the hash sum bucket field and get the hashed sum without it (because of the xor reversibility).
Compute difference
The whole purpose of the algorithm is to get a difference with number of item -/+1. You can iterate until you get to that point increasing by 2^level.
How to treat difference
Let's say, I have an item "green circle" that was hashed and spread into 3 different buckets (3 being the default number of hashed functions used for distributing items into buckets). With this difference set, I'm looking for item number equal to 1. I know the green circle was added, so I can removed it from the buckets index 1, 4 and 6 and add it to the set of added items. In index 2, the number of items is 2, so we carry on to index 3. I can remove blue circles from index 3, 4 and 5. etc... At the end of iteration, if I get an empty difference set, I've retrieved of differences. If the set is not empty, I don't have enough information so I need to increase numbers of buckets.
Implementation
Using "What’s the Difference? Efficient Set Reconciliation without Prior Context" paper, 3musket33rs libraires implements the IBF algorithm in Java, JavaScript and iOS.The different players: Resolver is responsible of computing difference, increasing number of bucket (level) as needed. Set of buckets are called Summary. Summary comparaison produces Difference which hold removed and added set. BucketSelector produces a set of hash functions to uniformly distribute items with buckets. Bucket is the data structure to hold number of items, xored key and xored hashed value.
Want to see it in action? See my next blog post for an example involving JavaScript backend and iOS client.
Tweet