A bloom filter is a data-structure that can be used to check if a set contains an element. It uses way less memory than a conventional set data-structure by sacrificing accuracy.
Say we are building a log-structured merge-tree, we can use a bloom filter to find out if the LSM-tree contains a particular key in O(1) time in most cases, the downside is that sometimes the bloom filter would say that the LSM-tree contains a key, but it actually does not and we would go searching for the value that's mapped to the key and never actually find it.
A bloom filter is just a bit-set that uses n
deterministic hash functions to add elements to it.
empty bit-set
To add the key bob
to the set, we run the key through each of the n
hash functions and map the hash function output to one of the positions in the bit-set and for each position, we flip the bit to 1
.
bit-set after bob was added to the bloom filter
To find out if the set contains the key bob
, we run the key through each of the n
hash functions again -- since the hash functions must be deterministic they will always map to the same position in the bit-set -- and check if the bit is set to 1
for each of the bit-set positions we reached after running the key through the hash functions. If every hash function maps to a bit set to 1
, it means the key is in the set.
bob is in the set because every hash function mapped it to a bit set to 1
alice is not in the set because not every hash function mapped to a bit set to 1
Since collisions can happen some keys will be mapped to bits that were set to 1
when other keys were added to the set. In this case, the bloom filter will say that it contains the key even though it does not.
the bit-set after bob was added to it
since john maps to the same bits as bob and the bits were set to 1 after bob was added to the set, we got a false positive
Network Applications of Bloom Filters: A Survey - https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf