See the wikipedia article for more information
A Bloom Filter is a probabilistic data structure.
A humorous way to think of it is as a really bad set. It's a set that doesn't actually store its elements (you have to store them elsewhere). Also, it returns false positives on membership checks. Also, you can't remove elements from it.
That being said, there are upsides to bloom filters. They are comparatively small data structures, and adding and checking members is incredibly fast.
Bloom filter usage example: CDNs
CDNs use bloom filters to determine what web pages to cache. Caching involves expensive write-to-disk processing and storage. CDNs want to avoid wasting time and space on rarely fetched web pages, which are often very large text objects. You can use a bloom filter here as a gate-keeper.
Every time a page is visited, you add it to the bloom filter. Checking the bloom filter for membership then becomes a check for "probably previously visited page". Then, you only cache (i.e., run the expensive storage process) pages that have been visited more than once.
function cache(page, bloomFilter, storage) {
if (bloomFilter.isMember(page)) {
storage.set(page)
} else {
bloomFilter.addMember(page)
}
}
The false positives are worth the trade-off here. Actually storing the pages in some data structure is both the time and space cost trying to be avoided to begin with! Even just storing a hash of the pages would practically be much larger than the bit array of the bloom filter.
Also, adding members and checking members are constant time (O(k) where k is the number of hash functions), and so incredibly fast. Constant-time lookup is theoretically true of hash tables as well, but practically this is not achievable, and bloom filters are faster (1 2).
Explanation of the algorithm
A bloom filter's memory representation is merely a bit array.
On creation, each bit is set to 0
.
A collection of hash functions are run per element added. The result of each hash function is a location in the bit array, which then gets flipped to 1
(or stays at 1
if previously flipped). For example, a bloom filter with five hash functions would flip up to five bits per added element (up to five because some bits might have been flipped previously).
Membership checks are performed by running those same hash functions on the element in question, and checking that all the bits are set to 1
. If they are, the element is 'probably a member', otherwise it is 'definitely not a member'.
False positives on membership checks result from the fact that it is not clear which elements have flipped the bits in question. It is possible that an element has not been added, but previously added elements have incidentally flipped all the bits it would've flipped had it been added. A membership check on this element would then return a false positive.
This also explains why you can't remove elements from a bloom filter. You can't "just" flip the relevant bits back to 0
, since other elements could've flipped any of them to 1
as well.