Bloom Filter

See the wikipedia article for more information

A Bloom Filter is a probabilistic data structure.

A humorous way to think of it is as a really bad set. It's a set that doesn't actually store its elements (you have to store them elsewhere). Also, it returns false positives on membership checks. Also, you can't remove elements from it, not only because it doesn't store elements, but also because it's representation of membership cannot be undone.

That being said, there are upsides to bloom filters. They can be much smaller than the data sets they represent, and adding and checking membership is incredibly fast.

Bloom filter usage examples

CDNs

CDNs use bloom filters to determine what web pages to cache. Caching involves expensive write-to-disk processing and storage. CDNs want to avoid wasting time and space on rarely fetched web pages, which are often very large text objects. You can use a bloom filter here as a gate-keeper.

Every time a page is visited, you add it to the bloom filter. Checking the bloom filter for membership then becomes a check for "probably previously visited page". Then, you only cache (i.e., run the expensive storage process) pages that have been visited more than once.

function cache(page, bloomFilter, storage) {
  if (bloomFilter.isMember(page)) {
    storage.set(page);
  } else {
    bloomFilter.addMember(page);
  }
}

The false positives are worth the trade-off here. Actually storing the pages in some data structure is both the time and space cost trying to be avoided to begin with! Even just storing a hash of the pages would practically be much larger than the bit array of the bloom filter.

Also, adding members and checking members are constant time (O(k) where k is the number of hash functions), and so incredibly fast. Constant-time lookup is theoretically true of hash tables as well, but practically this is not achievable, and bloom filters are faster (1 2).

Synchronizing hash graphs in Automerge

Automerge is a CRDT tool that uses Bloom filters for syncing between clients. Each client has a DAG of nodes, and potentially a common ancestor. The goal is to efficiently determine that common ancestor, and then to transmit to each other the remaining, unshared nodes. Client A transmits a serialized bloom filter to client B. Client B then uses the filter to determine which nodes are missing from client A and then sends them. Client A can then be informed which client B needs, and send them. In Martin Kleppmann's blog post on bloom filter hash graph syncing he mentions truncated hashes as an alternative approach that would hopefully be similarly efficient, but finds that bloom filters perform much better as the problem scales.

Explanation of the algorithm

A basic bloom filter's memory representation is merely a bit array. (There are other types of bloom filters, but they are principally as follows.)

On creation, each bit is set to 0.

A collection of hash functions are run per element added. The result of each hash function is a location in the bit array, which then gets flipped to 1 (or stays at 1 if previously flipped). For example, a bloom filter with five hash functions would flip up to five bits per added element (up to five because some bits might have been flipped previously).

Membership checks are performed by running those same hash functions on the element in question, and checking that all the bits are set to 1. If they are, the element is 'probably a member', otherwise it is 'definitely not a member'.

False positives on membership checks result from the fact that it is not clear which elements have flipped the bits in question. It is possible that an element has not been added, but previously added elements have incidentally flipped all the bits it would've flipped had it been added. A membership check on this element would then return a false positive.

This also explains why you can't remove elements from a bloom filter. You can't "just" flip the relevant bits back to 0, since other elements could've flipped any of them to 1 as well.