30 / 34

HBase - Bloom Filter

Not able to play video? Try with vimeo

Transcript: In this session, we will try to understand bloom filter. Imagine, there was a watchman, who has to allow a guest if the name of the guest exists in a register. So, the watchman maintains a register with the names of people ordered alphabetically. Everything was going fine until the register became really huge and the number of guests visiting per second also increased tremendously. This resulted in too much time for query and thus huge wait time. So, he came up with an idea to keep only the first two letters of each name in a separate notebook. And have another watchman do the screen based on first two letters of the name. Since this new index containing only the first two letters is really small, therefore screening became really fast. This new index is known as bloom and the second watchman is called a Bloom Filter. If the bloom filter or the new watchmen didn't find the first two letters of a person's name in the index, it is guaranteed that the person is not allowed. But if the first two letters are found in the index, we need to do a thorough check because multiple people could have the same first two letters. Therefore, we person is sent to the first watchmen for actual checking. So, many persons are dropped by the new watchmen and thus the load on the first watchman has reduced tremendously.

Let's take another example. In the diagram we have a storage lookup - the real lookup function which can deterministically tell if a key exists in storage or not. We have a bloom filter placed before the actual lookup. Say a user requests for key1, the first request would go to bloom filter. The bloom filter would compute the hashcode of key1 and then check in the bloom if it exists or not. In this case, key1's hashcode doesn't. So, it is certain that the key1 does not exist even in the actual storage. So, there is no need of further queries.

In the case of key2, the bloom filter found that the hashcode of key2 exists. But it does not mean that the key2 certainly exists because the hashcode of other keys could be the same as key2. So, the actual query is done on the storage. And storage after querying found key2. So, our result is successful.

In case of key3, the bloom filter found it but the actual storage didn't. So, the answer is "NO".

You can notice that in case of key3 the lookup at storage is unnecessary while storage lookup for key2 is necessary. The advantages of the first case are higher compared to the disadvantages of dual lookups in the second and third cases.

Now let's try to understand the hashing function. In our example, we converted big names into small ones using the first two letters. This is called hashing. And since computers are pretty efficient with searching numbers, generally hashing converts a huge string into a fixed size number.

For example, we could convert a string into number as follows:

  • Sum up the ASCII value of all characters and calculate the remainder after dividing by 100.
  • This value will always be less than 100 no matter how big text is.
  • So, the bloom will at most contain 100 entries no matter how big each name is and how many names are there.

So, we generally choose a hashing function which gives us the resulting list or bloom that can fit into the memory.

Another question that we need to answer is: Does it make sense to have more than one bloom filter or chaining bloom filters?

The answer is no. If a bloom filter is slow, instead of chaining another bloom filter, we make the existing bloom filter faster by either making the bloom smaller or by increasing the system's memory.

A bloom is generally kept in the memory in the form of a bit vector. Here after applying the hashcode we got numbers 1, 5, 6, 10, 11 & 15. So, we created a bit vector or array of size 15 with all default values initialized to false. For each value that exists in the list, we set the bit to true at that location.

Now let's try to understand how bloom filters play a role in HBase. Bloom filters can be enabled on HBase tables per column family if you have humongous data to search. There are three kinds of bloom filters you can set:

  • None - which means no bloom filter.
  • ROW - the bloom is prepared based on the row key. The hash of the row key will be added to the bloom on each insert.
  • ROWCOL - the hash of the row key + column family + column family qualifier will be added to the bloom on each key insert.

To know more about the Bloom Filter, you read here:

BloomFilter Hbase API The general behavior of a filter, Space/Time Trade-Offs in Hash Coding with Allowable Errors, For the ability to add elements to a Bloom filter

Loading comments...