Adv Spark Programming

5 / 16
Adv Spark Programming - Custom Partitioner

Custom Partitioner:

To create your custom partitioner you would need to define a class that extends Partitioner. In the first line, we are importing this partitioner class.

Our partitioner takes an integer as the argument with name numPartitions in constructor even though we are not using this variable at all.

This class should override three methods:

First, getPartition which gets key as the argument and it returns a number 0 to 1 less than the total number of partitions. So, this method is the one that dictates to which partition a key should go.

The second method equals is used for comparing two partitioners and third method hashCode is used for computing hashcode. We aren’t using either of those.

Now, let us use this partitioner.

First, we create an RDD x having key-value pairs by using parallelize over an Array. We are passing 3 as the second argument to parallelize. This creates an RDD x with three partitions.

Let us convert the entire RDD into array of arrays such that each partition becomes an array element. We can use collect() to get and print the whole array of arrays. This helps us understand the partitions very clearly.

Next, we run the partitionBy method on our RDD with a new instance of the partitioner class we just created. We assign the result to new RDD y.

Let’s check using glom how is the structure of our new RDD y.

You can see that this new RDD has two partitions. First partition containing all the keys starting with alphabet a to j. And second partition has the keys whose first character is after j.

You can see that we were able to successfully use our newly created partitioner.