Adv Spark Programming - Custom Partitioner (Python)

Custom Partitioner

To create your custom partitioner you would need to define a function that takes an element and returns a number which is a hashcode.

def TwoPartsPartitioner(el):
    if (el[0].upper() > 'J'):
        return 1 
    else:
        return 0

Now, let us use this partitioner.

First, we create an RDD x having key-value pairs by using parallelize over an Array. We are passing 3 as the second argument to parallelize. This creates an RDD x with three partitions.

x = sc.parallelize([("sandeep",1),("giri",1),("abhishek",1),("sravani",1),("jude",1)], 3)

Let us convert the entire RDD into array of arrays such that each partition becomes an array element. We can use collect() to get and print the whole array of arrays. This helps us understand the partitions very clearly.

x.glom().collect()

Next, we run the partitionBy method on our RDD with partitioner function we just created. We assign the result to the new RDD y.

y = x.partitionBy(2, partitionFunc=TwoPartsPartitioner)

Let’s check using glom how is the structure of our new RDD y.

y.glom().collect()

It should display something like this:

[[('giri', 1), ('abhishek', 1), ('jude', 1)], [('sandeep', 1), ('sravani', 1)]]

You can see that this new RDD has two partitions. The first partition contains all the keys starting with the alphabet a to j. And the second partition has the keys whose first character is after j. You can see that we were able to successfully use our newly created partitioner.

*Customer Partitioner - Scala Video *