Apache Spark with Python - Reduce, Commutative & Associative

Now let's look at reduce. It aggregate elements of dataset using a function. reduce can be used as follows:

RDD.reduce(function)

Please note that the function must be commutative and associative otherwise the results could be unpredictable and wrong. Also, the return type of function has to be same as argument.

Let's see how reduce works.

INSTRUCTIONS

First, let's create an array of 101 numbers and save it in an RDD named seq
```
<<your code goes here>> = sc.parallelize(range(1, 101))
```
Now let's define a function named sum that takes 2 arguments, and returns the sum of those 2 numbers
```
def <<your code goes here>>(x, y):
    return x+y
```
Now let's call this function using reduce to sum up the elements of the seq RDD, and then save it in total
```
total = seq.<<your code goes here>>(sum)
```
Let's print total and see the results
```
print(total)
```

This can be put in a simpler format using the following code

print(sc.parallelize(range(1, 101)).reduce(lambda x,y:x+y))

Let's check if we can use reduce to calculate average of a set of numbers since average is commutative and not associative. First, let define the set of numbers and store them in an RDD named seq
```
seq = sc.parallelize([3.0, 7, 13, 16, 19])
```
Now let's define the function avg which takes 2 numbers as arguments and returns their average
```
def <<your code goes here>>(x, y):
    return ((x+y)/2)
```
Now let's use reduce to calculate the average of the elements of the seq RDD, and then save it in total
```
total = seq.<<your code goes here>>(avg)
```
Now let's print total and see the result
```
print(total)
```
This is incorrect since the average for these sequence of numbers is 11.6

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Previous Index Next

Apache Spark Basics with Python

Apache Spark with Python - Reduce, Commutative & Associative

XP

Loading comments...