Login using Social Account
     Continue with GoogleLogin using your credentials
Now let's look at reduce
. It aggregate elements of dataset using a function. reduce
can be used as follows:
RDD.reduce(function)
Please note that the function must be commutative and associative otherwise the results could be unpredictable and wrong. Also, the return type of function has to be same as argument.
Let's see how reduce
works.
First, let's create an array of 101
numbers and save it in an RDD named seq
<<your code goes here>> = sc.parallelize(range(1, 101))
Now let's define a function named sum
that takes 2
arguments, and returns the sum of those 2
numbers
def <<your code goes here>>(x, y):
return x+y
Now let's call this function using reduce
to sum up the elements of the seq
RDD, and then save it in total
total = seq.<<your code goes here>>(sum)
Let's print total
and see the results
print(total)
This can be put in a simpler format using the following code
print(sc.parallelize(range(1, 101)).reduce(lambda x,y:x+y))
Let's check if we can use reduce
to calculate average of a set of numbers since average is commutative and not associative. First, let define the set of numbers and store them in an RDD named seq
seq = sc.parallelize([3.0, 7, 13, 16, 19])
Now let's define the function avg
which takes 2
numbers as arguments and returns their average
def <<your code goes here>>(x, y):
return ((x+y)/2)
Now let's use reduce
to calculate the average of the elements of the seq
RDD, and then save it in total
total = seq.<<your code goes here>>(avg)
Now let's print total
and see the result
print(total)
This is incorrect since the average for these sequence of numbers is 11.6
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...