Apache Spark Basics with Python

63 / 86

Apache Spark with Python - Reduce, Commutative & Associative

Now let's look at reduce. It aggregate elements of dataset using a function. reduce can be used as follows:

RDD.reduce(function)

Please note that the function must be commutative and associative otherwise the results could be unpredictable and wrong. Also, the return type of function has to be same as argument.

Let's see how reduce works.

INSTRUCTIONS
  • First, let's create an array of 101 numbers and save it in an RDD named seq

    <<your code goes here>> = sc.parallelize(range(1, 101))
    
  • Now let's define a function named sum that takes 2 arguments, and returns the sum of those 2 numbers

    def <<your code goes here>>(x, y):
        return x+y
    
  • Now let's call this function using reduce to sum up the elements of the seq RDD, and then save it in total

    total = seq.<<your code goes here>>(sum)
    
  • Let's print total and see the results

    print(total)
    
  • This can be put in a simpler format using the following code

    print(sc.parallelize(range(1, 101)).reduce(lambda x,y:x+y))
    
  • Let's check if we can use reduce to calculate average of a set of numbers since average is commutative and not associative. First, let define the set of numbers and store them in an RDD named seq

    seq = sc.parallelize([3.0, 7, 13, 16, 19])
    
  • Now let's define the function avg which takes 2 numbers as arguments and returns their average

    def <<your code goes here>>(x, y):
        return ((x+y)/2)
    
  • Now let's use reduce to calculate the average of the elements of the seq RDD, and then save it in total

    total = seq.<<your code goes here>>(avg)
    
  • Now let's print total and see the result

    print(total)
    

    This is incorrect since the average for these sequence of numbers is 11.6

See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...