Registrations Closing Soon for DevOps Certification Training by CloudxLab | Registrations Closing in

  Enroll Now

Apache Spark with Python - Problem Solving - Compute Average

In the last slide we saw that since average is not commutative, we could not use reduce directly to calculate the average on a set of numbers. So how do we calculate average using reduce in that case? Let's see.

INSTRUCTIONS
  • First, let's define a set of elements for which we will be calculating the average, and store them in an RDD named rdd

    <<your code goes here>> = sc.parallelize([1.0, 2, 3, 4, 5, 6, 7], 3)
    
  • Now let's calculate the average by using reduce to calculate the sum of the elements, and count to get the number of elements. Then we divided the sum by the count to get the average and then store the result in a new RDD called avg

    avg = rdd.<<your code goes here>>(lambda x, y: x + y) / rdd.count()
    

    The average given here is 4.0 which is correct. However, this is not the correct approach since we are computing RDD twice - during reduce and during count. So, we will move to the next approach

  • With the next approach, we will first translate all the values into a composite value such that each element of RDD represent a value along with how many elements have been summed up to reach this value. So we transform each element into a tuple with the value, and 1 which represents how many numbers have been added to reach the value (which is initially 1). We will use map for this as shown below

    rdd_count = rdd.<<your code goes here>>(lambda x: (x, 1))
    
  • Next, we will define a function add_tuples that will keep traversing the elements, and update both their sum and the number of elements that were summed up to reach this value and return a resulting tuple

    def <<your code goes here>>(x, y):
        return (x[0] + y[0], x[1] + y[1])
    
  • Now, we will use reduce with this function to return a sum of the values and their counts. We will store this in variables sum and count

    (sum, count) = rdd_count.<<your code goes here>>(add_tuples)
    
  • Finally, we will calculate the average using these values

    avg = sum / <<your code goes here>>
    

    This approach takes a significantly less amount of time than the previous one.


No hints are availble for this assesment

Answer is not availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...