Apache Spark Basics

69 / 89

Apache Spark - Problem Solving - Compute Average




Not able to play video? Try with youtube

INSTRUCTIONS
  • How to compute average?

    Approach 1:

    var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);
    var avg = rdd.reduce(_ + _) / rdd.count();
    

    Approach 2:

    var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);
    var rdd_count = rdd.map((_, 1))
    var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
    var avg = sum / count
    

Please login to comment

4 Comments

Can you also explain the notation for rdd.map((_,1)) ?

  Upvote    Share

In the notation rdd.map((_, 1)), the underscore (_) is a placeholder for each element in the RDD, and 1 is a constant value. Here’s a detailed explanation:

1. The map function in Spark is used to transform each element of the RDD by applying a function to it.

2. The function (elem => (elem, 1)) can be written more concisely as (_, 1), where _ stands for each element of the RDD. This is a shorthand in Scala for lambda functions when the parameter is used only once.

3. For each element in the RDD, the map function produces a tuple (elem, 1), where elem is the original element, and 1 is a constant value.

For example, if the RDD contains elements [1.0, 2, 3, 4, 5, 6, 7], the map transformation will convert it to [(1.0, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)].

In this way, rdd.map((_, 1)) creates an RDD of tuples where each original element is paired with the number 1.

  Upvote    Share

Hello ,

Can you please explain what does x._1 or y._1 mean?

  Upvote    Share

In the context of the provided code in Instructions, x._1 and y._1 refer to the first elements of the tuples x and y, respectively.

 

 

 

  Upvote    Share