Apache Spark - Problem Solving - Compute Average

Not able to play video? Try with youtube

INSTRUCTIONS

How to compute average?

Approach 1:

var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);
var avg = rdd.reduce(_ + _) / rdd.count();

Approach 2:

var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);
var rdd_count = rdd.map((_, 1))
var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
var avg = sum / count

Previous Index Next

Please login to comment

4 Comments

Chilukuri Sai Sumanth

10 months ago

Can you also explain the notation for rdd.map((_,1)) ?

Upvote Share

Shubh Tripathi

10 months ago

In the notation rdd.map((_, 1)), the underscore (_) is a placeholder for each element in the RDD, and 1 is a constant value. Here’s a detailed explanation:

1. The map function in Spark is used to transform each element of the RDD by applying a function to it.

2. The function (elem => (elem, 1)) can be written more concisely as (_, 1), where _ stands for each element of the RDD. This is a shorthand in Scala for lambda functions when the parameter is used only once.

3. For each element in the RDD, the map function produces a tuple (elem, 1), where elem is the original element, and 1 is a constant value.

For example, if the RDD contains elements [1.0, 2, 3, 4, 5, 6, 7], the map transformation will convert it to [(1.0, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)].

In this way, rdd.map((_, 1)) creates an RDD of tuples where each original element is paired with the number 1.

Upvote Share

Chilukuri Sai Sumanth

10 months ago

Hello ,

Can you please explain what does x._1 or y._1 mean?

Upvote Share

Shubh Tripathi

10 months ago

In the context of the provided code in Instructions, x._1 and y._1 refer to the first elements of the tuples x and y, respectively.

Upvote Share

Apache Spark Basics

Apache Spark - Problem Solving - Compute Average

XP

Please login to comment

4 Comments