How to compute average?
Approach 1:
var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);
var avg = rdd.reduce(_ + _) / rdd.count();
Approach 2:
var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);
var rdd_count = rdd.map((_, 1))
var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
var avg = sum / count
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Please login to comment
4 Comments
Can you also explain the notation for rdd.map((_,1)) ?
Upvote ShareIn the notation
rdd.map((_, 1))
, the underscore (_
) is a placeholder for each element in the RDD, and1
is a constant value. Here’s a detailed explanation:1. The
map
function in Spark is used to transform each element of the RDD by applying a function to it.2. The function
(elem => (elem, 1))
can be written more concisely as(_, 1)
, where_
stands for each element of the RDD. This is a shorthand in Scala for lambda functions when the parameter is used only once.3. For each element in the RDD, the
map
function produces a tuple(elem, 1)
, whereelem
is the original element, and1
is a constant value.For example, if the RDD contains elements
[1.0, 2, 3, 4, 5, 6, 7]
, themap
transformation will convert it to[(1.0, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
.In this way,
Upvote Sharerdd.map((_, 1))
creates an RDD of tuples where each original element is paired with the number1
.Hello ,
Can you please explain what does x._1 or y._1 mean?
Upvote ShareIn the context of the provided code in Instructions,
x._1
andy._1
refer to the first elements of the tuplesx
andy
, respectively.