Login using Social Account
     Continue with GoogleLogin using your credentials
In this slide, we will look at more operations:
Transformations
flatMap
- This is similar to map
, but each input item can be mapped to 0 or more output items. It returns a new RDD by first applying a function to all elements of this RDD, and then flattening the results. A flatMap
syntax looks like this:
variable.flatMap(function)
How is flatMap
different from map
?
map
the resulting RDD and input RDD having same number of elementsmap
can only convert one to one while flatMap could convert one to manyflatMap
can emulate map
as well as filter
, it can produce many as well as no value with an empty array as output. If it give out single value, it behaves like map
, if it gives out an empty array it behaves like a filter
.
union
- This return a new dataset that contains the union of the elements in the source dataset and the argument. The syntax for union
looks like the following:
RDD1.union(RDD2)
It should be noted that union does not remove duplicates.
Actions
collect
- It returns all the elements of the RDD as an array at the driver program. This is usually useful after a filter
or other operation that returns a sufficiently small subset of the data. One should note that using take
is more practical than collect
since... The syntax for collect
looks like the following:
RDD.collect()
count
- This returns the number of elements in an RDD. Syntactically, count
looks like the following:
RDD.count()
Let's do a hands-on with these new operations.
First, let's create an RDD as below
linesRDD = sc.parallelize( ["this is a dog", "named jerry"])
Now, let's create a function toWords
that splits a line into individual words
def <<your code goes here>>(line):
return line.split(" ")
Let us convert linesRDD
into words using this function
wordsRDD = linesRDD.flatMap(<<your code goes here>>)
Now let's view wordsRDD
using collect
wordsRDD.<<your code goes here>>()
Note that even though there were 2 different lines in the original RDD, we now have a single RDD with both the lines split into words
Now let's compare the results of flatMap
with map
. First, we will use map
to split the line into a new RDD named wordsRDD1
<<your code goes here>> = linesRDD.map(toWords)
Now let's view the resulting RDD using collect
wordsRDD1.<<your code goes here>>()
Note the difference in output compared to flatMap
Now let's use flatMap
as a map
. First, let's define an array of 10000 numbers as arr
<<your code goes here>> = range(1, 10000)
Next, let's convert that array into an RDD named nums
<<your code goes here>> = sc.parallelize(arr)
Now, let's define a function multiplyByTwo
that takes an element, multiplies it by 2
and returns the result
def <<your code goes here>>(x):
return [x*2]
Let's test this function by passing the number 5
to it
multiplyByTwo(<<your code goes here>>)
If you did it right, you would get an output of 10
Now let's use flatMap
using this function on the RDD nums
and save the results in new RDD named dbls
dbls = nums.flatMap(<<your code goes here>>)
Finally, let's observe the first 5
output using take
dbls.take(5)
Great! Now let's use flatMap
as a filter
. Let's define a function isEven
that takes an element, and returns the same if its even
def <<your code goes here>>(x):
if x%2 == 0:
return [x]
else:
return []
Now let's use flatMap
with this function on the nums
RDD we created earlier and store the results in a new RDD named evens
<<your code goes here>> = nums.flatMap(isEven)
Finally let's observe the resulting RDD with take
evens.take(3)
Note the similarity in output with that of map
Now let's see how union
works. First, define 2 RDDs as shown below
a = sc.parallelize(['1','2','3'])
b = sc.parallelize(['A','B','C'])
Use union
on the RDD a
with b
as an argument, and store the resulting RDD in c
<<your code goes here>> = a.union(b)
Use collect
to observe the resulting RDD c
c.<<your code goes here>>()
Note that we have already seen collect
in action in the above steps, now it's time to see how count
works. First, let's create a new RDD a
a = sc.parallelize([1,2,3,4,5,6,7],3)
Now let's count it's elements using count
a.<<your code goes here>>()
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...