Login using Social Account
     Continue with GoogleLogin using your credentials
In this slide, we will look at more operations:
Transformations
flatMap - This is similar to map, but each input item can be mapped to 0 or more output items. It returns a new RDD by first applying a function to all elements of this RDD, and then flattening the results. A flatMap syntax looks like this:
variable.flatMap(function)
How is flatMap different from map?
map the resulting RDD and input RDD having same number of elementsmap can only convert one to one while flatMap could convert one to manyflatMap can emulate map as well as filter, it can produce many as well as no value with an empty array as output. If it give out single value, it behaves like map, if it gives out an empty array it behaves like a filter.
union - This return a new dataset that contains the union of the elements in the source dataset and the argument. The syntax for union looks like the following:
RDD1.union(RDD2)
It should be noted that union does not remove duplicates.
Actions
collect - It returns all the elements of the RDD as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. One should note that using take is more practical than collect since... The syntax for collect looks like the following:
RDD.collect()
count - This returns the number of elements in an RDD. Syntactically, count looks like the following:
RDD.count()
Let's do a hands-on with these new operations.
First, let's create an RDD as below
linesRDD = sc.parallelize( ["this is a dog", "named jerry"])
Now, let's create a function toWords that splits a line into individual words
def <<your code goes here>>(line):
return line.split(" ")
Let us convert linesRDD into words using this function
wordsRDD = linesRDD.flatMap(<<your code goes here>>)
Now let's view wordsRDD using collect
wordsRDD.<<your code goes here>>()
Note that even though there were 2 different lines in the original RDD, we now have a single RDD with both the lines split into words
Now let's compare the results of flatMap with map. First, we will use map to split the line into a new RDD named wordsRDD1
<<your code goes here>> = linesRDD.map(toWords)
Now let's view the resulting RDD using collect
wordsRDD1.<<your code goes here>>()
Note the difference in output compared to flatMap
Now let's use flatMap as a map. First, let's define an array of 10000 numbers as arr
<<your code goes here>> = range(1, 10000)
Next, let's convert that array into an RDD named nums
<<your code goes here>> = sc.parallelize(arr)
Now, let's define a function multiplyByTwo that takes an element, multiplies it by 2 and returns the result
def <<your code goes here>>(x):
return [x*2]
Let's test this function by passing the number 5 to it
multiplyByTwo(<<your code goes here>>)
If you did it right, you would get an output of 10
Now let's use flatMap using this function on the RDD nums and save the results in new RDD named dbls
dbls = nums.flatMap(<<your code goes here>>)
Finally, let's observe the first 5 output using take
dbls.take(5)
Great! Now let's use flatMap as a filter. Let's define a function isEven that takes an element, and returns the same if its even
def <<your code goes here>>(x):
if x%2 == 0:
return [x]
else:
return []
Now let's use flatMap with this function on the nums RDD we created earlier and store the results in a new RDD named evens
<<your code goes here>> = nums.flatMap(isEven)
Finally let's observe the resulting RDD with take
evens.take(3)
Note the similarity in output with that of map
Now let's see how union works. First, define 2 RDDs as shown below
a = sc.parallelize(['1','2','3'])
b = sc.parallelize(['A','B','C'])
Use union on the RDD a with b as an argument, and store the resulting RDD in c
<<your code goes here>> = a.union(b)
Use collect to observe the resulting RDD c
c.<<your code goes here>>()
Note that we have already seen collect in action in the above steps, now it's time to see how count works. First, let's create a new RDD a
a = sc.parallelize([1,2,3,4,5,6,7],3)
Now let's count it's elements using count
a.<<your code goes here>>()
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...