Apache Spark Basics with Python

57 / 86

Apache Spark with Python - More Operations - Transformations & Actions

In this slide, we will look at more operations:

Transformations

flatMap - This is similar to map, but each input item can be mapped to 0 or more output items. It returns a new RDD by first applying a function to all elements of this RDD, and then flattening the results. A flatMap syntax looks like this:

variable.flatMap(function)

How is flatMap different from map?

  • In case of map the resulting RDD and input RDD having same number of elements
  • map can only convert one to one while flatMap could convert one to many

flatMap can emulate map as well as filter, it can produce many as well as no value with an empty array as output. If it give out single value, it behaves like map, if it gives out an empty array it behaves like a filter.

union - This return a new dataset that contains the union of the elements in the source dataset and the argument. The syntax for union looks like the following:

RDD1.union(RDD2)

It should be noted that union does not remove duplicates.

Actions

collect - It returns all the elements of the RDD as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. One should note that using take is more practical than collect since... The syntax for collect looks like the following:

RDD.collect()

count - This returns the number of elements in an RDD. Syntactically, count looks like the following:

RDD.count()

Let's do a hands-on with these new operations.

INSTRUCTIONS
  • First, let's create an RDD as below

    linesRDD = sc.parallelize( ["this is a dog", "named jerry"])
    
  • Now, let's create a function toWords that splits a line into individual words

    def <<your code goes here>>(line):
        return line.split(" ")
    
  • Let us convert linesRDD into words using this function

    wordsRDD = linesRDD.flatMap(<<your code goes here>>)
    
  • Now let's view wordsRDD using collect

    wordsRDD.<<your code goes here>>()
    

    Note that even though there were 2 different lines in the original RDD, we now have a single RDD with both the lines split into words

  • Now let's compare the results of flatMap with map. First, we will use map to split the line into a new RDD named wordsRDD1

    <<your code goes here>> = linesRDD.map(toWords)
    
  • Now let's view the resulting RDD using collect

    wordsRDD1.<<your code goes here>>()
    

    Note the difference in output compared to flatMap

  • Now let's use flatMap as a map. First, let's define an array of 10000 numbers as arr

    <<your code goes here>> = range(1, 10000)
    
  • Next, let's convert that array into an RDD named nums

    <<your code goes here>> = sc.parallelize(arr)
    
  • Now, let's define a function multiplyByTwo that takes an element, multiplies it by 2 and returns the result

    def <<your code goes here>>(x):
        return [x*2]
    
  • Let's test this function by passing the number 5 to it

    multiplyByTwo(<<your code goes here>>)
    

    If you did it right, you would get an output of 10

  • Now let's use flatMap using this function on the RDD nums and save the results in new RDD named dbls

    dbls = nums.flatMap(<<your code goes here>>)
    
  • Finally, let's observe the first 5 output using take

    dbls.take(5)
    
  • Great! Now let's use flatMap as a filter. Let's define a function isEven that takes an element, and returns the same if its even

    def <<your code goes here>>(x):
        if x%2 == 0:
            return [x]
        else:
            return []
    
  • Now let's use flatMap with this function on the nums RDD we created earlier and store the results in a new RDD named evens

    <<your code goes here>> = nums.flatMap(isEven)
    
  • Finally let's observe the resulting RDD with take

    evens.take(3)
    

    Note the similarity in output with that of map

  • Now let's see how union works. First, define 2 RDDs as shown below

    a = sc.parallelize(['1','2','3'])
    b = sc.parallelize(['A','B','C'])
    
  • Use union on the RDD a with b as an argument, and store the resulting RDD in c

    <<your code goes here>> = a.union(b)
    
  • Use collect to observe the resulting RDD c

    c.<<your code goes here>>()
    
  • Note that we have already seen collect in action in the above steps, now it's time to see how count works. First, let's create a new RDD a

    a = sc.parallelize([1,2,3,4,5,6,7],3)
    
  • Now let's count it's elements using count

    a.<<your code goes here>>()
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...