Apache Spark Basics with Python

51 / 86

Apache Spark with Python - Lazy Evaluation & Lineage Graph

What is Lazy Evaluation?

In previous examples did you notice that map returned another RDD quickly while take took some time and actually did something. The reason is a transformation such as map does not get executed instantaneously, instead Spark makes a note of it in the form of RDD graph or RDD lineage. As soon as a method such as take or collect is called, the transformations are evaluated. This is not instantaneous but lazy.

Example of Instantaneous Evaluation

When we typing x = 2 + 2, the value of x is immediately set to 4. It does not wait for print or any other method to be called for the evaluation to be completed.

Example of Lazy Evaluation

Imagine a waiter in a restaurant taking your orders one item at a time, going inside to convey the order to the chef, brining you back the item, then repeating the same process for the next item. This is not a favorable scenario in real-life. Instead, it would be best if the waiter notes down all the items, goes to the kitchen for conveying the complete order to the chef, and then finally brining back all the items together. This last method is an example of Lazy Evaluation. It is better suited for optimization by removing redundant steps.

Similarly, Spark clubs and optimizes multiple transformations together. How does Spark achieve this? Through Lineage Graph. To support lazy evaluation, an RDD keeps information from which other RDD it needs to be computed, and which transformation it needs to apply. This information is stored in the form of a graph called the Lineage Graph. Every time we define a transformation, it keeps building this graph.

Along with Spark, other languages like TensorFlow, Pig too have Lazy Evaluation.

We will see an example of Spark Lazy Evaluation and Lineage Graph below.

INSTRUCTIONS
  • First, let's create a large list of numbers and save it in a variable called num

    <<your code goes here>> = [i for i in range(1,50000)]
    
  • Now we will create an RDD with this list with 5 partitions and save it in rdd

    rdd = sc.parallelize(num,<<your code goes here>>)
    
  • Let's print this RDD object

    print(<<your code goes here>>)
    
  • Next, let's add a large number to each element of this RDD using map and save it in rdd_new

    rdd_new = rdd.<<your code goes here>>(lambda x : x+60000)
    
  • Now let's print this new RDD

    print(<<your code goes here>>)
    
  • Also, let's view the RDD Lineage using toDebugString function

    print(rdd_new.toDebugString())
    

    We can see that PythonRDD[1] is connected with ParallelCollectionRDD[0]

  • Now let's add another large number to rdd_new and save it in rdd_final

    <<your code goes here>> = rdd_new.map(lambda x : x+20000)
    
  • Let's print rdd_final

    print(rdd_final)
    
  • Finally, let's get the RDD Lineage for rdd_final

    print(rdd_final.toDebugString())
    

    Notice how Spark automatically skipped that redundant step and will add 80000 in a single step instead of how we defined it. Spark automatically defines the best fit path to perform any action, and then performs the transformations only when it is required.

See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...