51 / 86

# Apache Spark with Python - Lazy Evaluation & Lineage Graph

What is Lazy Evaluation?

In previous examples did you notice that `map` returned another RDD quickly while `take` took some time and actually did something. The reason is a transformation such as `map` does not get executed instantaneously, instead Spark makes a note of it in the form of RDD graph or RDD lineage. As soon as a method such as `take` or `collect` is called, the transformations are evaluated. This is not instantaneous but lazy.

Example of Instantaneous Evaluation

When we typing `x = 2 + 2`, the value of `x` is immediately set to `4`. It does not wait for `print` or any other method to be called for the evaluation to be completed.

Example of Lazy Evaluation

Imagine a waiter in a restaurant taking your orders one item at a time, going inside to convey the order to the chef, brining you back the item, then repeating the same process for the next item. This is not a favorable scenario in real-life. Instead, it would be best if the waiter notes down all the items, goes to the kitchen for conveying the complete order to the chef, and then finally brining back all the items together. This last method is an example of Lazy Evaluation. It is better suited for optimization by removing redundant steps.

Similarly, Spark clubs and optimizes multiple transformations together. How does Spark achieve this? Through Lineage Graph. To support lazy evaluation, an RDD keeps information from which other RDD it needs to be computed, and which transformation it needs to apply. This information is stored in the form of a graph called the Lineage Graph. Every time we define a transformation, it keeps building this graph.

Along with Spark, other languages like TensorFlow, Pig too have Lazy Evaluation.

We will see an example of Spark Lazy Evaluation and Lineage Graph below.

INSTRUCTIONS
• First, let's create a large list of numbers and save it in a variable called `num`

``````<<your code goes here>> = [i for i in range(1,50000)]
``````
• Now we will create an RDD with this list with 5 partitions and save it in `rdd`

``````rdd = sc.parallelize(num,<<your code goes here>>)
``````
• Let's print this RDD object

``````print(<<your code goes here>>)
``````
• Next, let's add a large number to each element of this RDD using `map` and save it in `rdd_new`

``````rdd_new = rdd.<<your code goes here>>(lambda x : x+60000)
``````
• Now let's print this new RDD

``````print(<<your code goes here>>)
``````
• Also, let's view the RDD Lineage using `toDebugString` function

``````print(rdd_new.toDebugString())
``````

We can see that PythonRDD[1] is connected with ParallelCollectionRDD[0]

• Now let's add another large number to `rdd_new` and save it in `rdd_final`

``````<<your code goes here>> = rdd_new.map(lambda x : x+20000)
``````
• Let's print `rdd_final`

``````print(rdd_final)
``````
• Finally, let's get the RDD Lineage for `rdd_final`

``````print(rdd_final.toDebugString())
``````

Notice how Spark automatically skipped that redundant step and will add `80000` in a single step instead of how we defined it. Spark automatically defines the best fit path to perform any action, and then performs the transformations only when it is required.

No hints are availble for this assesment

Note - Having trouble with the assessment engine? Follow the steps listed here