Login using Social Account
     Continue with GoogleLogin using your credentials
What is Lazy Evaluation?
In previous examples did you notice that map
returned another RDD quickly while take
took some time and actually did something. The reason is a transformation such as map
does not get executed
instantaneously, instead Spark makes a note of it in the form of RDD graph or RDD lineage. As soon as a method such as take
or collect
is called, the transformations are evaluated. This is not instantaneous but lazy.
Example of Instantaneous Evaluation
When we typing x = 2 + 2
, the value of x
is immediately set to 4
. It does not wait for print
or any other method to be called for the evaluation to be completed.
Example of Lazy Evaluation
Imagine a waiter in a restaurant taking your orders one item at a time, going inside to convey the order to the chef, brining you back the item, then repeating the same process for the next item. This is not a favorable scenario in real-life. Instead, it would be best if the waiter notes down all the items, goes to the kitchen for conveying the complete order to the chef, and then finally brining back all the items together. This last method is an example of Lazy Evaluation. It is better suited for optimization by removing redundant steps.
Similarly, Spark clubs and optimizes multiple transformations together. How does Spark achieve this? Through Lineage Graph. To support lazy evaluation, an RDD keeps information from which other RDD it needs to be computed, and which transformation it needs to apply. This information is stored in the form of a graph called the Lineage Graph. Every time we define a transformation, it keeps building this graph.
Along with Spark, other languages like TensorFlow, Pig too have Lazy Evaluation.
We will see an example of Spark Lazy Evaluation and Lineage Graph below.
First, let's create a large list of numbers and save it in a variable called num
<<your code goes here>> = [i for i in range(1,50000)]
Now we will create an RDD with this list with 5 partitions and save it in rdd
rdd = sc.parallelize(num,<<your code goes here>>)
Let's print this RDD object
print(<<your code goes here>>)
Next, let's add a large number to each element of this RDD using map
and save it in rdd_new
rdd_new = rdd.<<your code goes here>>(lambda x : x+60000)
Now let's print this new RDD
print(<<your code goes here>>)
Also, let's view the RDD Lineage using toDebugString
function
print(rdd_new.toDebugString())
We can see that PythonRDD[1] is connected with ParallelCollectionRDD[0]
Now let's add another large number to rdd_new
and save it in rdd_final
<<your code goes here>> = rdd_new.map(lambda x : x+20000)
Let's print rdd_final
print(rdd_final)
Finally, let's get the RDD Lineage for rdd_final
print(rdd_final.toDebugString())
Notice how Spark automatically skipped that redundant step and will add 80000
in a single step instead of how we defined it. Spark automatically defines the best fit path to perform any action, and then performs the transformations only when it is required.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...