Let us understand how we can improve the performance of our Spark applications using persistence or caching.
Recollect that RDDs are lazily evaluated.
Whenever an action is executed on an RDD, all of the RDDs on which it depends upon are also evaluated. Which means if an action on an RDD in this graph is being called twice, the same computation could happen twice.
Also, We may wish to use the same RDD multiple times.
So, to avoid re-computation of an RDD, we can cache or persist it using the function provided on RDD.
If a node that has data persisted on it fails, Spark will recompute the lost partitions of the data when needed.
We can also replicate our data on multiple nodes if we want to be able to handle node failure without slowdown.
Let’s take a look the following code in order to understand the caching.
Create an RDD of number from one to hundred thousand with fifty partitions.
Now, say we are creating an another RDD partitions by summing up the values of each of the partition of nums RDD. We have first created a function mysum which sums up values of an iterator and returns the sum as a single value iterator. This function mysum is being called using mapPartitions on nums.
Now, say we have created another RDD partition1 from partitions RDD by increasing each value by one using a map.
Everytime, we are going to call an action on paritions1 RDD partitions RDD is also going to be called. So, if partitions1 is going to be frequently used, it will be inefficient.
In such circumstances, we can cache partitions1 RDD.
We can persist an RDD using a method persist. This method needs an instance of StorageLevel as argument. The storage level specifies how should the RDD be persisted - in memory or on disk for example. If we do not provide any argument, it saves in memory the memory only.
Now, let’s try to persist in memory and disk both but before re-persisting we need to unpersist. For that we can use unpersist method.
Now, let us import a class called StorageLevel. Then call method persist() with StorageLevel MEMORY AND DISK. If the RDD doesn’t fit into RAM it would spill over to the disk.
Let us check using getStorageLevel. And you can see that it is same as StorageLevel.MEMORY_AND_DISK