There are various ways to persist an RDD - You can persist it in memory or in the disk. Here persist in memory sounds very strange - it is essentially caching the RDD so that we do not have to compute the entire RDD graph on every action.
To make it very customization, persist accepts an instance of StorageLevel class as argument. This storagelevel class is a bunch of settings which are going talk about.
You can create your instance of storagelevel using the constructor as shown.
The first argument is a boolean variable that specifies whether to use disk or not.
The second argument represents if we want to use memory. In the example we want to use both disk and memory, therefore, both are true.
The third argument is whether to use off the heap memory. The objects stored in heap are fast to retrieve but are subject to garbage collected by Java garbage collector.
The fourth argument represents whether to persist deserialized. By setting it false, it will store RDD as serialized Java objects - one byte array per partition. This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
The last argument is for specifying how many copies do we want. For critical RDDs we can have multiple replicas of RDD partitions in order avoid re-computation on failure of machines. The default value for number of replicas is 1 and it can go upto 40.
Finally, while using persist we can pass the storagelevel object as argument.
There are precrafted shortcuts for storage levels for various use cases which you can quickly use as per your requirements without creating the storagelevel objects.
These shortcuts such as the one shown in the slide are nothing but a static object of storagelevel. Lets understand various pre-created storagelevels and when to use each.
It persists the rdd into the memory so that it doesn’t get computed on every action in an application.
Stores RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
Stores RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions on disk that don't fit in memory, and read them from there when they're needed.
Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk, instead of recomputing them on the fly each time they're needed.
Store the RDD partitions only on disk.
Same as the levels above, but replicate each partition on two cluster nodes.
TODO::?? Store the RDD partitions only on disk.
Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:
If your RDDs fit comfortably with the default storage level which is MEMORY_ONLY, leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.
Taking you to the next exercise in seconds...