Learning is Evolving |Flat 25% off on all courses | Use coupon code LEARN25 During checkout |Enroll Now
Layout of data across various partitions of RDDs can help minimizing the data transfer immensely.
Data partitioning may not be very helpful if you need to scan the dataset only once because the data partitioning remains in effect only during the application execution.
When you have a dataset which is reused multiple times on the basis of a key such as join operations, the data partition is very helpful in optimization
Data partitioning techniques are mostly valid for key-value RDDs.
Essentially, it causes the system to group elements based on a particular key.
Spark does not give the user explicit control over which key would go to which worker node.
It lets program control or ensures which set of keys will appear together. It may be based on some hash value for example or it could be based on range sort.
Taking you to the next exercise in seconds...