Adv Spark Programming - Data Partitioning

Data Partioning

Layout of data across various partitions of RDDs can help minimizing the data transfer immensely.

Data partitioning may not be very helpful if you need to scan the dataset only once because the data partitioning remains in effect only during the application execution.

When you have a dataset which is reused multiple times on the basis of a key such as join operations, the data partition is very helpful in optimization

Data partitioning techniques are mostly valid for key-value RDDs.

Essentially, it causes the system to group elements based on a particular key.

Spark does not give the user explicit control over which key would go to which worker node.

It lets program control or ensures which set of keys will appear together. It may be based on some hash value for example or it could be based on range sort.

Slides - Adv Spark Programming(1)