Layout of data across various partitions of RDDs can help minimizing the data transfer immensely.
Data partitioning may not be very helpful if you need to scan the dataset only once because the data partitioning remains in effect only during the application execution.
When you have a dataset which is reused multiple times on the basis of a key such as join operations, the data partition is very helpful in optimization
Data partitioning techniques are mostly valid for key-value RDDs.
Essentially, it causes the system to group elements based on a particular key.
Spark does not give the user explicit control over which key would go to which worker node.
It lets program control or ensures which set of keys will appear together. It may be based on some hash value for example or it could be based on range sort.