Layout of data across various partitions of RDDs can help minimizing the data transfer immensely.
Data partitioning may not be very helpful if you need to scan the dataset only once because the data partitioning remains in effect only during the application execution.
When you have a dataset which is reused multiple times on the basis of a key such as join operations, the data partition is very helpful in optimization
Data partitioning techniques are mostly valid for key-value RDDs.
Essentially, it causes the system to group elements based on a particular key.
Spark does not give the user explicit control over which key would go to which worker node.
It lets program control or ensures which set of keys will appear together. It may be based on some hash value for example or it could be based on range sort.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Answer is not availble for this assesment
Please login to comment
3 Comments
Why "numsRdd" created 16 partitions in my PC rather 4 partitions?
Upvote ShareHello. This video got cut off at the 1:40 mark (vs. what is shown in the slides). Could an update be made. Thanks.
Upvote ShareThank you for letting us know. The remaining videos were after the quiz questions. I have moved those just after this video. I hope that helps.
The slides are available on the last page, you can click on the index and jump to the last page.
Upvote Share