Adv Spark Programming

10 / 52

Adv Spark Programming - Data Partitioning

Data Partioning

Layout of data across various partitions of RDDs can help minimizing the data transfer immensely.

Data partitioning may not be very helpful if you need to scan the dataset only once because the data partitioning remains in effect only during the application execution.

When you have a dataset which is reused multiple times on the basis of a key such as join operations, the data partition is very helpful in optimization

Data partitioning techniques are mostly valid for key-value RDDs.

Essentially, it causes the system to group elements based on a particular key.

Spark does not give the user explicit control over which key would go to which worker node.

It lets program control or ensures which set of keys will appear together. It may be based on some hash value for example or it could be based on range sort.

Slides - Adv Spark Programming(1)


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

3 Comments

Why "numsRdd" created 16 partitions in my PC rather 4 partitions?

  Upvote    Share

Hello. This video got cut off at the 1:40 mark (vs. what is shown in the slides). Could an update be made. Thanks.

  Upvote    Share

Thank you for letting us know. The remaining videos were after the quiz questions. I have moved those just after this video. I hope that helps.

The slides are available on the last page, you can click on the index and jump to the last page.

  Upvote    Share