Apache Spark with Python - What is an RDD?

Not able to play video? Try with youtube

XP

Taking you to the next exercise in seconds...

Want to create exercises like this yourself? Click here.

Please wait.

Success

Error

Show Detailed Error

Previous Index Next

Please login to comment

3 Comments

Umashankar Kuppuswamy

2 years ago

why is it getting created under user directory and not on any worker node

Upvote Share

Shubh Tripathi

2 years ago

When you create an RDD in Spark, it is not physically created on any worker nodes. Instead, the RDD metadata, such as its lineage and transformations, are stored in the driver node's memory. The data associated with the RDD is divided into partitions, and these partitions are distributed across the worker nodes for processing.

As for why RDDs are typically created under the user directory, this is because Spark runs as a user process, and the user directory is a convenient location to store application-specific data. When you create an RDD from a file, the file is typically located in the user directory or a subdirectory of it. Spark then reads the file into memory and partitions it for distribution across the worker nodes.

Upvote Share

Umashankar Kuppuswamy

2 years ago

I think file alrea got created -