Apache Spark Basics with Python

23 / 86

Apache Spark with Python - What is an RDD?




Not able to play video? Try with youtube


Please login to comment

3 Comments

why is it getting created under user directory and not on any worker node

  Upvote    Share

When you create an RDD in Spark, it is not physically created on any worker nodes. Instead, the RDD metadata, such as its lineage and transformations, are stored in the driver node's memory. The data associated with the RDD is divided into partitions, and these partitions are distributed across the worker nodes for processing.

As for why RDDs are typically created under the user directory, this is because Spark runs as a user process, and the user directory is a convenient location to store application-specific data. When you create an RDD from a file, the file is typically located in the user directory or a subdirectory of it. Spark then reads the file into memory and partitions it for distribution across the worker nodes.

  Upvote    Share

I think file alrea got created - 

/user/ukuppuswamy4330/my_result ls 

so the file got created first time when I executed.

Thanks,

Umashankar

  Upvote    Share