Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
01 D 19 H : 42 M : 38 S Apply NowHere we will learn how to create an RDD (Resilient Distributed Datasets). There are 2 ways to create RDD:
By directly loading a file from a remote location like Hadoop or any other file system using textFile
method as shown below
variable = sc.textFile(location of the file)
The second method is by distributing existing objects from local machine's memory using parallelize
variable = sc.parallelize(object)
Here, the variables are RDDs. To check the content of these RDDs, we can use a method called take
as shown below:
variable.take(number of objects for viewing)
Let's try to create a RDD by loading a text file from the location /data/mr/wordcount/input/big.txt
using the first method described above
lines = sc.textFile(<<your code goes here>>)
Now let's view the first 10
lines of the RDD we just created
<<your code goes here>>.take(10)
Now let's create an array of 10000 numbers
arr = range(1, <<your code goes gere>>)
Next, let's create an RDD from this array using the second method described above
nums = sc.<<your code goes here>>(arr)
Now let's view the first 10
numbers of this RDD
nums.take(<<your code goes here>>)
Please note that you cannot initialize the spark twice. To do that you will have to restart the kernel from the menu.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Please login to comment
Be the first one to comment!