Apache Spark with Python - Creating RDD

Here we will learn how to create an RDD (Resilient Distributed Datasets). There are 2 ways to create RDD:

By directly loading a file from a remote location like Hadoop or any other file system using textFile method as shown below
```
variable = sc.textFile(location of the file)
```
The second method is by distributing existing objects from local machine's memory using parallelize
```
variable = sc.parallelize(object)
```
Here, the variables are RDDs. To check the content of these RDDs, we can use a method called take as shown below:
```
variable.take(number of objects for viewing)
```

INSTRUCTIONS

Let's try to create a RDD by loading a text file from the location /data/mr/wordcount/input/big.txt using the first method described above
```
lines = sc.textFile(<<your code goes here>>)
```
Now let's view the first 10 lines of the RDD we just created
```
<<your code goes here>>.take(10)
```
Now let's create an array of 10000 numbers
```
arr = range(1, <<your code goes gere>>)
```
Next, let's create an RDD from this array using the second method described above
```
nums = sc.<<your code goes here>>(arr)
```
Now let's view the first 10 numbers of this RDD
```
nums.take(<<your code goes here>>)
```

Please note that you cannot initialize the spark twice. To do that you will have to restart the kernel from the menu.

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...