Apache Spark Basics with Python

34 / 86

Apache Spark with Python - Creating RDD

Here we will learn how to create an RDD (Resilient Distributed Datasets). There are 2 ways to create RDD:

  • By directly loading a file from a remote location like Hadoop or any other file system using textFile method as shown below

    variable = sc.textFile(location of the file)
    
  • The second method is by distributing existing objects from local machine's memory using parallelize

    variable = sc.parallelize(object)
    
  • Here, the variables are RDDs. To check the content of these RDDs, we can use a method called take as shown below:

    variable.take(number of objects for viewing)
    
INSTRUCTIONS
  • Let's try to create a RDD by loading a text file from the location /data/mr/wordcount/input/big.txt using the first method described above

    lines = sc.textFile(<<your code goes here>>)
    
  • Now let's view the first 10 lines of the RDD we just created

    <<your code goes here>>.take(10)
    
  • Now let's create an array of 10000 numbers

    arr = range(1, <<your code goes gere>>)
    
  • Next, let's create an RDD from this array using the second method described above

    nums = sc.<<your code goes here>>(arr)
    
  • Now let's view the first 10 numbers of this RDD

    nums.take(<<your code goes here>>)
    

Please note that you cannot initialize the spark twice. To do that you will have to restart the kernel from the menu.

See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...