Apache Spark Basics with Python

48 / 86

Apache Spark with Python - Actions - take & saveTextFile

Now that we know what a transformation is, let's get to know an action.

An action causes the full execution of transformations. It involves both Spark driver and the nodes. You will be surprised to know that we have already used 2 actions in some of the previous steps. These are take and saveAsText.

take receives an integer value (let's say n) as a parameter and returns an array of first n elements of the RDD.

saveAsTextFile saves an RDD into a text file in HDFS or other file systems. It is to be noted that each partition is a separate file. The filenames are of the form part-00000, part-00001, part-00002 and so on.

Let's see them in action once again.

Note: If you already have a file with the same name in the same directory, you will get an error if you run the saveAsText command. In that case check with hadoop fs -ls command if the file already exists, if it does, please delete the same and try again.

Note: You can check PySpark is running on how many cores using the sc.defaultParallelism command. If you are running the sc.parallelize command without any partition, it will utilize all the cores by default and start allocating space in each of them. This might throw an error, in which case you can use a partition like nums = sc.parallelize((arr),2). See the following discussion for more details:

https://discuss.cloudxlab.com/t/error-quota-exceeded-while-running-spark-job-but-havent-used-much-of-disk-space-solved/3472/4

INSTRUCTIONS
  • First, let's define an array of 10000 numbers from 1 to 10000 and store it in a variable named arr

    <<your code goes here>> = range(1, 10000)
    
  • Next, convert that array into an RDD named nums

    <<your code goes here>> = sc.parallelize(arr)
    
  • Now let's define a function multiplyByTwo that takes an element, multiplies it by 2 and returns the result

    def <<your code goes here>>(x):
        return x*2
    
  • Now, let's use map on the RDD nums using this function and store the result in a new RDD named dbls

    <<your code goes here>> = nums.map(multipleByTwo)
    
  • Let's use take to view the first 5 elements of dbls

    dbls.take(5)
    
  • Now, let's save the dbls RDD as a text file using saveAsTextFile

    dbls.<<your code goes here>>("mydirectory")
    
  • To check this mydirectory file in Hadoop, first login to the web console from another tab or click on the Console tab on the right side of this split screen

  • To check the content of entire file using the below command

    hadoop fs -ls  mydirectory
    
  • To check the content of the first part of the file use the below command

    hadoop fs -cat mydirectory/part-00000 | more
    
  • To check the content of the second part of the file use the below command

    hadoop fs -cat mydirectory/part-00001 | more
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...