Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
Apply NowLogin using Social Account
     Continue with GoogleLogin using your credentials
Now that we know what a transformation
is, let's get to know an action
.
An action
causes the full execution of transformations
. It involves both Spark driver and the nodes. You will be surprised to know that we have already used 2 actions
in some of the previous steps. These are take
and saveAsText
.
take
receives an integer value (let's say n
) as a parameter and returns an array of first n elements of the RDD.
saveAsTextFile
saves an RDD into a text file in HDFS or other file systems. It is to be noted that each partition is a separate file. The filenames are of the form part-00000
, part-00001
, part-00002
and so on.
Let's see them in action once again.
Note: If you already have a file with the same name in the same directory, you will get an error if you run the saveAsText
command. In that case check with hadoop fs -ls
command if the file already exists, if it does, please delete the same and try again.
Note: You can check PySpark is running on how many cores using the sc.defaultParallelism
command. If you are running the sc.parallelize
command without any partition, it will utilize all the cores by default and start allocating space in each of them. This might throw an error, in which case you can use a partition like nums = sc.parallelize((arr),2)
. See the following discussion for more details:
First, let's define an array of 10000
numbers from 1
to 10000
and store it in a variable named arr
<<your code goes here>> = range(1, 10000)
Next, convert that array into an RDD named nums
<<your code goes here>> = sc.parallelize(arr)
Now let's define a function multiplyByTwo
that takes an element, multiplies it by 2 and returns the result
def <<your code goes here>>(x):
return x*2
Now, let's use map
on the RDD nums
using this function and store the result in a new RDD named dbls
<<your code goes here>> = nums.map(multipleByTwo)
Let's use take
to view the first 5
elements of dbls
dbls.take(5)
Now, let's save the dbls
RDD as a text file using saveAsTextFile
dbls.<<your code goes here>>("mydirectory")
To check this mydirectory
file in Hadoop, first login to the web console from another tab or click on the Console tab on the right side of this split screen
To check the content of entire file using the below command
hadoop fs -ls mydirectory
To check the content of the first part of the file use the below command
hadoop fs -cat mydirectory/part-00000 | more
To check the content of the second part of the file use the below command
hadoop fs -cat mydirectory/part-00001 | more
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...