Login using Social Account
     Continue with GoogleLogin using your credentials
Note: If you're getting disk space quota error, please check this discussion.
Please run the following code on spark-shell
var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)), 2)
rdd.saveAsSequenceFile("pysequencefile1")
Let us check the file if it is created or not. The following should be rung in the separate console. It is a Linux command not the scala code.
hadoop fs -ls pysequencefile1
This should show you a list like the following. There are two files containing data because we created an RDD with two partitions - check the last argument of parallelize method()
[sandeepgiri9034@cxln4 ~]$ hadoop fs -ls pysequencefile1
Found 3 items
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 0 2021-05-21 15:11 pysequencefile1/_SUCCESS
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 109 2021-05-21 15:11 pysequencefile1/part-00000
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 130 2021-05-21 15:11 pysequencefile1/part-00001
Now, let us read the sequence file saved in first step above
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
val myrdd = sc.sequenceFile(
"pysequencefile1",
classOf[Text], classOf[DoubleWritable])
val result = myrdd.map{case (x, y) => (x.toString, y.get())}
result.collect()
It should print something like the following. You can see that we were able to load the data saved in the sequence file. Please note that we need to know the datatype of various fields in the sequence file.
Array((key1,1.0), (key2,2.0), (key3,3.0))
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Loading comments...