Loading and Saving Data - Handling Sequence and Object Files

Not able to play video? Try with youtube

Note: If you're getting disk space quota error, please check this discussion.

INSTRUCTIONS

Create sequence file

Please run the following code on spark-shell

var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)), 2)
rdd.saveAsSequenceFile("pysequencefile1")

Check The File

Let us check the file if it is created or not. The following should be rung in the separate console. It is a Linux command not the scala code.

hadoop fs -ls pysequencefile1

This should show you a list like the following. There are two files containing data because we created an RDD with two partitions - check the last argument of parallelize method()

[sandeepgiri9034@cxln4 ~]$ hadoop fs -ls pysequencefile1 
 Found 3 items
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034          0 2021-05-21 15:11 pysequencefile1/_SUCCESS
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        109 2021-05-21 15:11 pysequencefile1/part-00000
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        130 2021-05-21 15:11 pysequencefile1/part-00001

Read a sequence file

Now, let us read the sequence file saved in first step above

import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text

val myrdd = sc.sequenceFile(
"pysequencefile1", 
classOf[Text], classOf[DoubleWritable])

val result = myrdd.map{case (x, y) => (x.toString, y.get())}
result.collect()

It should print something like the following. You can see that we were able to load the data saved in the sequence file. Please note that we need to know the datatype of various fields in the sequence file.

Array((key1,1.0), (key2,2.0), (key3,3.0))