Loading and Saving Data

4 / 9

Loading and Saving Data - Handling Sequence and Object Files




Not able to play video? Try with youtube

Note: If you're getting disk space quota error, please check this discussion.

INSTRUCTIONS

Create sequence file

Please run the following code on spark-shell

var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)), 2)
rdd.saveAsSequenceFile("pysequencefile1")

Check The File

Let us check the file if it is created or not. The following should be rung in the separate console. It is a Linux command not the scala code.

hadoop fs -ls pysequencefile1

This should show you a list like the following. There are two files containing data because we created an RDD with two partitions - check the last argument of parallelize method()

[sandeepgiri9034@cxln4 ~]$ hadoop fs -ls pysequencefile1 
 Found 3 items
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034          0 2021-05-21 15:11 pysequencefile1/_SUCCESS
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        109 2021-05-21 15:11 pysequencefile1/part-00000
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        130 2021-05-21 15:11 pysequencefile1/part-00001

Read a sequence file

Now, let us read the sequence file saved in first step above

import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text

val myrdd = sc.sequenceFile(
"pysequencefile1", 
classOf[Text], classOf[DoubleWritable])

val result = myrdd.map{case (x, y) => (x.toString, y.get())}
result.collect()

It should print something like the following. You can see that we were able to load the data saved in the sequence file. Please note that we need to know the datatype of various fields in the sequence file.

Array((key1,1.0), (key2,2.0), (key3,3.0))


Please login to comment

6 Comments

 

after saving files its showing _temporary directory. 

 

  Upvote    Share

Have you run any other operation on rdd before saving? 

  Upvote    Share

Below is code that i running 

 

spark-shell
var linesRdd = sc.textFile("/data/mr/wordcount/input/big.txt")
var words = linesRdd.flatMap(x => x.split(" "))
var wordsKv = words.map(x => (x, 1))
//def myfunc(x:Int, y:Int): Int = x + y
var output = wordsKv.reduceByKey(_ + _)
output.take(10)
exit()

aimlankit2262@cxln4 ~]$ hadoop fs -ls  my_result
Found 1 items
drwxr-xr-x   - aimlankit2262 aimlankit2262          0 2022-10-03 03:39 my_result/_temporary

  Upvote    Share

Can you try the commands as mentioned in the slide. 

  Upvote    Share

this also showing _temporary. both issue is same. 

 

After saving file to HDD it is showing as temporary. can you test this with your account and list files. 

  Upvote    Share

Hi Ankit,

So, I have debugged the issue. It is because of the disk quota problem. In this command,

var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)), 2)

we have set the replication factor to 2 which is causing the disk space quota issue from your account as you don't have enough disk space left to save this file with 2 replications. If you limit it by 1, then you will be able to see the output as:

-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034          0 2021-05-21 15:11 pysequencefile1/_SUCCESS
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        109 2021-05-21 15:11 pysequencefile1/part-00000

You can create an rdd with replication factor of 1 as:

var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)), 1)

 

  Upvote    Share