Note: If you're getting disk space quota error, please check this discussion.
Please run the following code on spark-shell
var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)), 2)
rdd.saveAsSequenceFile("pysequencefile1")
Let us check the file if it is created or not. The following should be rung in the separate console. It is a Linux command not the scala code.
hadoop fs -ls pysequencefile1
This should show you a list like the following. There are two files containing data because we created an RDD with two partitions - check the last argument of parallelize method()
[sandeepgiri9034@cxln4 ~]$ hadoop fs -ls pysequencefile1
Found 3 items
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 0 2021-05-21 15:11 pysequencefile1/_SUCCESS
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 109 2021-05-21 15:11 pysequencefile1/part-00000
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 130 2021-05-21 15:11 pysequencefile1/part-00001
Now, let us read the sequence file saved in first step above
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
val myrdd = sc.sequenceFile(
"pysequencefile1",
classOf[Text], classOf[DoubleWritable])
val result = myrdd.map{case (x, y) => (x.toString, y.get())}
result.collect()
It should print something like the following. You can see that we were able to load the data saved in the sequence file. Please note that we need to know the datatype of various fields in the sequence file.
Array((key1,1.0), (key2,2.0), (key3,3.0))
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Please login to comment
6 Comments
after saving files its showing _temporary directory.
Have you run any other operation on rdd before saving?
Upvote ShareBelow is code that i running
spark-shell
var linesRdd = sc.textFile("/data/mr/wordcount/input/big.txt")
var words = linesRdd.flatMap(x => x.split(" "))
var wordsKv = words.map(x => (x, 1))
//def myfunc(x:Int, y:Int): Int = x + y
var output = wordsKv.reduceByKey(_ + _)
output.take(10)
exit()
aimlankit2262@cxln4 ~]$ hadoop fs -ls my_result
Upvote ShareFound 1 items
drwxr-xr-x - aimlankit2262 aimlankit2262 0 2022-10-03 03:39 my_result/_temporary
Can you try the commands as mentioned in the slide.
Upvote Sharethis also showing _temporary. both issue is same.
After saving file to HDD it is showing as temporary. can you test this with your account and list files.
Upvote ShareHi Ankit,
So, I have debugged the issue. It is because of the disk quota problem. In this command,
we have set the replication factor to 2 which is causing the disk space quota issue from your account as you don't have enough disk space left to save this file with 2 replications. If you limit it by 1, then you will be able to see the output as:
You can create an rdd with replication factor of 1 as: