Is this method of creating rdd correct: val myrdd = sc.parallelize(scala.io.Source.fromFile("./myfile").getLines.toList) ?
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Note - Having trouble with the assessment engine? Follow the steps listed here
No hints are availble for this assesment
Answer is not availble for this assesment
Please login to comment
13 Comments
May I know, why I am getting the error, If option 2 is true.
scala> val myrdd = sc.textFile("/data/mr/wordcount/input/big.txt").getLines.toList
Upvote Share<console>:24: error: value getLines is not a member of org.apache.spark.rdd.RDD[String]
val myrdd = sc.textFile("/data/mr/wordcount/input/big.txt").getLines.toList
^
getLines method is not a member of the
RDD
class in Spark. sc.textfile returns the file as an RDD of Strings.
Upvote SharegetLines
method is present in in Scala'sBufferedSource
class, which is used for reading lines from a file or other input stream. However, it is not applicable in the context of Spark, as thetextFile
method returns an RDD, not aBufferedSource
.scala.io.Source.fromFile("./myfile").getLines.toList - whatis this part of code doing?
Upvote ShareThis will upload the file into RAM and if the file is big it will cause memory overflow.
Upvote Shareval myrdd = sc.parallelize(scala.io.Source.fromFile("./myfile").getLines.toList)
Can you please explain whats happening here?
Upvote Sharescala.io.Source.fromFile("./myfile"). This will upload the file into RAM and if the file is big it will cause memory overflow.
@satyajit das - When i try to run this code it gives,
scala> val myrdd = sc.parallelize(scala.io.Source.fromFile("/data/mr/wordcount/input/big.txt").getLines.toList)
java.io.FileNotFoundException: /data/mr/wordcount/input/big.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
The same code is executing when i try to create an RDD.Please advise
Upvote ShareHi, Ajinkya.
I see that you have already run the command and saw that the RDD cannot be formed as it is first loading the file into the memory and if the file is big may create a memory leak.
val rdd1 = sc.textFile("data.txt")
All the best!
-- Satyajit Das
Upvote ShareHi Satyajit,
I created the rdd1 for data.txt but it gives error when I tried to read it.
scala> val rdd1 = sc.textFile("data.txt")
rdd1: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at <console>:24
scala> rdd1.take(10)org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://cxln1.c.thelab-240901.internal:8020/user/punitnb7985/data.txt at
Is rdd get created even if the file does not exist or I misinterpreted the error?
Upvote SharePleaase check if you have "data.txt" in your HDFS home directory.
Upvote ShareYes, the sc.textFile is superlazy. The entire process starts only when an action is called.
Upvote ShareI am not understanding this.
scala.io.Source.fromFile("./myfile").getLines.toList
The above code is not a spark code. Hence first this line will be stored in the hard disk right.
Upvote ShareYes.
Upvote Share