Apache Spark Basics

35 / 89

Is this method of creating rdd correct: val myrdd = sc.parallelize(scala.io.Source.fromFile("./myfile").getLines.toList) ?


Note - Having trouble with the assessment engine? Follow the steps listed here


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

13 Comments

May I know, why I am getting the error, If option 2 is true.

scala> val myrdd = sc.textFile("/data/mr/wordcount/input/big.txt").getLines.toList
<console>:24: error: value getLines is not a member of org.apache.spark.rdd.RDD[String]
       val myrdd = sc.textFile("/data/mr/wordcount/input/big.txt").getLines.toList
                                                                   ^

  Upvote    Share

getLines method is not a member of the RDD class in Spark. sc.textfile returns the file as an RDD of Strings.

getLines method is present in in Scala's BufferedSource class, which is used for reading lines from a file or other input stream. However, it is not applicable in the context of Spark, as the textFile method returns an RDD, not a BufferedSource.

  Upvote    Share

scala.io.Source.fromFile("./myfile").getLines.toList - whatis this part of code doing?

  Upvote    Share

This will upload the file into RAM and if the file is big it will cause memory overflow. 

  Upvote    Share

val myrdd = sc.parallelize(scala.io.Source.fromFile("./myfile").getLines.toList) 

Can you please explain whats happening here?

  Upvote    Share

scala.io.Source.fromFile("./myfile").  This will upload the file into RAM and if the file is big it will cause memory overflow. 

sc.parallelize("./myfile") can be used for RDD. 

 

  Upvote    Share

@satyajit das - When i try to run this code it gives,

scala> val myrdd = sc.parallelize(scala.io.Source.fromFile("/data/mr/wordcount/input/big.txt").getLines.toList)
java.io.FileNotFoundException: /data/mr/wordcount/input/big.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)

The same code is executing when i try to create an RDD.Please advise

  Upvote    Share

Hi, Ajinkya.

I see that you have already run the command and saw that the RDD cannot be formed as it is first loading the file into the memory and if the file is big may create a memory leak.
val rdd1 = sc.textFile("data.txt")

All the best!

-- Satyajit Das

  Upvote    Share

Hi Satyajit,

I created the rdd1 for data.txt but it gives error when I tried to read it.

scala> val rdd1 = sc.textFile("data.txt")

rdd1: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at <console>:24

scala> rdd1.take(10)org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://cxln1.c.thelab-240901.internal:8020/user/punitnb7985/data.txt at

Is rdd get created even if the file does not exist or I misinterpreted the error?

  Upvote    Share

Pleaase check if you have "data.txt" in your HDFS home directory.

  Upvote    Share

Yes, the sc.textFile is superlazy. The entire process starts only when an action is called.

  Upvote    Share

I am not understanding this.

scala.io.Source.fromFile("./myfile").getLines.toList

The above code is not a spark code. Hence first this line will be stored in the hard disk right.

  Upvote    Share

Yes.

  Upvote    Share