Apache Spark - Is this method of creating rdd correct: val myrdd =...

Apache Spark Basics

35 / 89

Is this method of creating rdd correct: val myrdd = sc.parallelize(scala.io.Source.fromFile("./myfile").getLines.toList) ?

Yes
No, This method first reads data into the memory which would overflow if the file is big

Note - Having trouble with the assessment engine? Follow the steps listed here

Previous Index Next

Please login to comment

13 Comments

Karthikeyan Mahalingam

2 years ago

May I know, why I am getting the error, If option 2 is true.

scala> val myrdd = sc.textFile("/data/mr/wordcount/input/big.txt").getLines.toList
<console>:24: error: value getLines is not a member of org.apache.spark.rdd.RDD[String]
val myrdd = sc.textFile("/data/mr/wordcount/input/big.txt").getLines.toList
^

Upvote Share

Shubh Tripathi

2 years ago

getLines method is not a member of the RDD class in Spark. sc.textfile returns the file as an RDD of Strings.

getLines method is present in in Scala's BufferedSource class, which is used for reading lines from a file or other input stream. However, it is not applicable in the context of Spark, as the textFile method returns an RDD, not a BufferedSource.

Upvote Share

Prithvi Pradip

3 years ago

scala.io.Source.fromFile("./myfile").getLines.toList - whatis this part of code doing?

Upvote Share

Shubh Tripathi

3 years ago

This will upload the file into RAM and if the file is big it will cause memory overflow.

Upvote Share

Sontosh Kumar Roy

5 years ago

val myrdd = sc.parallelize(scala.io.Source.fromFile("./myfile").getLines.toList)

Can you please explain whats happening here?

Upvote Share

Satyajit Das

5 years ago

scala.io.Source.fromFile("./myfile"). This will upload the file into RAM and if the file is big it will cause memory overflow.

sc.parallelize("./myfile") can be used for RDD.

Upvote Share

Ajinkya Gavi

5 years ago

@satyajit das - When i try to run this code it gives,

scala> val myrdd = sc.parallelize(scala.io.Source.fromFile("/data/mr/wordcount/input/big.txt").getLines.toList)
java.io.FileNotFoundException: /data/mr/wordcount/input/big.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)

The same code is executing when i try to create an RDD.Please advise

Upvote Share

CloudxLab

5 years ago

Hi, Ajinkya.

I see that you have already run the command and saw that the RDD cannot be formed as it is first loading the file into the memory and if the file is big may create a memory leak.
val rdd1 = sc.textFile("data.txt")

All the best!

-- Satyajit Das

Upvote Share

Punit Bhilota

4 years ago

Hi Satyajit,

I created the rdd1 for data.txt but it gives error when I tried to read it.

scala> val rdd1 = sc.textFile("data.txt")

rdd1: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at <console>:24

scala> rdd1.take(10)org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://cxln1.c.thelab-240901.internal:8020/user/punitnb7985/data.txt at

Is rdd get created even if the file does not exist or I misinterpreted the error?

Upvote Share