In the previous projects, we have either converted an in-memory object using parallelized RDD or loaded data from HDFS files. In this hands-on project we will learn to load and save different kinds of data from various sources into an RDD.
Spark supports a wide variety of data sets. It can access data through the InputFormat
and OutputFormat
provided by Hadoop. These interfaces are also available for many common file formats and storage systems such as S3, HDFS, Cassandra, HBase, etc.
Data could be located on various storages in various formats. Spark can handle a variety of such storages and formats. The example of file formats are plain text, JSON, sequence files (binary key-value pair files), protocol buffers. The files could be having compression.
The examples of stores are file systems such as network file system, Hadoop distributed file system, Amazon S3. The storages could be relational database, or No-SQL store such as Cassandra, HBase, Elastic Search.
With spark you can access the data from any of these file systems in any format with different kind of compression. Apart from this, Spark provides a very powerful data frames API which was part of Spark SQL for querying structured data sources such as JSON and Hive. We will be covering it in a separate project.
A file or large data could be in any format. If we know up front then we can read it using specific format reader, if we do not know the format we can use a UNIX utility such as file
. Let us try to understand various formats quickly.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Answer is not availble for this assesment
Please login to comment
14 Comments
i am trying to read a csv file "user/avishekbrutus8015/Avishek/data_nyc.csv" into jupyter notebook,getting an error File not found,please help locate the file on jupyter
Upvote ShareHi,
Please makesure that the file exists in the specified location before reading it.
Thanks.
Upvote Sharei do see the file here:
"user/avishekbrutus8015/Avishek/data_nyc.csv"
Upvote ShareCould you attach the screenshot of the command and output of the following command:
ambari:
i dont see it from cosole,so how i do upload data from ambari ?
Upvote ShareHi Kumar,
It should start from "/user" and not "user". Can you please verify if this file exists. Please run below command on your web console
Adding to Abhinav's answer, after checking the existence of the file in HDFS, if you want to read it via jupyter notebook: you might want to copy the file from HDFS to local, then use that path of the file in local to read the file. Else, you may use pyspark and directly read the file using sc.textfile.
Thanks.
Upvote Sharethere is no hue in cloudxlab please enable hue
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {
val parser = new CSVParser(',')
for(line <- itr)
yield parser.parseLine(line)
}
//Check with simple example
val x = parseCSV(Array("1,2,3","a,b,c").iterator)
linesRdd.mapPartitions(parseCSV)
Sir please explain the above code ?
Upvote ShareHi, Vaibhav.
1) You are importing CSVParser library.
2) You are creating a RDD linesRDD with the file path "data/spark/temps.csv".
3) You are creating your custom function def parseCSV() which is accepting a iterator variable of string type. and yielding just lines from the file one by one.
4) You are calling the function
5) Storing the RDD as partitions.
All the best!
-- Satyajit Das
Upvote ShareHere , what is the use of yield and iterator (can't we use any other datatype other than iterator )??
Upvote ShareHi,
An Iterator is a way to access the elements of a collection one by one. The two basic operations on an iterator it are "next" and "hasNext". it.next() will return the next element of the iterator and "hasNext" will checks if there are more elements to return or not.
yield means return, it will return an iterable object.
All the best!
-- Satyajit Das
Upvote ShareHi,
When i am trying to load a csv files of 1.6 gb using sc.textfile the number of partitons we are getting 50 (Data is in my local machine) . where as if i try to load the same file as a dataframe the number of partitons are only 13.
One thing can assume is when loading as an rdd the partiton size seems 32MB where as while loading as dataframe the partiton size seems 128 MB
Can you provide more clarifications on this
Upvote Share