Loading and Saving Data

2 / 9

Loading and Saving Data - Reading from Common Data Sources

Slides

INSTRUCTIONS

In the previous projects, we have either converted an in-memory object using parallelized RDD or loaded data from HDFS files. In this hands-on project we will learn to load and save different kinds of data from various sources into an RDD.

Spark supports a wide variety of data sets. It can access data through the InputFormat and OutputFormat provided by Hadoop. These interfaces are also available for many common file formats and storage systems such as S3, HDFS, Cassandra, HBase, etc.

Data could be located on various storages in various formats. Spark can handle a variety of such storages and formats. The example of file formats are plain text, JSON, sequence files (binary key-value pair files), protocol buffers. The files could be having compression.

The examples of stores are file systems such as network file system, Hadoop distributed file system, Amazon S3. The storages could be relational database, or No-SQL store such as Cassandra, HBase, Elastic Search.

With spark you can access the data from any of these file systems in any format with different kind of compression. Apart from this, Spark provides a very powerful data frames API which was part of Spark SQL for querying structured data sources such as JSON and Hive. We will be covering it in a separate project.

A file or large data could be in any format. If we know up front then we can read it using specific format reader, if we do not know the format we can use a UNIX utility such as file. Let us try to understand various formats quickly.


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

14 Comments

i am trying to read a csv file "user/avishekbrutus8015/Avishek/data_nyc.csv" into jupyter notebook,getting an error File not found,please help locate the file on jupyter

  Upvote    Share

Hi,

Please makesure that the file exists in the specified location before reading it.

Thanks.

  Upvote    Share

i do see the file here:

"user/avishekbrutus8015/Avishek/data_nyc.csv" 

  Upvote    Share

Could you attach the screenshot of the command and output of the following command:

ls user/avishekbrutus8015/Avishek/data_nyc.csv

 

  Upvote    Share

ambari:

  Upvote    Share

i dont see it from cosole,so how i do upload data from ambari ?

  Upvote    Share

Hi Kumar,

It should start from "/user" and not "user". Can you please verify if this file exists. Please run below command on your web console

hadoop fs -ls /user/avishekbrutus8015/Avishek/data_nyc.csv

 

  Upvote    Share

Adding to Abhinav's answer, after checking the existence of the file in HDFS, if you want to read it via jupyter notebook: you might want to copy the file from HDFS to local, then use that path of the file in local to read the file. Else, you may use pyspark and directly read the file using sc.textfile.

Thanks.

  Upvote    Share

there is no hue in cloudxlab please enable hue

 

  Upvote    Share

import au.com.bytecode.opencsv.CSVParser

var linesRdd = sc.textFile("/data/spark/temps.csv");

def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {

val parser = new CSVParser(',')

for(line <- itr)

yield parser.parseLine(line)

}

//Check with simple example

val x = parseCSV(Array("1,2,3","a,b,c").iterator)

linesRdd.mapPartitions(parseCSV)

Sir please explain the above code ?

  Upvote    Share

Hi, Vaibhav.

1) You are importing CSVParser library.
2) You are creating a RDD linesRDD with the file path "data/spark/temps.csv".
3) You are creating your custom function def parseCSV() which is accepting a iterator variable of string type. and yielding just lines from the file one by one.
4) You are calling the function
5) Storing the RDD as partitions.

All the best!

-- Satyajit Das

  Upvote    Share

Here , what is the use of yield and iterator (can't we use any other datatype other than iterator )??

  Upvote    Share

Hi,
An Iterator is a way to access the elements of a collection one by one. The two basic operations on an iterator it are "next" and "hasNext". it.next() will return the next element of the iterator and "hasNext" will checks if there are more elements to return or not.
yield means return, it will return an iterable object.

All the best!

-- Satyajit Das

  Upvote    Share

Hi,
When i am trying to load a csv files of 1.6 gb using sc.textfile the number of partitons we are getting 50 (Data is in my local machine) . where as if i try to load the same file as a dataframe the number of partitons are only 13.

One thing can assume is when loading as an rdd the partiton size seems 32MB where as while loading as dataframe the partiton size seems 128 MB

Can you provide more clarifications on this

  Upvote    Share