Loading and Saving Data - Reading from Common Data Sources

INSTRUCTIONS

In the previous projects, we have either converted an in-memory object using parallelized RDD or loaded data from HDFS files. In this hands-on project we will learn to load and save different kinds of data from various sources into an RDD.

Spark supports a wide variety of data sets. It can access data through the InputFormat and OutputFormat provided by Hadoop. These interfaces are also available for many common file formats and storage systems such as S3, HDFS, Cassandra, HBase, etc.

Data could be located on various storages in various formats. Spark can handle a variety of such storages and formats. The example of file formats are plain text, JSON, sequence files (binary key-value pair files), protocol buffers. The files could be having compression.

The examples of stores are file systems such as network file system, Hadoop distributed file system, Amazon S3. The storages could be relational database, or No-SQL store such as Cassandra, HBase, Elastic Search.

With spark you can access the data from any of these file systems in any format with different kind of compression. Apart from this, Spark provides a very powerful data frames API which was part of Spark SQL for querying structured data sources such as JSON and Hive. We will be covering it in a separate project.

A file or large data could be in any format. If we know up front then we can read it using specific format reader, if we do not know the format we can use a UNIX utility such as file. Let us try to understand various formats quickly.

Loading and Saving Data

Loading and Saving Data - Reading from Common Data Sources

XP

Loading comments...