DataFrames, Spark SQL, R

2 / 18

Spark SQL - Dataframe Introduction

The core of Spark SQL is dataframe API. Let's try to understand dataframes by comparing it to RDD.

RDD or resilient distributed dataset is nothing but a list of distributed records where a record could be anything. The format of a record in an RDD could be anything. A record could have tuple, objects, strings, numbers or binary array.

There is no compulsion of any structure in an RDD. Hence, an RDD is generally called unstructured data. We do not know the field and their data types.

To process the data in an RDD we need to mostly write code by the way of transformations or actions. This is not as efficient of churning data as databases or R.

So, spark team while working with SparkSQL came up with dataframes similar to the dataframes of R.

The data frames are basically structured RDD very much like a database table. Each record or row has the same number of fields with the same data type.

While an RDD can accommodate any kind of data, a data frame can provide very simple processing using language such as SQL or R. We can easily filter, group or sort the records based on various fields' values.

A dataframe is a collection of data organized into names columns. Meaning, the data frame has columns and each column carries a name.

This data frame is distributed because under the hood the dataframe is an RDD. So, dataframe too has partitions each partition has subsets of rows and partitions could be located on multiple machines. Please note the whole record or row will not be distributed.

The concept of the dataframe is very similar to dataframe in R or dataframe in pandas library of python.

The dataframes can be constructed using the structured data files such as CSV or JSON, from the tables in HIVE, the tables in relational databases such MySQL, Oracle, Postgres, Microsoft SQL.

The dataframes can be converted from existing RDDs by parsing the records and imposing schema on the RDDs. If we can not directly load something into dataframe, we first create an RDD and then load RDD into dataframe.

We can access the dataframe using Scala, Python, R and Java. The API query is available in Scala, Python, R and Java. And this API includes the ways of churning or processing data using SQL queries or using R kind of conditions.