DataFrames, Spark SQL, R

2 / 18

Spark SQL - Dataframe Introduction

The core of Spark SQL is dataframe API. Let's try to understand dataframes by comparing it to RDD.

RDD or resilient distributed dataset is nothing but a list of distributed records where a record could be anything. The format of a record in an RDD could be anything. A record could have tuple, objects, strings, numbers or binary array.

There is no compulsion of any structure in an RDD. Hence, an RDD is generally called unstructured data. We do not know the field and their data types.

To process the data in an RDD we need to mostly write code by the way of transformations or actions. This is not as efficient of churning data as databases or R.

So, spark team while working with SparkSQL came up with dataframes similar to the dataframes of R.

The data frames are basically structured RDD very much like a database table. Each record or row has the same number of fields with the same data type.

While an RDD can accommodate any kind of data, a data frame can provide very simple processing using language such as SQL or R. We can easily filter, group or sort the records based on various fields' values.

A dataframe is a collection of data organized into names columns. Meaning, the data frame has columns and each column carries a name.

This data frame is distributed because under the hood the dataframe is an RDD. So, dataframe too has partitions each partition has subsets of rows and partitions could be located on multiple machines. Please note the whole record or row will not be distributed.

The concept of the dataframe is very similar to dataframe in R or dataframe in pandas library of python.

The dataframes can be constructed using the structured data files such as CSV or JSON, from the tables in HIVE, the tables in relational databases such MySQL, Oracle, Postgres, Microsoft SQL.

The dataframes can be converted from existing RDDs by parsing the records and imposing schema on the RDDs. If we can not directly load something into dataframe, we first create an RDD and then load RDD into dataframe.

We can access the dataframe using Scala, Python, R and Java. The API query is available in Scala, Python, R and Java. And this API includes the ways of churning or processing data using SQL queries or using R kind of conditions.

Spark - Dataframes & Spark SQL (Part1)


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

11 Comments

It is mentioned that the dataframe is distributed in multiple partitions. At 2.00 in the video the dataframe is depicted such that the header/column names are stored in partition 1 with some rows and other rows are in partition 2.

How headers/column names will be stored? Each partition will have header data or will it be at centralised storage that is linked to each partition?

  Upvote    Share

The header information is available on all nodes.

  Upvote    Share

what is the real time scenario of use case for spark dataframe ; is it like RDD followed by spark dataframe; in the real time project do we always use spark dataframe or spark sql to process huge data along with RDD ; please explain 

as the below has been explained in the above

To process the data in an RDD we need to mostly write code by the way of transformations or actions. This is not as efficient of churning data as databases or R.

  Upvote    Share

If your data is unstructured, you will have to use RDD. If your data is already structured, you can use dataframe.

  Upvote    Share

but as mentioned in the topic

To process the data in an RDD we need to mostly write code by the way of transformations or actions. This is not as efficient of churning data as databases or R.

is it like if data is unstructured we need to convert to structured data and use dataframe ?

so what is the procedure to follow here .

  Upvote    Share

 >> is it like if data is unstructured we need to convert to structured data and use dataframe ?

Yes, that's right.

  Upvote    Share

Also, as you will finish the remaining videos, you will get a fair idea about how to convert and RDD into Dataframe and the picture would become clearer for you.

  Upvote    Share

partitions could be located on multiple machines.

here multiple machines refer to virtual space ?

  Upvote    Share

Hi Olivia,

Spark is a distributed system & is generally installed on a cluster of machines. These machines may be real physical machines or VMs hosted by a cloud computing provider, depending upon where Spark was installed.

Thanks

  Upvote    Share

What is meant by "Please note the whole record or row will not be distributed."

  Upvote    Share

During processing most of the systems first define the meaning of a row or a record. For example, when we sc.textFile, one record is a line of text from file. If we are using sc.wholeTextFiles, one record is a pair (filename, content_of_file).

So, if we are using sc.wholeTextFiles and one file is not fitting the memory, during the execution of an spark application it would throw error.

  Upvote    Share