The core of Spark SQL is dataframe API. Let's try to understand dataframes by comparing it to RDD.
RDD or resilient distributed dataset is nothing but a list of distributed records where a record could be anything. The format of a record in an RDD could be anything. A record could have tuple, objects, strings, numbers or binary array.
There is no compulsion of any structure in an RDD. Hence, an RDD is generally called unstructured data. We do not know the field and their data types.
To process the data in an RDD we need to mostly write code by the way of transformations or actions. This is not as efficient of churning data as databases or R.
So, spark team while working with SparkSQL came up with dataframes similar to the dataframes of R.
The data frames are basically structured RDD very much like a database table. Each record or row has the same number of fields with the same data type.
While an RDD can accommodate any kind of data, a data frame can provide very simple processing using language such as SQL or R. We can easily filter, group or sort the records based on various fields' values.
A dataframe is a collection of data organized into names columns. Meaning, the data frame has columns and each column carries a name.
This data frame is distributed because under the hood the dataframe is an RDD. So, dataframe too has partitions each partition has subsets of rows and partitions could be located on multiple machines. Please note the whole record or row will not be distributed.
The concept of the dataframe is very similar to dataframe in R or dataframe in pandas library of python.
The dataframes can be constructed using the structured data files such as CSV or JSON, from the tables in HIVE, the tables in relational databases such MySQL, Oracle, Postgres, Microsoft SQL.
The dataframes can be converted from existing RDDs by parsing the records and imposing schema on the RDDs. If we can not directly load something into dataframe, we first create an RDD and then load RDD into dataframe.
We can access the dataframe using Scala, Python, R and Java. The API query is available in Scala, Python, R and Java. And this API includes the ways of churning or processing data using SQL queries or using R kind of conditions.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Answer is not availble for this assesment
Please login to comment
11 Comments
It is mentioned that the dataframe is distributed in multiple partitions. At 2.00 in the video the dataframe is depicted such that the header/column names are stored in partition 1 with some rows and other rows are in partition 2.
How headers/column names will be stored? Each partition will have header data or will it be at centralised storage that is linked to each partition?
Upvote ShareThe header information is available on all nodes.
Upvote Sharewhat is the real time scenario of use case for spark dataframe ; is it like RDD followed by spark dataframe; in the real time project do we always use spark dataframe or spark sql to process huge data along with RDD ; please explain
as the below has been explained in the above
To process the data in an RDD we need to mostly write code by the way of transformations or actions. This is not as efficient of churning data as databases or R.
Upvote ShareIf your data is unstructured, you will have to use RDD. If your data is already structured, you can use dataframe.
Upvote Sharebut as mentioned in the topic
To process the data in an RDD we need to mostly write code by the way of transformations or actions. This is not as efficient of churning data as databases or R.
is it like if data is unstructured we need to convert to structured data and use dataframe ?
so what is the procedure to follow here .
Upvote Share>> is it like if data is unstructured we need to convert to structured data and use dataframe ?
Yes, that's right.
Upvote ShareAlso, as you will finish the remaining videos, you will get a fair idea about how to convert and RDD into Dataframe and the picture would become clearer for you.
Upvote Sharepartitions could be located on multiple machines.
here multiple machines refer to virtual space ?
Upvote ShareHi Olivia,
Spark is a distributed system & is generally installed on a cluster of machines. These machines may be real physical machines or VMs hosted by a cloud computing provider, depending upon where Spark was installed.
Thanks
Upvote ShareWhat is meant by "Please note the whole record or row will not be distributed."
Upvote ShareDuring processing most of the systems first define the meaning of a row or a record. For example, when we sc.textFile, one record is a line of text from file. If we are using sc.wholeTextFiles, one record is a pair (filename, content_of_file).
So, if we are using sc.wholeTextFiles and one file is not fitting the memory, during the execution of an spark application it would throw error.
Upvote Share