DataFrames, Spark SQL, R

8 / 18

Spark SQL - Rdd And Dataframe Interoperability

Earlier we discussed that we can create the dataframe from a JSON file using spark.read.json function directly. It was easy to create dataframe from JSON file because dataframe needs to know the columns and datatypes of columns and JSON has those details.

What if we want to create Dataframe out of unstructured data? The unstructured data does not have any details.

We would first create RDDs as learned earlier and then convert these RDDs to dataframe. But How?

Spark SQL supports two different methods for converting existing RDDs into dataframes.

The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection-based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

The second method for creating dataframes is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct dataframes when the columns and their types are not known until runtime.