DataFrames, Spark SQL, R

4 / 18

Spark SQL - Create Df from Json

Let us try to create dataframe from a JSON file. This JSON file people.json is located in HDFS at /data/spark/. You can either use File Browser from within Hue or use Hadoop fs -cat inside web console or ssh.

You can see that this file contains three JSON objects. Each line is a valid JSON object. Each of these lines is separated by a newline. Please note the complete file does not represent a valid JSON object while each line is a valid JSON object.

To load the standard formats as dataframe the spark session provides read object which has various methods. Here we are calling json method on reading object of spark. This is very much similar to the way people usual load data in R.

Here the location by default is considered as HDFS. The df variable refers to the constructed dataframe. Please note that like RDDs, dataframes are also lazily evaluated. On df we call various methods such as join, map, flatmap, reduce or other transformation. On df we can query using sql like interface or using R like mechanism.

So, to see the data, we would need to call, the result would be displayed on the screen.

You can see that it has inferred the structure of data from JSON. It has figured out the column names from the JSON object and also fit the values into the respective columns.

The first JSON object did not have age attribute, still, Spark was able to figure out the column name for that column because age is existing in other JSON objects.

So, you can see that this pretty sophisticated way of processing big data using available in JSON format.