DataFrames, Spark SQL, R

4 / 18

Previous Index Next

Spark SQL - Create Df from Json

Let us try to create dataframe from a JSON file. This JSON file people.json is located in HDFS at /data/spark/. You can either use File Browser from within Hue or use Hadoop fs -cat inside web console or ssh.

You can see that this file contains three JSON objects. Each line is a valid JSON object. Each of these lines is separated by a newline. Please note the complete file does not represent a valid JSON object while each line is a valid JSON object.

To load the standard formats as dataframe the spark session provides read object which has various methods. Here we are calling json method on reading object of spark. This is very much similar to the way people usual load data in R.

Here the location by default is considered as HDFS. The df variable refers to the constructed dataframe. Please note that like RDDs, dataframes are also lazily evaluated. On df we call various methods such as join, map, flatmap, reduce or other transformation. On df we can query using sql like interface or using R like mechanism.

So, to see the data, we would need to call df.show(), the result would be displayed on the screen.

You can see that it has inferred the structure of data from JSON. It has figured out the column names from the JSON object and also fit the values into the respective columns.

The first JSON object did not have age attribute, still, Spark was able to figure out the column name for that column because age is existing in other JSON objects.

So, you can see that this pretty sophisticated way of processing big data using available in JSON format.

Spark - Dataframes & Spark SQL (Part1)

Previous Index Next

Please login to comment

22 Comments

Mayank Chaubey

4 years ago

Why I can't run the hadoop command in the Jupyter?

Vagdevi K

4 years ago

Hi,

This jupyter notebook is configured to run scala code-snippets. So please use console to run hadoop/bash commands.

Thanks.

Ajinkya Gavi

5 years ago

In the JSON format observed in hadoop fs ,one can see name first followed by age.While in spark sql,we can see age column first.Is there any sequencing followed in spark,as in column name based on ASCII value???

Pranav Shetty

4 years ago

Yes, spark dataframe does sort the columns (keys of json objects) based on ASCII value. I created another json file and added some columns to confirm it.

scala> df.show()
+------+----+----+------+-------+
| 123| ABC| age|gender| name|
+------+----+----+------+-------+
| num|null|null| male|Michael|
|number|null| 30|female| Andy|
| nums| 145| 19| male| Justin|
+------+----+----+------+-------+

Ajinkya Gavi

5 years ago

I got the below warning :

WARN DataSource: Error while looking for metadata directory.df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

----What exactly does this warning represents.

Amit Padhi

5 years ago

Spark was able to figure out the column name for that column because age is existing in other JSON objects.

Please explain this line as the first row having age as Null value so what does signifie

figure out the column name for that column

Satyajit Das

5 years ago

Hi, Amit

As the "age" column is existing in the JSON file so spark was able to figure it out what is column and what is data.

All the best!

Sapna Vacheri

5 years ago

hi

i created

val spark2 = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

this is the warning

20/07/24 05:01:34 WARN sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.spark2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6419726f

var df = spark2.read.load("/data/spark/people.json")

ERROR executor.Executor: Exception in task 0.0 in stage 2.0 (TID 2)org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:290)

I cant read the file. i cannot continue studying

please reply soon

Sapna Vacheri

5 years ago

the error messages

ERROR executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3)org.apache.spark.SparkException: Exception thrown in awaitResult:

Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=hdfs://cxln1.c.thelab-240901.internal:8020/data/spark/people.json; isDirectory=false; length=73; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

Caused by: java.lang.RuntimeException: hdfs://cxln1.c.thelab-240901.internal:8020/data/spark/people.json is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 57, 125, 10]

Satyajit Das

5 years ago

Hi, Sapna.

You are doing right only!

# By the read.json you are reading the json file.
var df = spark.read.json("/data/spark/people.json")

# Disaply the content of the Dtaaframe to stdout

df.show()

All the best!

This comment has been removed.

vijay yadav

5 years ago

scala> var df = spark.read.json("/data/spark/people.json")

20/07/19 06:38:06 WARN DataSource: Error while looking for metadata directory.

Facing above given warning. What's wrong with statement ?

Sandeep Giri

4 years ago

You can safely ignore the warning.

You can use df.show() to check if it is getting loaded correctly.

Amit Gupta

5 years ago

I was trying
var df = spark.resd.json("/data/spark/people.json")

Amit Gupta

5 years ago

could I get immediate resolution on this

Amit Gupta

5 years ago

getting error: "WARN Datasource: Error while looking for metadata directory"

Man

5 years ago

Hi,

I did PYSPARK setup in jupyter notebook and executed the following successfully.
df = spark.read.json('data/spark/people.json')

do i not require to create an object spark using SparkSession like below beforehand
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()

Sayeef Mohammad

5 years ago

Hi Sandeep

Spark shell is reporting error as shown below

Please help

Regards
Sayeef

Satyajit Das

5 years ago

HI, Sayeef.
Don't export any variable on the Web-console.
Ju

st type spark-shell

Anubhav Gupta

5 years ago

Why we did not use sc over here?

Frank Fan

6 years ago

I encounter an error when importing this, please help

scala> import spark.implicits._
error: value implicits is not a member of object org.apache.spark.sql.SparkSession
import spark.implicits._

Sandeep Giri

6 years ago

It seems to work fine.