DataFrames, Spark SQL, R

4 / 18

Spark SQL - Create Df from Json

Let us try to create dataframe from a JSON file. This JSON file people.json is located in HDFS at /data/spark/. You can either use File Browser from within Hue or use Hadoop fs -cat inside web console or ssh.

You can see that this file contains three JSON objects. Each line is a valid JSON object. Each of these lines is separated by a newline. Please note the complete file does not represent a valid JSON object while each line is a valid JSON object.

To load the standard formats as dataframe the spark session provides read object which has various methods. Here we are calling json method on reading object of spark. This is very much similar to the way people usual load data in R.

Here the location by default is considered as HDFS. The df variable refers to the constructed dataframe. Please note that like RDDs, dataframes are also lazily evaluated. On df we call various methods such as join, map, flatmap, reduce or other transformation. On df we can query using sql like interface or using R like mechanism.

So, to see the data, we would need to call df.show(), the result would be displayed on the screen.

You can see that it has inferred the structure of data from JSON. It has figured out the column names from the JSON object and also fit the values into the respective columns.

The first JSON object did not have age attribute, still, Spark was able to figure out the column name for that column because age is existing in other JSON objects.

So, you can see that this pretty sophisticated way of processing big data using available in JSON format.

Spark - Dataframes & Spark SQL (Part1)


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

22 Comments

Why I can't run the hadoop command in the Jupyter?

  Upvote    Share

Hi,

This jupyter notebook is configured to run scala code-snippets. So please use console to run hadoop/bash commands.

Thanks.

  Upvote    Share

In the JSON format observed in hadoop fs ,one can see name first followed by age.While in spark sql,we can see age column first.Is there any sequencing followed in spark,as in column name based on ASCII value???

  Upvote    Share

Yes, spark dataframe does sort the columns (keys of json objects) based on ASCII value. I created another json file and added some columns to confirm it.

 

scala> df.show()
+------+----+----+------+-------+
|   123| ABC| age|gender|   name|
+------+----+----+------+-------+
|   num|null|null|  male|Michael|
|number|null|  30|female|   Andy|
|  nums| 145|  19|  male| Justin|
+------+----+----+------+-------+

  Upvote    Share

I got the below warning :

 

WARN DataSource: Error while looking for metadata directory.df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

----What exactly does this warning represents.

 

  Upvote    Share

Spark was able to figure out the column name for that column because age is existing in other JSON objects.

Please explain this line as the first row having age as Null value so what does signifie

figure out the column name for that column

 

 

  Upvote    Share

Hi, Amit

As the "age" column is existing in the JSON file so spark was able to figure it out what is column and what is data. 

All the best!

  Upvote    Share

hi

i created

val spark2 = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

this is the warning

20/07/24 05:01:34 WARN sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.spark2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6419726f

var df = spark2.read.load("/data/spark/people.json")

ERROR executor.Executor: Exception in task 0.0 in stage 2.0 (TID 2)org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:290)

I cant read the file. i cannot continue studying

please reply soon

  Upvote    Share

 

the error messages

ERROR executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3)org.apache.spark.SparkException: Exception thrown in awaitResult:

Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=hdfs://cxln1.c.thelab-240901.internal:8020/data/spark/people.json; isDirectory=false; length=73; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

 

Caused by: java.lang.RuntimeException: hdfs://cxln1.c.thelab-240901.internal:8020/data/spark/people.json is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 57, 125, 10]

 

  Upvote    Share

Hi, Sapna. 

You are doing right only!

# By the read.json you are reading the json file.
var df = spark.read.json("/data/spark/people.json")

# Disaply the content of the Dtaaframe to stdout

df.show()


All the best!

  Upvote    Share

This comment has been removed.

scala> var df = spark.read.json("/data/spark/people.json")

20/07/19 06:38:06 WARN DataSource: Error while looking for metadata directory.

Facing above given warning. What's wrong with statement ?

  Upvote    Share

You can safely ignore the warning.

You can use df.show() to check if it is getting loaded correctly.

  Upvote    Share

I was trying
var df = spark.resd.json("/data/spark/people.json")

  Upvote    Share

could I get immediate resolution on this

  Upvote    Share

getting error: "WARN Datasource: Error while looking for metadata directory"

  Upvote    Share

Hi,

I did PYSPARK setup in jupyter notebook and executed the following successfully.
df = spark.read.json('data/spark/people.json')

do i not require to create an object spark using SparkSession like below beforehand
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()

  Upvote    Share

Hi Sandeep

Spark shell is reporting error as shown below

Please help

Regards
Sayeef

  Upvote    Share

HI, Sayeef.
Don't export any variable on the Web-console.
Ju

st type spark-shell

  Upvote    Share

Why we did not use sc over here?

  Upvote    Share

I encounter an error when importing this, please help

scala> import spark.implicits._
error: value implicits is not a member of object org.apache.spark.sql.SparkSession
import spark.implicits._

  Upvote    Share

It seems to work fine.

  Upvote    Share