Let us try to create dataframe from a JSON file. This JSON file people.json is located in HDFS at /data/spark/. You can either use File Browser from within Hue or use Hadoop fs -cat inside web console or ssh.
You can see that this file contains three JSON objects. Each line is a valid JSON object. Each of these lines is separated by a newline. Please note the complete file does not represent a valid JSON object while each line is a valid JSON object.
To load the standard formats as dataframe the spark session provides read object which has various methods. Here we are calling json method on reading object of spark. This is very much similar to the way people usual load data in R.
Here the location by default is considered as HDFS. The df variable refers to the constructed dataframe. Please note that like RDDs, dataframes are also lazily evaluated. On df we call various methods such as join, map, flatmap, reduce or other transformation. On df we can query using sql like interface or using R like mechanism.
So, to see the data, we would need to call df.show(), the result would be displayed on the screen.
You can see that it has inferred the structure of data from JSON. It has figured out the column names from the JSON object and also fit the values into the respective columns.
The first JSON object did not have age attribute, still, Spark was able to figure out the column name for that column because age is existing in other JSON objects.
So, you can see that this pretty sophisticated way of processing big data using available in JSON format.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Answer is not availble for this assesment
Please login to comment
22 Comments
Why I can't run the hadoop command in the Jupyter?
Upvote ShareHi,
This jupyter notebook is configured to run scala code-snippets. So please use console to run hadoop/bash commands.
Thanks.
Upvote ShareIn the JSON format observed in hadoop fs ,one can see name first followed by age.While in spark sql,we can see age column first.Is there any sequencing followed in spark,as in column name based on ASCII value???
Upvote ShareYes, spark dataframe does sort the columns (keys of json objects) based on ASCII value. I created another json file and added some columns to confirm it.
scala> df.show()
Upvote Share+------+----+----+------+-------+
| 123| ABC| age|gender| name|
+------+----+----+------+-------+
| num|null|null| male|Michael|
|number|null| 30|female| Andy|
| nums| 145| 19| male| Justin|
+------+----+----+------+-------+
I got the below warning :
WARN DataSource: Error while looking for metadata directory.df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
----What exactly does this warning represents.
Spark was able to figure out the column name for that column because age is existing in other JSON objects.
Please explain this line as the first row having age as Null value so what does signifie
figure out the column name for that column
Hi, Amit
As the "age" column is existing in the JSON file so spark was able to figure it out what is column and what is data.
All the best!
Upvote Sharehi
i created
val spark2 = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
this is the warning
20/07/24 05:01:34 WARN sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.spark2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6419726f
var df = spark2.read.load("/data/spark/people.json")
ERROR executor.Executor: Exception in task 0.0 in stage 2.0 (TID 2)org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:290)
I cant read the file. i cannot continue studying
please reply soon
Upvote Sharethe error messages
ERROR executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3)org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=hdfs://cxln1.c.thelab-240901.internal:8020/data/spark/people.json; isDirectory=false; length=73; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
Caused by: java.lang.RuntimeException: hdfs://cxln1.c.thelab-240901.internal:8020/data/spark/people.json is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 57, 125, 10]
Hi, Sapna.
You are doing right only!
# By the read.json you are reading the json file.
var df = spark.read.json("/data/spark/people.json")
# Disaply the content of the Dtaaframe to stdout
df.show()
All the best!
This comment has been removed.
scala> var df = spark.read.json("/data/spark/people.json")
20/07/19 06:38:06 WARN DataSource: Error while looking for metadata directory.
Facing above given warning. What's wrong with statement ?
Upvote ShareYou can safely ignore the warning.
You can use df.show() to check if it is getting loaded correctly.
Upvote ShareI was trying
Upvote Sharevar df = spark.resd.json("/data/spark/people.json")
could I get immediate resolution on this
Upvote Sharegetting error: "WARN Datasource: Error while looking for metadata directory"
Upvote ShareHi,
I did PYSPARK setup in jupyter notebook and executed the following successfully.
df = spark.read.json('data/spark/people.json')
do i not require to create an object spark using SparkSession like below beforehand
Upvote Sharefrom pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
Hi Sandeep
Spark shell is reporting error as shown below
Please help
Regards
Upvote ShareSayeef
HI, Sayeef.
Don't export any variable on the Web-console.
Ju
Why we did not use sc over here?
Upvote ShareI encounter an error when importing this, please help
scala> import spark.implicits._
Upvote Shareerror: value implicits is not a member of object org.apache.spark.sql.SparkSession
import spark.implicits._
It seems to work fine.