DataFrames, Spark SQL, R

14 / 18

Previous Index Next

Spark SQL - Using Hive tables

To make spark read and write hive tables, you need to configure Spark. You need to create a conf folder in your home directory.

And place your hadoop configuration files hive-site.xml, core-site.xml and hdfs-site.xml file in conf/ directory.

These configurations are not required on CloudxLab as it is already configured to run.

Start spark shell using the command shown on screen. You first need to import all classes from spark.implicits

If you have a table in Hive with the name a_student, you can simply load that table using spark.sql method using a query like select * from a_student.

Here it has created a dataframe with name df. Now, we can see the data inside using show() method.

Did you notice that the SQL query method is same as that of querying temporary view used earlier? It is very much possible to join a temporary view with a hive table using SQL queries. So, you can register any dataframe as a view and perform joins between data from view and SQL.

If you wish to read data from Hive using spark-submit, you would need to create the spark object using sparksession.builder method and specifying all kinds of configuration settings on this object.

Using spark-sql you can read data from various databases such as MySQL, Oracle or Microsoft SQL using the JDBC connectivity. JDBC stands for Java database connectivity. Using JDBC we would read almost every database from Java code. So, the same JDBC driver, you could read the data from your traditional database to create a dataframe which can be combined with other data.

Spark - Dataframes & Spark SQL (Part2)

Previous Index Next

Please login to comment

22 Comments

Pragya Shukla

4 years ago

Getting following error:

Table or view not found :

screenshot

This comment has been removed.

Vagdevi K

4 years ago

Hi,

As the error suggests, please make sure that the table (or view) named a_student exists, before selecting any records from it.

Thanks.

Pragya Shukla

4 years ago

how do i launch spark ? i am getting this error:

screenshot

Vagdevi K

4 years ago

Hi,

You may use the adjacent Jupyter notebook for this exercise.

Nevertheless, please use

spark-shell

to launch spark-shell in web console.

Thanks.

Bharat Sharma

5 years ago

can you pls explain, what are the properties that we need to change in hive-site.xml for hive spark integeration

Bharat Sharma

5 years ago

pls explain internal data flow between hive and spark

Satyajit Das

5 years ago

Hi, Bharat.

As mentioned in the lecture you have to create the Spark Session builder object for and give the required configurations settings connecting to various databases.

import org.apache.spark.sql.SparkSession

It is like a entry point to all DB.

Kindly refer for more details about fields and configurations you need to provide :

https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-SparkSession.html

All the best!

This comment has been removed.

dilip kumar

5 years ago

Hi

can you please check why am i getting error. I have table named "emp_details" and can access this table.

https://cloudxlab.com/assessment/displayslide/615/spark-sql-using-hive-tables?course_id=73&playlist_id=338

Thanks

Sandeep Giri

4 years ago

What error are you geting?

Pavan Kumar

5 years ago

in spark-sql we are using "val" while creating data frame where as while we are using with hive table s why are we using "var" while creating a dataframe

Dinesh Ughade

5 years ago

Hello Team,
var df = spark.sql("select * from a_student")
gives below error
org.apache.spark.sql.AnalysisException: Table or view not found: a_student; line 1 pos 14.
i did get above command. why does it give error?
please explain with solution.

Satyajit Das

5 years ago

Hi, Dinesh.

The error is self explanatory, can you check the Table inside your database?
All the best!

Harry

6 years ago

I am trying to follow the example but when I type import spark.implicits._ in the spark shell, I get 25: error: not found: value spark

Sandeep Giri

6 years ago

Strangely, it worked in spark-shell but it didn't work.

This works:
val spark2 = spark import spark2.implicits._

Sandeep Giri

6 years ago

See my comment below in response to Thiago.

Deepak Kumar Singh

6 years ago

Hi Harry,
I could run this command 'import spark.implicits._' successfully from spark-shell, without any error. Request you to try again.
Thanks

Thiago M. Pinto

7 years ago

Hello,

But then, how Spark knows that its a hive table?
Only from configuration files?

Sandeep Giri

7 years ago

Yes, the hive is discovered using the configuration files hive-site.xml, core-site.xml and hdfs-site.xml in conf/ directory.

If hive support is enabled, the tables are created and searched in Hive.

Punit Bhilota

4 years ago

How spark will resolve the conflict in following scenario? Spark will retrieve data from which source?

A temporary view is created and hive table also has identical name.

Sandeep Giri

4 years ago

Spark will give preference to the register view or registered temp table over the hive views or tables.

If there is a table X already existing in Hive and you register another table with the same name in spark. When you access X in spark it will give the data of the registered dataframe not the hive table. Please note that it is temporary and is only valid in the cusrrent session.

Outside this session, if you try to access X, it will give the Hive data only not spark's.