DataFrames, Spark SQL, R

14 / 18

Spark SQL - Using Hive tables

To make spark read and write hive tables, you need to configure Spark. You need to create a conf folder in your home directory.

And place your hadoop configuration files hive-site.xml, core-site.xml and hdfs-site.xml file in conf/ directory.

These configurations are not required on CloudxLab as it is already configured to run.

Start spark shell using the command shown on screen. You first need to import all classes from spark.implicits

If you have a table in Hive with the name a_student, you can simply load that table using spark.sql method using a query like select * from a_student.

Here it has created a dataframe with name df. Now, we can see the data inside using show() method.

Did you notice that the SQL query method is same as that of querying temporary view used earlier? It is very much possible to join a temporary view with a hive table using SQL queries. So, you can register any dataframe as a view and perform joins between data from view and SQL.

If you wish to read data from Hive using spark-submit, you would need to create the spark object using sparksession.builder method and specifying all kinds of configuration settings on this object.

Using spark-sql you can read data from various databases such as MySQL, Oracle or Microsoft SQL using the JDBC connectivity. JDBC stands for Java database connectivity. Using JDBC we would read almost every database from Java code. So, the same JDBC driver, you could read the data from your traditional database to create a dataframe which can be combined with other data.