DataFrames, Spark SQL, R

15 / 18

Previous Index Next

Spark SQL - Working With JDBC

To connect to any database, you need the database specific driver. This driver is also known as the connector is the one that bridges the gap between a JDBC and the database so that every database can be accessed with the same code.

At Cloudxlab, we have already downloaded the MySQL connector and kept in /data/spark HDFS folder. You would first need to copy it to local using hadoop fs -copyToLocal as shown in the code.

Now, we can launch the spark shell with this connector by using --jars argument and also specify the driver class path. This would essentially register the connector or driver and in the future queries to Databases this driver will be used. We are using --jars to make spark run the code in the cluster.

On spark shell, we can use spark.read.format function to create an instance of format and on this format object we can specify the various options. You can see that we have specified url, dbtable, user and password for the database. Here we are trying to connect to MySQL database sqoopex. These settings are also available in "My Lab" section.

Once we are done specifying all options we can call load which would create the dataframe for the database table. We can take a look at the dataframe using .show method.

So, once you have a dataframe you can register it like a view and join the SQL table with hive table or data from XML or JSON as we learned in past.

So, what we have learned now is that we can create dataframe from a variety of sources such as JSON, Parquet, Hive, RDBMS, RDDs. We can also save data to various locations and formats.

We can process and combine data from different sources and formats using various transformations, SQL and R like dataframe operations.

There have been many ETL tools that accomplish this but spark can accomplish this on Huge data.

Spark - Dataframes & Spark SQL (Part2)

Previous Index Next

Please login to comment

31 Comments

Sreenivasan Sistla

4 years ago

Can't we give driver class path and jars also in options while creating dataframe instead of giving those things while connecting to spark-shell.

In spark-shell interactive mode his example works fine but if we want to run spark-submit command and want to use JDBC connector for accessing MYSQL then do we need to mention this class path and jars in any config settings?

This comment has been removed.

Abhinav Singh

4 years ago

Hi Sreenivasan,

We can pass the JARs and set classpath in spark-submit command too in the same way we are passing in spark-shell. Else there should be a way to specify dependencies or configs in SBT file. I am not sure though, I will have to check it once.

Amlan Patnaik

4 years ago

Getting the below error while trying to create the jdbcDF. The my-sql connector is copied to local as shown in the screenshot:

Error : java.sql.SQLException: No suitable driver

Abhinav Singh

4 years ago

Hi,

Please check the JDBC connection string.

Amlan Patnaik

4 years ago

Thank you! I was able to create the dataframe with the following steps:

Copy the connector to local path:

hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar

Lauch the spark-shell with the connector:

/usr/bin/spark-shell --driver-class-path mysql-connector-java-5.1.36-bin.jar --jars mysql-connector-java-5.1.36-bin.jar

Create the Dataframe with the below command/connection string

scala> val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex").option("dbtable", "widgets").option("user", "sqoopuser").option("password", "NHkkP876rp").load()

scala> jdbcDF.show(5)

This comment has been removed.

This comment has been removed.

Bharat Sharma

5 years ago

scala> val mysqlurl= "jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex"
mysqlurl: String = jdbc:mysql://localhost/sqoopexscala> val df=spark.read.format("jdbc").option("driver","com.mysql.jdbc.Driver").option("url","mysqlurl").option("dbtable","student_bharat").option("user","sqoopuser").option("password","NHkkP876rp").load()
i have used above command to connect with mysql but its showing nullPointerException could you pls provide me the solution

Bharat Sharma

5 years ago

bin/spark-shell --driver-class-path mysql-connector-java-5.1.36-bin.jar --jars mysql-connector-java-5.1.36-bin.jarWarning: Local jar /usr/spark2.3/mysql-connector-java-5.1.36-bin.jar does not exist, skipping.Setting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).2020-09-09 21:55:09 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.2020-09-09 21:55:09 WARN Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.2020-09-09 21:55:09 WARN Utils:66 - Service 'SparkUI' could not bind on port 4042. Attempting port 4043.2020-09-09 21:55:09 ERROR SparkContext:91 - Failed to add file:/usr/spark2.3/mysql-connector-java-5.1.36-bin.jar to Spark environmentjava.io.FileNotFoundException: Jar /usr/spark2.3/mysql-connector-java-5.1.36-bin.jar not found

Bharat Sharma

5 years ago

could you pls reply the reason of these error

Satyajit Das

5 years ago

The error may occur when the jar file may not be present at the required location!

KIndly navigate from the web-console to the locations of the Mysql-connector jar files!

Satyajit Das

5 years ago

Hi.

Kindly check the Mysql path!

MYSQL PATH :-
===========

/data/jars/mysql-connector-java-5.1.36-bin.jar

Bharat Sharma

5 years ago

[bharatbhooshan891801@cxln4 ~]$ hadoop fs -ls /data/jars
Found 2 items
drwxr-xr-x - hdfs hdfs 0 2019-07-22 17:55 /data/jars/mongo_jars
-rw-r--r-- 3 hdfs hdfs 972009 2019-07-22 17:55 /data/jars/mysql-connector-java-5.1.36-bin.jar

jar is already present at required location

Bharat Sharma

5 years ago

Now what next

Satyajit Das

5 years ago

Now you can read the spark read using this information!

Bharat Sharma

5 years ago

again showing same error???

Bharat Sharma

5 years ago

???????

Bharat Sharma

5 years ago

showing same error

Rajtilak Bhattacharjee

4 years ago

Hi,

Could you please tell me are you still facing this issue? If yes, then could you please share a screenshot?

Thanks.

Bhagyesh Thakur

5 years ago

Hello Sir.

Can we connect to Hbase database in SparkSQL?

Sapna Vacheri

5 years ago

i have the same error as Samrat aand anubhav gupta. spark shell started fine with the jar file which is connector

[sapnavacheri1945@cxln4 ~]$ /usr/bin/spark-shell --driver-class-path mysql-connector-java-5.1.36-bin.jar --jars mysql-connector-java-5.1.36-bin.jar
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/07/28 13:23:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/07/28 13:23:15 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Spark context Web UI available at http://10.142.1.4:4042
Spark context available as 'sc' (master = local[*], app id = local-1595942595786).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1.2.6.2.0-205
/_/

on spark prompt

scala> val jdbcDF = spark.read.format("jdbc").option("url","jdbc:mysql:cxln2.c.thelab-240901.internal").option("dbtable","widgets").option("user","sqoopuser").option("password","NHkkP876rp")
jdbcDF: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@7712df76

by when try to load the data

scala> jdbcDF.load()
java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
... 48 elided

there is an issue with the driver. please fix it

Siddharth

5 years ago

Hi,

From shell I'm specifying the JDBC jar with --jars option. How do I put it i my scala code to be run with spark-submit?

CloudxLab

5 years ago

We can pass the jars with spark-submit command with --jars option

Kishan Pandey

5 years ago

scala> /usr/spark2.0.1/bin/spark-shell --driver-class-path
<console>:1: error: ';' expected but double literal found.
/usr/spark2.0.1/bin/spark-shell --driver-class-path
^
scala> mysql-connector-java-5.1.36-bin.jar --jars
<console>:1: error: ';' expected but double literal found.
mysql-connector-java-5.1.36-bin.jar --jars
^
scala> mysql-connector-java-5.1.36-bin.jar
<console>:1: error: ';' expected but double literal found.
mysql-connector-java-5.1.36-bin.jar

Getting this error while providing path to jdbc connector. please guide me, Thanks

CloudxLab

5 years ago

Can you please follow the course video?

Samrat

5 years ago

Hi , I am not able to connect to mysql

Satyajit Das

5 years ago

Hi, Samrat.

Kindly check Mysql path and Ip address as below :-

MYSQL PATH :-
===========
/data/jars/mysql-connector-java-5.1.36-bin.jar

The updated ip address is :-
=====================
mysql -h ip-10.142.1.2 -u sqoopuser -p

Mysql : NHkkP876rp

This is the actual Mysql starting statements :-
==================================

mysql -u sqoopuser -p -h cxln2.c.thelab-240901.internal
Password : NHkkP876rp

All the best!

Anubhav Gupta

5 years ago

how to start spark-shell using this driver??

i am getting the error..

pleasr provide me the command,,

the above commands are not working..

thank

Thiago M. Pinto

7 years ago

Only who will run the query is Driver?
Can I distribute the query, to go faster?

Tkx,

Sandeep Giri

7 years ago

The query will be translated to RDD operations (Transformation and Actions) of spark by the spark driver. The operations on RDD are distributed.
Therefore, the query will be run in distributed fashion.