DataFrames, Spark SQL, R

15 / 18

Spark SQL - Working With JDBC

To connect to any database, you need the database specific driver. This driver is also known as the connector is the one that bridges the gap between a JDBC and the database so that every database can be accessed with the same code.

At Cloudxlab, we have already downloaded the MySQL connector and kept in /data/spark HDFS folder. You would first need to copy it to local using hadoop fs -copyToLocal as shown in the code.

Now, we can launch the spark shell with this connector by using --jars argument and also specify the driver class path. This would essentially register the connector or driver and in the future queries to Databases this driver will be used. We are using --jars to make spark run the code in the cluster.

On spark shell, we can use spark.read.format function to create an instance of format and on this format object we can specify the various options. You can see that we have specified url, dbtable, user and password for the database. Here we are trying to connect to MySQL database sqoopex. These settings are also available in "My Lab" section.

Once we are done specifying all options we can call load which would create the dataframe for the database table. We can take a look at the dataframe using .show method.

So, once you have a dataframe you can register it like a view and join the SQL table with hive table or data from XML or JSON as we learned in past.

So, what we have learned now is that we can create dataframe from a variety of sources such as JSON, Parquet, Hive, RDBMS, RDDs. We can also save data to various locations and formats.

We can process and combine data from different sources and formats using various transformations, SQL and R like dataframe operations.

There have been many ETL tools that accomplish this but spark can accomplish this on Huge data.

Spark - Dataframes & Spark SQL (Part2)


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

31 Comments

Can't we give driver class path and jars also in options while creating dataframe instead of giving those things while connecting to spark-shell.

In spark-shell interactive mode his example works fine but if we want to run spark-submit command and want to use JDBC connector for accessing MYSQL then do we need to mention this class path and jars in any config settings?

  Upvote    Share

This comment has been removed.

Hi Sreenivasan,

We can pass the JARs and set classpath in spark-submit command too in the same way we are passing in spark-shell. Else there should be a way to specify dependencies or configs in SBT file. I am not sure though, I will have to check it once.

  Upvote    Share

Getting the below error while trying to create the jdbcDF. The my-sql connector is copied to local as shown in the screenshot:

Error : java.sql.SQLException: No suitable driver

  Upvote    Share

Hi,

Please check the JDBC connection string.

  Upvote    Share

Thank you! I was able to create the dataframe with the following steps:

Copy the connector to local path:

  • hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar

Lauch the spark-shell with the connector:

  • /usr/bin/spark-shell --driver-class-path mysql-connector-java-5.1.36-bin.jar --jars mysql-connector-java-5.1.36-bin.jar

Create the Dataframe with the below command/connection string

  • scala> val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex").option("dbtable", "widgets").option("user", "sqoopuser").option("password", "NHkkP876rp").load()
  • scala> jdbcDF.show(5)
 3  Upvote    Share

This comment has been removed.

This comment has been removed.

scala> val mysqlurl= "jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex"
mysqlurl: String = jdbc:mysql://localhost/sqoopexscala> val df=spark.read.format("jdbc").option("driver","com.mysql.jdbc.Driver").option("url","mysqlurl").option("dbtable","student_bharat").option("user","sqoopuser").option("password","NHkkP876rp").load()
 i have used above command to connect with mysql but its showing nullPointerException could you pls provide me the solution

  Upvote    Share

bin/spark-shell --driver-class-path mysql-connector-java-5.1.36-bin.jar --jars mysql-connector-java-5.1.36-bin.jarWarning: Local jar /usr/spark2.3/mysql-connector-java-5.1.36-bin.jar does not exist, skipping.Setting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).2020-09-09 21:55:09 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.2020-09-09 21:55:09 WARN Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.2020-09-09 21:55:09 WARN Utils:66 - Service 'SparkUI' could not bind on port 4042. Attempting port 4043.2020-09-09 21:55:09 ERROR SparkContext:91 - Failed to add file:/usr/spark2.3/mysql-connector-java-5.1.36-bin.jar to Spark environmentjava.io.FileNotFoundException: Jar /usr/spark2.3/mysql-connector-java-5.1.36-bin.jar not found

  Upvote    Share

could you pls reply the reason of these error

  Upvote    Share

The error may occur when the jar file may not be present at the required location!

KIndly navigate from the web-console to the locations of the Mysql-connector jar files!

 

  Upvote    Share

Hi.

Kindly check the Mysql path!

MYSQL PATH :-
=========== 

/data/jars/mysql-connector-java-5.1.36-bin.jar

 

  Upvote    Share

[bharatbhooshan891801@cxln4 ~]$ hadoop fs -ls /data/jars
Found 2 items
drwxr-xr-x   - hdfs hdfs          0 2019-07-22 17:55 /data/jars/mongo_jars
-rw-r--r--   3 hdfs hdfs     972009 2019-07-22 17:55 /data/jars/mysql-connector-java-5.1.36-bin.jar

jar is already present at required location

  Upvote    Share

Now what next

 

  Upvote    Share

Now you can read the spark read using this information!

  Upvote    Share

again showing same error???

  Upvote    Share

???????

  Upvote    Share

showing same error

 

  Upvote    Share

Hi,

Could you please tell me are you still facing this issue? If yes, then could you please share a screenshot?

Thanks.

  Upvote    Share

Hello Sir.

  Can we connect to Hbase database in SparkSQL? 

  Upvote    Share

i have the same error as Samrat aand anubhav gupta. spark shell started fine with the jar file which is connector

[sapnavacheri1945@cxln4 ~]$ /usr/bin/spark-shell --driver-class-path mysql-connector-java-5.1.36-bin.jar --jars mysql-connector-java-5.1.36-bin.jar
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/07/28 13:23:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/07/28 13:23:15 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Spark context Web UI available at http://10.142.1.4:4042
Spark context available as 'sc' (master = local[*], app id = local-1595942595786).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.1.2.6.2.0-205
      /_/

on spark prompt

scala> val jdbcDF = spark.read.format("jdbc").option("url","jdbc:mysql:cxln2.c.thelab-240901.internal").option("dbtable","widgets").option("user","sqoopuser").option("password","NHkkP876rp")
jdbcDF: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@7712df76

by when try to load the data

scala> jdbcDF.load()
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

 

there is an issue with the driver. please fix it

  Upvote    Share

Hi,

From shell I'm specifying the JDBC jar with --jars option. How do I put it i my scala code to be run with spark-submit?

  Upvote    Share

We can pass the jars with spark-submit command with --jars option

  Upvote    Share

scala> /usr/spark2.0.1/bin/spark-shell --driver-class-path
<console>:1: error: ';' expected but double literal found.
/usr/spark2.0.1/bin/spark-shell --driver-class-path
^
scala> mysql-connector-java-5.1.36-bin.jar --jars
<console>:1: error: ';' expected but double literal found.
mysql-connector-java-5.1.36-bin.jar --jars
^
scala> mysql-connector-java-5.1.36-bin.jar
<console>:1: error: ';' expected but double literal found.
mysql-connector-java-5.1.36-bin.jar

Getting this error while providing path to jdbc connector. please guide me, Thanks

  Upvote    Share

Can you please follow the course video?

  Upvote    Share

Hi , I am not able to connect to mysql

  Upvote    Share

Hi, Samrat.

Kindly check Mysql path and Ip address as below :-

MYSQL PATH :-
===========
/data/jars/mysql-connector-java-5.1.36-bin.jar

The updated ip address is :-
=====================
mysql -h ip-10.142.1.2 -u sqoopuser -p

Mysql : NHkkP876rp

This is the actual Mysql starting statements :-
==================================

mysql -u sqoopuser -p -h cxln2.c.thelab-240901.internal
Password : NHkkP876rp

All the best!

  Upvote    Share

how to start spark-shell using this driver??

i am getting the error..

pleasr provide me the command,,

the above commands are not working..

thank

  Upvote    Share

Only who will run the query is Driver?
Can I distribute the query, to go faster?

Tkx,

  Upvote    Share

The query will be translated to RDD operations (Transformation and Actions) of spark by the spark driver. The operations on RDD are distributed.
Therefore, the query will be run in distributed fashion.

  Upvote    Share