DataFrames, Spark SQL, R

15 / 18

Spark SQL - Working With JDBC

To connect to any database, you need the database specific driver. This driver is also known as the connector is the one that bridges the gap between a JDBC and the database so that every database can be accessed with the same code.

At Cloudxlab, we have already downloaded the MySQL connector and kept in /data/spark HDFS folder. You would first need to copy it to local using hadoop fs -copyToLocal as shown in the code.

Now, we can launch the spark shell with this connector by using --jars argument and also specify the driver class path. This would essentially register the connector or driver and in the future queries to Databases this driver will be used. We are using --jars to make spark run the code in the cluster.

On spark shell, we can use spark.read.format function to create an instance of format and on this format object we can specify the various options. You can see that we have specified url, dbtable, user and password for the database. Here we are trying to connect to MySQL database sqoopex. These settings are also available in "My Lab" section.

Once we are done specifying all options we can call load which would create the dataframe for the database table. We can take a look at the dataframe using .show method.

So, once you have a dataframe you can register it like a view and join the SQL table with hive table or data from XML or JSON as we learned in past.

So, what we have learned now is that we can create dataframe from a variety of sources such as JSON, Parquet, Hive, RDBMS, RDDs. We can also save data to various locations and formats.

We can process and combine data from different sources and formats using various transformations, SQL and R like dataframe operations.

There have been many ETL tools that accomplish this but spark can accomplish this on Huge data.