DataFrames, Spark SQL, R

13 / 18

Spark SQL - Handling Various Data Sources

Let's take a look at how we are going to load the parquet formatted data. The first way to load automatically using spark.read.load.

Here we simply letting spark.read.load method figures out what is the format and what are the various columns etc. It is able to create the dataframe df automatically from the located in HDFS at /data/spark/users.parquet.

In the second line of code we using R like syntax to select the columns. Then we are using write.save method on dataframe to save the result onto HDFS. Again, we are letting the save function to figure out the format from the extension of the file.

The second method is to explicitly specify the format. Here we are specifying JSON for example. And then load the data from the file. This method is good if your files don't have proper extensions. In such cases, we first create the format using spark.read.format and then using load function of format we load the data.

After doing some projections using the select method on data frame, we are trying to save the Dataframe created from JSON format into parquet format. Just like the loading time, we are first creating the format using df.write.format and then calling save method on the format. The first argument of save the name of file in HDFS.

You can also directly run the SQL query on the file to create the dataframe. In the example, we are calling SQL with the query select star from format followed by dot followed by the location of the file.

This is the easiest way of loading structured data into dataframe.

Since most of the structured data on Hadoop is kept in Hive. So, it is very important to be able to import data from Apache Hive.

With Spark SQL, you can read data stored in Apache Hive and write Hive table easily. A word of caution, Since the Hive has a large number of dependencies, these dependencies are not included with Spark Distribution.If Hive dependencies can be found on the classpath, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.