DataFrames, Spark SQL, R

12 / 18

Spark SQL - Loading XML

To load XML, we generally use spark-xml package. The spark-xml package is available in the SBT repository. It can be automatically downloaded by specifying dependency inside build.sbt while using spark-submit. Or it can be loaded in spark-shell by the way of --package argument.

Lets launch the spark-shell with the --package com.databricks:spark-xml_2.10:0.4.1

It might take a while to launch for the first time because it is going to download the package from sbt repository.

Now, we can also use the spark.read.format object with xml as an argument and then specifying the columns using a method .option and then load the data from the HDFS.

We can also use the fully qualified name of format as com.databricks.spark.xml instead of simply xml.

Finally, we can take a look at data of dataframe using show() method. You can see that it in this dataframe every row is a book and the columns if the book is id, author, descriptions etc.

So, the spark-xml by default expects the top level to have the records and reach record to have the attributes which become columns.

Let's try to understand what does it mean by remote process call. Imagine that there is a phonebook service which stores your phonebook or contact list. The user accesses this phone book in order to look up a phone number of someone, update the number or download the entire phonebook. The users can either use a browser or user can use a mobile app which internally will call the service. The user can also create a bot or an automated script to query the server. So, the service could be accessed by bot, browser or mobile app. The access to the server is called Remote Process Call.

In the example diagram, getPhoneBook method being called and it is returning a complex object having an array of the phone number. Here the returned value is in the form of JSON format. There are many kinds of formats designed for such communication such as protocol buffers and AVRO.