Spark SQL - Loading XML

To load XML, we generally use spark-xml package. The spark-xml package is available in the SBT repository. It can be automatically downloaded by specifying dependency inside build.sbt while using spark-submit. Or it can be loaded in spark-shell by the way of --packages argument.

Lets launch the spark-shell with the --packages as follows:

spark-shell --packages com.databricks:spark-xml_2.11:0.12.0

Or if you are using the Jupyter notebook with Apache Toree, you can use %Addjar as follows:

%AddDeps com.databricks spark-xml_2.11 0.12.0

It might take a while to launch for the first time because it is going to download the packages from sbt repository. To learn more about finding and installing libraries, please follow this.

Now, we can also use the spark.read.format object with xml as an argument and then specifying the columns using a method .option and then load the data from the HDFS.

We can also use the fully qualified name of format as com.databricks.spark.xml instead of simply xml.

Finally, we can take a look at data of dataframe using show() method. You can see that it in this dataframe every row is a book and the columns if the book is id, author, descriptions etc.

spark.read.format("xml").option("rowTag","book").load("/data/spark/books.xml").show()

So, the spark-xml by default expects the top level to have the records and reach record to have the attributes which become columns.

Let's try to understand what does it mean by remote process call. Imagine that there is a phonebook service which stores your phonebook or contact list. The user accesses this phone book in order to look up a phone number of someone, update the number or download the entire phonebook. The users can either use a browser or user can use a mobile app which internally will call the service. The user can also create a bot or an automated script to query the server. So, the service could be accessed by bot, browser or mobile app. The access to the server is called Remote Process Call.

In the example diagram, getPhoneBook method being called and it is returning a complex object having an array of the phone number. Here the returned value is in the form of JSON format. There are many kinds of formats designed for such communication such as protocol buffers and AVRO.

Spark - Dataframes & Spark SQL (Part2)

Previous Index Next

Please login to comment

19 Comments

Abhishek Kumar

4 years ago

At first i got error message as :

[abchandravansi8369@cxln4 ~]$ spark-shell --package com.databricks:spark-xml_2.10:0.4.1SPARK_MAJOR_VERSION is set to 2, using Spark2Setting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).bad option: '--package'

And It worked as the issue was :

Lets launch the spark-shell with the --package com.databricks:spark-xml_2.10:0.4.1
should have been --packages com.databricks:spark-xml_2.10:0.4.1

2 Upvote Share

Sairam Srinivas Vasantada

4 years ago

Hi,

In pyspark, I am able to load XML data. But when I tried to save in xml format from DataFrame, ending with error.

Here I am providing steps that I performed and screen print of Error Message.

Can you help to understand the root cause for this error and resolution?

Step 1. Invoking pyspark:
/usr/spark2.4.3/bin/pyspark --packages com.databricks:spark-xml_2.10:0.4.1

Step 2. Loading Books XML data:
df = spark.read.format("xml") \
    .option("rootTag","catalog") \
    .option("rowTag", "book") \
    .load("/data/spark/books.xml")

Step 3. Selected few columns and tried to save in XML format and ended with runtime error

df.select("_id","author","title","price") \
    .write.format("com.databricks.spark.xml") \
    .option("rootTag","data") \
    .option("rowTag","book") \
    .save("books_filtered", mode="overwrite")

DataFrames, Spark SQL, R

Spark SQL - Loading XML

XP

Please login to comment

19 Comments