To load XML, we generally use spark-xml package. The spark-xml package is available in the SBT repository. It can be automatically downloaded by specifying dependency inside build.sbt while using spark-submit. Or it can be loaded in spark-shell by the way of --packages argument.
Lets launch the spark-shell with the --packages as follows:
spark-shell --packages com.databricks:spark-xml_2.11:0.12.0
Or if you are using the Jupyter notebook with Apache Toree, you can use %Addjar as follows:
%AddDeps com.databricks spark-xml_2.11 0.12.0
It might take a while to launch for the first time because it is going to download the packages from sbt repository. To learn more about finding and installing libraries, please follow this.
Now, we can also use the spark.read.format object with xml as an argument and then specifying the columns using a method .option and then load the data from the HDFS.
We can also use the fully qualified name of format as com.databricks.spark.xml instead of simply xml.
Finally, we can take a look at data of dataframe using show() method. You can see that it in this dataframe every row is a book and the columns if the book is id, author, descriptions etc.
spark.read.format("xml").option("rowTag","book").load("/data/spark/books.xml").show()
So, the spark-xml by default expects the top level to have the records and reach record to have the attributes which become columns.
Let's try to understand what does it mean by remote process call. Imagine that there is a phonebook service which stores your phonebook or contact list. The user accesses this phone book in order to look up a phone number of someone, update the number or download the entire phonebook. The users can either use a browser or user can use a mobile app which internally will call the service. The user can also create a bot or an automated script to query the server. So, the service could be accessed by bot, browser or mobile app. The access to the server is called Remote Process Call.
In the example diagram, getPhoneBook method being called and it is returning a complex object having an array of the phone number. Here the returned value is in the form of JSON format. There are many kinds of formats designed for such communication such as protocol buffers and AVRO.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Answer is not availble for this assesment
Please login to comment
19 Comments
At first i got error message as :
2 Upvote Share[abchandravansi8369@cxln4 ~]$ spark-shell --package com.databricks:spark-xml_2.10:0.4.1SPARK_MAJOR_VERSION is set to 2, using Spark2Setting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).bad option: '--package'
And It worked as the issue was :
Lets launch the spark-shell with the --package com.databricks:spark-xml_2.10:0.4.1
should have been --packages com.databricks:spark-xml_2.10:0.4.1
Hi,
In pyspark, I am able to load XML data. But when I tried to save in xml format from DataFrame, ending with error.
Here I am providing steps that I performed and screen print of Error Message.
Can you help to understand the root cause for this error and resolution?
Step 1. Invoking pyspark:
/usr/spark2.4.3/bin/pyspark --packages com.databricks:spark-xml_2.10:0.4.1
Step 2. Loading Books XML data:
df = spark.read.format("xml") \
.option("rootTag","catalog") \
.option("rowTag", "book") \
.load("/data/spark/books.xml")
Step 3. Selected few columns and tried to save in XML format and ended with runtime error
df.select("_id","author","title","price") \
.write.format("com.databricks.spark.xml") \
.option("rootTag","data") \
.option("rowTag","book") \
.save("books_filtered", mode="overwrite")
how to open spark-shell with both the packages for xml and avro
Upvote ShareHi,
Multiple packages can be specified while starting the spark-shell using the --packages option and the names of multiple packages separated by commas.
Thanks
Upvote Sharecan we submit the .scala file to spark-submit instead of jar ? could you please give us example?
Upvote ShareHow to load XML file if it is encrpyted can you please provide the code with explination
Upvote Sharecan someone help me with the above question
Upvote ShareCan you please explain the fields in option("rowTag","book") method
Upvote ShareThe XML is a heirarchical structure of data. How to decide what will become a row is marked by this option called "rowtag". Whatever XML tag you mention as "rowTag", it would make those nodes as the row while translating it into a dataframe - a tabular structure.
Upvote ShareBut we haven't declare/define rowTag or book here. Then how the xml structure is defined here
Upvote ShareSee the pic.
Upvote ShareIt seems its only taking roottag i.e. book as a rowTag. Any other value from the xmlfieldtag is giving error.
1. scala> val df = spark.read.format("xml").option("rowTag","price").load("/data/spark/books.xml")
20/11/02 19:47:42 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) java.util.NoSuchElementException
2. scala> val df = spark.read.format("xml").option("rowTag","title").load("/data/spark/books.xml")
20/11/02 19:57:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) java.util.NoSuchElementException
3. Using any arbitary value i.e. not present in the xmltag creates an empty dataframe.
scala> val df = spark.read.format("xml").option("rowTag","CloudxLab").load("/data/spark/books.xml")
df: org.apache.spark.sql.DataFrame = []
scala> df.show()
++
||
++
++
From above, cannot provide any column name in rowtag to create dataframe, it has to be roottag is it correct? Any clue why the 3rd statement is not producing any error?
Upvote ShareThis comment has been removed.
This comment has been removed.
I think that's how they have implemented the XML parser.
Please take a look at their documentation: https://github.com/databricks/spark-xml
Upvote ShareHi Sandeep,
While explaining spark sql, data frames, for XML input format you have said that, based on XML tags data will be transported in between blocks.
But I have one doubt, let say, bock is 128 MB and it is already full, how the end tags will be transported to this block.
Or the transportation will be done while executing the jobs, can you please explain in details.
Upvote ShareThe transportation will be while executing the task/job.
Upvote Sharewhile launching spark xml, I am getting below error
/usr/bin/spark-shell --packages com.databricks.spark-xml_2.10:4.1SPARK_MAJOR_VERSION is set to 2, using Spark2
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: com.databricks.spark-xml_2.10:4.1 at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoo
can you please help me to resolve this.
Upvote ShareHi @@disqus_kEa3TbkWk2:disqus ,
The message is self explanatory. You should provide coordinates in groupId:artifactId:version.
Something like below command should work
/usr/bin/spark-shell --packages spark-shell --packages com.databricks:spark-xml_2.11:0.5.0
Upvote Share