DataFrames, Spark SQL, R

11 / 18

Spark SQL - Handling Avro

AVRO is data serialization framework for RPC or remote procedure call. It uses JSON format to define the data types of values and protocols and it serializes data in compact binary format.

It is very similar to Thrift and protocol buffers. It does not require running a code-generation program.

Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Apache Spark SQL can access Avro as a data source.

For the purpose of reading the data formatted in AVRO format, we will be using spark-avro databricks package. Which can be specified using --packages on spark-shell or inside built.sbt for spark-submit.

To start a spark2 shell with the spark-avro package, go to console and use the following command:

spark-shell --packages com.databricks:spark-avro_2.11:4.0.0

In Apache Toree Jupyter notebook, you can achieve the installation using: %AddDeps com.databricks spark-avro_2.11 4.0.0

To learn, more about finding and installing a library, please check this post.

Next, we create the AVRO format using spark.read.format method and then load the AVRO data with this format object which would create the data frame. In the example, df is the data frame.

Once done we call the show method on dataframe to see the data inside the dataframe.

val df = spark.read.format("com.databricks.spark.avro").load("/data/spark/episodes.avro")
df.show()

Paraquet is a very efficient format with which you can store the tabular data in a columnar way. It can be used by any project in the Hadoop ecosystem regardless of the choice of data processing framework, data model or programming language.

Spark - Dataframes & Spark SQL (Part2)


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

22 Comments

scala> val df = spark.read.format("com.databricks.spark.avro").load("/data/spark/episodes.avro")
java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.avro.AvroFileFormat. Please find packages at http://spark.apache.org/third-party-projects.html
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  ... 49 elided
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.avro.AvroFileFormat.DefaultSource
  at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
  at scala.util.Try$.apply(Try.scala:192)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
  at scala.util.Try.orElse(Try.scala:84)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
  ... 51 more

Unable to resolve this .. Please let us know how to proceed ahead

 

 

 1  Upvote    Share

I am getting the same error. Can someone explain?

  Upvote    Share

Hi,

It is working fine from my end, by using the following:

var df = spark.read.format("com.databricks.spark.avro").load("/data/spark/episodes.avro")

df.show()

Could you please retry?

Thanks.

  Upvote    Share

Below command is working file to launch spark-shell

/usr/spark2.4.3/bin/spark-shell --packages com.databricks:spark-avro_2.10:2.0.1

but while loading data into dataframe it is failing with below error.

Please check what is the issue

Below is the link where I could see all databricks avro packages list.

https://spark-packages.org/package/databricks/spark-avro

 

  Upvote    Share

Hi Sreenivasan,

yes, there is some issue with the current version of the spark, I updated the loading the avro package as per the spark documentation of 2.4.3. It is not giving any error when initializing the spark-shell. but still it gives below error. As of now I does not have much time to sort out this. I will look more on this.

java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2
  at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
  at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
  at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
  at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
  at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
  at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
  at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

 

I used the below command to initialize the `/usr/spark2.4.3/bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.0.1`

and run the below spark code for intializing the dataframe.

val df = spark.read.format("avro").load("/data/spark/episodes.avro")

 

I got this info from the below url and look like there is issue with the scala version need to find this.

https://spark.apache.org/docs/latest/sql-data-sources-avro.html

 

 1  Upvote    Share

Getting error "java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.avro.AvroFileFormat. Please find packages at http://spark.apache.org/third-party-projects.html" in

val df = spark.read.format("com.databricks.spark.avro").load("/data/spark/episodes.avro")

  Upvote    Share

I just tested the code mentioned above. it is working fine.

 1  Upvote    Share

Hi Sandeep 

It`s working in spark-shell, when it comes to jupyter notebook, how can we execute it. Can you please describe  the exact number of steps from adding the dependency to executing it in a pyspark code?

  Upvote    Share

I think the AVRO is installed in the Jupyter notebook. 

Please follow the instructions on how to run the code in the Jupyter notebook.

  Upvote    Share

 

/usr/spark2.0.1/bin/spark-shell --packages com.databtricks:spark-avro_2.11:3.2.0 

sir, at the start of spark -shell above code gave error and also it  does not display data

by df.show() method

  Upvote    Share

The Scala version is 2.11.8 .

  Upvote    Share

i tried

/usr/spark2.3/bin/spark-shell --packages com.databtricks:spark-avro_2.11:3.2.0

/usr/spark2.4.3/bin/spark-shell --packages com.databtricks:spark-avro_2.11:3.2.0

spark-shell --packages com.databtricks:spark-avro_2.11:3.2.0

these are the versions of spark i can see in /usr

i got error

Ivy Default Cache set to: /home/sapnavacheri1945/.ivy2/cache
The jars for the packages stored in: /home/sapnavacheri1945/.ivy2/jars
:: loading settings :: url = jar:file:/usr/spark2.3/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databtricks#spark-avro_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-7da686d3-d038-4895-8ecf-6dfe20044b99;1.0
        confs: [default]
:: resolution report :: resolve 1494ms :: artifacts dl 1ms
        :: modules in use:
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
        ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
                module not found: com.databtricks#spark-avro_2.11;3.2.0

------

ls .ivy2/cache //i do not  find com.databricks. the package is not in the cache and it is looking for the package in the cache in my user login

------

in /home/sapnavacheri1945/.ivy2/jars net.sf.opencsv_opencsv-2.3.jar //is the only jar file present

------

in /usr/spark2.3/jars

avro-1.7.7.jar
avro-ipc-1.7.7.jar
avro-mapred-1.7.7-hadoop2.jar

I do not understand the configurations.

  Upvote    Share

This comment has been removed.

what is the difference between paraquet and normal dataframe ;

what is the advantages and disadvantages of both.

  Upvote    Share

Hi, Amit. 

Parquet is a columnar format file.

Parquet is built to support very efficient compression and encoding schemes.

https://parquet.apache.org/documentation/latest/

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python,

https://spark.apache.org/docs/latest/sql-programming-guide.html

All the best!

 

  Upvote    Share

i tried

/usr/spark2.3/bin/spark-shell --packages com.databtricks:spark-avro_2.11:3.2.0

/usr/spark2.4.3/bin/spark-shell --packages com.databtricks:spark-avro_2.11:3.2.0

spark-shell --packages com.databtricks:spark-avro_2.11:3.2.0

these are the versions of spark i can see in /usr

i got error

com.databtricks#spark-avro_2.11;3.2.0: not found

if i am to create databricks folder - where should i create it? also it looks like the package itself is missing so will creating databricks folder solve the problem?

  Upvote    Share

please reply to this.

  Upvote    Share

Could you please explain about column names (title, air_date,doctor)?

  Upvote    Share

how to define databricks package in build.sbt?

  Upvote    Share

All you need to do is create a folder "databricks" and inside you scala code, mention package at the top.

Take a look at this project: https://github.com/cloudxla...

The code file "log-parser.scala" is in the folder "com/cloudxlab/logparsing/" and inside log-parser.scala, you would see a line "package com.cloudxlab.logparsing" (notice it is same as folder just that slashes are replaced with dots.

  Upvote    Share

Unable to find the code .. can you please explain in detail?

Thanks

Chitra

  Upvote    Share