Previous Index Next

Spark SQL - Handling Avro

AVRO is data serialization framework for RPC or remote procedure call. It uses JSON format to define the data types of values and protocols and it serializes data in compact binary format.

It is very similar to Thrift and protocol buffers. It does not require running a code-generation program.

Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Apache Spark SQL can access Avro as a data source.

For the purpose of reading the data formatted in AVRO format, we will be using spark-avro databricks package. Which can be specified using --packages on spark-shell or inside built.sbt for spark-submit.

To start a spark2 shell with the spark-avro package, go to console and use the following command:

spark-shell --packages com.databricks:spark-avro_2.11:4.0.0

In Apache Toree Jupyter notebook, you can achieve the installation using: %AddDeps com.databricks spark-avro_2.11 4.0.0

To learn, more about finding and installing a library, please check this post.

Next, we create the AVRO format using spark.read.format method and then load the AVRO data with this format object which would create the data frame. In the example, df is the data frame.

Once done we call the show method on dataframe to see the data inside the dataframe.

val df = spark.read.format("com.databricks.spark.avro").load("/data/spark/episodes.avro")
df.show()

Paraquet is a very efficient format with which you can store the tabular data in a columnar way. It can be used by any project in the Hadoop ecosystem regardless of the choice of data processing framework, data model or programming language.

Spark - Dataframes & Spark SQL (Part2)

DataFrames, Spark SQL, R

Spark SQL - Handling Avro

XP

Loading comments...