Python for Machine Learning - Live Instructor-led Training Enroll For Free

DataFrames, Spark SQL, R

11 / 18

Spark SQL - Handling Avro

AVRO is data serialization framework for RPC or remote procedure call. It uses JSON format to define the data types of values and protocols and it serializes data in compact binary format.

It is very similar to Thrift and protocol buffers. It does not require running a code-generation program.

Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Apache Spark SQL can access Avro as a data source.

For the purpose of reading the data formatted in AVRO format, we will be using spark-avro databricks package. Which can be specified using --packages on spark-shell or inside built.sbt for spark-submit.

Let's start a spark2 shell with the spark-avro package.

Next, we create the avro format using spark.read.format method and then load the AVRO data with this format object which would create the data frame. In the example, df is the dataframe.

Once done we call show method on dataframe to see the data inside the dataframe.

Paraquet is a very efficient format with which you can store the tabular data in a columnar way. It can be used by any project in the Hadoop ecosystem regardless of the choice of data processing framework, data model or programming language.