Loading and Saving Data

7 / 8

Loading and Saving Data - Understanding Compression

Not able to play video? Try with youtube

Hands-On with Parquet Format with Snappy Compression

Let us try to create a parquet dataset with snappy compression

Create File

Let's first load normal plain text data by first creating a dataframe and saving it in Parquet format with snappy compression. Please execute the following commands from the spark-shell.

val data = spark.read.text("/data/mr/wordcount/input/big.txt")
data.write.mode(org.apache.spark.sql.SaveMode.Overwrite).option("compression", "snappy").parquet("my_parquet_snappy")

Check / Verify

Now, in another terminal please check if the output folder has any files.

[sandeepgiri9034@cxln4 ~]$ hadoop fs -ls my_parquet_snappy Found 3 items -rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 0 2021-05-21 16:47 my_parquet_snappy/_SUCCESS -rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 2620440 2021-05-21 16:47 my_parquet_snappy/part-00000-71f56b59-e784-48b6-868f-5a6c3cf985c7.snappy.parquet -rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 1480278 2021-05-21 16:47 my_parquet_snappy/part-00001-71f56b59-e784-48b6-868f-5a6c3cf985c7.snappy.parquet


val data1 = spark.read.parquet("my_parquet_snappy")

This should display the first 10 lines of the text.

scala> data1.take(10) res7: Array[org.apache.spark.sql.Row] = Array([The Project Gutenberg EBook of The Adventures of Sherlock Holmes], [by Sir Arthur Conan Doyle], [(#15 in our series by Sir Arthur Conan Doyle)], [], [Copyright laws are changing all over the world. Be sure to check the], [copyright laws for your country before downloading or redistributing], [this or any other Project Gutenberg eBook.], [], [This header should be the first thing seen when viewing this Project], [Gutenberg file. Please do not remove it. Do not change or edit the])

Configure HDP for LZO using Ambari

Loading comments...