Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
Apply NowLogin using Social Account
     Continue with GoogleLogin using your credentials
Let us try to create a parquet dataset with snappy compression
Let's first load normal plain text data by first creating a dataframe and saving it in Parquet format with snappy compression. Please execute the following commands from the spark-shell.
val data = spark.read.text("/data/mr/wordcount/input/big.txt")
data.write.mode(org.apache.spark.sql.SaveMode.Overwrite).option("compression", "snappy").parquet("my_parquet_snappy")
Now, in another terminal please check if the output folder has any files.
[sandeepgiri9034@cxln4 ~]$ hadoop fs -ls my_parquet_snappy Found 3 items -rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 0 2021-05-21 16:47 my_parquet_snappy/_SUCCESS -rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 2620440 2021-05-21 16:47 my_parquet_snappy/part-00000-71f56b59-e784-48b6-868f-5a6c3cf985c7.snappy.parquet -rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 1480278 2021-05-21 16:47 my_parquet_snappy/part-00001-71f56b59-e784-48b6-868f-5a6c3cf985c7.snappy.parquet
val data1 = spark.read.parquet("my_parquet_snappy")
data1.take(10)
This should display the first 10 lines of the text.
scala> data1.take(10) res7: Array[org.apache.spark.sql.Row] = Array([The Project Gutenberg EBook of The Adventures of Sherlock Holmes], [by Sir Arthur Conan Doyle], [(#15 in our series by Sir Arthur Conan Doyle)], [], [Copyright laws are changing all over the world. Be sure to check the], [copyright laws for your country before downloading or redistributing], [this or any other Project Gutenberg eBook.], [], [This header should be the first thing seen when viewing this Project], [Gutenberg file. Please do not remove it. Do not change or edit the])
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Loading comments...