Apache Spark Basics

In this exercise, we are going to learning how to perform wordcount using spark.

Step 1: Start the spark shell using following command and wait for prompt to appear


Step 2: Create RDD from a file in HDFS, type the following on spark-shell and press enter:

var linesRDD = sc.textFile("/data/mr/wordcount/input/big.txt")

Step 3: Convert each record into word

var wordsRDD = linesRDD.flatMap(_.split(" "))

Step 3: Convert each word into key-value pair

var wordsKvRdd = wordsRDD.map((_, 1))

Step 3: Group By key and perform aggregation on each key:

var wordCounts = wordsKvRdd.reduceByKey(_ + _ )

Step 3: Save the results into HDFS: