In this exercise, we are going to learning how to perform wordcount using spark.
Step 1: Start the spark shell using following command and wait for prompt to appear
Step 2: Create RDD from a file in HDFS, type the following on spark-shell and press enter:
var linesRDD = sc.textFile("/data/mr/wordcount/input/big.txt")
Step 3: Convert each record into word
var wordsRDD = linesRDD.flatMap(_.split(" "))
Step 3: Convert each word into key-value pair
var wordsKvRdd = wordsRDD.map((_, 1))
Step 3: Group By key and perform aggregation on each key:
var wordCounts = wordsKvRdd.reduceByKey(_ + _ )
Step 3: Save the results into HDFS:
Taking you to the next exercise in seconds...