Writing Spark Applications

12 / 16
Tutorial - Browsing Through The Code

Folder Structure:

sbt needs a certain format of the folder structure. At the top we have src folder. Inside src we main and test. Inside main, we keep the actual code while inside test, we keep the unit test cases. Please note that unit test cases too will be written in scala.

source-code-layout

Inside, main, we have created a folder named scala. If we had some code in java we would create java folder parallel to scala. Inside scala folder we have com/cloudxlab/logparsing folder structure. This folder structure is the same as our namespace or package of the code.

Also, note that inside test folder we have exactly same folder structure as src, scala/com/cloudxlab/logparsing. Why? The test cases are in scala and also the test cases have to be inside same namespace or package otherwise the test cases would not be able to access the classes which are private to the namespace or package.

Now, let's go to the folder having our code.

code-folder

log-parser.scala

Click on log-parser.scala to take a look at the content.

main-code

This file contains the actual code. Remember the code we executed on the spark-shell?

There are only deviations from that code:

  • We have created an object called EntryPoint inside which we are writing our code. Notice the name of package or namespace is package com.cloudxlab.logparsing

  • The sparkContext. While running code from spark-shell we got the sparkContext object sc which was created by spark-shell on starting up.

create-sc

Here are we are first creating configuration object and specifying the name. We can also specify other properties of conf object. Then we are creating SparkContext sc object. To reduce the loglevel we are setting it to WARN so that only log-messages that are of WARNING or above severity will be printed.

  • Most of the logic has been refactored to a separate class Utils. It has three three methods: containsIP, extarctIP, gettop10. These methods can be called by anyone. The core idea is we should be able to test the logic independently. Inside, the test cases we will create the emulated sparkcontext and pass it to gettop10 and run gettop10 function without the actual spark installation on your desktop. Also, note that since we are passing the functions of this class to map etc, the classes will be serialized and sent to each node. That is why Utils class is marked as Serializable. This is the only downside of externalizing the functions to another class.

  • The other difference is that this program takes the number of top ips and the location of input data an argument on the command line instead of harding coding the path.

main-args

First we are checking for valid command line argument and then we are calling using the value in val top10 = utils.gettop10(accessLogs, sc, args(1).toInt)

So, while calling it using the spark-submit, we are going to pass the path in HDFS as argument.