Writing Spark Applications

14 / 16
   

Tutorial - Unit Test cases

As per the software development philosophy, we must write the test case before writing the actual code. Initially all of the test cases will fail. As the software is developed, the test cases will keep on getting passed.

Let's take a look at the test cases that we have written. We have two files containing test cases in the folder src/test/scala and inside the same package as code com.cloudxlab.logparsing:

testing-folder

log-parser-test.scala:

The first file just tests on the function which do not include the spark related code.

testing-code

It has two test cases: line number 7 and line number 14.

The first test case checks if the extractIP function is really extracting the actual IP Addresses. The second test caes check if containsIP function is really checking if the IP address exists in a log message.

Please note that these are part of a class that extends FlatSpec which is provided by scalatest suite. To learn more about testing with scala, please go through The Documentation of Scalatest

spark-testing.scala:

The second file contains test cases related to spark. Let's have a look:

sampletest

There are two test cases. First one checks top10 and second one check if it is removing IPs having octet decimal more than 126 for example 216.113.160.77.

The first test cases has been implemented correctly while second one isn't implemented and therefore it fails.

Note the "sc" object in line#14 and line#26. This "sc" object is a fake or emulation of sparkContext. It creates sparkcontext without the installation of spark. So, we are going to test our application against this "sc" - therefore we parameterized our code that computes top10 urls.

Also, note that there is different style of testing as compare to other file. Here we are using FunSuite with holdenkarau's spark testing library.

There libraries are specified in the sbt configuration file: build.sbt

built_sbt

Here 1.5_2 refers to the spark version of our production environment and 0.6.0 refers to the holdenkarau's test library. Also, the holdenkarau's test library includes the sparktest so we are not going explicitly use sparktest - we have commented it out.

If we were to run our testcases with another version of spark say 2.1.0, then we would use this instead: "com.holdenkarau" %% "spark-testing-base" % "2.1.0_0.6.0" % "test"

It is always a better practice to run unit test cases before sending it QA or shipping it to production. So, use the following command to run test cases: sbt test

You should see the outcome as follows:

sbt-test-results

Out of 4 test cases, one has failed. It is left purposefully for you to fix it.

This approach of cloning and building with sbt package inside cloudxlab console is working perfectly fine. Just that there are two issues:

  1. If we have to edit the code, we will have to use a terminal based editors such as nano or vim. So, we may want a proper development environment. We would try to develop with eclipse and then commit to source code repository.

  2. We should be able to test it on our desktop before running on the actual cluster. We will run unit test cases on our machine to validate if everything is fine.