Spark Project - Log Parsing

1 / 5
Spark Project - Apache log parsing - Introduction

In this project, we will parse Apache logs to get some meaningful insights from the logs.

We've already done a part of it in Writing Spark Applications topic.

Extend the same project, write unit test cases and code for the next set of problems and send the code to

Data set -

Dataset is located in /data/spark/project/NASA_access_log_Aug95.gz directory in HDFS

Above dataset is access log of NASA Kennedy Space Center WWW server in Florida.

The logs are an ASCII file with one line per request, with the following columns:

  1. host - making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.
  2. timestamp - in the format "DAY MON DD HH:MM:SS YYYY", where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.
  3. request URL - given in quotes.
  4. HTTP reply code.
  5. Bytes returned by the server.

Note that from 01/Aug/1995:14:52:01 until 03/Aug/1995:04:36:13 there are no accesses recorded, as the Web server was shut down, due to Hurricane Erin.

Based on the above data, please answer following questions