Project - Writing Spark Applications

7 / 16

Example Objective

Not able to play video? Try with youtube

To understand the end-to-end application development we are going to use this simple problem:

Whenever you access a website or a web page, the request is received and responded to by a web server. This web server also records the requests details in a file - we call such files weblogs or simple logs. We need to find top 10 IP addresses from which the website received the maximum number of requests based on Apache Web Server Logs.

This data is located in HDFS on CloudxLab at Location /data/spark/project/access/

Let's take a look at the data. Each line has the IP Address of the client in the beginning.

The objective is to find top 10 IP Addresses which are most frequent in this log directory.

Login into the cloudxlab web console.

Let us take a look at the files in this input folder. Run command hadoop fs -ls followed by the folder name which /data/spark/project/access/.

Now, in order to view a file, please run command hadoop fs -text followed by the full path of the file which includes folder path and filename. For example, if you want to take a look at data inside the file "access.log.2.gz" run command "hadoop fs -text followed by the full file path which is /data/spark/project/access/access.log.2.gz and then | head -n 10. The command snippet "head -n 10" will only show the first 10 lines of the file preventing the console from getting overflowed with text.

Notice that IP address is the first string before first space in each line.

Loading comments...