To understand the end-to-end application development we are going to use this simple problem:
Whenever you access a website or a web page, the request is received and responded to by a web server. This web server also records the requests details in a file - we call such files weblogs or simple logs. We need to find top 10 IP addresses from which the website received the maximum number of requests based on Apache Web Server Logs.
This data is located in HDFS on CloudxLab at Location /data/spark/project/access/
Let's take a look at the data. Each line has the IP Address of the client in the beginning. Inside Hue, open File Browser and please navigate to the folder: /data/spark/project/access/ and take a look at any of the file and notice that the IP Address is the first string before first space in each line.