Python for Machine Learning - Live Instructor-led Training Enroll For Free
Repository Link https://github.com/cloudxlab/bigdata
As part of this hands on we are going learn how to build and run a Hadoop MapReduce job.
We are going to use the code from our GitHub repository which is mentioned on the screen [https://github.com/singhabhinav/cloudxlab]
In this repository, navigate to the folder for java word count inside hdpexamples/java as shown on the screen. [https://github.com/singhabhinav/cloudxlab/tree/master/hdpexamples/java]
Now, let's take a look at the code, click on src/com/cloudxlab/ and then wordcount
To take a look at StubMapper.java, click on it. The first line is the package which is essentially the namespace of your java class.
Then there are various imports. Whichever classes you want to use in your code, you need to import those.
The remaining part of the file is the class definition that we have discussed earlier.
Now, lets go back to wordcount directory and take a look at reducer. The package and imports are similar to mapper and the class definition we have already discussed.
Lets take a look at driver class too. You package definition followed by imports and then the class definition. You can see a lot of code is commented. Go through these comments because it would help you explore other configuration options of Job such as number of reducers or setting the custom input format.
Also, notice that the package name is similar to the directory after the "src". package name is "." separated while the directory is "/" separated.
As part of this session, we are going to checkout wordcount code and build it using ant on CloudxLab web console.
Lets login to the console on cloudxlab.
Open cloudxlab.com/my-lab, click on "Web Console". Copy-paste login and password. Please notice that the password field will not print anything on the screen while you key-in your password.
Now we have logged in to the console.
Lets checkout the code from github using git clone. git clone https://github.com/singhabhinav/cloudxlab.git
git is a source code management tool and github is a public source code repository where we have hosted our code for exercises.
It might take sometime to checkout or download the code.
Once finished, it would create a directory cloudxlab. please change directory to cloudxlab using cd cloudxlab.
Take a look at the directory structure. It contains code for all other exercises in cloudxlab too.
let's go to the map-reduce with java folder by using cd hdpexamples/java
To build and prepare the jar, please use ant jar. ant is a build tool and jar is a target defined in the build.xml the configuration file for ant. you can take a look at contents of build.xml using either the github repository interface or using cat build.xml
You can see the target "jar" is calling "compile" and bundling the resulting binaries into build/jar/hdpexamples.jar. And "compile" is calling java compiler on our code.
If ant jar was successful, it would have created hdpexamples.jar in build/jar folder.
This jar contains the compiled map-reduce code. We would launch this using hadoop mapreduce with the following command: hadoop jar build/jar/hdpexamples.jar com.cloudxlab.wordcount.StubDriver
The first argument to hadoop jar is the location of jar, it could relative or absolute. The second argument is fully qualified name of Class. Please note that the jar should be in the local folder not in HDFS.
This driver basically creates javamrout folder in HDFS for the results of java map-reduce. The the output folder exists, the job will throw an error.
For you, this deletion may not be required. Lets delete the javamrout directory from hdfs using: hadoop fs -rm -r javamrout
Now lets try to execute the job again. It started the job. You can check the progress of the job using Hue. It would show the logs of the jobs here.
Once this job is successful, it would have create javamrout in your home folder in HDFS. lets take a look.
Go to filebrowser, if you have too many files, order by date in descending. You can see javamrout.
It has two files, _SUCCESS and part-r followed by zeros. The _SUCCESS is an empty, it signfiy the status of mapreduce job and the files starting with part contain out of the job. If we had confirgured multiple reducers, there would have been mutiple files starting with part.
Also, note that if these files were generated by mapper, the name would -m instead -r after part.
Now, lets take a look at the contents of the file. The result is tab separated plain text. The first column is the word and second column is the number of times the word has occurred.
For example, the word accompanied has occurred 85 times.
Also notice that the result is sorted by the word alphabatically.
You can modify the code using nano and recompile and run.
Say, we want to change the output location. So, we are going to edit the driver using a text editor in unix called nano.
we are changing the output folder to javamrout1 and saving the results by pressing CTRL+X and then confirming with y and press enter without changing the file location.
lets clear the screen.
Lets build again with ant jar and then execute the job. Lets check the results in file browser of hue.
Taking you to the next exercise in seconds...