Registrations Closing Soon for DevOps Certification Training by CloudxLab | Batch Starts on 18th AprilEnroll Now
Repository Link https://github.com/cloudxlab/bigdata
As part of this hands on we are going learn how to build and run a Hadoop MapReduce job.
We are going to use the code from our GitHub repository which is mentioned on the screen [https://github.com/singhabhinav/cloudxlab]
In this repository, navigate to the folder for java word count inside hdpexamples/java as shown on the screen. [https://github.com/singhabhinav/cloudxlab/tree/master/hdpexamples/java]
Now, let's take a look at the code, click on src/com/cloudxlab/ and then wordcount
To take a look at StubMapper.java, click on it. The first line is the package which is essentially the namespace of your java class.
Then there are various imports. Whichever classes you want to use in your code, you need to import those.
The remaining part of the file is the class definition that we have discussed earlier.
Now, lets go back to wordcount directory and take a look at reducer. The package and imports are similar to mapper and the class definition we have already discussed.
Lets take a look at driver class too. You package definition followed by imports and then the class definition. You can see a lot of code is commented. Go through these comments because it would help you explore other configuration options of Job such as number of reducers or setting the custom input format.
Also, notice that the package name is similar to the directory after the "src". package name is "." separated while the directory is "/" separated.
As part of this session, we are going to checkout wordcount code and build it using ant on CloudxLab web console.
Lets login to the console on cloudxlab.
Open cloudxlab.com/my-lab, click on "Web Console". Copy-paste login and password. Please notice that the password field will not print anything on the screen while you key-in your password.
Now we have logged in to the console.
Lets checkout the code from github using git clone.
git clone https://github.com/singhabhinav/cloudxlab.git
git is a source code management tool and github is a public source code repository where we have hosted our code for exercises.
It might take sometime to checkout or download the code.
Once finished, it would create a directory cloudxlab. please change directory to cloudxlab using cd cloudxlab.
Take a look at the directory structure. It contains code for all other exercises in cloudxlab too.
let's go to the map-reduce with java folder by using cd hdpexamples/java
To build and prepare the jar, please use
ant jar. ant is a build tool and jar is a target defined in the build.xml the configuration file for ant. you can take a look at contents of build.xml using either the github repository interface or using cat build.xml
You can see the target "jar" is calling "compile" and bundling the resulting binaries into build/jar/hdpexamples.jar. And "compile" is calling java compiler on our code.
If ant jar was successful, it would have created
hdpexamples.jar in build/jar folder.
This jar contains the compiled map-reduce code. We would launch this using hadoop mapreduce with the following command:
hadoop jar build/jar/hdpexamples.jar com.cloudxlab.wordcount.StubDriver
The first argument to hadoop jar is the location of jar, it could relative or absolute. The second argument is fully qualified name of Class. Please note that the jar should be in the local folder not in HDFS.
This driver basically creates javamrout folder in HDFS for the results of java map-reduce. The the output folder exists, the job will throw an error.
For you, this deletion may not be required. Lets delete the javamrout directory from hdfs using:
hadoop fs -rm -r javamrout
Now lets try to execute the job again. It started the job. You can check the progress of the job using the console. Copy the application ID from the running job. Now, open a new web console and run command
yarn application -status <<application_ID>>. It would show the progress of the jobs here.
Once this job is successful, it would have create
javamrout in your home folder in HDFS. Let's take a look. Run command
hadoop fs -ls /user/$USER/javamrout. It has two files,
part-rfollowed by zeros. The _SUCCESS file is empty, it signifies that the mapreduce job is successful and the files starting with part contain output of the job. If we had configured multiple reducers, there would have been multiple files starting with part.
Also, note that if these files were generated by mapper, the name would be -m instead of -r after part.
Now, let's take a look at the contents of the part-r file. Run command
hadoop fs -cat /user/$USER/javamrout/part-r-00000 | more. The
more command snippet will prevent the console from overflowing with text. $USER is a variable which will be replaced with your username automatically.
The result is tab-separated plain text. The first column is the word and second column is the number of times the word has occurred. Also, notice that the result is sorted by the word alphabetically. Press enter to view more content. Press
Ctrl + C to exit this window.
You can modify the code using nano and recompile and run.
Say, we want to change the output location. So, we are going to edit the driver using a text editor in unix called nano.
we are changing the output folder to javamrout1 and saving the results by pressing
CTRL+X and then confirming with
y and press
enter without changing the file location.
Lets clear the screen. Let's build again with ant jar and then execute the job. Let's check the results, run command
hadoop fs -ls /user/$USER/javamrout1. Here $USER is a variable which will be replaced with your username automatically.
No hints are availble for this assesment
Answer is not availble for this assesment