MapReduce Programming

3 / 13
Writing MapReduce code using Eclipse

Eclipse is integrated development environment - IDE, especially for Java. Basically, it is a good text editor having a great set of features like a compiler, debugger, syntax auto complete.

Eclipse is very often used for Java development.

Now, let try to build our map-reduce code with eclipse.

The first step is to download and install eclipse. Open eclipse.org, click on Download on the top right. Download the one suggested for you.

Wait for it to complete. Once downloaded please double click to extract and then open the Eclipse Installer binary. In mac the name of the program has suffix .app and in windows it's extension is .exe.

With the installer, install Eclipse IDE for Java developers. Wait for it to complete and then launch.

The eclipse prompts for selecting the workspace. Your work gets saved in a workspace. Close the welcome window.

Now, let's download the code from GitHub repository. Please open the repository URL displayed on the screen

Click on "Clone or download" and "Download Zip".

Unzip the downloaded file. It would have the folders containing the code.

In Eclipse, create a javaproject. Give it a name. Uncheck default location and browse to the hdpexamples java folder.

Click on finish.

This would create a project. Right on it and click on "Properties"

Select Java Build path, see the lobraries tab. It has automatically added libraries from lib folder.

Select "java Compiler", change it to 1.7 instead of 1.8.

Click on ok.

You can see it has discovered all the classes. Lets take a look at driver the entry point for our example.

Now, right click on the project folder, click on export. Then select "Jar file".

Next select the destination. we are going to keep it in the downloads folder.

For now, lets ignore the warnings.

The jar file has been created.

Now, we have to upload it to our cloudxlab console. The easiest way is to first upload it into HDFS using Hue and the copy it to local inside the console.

Open hue, click on File browser and then click on Home just to ensure that you are in your hdfs home. Click on upload and select the file to upload. Let it finish uploading.

Now, login to web console.

Copy the jar file from HDFS to local using hadoop fs -copyToLocal

Using ls command check if it is copied

Now, let us execute the mapreduce for wordcount using:hadoop jar hdpexample_eclipse.jar com.cloudxlab.wordcount.StubDriver

Now since the output folder is already created from our previous execution, let's delete this output folder.

Let us try to execute the job again.

While the job is executed, lets look at the status of job in job browser in Hue. Every job has tasks.

Here you can see there are two tasks one is map and other is reduce. For all tasks there are attempt, here there is only one attempt for mapper because it didn't at all. Click on attempt. And them logs.

You will see there are three logs: stdout, stderr and syslog.

The general system logs are displayed in syslog. If you have error statement such as system.err.print in your java code, they would appear in stderr tab. If you have system.out.println in your mapper code, they would appear in stdout tab.

Lets go back task and back to job.

Similar to mapper, reducer too can have the logs.

Now, lets take a look at the output. Go to file browser and home directory in HDFS. If you have too many files order by date in descending fashion.

The first folder would javamrout. This is the folder where we configured out job to save output to.

Browse through the code. You will see the first example called chaining, it shows how to chain multiple jobs together. You can execute this driver from existing jar.

The second package is charcount, which can compute the character frequencies in a huge data.

The third package is customreader, which provides an example of how to create your custom input format.

The forth package is an example of hive user defined functions.

The next package is "nextword". The nextword, provides solution to finding next word recommendations based on huge data.

Simplewordcount is just another wordcount. It can be ignored.

Now you can make your modifications to the code, export the jar again, upload to hue, copy to local and then run mapreduce job using hadoop jar followed by classname.