MapReduce Basics

6 / 9

MapReduce - Understanding The Paradigm

Not able to play video? Try with youtube

This is where MapReduce comes into play. So, as we go forward, we will see that MapReduce is nothing but a highly customizable version of our External Sort where data doesn't need to be transferred to the machines because it is already distributed on data-nodes of HDFS.

Again, Hadoop MapReduce is a framework to help us solve Big Data problems. This is specifically great for the tasks which are sorting or disk read-intensive.

Ideally, you would write two functions or pieces of logic called mapper and reducer. Mapper converts every input record into key-value pairs. Reducer aggregates values for each key as given out by Maps() phase.

Map() is executed as many times as records are there. And reduce() is executed for each unique key in the output map(). You could have a MapReduce job without reduce phase.

Let us take a look at an example where we have many profile pics of the users and we would like to resize these images to 100 by 100 pixels size. To solve such a problem, first, we will copy the pics to HDFS - Hadoop distributed file system and then, we write a function which takes an image as input and resizes it. This function is what we are going to use as a mapper.

Since there is no reducer, in this case, the result of the mapper is going to be saved in the HDFS. This mapper is going to be executed for all the images on the machines.

So, in our diagram, we have 9 images distributed on three machines, our mapper function will be executed 9 times on these three machines.

Now, let's take try to understand the scenario with both mapper and reducer.

The HDFS block in first converted into something called InputSplit which is a logical representation of the raw HDFS block. An InputSplit is a collection of records. It is created by InputFormat. By default, the input format is TextInputFormat which creates InputSplit out of the plain text file such that each line is a record.

On the machines having the InputSplits, your mappers or map functions are executed. The logic of map() is written by you as per your requirements. This map function converts input records into key-value pairs. These key-values are sorted on each node locally and then transported to a machine which is designated as a reducer. The reducer machine merges the result from all the nodes and groups the data.

On each group, your reducer function gets executed. The logic of reducer function gets a key and a list of values. The reducer function generally aggregates all of the values and then final value is saved to HDFS.

Since the result from all map function is grouped by framework based on the key, you always chose the key on which you want your data to be grouped. So, you always try to breakdown your problem into groupby kind of logic.

Loading comments...