[Correction: "re" should also be twice like "sa"] Let's look at an example of wordcount problem. Say you have a text file having two lines. First Line being "sa re" and second line "sa ga". This plain text will be converted into InputSplit with two records. First record is a line "sa re" and second record is "sa ga"
Here we have a written a very simple function for mapper() which basically gives out each word as key and numeric 1 as the value. Thus converting the input line "sa re" into "sa 1" and "re 1" and input line "sa ga" into "sa 1" and "ga 1". Your mapper function has been executed twice.
The results of mapper is sorted by Hadoop MapReduce Framework based on the key and then grouped. The dashed line in the diagram represents the work done by Hadoop Framework. So, we have three unique keys ga,re and sa having their values grouped.
For each of these groups, the reduce function is executed. The reduce function basically gets the key and the list of values as arguments. Here, the reduce function is simply summing up the values. So, the outcome of reduce function is ga 1, re 1, and sa 2.
Let's take a more practical example where we have to find maximum temperature of each city based on the temperature log of various cities on various dates. This temperature log is a comma-separated values containing temperature, city, and date.
To find the maximum temperature per city we will have to group the data based on the city and then find maximum temperature. So, our mapper would basically give out city as the key and temperature as the value, it will not give out date in the output. These values are then ordered and grouped by Hadoop MapReduce framework. In our example, we got four groups for each of the cities, BLR, Chicago, NYC, and Seattle.
Now, in the reduce function we would give out the maximum of the values for each key. This reduce function will be called for each of the groups and hence we would get the maximum value of temperature for each of the city.
If you had to solve the above problem of finding maximum temperature using SQL, you would simply group the data by city and for each group you would compute maximum. The query would like like: "select city, max(temp) from table group by city".
The map part corresponds to selection of column in group-by and reduce part is analogous to aggregation of SQL.
Similarly for word count, map part corresponds to the selection of column in the group-by of SQL and reduce part is equivalent of count aggregation of SQL.