5 / 9

# MapReduce - Thinking in MR - External Sort

Not able to play video? Try with youtube

We discussed two more approaches - one was using SQL, other was using Unix commands. What problems do you see in these last two approaches?

The moment the data starts going beyond RAM the time taken starts increasing. CPU, Disk Speed, Disk Space starts becoming the bottlenecks.

So, what do we do in order to alleviate these bottlenecks caused by the big data?

We may try to use external sort. In external sort, we utilize many computers to do the word count and then we merge this result together. Let us say we have three machines: Launcher, Machine 1, and Machine 2. Launcher breaks the file into two parts and sends first part to machine 1 and second part to machine 2.

Both machines (1 and 2) count the unique words in their part of the text and send the result back to the launcher. The launcher merges the results. Please note that the results that launcher gets from both of these machines is ordered so that the merging gets easy because merging the ordered data is really fast - the time it takes to merge ordered data is linearly dependent on total number of elements to be merged.

Let us see how merging of sorted data happens: Say Machine 1 returned a:1 - where a is the key and 1 is the count, b:3 f:1 z:10 And machine 2 returned a:10, d:3, f:23, y:9

To merge both of the results we will simply compare first words from both. If both words are equal, then we sum up the counts, otherwise, we pick the smaller.

So, in one machine output it is a:1 and in another it is a:10, the word in both is same therefore we remove a from both and add 10 and 1, the result of merge is a:11.

Now, we remove a from both sides. Next, we compare the heads of both of which are b and d. So, we pick b:3 and put it in output. Next, we compare f and d and remove d, as d is smaller than f lexically or English dictionary wise. Afterwards, we merge f from both sides. Then we pick y and in the last we pick z.

So, you can see that we got the result very quickly. This approach of merging is used to merge data which is sorted already. In case there are more than two lists to merge, we generally use a min-heap tree to pick the smallest of the heads from all the lists. Alternatively, we can also merge two lists at a time and do the merging again with the result.

This is how our launcher also works. In our diagram, the launcher has merged re:2, sa:1 with ga:2, re:1 to produce ga:2, re:3, sa:1.

With external sorting, we have been able to divide the work amongst many machines. What do think is the problem with Approach 4 which is External Sorting?

Here are some of the problems:

1. A lot of time is consumed in transport of data from launcher to machines and back.

2. For each requirement we would need to build a special-purpose network-oriented program to perform all of these operations. In our case, we were trying to count unique words. If we need to do some other operation such as counting verbs or nouns in the text, we will have code the entire mechanics again.

3. The external sort requires a lot of Engineering. We have to handle the cases where network fails or either of the machines fail.