MapReduce Basics

4 / 9

MapReduce - Thinking in MR - Unix Pipeline

Not able to play video? Try with youtube

The third approach is to use Unix command in pipeline or in chain. Let us first try to understand what does it mean by pipeline.

As we discussed earlier that when we run a program it may take input from you. In other words, you may provide input to a program by typing. A program or command may also print some output on the screen.

In Unix, you can provide output of one program as input to another. This is known as piping. A pipe is denoted by vertical bar symbol. command1 vertical bar command2 means the output of command1 will become input to command2.

Let us take an example.

echo Unix command prints on the standard output whatever argument is passed to it. For example, echo "Hi" print "Hi" to the screen.

wc command prints the number of characters, words, and lines out of whatever you type on standard input. Let me show you, Start wc command, type some text say "hi", newline and "how are you" and then press Ctrl+d to end the input:

It would print number of lines, words, and characters which are 2, 4, and 15 respectively.

If we want to count the number of words or characters in the output of echo command, we could use a command like: echo "Hello, World" | wc

Let us try to understand this pipeline of commands for word counting in parts.

The first command cat myfile prints the contents of the file "myfile".

Second command in chain is sed. sed stands for streaming editor. It is used to replace a text with something else in the input. It is very similar to the search and replace option feature of text editors. You can use regular expression with sed by providing an option -E to it. sed -E 's/[\t ]+/\n/g' replaces spaces and tabs with newline. Essentially, it converts text into one word per line. So, when you chain cat and sed, it basically prints one word per line from the file.

This one-word-per-line text can be sent further to a command called sort which can order lines in input. The sort command take various options. The option -S makes it use only limited memory. In our case, we are using -S 1g option to sort data using only 1-gigabyte of memory.

The last command is uniq, uniq command finds unique lines in the input. It expects the data to be ordered already. In case, the input to uniq is not sorted, the result is not correct. uniq command has -c option which prints the counts of each unique word. So uniq -c would print counts of each unique word in the sorted input.

So, the entire pipeline consisting of cat, sed, sort followed by uniq prints the word count of unique words in the text file.

Loading comments...