Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
Apply NowLogin using Social Account
     Continue with GoogleLogin using your credentials
The third approach is to use Unix command in pipeline or in chain. Let us first try to understand what does it mean by pipeline.
As we discussed earlier that when we run a program it may take input from you. In other words, you may provide input to a program by typing. A program or command may also print some output on the screen.
In Unix, you can provide output of one program as input to another. This is known as piping. A pipe is denoted by vertical bar symbol. command1 vertical bar command2 means the output of command1 will become input to command2.
Let us take an example.
echo
Unix command prints on the standard output whatever argument is passed to it.
For example, echo "Hi" print "Hi" to the screen.
wc
command prints the number of characters, words, and lines out of whatever you type on standard input. Let me show you, Start wc
command, type some text say "hi", newline and "how are you" and then press Ctrl+d to end the input:
It would print number of lines, words, and characters which are 2, 4, and 15 respectively.
If we want to count the number of words or characters in the output of echo command, we could use a command like: echo "Hello, World" | wc
Let us try to understand this pipeline of commands for word counting in parts.
The first command cat myfile
prints the contents of the file "myfile".
Second command in chain is sed
. sed
stands for streaming editor. It is used to replace a text with something else in the input. It is very similar to the search and replace option feature of text editors. You can use regular expression with sed
by providing an option -E
to it.
sed -E 's/[\t ]+/\n/g'
replaces spaces and tabs with newline. Essentially, it converts text into one word per line. So, when you chain cat
and sed
, it basically prints one word per line from the file.
This one-word-per-line text can be sent further to a command called sort
which can order lines in input. The sort
command take various options. The option -S
makes it use only limited memory. In our case, we are using -S 1g
option to sort data using only 1-gigabyte of memory.
The last command is uniq
, uniq
command finds unique lines in the input. It expects the data to be ordered already. In case, the input to uniq
is not sorted, the result is not correct. uniq
command has -c
option which prints the counts of each unique word. So uniq -c
would print counts of each unique word in the sorted input.
So, the entire pipeline consisting of cat
, sed
, sort
followed by uniq
prints the word count of unique words in the text file.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Loading comments...