Linux Basics for Big Data

Word Count Exercise

Step 1:

Check the Data using cat command. Since the file is big, you can use "more" to see pagewise

    cat /cxldata/big.txt | more

Step 2:

Replace space with newline such that every line in output contains only single word:

    cat /cxldata/big.txt | sed 's/ /\n/g' |more

For example, after replacing space with new line in "I am ok" we should get:

The "/g" is an option of sed which makes replace all occurrences of space instead of only one.

Also, note this command has three programs connected by two pipes. The output of cat is going to sed and output of sed is going to more to see the pagewise.

Step 3:

We can sort the words using sort command in the following way

    cat /cxldata/big.txt | sed 's/ /\n/g' | sort|more

Note that we are using "more" command just to avoid screen-blindness (too much text scrolling).

Step 4:

We can now, count the words using uniq command

    cat /cxldata/big.txt | sed 's/ /\n/g' | sort|uniq -c|more

Please save the result of the command to a file "word_count_results" in your home directory

    cat /cxldata/big.txt | sed 's/ /\n/g' | sort|uniq -c > word_count_results