Linux Basics for Big Data

74 / 87
Improved Word Count Using Unix Commands

We can further improve the word frequency count by using more filters.

Improvement 1:

Translate to lower case using

    tr 'A-Z' 'a-z'

Improvement 2:

Remove non-alphanumeric characters using sed with regular expression:

    sed 's/[^0-9a-z]//g'

Improvement 3:

Replace all whitespace (multiple tabs and spaces):

    sed -E 's/[ \t]+/\n/g'

Please note that since we are using regular expressions, we need to specify "-E"

Improvement 4:

Display most frequent at top or display the results in reverse numeric sorting:

    sort -nr

Improvement 5:

If the input file is big, the sort command might use too much of memory. So, you can force sort command to use less memory say 100 MB:

    sort -S 50M

After all of these improvements, please save the results

    cat /cxldata/big.txt |tr 'A-Z' 'a-z'| sed -E 's/[ \t]+/\n/g'|sed 's/[^0-9a-z]//g' | sort|uniq -c|sort -nr -S 50M > word_count_results_nice

Lab Details


Enroll now to learn and practice or Refer friends and get 15 days lab access


Enroll Now >>