We can further improve the word frequency count by using more filters.
Translate to lower case using
tr 'A-Z' 'a-z'
Remove non-alphanumeric characters using sed with regular expression:
Replace all whitespace (multiple tabs and spaces):
sed -E 's/[ \t]+/\n/g'
Please note that since we are using regular expressions, we need to specify "-E"
Display most frequent at top or display the results in reverse numeric sorting:
If the input file is big, the sort command might use too much of memory. So, you can force sort command to use less memory say 100 MB:
sort -S 50M
After all of these improvements, please save the results
cat /cxldata/big.txt |tr 'A-Z' 'a-z'| sed -E 's/[ \t]+/\n/g'|sed 's/[^0-9a-z]//g' | sort|uniq -c|sort -nr -S 50M > word_count_results_nice
Taking you to the next exercise in seconds...