Sunday, October 14, 2012

Bash - Count words in text and make dictionary

In this tutorial i'll try to explain how to count words in some text and make list of used words.
First we could get some text file. I get War and Peace by Leo Tolstoy from Project Gutenberg:
wget http://gutenberg.org/files/2600/2600.txt

After getting file, we must to convert uppercase letters to lowercase:
tr A-Z a-z
then convert everything which is not small letter to new line
tr -cs a-z '\n'
then sort result
sort
finally get unique words and count them:
uniq -c
To get this into working condition we must put those commands into pipeline. To do that we'll make file wordcount:
nano wordcount
and type this:

cat "$@" | tr A-Z a-z | tr -cs a-z '\n'|sort|uniq -c
save and exit nano editor.
Give executable rights to file wordcount:
chmod +x wordcount
and count words with
./wordcount <textfile>
in this case:
./wordcount 2600.txt
and we'll get words used in this great book and their number of appearances.

If we want to get only dictionary used in this book, we make another file:
nano makedict
and type this:
cat "$@" | tr A-Z a-z | tr -cs a-z '\n'|sort|uniq
take note that only difference between wordcount and makedict scripts is -c switch in uniq command.
of course, we must give execution rights to file:
chmod +x makedict
and use:
./makedict 2600.txt









1 comment: