Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
605 views
in Technique[技术] by (71.8m points)

bash - How to create a frequency list of every word in a file?

I have a file like this:

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

I would like to generate a two-column list. The first column shows what words appear, the second column shows how often they appear, for example:

this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1 
  • To make this work simpler, prior to processing the list, I will remove all punctuation, and change all text to lowercase letters.
  • Unless there is a simple solution around it, words and word can count as two separate words.

So far, I have this:

sed -i "s/ /
/g" ./file1.txt # put all words on a new line
while read line
do
     count="$(grep -c $line file1.txt)"
     echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines

For some reason, this is only showing "0" after each word.

How can I generate a list of every word that appears in a file, along with frequency information?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Not sed and grep, but tr, sort, uniq, and awk:

% (tr ' ' '
' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:

sed -e  's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '
' | grep -v '^$'| sort | uniq -c | sort -rn

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...