Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
583 views
in Technique[技术] by (71.8m points)

performance - How to get the biggest number in a file?

I want to get the maximum number in a file, where numbers are integers that can occur in any place of the file.

I thought about doing the following:

grep -o '[-0-9]*' myfile | sort -rn | head -1

This uses grep to get all the integers from the file, outputting one per line. Then, sort sorts them and head prints the very first one.

But then thought that sort -r may cause some overhead, so I went for:

grep -o '[-0-9]*' myfile | sort -n | tail -1

To see what is fastest, I created a huge file with some random data, such like this:

$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla 
and 3624 is another number
but this is not enough for -23 234245
$ for i in {1..50000}; do cat a >> myfile ; done

So that the file contains 150K lines.

Now I compare the performance in my GNU bash version 4.2 and sys is way smaller for sort -rn:

$ time grep -o '[-0-9]*' myfile | sort -n | tail -1
42342234

real    0m1.823s
user    0m1.865s
sys 0m0.045s

$ cp myfile myfile2    #to prevent using cached info
$ time grep -o '[-0-9]*' myfile2 | sort -rn | head -1
42342234

real    0m1.864s
user    0m1.926s
sys 0m0.027s

So I have two questions here:

  • What is best, sort -r | tail -1 or sort -rn | head -1?
  • Is there a fastest way to get the maximum integer in a given file?


Testing the solutions

So I ran all the commands and compared the time it gets them to find the value. To make things more reliable, I created a bigger file, 10 times bigger than the one I mentioned in the question:

$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla 
and 3624 is another number
but this is not enough for -23 234245
$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=500000;i++) print s}' > myfile
$ wc myfile 
1500000 13000000 62000000 myfile

Benchmark, from which I see hek2mgl's solution is the fastest:

$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' myfile
42342234

real    0m3.979s
user    0m3.970s
sys 0m0.007s
$ time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' myfile 
42342234

real    0m2.203s
user    0m2.196s
sys 0m0.006s
$ time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile
42342234

real    0m0.926s
user    0m0.848s
sys 0m0.077s
$ time tr ' ' '
' < myfile | sort -rn | head -1
42342234

real    0m11.089s
user    0m11.049s
sys 0m0.086s
$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} @F} END {print $max' myfile


real    0m6.166s
user    0m6.146s
sys 0m0.011s
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I'm surprised by awk's speed here. perl is usually pretty speedy, but:

$ for ((i=0; i<1000000; i++)); do echo $RANDOM; done > rand

$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' rand
32767

real    0m0.890s
user    0m0.887s
sys 0m0.003s

$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} @F} END {print $max' rand 
32767

real    0m1.110s
user    0m1.107s
sys 0m0.002s

I think I've found a winner: With perl, slurp the file as a single string, find the (possibly negative) integers, and take the max:

$ time perl -MList::Util=max -0777 -nE 'say max /-?d+/g' rand
32767

real    0m0.565s
user    0m0.539s
sys 0m0.025s

Takes a little more "sys" time, but less real time.

Works with a file with only negative numbers too:

$ cat file
hello -42 world
$ perl -MList::Util=max -0777 -nE 'say max /-?d+/g' file
-42

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...