I want to get the maximum number in a file, where numbers are integers that can occur in any place of the file.
I thought about doing the following:
grep -o '[-0-9]*' myfile | sort -rn | head -1
This uses grep
to get all the integers from the file, outputting one per line. Then, sort
sorts them and head
prints the very first one.
But then thought that sort -r
may cause some overhead, so I went for:
grep -o '[-0-9]*' myfile | sort -n | tail -1
To see what is fastest, I created a huge file with some random data, such like this:
$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
$ for i in {1..50000}; do cat a >> myfile ; done
So that the file contains 150K lines.
Now I compare the performance in my GNU bash version 4.2
and sys
is way smaller for sort -rn
:
$ time grep -o '[-0-9]*' myfile | sort -n | tail -1
42342234
real 0m1.823s
user 0m1.865s
sys 0m0.045s
$ cp myfile myfile2 #to prevent using cached info
$ time grep -o '[-0-9]*' myfile2 | sort -rn | head -1
42342234
real 0m1.864s
user 0m1.926s
sys 0m0.027s
So I have two questions here:
- What is best,
sort -r | tail -1
or sort -rn | head -1
?
- Is there a fastest way to get the maximum integer in a given file?
Testing the solutions
So I ran all the commands and compared the time it gets them to find the value. To make things more reliable, I created a bigger file, 10 times bigger than the one I mentioned in the question:
$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=500000;i++) print s}' > myfile
$ wc myfile
1500000 13000000 62000000 myfile
Benchmark, from which I see hek2mgl's solution is the fastest:
$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' myfile
42342234
real 0m3.979s
user 0m3.970s
sys 0m0.007s
$ time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' myfile
42342234
real 0m2.203s
user 0m2.196s
sys 0m0.006s
$ time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile
42342234
real 0m0.926s
user 0m0.848s
sys 0m0.077s
$ time tr ' ' '
' < myfile | sort -rn | head -1
42342234
real 0m11.089s
user 0m11.049s
sys 0m0.086s
$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} @F} END {print $max' myfile
real 0m6.166s
user 0m6.146s
sys 0m0.011s
See Question&Answers more detail:
os