Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
177 views
in Technique[技术] by (71.8m points)

linux - Fastest way to find lines of a file from another larger file in Bash

I have two files, file1.txt and file2.txt. file1.txt has about 14K lines and file2.txt has about 2 billions. file1.txt has a single field f1 per line while file2.txt has 3 fields, f1 through f3, delimited by |.

I want to find all lines from file2.txt where f1 of file1.txt matches f2 of file2.txt (or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt).

file1.txt (about 14K lines, not sorted):

foo1
foo2
...
bar1
bar2
...

file2.txt (about 2 billion lines, not sorted):

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Output expected:

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Here is what I have tried and it seems to be taking several hours to run:

fgrep -F -f file1.txt file2.txt > file.matched

I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

A Perl solution.   [See Note below.]

Use a hash for the first file. As you read the big file line-by-line, extract the field by regex (captures the first pattern between ||) or split (gets the second word) and print if it exists. They likely differ in speed a bit (time them). The defined check isn't needed in the regex while for split use // (defined-or) that short-circuits.

use warnings;
use strict;

# If 'prog smallfile bigfile' is the preferred use
die "Usage: $0 smallfile bigfile
"  if @ARGV != 2;
my ($smallfile, $bigfile) = @ARGV;

open my $fh, '<', $smallfile or die "Can't open $smallfile: $!";    
my %word = map { chomp; $_ => 1 } <$fh>;

open    $fh, '<', $bigfile or die "Can't open $bigfile: $!";       
while (<$fh>) 
{
    exists $word{ (/|([^|]+)/)[0] } && print;  

    # Or
    #exists $word{ (split /|/)[1] // '' } && print;
}
close $fh;

Avoiding the if branch and using short-circuit is faster, but only very little. On billions of lines these tweaks add up but again not to too much. It may (or may not) be a tad bit faster to read the small file line by line, instead of in list context as above, but this should not be noticeable.

Update   Writing to STDOUT saves two operations and I repeatedly time it to be a little faster than writing to a file. Such usage is also consistent with most UNIX tools so I changed to write to STDOUT. Next, the exists test is not needed and dropping it spares an operation. However, I consistently get a touch better runtimes with it, while it also conveys the purpose better. Altogether I am leaving it in. Thanks to ikegami for comments.

Note   The commented out version is about 50% faster than the other, by my benchmark below. These are both given because they are different, one finding the first match and the other the second field. I am keeping it this way as a more generic choice, since the question is ambiguous on that.


Some comparisons (benchmark)   [Updated for writing to STDOUT, see "Update" above]

There is an extensive analysis in the answer by H?konH?gland, timing one run of most solutions. Here is another take, benchmarking the two solutions above, the OP's own answer, and the posted fgrep one, expected to be fast and used in the question and in many answers.

I build test data in the following way. A handful of lines of the length roughly as shown are made with random words, for both files, so to match in the second field. Then I pad this "seed" for data samples with lines that won't match, so to mimic ratios between sizes and matches quoted by OP: for 14K lines in small file there are 1.3M lines in the big file, yielding 126K matches. Then these samples are written repeatedly to build full data files as OP's, shuffle-ed each time using List::Util.

All runs compared below produce 106_120 matches for the above file sizes (diff-ed for a check), so the matching frequency is close enough. They are benchmarked by calling complete programs using my $res = timethese(60 ...). The result of cmpthese($res) on v5.16 are

        Rate regex  cfor split fgrep
regex 1.05/s    --  -23%  -35%  -44%
cfor  1.36/s   30%    --  -16%  -28%
split 1.62/s   54%   19%    --  -14%
fgrep 1.89/s   80%   39%   17%    --

The fact that the optimized C-program fgrep comes on top isn't surprising. The lag of "regex" behind "split" may be due to the overhead of starting the engine for little matches, many times. This may vary over Perl versions, given the evolving regex engine optimizations. I include the answer of @codeforester ("cfor") since it was claimed to be fastest, and its 20% lag behind the very similar "split" is likely due to scattered small inefficiencies (see a comment below this answer).

This isn't shatteringly different, while there are sure variations across hardware and software and over data details. I ran this on different Perls and machines, and the notable difference is that in some cases fgrep was indeed an order of magnitude faster.

The OP's experience of very slow fgrep is surprising. Given their quoted run times, order of magnitude slower than the above, I'd guess that there's an old system to "blame."

Even though this is completely I/O based there are concurrency benefits from putting it on multiple cores and I'd expect a good speedup, up to a factor of a few.


Alas, the comment got deleted (?). In short: unneeded use of a scalar (costs), of an if branch, of defined, of printf instead of print (slow!). These matter for efficiency on 2 billion lines.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...