I noticed something weird while testing a simple Perl script that's supposed to filter out filenames beginning with certain prefixes.
Basically, I'm constructing a regex like this:
my $regex = join "|", map quotemeta, @prefixes;
$regex = qr/^($regex)/; # anchor the regex and precompile it
Here, in the scenario I originally tested, @prefixes
consists of 32-character hexadecimal strings. What I found is that everything ran nice and smoothly up to 6,552 prefixes — but as soon as I tried 6,553, the execution time of the script jumped by a factor of over 25 (from 1.8 seconds to 51 seconds)!
I played around with it, and constructed the following benchmark. I originally used 32-character hex strings, like in my original program, but found that if I cut the length of the strings down to just 8 characters, the threshold value moved higher — to 16,383, in fact — while the slowdown factor got dramatically higher yet: the regexp with 16,383 alternatives is almost 650 times slower than the one with only 16,382!
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(timethese cmpthese);
my $count = shift || 10;
our @items = map unpack("H8", pack "V", $_), 0..99999;
our $nA = 16382; our $reA = join "|", @items[1 .. $nA];
our $nB = 16383; our $reB = join "|", @items[1 .. $nB];
$_ = qr/^($_)/ for $reA, $reB; # anchor and compile regex
my $results = timethese( $count, {
$nA => q{ our (@items, $nA, $reA); $nA == grep /$reA/, @items or die; },
$nB => q{ our (@items, $nB, $reB); $nB == grep /$reB/, @items or die; },
} );
cmpthese( $results );
Here are the results of running this benchmark on my laptop, using Perl (v5.18.2):
Benchmark: timing 10 iterations of 16382, 16383...
16382: 2 wallclock secs ( 1.79 usr + 0.00 sys = 1.79 CPU) @ 5.59/s (n=10)
16383: 1163 wallclock secs (1157.28 usr + 2.70 sys = 1159.98 CPU) @ 0.01/s (n=10)
s/iter 16383 16382
16383 116 -- -100%
16382 0.179 64703% --
Note the 64,703% speed difference.
My original hunch, based on the numerical coincidence that 6553 ≈ 216 / 10, was that this might've been some kind of an arbitrary limit within the Perl regex engine, or maybe that there there might be some kind of an array of 10-byte structs that was limited to 64 kB, or something. But the fact that the threshold number of alternatives depends on their length makes things more confusing.
(On the other hand, it's clearly not just about the length of the regex, either; the original regex with 6,553 32-byte alternatives was 2 + 6,553×33 = 216,251 bytes long, while the one with 16,383 8-byte alternatives is only 2 + 16,383×9 = 147,450 bytes long.)
What is causing this weird jump in regex matching time, and why does it happen at that specific point?
See Question&Answers more detail:
os