Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
253 views
in Technique[技术] by (71.8m points)

java - Is regex too slow? Real life examples where simple non-regex alternative is better

I've seen people here made comments like "regex is too slow!", or "why would you do something so simple using regex!" (and then present a 10+ lines alternative instead), etc.

I haven't really used regex in industrial setting, so I'm curious if there are applications where regex is demonstratably just too slow, AND where a simple non-regex alternative exists that performs significantly (maybe even asymptotically!) better.

Obviously many highly-specialized string manipulations with sophisticated string algorithms will outperform regex easily, but I'm talking about cases where a simple solution exists and significantly outperforms regex.

What counts as simple is subjective, of course, but I think a reasonable standard is that if it uses only String, StringBuilder, etc, then it's probably simple.


Note: I would very much appreciate answers that demonstrate the following:

  1. a beginner-level regex solution to a non-toy real-life problem that performs horribly
  2. the simple non-regex solution
  3. the expert-level regex rewrite that performs comparably
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I remember a textbook example of a regex gone bad. Be aware that none of the following approaches is recommended for production use! Use a proper CSV parser instead.

The mistake made in this example is quite common: Using a dot where a narrower character class is better suited.

In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).

1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 // don't match
42,12,13,12,32,13,14,43,56,31,78,10 // match
42,12,13,12,32,14,13,43,56,31,78,10 // don't match

We use a regex containing exactly 11 commas:

".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*"

This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)

A simple non-regex solution would be to split() each line and compare the 6th element. (Much faster: 9 microseconds per string.)

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

So we replace the greedy quantifier with the reluctant one:

".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?"

This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.

A performant regex replaces the dot by the character class "[^,]":

"[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*"

(This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...