Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
131 views
in Technique[技术] by (71.8m points)

java - Why does this regex take a long time to execute?

I found out, that for example this line has a very very long execution time:

System.out.println(
        ".. .. .. .. .. .. .. .. ..  .. .. .. .. .. .. .. .. .. .. .. .... .. .."
        .matches("(?i)(?:.* )?\W?([a-z0-9-_\.]+((?: *)\.(?: *))+(?:DE))(?:[0-9]{1,5})?")
);

If I reduce the amount of dots at the start of the String the execution time gets lower (seems like it's exponential). Here is the suspended thread's stack trace:

[Repeating text]...
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.match(Matcher, int, CharSequence) line: 4785
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.match(Matcher, int, CharSequence) line: 4785
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.matchInit(Matcher, int, CharSequence) line: 4801
Pattern$Prolog.match(Matcher, int, CharSequence) line: 4741
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Ques.match(Matcher, int, CharSequence) line: 4182
Pattern$BranchConn.match(Matcher, int, CharSequence) line: 4568
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Branch.match(Matcher, int, CharSequence) line: 4604
Matcher.match(int, int) line: 1270
Matcher.matches() line: 604
Pattern.matches(String, CharSequence) line: 1135
String.matches(String) line: 2121
Main.main(String[]) line: 11

Why does this happen?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

When pattern x is made optional - using ? or * quantifiers (or {0,}) - engine has two paths to approach according to the nature of quantifier being used:

  • Consumes then backtracks for other patterns (case of greediness i.e. .*, .?)
  • First doesn't consume and looks immediately for other patterns (case of laziness .*?)

Someone probably is not aware about regular expressions or doesn't care about performance and throws .* wherever he needs a match somewhere in string and engines are so fast in taking steps back and forth that nothing seems weird or slow unless a pattern can not be found.

Time complexity starts at O(n) and continues with O(n^2b) where b is level of nesting quantifiers. So on failure number of steps an engine takes is HUGE.

To avoid such situations someone needs to consider some guiding principles:

  • Specifying boundaries. If pattern should stop somewhere before digits do not do .*. Instead do D*.

  • Use conditions. You can check if pattern / letter x exists before running a whole match using a lookahead ^(?=[^x]*x). This leads to an early failure.

  • Use possessive quantifiers or atomic groups (if available). These two avoid backtracks. Sometimes you do not need backtracks.

  • Do not do (.*)+ or similar patterns. Instead reconsider your requirements or at least use atomic groups (?>.*)+.

Your own Regular Expression isn't an exception. It suffers from much greediness and optional matches and needs a time to be restudied.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...