Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
250 views
in Technique[技术] by (71.8m points)

java - Regex search with pattern containing (?:.|s)*? takes increasingly long time

My regex is taking increasingly long to match (about 30 seconds the 5th time) but needs to be applied for around 500 rounds of matches. I suspect catastrophic backtracking. Please help! How can I optimize this regex:

String regex = "<tr bgcolor="ffffff">\s*?<td width="20%"><b>((?:.|\s)+?): *?</b></td>\s*?<td width="80%">((?:.|\s)*?)(?=(?:</td>\s*?</tr>\s*?<tr bgcolor="ffffff">)|(?:</td>\s*?</tr>\s*?</table>\s*?<b>Tags</b>))";

EDIT: since it was not clear(my bad): i am trying to take a html formatted document and reformat by extracting the two search groups and adding formating afterwards.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The alternation (?:.|\s)+? is very inefficient, as it involves too much backtracking.

Basically, all variations of this pattern are extremely inefficient: (?:.|s)*?, (?:.| )*?, (?:.| )*? and there greedy counterparts, too ((?:.|s)*, (?:.| )*, (?:.| )*). (.|s)*? is probably the worst of them all.

Why?

The two alternatives, . and s may match the same text at the same location, the both match regular spaces at least. See this demo taking 3555 steps to complete and .*? demo (with s modifier) taking 1335 steps to complete.

Patterns like (?:.| )*? / (?:.| )* in Java often cause a Stack Overflow issue, and the main problem here is related to the use of alternation (that already alone causes backtracking) that matches char by char, and then the group is modified with a quantifier of unknown length. Although some regex engines can cope with this and do not throw errors, this type of pattern still causes slowdowns and is not recommended to use (only in ElasticSearch Lucene regex engine the (.| ) is the only way to match any char).

Solution

If you want to match any characters including whitespace with regex, do it with

[\s\S]*?

Or enable singleline mode with (?s) (or Pattern.DOTALL Matcher option) and just use . (e.g. (?s)start(.*?)end).

NOTE: To manipulate HTML, use a dedicated parser, like jsoup. Here is an SO post discussing Java HTML parsers.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...