The alternation (?:.|\s)+?
is very inefficient, as it involves too much backtracking.
Basically, all variations of this pattern are extremely inefficient: (?:.|s)*?
, (?:.|
)*?
, (?:.|
)*?
and there greedy counterparts, too ((?:.|s)*
, (?:.|
)*
, (?:.|
)*
). (.|s)*?
is probably the worst of them all.
Why?
The two alternatives, .
and s
may match the same text at the same location, the both match regular spaces at least. See this demo taking 3555 steps to complete and .*?
demo (with s
modifier) taking 1335 steps to complete.
Patterns like (?:.|
)*?
/ (?:.|
)*
in Java often cause a Stack Overflow issue, and the main problem here is related to the use of alternation (that already alone causes backtracking) that matches char by char, and then the group is modified with a quantifier of unknown length. Although some regex engines can cope with this and do not throw errors, this type of pattern still causes slowdowns and is not recommended to use (only in ElasticSearch Lucene regex engine the (.|
)
is the only way to match any char).
Solution
If you want to match any characters including whitespace with regex, do it with
[\s\S]*?
Or enable singleline mode with (?s)
(or Pattern.DOTALL
Matcher
option) and just use .
(e.g. (?s)start(.*?)end
).
NOTE: To manipulate HTML, use a dedicated parser, like jsoup. Here is an SO post discussing Java HTML parsers.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…