Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

regex - PHP 7 preg_replace PREG_JIT_STACKLIMIT_ERROR with simple string

I know other people have submitted questions around this error, however I can't see how this regex or the subject string could be any simpler.

To me this is a bug, but before submitting it to PHP I thought I'd make sure and get help to see if this can be simpler.

Here's a small test script showing 2 strings; one with 1024 x's and one with 1023:

// 1024 x's
$str = '_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'; 

// Outputs nothing (bug?)
echo preg_replace('/(?<=[^w]|^)_([^_
 ](.|
(?!
))*?)_(?=[^w]|$)/', '[i]${1}[/i]', $str); 

echo "

";

// 1023 x's
$str = '_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'; 

// Outputs the unchanged string as expected
echo preg_replace('/(?<=[^w]|^)_([^_
 ](.|
(?!
))*?)_(?=[^w]|$)/', '[i]${1}[/i]', $str);

As you can see, only with a slightly longer string (greater than 1024 characters) do we get an error. The strings that will be processed by this are going to be any length – they will be forum posts, news articles, etc.

Regex Explanation

Just trying to do some markdown parsing to convert a string like _I am italic_, to a legacy version of markup we're using from our old site in certain situations. The reasons/uses aren't important. What's important is that I believe this should work just fine, and in fact it does, like, everywhere else except PHP 7.

It should match these underscores only if that represent an independent word or sentence. It should not match the first underscore if it is preceded by any "word" based character, and it should not match the last underscore if it is followed by any "word" based character.

Environment: Centos 7, PHP: 7.1.6

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

IMPORTANT NOTE:
The (.| )*? or (.| ? )*? patterns should be avoided as they cause too much redundant backtracking. To match any char, you usually may use . with a DOTALL flag, or, in JavaScript, you may use [^] or [sS] constructs. See How do I match any character across multiple lines in a regular expression? for more details.

Current Issue

The (.| (?! ))*? pattern is very inefficient and causes a lot of redundant backtracking when used not at the end of the pattern (where it does not make sense at all). The more it is located to the left of the pattern, the worse is the performance.

Since all it does is matches any char but a newline and then a newline that is not followed with another newline, in a lazy way, you may re-write the pattern as .*?(?:R(?!R).*?)*:

'~_([^_
 ].*?(?:R(?!R).*?)*)_~'

See the regex demo.

Note:

  • (?<=[^w]|^) = because there is a _ (a word char) after the lookbehind
  • (?=[^w]|$) = because there is a _ before the lookahead
  • .*?(?:R(?!R).*?)* - matches:
    • .*? - any 0+ chars other than line break chars, as few as possible, then
    • (?:R(?!R).*?)* - zero or more sequences of:
      • R(?!R) - a line break sequence not followed with another line break sequence (R = , or )
      • .*? - any 0+ chars other than line break chars, as few as possible

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...