python - Why does this take so long to match? Is it a bug?

Question

Welcome To Ask or Share your Answers For Others

python - Why does this take so long to match? Is it a bug?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Why does this take so long to match? Is it a bug?

I need to match certain URLs in web application, i.e. /123,456,789, and wrote this regex to match the pattern:

r'(d+(,)?)+/$'

I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:

re.findall(r'(d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')

The expected result would be that there were no matches.

This expression, however, executes almost immediately (note the trailing slash):

re.findall(r'(d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')

Is this a bug?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:13:33+0000

There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.

The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [d,]+/$. However, that is not perfect since it would allow for something like ,123,,4,5/.

For this you can use a slightly optimized version of your initial try: (?:d,?)+/$. First, I made your repeating group non-capturing ((?:...)) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the d inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional , since ? only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /.

This can still match an odd string 1,2,3,/, so for the heck of it I improved your original regex with a negative lookbehind: (?:d,?)+(?<!,)/$. This will assert that there is no comma directly before the trailing /.

Categories

python - Why does this take so long to match? Is it a bug?

python - Why does this take so long to match? Is it a bug?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags