Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
299 views
in Technique[技术] by (71.8m points)

r - Unexpected match of regex

I expect the regex pattern ab{,2}c to match only with a followed by 0, 1 or 2 bs, followed by c.

It works that way in lots of languages, for instance Python. However, in R:

grepl("ab{,2}c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
# [1]  TRUE  TRUE  TRUE  TRUE FALSE

I'm surprised by the 4th TRUE. In ?regex, I can read:

{n,m} The preceding item is matched at least n times, but not more than m times.

So I agree that {,2} should be written {0,2} to be a valid pattern (unlike in Python, where the docs state explicitly that omitting n specifies a lower bound of zero).

But then using {,2} should throw an error instead of returning misleading matches! Am I missing something or should this be reported as a bug?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The behavior with {,2} is not expected, it is a bug. If you have a look at the TRE source code, tre_parse_bound method, you will see that the min variable value is set to -1 before the engine tries to initialize the minimum bound. It seems that the number of "repeats" in case the minimum value is missing in the quantifier is the number of maximum value + 1 (as if the repeat number equals max - min = max - (-1) = max+1).

So, a{,} matches one occurrence of a. Same as a{, } or a{ , }. See R demo, only abc is matched with ab{,}c:

grepl("ab{,}c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
grepl("ab{, }c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
grepl("ab{ ,   }c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
## => [1] FALSE  TRUE FALSE FALSE FALSE

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...