Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
441 views
in Technique[技术] by (71.8m points)

language agnostic - Why do regex engines allow / automatically attempt matching at the end of the input string?

Note:
* Python is used to illustrate behaviors, but this question is language-agnostic.
* For the purpose of this discussion, assume single-line input only, because the presence of newlines (multi-line input) introduces variations in behavior of $ and . that are incidental to the questions at hand.

Most regex engines:

  • accept a regex that explicitly tries to match an expression after the end of the input string[1].

    $ python -c "import re; print(re.findall('$.*', 'a'))"
    [''] # !! Matched the hypothetical empty string after the end of 'a'
    
  • when finding / replacing globally, i.e., when looking for all non-overlapping matches of a given regex, and having reached the end of the string, unexpectedly try to match again[2], as explained in this answer to a related question:

    $ python -c "import re; print(re.findall('.*$', 'a'))"
    ['a', ''] # !! Matched both the full input AND the hypothetical empty string
    

Perhaps needless to say, such match attempts succeed only if the regex in question matches the empty string (and the regex by default / is configured to report zero-length matches).

These behaviors are at least at first glance counter-intuitive, and I wonder if someone can provide a design rationale for them, not least because:

  • it's not obvious what the benefit of this behavior is.
  • conversely, in the context of finding / replacing globally with patterns such as .* and .*$, the behavior is downright surprising.[3]
    • To ask the question more pointedly: Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already, irrespective of what the regex is (although you'll never see the symptom with a regex that doesn't at least also match the empty string)
    • The following languages/engines exhibit the surprising behavior: .NET, Python (both 2.x and 3.x)[2], Perl (both 5.x and 6.x), Ruby, Node.js (JavaScript)

Note that regex engines vary in behavior with respect to where to continue matching after a zero-length (empty-string) match.

Either choice (start at the same character position vs. start at the next) is defensible - see the chapter on zero-length matches at www.regular-expressions.info.

By contrast, the .*$ case discussed here is different in that, with any non-empty input, the first match for .*$ is not a zero-length match, so the behavior difference does not apply - instead, the character position should advance unconditionally after the first match, which of course is impossible if you're already at the end.
Again, my surprise is at the fact that another match is attempted nonetheless, even though there's by definition nothing left.


[1] I'm using $ as the end-of-input marker here, even though in some engines, such as .NET's, it can mark the end the end of the input optionally followed by a trailing newline. However, the behavior equally applies when you use the unconditional end-of-input marker, z.

[2] Python 2.x and 3.x up to 3.6.x seemingly special-cased replacement behavior in this context: python -c "import re; print(re.sub('.*$', '[g<0>]', 'a'))" used to yield just [a] - that is, only one match was found and replaced.
Since Python 3.7, the behavior is now like in most other regex engines, where two replacements are performed, yielding [a][].

[3] You can avoid the problem by either (a) choosing a replacement method that is designed to find at most one match or (b) use ^.* to prevent multiple matches from being found via start-of-input anchoring.
(a) may not be an option, depending on how a given language surfaces functionality; for instance, PowerShell's -replace operator invariably replaces all occurrences; consider the following attempt to enclose all array elements in "...":
'a', 'b' -replace '.*', '"$&"'. Due to matching twice, this yields elements "a""" and "b""";
option (b), 'a', 'b' -replace '^.*', '"$&"', fixes the problem.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I am giving this answer just to demonstrate why a regex would want to allow any code appearing after the final $ anchor in the pattern. Suppose we needed to create a regex to match a string with the following rules:

  • starts with three numbers
  • followed by one or more letters, numbers, hyphen, or underscore
  • ends with only letters and numbers

We could write the following pattern:

^d{3}[A-Za-z0-9-_]*[A-Za-z0-9]$

But this is a bit bulky, because we have to use two similar character classes adjacent to each other. Instead, we could write the pattern as:

^d{3}[A-Za-z0-9-_]+$(?<!_|-)

or

^d{3}[A-Za-z0-9-_]+(?<!_|-)$

Here, we eliminated one of the character classes, and instead used a negative lookbehind after the $ anchor to assert that the final character was not underscore or hyphen.

Other than a lookbehind, it makes no sense to me why a regex engine would allow something to appear after the $ anchor. My point here is that a regex engine may allow a lookbehind to appear after the $, and there are cases for which it logically makes sense to do so.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...