Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
498 views
in Technique[技术] by (71.8m points)

javascript - Why does a simple .*? non-greedy regex greedily include additional characters before a match?

I have a very simple regex similar to this:

HOHO.*?_HO_

With this test string...

fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_fbguyev

  • I expect it to match just _HOHO___HO_ (shortest match, non-greedy)
  • Instead it matches _HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_ (longest match, looks greedy).

Why? How can I make it match the shortest match?

Adding and removing the ? gives the same result.

Edit - better test string that shows why [^HOHO] doesn't work: fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye


All I can think of is that maybe it is matching multiple times - but there's only one match for _HO_, so I don't understand why it isn't taking the shortest match that ends at the _HO_, discarding the rest.

I've browsed all the questions I can find with titles like "Non-greedy regex acts greedy", but they all seem to have some other problem.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I figured out a solution with some help from Regex lazy vs greedy confusion.

In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right - from the first left-hand match that fits to the nearest right-hand match.

Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).

Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".


You can approximate a regex that is non-greedy in both directions by replacing the . with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into (?:(?!).). For example, (?:(?!HOHO).)

For example, the equivalent of HOHO.*?_HO_ which is non-greedy on the left and right would be:

HOHO(?:(?!HOHO).)*?_HO_

So the regex engine is essentially going through each character like this:

  • HOHO - Does this match the left side?
  • (?:(?!HOHO).)* - If so, can I reach the right-hand side without any repeats of the left side?
  • _HO_ - If so, grab everything until the right-hand match
  • ? modifier on * or + - If there are multiple right-hand matches, choose the nearest one

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...