Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
602 views
in Technique[技术] by (71.8m points)

regex - perl6 grammar , not sure about some syntax in an example

I am still learning perl6, and I am reading the example on grammar from this page: http://examples.perl6.org/categories/parsers/SimpleStrings.html ; I have read the documentations on regex multiple times, but there are still some syntax that I don't understand.

token string { <quote> {} <quotebody($<quote>)> $<quote> }

Question 1: what is this "{}" in the token doing? Capture marker is <()>, and nesting structures is tilda '(' ~ ')'; but what is {} ?

token quotebody($quote) { ( <escaped($quote)> | <!before $quote> . )* }

Question 2a: escaped($quote) inside <> would be a regex function, right? And it takes $quote as an argument and returns another regex ?

Question 2b: If I want to indicate "char that is not before quote", should I use ". <!before $quote>" instead of "<!before $quote> ." ??

token escaped($quote) { '\' ( $quote | '\' ) } # I think this is a function;
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

TL;DR @briandfoy has provided an easy to digest answer. But here be dragons that he didn't mention. And pretty butterflies too. This answer goes deep.

Question 1: what is this {} in the token doing?

It's a code block1,2,3,4.

It's an empty one and has been inserted purely to force the $<quote> in quotebody($<quote>) to evaluate to the value captured by the <quote> at the start of the regex.

The reason why $<quote> does not contain the right value without insertion of a code block is a Rakudo Perl 6 compiler limitation or bug related to "publication of match variables".

"Publication" of match variables by Rakudo

Moritz Lenz states in a Rakudo bug report that "the regex engine doesn't publish match variables unless it is deemed necessary".

By "regex engine" he means the regex/grammar engine in NQP, part of the Rakudo Perl 6 compiler.3

By "match variables", he means the variables that store captures of match results:

  • the current match variable $/;

  • the numbered sub-match variables $0, $1, etc.;

  • named sub-match variables of the form $<foo>.

By "publish" he means that the regex/grammar engine does what it takes so that any mentions of any variables in a regex (a token is also a regex) evaluate to the values they're supposed to have when they're supposed to have them. Within a given regex, match variables are supposed to contain a Match object corresponding to what has been captured for them at any given stage in processing of that regex, or Nil if nothing has been captured.

By "deemed necessary" he means that the regex/grammar engine makes a conservative call about whether it's worth doing the publication work after each step in the matching process. By "conservative" I mean that the engine often avoids doing publication, because it slows things down and is usually unnecessary. Unfortunately it's sometimes too optimistic about when publication is actually necessary. Hence the need for programmers to sometimes intervene by explicitly inserting a code block to force publication of match variables (and other techniques for other variables5). It's possible that the regex/grammar engine will improve in this regard over time, reducing the scenarios in which manual intervention is necessary. If you wish to help progress this, please create test cases that matter to you for existing related bugs.5

"Publication" of $<quote>'s value

The named capture $<quote> is the case in point here.

As far as I can tell, all sub-match variables correctly refer to their captured value when written directly into the regex without a surrounding construct. This works:

my regex quote { <['"]> }
say so '"aa"' ~~ / <quote> aa $<quote> /; # True

I think6 $<quote> gets the right value because it is parsed as a regex slang construct.4

In contrast, if the {} were removed from

token string { <quote> {} <quotebody($<quote>)> $<quote> }

then the $<quote> in quotebody($<quote>) would not contain the value captured by the opening <quote>.

I think this is because the $<quote> in this case is parsed as a main slang construct.

Question 2a: escaped($quote) inside <> would be a regex function, right? And it takes $quote as an argument

That's a good first approximation.

More specifically, regex atoms of the form <foo(...)> are calls of the method foo.

All regexes -- whether declared with token, regex, rule, /.../ or any other form -- are methods. But methods declared with method are not regexes:

say Method ~~ Regex; # False
say WHAT token { . } # (Regex)
say Regex ~~ Method; # True
say / . / ~~ Method; # True

When the <escaped($quote)> regex atom is encountered, the regex/grammar engine doesn't know or care if escaped is a regex or not, nor about the details of method dispatch inside a regex or grammar. It just invokes method dispatch, with the invocant set to the Match object that's being constructed by the enclosing regex.

The call yields control to whatever ends up running the method. It typically turns out that the regex/grammar engine is just recursively calling back into itself because typically it's a matter of one regex calling another. But it isn't necessarily so.

and returns another regex

No, a regex atom of the form <escaped($quote)> does not return another regex.

Instead it calls a method that will/should return a Match object.

If the method called was a regex, P6 will make sure the regex generates and populates the Match object automatically.

If the method called was not a regex but instead just an ordinary method, then the method's code should have manually created and returned a Match object. Moritz shows an example in his answer to the SO question Can I change the Perl 6 slang inside a method?.

The Match object is returned to the "regex/grammar engine" that drives regex matching / grammar parsing.3

The engine then decides what to do next according to the result:

  • If the match was successful, the engine updates the overall match object corresponding to the calling regex. The updating may include saving the returned Match object as a sub-match capture of the calling regex. This is how a match/parse tree gets built.

  • If the match was unsuccessful, the engine may backtrack, undoing previous updates; thus the parse tree may dynamically grow and shrink as matching progresses.

Question 2b: If I want to indicate "char that is not before quote", should I use . <!before $quote> instead of <!before $quote> . ??

Yes.

But that's not what's needed for the quotebody regex, if that's what you're talking about.

While on the latter topic, in @briandfoy's answer he suggests using a "Match ... anything that's not a quote" construct rather than doing a negative look ahead (<!before $quote>). His point is that matching "not a quote" is much easier to understand than "are we not before a quote? then match any character".

However, it is by no means straight-forward to do this when the quote is a variable whose value is set to the capture of the opening quote. This complexity is due to bugs in Rakudo. I've worked out what I think is the simplest way around them but think it likely best to just stick with use of <!before $quote> . unless/until these long-standing Rakudo bugs are fixed.5

token escaped($quote) { '\' ( $quote | '\' ) } # I think this is a function;

It's a token, which is a Regex, which is a Method, which is a Routine:

say token { . } ~~ Regex;   # True
say Regex       ~~ Method;  # True
say Method      ~~ Routine; # True

The code inside the body (the { ... } bit) of a regex (in this instance the code is the lone . in token { . }, which is a regex atom that matches a single character) is written in the P6 regex "slang" whereas the code used inside the body of a method routine is written in the main P6 "slang".4

Using ~

The regex tilde (~) operator is specifically designed for the sort of parsing in the example this question is about. It reads better inasmuch as it's instantly recognizable and keeps the opening and closing quotes together. Much more importantly it can provide a human intelligible error message in the event of failure because it can say what closing delimiter(s) it's looking for.

But there's a key wrinkle you must consider if you insert a code block in a regex (with or without code in it) right next to the regex ~ operator (on either side of it). You will need to group the code block unless you specifically want the tilde to treat the code block as its own atom. For example:

token foo { <quote> ~ $<quote> {} <quotebody($<quote>) }

will match a pair of <quote>s with nothing between them. (And then try to match <quotebody...>.)

In contrast, here's a way to duplicate the matching behavior of the string token in the String::Simple::Grammar grammar:

token string { <quote> ~ $<quote> [ {} <quotebody($<quote>) ] }

Footnotes

1 In 2002 Larry Wall wrote "It needs to be just as easy for a regex to call Perl code as it is for Perl code to call a regex.". Computer scientists note that you can't have procedural code in the middle of a traditional regular expression. But Perls long ago led the shift to non-traditional regexes and P6 has arrived at the logical conclusion -- a simple {...} is all it takes to insert arbitrary procedural code in the middle of a regex. The language design and regex/grammar engine implementation3 ensure that traditional style purely declarative regions within a regex are recognized, so that formal regular expression theory and optimizations can be applied to them, but nevertheless arbitrary regular procedural code can also be inserted. Simple uses include matching logic and debugging


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...