Noted by Brian Candler on ruby-talk:
w
only matches ASCII letters and digits, while [[:alpha:]]
matches the full set of Unicode letters.
d
only matches ASCII digits, while [[:digit:]]
matches the full set of Unicode numbers.
The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on w
in the same Oniguruma doc we see the text:
w word character
Not Unicode: alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.
This would explain why d
does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.
p "ab?".scan(/w/), "ab?".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "u00E7"]
It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u
flag (e.g. /w/u
) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)
Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:
[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."
Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…