Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
491 views
in Technique[技术] by (71.8m points)

regex - Python unicode regular expression matching failing with some unicode characters -bug or mistake?

I am attempting to use the re module in Python 2.7.3 with Unicode encoded Devnagari text. I have added from __future__ import unicode_literals to the top of my code so all strings literals should be unicode objects.

However, I am running into some odd problems with Python's regex matching. For instance, consider this name: "??????". This is a (mis-spelled) name, in Hindi, entered by one of my users. Any Hindi reader would recognise this as a word.

The following returns a match, as it should:

re.search("^[ws][ws]*","??????",re.UNICODE)

But this does not:

re.search("^[ws][ws]*$","??????",re.UNICODE)

Some spelunking revealed that only one character in this string, character 0915 (?), is recognised as falling within the w character class. This is incorrect, as the Unicode Character Database file on "derived core properties" lists other characters (I have not checked all) in this string as alphabetic ones - as indeed they are.

Is this just a bug in Python's implementation? I could get around this by manually defining all the Devnagari alphanumeric characters as a character range, but that would be painful. Or am I doing something wrong?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

It is a bug in the re module and it is fixed in the regex module:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import unicodedata
import re
import regex  # $ pip install regex

word = "??????"


def test(re_):
    assert re_.search("^\w+$", word, flags=re_.UNICODE)

print([unicodedata.category(cp) for cp in word])
print(" ".join(ch for ch in regex.findall("\X", word)))
assert all(regex.match("\w$", c) for c in ["a", "u093f", "u0915"])

test(regex)
test(re)  # fails

The output shows that there are 6 codepoints in "??????", but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:

Word boundaries, line boundaries, and sentence boundaries should not occur within a grapheme cluster: in other words, a grapheme cluster should be an atomic unit with respect to the process of determining these other boundaries.

here and further emphasis is mine

A word boundary is defined as a transition from w to W (or in reverse) in the docs:

Note that formally, is defined as the boundary between a w and a W character (or vice versa), or between w and the beginning/end of the string, ...

Therefore either all codepoints that form a single character are w or they are all W. In this case "??????" matches ^w{6}$.


From the docs for w in Python 2:

If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

in Python 3:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.

From regex docs:

Definition of 'word' character (issue #1693050):

The definition of a 'word' character has been expanded for Unicode. It now conforms to the Unicode specification at http://www.unicode.org/reports/tr29/. This applies to w, W, and B.

According to unicode.org U+093F (DEVANAGARI VOWEL SIGN I) is alnum and alphabetic so regex is also correct to consider it w even if we follow definitions that are not based on word boundaries.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...