regex - Regular expression to confirm whether a string is a valid Python identifier?

Question

Welcome To Ask or Share your Answers For Others

regex - Regular expression to confirm whether a string is a valid Python identifier?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Regular expression to confirm whether a string is a valid Python identifier?

I have the following definition for an Identifier:

Identifier --> letter{ letter| digit}

Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.

I've tried this:

if re.match('w+(wd)?', i):     
  return True
else:
  return False

but when I run my program every time it meets an integer it thinks that it's a valid identifier.

For example

c = 0 ;

it prints c as a valid identifier which is fine, but it also prints 0 as a valid identifer.

What am I doing wrong here?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:34:01+0000

Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:

No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.

The reasons are:

As @JoeCondron pointed out, Python reserved keywords such as True, if, return, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.
Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of d, w, W in the re module, as demonstrated in @martineau's counter-example and explained in great detail by @Hatshepsut's amazing research.

While we could try to solve the first issue using keyword.iskeyword(), as @Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?

As Hatshepsut said:

str.isidentifier() works

Just use it, problem solved.

As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:

identifier ::=  (letter|"_") (letter | digit | "_")*

Which can be expressed by the regular expression:

^[^dW]w*

Example:

import re
identifier = re.compile(r"^[^dW]w*", re.UNICODE)

tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa
" ]
for test in tests:
    result = re.match(identifier, test)
    print("%r= %s" % (test, (result is not None)))

Result:

'a'      = True
'a1'     = True
'_a1'    = True
'1a'     = False
'aa$%@%' = False
'aa bb'  = False
'aa_bb'  = True
'aa
'   = False

Categories

regex - Regular expression to confirm whether a string is a valid Python identifier?

regex - Regular expression to confirm whether a string is a valid Python identifier?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags