Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:
No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.
The reasons are:
As @JoeCondron pointed out, Python reserved keywords such as True
, if
, return
, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.
Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of d
, w
, W
in the re
module, as demonstrated in @martineau's counter-example and explained in great detail by @Hatshepsut's amazing research.
While we could try to solve the first issue using keyword.iskeyword()
, as @Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?
As Hatshepsut said:
str.isidentifier()
works
Just use it, problem solved.
As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:
identifier ::= (letter|"_") (letter | digit | "_")*
Which can be expressed by the regular expression:
^[^dW]w*
Example:
import re
identifier = re.compile(r"^[^dW]w*", re.UNICODE)
tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa
" ]
for test in tests:
result = re.match(identifier, test)
print("%r= %s" % (test, (result is not None)))
Result:
'a' = True
'a1' = True
'_a1' = True
'1a' = False
'aa$%@%' = False
'aa bb' = False
'aa_bb' = True
'aa
' = False
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…