The charset doesn't matter. The allowed characters matters more. Check the CSS specification. Here's a cite of relevance:
In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9]
and ISO 10646 characters U+00A0
and higher, plus the hyphen (-
) and the underscore (_
); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?"
may be written as "B&W?"
or "B26 W3F"
.
Update: As to the regex question, you can find the grammar here:
ident -?{nmstart}{nmchar}*
Which contains of the parts:
nmstart [_a-z]|{nonascii}|{escape}
nmchar [_a-z0-9-]|{nonascii}|{escape}
nonascii [240-377]
escape {unicode}|\[^
f0-9a-f]
unicode \{h}{1,6}(
|[
f])?
h [0-9a-f]
This can be translated to a Java regex as follows (I only added parentheses to parts containing the OR and escaped the backslashes):
String h = "[0-9a-f]";
String unicode = "\\{h}{1,6}(\r\n|[ \t\r\n\f])?".replace("{h}", h);
String escape = "({unicode}|\\[^\r\n\f0-9a-f])".replace("{unicode}", unicode);
String nonascii = "[\240-\377]";
String nmchar = "([_a-z0-9-]|{nonascii}|{escape})".replace("{nonascii}", nonascii).replace("{escape}", escape);
String nmstart = "([_a-z]|{nonascii}|{escape})".replace("{nonascii}", nonascii).replace("{escape}", escape);
String ident = "-?{nmstart}{nmchar}*".replace("{nmstart}", nmstart).replace("{nmchar}", nmchar);
System.out.println(ident); // The full regex.
Update 2: oh, you're more a PHP'er, well I think you can figure how/where to do str_replace
?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…