Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

regex - Regular expression for checking if capital letters are found consecutively in a string?

I want to know the regexp for the following case:

The string should contain only alphabetic letters. It must start with a capital letter followed by small letter. Then it can be small letters or capital letters.

^[A-Z][a-z][A-Za-z]*$

But the string must also not contain any consecutive capital letters. How do I add that logic to the regexp?

That is, HttpHandler is correct, but HTTPHandler is wrong.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Whenever one writes [A-Z] or [a-z], one explicitly commits to processing nothing but 7-bit ASCII data from the 1960s. If that’s really ok, then fine. But if it’s not ok, then Unicode character properties exist to help you with handling modern character data.

There are three cases in Unicode, not two. Furthermore, you also have noncased letters. Letters in general are specified by the pL property, and each of these also belongs to exactly one of five subcategories:

  1. uppercase letters, specified with p{Lu}; eg: A??TΣSS?ΙST
  2. titlecase letters, specified with p{Lt}; eg: ??Ss?St (actually Ss and St are an upper- and then a lowercase letter, but they are what you get if you ask for the titlecase of ? and ?, respectively)
  3. lowercase letters, specified with p{Ll}; eg: aα???σt???
  4. modifier letters, specified with p{Lm}; eg: ?????????
  5. other letters, specified with p{Lo}; eg: ????京

You can take the complement of any of these, but do be careful, because something like P{Lu} does not mean a letter that isn’t uppercase! It means any character that isn’t an uppercase letter.

For a letter that’s either of uppercase or titlecase, use [p{Lu}p{Lt}]. So you could use for your pattern:

 ^([p{Lu}p{Lt}]p{Ll}+)+$

If you don’t mean to limit the letters following the first to the “casing” letters alone, then you might prefer:

 ^([p{Lu}p{Lt}][p{Ll}p{Lm}p{Lo}]+)+$

If you’re trying to match so-called “CamelCase” identifiers, then the actual rules depend on the programming language, but usually include the underscore character and the decimal numbers (p{Nd}), and may also include a literal dollar sign and other language-dependent characters. If so, you may wish to add some of these to one or the other of the two character classes provided above.

For example, you may wish to add underscore to both but digits only to the second, leaving you with:

 ^([_p{Lu}p{Lt}][_p{Nd}p{Ll}p{Lm}p{Lo}]+)+$

If, though, you are dealing with certain “words” from various RFCs and ISO standards, these are often specified as containing ASCII only. If so, you can get by with the literal [A-Z] idea. It’s just not kind to impose that restriction if it doesn’t actually exist.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...