Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
567 views
in Technique[技术] by (71.8m points)

regex - How to validate a unicode email?

Since

In October 2009, the Internet Corporation for Assigned Names and Numbers (ICANN) approved the creation of country code top-level domains (ccTLDs) in the Internet that use the IDNA standard for native language scripts.

We just validate the a-zA-Z at in the past. But now, I want to validate a unicode email such as a Chinese email 我@在.中国 or other languages. How to validate them by RegExp?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Here is a validation regex that I wrote for maximum Unicode support and reasonably good overall adherence to RFC standards.

JS:

/^(?!.)((?!.*.{2})[a-zA-Z0-9u0080-u00FFu0100-u017Fu0180-u024Fu0250-u02AFu0300-u036Fu0370-u03FFu0400-u04FFu0500-u052Fu0530-u058Fu0590-u05FFu0600-u06FFu0700-u074Fu0750-u077Fu0780-u07BFu07C0-u07FFu0900-u097Fu0980-u09FFu0A00-u0A7Fu0A80-u0AFFu0B00-u0B7Fu0B80-u0BFFu0C00-u0C7Fu0C80-u0CFFu0D00-u0D7Fu0D80-u0DFFu0E00-u0E7Fu0E80-u0EFFu0F00-u0FFFu1000-u109Fu10A0-u10FFu1100-u11FFu1200-u137Fu1380-u139Fu13A0-u13FFu1400-u167Fu1680-u169Fu16A0-u16FFu1700-u171Fu1720-u173Fu1740-u175Fu1760-u177Fu1780-u17FFu1800-u18AFu1900-u194Fu1950-u197Fu1980-u19DFu19E0-u19FFu1A00-u1A1Fu1B00-u1B7Fu1D00-u1D7Fu1D80-u1DBFu1DC0-u1DFFu1E00-u1EFFu1F00-u1FFFu20D0-u20FFu2100-u214Fu2C00-u2C5Fu2C60-u2C7Fu2C80-u2CFFu2D00-u2D2Fu2D30-u2D7Fu2D80-u2DDFu2F00-u2FDFu2FF0-u2FFFu3040-u309Fu30A0-u30FFu3100-u312Fu3130-u318Fu3190-u319Fu31C0-u31EFu31F0-u31FFu3200-u32FFu3300-u33FFu3400-u4DBFu4DC0-u4DFFu4E00-u9FFFuA000-uA48FuA490-uA4CFuA700-uA71FuA800-uA82FuA840-uA87FuAC00-uD7AFuF900-uFAFF.!#$%&'*+-/=?^_`{|}~-d]+)@(?!.)([a-zA-Z0-9u0080-u00FFu0100-u017Fu0180-u024Fu0250-u02AFu0300-u036Fu0370-u03FFu0400-u04FFu0500-u052Fu0530-u058Fu0590-u05FFu0600-u06FFu0700-u074Fu0750-u077Fu0780-u07BFu07C0-u07FFu0900-u097Fu0980-u09FFu0A00-u0A7Fu0A80-u0AFFu0B00-u0B7Fu0B80-u0BFFu0C00-u0C7Fu0C80-u0CFFu0D00-u0D7Fu0D80-u0DFFu0E00-u0E7Fu0E80-u0EFFu0F00-u0FFFu1000-u109Fu10A0-u10FFu1100-u11FFu1200-u137Fu1380-u139Fu13A0-u13FFu1400-u167Fu1680-u169Fu16A0-u16FFu1700-u171Fu1720-u173Fu1740-u175Fu1760-u177Fu1780-u17FFu1800-u18AFu1900-u194Fu1950-u197Fu1980-u19DFu19E0-u19FFu1A00-u1A1Fu1B00-u1B7Fu1D00-u1D7Fu1D80-u1DBFu1DC0-u1DFFu1E00-u1EFFu1F00-u1FFFu20D0-u20FFu2100-u214Fu2C00-u2C5Fu2C60-u2C7Fu2C80-u2CFFu2D00-u2D2Fu2D30-u2D7Fu2D80-u2DDFu2F00-u2FDFu2FF0-u2FFFu3040-u309Fu30A0-u30FFu3100-u312Fu3130-u318Fu3190-u319Fu31C0-u31EFu31F0-u31FFu3200-u32FFu3300-u33FFu3400-u4DBFu4DC0-u4DFFu4E00-u9FFFuA000-uA48FuA490-uA4CFuA700-uA71FuA800-uA82FuA840-uA87FuAC00-uD7AFuF900-uFAFF-.d]+)((.([a-zA-Zu0080-u00FFu0100-u017Fu0180-u024Fu0250-u02AFu0300-u036Fu0370-u03FFu0400-u04FFu0500-u052Fu0530-u058Fu0590-u05FFu0600-u06FFu0700-u074Fu0750-u077Fu0780-u07BFu07C0-u07FFu0900-u097Fu0980-u09FFu0A00-u0A7Fu0A80-u0AFFu0B00-u0B7Fu0B80-u0BFFu0C00-u0C7Fu0C80-u0CFFu0D00-u0D7Fu0D80-u0DFFu0E00-u0E7Fu0E80-u0EFFu0F00-u0FFFu1000-u109Fu10A0-u10FFu1100-u11FFu1200-u137Fu1380-u139Fu13A0-u13FFu1400-u167Fu1680-u169Fu16A0-u16FFu1700-u171Fu1720-u173Fu1740-u175Fu1760-u177Fu1780-u17FFu1800-u18AFu1900-u194Fu1950-u197Fu1980-u19DFu19E0-u19FFu1A00-u1A1Fu1B00-u1B7Fu1D00-u1D7Fu1D80-u1DBFu1DC0-u1DFFu1E00-u1EFFu1F00-u1FFFu20D0-u20FFu2100-u214Fu2C00-u2C5Fu2C60-u2C7Fu2C80-u2CFFu2D00-u2D2Fu2D30-u2D7Fu2D80-u2DDFu2F00-u2FDFu2FF0-u2FFFu3040-u309Fu30A0-u30FFu3100-u312Fu3130-u318Fu3190-u319Fu31C0-u31EFu31F0-u31FFu3200-u32FFu3300-u33FFu3400-u4DBFu4DC0-u4DFFu4E00-u9FFFuA000-uA48FuA490-uA4CFuA700-uA71FuA800-uA82FuA840-uA87FuAC00-uD7AFuF900-uFAFF]){2,63})+)$/i

PHP:

/^(?!.)((?!.*.{2})[a-zA-Z0-9x{0080}-x{00FF}x{0100}-x{017F}x{0180}-x{024F}x{0250}-x{02AF}x{0300}-x{036F}x{0370}-x{03FF}x{0400}-x{04FF}x{0500}-x{052F}x{0530}-x{058F}x{0590}-x{05FF}x{0600}-x{06FF}x{0700}-x{074F}x{0750}-x{077F}x{0780}-x{07BF}x{07C0}-x{07FF}x{0900}-x{097F}x{0980}-x{09FF}x{0A00}-x{0A7F}x{0A80}-x{0AFF}x{0B00}-x{0B7F}x{0B80}-x{0BFF}x{0C00}-x{0C7F}x{0C80}-x{0CFF}x{0D00}-x{0D7F}x{0D80}-x{0DFF}x{0E00}-x{0E7F}x{0E80}-x{0EFF}x{0F00}-x{0FFF}x{1000}-x{109F}x{10A0}-x{10FF}x{1100}-x{11FF}x{1200}-x{137F}x{1380}-x{139F}x{13A0}-x{13FF}x{1400}-x{167F}x{1680}-x{169F}x{16A0}-x{16FF}x{1700}-x{171F}x{1720}-x{173F}x{1740}-x{175F}x{1760}-x{177F}x{1780}-x{17FF}x{1800}-x{18AF}x{1900}-x{194F}x{1950}-x{197F}x{1980}-x{19DF}x{19E0}-x{19FF}x{1A00}-x{1A1F}x{1B00}-x{1B7F}x{1D00}-x{1D7F}x{1D80}-x{1DBF}x{1DC0}-x{1DFF}x{1E00}-x{1EFF}x{1F00}-x{1FFF}x{20D0}-x{20FF}x{2100}-x{214F}x{2C00}-x{2C5F}x{2C60}-x{2C7F}x{2C80}-x{2CFF}x{2D00}-x{2D2F}x{2D30}-x{2D7F}x{2D80}-x{2DDF}x{2F00}-x{2FDF}x{2FF0}-x{2FFF}x{3040}-x{309F}x{30A0}-x{30FF}x{3100}-x{312F}x{3130}-x{318F}x{3190}-x{319F}x{31C0}-x{31EF}x{31F0}-x{31FF}x{3200}-x{32FF}x{3300}-x{33FF}x{3400}-x{4DBF}x{4DC0}-x{4DFF}x{4E00}-x{9FFF}x{A000}-x{A48F}x{A490}-x{A4CF}x{A700}-x{A71F}x{A800}-x{A82F}x{A840}-x{A87F}x{AC00}-x{D7AF}x{F900}-x{FAFF}.!#$%&'*+-/=?^_`{|}~-d]+)@(?!.)([a-zA-Z0-9x{0080}-x{00FF}x{0100}-x{017F}x{0180}-x{024F}x{0250}-x{02AF}x{0300}-x{036F}x{0370}-x{03FF}x{0400}-x{04FF}x{0500}-x{052F}x{0530}-x{058F}x{0590}-x{05FF}x{0600}-x{06FF}x{0700}-x{074F}x{0750}-x{077F}x{0780}-x{07BF}x{07C0}-x{07FF}x{0900}-x{097F}x{0980}-x{09FF}x{0A00}-x{0A7F}x{0A80}-x{0AFF}x{0B00}-x{0B7F}x{0B80}-x{0BFF}x{0C00}-x{0C7F}x{0C80}-x{0CFF}x{0D00}-x{0D7F}x{0D80}-x{0DFF}x{0E00}-x{0E7F}x{0E80}-x{0EFF}x{0F00}-x{0FFF}x{1000}-x{109F}x{10A0}-x{10FF}x{1100}-x{11FF}x{1200}-x{137F}x{1380}-x{139F}x{13A0}-x{13FF}x{1400}-x{167F}x{1680}-x{169F}x{16A0}-x{16FF}x{1700}-x{171F}x{1720}-x{173F}x{1740}-x{175F}x{1760}-x{177F}x{1780}-x{17FF}x{1800}-x{18AF}x{1900}-x{194F}x{1950}-x{197F}x{1980}-x{19DF}x{19E0}-x{19FF}x{1A00}-x{1A1F}x{1B00}-x{1B7F}x{1D00}-x{1D7F}x{1D80}-x{1DBF}x{1DC0}-x{1DFF}x{1E00}-x{1EFF}x{1F00}-x{1FFF}x{20D0}-x{20FF}x{2100}-x{214F}x{2C00}-x{2C5F}x{2C60}-x{2C7F}x{2C80}-x{2CFF}x{2D00}-x{2D2F}x{2D30}-x{2D7F}x{2D80}-x{2DDF}x{2F00}-x{2FDF}x{2FF0}-x{2FFF}x{3040}-x{309F}x{30A0}-x{30FF}x{3100}-x{312F}x{3130}-x{318F}x{3190}-x{319F}x{31C0}-x{31EF}x{31F0}-x{31FF}x{3200}-x{32FF}x{3300}-x{33FF}x{3400}-x{4DBF}x{4DC0}-x{4DFF}x{4E00}-x{9FFF}x{A000}-x{A48F}x{A490}-x{A4CF}x{A700}-x{A71F}x{A800}-x{A82F}x{A840}-x{A87F}x{AC00}-x{D7AF}x{F900}-x{FAFF}-.d]+)((.([a-zA-Zx{0080}-x{00FF}x{0100}-x{017F}x{0180}-x{024F}x{0250}-x{02AF}x{0300}-x{036F}x{0370}-x{03FF}x{0400}-x{04FF}x{0500}-x{052F}x{0530}-x{058F}x{0590}-x{05FF}x{0600}-x{06FF}x{0700}-x{074F}x{0750}-x{077F}x{0780}-x{07BF}x{07C0}-x{07FF}x{0900}-x{097F}x{0980}-x{09FF}x{0A00}-x{0A7F}x{0A80}-x{0AFF}x{0B00}-x{0B7F}x{0B80}-x{0BFF}x{0C00}-x{0C7F}x{0C80}-x{0CFF}x{0D00}-x{0D7F}x{0D80}-x{0DFF}x{0E00}-x{0E7F}x{0E80}-x{0EFF}x{0F00}-x{0FFF}x{1000}-x{109F}x{10A0}-x{10FF}x{1100}-x{11FF}x{1200}-x{137F}x{1380}-x{139F}x{13A0}-x{13FF}x{1400}-x{167F}x{1680}-x{169F}x{16A0}-x{16FF}x{1700}-x{171F}x{1720}-x{173F}x{1740}-x{175F}x{1760}-x{177F}x{1780}-x{17FF}x{1800}-x{18AF}x{1900}-x{194F}x{1950}-x{197F}x{1980}-x{19DF}x{19E0}-x{19FF}x{1A00}-x{1A1F}x{1B00}-x{1B7F}x{1D00}-x{1D7F}x{1D80}-x{1DBF}x{1DC0}-x{1DFF}x{1E00}-x{1EFF}x{1F00}-x{1FFF}x{20D0}-x{20FF}x{2100}-x{214F}x{2C00}-x{2C5F}x{2C60}-x{2C7F}x{2C80}-x{2CFF}x{2D00}-x{2D2F}x{2D30}-x{2D7F}x{2D80}-x{2DDF}x{2F00}-x{2FDF}x{2FF0}-x{2FFF}x{3040}-x{309F}x{30A0}-x{30FF}x{3100}-x{312F}x{3130}-x{318F}x{3190}-x{319F}x{31C0}-x{31EF}x{31F0}-x{31FF}x{3200}-x{32FF}x{3300}-x{33FF}x{3400}-x{4DBF}x{4DC0}-x{4DFF}x{4E00}-x{9FFF}x{A000}-x{A48F}x{A490}-x{A4CF}x{A700}-x{A71F}x{A800}-x{A82F}x{A840}-x{A87F}x{AC00}-x{D7AF}x{F900}-x{FAFF}]){2,63})+)$/u

Besides RFC rules, the Unicode part above consists of a series of ranges of character subsets. This is done so that only real letters and numbers enter the validation, while non-Latin punctuation and miscellaneous Unicode characters are rejected.

The main things that don't validate correctly at this time are IP addresses in place of domain names, comments inside the "local" part, apostrophes, and forward slashes. I've never seen anyone use the latter two so didn't bother bloating the regex to support them.

You can find a live demo here: http://jsfiddle.net/aossikine/qCLVH/3/

Here is a breakdown of characters allowed by RFC standards and whether they are supported by this regex:

  • a-zA-Z0-9
  • !#$%&'*+-/=?^_`{|}~
  • (),:;<>@[] (must be between quotation marks)
    • this one is not yet implemented
  • . (period cannot be the first or last character, shouldn't appear consecutively)
  • (must be preceded by a backslash)
    • this one is not yet implemented
  • " (must be preceded by a backslash)
    • this one is not yet implemented
  • strip out anything surrounded by parenthesis from the start or end of the local part
    • this one is not yet implemented
  • domain name can only contain letters, numbers, and dashes (and dashes may be consecutive)

For details on RFC standards, see http://en.wikipedia.org/wiki/E-mail_address#Syntax. Most of the logic detailed there is supported. The JSFiddle link above includes some additional documentation and additional links to handy sites.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...