Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

regex - PHP: How to match a range of unicode paired surrogates emoticons/emoji?

anubhava's answer about matching ranges of unicode characters led me to the regex to use for cleaning up a specific range of single code point of characters. With it, now I can match all miscellaneous symbols in this list (includes emoticons) with this simple expression:

preg_replace('/[x{2600}-x{26FF}]/u', '', $str);

However, I also want to match those in this list of paired/double surrogates emoji, but as nhahtdh explained in a comment:

There is a range from d800 to dfff to specify surrogates in UTF-16 to allow for more characters to be specified. A single surrogate is not a valid character in UTF-16 (a pair is necessary to specify a valid character).

So, for example, when I try this:

preg_replace('/x{D83D}x{DE00}/u', '', $str);

For replacing only the first of the paired surrogates on this list, i.e.: ??

PHP throws this:

preg_replace(): Compilation failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff)

I have tried several different combinations, including the supposed combination of the above code points in UTF8 for ?? ('/[x{00F0}x{009F}x{0098}x{0080}]/u'), but I was still unable to match it. I also looked into other PCRE pattern modifiers, but it seems u is the only one that allows to point through UTF8.

Am I missing any "escape" alternative here?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

revo's comment above was very helpful to find a solution:

If your PHP isn't shipped with a PCRE build for UTF-16 then you can't perform such a match. From PHP 7.0 on, you're able to use Unicode code points following this syntax u{XXXX} e.g. preg_replace("~u{1F600}~", '', $str); (Mind the double quotes)

Since I am using PHP 7, echo "u{1F602}"; outputs ?? according to this PHP RFC page on unicode escape. This proposal was in essence:

A new escape sequence is added for double-quoted strings and heredocs.

  • u{ codepoint-digits } where codepoint-digits is composed of hexadecimal digits.

This implies that the matching string in preg_replace (normally single-quoted for not messing up with double-quoted strings variable expansion), now needs some preg_quote magic. This is the solution I came up with:

preg_replace(
  // single point unicode list
  "/[x{2600}-x{26FF}".
  // http://www.fileformat.info/info/unicode/block/miscellaneous_symbols/list.htm
  // concatenates with paired surrogates
  preg_quote("u{1F600}", '/')."-".preg_quote("u{1F64F}", '/').
  // https://www.fileformat.info/info/unicode/block/emoticons/list.htm
  "]/u",
  '',
  $str
);

Here's the proof of the above in 3v4l.

EDIT: a simpler solution

In another comment made by revo, it seems that by placing unicode characters directly into the regex character class, single-quoted strings and previous PHP versions (e.g. 4.3.4) are supported:

preg_replace('/[?-???-??]/u','YOINK',$str);

For using PHP 7's new feature though, you still need double-quotes:

preg_replace("/[u{2600}-u{26FF}u{1F600}-u{1F64F}]/u",'YOINK',$str);

Here's revo's proof in 3v4l.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...