Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

regex - Replacing Emoji Unicode Range from Arabic Tweets using Java

I am trying to replace emoji from Arabic tweets using java.

I used this code:

String line = "???? ????? ??? ???????? ????? ??? ??? ?? ??? ???? ????";
Pattern unicodeOutliers = Pattern.compile("([u1F601-u1F64F])", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(line);
line = unicodeOutlierMatcher.replaceAll(" $1 ");

But it is not replacing them. Even if I am matching only the character itself "u1F602" it is not replacing it. May be because it is 5 digits after the u?! I am not sure, just a guess.

Note that:

1- the emotion at the end of the tweet (??) is the "U+1F602" which is "face with tears of joy"

2- this question is not a duplicate for this question.

Any Ideas?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

From the Javadoc for the Pattern class

A Unicode character can also be represented in a regular-expression by using its Hex notation(hexadecimal code point value) directly as described in construct x{...}, for example a supplementary character U+2011F can be specified as x{2011F}, instead of two consecutive Unicode escape sequences of the surrogate pair uD840uDD1F.

This means that the regular expression that you're looking for is ([x{1F601}-x{1F64F}]). Of course, when you write this as a Java String literal, you must escape the backslashes.

Pattern unicodeOutliers = Pattern.compile("([\x{1F601}-\x{1F64F}])");

Note that the construct x{...} is only available from Java 7.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...