Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
357 views
in Technique[技术] by (71.8m points)

c# - Conversion of String with emoticon unicode format to String with single character emoticon

I am trying to convert a String object containing a string representing an emoticon's Unicode format into a String with the same emoticon represented by the Unicode as its only character, e.g. converting "u1F34E" to ??.

I attempted the following under the supposition the string's escape sequence would be properly processed:

String str = "u1F34E";
Console.WriteLine("'{0}' to '{1}'", str, str.ToCharArray()[0]);

Output:

'u1F34E' to ''

Outputting the string directly to a text file yields the same result, so it is not just the debugger I am using. I am unsure what to do. Any help would be greatly appreciated.

EDIT:

I realize my original question was not clear; my intent was to have a properly formatted UTF-16 string with a UTF-32 unicode within a string, as an API I was sending this value to required this formatting. I have successfully resolved the problem with the following:

String str = "1F34E"; //removed u with prior parsing
int unicode_utf32 = int.Parse(stdemote.Unicode, System.Globalization.NumberStyles.HexNumber);
String unicode_utf16_str = Char.ConvertFromUtf32(unicode_utf32);
Console.WriteLine("'{0}' to '{1}'", str, unicode_utf16_str);
question from:https://stackoverflow.com/questions/65854613/conversion-of-string-with-emoticon-unicode-format-to-string-with-single-characte

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This is not what it seems

string str = "u1F34E";

.Net uses using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point. Which in turn makes the Unicode u escape sequence actually U+0000 to U+FFFF (16-bit) or for the extended version U+00000000 to U+FFFFFFFF (32-bit)

The emoji ??, uses a high code point 0001F34E so will need to encode it as a surrogate pair, two UTF-16 characters "uD83CuDF4E" or combined as
"U0001F34E" 1

Example

string str = "uD83CuDF4E";
// or
string str = "U0001F34E"

If you goal is to separate actual text elements apposed to characters, you could make use of StringInfo.GetTextElementEnumerator

public static IEnumerable<string> ToElements(string source)
{
   var enumerator = StringInfo.GetTextElementEnumerator(source);
   while (enumerator.MoveNext())
      yield return enumerator.GetTextElement();
}

Note : My use of terminology might not be most common or accurate, if you think it can be tightened up feel free to edit


1 Thanks to Mark Tolonen for pointing out that the Unicode escape sequence actually supports both 16bit and 32bit variants uXXXX and UXXXXXXXX more information can be found in a blog post by Jon Skeet Strings in C# and .NET


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...