visual c++ - Confusion on Unicode and Multibyte Articles

Question

Welcome To Ask or Share your Answers For Others

visual c++ - Confusion on Unicode and Multibyte Articles

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

visual c++ - Confusion on Unicode and Multibyte Articles

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.

After reading the whole article, my point is that, if someone told you, his text is in unicode, you will have no idea how much memory space taken up by every of his character. He have to tell you, "My unicode text is encoded in UTF-8", then only you will have idea how much memory space is taken up by every of his character.

Unicode = not necessary 2 byte for each character

However, when comes to Code Project's Article and Microsoft's Help, this confused me :

Microsoft :

Unicode is a 16-bit character encoding, providing enough encodings for all languages. All ASCII characters are included in Unicode as "widened" characters.

Code Project :

The Unicode character set is a "wide character" (2 bytes per character) set that contains every character available in every language, including all technical symbols and special publishing characters. Multibyte character set (MBCS) uses either 1 or 2 bytes per character

Unicode = 2 byte for each character ?

Is 65536 possible characters able to represent all language in this world?

Why the concept seems different among web developer community and desktop developer community?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:18:11+0000

Once upon a time,

Unicode had only as many characters as fit in 16 bits, and
UTF-8 did not exist or was not the de facto encoding to use.

These factors led to UTF-16 (or rather, what is now called UCS-2) to be considered synonymous with “Unicode”, because it was after all the encoding which supported all of Unicode.

Practically, you will see “Unicode” being used where “UTF-16” or “UCS-2” is meant. This is a historical confusion and should be ignored and not propagated. Unicode is a set of characters; UTF-8, UTF-16, and UCS-2 are different encodings.

(The difference between UTF-16 and UCS-2 is that UCS-2 is a true 16-bits-per-“character” encoding, and therefore encodes only the “BMP” (Basic Multilingual Plane) portion of Unicode, whereas UTF-16 uses “surrogate pairs” (for a total of 32 bits) to encode above-BMP characters.)

Categories

visual c++ - Confusion on Unicode and Multibyte Articles

visual c++ - Confusion on Unicode and Multibyte Articles

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags