char
is for 8-bit code units, char16_t
is for 16-bit code units, and char32_t
is for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.
The guarantee made for wchar_t
was that any character supported in a locale could be converted from char
to wchar_t
, and whatever representation was used for char
, be it multiple bytes, shift codes, what have you, the wchar_t
would be a single, distinct value. The purpose of this was that then you could manipulate wchar_t
strings just like the simple algorithms used with ASCII.
For example, converting ascii to upper case goes like:
auto loc = std::locale("");
char s[] = "hello";
for (char &c : s) {
c = toupper(c, loc);
}
But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:
auto loc = std::locale("");
wchar_t s[] = L"hello";
for (wchar_t &c : s) {
c = toupper(c, loc);
}
So every wchar_t
is a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ? where the uppercase version is actually the two characters SS instead of a single character.
So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_t
intended. As such wchar_t
and wide characters in general provide little value.
The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…