Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
365 views
in Technique[技术] by (71.8m points)

c++ - How can I use std::imbue to set the locale for std::wcout?

I am trying to use the std::locale mechanism in C++11 to count words in different languages. Specifically, I have std::wstringstream which contains the title of a famous Russian novel ("Crime and Punishment" in English). What I want to do is to use the appropriate locale (ru_RU.utf8 on my Linux machine) to read the stringstream, count the words and print the results. I should also probably note that my system is set to use the en_US.utf8 locale.

The desired result is this:

0: "Преступление"
1: "и"
2: "наказание"

I counted 3 words.
and the last word was "наказание"

That all works when I set the global locale, but not when I attempt to imbue the wcout stream. When I try that, I get this result instead:

0: "????????????"
1: "?"
2: "?????????"

I counted 3 words.
and the last word was "?????????"

Also, when I attempt to use a solution suggested in the comments, (which can be activate by changing #define USE_CODECVT 0 to #define USE_CODECVT 1) I get the error mentioned in this other question.

Those interested in experimenting with the code, or with compiler settings or both may wish to use this live code.

My questions

  1. Why does that not work? Is it because wcout is already open?
  2. Is there way to use imbue rather than setting the global locale to do what I want?

If it makes a difference, I'm using g++ 4.8.3. The full code is shown below.

getwords.cpp

#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <locale>

#define USE_CODECVT 0
#define USE_IMBUE   1

#if USE_CODECVT
#include <codecvt>
#endif 
using namespace std;

int main()
{
#if USE_CODECVT
    locale ru("ru_RU.utf8", 
        new codecvt_utf8<wchar_t, 0x10ffff, consume_header>{});
#else
    locale ru("ru_RU.utf8");
#endif
#if USE_IMBUE
    wcout.imbue(ru);
#else
    locale::global(ru);
#endif
    wstringstream in{L"Преступление и наказание"};
    in.imbue(ru);
    wstring word;
    unsigned wordcount = 0;
    while (in >> word) {
        wcout << wordcount << ": "" << word << ""
";
        ++wordcount;
    }
    wcout << "
I counted " << wordcount << " words.
"
        << "and the last word was "" << word << ""
";
}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

First I did some more test using your code and I can confirm that L"Преступление и наказание" is a correct UTF16 string. I controlled the code of the individual characters, and they are correctly 0x41f, 0x440, 0x435, 0x441, 0x442, 0x443, 0x43f, 0x43b, 0x435, 0x43d, 0x438, 0x435, 0x20, 0x438, 0x20, 0x43d, 0x430, 0x43a, 0x430, 0x437, 0x430, 0x43d, 0x438, 0x435

I could not find any reference about it, but it looks like simply calling imbue is not enough. imbue it a method from basic_ios which is an ancestor of cout and wcout. It does act on numeric conversions, but on all my tests, it has no effect on the charset used for output.

By default, the locale used in a C++ (or C) program is ... the C locale which knows nothing about unicode. All printable ASCII characters (below 128) are outputted as is, and others are replaced with a ?. It is exactly what your program does.

To make it work correctly, you have to select a locale that knows about unicode characters with setlocale. Once this is done, you can change the numeric conversion by calling imbue, and as you selected a unicode charset all will be fine.

So provided your current locale uses an UTF-8 charset, you only have to add

setlocale(LC_ALL, "");

as first line in your program, and the output will be as expected :

0: "Преступление"
1: "и"
2: "наказание"

I counted 3 words.
and the last word was "наказание"

If your current locale does not use UTF-8, choose one that is installed on you system and that supports it. I used setlocale(LC_ALL, "fr_FR.UTF-8");, or even setlocale(LC_ALL, "en_US.UTF-8"); and both worked.

Edit :

In fact, the best way to correctly output unicode to screen is to use setlocale(LC_ALL, "");. It automatically adapts to the current charset. I tested with a stripped down variant using Latin1 charset (my system speaks natively french and not russian ...)

#include <iostream>
#include <locale>

using namespace std;

int main() {
    setlocale(LC_ALL, "");
    wchar_t ws[] = { 0xe8, 0xe9, 0 };

    wcout << ws << endl;
}

I tried it under Linux using UTF-8 charset and ISO-8859-1 (latin1) (resp export LANG=fr_FR.UTF-8 and export LANG=fr_FR.ISO-8859-1) and I got correctly èé in the proper charset. I tried it also under Windows XP, with codepage 851 (oem) and 1252 (ansi) (resp. chcp 850 and chcp 1252 with Lucida console charset), and got èé on the console too.

Edit 2 :

Of course, you can also set a global C++ locale with locale::global(locale(""); with default locale or locale::global(locale("ru_RU.UTF-8"); with russian locale, but it is more than simply calling setlocale. According to the documentation of Gnu implementation of C++ Standard Library about locale : there is only one relation (of the C++ locale mechanism) to the C locale mechanism: the global C locale is modified if a named C++ locale object is set as the global locale", that is: std::locale::global(std::locale("")); affects the C functions as if the following call was made: std::setlocale(LC_ALL, "");. On the other hand, there is no vice versa, that is, calling setlocale has no whatsoever on the C++ locale mechanism, in particular on the working of locale("").

So it really looks like there was an underlying C library mechanizme that should be first enabled with setlocale to allow imbue conversion to work correctly.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...