This should work fine but you need to use std::wregex
and std::wsmatch
. You will need to convert the source string and regular expression to wide character unicode (UTF-32 on Linux, UTF-16(ish) on Windows) to make it work.
This works for me where source text is UTF-8
:
inline std::wstring from_utf8(const std::string& utf8)
{
// code to convert from utf8 to utf32/utf16
}
inline std::string to_utf8(const std::wstring& ws)
{
// code to convert from utf32/utf16 to utf8
}
int main()
{
std::string test = "john.doe@神谕.com"; // utf8
std::string expr = "[\u0080-\uDB7F]+"; // utf8
std::wstring wtest = from_utf8(test);
std::wstring wexpr = from_utf8(expr);
std::wregex we(wexpr);
std::wsmatch wm;
if(std::regex_search(wtest, wm, we))
{
std::cout << to_utf8(wm.str(0)) << '
';
}
}
Output:
神谕
Note: If you need a UTF
conversion library I used THIS ONE in the example above.
Edit: Or, you could use the functions given in this answer:
Any good solutions for C++ string code point and code unit?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…