Input file (added a line in my native locale):
100008251304976 T?iat?icet ?lutych ?i?inek 2019-10-04 16:52:15
100008251304976 你又知喎 2019-10-04 16:52:15
100027970365477 甘你買多幾包花生,小心熱氣 2019-10-04 16:23:43
R code snippet (converting individual rows of the x
data frame could be done in a loop, I know…):
sessionInfo()
library(stringi)
library(magrittr)
x <- read.table('d:\bat\R\comment.txt', encoding = 'UTF-8', quote = """, fill = TRUE, sep = '')
print(x)
x['V2'][1,] %>%
stri_replace_all_regex("<U\+([[:alnum:]]+)>", "\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
x['V2'][2,] %>%
stri_replace_all_regex("<U\+([[:alnum:]]+)>", "\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
x['V2'][3,] %>%
stri_replace_all_regex("<U\+([[:alnum:]]+)>", "\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
Result (paste the code snippet to an open Rstudio console):
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=Czech_Czechia.1250 LC_CTYPE=Czech_Czechia.1250 LC_MONETARY=Czech_Czechia.1250
[4] LC_NUMERIC=C LC_TIME=Czech_Czechia.1250
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] magrittr_1.5 stringi_1.1.5
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1
> library(stringi)
> library(magrittr)
>
> x <- read.table('d:\bat\R\comment.txt', encoding = 'UTF-8', quote = """, fill = TRUE, sep = '')
>
> print(x)
V1 V2
1 1.000083e+14 T?iat?icet ?lutych ?i?inek
2 1.000083e+14 <U+4F60><U+53C8><U+77E5><U+558E>
3 1.000280e+14 <U+7518><U+4F60><U+8CB7><U+591A><U+5E7E><U+5305><U+82B1><U+751F>,<U+5C0F><U+5FC3><U+71B1><U+6C23>
V3
1 2019-10-04 16:52:15
2 2019-10-04 16:52:15
3 2019-10-04 16:23:43
>
> x['V2'][1,] %>%
+ stri_replace_all_regex("<U\+([[:alnum:]]+)>", "\\u$1") %>%
+ stri_unescape_unicode() %>%
+ stri_enc_toutf8()
[1] "T?iat?icet ?lutych ?i?inek"
> x['V2'][2,] %>%
+ stri_replace_all_regex("<U\+([[:alnum:]]+)>", "\\u$1") %>%
+ stri_unescape_unicode() %>%
+ stri_enc_toutf8()
[1] "你又知喎"
> x['V2'][3,] %>%
+ stri_replace_all_regex("<U\+([[:alnum:]]+)>", "\\u$1") %>%
+ stri_unescape_unicode() %>%
+ stri_enc_toutf8()
[1] "甘你買多幾包花生,小心熱氣"
>
Used the accepted answer to convert utf8 code point strings like to utf8.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…