I am simply trying to create a corpus from Russian, UTF-8 encoded text. The problem is, the Corpus method from the tm
package is not encoding the strings correctly.
Here is a reproducible example of my problem:
Load in the Russian text:
> data <- c("Renault Logan, 2005","Складское помещение, 345 м2",
"Су-шеф","3-к квартира, 64 м2, 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)")
Create a VectorSource:
> vs <- VectorSource(data)
> vs # outputs correctly
Then, create the corpus:
> corp <- Corpus(vs)
> inspect(corp) # output is not encoded properly
The output that I get is:
> inspect(corp)
<<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
Renault Logan, 2005
[[2]]
<<PlainTextDocument (metadata: 7)>>
?ê?à??ê?? ??ì?ù?íè?, 345 ì<U+00B2>
[[3]]
<<PlainTextDocument (metadata: 7)>>
?ó-???
[[4]]
<<PlainTextDocument (metadata: 7)>>
3-ê êaàeòèeà, 64 ì<U+00B2>, 3/5 yò.
[[5]]
<<PlainTextDocument (metadata: 7)>>
Samsung galaxy S4 mini GT-I9190 (÷?eí?é)
Why does it output incorrectly? There doesn't seem to be any option to set the encoding on the Corpus method. Is there a way to set it after the fact? I have tried this:
> title_corpus <- tm_map(title_corpus, enc2utf8)
Error in FUN(X[[1L]], ...) : argumemt is not a character vector
But, it errors as shown.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…