Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
339 views
in Technique[技术] by (71.8m points)

r - RStudio not picking the encoding I'm telling it to use when reading a file

I'm trying to read the following UTF-8 encoded file in R, but whenever I read it, the unicode characters are not encoded correctly:

enter image description here

The script I'm using to process the file is as follows:

defaultEncoding <- "UTF8"
detalheVotacaoMunicipioZonaTypes <- c("character", "character", "factor", "factor", "factor", "factor", "factor",
                                                     "factor", "factor", "factor", "factor", "factor", "numeric", 
                                                     "numeric", "numeric", "numeric", "numeric", "numeric",
                                                     "numeric", "numeric", "numeric", "numeric", "numeric", 
                                                     "numeric", "character", "character")

readDetalheVotacaoMunicipioZona <- function( fileName ) {
  fileConnection = file(fileName,encoding=defaultEncoding)
  contents <- readChar(fileConnection, file.info(fileName)$size)  
  close(fileConnection)
  contents <- gsub('"', "", contents)

  columnNames <- c("data_geracao", "hora_geracao", "ano_eleicao", "num_turno", "descricao_eleicao", "sigla_uf", "sigla_ue",
                   "codigo_municipio", "nome_municipio", "numero_zona", "codigo_cargo", "descricao_cargo", "qtd_aptos", 
                   "qtd_secoes", "qtd_secoes_agregadas", "qtd_aptos_tot", "qtd_secoes_tot", "qtd_comparecimento",
                   "qtd_abstencoes", "qtd_votos_nominais", "qtd_votos_brancos", "qtd_votos_nulos", "qtd_votos_legenda", 
                   "qtd_votos_anulados", "data_ult_totalizacao", "hora_ult_totalizacao")

  read.csv(text=contents, 
           colClasses=detalheVotacaoMunicipioZonaTypes,
           sep=";", 
           col.names=columnNames, 
           fileEncoding=defaultEncoding,
           header=FALSE)
}

I read the file sending in the UTF-8 encoding, remove all quotes (even numbers are quoted, so I need to clean them up) and then feed the contents to read.csv. It reads and processes the file correctly but it seems like it's not using the encoding information I'm giving it.

What should I do to make it use UTF-8 to read this file?

I'm using RStudio on OSX if it makes any difference.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This problem is caused by the wrong locale being set, whether inside RStudio or command-line R:

  1. If the problem only happens in RStudio not command-line R, go to RStudio->Preferences:General, tell us what 'Default text encoding:'is set to, click 'Change' and try Windows-1252, UTF-8 or ISO8859-1('latin1') (or else 'Ask' if you always want to be prompted). Screenshot attached at bottom. Let us know which one worked!

  2. If the problem also happens in command-line R, do the following:

Do locale -m on your Mac and tell us whether it supports CP1252 or else ISO8859-1 ('latin1')? Dump the list of supported locales if you need to. (You might as well tell us your version of MacOS while you're at it.)

For both of those locales, try to change to that locale:

# first try Windows CP1252, although that's almost surely not supported on Mac:
Sys.setlocale("LC_ALL", "pt_PT.1252") # Make sure not to omit the `"LC_ALL",` first argument, it will fail.
Sys.setlocale("LC_ALL", "pt_PT.CP1252") # the name might need to be 'CP1252'

# next try IS08859-1(/'latin1'), this works for me:
Sys.setlocale("LC_ALL", "pt_PT.ISO8859-1")

# Try "pt_PT.UTF-8" too...

# in your program, make sure the Sys.setlocale worked, sprinkle this assertion in your code before attempting to read.csv:
stopifnot(Sys.getlocale('LC_CTYPE') == "pt_PT.ISO8859-1")

That should work. Strictly the Sys.setlocale() command should go in your ~/.Rprofile for startup, not inside your R session or source-code. However Sys.setlocale() can fail, so just be aware of that. Also, assert Sys.getlocale() inside your setup code early and often, as I do. (really, read.csv should figure out if the encoding it uses is compatible with the locale, and warn or error if not).

Let us know which fix worked! I'm trying to document this more generally so we can figure out the correct enhance.

  1. Screenshot of RStudio Preferences Change default text encoding menu: enter image description here

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...