I'd like to read only the first character from each line of a text file, ignoring the rest.
Here's an example file:
x <- c(
"Afklgjsdf;bosfu09[45y94hn9igf",
"Basfgsdbsfgn",
"Cajvw58723895yubjsdw409t809t80",
"Djakfl09w50968509",
"E3434t"
)
writeLines(x, "test.txt")
I can solve the problem by reading everything with readLines
and using substring
to get the first character:
lines <- readLines("test.txt")
substring(lines, 1, 1)
## [1] "A" "B" "C" "D" "E"
This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?
I suspect that there ought to be some incantation using scan
, but I can't find it. An alternative might be low level file manipulation (maybe with seek
).
Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:
set.seed(2015)
nch <- sample(1:100, 1e4, replace = TRUE)
x2 <- vapply(
nch,
function(nch)
{
paste0(
sample(letters, nch, replace = TRUE),
collapse = ""
)
},
character(1)
)
writeLines(x2, "bigtest.txt")
Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines
(Richard Scriven's stringi::stri_read_lines
solution and Josh O'Brien's data.table::fread
solution), or to treat the file as binary (Martin Morgan's readBin
solution).
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…