Just for reference, here is the solution I came up with. It makes use of the great package gsubfn
:
library(gsubfn)
I use a vector htmlchars
for named html entities I scraped from Wikipedia. For brevity, I do not copy the vector in this answer here, but source it from pastebin:
source("http://pastebin.com/raw.php?i=XtzN1NMs") # creates variable htmlchars
Now the decoding function I was looking for is simply:
strdehtml <- function(s) {
ret <- gsubfn("&#([0-9]+);", function(x) rawToChar(as.raw(as.numeric(x))), s)
ret <- gsubfn("&([^;]+);", function(x) htmlchars[x], ret)
return(ret)
}
Not sure if this covers all possible HTML characters, but it gets me working.
For instance, it can be used thus:
test <- "My this & last year's resolutions"
strdehtml(test)
[1] "My this & last year's resolutions"
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…