Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
736 views
in Technique[技术] by (71.8m points)

r - Unescape HTML &#nn; sequences

My text has some HTML escaped characters, for instance, instead of ' there is '. Now I would like to unescape these sequences. Since I do not know which characters are escaped, I do not want to use a simple mapping such as in c("'"="'", ...).

I understand that the number after the ampersand is the decimal unicode number. So ' is u27 since 27 is the hexidecimal representation of 39. So I thought a solution that involves

sprintf("u%x", s)

where s is the extracted number between & and ;. However, this results in an error: "u used without hex numbers."

What would be a better approach to convert HTML escaped sequences back to characters?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Just for reference, here is the solution I came up with. It makes use of the great package gsubfn:

library(gsubfn)

I use a vector htmlchars for named html entities I scraped from Wikipedia. For brevity, I do not copy the vector in this answer here, but source it from pastebin:

source("http://pastebin.com/raw.php?i=XtzN1NMs") # creates variable htmlchars

Now the decoding function I was looking for is simply:

strdehtml <- function(s) {
    ret <- gsubfn("&#([0-9]+);", function(x) rawToChar(as.raw(as.numeric(x))), s)
    ret <- gsubfn("&([^;]+);", function(x) htmlchars[x], ret)
    return(ret)
}

Not sure if this covers all possible HTML characters, but it gets me working. For instance, it can be used thus:

test <- "My this &amp; last year&#39;s resolutions"
strdehtml(test)
[1] "My this & last year's resolutions"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...