First, here's a sample data.frame
dd<-data.frame(
id=10:13,
text=c("No wonder, then, that ever gathering volume from the mere transit ",
"So that in many cases such a panic did he finally strike, that few ",
"But there were still other and more vital practical influences at work",
"Not even at the present day has the original prestige of the Sperm Whale")
,stringsAsFactors=F
)
Now, in order to read special attributes from a data.frame, we will use the readTabular
function to make our own custom data.frame reader. This is all we need to do
library(tm)
myReader <- readTabular(mapping=list(content="text", id="id"))
We just specify the column to use for the contents and the id in the data.frame. Now we read it in with DataframeSource
but use our custom reader.
tm <- VCorpus(DataframeSource(dd), readerControl=list(reader=myReader))
Now if we want to only keep a certain set of words, we can create our own content_transformer
function. One way to do this is
keepOnlyWords<-content_transformer(function(x,words) {
regmatches(x,
gregexpr(paste0("\b(", paste(words,collapse="|"),"\b)"), x)
, invert=T)<-" "
x
})
This will replace everything that's not in the word list with a space. Note that you probably want to run stripWhitespace after this. Thus our transformations would look like
keep<-c("wonder","then","that","the")
tm<-tm_map(tm, content_transformer(tolower))
tm<-tm_map(tm, keepOnlyWords, keep)
tm<-tm_map(tm, stripWhitespace)
And then we can turn that into a document term matrix
dtm<-DocumentTermMatrix(tm)
inspect(dtm)
# <<DocumentTermMatrix (documents: 4, terms: 4)>>
# Non-/sparse entries: 7/9
# Sparsity : 56%
# Maximal term length: 6
# Weighting : term frequency (tf)
# Terms
# Docs that the then wonder
# 10 1 1 1 1
# 11 2 0 0 0
# 12 0 1 0 0
# 13 0 3 0 0
and you can it it has our list of words and the proper document IDs from the data.frame