Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

text mining - Is there an R function to clean via a custom dictionary

I would like to use a custom dictionary (upwards of 400,000 words) when cleaning my data in R. I already have the dictionary loaded as a large character list and I am trying to have it so that the content within my data (VCorpus) compromises of only the words in my dictionary.
For example:

#[1] "never give up uouo cbbuk jeez"  

would become

#[1*] "never give up"  

as the words "never","give",and "up" are all in the custom dictionary. I have previously tried the following:

#Reading the custom dictionary as a function
    english.words  <- function(x) x %in% custom.dictionary
#Filtering based on words in the dictionary
    DF2 <- DF1[(english.words(DF1$Text)),]

but my result is a character list with one word. Any advice?

question from:https://stackoverflow.com/questions/65880680/is-there-an-r-function-to-clean-via-a-custom-dictionary

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can split the sentences into words, keep only words that are part of your dictionary and paste them in one sentence again.

DF1$Text1 <- sapply(strsplit(DF1$Text, '\s+'), function(x) 
                    paste0(Filter(english.words, x), collapse = ' '))

Here I have created a new column called Text1 with only english words, if you want to replace the original column you can save the output in DF1$Text.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...