Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
366 views
in Technique[技术] by (71.8m points)

r - Find matches of a vector of strings in another vector of strings

I'm trying to create a subset of a data frame of news articles that mention at least one element of a set of keywords or phrases.

# Sample data frame of articles
articles <- data.frame(id=c(1, 2, 3, 4), text=c("Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod", "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,", "quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo", "consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse"))
articles$text <- as.character(articles$text)

# Sample vector of keywords or phrases
keywords <- as.character(c("elit", "tempor incididunt", "reprehenderit"))

#   id                                                                         text
# 1  1     Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# 2  2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
# 3  3      quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
# 4  4    consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse

Given the vector of keywords, the subset should contain rows 1, 2, and 4, since those rows contain one or more of the elements of the vector.

Neither %in nor grepl() work, since %in% seems to require that each word in the data frame be vectorized (articles$text %in% keywords results in four FALSEs), and grep() doesn't seem to be able to handle vectorized patterns (grep(keywords, articles$text) gives an error). Neither function alone seems to work well across multiple dimensions (i.e. it would be easy to search for one word in all the rows, but not all 3 at the same time).

What's the best way to find and select all rows of the data frame that contain at least one of the elements of the keyword vector?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can try pasting your "keywords" together and separate them with the pipe character (|) which will work like an "or", like this:

> articles[grepl(paste(keywords, collapse="|"), articles$text),]
  id                                                                         text
1  1     Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
2  2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
4  4    consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...