Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
439 views
in Technique[技术] by (71.8m points)

Python regex matching multiple words from a list

I have a list of words and a string and would like to get back a list of words from the original list which are found in the string.

Ex:

import re

lof_terms = ['car', 'car manufacturer', 'popular']
str_content = 'This is a very popular car manufacturer.'

pattern = re.compile(r"(?=(" + r"|".join(map(re.escape, lof_terms)) + r"))")
found_terms = re.findall(pattern, str_content)

This will only return ['car', 'popular']. It fails to catch 'car manufacturer'. However it will catch it if I change the source list of terms to lof_terms = ['car manufacturer', 'popular']

Somehow the overlapping between 'car' and 'car manufacturer' seems to be source of this issue.

Any ideas how to get over this?

Many thanks

question from:https://stackoverflow.com/questions/65884809/how-to-match-repeating-words-in-python-regex

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The current code can be fixed if you first sort the lof_terms by length in the descending order:

rx = r"(?=({}))".format("|".join(map(re.escape, sorted(lof_terms, key=len, reverse=True))))
pattern = re.compile(rx)

Note that in this case, word boundaries are only used once on either end of the grouping, no need to repeat them around each alternative. See this regex demo.

See the Python demo:

import re

lof_terms = ['car', 'car manufacturer', 'popular']
str_content = 'This is a very popular car manufacturer.'

rx = r"(?=({}))".format("|".join(map(re.escape, sorted(lof_terms, key=len, reverse=True))))
pattern = re.compile(rx)
found_terms = re.findall(pattern, str_content)
print(found_terms)
# => ['popular', 'car manufacturer']

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...