TLDR
Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP's, it's approximately 2000 times faster than the accepted answer.
If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.
Theory
If your sentences aren't humongous strings, it's probably feasible to process many more than 50 per second.
If you save all the banned words into a set, it will be very fast to check if another word is included in that set.
Pack the logic into a function, give this function as argument to re.sub
and you're done!
Code
import re
with open('/usr/share/dict/american-english') as wordbook:
banned_words = set(word.strip().lower() for word in wordbook)
def delete_banned_words(matchobj):
word = matchobj.group(0)
if word.lower() in banned_words:
return ""
else:
return word
sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
"GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000
word_pattern = re.compile('w+')
for sentence in sentences:
sentence = word_pattern.sub(delete_banned_words, sentence)
Converted sentences are:
' . !
.
GiraffeElephantBoat
sfgsdg sdwerha aswertwe
Note that:
- the search is case-insensitive (thanks to
lower()
)
- replacing a word with
""
might leave two spaces (as in your code)
- With python3,
w+
also matches accented characters (e.g. "?ngstr?m"
).
- Any non-word character (tab, space, newline, marks, ...) will stay untouched.
Performance
There are a million sentences, banned_words
has almost 100000 words and the script runs in less than 7s.
In comparison, Liteye's answer needed 160s for 10 thousand sentences.
With n
being the total amound of words and m
the amount of banned words, OP's and Liteye's code are O(n*m)
.
In comparison, my code should run in O(n+m)
. Considering that there are many more sentences than banned words, the algorithm becomes O(n)
.
Regex union test
What's the complexity of a regex search with a '(word1|word2|...|wordN)'
pattern? Is it O(N)
or O(1)
?
It's pretty hard to grasp the way the regex engine works, so let's write a simple test.
This code extracts 10**i
random english words into a list. It creates the corresponding regex union, and tests it with different words :
- one is clearly not a word (it begins with
#
)
- one is the first word in the list
- one is the last word in the list
- one looks like a word but isn't
import re
import timeit
import random
with open('/usr/share/dict/american-english') as wordbook:
english_words = [word.strip().lower() for word in wordbook]
random.shuffle(english_words)
print("First 10 words :")
print(english_words[:10])
test_words = [
("Surely not a word", "#surely_N?T?WORD_so_regex_engine_can_return_fast"),
("First word", english_words[0]),
("Last word", english_words[-1]),
("Almost a word", "couldbeaword")
]
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("
Union of %d words" % 10**exp)
union = re.compile(r"(%s)" % '|'.join(english_words[:10**exp]))
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %-17s : %.1fms" % (description, time))
It outputs:
First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']
Union of 10 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 0.7ms
Almost a word : 0.7ms
Union of 100 words
Surely not a word : 0.7ms
First word : 1.1ms
Last word : 1.2ms
Almost a word : 1.2ms
Union of 1000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 9.6ms
Almost a word : 10.1ms
Union of 10000 words
Surely not a word : 1.4ms
First word : 1.8ms
Last word : 96.3ms
Almost a word : 116.6ms
Union of 100000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 1227.1ms
Almost a word : 1404.1ms
So it looks like the search for a single word with a '(word1|word2|...|wordN)'
pattern has:
O(1)
best case
O(n/2)
average case, which is still O(n)
O(n)
worst case
These results are consistent with a simple loop search.
A much faster alternative to a regex union is to create the regex pattern from a trie.