Try tuning batch_size
and n_process
params :
def process_abstract(df):
cleaned_text = []
document = nlp.pipe(df["abstract"].to_list(), batch_size=256, n_process=12)
for doc in document:
text = [
token.text
for token in doc
if not token.is_punct
and not token.is_stop
and not token.like_num
and token.is_alpha
]
cleaned_text.append(" ".join(text).lower())
return cleaned_text
Note as well, by joining on " "
you may have some surprises, as spaCy's splitting rules are a bit more complex than that.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…