python - processing text with spacy nlp.pipe

Question

Welcome To Ask or Share your Answers For Others

python - processing text with spacy nlp.pipe

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - processing text with spacy nlp.pipe

I'm procerssing 40,000 abstracts with spacy nlp.pipe using the code below and its taking 8 mins. Is there a way to speed this up further? I've also disabled ner.

nlp = spacy.load("en_core_web_md", disable=["ner"])

def process_abstract(df):
    cleaned_text = []
    document = list(nlp.pipe(df['abstract'].values))
    for doc in document:
        text = [token.text for token in doc 
                if token.is_punct==False and 
                token.is_stop==False and 
                token.like_num==False and 
                token.is_alpha==True
                ]
        cleaned_text.append(' '.join(text).lower())
    return cleaned_text

question from:https://stackoverflow.com/questions/65850018/processing-text-with-spacy-nlp-pipe

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:29:32+0000

Try tuning batch_size and n_process params :

def process_abstract(df):
    cleaned_text = []
    document = nlp.pipe(df["abstract"].to_list(), batch_size=256, n_process=12)
    for doc in document:
        text = [
            token.text
            for token in doc
            if not token.is_punct
            and not token.is_stop
            and not token.like_num
            and token.is_alpha
        ]
        cleaned_text.append(" ".join(text).lower())
    return cleaned_text

Note as well, by joining on " " you may have some surprises, as spaCy's splitting rules are a bit more complex than that.

Categories

python - processing text with spacy nlp.pipe

python - processing text with spacy nlp.pipe

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags