Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
493 views
in Technique[技术] by (71.8m points)

python - processing text with spacy nlp.pipe

I'm procerssing 40,000 abstracts with spacy nlp.pipe using the code below and its taking 8 mins. Is there a way to speed this up further? I've also disabled ner.

nlp = spacy.load("en_core_web_md", disable=["ner"])

def process_abstract(df):
    cleaned_text = []
    document = list(nlp.pipe(df['abstract'].values))
    for doc in document:
        text = [token.text for token in doc 
                if token.is_punct==False and 
                token.is_stop==False and 
                token.like_num==False and 
                token.is_alpha==True
                ]
        cleaned_text.append(' '.join(text).lower())
    return cleaned_text
question from:https://stackoverflow.com/questions/65850018/processing-text-with-spacy-nlp-pipe

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Try tuning batch_size and n_process params :

def process_abstract(df):
    cleaned_text = []
    document = nlp.pipe(df["abstract"].to_list(), batch_size=256, n_process=12)
    for doc in document:
        text = [
            token.text
            for token in doc
            if not token.is_punct
            and not token.is_stop
            and not token.like_num
            and token.is_alpha
        ]
        cleaned_text.append(" ".join(text).lower())
    return cleaned_text

Note as well, by joining on " " you may have some surprises, as spaCy's splitting rules are a bit more complex than that.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...