Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
882 views
in Technique[技术] by (71.8m points)

python 3.x - Spacy, matcher with entities spanning more than a single token

I am trying to create a matcher that finds negated custom entities in the text. It is working fine for entities that span a single token, but I am having trouble trying to capture entities that span more than one token.

As an example, let's say that my custom entities are animals (and are labeled as token.ent_type_ = "animal")

["cat", "dog", "artic fox"] (note that the last entity has two words).

Now I want to find those entities in the text but negated, so I can create a simple matcher with the following pattern:

[{'lower': 'no'}, {'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]

And for example, I have the following text:

There is no cat in the house and no artic fox in the basement

I can successfully capture no cat and no artic, but the last match is incorrect as the full match should be no artic fox. This is due to the OP: '+' in the pattern that matches a single custom entity instead of two. How can I modify the pattern to prioritize longer matches over shorter ones?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

A solution is to use the doc retokenize method in order to merge the individual tokens of each multi-token entity into a single token:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

animal = ["cat", "dog", "artic fox"]
ruler = EntityRuler(nlp)
for a in animal:
    ruler.add_patterns([{"label": "animal", "pattern": a}])
nlp.add_pipe(ruler)


doc = nlp("There is no cat in the house and no artic fox in the basement")

with doc.retokenize() as retokenizer:
    for ent in doc.ents:
        retokenizer.merge(doc[ent.start:ent.end])


from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern =[{'lower': 'no'},{'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
matcher.add('negated animal', None, pattern)
matches = matcher(doc)


for match_id, start, end in matches:
    span = doc[start:end]
    print(span)

the output is now:

no cat
no artic fox


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...