I am new to Spacy and NLP. I'm facing the below issue while doing sentence segmentation using Spacy.
The text I am trying to tokenise into sentences contains numbered lists (with space between numbering and actual text), like below.
import spacy
nlp = spacy.load('en_core_web_sm')
text = "This is first sentence.
Next is numbered list.
1. Hello World!
2. Hello World2!
3. Hello World!"
text_sentences = nlp(text)
for sentence in text_sentences.sents:
print(sentence.text)
Output (1.,2.,3. are considered as separate lines) is:
This is first sentence.
Next is numbered list.
1.
Hello World!
2.
Hello World2!
3.
Hello World!
But if there is no space between numbering and actual text, then sentence tokenisation is fine. Like below:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "This is first sentence.
Next is numbered list.
1.Hello World!
2.Hello World2!
3.Hello World!"
text_sentences = nlp(text)
for sentence in text_sentences.sents:
print(sentence.text)
Output(desired) is:
This is first sentence.
Next is numbered list.
1.Hello World!
2.Hello World2!
3.Hello World!
Please suggest whether we can customise sentence detector to do this.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…