a Regex for extracting sentence from a paragraph in python

Question

Welcome To Ask or Share your Answers For Others

a Regex for extracting sentence from a paragraph in python

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

a Regex for extracting sentence from a paragraph in python

I'm trying to extract a sentence from a paragraph using regular expressions in python.
Usually the code that I'm testing extracts the sentence correctly, but in the following paragraph the sentence does not get extracted correctly.

The paragraph:

"But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections." A new type of vaccine?

The code:

def splitParagraphIntoSentences(paragraph):

import re

sentenceEnders = re.compile('[.!?][s]{1,2}(?=[A-Z])')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
    f = open("bs.txt", 'r')
    text = f.read()
    mylist = []
    sentences = splitParagraphIntoSentences(text)
    for s in sentences:
        mylist.append(s.strip())
        for i in mylist:
            print i

When tested with the above paragraph it gives output exactly as the input paragraph but the output should look like-

But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections

A new type of vaccine

Is there anything wrong with the regular expression?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:32:07+0000

Riccardo Murri's answer is correct, but I thought I'd throw a bit more light on the subject.

There was a similar question asked with regard to PHP: php sentence boundaries detection. My answer to that question includes handling the exceptions such as "Mr.", "Mrs." and "Jr.". I've adapted that regex to work with Python, (which places more restrictions on lookbehinds). Here is a modified and tested version of your script which uses this new regex:

def splitParagraphIntoSentences(paragraph):
    import re
    sentenceEnders = re.compile(r"""
        # Split sentences on whitespace between them.
        (?:               # Group for two positive lookbehinds.
          (?<=[.!?])      # Either an end of sentence punct,
        | (?<=[.!?]['"])  # or end of sentence punct and quote.
        )                 # End group of two positive lookbehinds.
        (?<!  Mr.   )    # Don't end sentence on "Mr."
        (?<!  Mrs.  )    # Don't end sentence on "Mrs."
        (?<!  Jr.   )    # Don't end sentence on "Jr."
        (?<!  Dr.   )    # Don't end sentence on "Dr."
        (?<!  Prof. )    # Don't end sentence on "Prof."
        (?<!  Sr.   )    # Don't end sentence on "Sr."
        s+               # Split on whitespace between sentences.
        """, 
        re.IGNORECASE | re.VERBOSE)
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList

if __name__ == '__main__':
    f = open("bs.txt", 'r')
    text = f.read()
    mylist = []
    sentences = splitParagraphIntoSentences(text)
    for s in sentences:
        mylist.append(s.strip())
    for i in mylist:
        print i

You can see how it handles the special cases and it is easy to add or remove them as required. It correctly parses your example paragraph. It also correctly parses the following test paragraph (which includes more special cases):

This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!"

But note that there are other exceptions that can fail which Riccardo Murri has correctly pointed out.

Categories

a Regex for extracting sentence from a paragraph in python

a Regex for extracting sentence from a paragraph in python

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags