Riccardo Murri's answer is correct, but I thought I'd throw a bit more light on the subject.
There was a similar question asked with regard to PHP: php sentence boundaries detection. My answer to that question includes handling the exceptions such as "Mr.", "Mrs." and "Jr.". I've adapted that regex to work with Python, (which places more restrictions on lookbehinds). Here is a modified and tested version of your script which uses this new regex:
def splitParagraphIntoSentences(paragraph):
import re
sentenceEnders = re.compile(r"""
# Split sentences on whitespace between them.
(?: # Group for two positive lookbehinds.
(?<=[.!?]) # Either an end of sentence punct,
| (?<=[.!?]['"]) # or end of sentence punct and quote.
) # End group of two positive lookbehinds.
(?<! Mr. ) # Don't end sentence on "Mr."
(?<! Mrs. ) # Don't end sentence on "Mrs."
(?<! Jr. ) # Don't end sentence on "Jr."
(?<! Dr. ) # Don't end sentence on "Dr."
(?<! Prof. ) # Don't end sentence on "Prof."
(?<! Sr. ) # Don't end sentence on "Sr."
s+ # Split on whitespace between sentences.
""",
re.IGNORECASE | re.VERBOSE)
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
f = open("bs.txt", 'r')
text = f.read()
mylist = []
sentences = splitParagraphIntoSentences(text)
for s in sentences:
mylist.append(s.strip())
for i in mylist:
print i
You can see how it handles the special cases and it is easy to add or remove them as required. It correctly parses your example paragraph. It also correctly parses the following test paragraph (which includes more special cases):
This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!"
But note that there are other exceptions that can fail which Riccardo Murri has correctly pointed out.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…