I would use what Natural Language Processing and nltk
has to offer to extract entities.
Example (heavily based on this gist) which tokenizes each line from a file, splits it into chunks and looks for NE
(named entity) labels for every chunk recursively. More explanation here:
import nltk
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
with open('sample.txt', 'r') as f:
for line in f:
sentences = nltk.sent_tokenize(line)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
entities = []
for tree in chunked_sentences:
entities.extend(extract_entity_names(tree))
print(entities)
For the sample.txt
containing:
Denmark, CET
Location is Devon, England, GMT time zone
Australia. Australian Eastern Standard Time. +10h UTC.
My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)
It prints:
['Denmark', 'CET']
['Location', 'Devon', 'England', 'GMT']
['Australia', 'Australian Eastern Standard Time']
['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific']
['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']
The output is not ideal, but might be a good start for you.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…