Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.3k views
in Technique[技术] by (71.8m points)

nlp - Fast/Optimize N-gram implementations in python

Which ngram implementation is fastest in python?

I've tried to profile nltk's vs scott's zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/):

from nltk.util import ngrams as nltkngram
import this, time

def zipngram(text,n=2):
  return zip(*[text.split()[i:] for i in range(n)])

text = this.s

start = time.time()
nltkngram(text.split(), n=2)
print time.time() - start

start = time.time()
zipngram(text, n=2)
print time.time() - start

[out]

0.000213146209717
6.50882720947e-05

Is there any faster implementation for generating ngrams in python?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Some attempts with some profiling. I thought using generators could improve the speed here. But the improvement was not noticeable compared to a slight modification of the original. But if you don't need the full list at the same time, the generator functions should be faster.

import timeit
from itertools import tee, izip, islice

def isplit(source, sep):
    sepsize = len(sep)
    start = 0
    while True:
        idx = source.find(sep, start)
        if idx == -1:
            yield source[start:]
            return
        yield source[start:idx]
        start = idx + sepsize

def pairwise(iterable, n=2):
    return izip(*(islice(it, pos, None) for pos, it in enumerate(tee(iterable, n))))

def zipngram(text, n=2):
    return zip(*[text.split()[i:] for i in range(n)])

def zipngram2(text, n=2):
    words = text.split()
    return pairwise(words, n)


def zipngram3(text, n=2):
    words = text.split()
    return zip(*[words[i:] for i in range(n)])

def zipngram4(text, n=2):
    words = isplit(text, ' ')
    return pairwise(words, n)


s = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
s = s * 10 ** 3

res = []
for n in range(15):

    a = timeit.timeit('zipngram(s, n)', 'from __main__ import zipngram, s, n', number=100)
    b = timeit.timeit('list(zipngram2(s, n))', 'from __main__ import zipngram2, s, n', number=100)
    c = timeit.timeit('zipngram3(s, n)', 'from __main__ import zipngram3, s, n', number=100)
    d = timeit.timeit('list(zipngram4(s, n))', 'from __main__ import zipngram4, s, n', number=100)

    res.append((a, b, c, d))

a, b, c, d = zip(*res)

import matplotlib.pyplot as plt

plt.plot(a, label="zipngram")
plt.plot(b, label="zipngram2")
plt.plot(c, label="zipngram3")
plt.plot(d, label="zipngram4")
plt.legend(loc=0)
plt.show()

For this test data, zipngram2 and zipngram3 seems to be the fastest by a good margin.

enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...