• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    公众号

michaelfester/mastodon: A simple next-word prediction engine

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称(OpenSource Name):

michaelfester/mastodon

开源软件地址(OpenSource Url):

https://github.com/michaelfester/mastodon

开源编程语言(OpenSource Language):

D 62.4%

开源软件介绍(OpenSource Introduction):

Mastodon

A simple next-word prediction engine

Quick start

# Fetch a sample corpus
$ mkdir data
$ mkdir data/samples
$ curl http://norvig.com/big.txt -o data/samples/big.txt

# Generate stats using NSP
$ mkdir data/output
$ cd scripts
$ ./generate_stats.sh ../data/samples/big.txt ../data/output/

# Create binary dictionaries
$ cd ..
$ mkdir dictionaries
$ mkdir dictionaries/test
$ cd scripts
$ python makedict.py -u ../data/output/unigrams.txt -n ../data/output/ngrams2.ll,../data/output/ngrams3.ll,../data/output/ngrams4.ll -o ../dictionaries/test/big.dict

# Create binary dictionaries for unit tests
$ python makedict.py -t
$ python unittests.py
$ cd ../cpp
$ make test

Generating statistics

To create a binary dictionary, we need data created from the N-Gram Statistics Package (NSP), available at http://www.d.umn.edu/~tpederse/nsp.html. The script generate_stats.sh in the scripts/ folder serves this purpose.

A sample corpus can be found at http://norvig.com/big.txt.

$ curl https://dl.dropbox.com/u/228601/8pen/big.txt -o data/samples/big.txt

We can generate the desired statistics in the following way:

$ cd scripts
$ ./generate_stats.sh INPUT_FILE OUTPUT_DIR

Unigrams

The script generates a simple word frequency list unigram.txt in OUTPUT_DIR, in which each line is of the form weight unigram. Example output:

79377 the
39997 of
38076 and
28604 to
21780 in
20910 a
...

The weight is simply the number of occurences of the corresponding word in the corpus.

N-grams

The script then generates a lists of bi-, tri-, and four-grams (ngrams2.ll, ngrams3.ll, ngrams4.ll, also locaed in OUTPUT_DIR) of the form unigram<>unigram<>...<>rank weight (we ignore rank for now). Example output:

of<>the<>2 25053.6988
in<>the<>6 10335.9606
did<>not<>8 9798.6723

Generating dictionaries

To generate a binary dictionary using output of the NSP, a script makedict.py in the python/ folder is available. Example usage:

$ python makedict.py -u UNIGRAM_FILE -n BIGRAM_FILE,TRIGRAM_FILE,FOURGRAM_FILE -o OUTPUT_FILE

Using dictionaries

Implementations in Python and C++ are currently available for loading a binary dictionary and querying it for:

  • Corrections
  • Completions (Python only)
  • Next-word predictions

Python

Here is a simple usage in Python:

bindict = BinaryDictionary.from_file('../dictionaries/test/test.dict')
bindict.get_predictions(['hello']) # => [('there',10),('sir',3)]
bindict.get_corrections('yuur')    # => ['your','you','year']
bindict.get_completions('yo', 2)   # => ['you','your']

C++

Here is a simple usage in C++:

BinaryDictionary bindict;
bindict.fromFile("../dictionaries/test/test.dict");

string phrase[] = {"how", "are"};
vector<weighted_string> holder;
vector<weighted_string> predictions = bindict.getPredictions(phrase, 2, holder, 4);

vector<weighted_string> holder;
vector<weighted_string> corrections = bindict.getCorrections("you", holder, 100);

Note that querying for word completions is not yet implemented in C++.

Unit tests

The unit tests are designed to be used with a simple dictionary, located at dictionaries/test/test.dict, and generated using the -t option:

$ python makedict.py -t

Python

The Python unit tests use the unittest module, and are available in python/unittests.py:

$ python unittests.py

C++

The C++ unit tests, located at cpp/tests/unit/test.cpp, are based on the UnitTest++ framework (included). Simply use the provided Makefile in the cpp folder to run the tests:

$ make test

Generating statistics

License

Mastodon is released under the MIT license. See LICENSE.md.




鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
上一篇:
magicalraccoon/tootstream: A command line interface for interacting with Mastodo ...发布时间:2022-08-17
下一篇:
cinderella-project/iMast: 发布时间:2022-08-17
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap