# Fetch a sample corpus
$ mkdir data
$ mkdir data/samples
$ curl http://norvig.com/big.txt -o data/samples/big.txt
# Generate stats using NSP
$ mkdir data/output
$ cd scripts
$ ./generate_stats.sh ../data/samples/big.txt ../data/output/
# Create binary dictionaries
$ cd ..
$ mkdir dictionaries
$ mkdir dictionaries/test
$ cd scripts
$ python makedict.py -u ../data/output/unigrams.txt -n ../data/output/ngrams2.ll,../data/output/ngrams3.ll,../data/output/ngrams4.ll -o ../dictionaries/test/big.dict
# Create binary dictionaries for unit tests
$ python makedict.py -t
$ python unittests.py
$ cd ../cpp
$ make test
Generating statistics
To create a binary dictionary, we need data created from the N-Gram Statistics Package (NSP), available at http://www.d.umn.edu/~tpederse/nsp.html. The script generate_stats.sh in the scripts/ folder serves this purpose.
We can generate the desired statistics in the following way:
$ cd scripts
$ ./generate_stats.sh INPUT_FILE OUTPUT_DIR
Unigrams
The script generates a simple word frequency list unigram.txt in OUTPUT_DIR, in which each line is of the form weight unigram. Example output:
79377 the
39997 of
38076 and
28604 to
21780 in
20910 a
...
The weight is simply the number of occurences of the corresponding word in the corpus.
N-grams
The script then generates a lists of bi-, tri-, and four-grams (ngrams2.ll, ngrams3.ll, ngrams4.ll, also locaed in OUTPUT_DIR) of the form unigram<>unigram<>...<>rank weight (we ignore rank for now). Example output:
Note that querying for word completions is not yet implemented in C++.
Unit tests
The unit tests are designed to be used with a simple dictionary, located at dictionaries/test/test.dict, and generated using the -t option:
$ python makedict.py -t
Python
The Python unit tests use the unittest module, and are available in python/unittests.py:
$ python unittests.py
C++
The C++ unit tests, located at cpp/tests/unit/test.cpp, are based on the UnitTest++ framework (included). Simply use the provided Makefile in the cpp folder to run the tests:
$ make test
Generating statistics
License
Mastodon is released under the MIT license. See LICENSE.md.
请发表评论