Other features: dropout, train monolingual language models.
End-to-end pipeline: scripts to preprocess, compute evaluation scores.
Citations:
If you make use of this code in your research, please cite our paper
@InProceedings{luong-pham-manning:2015:EMNLP,
author = {Luong, Minh-Thang and Pham, Hieu and Manning, Christopher D.},
title = {Effective Approaches to Attention-based Neural Machine Translation},
booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
year = {2015},
}
With contributions from:
Hieu Pham [email protected] -- beam-search decoder.
Files
README.md - this file
code/ - main Matlab code
trainLSTM.m: train models
testLSTM.m: decode models
data/ - toy data
scripts/ - utility scripts
The code directory further divides into sub-directories:
basic/: define basic functions like sigmoid, prime. It also has an efficient way to aggreate embeddings.
layers/: we define various layers like attention, LSTM, etc.
misc/: things that we haven't categorized yet.
preprocess/: deal with data.
print/: print results, logs for debugging purposes.
This trains a very basic model with all the default settings. We set 'isResume' to 0 so that it will train a new model each time you run the command instead of loading existing models. See trainLSTM.m for more.
(To run directly from your terminal, checkout scripts/train.sh)
The syntax is:
trainLSTM(trainPrefix,validPrefix,testPrefix,srcLang,tgtLang,srcVocabFile,tgtVocabFile,outDir,varargin)
% Arguments:
% trainPrefix, validPrefix, testPrefix: expect files trainPrefix.srcLang,
% trainPrefix.tgtLang. Similarly for validPrefix and testPrefix.
% These data files contain sequences of integers one per line.
% srcLang, tgtLang: languages, e.g. en, de.
% srcVocabFile, tgtVocabFile: one word per line.
% outDir: output directory.
% varargin: other optional arguments.
The trainer code outputs logs such as:
1, 20, 6.38K, 1, 6.52, gN=11.62
which means: at epoch 1, mini-batches 20, training speed is 6.38K words/s, learning rate is 1, train cost is 6.52, and grad norm is 11.62.
Once in a while, the code will evaluate on the valid and test sets:
which tells us additional information such as the current test perplexity, 34.40, the valid / test costs, and the average abs values of the model paramters.
Decode with beamSize 2, collect maximum 10 translations, batchSize 1.
Note that the testLSTM implicitly decodes the test file specified during training. To specifiy a different test file, use 'testPrefix', see testLSTM.m for more.
(To run directly from your terminal, checkout scripts/test.sh)
Syntax:
testLSTM(modelFiles, beamSize, stackSize, batchSize, outputFile,varargin)
% Arguments:
% modelFiles: single or multiple models to decode. Multiple models are
% separated by commas.
% beamSize: number of hypotheses kept at each time step.
% stackSize: number of translations retrieved.
% batchSize: number of sentences decoded simultaneously. We only ensure
% accuracy of batchSize = 1 for now.
% outputFile: output translation file.
% varargin: other optional arguments.
Apart from "obvious" hyperparameters, we want to scale our gradients whenever its norm averaged by batch size (128) is greater than 5. After training for 5 epochs, we start halving our learning rate each epoch. To have further control of learning rate schedule, see 'epochFraction' and 'finetuneRate' options in trainLSTM.m
Here, we also use source reversing 'isReverse' and the input feeding approach 'feedInput' as described in the paper. Other attention architectures can be specified as follows:
% attnFunc=0: no attention.
% 1: global attention
% 2: local attention + monotonic alignments
% 4: local attention + regression for absolute pos (multiplied distWeights)
% attnOpt: decide how we generate the alignment weights:
% 0: location-based
% 1: content-based, dot product
% 2: content-based, general dot product
% 3: content-based, concat Montreal style
'isResume' is set to 0 to avoid loading existing models (done by default), so that you can try different attention architectures.
Then compare with the provided grad check outputs data/grad_checks.txt. They
should look similar.
Note: many different configurations will be run with the run_grad_checks.sh script. For many configuration, we set the 'initRange' to a large value 10, so you will notice the total gradient differences are large. This is to debug subtle mistakes; and if the total diff < 10, you can mostly be assured. We do note that with attnFunc=4, attnOpt=1, the diff is quite large; this is something to be checked though the model seems to work in practice.
请发表评论