在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:sentencepiece开源软件地址:https://gitee.com/mirrors/sentencepiece开源软件介绍:SentencePieceSentencePiece is an unsupervised text tokenizer and detokenizer mainly forNeural Network-based text generation systems where the vocabulary sizeis predetermined prior to the neural model training. SentencePiece implementssubword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) andunigram language model [Kudo.])with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing. This is not an official Google product. Technical highlights
For those unfamiliar with SentencePiece as a software/algorithm, one can read a gentle introduction here. Comparisons with other implementations
Note that BPE algorithm used in WordPiece is slightly different from the original BPE. OverviewWhat is SentencePiece?SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabularyproblems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model [Kudo.]. Here are the high level differences from other implementations. The number of unique tokens is predeterminedNeural Machine Translation models typically operate with a fixedvocabulary. Unlike most unsupervised word segmentation algorithms, whichassume an infinite vocabulary, SentencePiece trains the segmentation model suchthat the final vocabulary size is fixed, e.g., 8k, 16k, or 32k. Note that SentencePiece specifies the final vocabulary size for training, which is different fromsubword-nmt that uses the number of merge operations.The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character. Trains from raw sentencesPrevious sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance.The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words. Whitespace is treated as a basic symbolThe first step of Natural Language processing is text tokenization. Forexample, a standard English tokenizer would segment the text "Hello world." into thefollowing three tokens.
One observation is that the original input and tokenized sequence are NOTreversibly convertible. For instance, the information that is no space between“World” and “.” is dropped from the tokenized sequence, since e.g., SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.
Then, this text is segmented into small pieces, for example:
Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities. detokenized = ''.join(pieces).replace('▁', ' ') This feature makes it possible to perform detokenization without relying on language-specific resources. Note that we cannot apply the same lossless conversions when splitting thesentence with standard word segmenters, since they treat the whitespace as aspecial symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.
Subword regularization and BPE-dropoutSubword regularization [Kudo.] and BPE-dropout Provilkov et al are simple regularization methodsthat virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models. To enable subword regularization, you would like to integrate SentencePiece library(C++/Python) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of Python library. You can find that 'New York' is segmented differently on each >>> import sentencepiece as spm>>> s = spm.SentencePieceProcessor(model_file='spm.model')>>> for n in range(5):... s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)...['▁', 'N', 'e', 'w', '▁York']['▁', 'New', '▁York']['▁', 'New', '▁Y', 'o', 'r', 'k']['▁', 'New', '▁York']['▁', 'New', '▁York'] InstallationPython moduleSentencePiece provides Python wrapper that supports both SentencePiece training and segmentation.You can install Python binary package of SentencePiece with. % pip install sentencepiece For more detail, see Python module Build and install SentencePiece command line tools from C++ sourceThe following tools and libraries are required to build SentencePiece:
On Ubuntu, the build tools can be installed with apt-get: % sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev Then, you can build and install command line tools as follows. % git clone https://github.com/google/sentencepiece.git % cd sentencepiece% mkdir build% cd build% cmake ..% make -j $(nproc)% sudo make install% sudo ldconfig -v On OSX/macOS, replace the last command with Build and install using vcpkgYou can download and install sentencepiece using the vcpkg dependency manager: git clone https://github.com/Microsoft/vcpkg.gitcd vcpkg./bootstrap-vcpkg.sh./vcpkg integrate install./vcpkg install sentencepiece The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository. Usage instructionsTrain SentencePiece Model% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
Use Encode raw text into sentence pieces/ids% spm_encode --model=<model_file> --output_format=piece < input > output% spm_encode --model=<model_file> --output_format=id < input > output Use % spm_encode --extra_options=eos (add </s> only)% spm_encode --extra_options=bos:eos (add <s> and </s>)% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>) SentencePiece supports nbest segmentation and segmentation sampling with % spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output Decode sentence pieces/ids into raw text% spm_decode --model=<model_file> --input_format=piece < input > output% spm_decode --model=<model_file> --input_format=id < input > output Use % spm_decode --extra_options=reverse < input > output End-to-End Example% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000unigram_model_trainer.cc(494) LOG(INFO) Starts training with :input: "../data/botchan.txt"... <snip>unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091trainer_interface.cc(272) LOG(INFO) Saving model: m.modeltrainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab% echo "I saw a girl with a telescope." | spm_encode --model=m.model▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id9 459 11 939 44 11 4 142 82 8 28 21 132 6% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=idI saw a girl with a telescope. You can find that the original input sentence is restored from the vocabulary id sequence. Export vocabulary list% spm_export_vocab --model=<model_file> --output=<output file>
Redefine special meta tokensBy default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows. % spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=... When setting -1 id e.g., If you want to assign another special tokens, please see Use custom symbols. Vocabulary restriction
The usage is basically the same as that of % cat {train_file}.L1 {train_file}.L2 | shuffle > train% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2
Then segment train/test corpus with % spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2 Advanced topics
|
请发表评论