google / sentencepiece Goto Github PK

Unsupervised text tokenizer for Neural Network-based text generation.

License: Apache License 2.0

Perl 0.10% C++ 96.68% Python 1.03% CMake 0.78% Shell 0.01% Jupyter Notebook 0.56% SWIG 0.84%

neural-machine-translation natural-language-processing word-segmentation

sentencepiece's Introduction

SentencePiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

This is not an official Google product.

Technical highlights

Purely data driven: SentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required.
Language independent: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
Multiple subword algorithms: BPE [Sennrich et al.] and unigram language model [Kudo.] are supported.
Subword regularization: SentencePiece implements subword sampling for subword regularization and BPE-dropout which help to improve the robustness and accuracy of NMT models.
Fast and lightweight: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
Self-contained: The same tokenization/detokenization is obtained as long as the same model file is used.
Direct vocabulary id generation: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
NFKC-based normalization: SentencePiece performs NFKC-based text normalization.

For those unfamiliar with SentencePiece as a software/algorithm, one can read a gentle introduction here.

Comparisons with other implementations

Feature	SentencePiece	subword-nmt	WordPiece
Supported algorithm	BPE, unigram, char, word	BPE	BPE*
OSS?	Yes	Yes	Google internal
Subword regularization	Yes	No	No
Python Library (pip)	Yes	No	N/A
C++ Library	Yes	No	N/A
Pre-segmentation required?	No	Yes	Yes
Customizable normalization (e.g., NFKC)	Yes	No	N/A
Direct id generation	Yes	No	N/A

Note that BPE algorithm used in WordPiece is slightly different from the original BPE.

Overview

What is SentencePiece?

SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model [Kudo.]. Here are the high level differences from other implementations.

The number of unique tokens is predetermined

Neural Machine Translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms, which assume an infinite vocabulary, SentencePiece trains the segmentation model such that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.

Note that SentencePiece specifies the final vocabulary size for training, which is different from subword-nmt that uses the number of merge operations. The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.

Trains from raw sentences

Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance. The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.

Whitespace is treated as a basic symbol

The first step of Natural Language processing is text tokenization. For example, a standard English tokenizer would segment the text "Hello world." into the following three tokens.

[Hello] [World] [.]

One observation is that the original input and tokenized sequence are NOT reversibly convertible. For instance, the information that is no space between “World” and “.” is dropped from the tokenized sequence, since e.g., Tokenize(“World.”) == Tokenize(“World .”)

SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.

Hello▁World.

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.

  detokenized = ''.join(pieces).replace('▁', ' ')

This feature makes it possible to perform detokenization without relying on language-specific resources.

Note that we cannot apply the same lossless conversions when splitting the sentence with standard word segmenters, since they treat the whitespace as a special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.

(en) Hello world. → [Hello] [World] [.] (A space between Hello and World)
(ja) こんにちは世界。 → [こんにちは] [世界] [。] (No space between こんにちは and 世界)

Subword regularization and BPE-dropout

Subword regularization [Kudo.] and BPE-dropout Provilkov et al are simple regularization methods that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

To enable subword regularization, you would like to integrate SentencePiece library (C++/Python) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of Python library. You can find that 'New York' is segmented differently on each SampleEncode (C++) or encode with enable_sampling=True (Python) calls. The details of sampling parameters are found in sentencepiece_processor.h.

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']

Installation

Python module

SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation. You can install Python binary package of SentencePiece with.

pip install sentencepiece

For more detail, see Python module

Build and install SentencePiece command line tools from C++ source

The following tools and libraries are required to build SentencePiece:

cmake
C++11 compiler
gperftools library (optional, 10-40% performance improvement can be obtained.)

On Ubuntu, the build tools can be installed with apt-get:

% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

Then, you can build and install command line tools as follows.

% git clone https://github.com/google/sentencepiece.git 
% cd sentencepiece
% mkdir build
% cd build
% cmake ..
% make -j $(nproc)
% sudo make install
% sudo ldconfig -v

On OSX/macOS, replace the last command with sudo update_dyld_shared_cache

Build and install using vcpkg

You can download and install sentencepiece using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece

The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Download and install SentencePiece from signed released wheels

You can download the wheel from the GitHub releases page. We generate SLSA3 signatures using the OpenSSF's slsa-framework/slsa-github-generator during the release process. To verify a release binary:

Install the verification tool from slsa-framework/slsa-verifier#installation.
Download the provenance file attestation.intoto.jsonl from the GitHub releases page.
Run the verifier:

slsa-verifier -artifact-path <the-wheel> -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag <the-tag>

pip install wheel_file.whl

Usage instructions

Train SentencePiece Model

% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>

--input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.
--model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
--vocab_size: vocabulary size, e.g., 8000, 16000, or 32000
--character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
--model_type: model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

Use --help flag to display all parameters for training, or see here for an overview.

Encode raw text into sentence pieces/ids

% spm_encode --model=<model_file> --output_format=piece < input > output
% spm_encode --model=<model_file> --output_format=id < input > output

Use --extra_options flag to insert the BOS/EOS markers or reverse the input sequence.

% spm_encode --extra_options=eos (add </s> only)
% spm_encode --extra_options=bos:eos (add <s> and </s>)
% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

SentencePiece supports nbest segmentation and segmentation sampling with --output_format=(nbest|sample)_(piece|id) flags.

% spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output

Decode sentence pieces/ids into raw text

% spm_decode --model=<model_file> --input_format=piece < input > output
% spm_decode --model=<model_file> --input_format=id < input > output

Use --extra_options flag to decode the text in reverse order.

% spm_decode --extra_options=reverse < input > output

End-to-End Example

% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6

% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.

You can find that the original input sentence is restored from the vocabulary id sequence.

Export vocabulary list

% spm_export_vocab --model=<model_file> --output=<output file>

<output file> stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

Redefine special meta tokens

By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.

% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...

When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknown id cannot be disabled. We can define an id for padding (<pad>) as --pad_id=3.

If you want to assign another special tokens, please see Use custom symbols.

Vocabulary restriction

spm_encode accepts a --vocabulary and a --vocabulary_threshold option so that spm_encode will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in subword-nmt page.

The usage is basically the same as that of subword-nmt. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:

% cat {train_file}.L1 {train_file}.L2 | shuffle > train
% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2

shuffle command is used just in case because spm_train loads the first 10M lines of corpus by default.

Then segment train/test corpus with --vocabulary option

% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2

Advanced topics

SentencePiece Experiments
SentencePieceProcessor C++ API
Use custom text normalization rules
Use custom symbols
Python Module
[Segmentation and training algorithms in detail]

sentencepiece's People

Contributors

Stargazers

Watchers

Forkers

yamitzky ml-lab tetsuok ericxsun chagge benjamesbabala alvations fangzheng354 chubbymaggie sher-ali randyamiel lngvietthang khanhnguyenneka xpertasks qitong sohomghosh dkorolev devsinghsachan ilyeong-ai mattn ezhangle onpoeet zzmjohn kashif resec koshinryuu frankxu2004 mylearning2017 gdg tucciresearch tsuchm redeipirati yfkwon hwanim xaveng bearrundr ouo rosssong shafiahmed ruanchong ifarhankhan prateekg keishinkickback yhren ucaslyc panyang hfxunlp github-hongweizhang kant jonathanfly unnonouno dongjinleekr snsn deasuke ksudoh rikima ldkhanh nucflash xibin ahmedsmostafa jingchun01 chenglongchen xeedio linpingchuan ssttv cyzhangathit mjabuz ttslr lahiruts duyvuleo zh794390558 justinchiu takotakot-archives tandychao tom-pelsmaeker jing1201 chiuyeelau fizx weavermonkey neubig tanghao922 ml-ai-nlp-ir idiosyncraticdragon fullstackenviormentss xbklairith kismeter guodongxie lishiji1992 shubhampachori12110095 chirayukong sixingwu afcarl guilding alexandra-chron mangohero1985 yuzhimin999 namisan letsdodatascience morristech rchatterjee

sentencepiece's Issues

Build error of Python module for Python 3.5.3

Hi,

When building the Python module of sentencepiece for Python 3.5.3, I saw the following build error "‘PyString_AsStringAndSize’ was not declared in this scope".

In the same environment, I successfully built the Python module of sentencepiece for Python 2.7.13, thus, I believe that my environment meets build dependency of the Python module of sentencepiece.

Could you give me an advice to avoid this problem?

$ pip3 --no-cache-dir install --user sentencepiece
Collecting sentencepiece
  Downloading sentencepiece-0.0.0.tar.gz (183kB)
    100% |████████████████████████████████| 184kB 6.2MB/s 
Installing collected packages: sentencepiece
  Running setup.py install for sentencepiece ... error
    Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-yqcu2vo3/sentencepiece/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-oqqfb0ui-record/install-record.txt --single-version-externally-managed --compile --user --prefix=:
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.5
    copying sentencepiece.py -> build/lib.linux-x86_64-3.5
    running build_ext
    building '_sentencepiece' extension
    creating build/temp.linux-x86_64-3.5
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fdebug-prefix-map=/build/python3.5-MLq5fN/python3.5-3.5.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -c sentencepiece_wrap.cxx -o build/temp.linux-x86_64-3.5/sentencepiece_wrap.o -std=c++11 -g -O2 -fdebug-prefix-map=/home/tsuchiya/work/pkg-sentencepiece=. -fstack-protector-strong -Wformat -Werror=format-security
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_Load(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3305:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_LoadOrDie(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3347:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_SetEncodeExtraOptions(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3389:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_SetDecodeExtraOptions(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3431:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_PieceToId(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3496:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_IdToPiece(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3543:80: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         resultobj = PyString_FromStringAndSize((&result)->data(), (&result)->size());
                                                                                    ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_Encode(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3665:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx:3677:95: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         PyList_SetItem(resultobj, i, PyString_FromStringAndSize(result[i].data(), result[i].size()));
                                                                                                   ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_EncodeAsPieces(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3712:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx:3724:95: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         PyList_SetItem(resultobj, i, PyString_FromStringAndSize(result[i].data(), result[i].size()));
                                                                                                   ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_EncodeAsIds(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3759:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_Decode(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3811:54: error: ‘PyString_AsStringAndSize’ was not declared in this scope
               PyString_AsStringAndSize(o, &str, &str_size);
                                                          ^
    sentencepiece_wrap.cxx:3826:80: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         resultobj = PyString_FromStringAndSize((&result)->data(), (&result)->size());
                                                                                    ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_DecodePieces(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3866:54: error: ‘PyString_AsStringAndSize’ was not declared in this scope
               PyString_AsStringAndSize(o, &str, &str_size);
                                                          ^
    sentencepiece_wrap.cxx:3881:80: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         resultobj = PyString_FromStringAndSize((&result)->data(), (&result)->size());
                                                                                    ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_DecodeIds(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3933:80: error: ‘PyString_FromStringAndSize’ was not declared in this scope
         resultobj = PyString_FromStringAndSize((&result)->data(), (&result)->size());
                                                                                    ^
    sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor___getitem__(PyObject*, PyObject*)’:
    sentencepiece_wrap.cxx:3990:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
           PyString_AsStringAndSize(obj1, &str, &str_size);
                                                         ^
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    
    ----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-yqcu2vo3/sentencepiece/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-oqqfb0ui-record/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-build-yqcu2vo3/sentencepiece/

Spaces in user_defined_symbols?

I would like to define "foo bar" as user defined symbol. I see that the model can have white spaces in its tokens, is there a way to easily add this to user defined symbols?

Make fails on Ubuntu 16.04(LTS)

I tried installing with reference to Build and Install SentencePiece, but Make does not pass.

The environment is as follows:

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.2 LTS"
$ uname -r
4.4.0-64-generic

The result of Make is as follows.

$ sudo ldconfig
$ make
make  all-recursive
make[1]: Entering directory '/home/ornew/sentencepiece'
Making all in src
make[2]: Entering directory '/home/ornew/sentencepiece/src'
make  all-am
make[3]: Entering directory '/home/ornew/sentencepiece/src'
/bin/bash ../libtool  --tag=CXX   --mode=link g++  -std=c++11 -Wall -O3 -pthread   -o spm_encode spm_encode_main.o libsentencepiece.la -lprotobuf -pthread -lpthread
libtool: link: g++ -std=c++11 -Wall -O3 -pthread -o .libs/spm_encode spm_encode_main.o -pthread  ./.libs/libsentencepiece.so -lprotobuf -lpthread -pthread
spm_encode_main.o: In function `std::_Function_handler<void (std::string const&), main::{lambda(std::string const&)#3}>::_M_invoke(std::_Any_data const&, std::string const&)':
spm_encode_main.cc:(.text+0x1df): undefined reference to `google::protobuf::Message::Utf8DebugString() const'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::empty_string_'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::WireFormatLite::ReadString(google::protobuf::io::CodedInputStream*, std::string*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::WireFormatLite::ReadBytes(google::protobuf::io::CodedInputStream*, std::string*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteBytesMaybeAliased(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::DescriptorPool::FindFileByName(std::string const&) const'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteString(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::Message::InitializationErrorString() const'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::StringTypeHandlerBase::Delete(std::string*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::MessageFactory::InternalRegisterGeneratedFile(char const*, void (*)(std::string const&))'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteStringMaybeAliased(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::internal::StringTypeHandlerBase::New()'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::io::CodedOutputStream::WriteStringWithSizeToArray(std::string const&, unsigned char*)'
./.libs/libsentencepiece.so: undefined reference to `google::protobuf::Message::GetTypeName() const'
collect2: error: ld returned 1 exit status
Makefile:834: recipe for target 'spm_encode' failed
make[3]: *** [spm_encode] Error 1
make[3]: Leaving directory '/home/ornew/sentencepiece/src'
Makefile:678: recipe for target 'all' failed
make[2]: *** [all] Error 2
make[2]: Leaving directory '/home/ornew/sentencepiece/src'
Makefile:418: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/ornew/sentencepiece'
Makefile:350: recipe for target 'all' failed
make: *** [all] Error 2

It seems that there is no package named libprotobuf-c++ and protocolbuffer.

$ sudo apt-get upgrade -y
$ sudo apt-get update
$ sudo apt-get install libprotobuf-c++ protocolbuffer
Reading package lists... Done
Building dependency tree
Reading state information... Done
Note, selecting 'libprotobuf-c-dev' for regex 'libprotobuf-c+'
Note, selecting 'libprotobuf-c0-dev' for regex 'libprotobuf-c+'
Note, selecting 'libprotobuf-c1' for regex 'libprotobuf-c+'
Note, selecting 'libprotobuf-c1-dbg' for regex 'libprotobuf-c+'
Note, selecting 'libprotobuf-c-dev' instead of 'libprotobuf-c0-dev'
E: Unable to locate package protocolbuffer

Although ProtocolBuffer seems to be installed...

$ dpkg -l | grep protobuf
ii  libmirprotobuf3:amd64                  0.21.0+16.04.20160330-0ubuntu1                         amd64        Display server for Ubuntu - RPC definitions
ii  libprotobuf-c-dev                      1.2.1-1                                                amd64        Protocol Buffers C static library and headers (protobuf-c)
ii  libprotobuf-c1                         1.2.1-1                                                amd64        Protocol Buffers C shared library (protobuf-c)
ii  libprotobuf-c1-dbg                     1.2.1-1                                                amd64        Protocol Buffers C shared library debug symbols (protobuf-c)
ii  libprotobuf-dev:amd64                  2.6.1-1.3                                              amd64        protocol buffers C++ library (development files)
ii  libprotobuf-java                       2.6.1-1.3                                              all          Java bindings for protocol buffers
ii  libprotobuf-lite9v5:amd64              2.6.1-1.3                                              amd64        protocol buffers C++ library (lite version)
ii  libprotobuf9v5:amd64                   2.6.1-1.3                                              amd64        protocol buffers C++ library
ii  protobuf-c-compiler                    1.2.1-1                                                amd64        Protocol Buffers C compiler (protobuf-c)
ii  protobuf-compiler                      2.6.1-1.3                                              amd64        compiler for protocol buffer definition files

What should I do?
Thank you.

Guidance on how to implement subword sampling at train time

I guess I should be re-sampling tokenizations on the train data with SP before each epoch, but it would be nice to see a canonical implementation of this in $FRAMEWORK.

instialltion failed on ubuntu

I installed protobuf 3.4 after the apt-get version failed.

./autogen.sh

Running aclocal ...
Running autoheader...
Running libtoolize ..
Running automake ...
Running autoconf ...

./configure

checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking how to print strings... printf
checking for style of include used by make... GNU
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking dependency style of gcc... gcc3
checking for a sed that does not truncate output... /bin/sed
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for fgrep... /bin/grep -F
checking for ld used by gcc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking how to convert x86_64-pc-linux-gnu file names to x86_64-pc-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-pc-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc3
checking for ar... ar
checking for archiver @file support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for a working dd... /bin/dd
checking how to truncate binary pipes... /bin/dd bs=4096 count=1
checking for mt... mt
checking if mt is a manifest tool... no
checking how to run the C preprocessor... gcc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for g++ option to produce PIC... -fPIC -DPIC
checking if g++ PIC flag -fPIC -DPIC works... yes
checking if g++ static flag -static works... yes
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... (cached) GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether g++ accepts -g... (cached) yes
checking dependency style of g++... (cached) gcc3
checking for gcc... (cached) gcc
checking whether we are using the GNU C compiler... (cached) yes
checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ISO C89... (cached) none needed
checking whether gcc understands -c and -o together... (cached) yes
checking dependency style of gcc... (cached) gcc3
checking for pkg-config... /usr/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for PROTOBUF... yes
checking nfkc-compile option... no
checking gcov option... no
configure: pkgconfig directory is ${libdir}/pkgconfig
checking for unistd.h... (cached) yes
checking for size_t... yes
checking for working strtod... yes
checking for memchr... yes
checking for memset... yes
checking that generated files are newer than configure... done

make

make all-recursive
make[1]: Entering directory '/home/dori/src/sentencepiece'
Making all in src
make[2]: Entering directory '/home/dori/src/sentencepiece/src'
make all-am
make[3]: Entering directory '/home/dori/src/sentencepiece/src'
g++ -DHAVE_CONFIG_H -I. -I.. -std=c++11 -Wall -O3 -I/usr/include/google/protobuf -D_THREAD_SAFE -MT builder.o -MD -MP -MF .deps/builder.Tpo -c -o builder.o builder.cc
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h:17:2: error: #error This file was generated by an older version of protoc which is
#error This file was generated by an older version of protoc which is
^
sentencepiece_model.pb.h:18:2: error: #error incompatible with your Protocol Buffer headers. Please
#error incompatible with your Protocol Buffer headers. Please
^
sentencepiece_model.pb.h:19:2: error: #error regenerate this file with a newer version of protoc.
#error regenerate this file with a newer version of protoc.
^
Makefile:916: recipe for target 'builder.o' failed
make[3]: *** [builder.o] Error 1
make[3]: Leaving directory '/home/dori/src/sentencepiece/src'
Makefile:686: recipe for target 'all' failed
make[2]: *** [all] Error 2
make[2]: Leaving directory '/home/dori/src/sentencepiece/src'
Makefile:476: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/dori/src/sentencepiece'
Makefile:385: recipe for target 'all' failed
make: *** [all] Error 2

[Q] Would you elaborate the differences of 4 corpus size parameters?

I have 50M sentences corpus to train. I'd like to know the difference of following parameters.
I set 50M, 10M, 5M, 50M respectively (x5 of default) and got crashed like issue#4 --
CHECK(!pieces.empty()) failed on serialize.
The vocab size I set was 32768.

   --input_sentence_size (maximum size of sentences the trainer loads)  type: int32  default: 10000000
   --mining_sentence_size (maximum size of sentences to make seed sentence piece)  type: int32  default: 2000000
   --seed_sentencepiece_size (the size of seed sentencepieces)  type: int32  default: 1000000
   --training_sentence_size (maximum size of sentences to train sentence pieces)  type: int32  default: 10000000

How to build with -std=c++11

I neet to build sentencepiece with -std=c++11 and I set environment variables shown below.

export CC=/usr/local/gcc-5.4.0/bin/gcc
export CXX=/usr/local/gcc-5.4.0/bin/g++
export CXXFLAGS="$CXXFLAGS -std=c++11"

I can see from the output that the compiler changed to above gcc-5.4.0 ( my default is gcc-4.8.5 )

checking if /usr/local/gcc-5.4.0/bin/gcc supports -fno-rtti -fno-exceptions... no
checking for /usr/local/gcc-5.4.0/bin/gcc option to produce PIC... -fPIC -DPIC
checking if /usr/local/gcc-5.4.0/bin/gcc PIC flag -fPIC -DPIC works... yes
checking if /usr/local/gcc-5.4.0/bin/gcc static flag -static works... no

but it does not build with -std=c++11

I even added manually -std=c++11 to CXXFLAGS in the configure file and also Makefile, but it still does not build with -std=c++11

Manually modifying SentencePiece model?

Hello,

I'd like to try to take in an existing sentencepiece model, modify some of its probabilities manually, then use it to segment text. What would be the easiest way to do this? The ".vocab" file is easy to modify, but the ".model" file is binary format, and it wasn't clear what to do here.

Is this the tokenizer used in the official Transformer implementation?

Google's official Transformer implementation comes with a subtokenizer that appears to work extremely similarly to SentencePiece, breaking words into subwords and using _ added to tokens instead of whitespace, and training the vocab in an unsupervised fashion with a pre-fixed target vocab size (32k in the code). I was wondering if you guys knew if it was actually SentencePiece or otherwise, if you had a bit of context on the differences.

https://github.com/tensorflow/models/tree/master/official/transformer
https://github.com/tensorflow/models/blob/master/official/transformer/utils/tokenizer.py

Thanks a lot in advance.

Saving model error with user defined tags

Hi,
First of all, I have read this closed issue, about saving model error but i am pretty sure, that my corpus is big enough (contains 36890548 lines, and i would like to generate a 32k vocab).
Can you help me solve this issue? (Ubuntu 16.04)
This is the command i ran:

sudo spm_train --user_defined_symbols=city --input=corpus_max_t1.en --model_prefix=spm.en --vocab_size=32000 --model_type=bpe --input_sentence_size=37000000

This is the end of the output:

bpe_model_trainer.cc(254) LOG(INFO) Added: freq=370 size=31700 all=805279 active=40873 piece=▁Mystery
bpe_model_trainer.cc(163) LOG(INFO) Updating active symbols. max_freq=370 min_freq=94
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=370 size=31720 all=805278 active=40262 piece=▁multiplied
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=369 size=31740 all=805577 active=40561 piece=▁Agric
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=369 size=31760 all=805579 active=40563 piece=▁realtor
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=368 size=31780 all=805662 active=40646 piece=▁eel
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=368 size=31800 all=805899 active=40883 piece=▁Arturo
bpe_model_trainer.cc(163) LOG(INFO) Updating active symbols. max_freq=368 min_freq=93
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=368 size=31820 all=805940 active=40336 piece=▁Carlisle
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=367 size=31840 all=806037 active=40433 piece=15.
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=367 size=31860 all=806749 active=41145 piece=▁101.
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=367 size=31880 all=806955 active=41351 piece=▁shabby
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=366 size=31900 all=807144 active=41540 piece=▁TMZ
bpe_model_trainer.cc(163) LOG(INFO) Updating active symbols. max_freq=366 min_freq=93
bpe_model_trainer.cc(254) LOG(INFO) Added: freq=366 size=31920 all=807426 active=40638 piece=ylamine
trainer_interface.cc(314) LOG(INFO) Saving model: spm.en.model
trainer_interface.cc(278) [dup.insert(piece).second] city is already defined
Aborted (core dumped)

Understanding BOS/EOS symbols

Are these control characters meant to be semantically meaningful for further dl tasks down the line, or for internal use by sentencepiece?

After

sp.SetEncodeExtraOptions('eos')

I can encode strings and the </s> token is appended automatically. However sp does not recognize </s> as a symbol if it's seen in the text:

sp.EncodeAsPieces('foo\nbar</s>')
=> ['▁f', 'oo', '\n', 'b', 'ar', '<', '/', 's', '>']

Tokenize words, rather than wordparts

First of all, thanks for sharing such a useful too! I really like this library.

Second, I'm working on a non-translation task where I think I want to be workings with words, rather than word parts. Are there any settings I can use in sentencepiece to tend to favor longer word units?

Python wrapper build is broken for Python 3

Commit 9d82cbd breaks the build of the python wrapper for Python 3. With Python 2, it compiles without problem.

This is the error obtained for python setup.py build:

python setup.py build
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.5
copying sentencepiece.py -> build/lib.linux-x86_64-3.5
running build_ext
building '_sentencepiece' extension
creating build/temp.linux-x86_64-3.5
Traceback (most recent call last):
  File "setup.py", line 47, in <module>
    test_suite = 'sentencepiece_test.suite')
  File "/usr/lib/python3.5/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.5/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/usr/lib/python3.5/distutils/command/build.py", line 135, in run
    self.run_command(cmd_name)
  File "/usr/lib/python3.5/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/noe/env/nmt/lib/python3.5/site-packages/setuptools/command/build_ext.py", line 75, in run
    _build_ext.run(self)
  File "/usr/lib/python3.5/distutils/command/build_ext.py", line 338, in run
    self.build_extensions()
  File "/usr/lib/python3.5/distutils/command/build_ext.py", line 447, in build_extensions
    self._build_extensions_serial()
  File "/usr/lib/python3.5/distutils/command/build_ext.py", line 472, in _build_extensions_serial
    self.build_extension(ext)
  File "/home/noe/env/nmt/lib/python3.5/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
    _build_ext.build_extension(self, ext)
  File "/usr/lib/python3.5/distutils/command/build_ext.py", line 532, in build_extension
    depends=ext.depends)
  File "/usr/lib/python3.5/distutils/ccompiler.py", line 574, in compile
    self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
  File "/usr/lib/python3.5/distutils/unixccompiler.py", line 118, in _compile
    extra_postargs)
  File "/usr/lib/python3.5/distutils/ccompiler.py", line 909, in spawn
    spawn(cmd, dry_run=self.dry_run)
  File "/usr/lib/python3.5/distutils/spawn.py", line 36, in spawn
    _spawn_posix(cmd, search_path, dry_run=dry_run)
  File "/usr/lib/python3.5/distutils/spawn.py", line 89, in _spawn_posix
    log.info(' '.join(cmd))
TypeError: sequence item 22: expected str instance, bytes found

Reverting back to commit 6b4daf1 everything compiles file again.

Question on vocab / vocab size

Hi,
I got question.
say I do the following if I have a training corpus train.de / train.en [4500000 lines each]
I concat the 2 files to build the model.
vocab_size=32000
I end up with a vocab file of 32000 lines, so far so good.
I encode train.en and train.de with the model.
If I look at the "vocabulary" of each tokenized file, I end up with more than 32000 for both of them. (39345 and 38299)
What's wrong ?
my understanding is that if total lines are less than 10M then the full corpus is taken into account and therefore I should end up with 32000 max.
thanks.

make install passes, but am unable to run spm

I followed all the installation steps without a hitch.

make passes, as does make check and sudo make install.

However, when I try to run spm_train, spm_decode, or spm_encode, I get the following error message:

spm_train: error while loading shared libraries: libsentencepiece.so.0: cannot open shared object file: No such file or directory

I am running Ubuntu 16.04.2.

Problems installing Python bindings on Mac

Hello! I'm having some trouble installing the Python bindings on Mac OS, and just thought I'd mention it here in case anyone had similar trouble. This is within an Anaconda environment, Python 3.

Move into directory and install:

$ cd /path/to/sentencepiece
$ ./autogen.sh
$ ./configure
$ make
$ make install

Success. Try to install Python bindings:

$ python setup.py build
Package sentencepiece was not found in the pkg-config search path.
Perhaps you should add the directory containing `sentencepiece.pc'
to the PKG_CONFIG_PATH environment variable
No package 'sentencepiece' found
Failed to find sentencepiece pkgconfig

Fix the path.

$ export PKG_CONFIG_PATH=`pwd`/..

And try again:

$ python setup.py build
running build
running build_py
running build_ext
building '_sentencepiece' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/envs/python3/include -arch x86_64 -I/anaconda3/envs/python3/include -arch x86_64 -I/anaconda3/envs/python3/include/python3.6m -c sentencepiece_wrap.cxx -o build/temp.macosx-10.7-x86_64-3.6/sentencepiece_wrap.o -std=c++11 -g -O2 -I/Users/neubig/usr/include
In file included from sentencepiece_wrap.cxx:3124:
/Users/neubig/usr/include/sentencepiece_processor.h:141:8: error: no template named 'unique_ptr' in namespace 'std'
  std::unique_ptr<Rep> rep_;
  ~~~~~^
/Users/neubig/usr/include/sentencepiece_processor.h:181:8: error: no template named 'unique_ptr' in namespace 'std'
  std::unique_ptr<std::string> rep_;
  ~~~~~^    
...

The strange thing is that when I run the command on its own by copy-pasting things seem to work.

$ gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/envs/python3/include -arch x86_64 -I/anaconda3/envs/python3/include -arch x86_64 -I/anaconda3/envs/python3/include/python3.6m -c sentencepiece_wrap.cxx -o build/temp.macosx-10.7-x86_64-3.6/sentencepiece_wrap.o -std=c++11 -g -O2 -I/Users/neubig/usr/include

I'm not sure what would lead to the difference, but I'm stuck here...

Python doesn't yet support `hard_vocab_limit`

First, thanks for developing and releasing this package. It's really excellent and I recommend it to everyone :)

Just a very small request: the Python bindings from pip don't support the new hard_vocab_limit functionality, so when you have time I'd appreciate if you could make a new release to PyPI to that effect.

Restarting sentencepiece on a new dataset

I have a corpus of .xz files from the common crawl. I don't have enough disk space to unzip all the files and concatenate them into a single file. Is there any way to re-start the sentencepiece model on a new corpus of text?

I'd like to loop through each of my files, unzip them to a temp file, run feed it to the tokenization model, delete the temp file, and then move on to the next one.

Do you have any pre-trained models?

I just want to have a good model for English and don't want to tr it myself. Is there any official models for this?

Guidance on multilingual text data

Thanks for the awesome tool. I am working on multilingual text data, specifically the mix of Chinese and English so english words would be used in between Chinese characters without any space delimiter like 有hockey羽毛球欖球籃球足球. I don't have a lot of data like this. So I was wondering, will the tool work on such data if I fit in both chinese and english only text as input? If not, any insights on how this can be handled? Thanks a lot for your help in advance

Encode() in sentencePieceProcessor doesn't take unicode strings as input

Hi,
The Encode() function in the sentencePieceProcessor module doesn't take unicode strings as inputs. Could anyone please suggest an alternative way?

p.s. I am using the python wrapper of sentencepiece.

Thanks in advance.

Version of protoc required and alternative to libprotobuf9v5 for Red Hat

Hi,

I am trying to install sentencepiece following the instructions given in README, on a Red Hat distribution, and the make command gives me an error indicating that my protoc version is not adapted for sentencepiece_model.pb.h. May I ask which version was used to generate the file?

Also, it seems that no package libprotobuf9v5/libprotobuf-c++ is available for RHEL (I found for Ubuntu and Debian). Would there be an alternative to it? Thanks a lot.

Here is the detailed result:

$ make
make all-recursive
make[1]: Entering directory `/data/sentencepiece'
Making all in src
make[2]: Entering directory `/data/sentencepiece/src'
make all-am
make[3]: Entering directory `/data/sentencepiece/src'
g++ -DHAVE_CONFIG_H -I. -I. -std=c++11 -Wall -O3 -MT builder.o -MD -MP -MF .deps/builder.Tpo -c -o builder.o builder.cc
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
#error This file was generated by a newer version of protoc which is
^
sentencepiece_model.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
#error incompatible with your Protocol Buffer headers. Please update
^
sentencepiece_model.pb.h:14:2: error: #error your headers.
#error your headers.
^
sentencepiece_model.pb.h:22:35: fatal error: google/protobuf/arena.h: No such file or directory
#include <google/protobuf/arena.h>
^
compilation terminated.
make[3]: *** [builder.o] Error 1
make[3]: Leaving directory `/data/sentencepiece/src'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/data/sentencepiece/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/data/sentencepiece'

Then I downloaded the latest release of Protocol Buffer (v3.2.0) in C++, from https://github.com/google/protobuf/releases and compiled it.

When I try the 'make' command again, I got another error saying that this version is too new :

$ make
make all-recursive
make[1]: Entering directory `/data/sentencepiece'
Making all in src
make[2]: Entering directory `/data/sentencepiece/src'
make all-am
make[3]: Entering directory `/data/sentencepiece/src'
g++ -DHAVE_CONFIG_H -I. -I.. -std=c++11 -Wall -O3 -MT builder.o -MD -MP -MF .deps/builder.Tpo -c -o builder.o builder.cc
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h:17:2: error: #error This file was generated by an older version of protoc which is
#error This file was generated by an older version of protoc which is
^
sentencepiece_model.pb.h:18:2: error: #error incompatible with your Protocol Buffer headers. Please
#error incompatible with your Protocol Buffer headers. Please
^
sentencepiece_model.pb.h:19:2: error: #error regenerate this file with a newer version of protoc.
#error regenerate this file with a newer version of protoc.
^
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h: In member function ‘const string& sentencepiece::TrainerSpec::model_prefix() const’:
sentencepiece_model.pb.h:938:95: error: no matching function for call to ‘google::protobuf::internal::ArenaStringPtr::GetNoArena(const string*) const’
return model_prefix_.GetNoArena(&::google::protobuf::internal::GetEmptyStringAlreadyInited());
^
sentencepiece_model.pb.h:938:95: note: candidate is:
In file included from sentencepiece_model.pb.h:23:0,
from builder.h:22,
from builder.cc:15:
/usr/local/include/google/protobuf/arenastring.h:225:31: note: const string& google::protobuf::internal::ArenaStringPtr::GetNoArena() const
inline const ::std::string& GetNoArena() const { return ptr_; }
^
/usr/local/include/google/protobuf/arenastring.h:225:31: note: candidate expects 0 arguments, 1 provided
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h: In member function ‘const string& sentencepiece::NormalizerSpec::name() const’:
sentencepiece_model.pb.h:1477:87: error: no matching function for call to ‘google::protobuf::internal::ArenaStringPtr::GetNoArena(const string) const’
return name_.GetNoArena(&::google::protobuf::internal::GetEmptyStringAlreadyInited());
^
sentencepiece_model.pb.h:1477:87: note: candidate is:
In file included from sentencepiece_model.pb.h:23:0,
from builder.h:22,
from builder.cc:15:
/usr/local/include/google/protobuf/arenastring.h:225:31: note: const string& google::protobuf::internal::ArenaStringPtr::GetNoArena() const
inline const ::std::string& GetNoArena() const { return ptr_; }
^
/usr/local/include/google/protobuf/arenastring.h:225:31: note: candidate expects 0 arguments, 1 provided
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h: In member function ‘const string& sentencepiece::NormalizerSpec::precompiled_charsmap() const’:
sentencepiece_model.pb.h:1531:103: error: no matching function for call to ‘google::protobuf::internal::ArenaStringPtr::GetNoArena(const string) const’
return precompiled_charsmap_.GetNoArena(&::google::protobuf::internal::GetEmptyStringAlreadyInited());
^
sentencepiece_model.pb.h:1531:103: note: candidate is:
In file included from sentencepiece_model.pb.h:23:0,
from builder.h:22,
from builder.cc:15:
/usr/local/include/google/protobuf/arenastring.h:225:31: note: const string& google::protobuf::internal::ArenaStringPtr::GetNoArena() const
inline const ::std::string& GetNoArena() const { return ptr_; }
^
/usr/local/include/google/protobuf/arenastring.h:225:31: note: candidate expects 0 arguments, 1 provided
In file included from builder.h:22:0,
from builder.cc:15:
sentencepiece_model.pb.h: In member function ‘const string& sentencepiece::ModelProto_SentencePiece::piece() const’:
sentencepiece_model.pb.h:1664:88: error: no matching function for call to ‘google::protobuf::internal::ArenaStringPtr::GetNoArena(const string) const’
return piece_.GetNoArena(&::google::protobuf::internal::GetEmptyStringAlreadyInited());
^
sentencepiece_model.pb.h:1664:88: note: candidate is:
In file included from sentencepiece_model.pb.h:23:0,
from builder.h:22,
from builder.cc:15:
/usr/local/include/google/protobuf/arenastring.h:225:31: note: const string& google::protobuf::internal::ArenaStringPtr::GetNoArena() const
inline const ::std::string& GetNoArena() const { return *ptr_; }
^
/usr/local/include/google/protobuf/arenastring.h:225:31: note: candidate expects 0 arguments, 1 provided
make[3]: *** [builder.o] Error 1
make[3]: Leaving directory `/data/sentencepiece/src'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/data/sentencepiece/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/data/sentencepiece'
make: *** [all] Error 2

Thanks for your help.

Purpose of adding a dummy whitespace at the beginning of each line of sentence

I have seen in the help text of spm_train the following about the parameter:

--add_dummy_prefix (Add dummy whitespace at the beginning of text) type: bool default: true

Is there any explanation that why the default behavior is adding a prefix whitespace? I am just wondering what's the intention or advantages of doing this.

how to use sentecepiece tensorflow Op

I see you have added Tensorflow Op in your recent master.
Does this mean we can use SentencePiece Op in tensorflow graph?

I wish to build libtensorflow_inference.so file with SentencePieceOp integrated in, so that I can have
a single Tensorflow Graph to make inferences.

Can you explain how we can utilize your recent master merged tensorflow op in this way?

Thanks in advance,

num_threads for spm_train

Even though the default num_threads is 16, the load on the CPU is mainly concentrated on one:

Is this just due to the type of load?

Split longer word, rather than word by word

Firstly, greatly appreciate your library, it's very useful and easy to to use. But when using i've a trouble,.In Vietnamese vocabulary, a meaning word sometimes includes more than 1 word. For example, sentence "I live in Ha Noi", i want "Ha Noi" will stand together after being split. Is there any way or any parameter to handle this case ? Best wishes !

Build failure due to wrong google/protobuf/descriptor.proto path generation

On my machine building sentencepiece fails due to generating the path to google/protobuf/descriptor.proto wrongly.

Executing the following commands

% cd /path/to/sentencepiece
% ./autogen.sh
% ./configure
% make
% make check

results in the following failure:

Making check in src
make[1]: Entering directory '/home/leonard/software/sentencepiece/src'
make  check-am
make[2]: Entering directory '/home/leonard/software/sentencepiece/src'
protoc --cpp_out=. .//usr/include/google/protobuf/descriptor.proto
.//usr/include/google/protobuf/descriptor.proto: No such file or directory
make[2]: *** [Makefile:1350: /usr/include/google/protobuf/descriptor.pb.h] Error 1
make[2]: Leaving directory '/home/leonard/software/sentencepiece/src'
make[1]: *** [Makefile:1206: check] Error 2
make[1]: Leaving directory '/home/leonard/software/sentencepiece/src'
make: *** [Makefile:470: check-recursive] Error 1

Note that /usr/include/google/protobuf/descriptor.proto exists, but that the path .//usr/include/google/protobuf/descriptor.proto does not.

I'm using the following software versions

➜  ~ aclocal --version
aclocal (GNU automake) 1.15

➜  ~ autoheader --version
autoheader (GNU Autoconf) 2.69

➜  ~ automake --version
automake (GNU automake) 1.15

➜  ~ autoconf --version
autoconf (GNU Autoconf) 2.69

➜  ~ protoc --version
libprotoc 3.4.0

Sentence encoder having issues with Python 2.7.6

We are running into issues using sentencepiece and text encoder in python 2.7.6 and tensorflow 1.8

module = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-lite/2")

File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/module.py", line 105, in init
self._spec = as_module_spec(spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/module.py", line 31, in as_module_spec
return native_module.load_module_spec(spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/native_module.py", line 99, in load_module_spec
path = compressed_module_resolver.get_default().get_module_path(path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/resolver.py", line 385, in get_module_path
return self._get_module_path(handle)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/resolver.py", line 467, in _get_module_path
return resolver.get_module_path(handle)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/resolver.py", line 385, in get_module_path
return self._get_module_path(handle)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/compressed_module_resolver.py", line 105, in _get_module_path
self._lock_file_timeout_sec())
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/resolver.py", line 313, in atomic_download
download_fn(handle, tmp_dir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_hub/compressed_module_resolver.py", line 101, in download
response = url_opener.open(request)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1222, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] _ssl.c:510: EOF occurred in violation of protocol>

Is the module not supported in python 2.7, it seems to work fine in python3

Does not recognize \n

Hi, I've found that feeding some multiline text into sentencepiece results in token for the newline character. How can I get SP to recognize that '\n' is a valid character?

Fail to install Python Wrapper

When run pip install sentencepiece in python 3.6.1, I got this error:

  building '_sentencepiece' extension
  creating build/temp.linux-x86_64-3.6
  gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/yuchang/miniconda3/include/python3.6m -c sentencepiece_wrap.cxx -o build/temp.linux-x86_64-3.6/sentencepiece_wrap.o -std=c++11 -g -O2 -I/usr/local/include
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_Load(PyObject*, PyObject*)’:
  sentencepiece_wrap.cxx:3305:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
         PyString_AsStringAndSize(obj1, &str, &str_size);
                                                       ^
  sentencepiece_wrap.cxx: In function ‘PyObject* _wrap_SentencePieceProcessor_LoadOrDie(PyObject*, PyObject*)’:
  sentencepiece_wrap.cxx:3347:53: error: ‘PyString_AsStringAndSize’ was not declared in this scope
         PyString_AsStringAndSize(obj1, &str, &str_size);
                                                       ^

I googled the "PyString_AsStringAndSize", some one said that it means the code does not support python3 , so I change the python version to 2.7.12, then I got:

Collecting sentencepiece
  Using cached sentencepiece-0.0.0.tar.gz
    Complete output from command python setup.py egg_info:
    Package sentencepiece was not found in the pkg-config search path.
    Perhaps you should add the directory containing `sentencepiece.pc'
    to the PKG_CONFIG_PATH environment variable
    No package 'sentencepiece' found
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-joX_PJ/sentencepiece/setup.py", line 28, in <module>
        cmd('pkg-config sentencepiece --cflags'),
      File "/tmp/pip-build-joX_PJ/sentencepiece/setup.py", line 14, in cmd
        return os.popen(line).readlines()[0][:-1].split()
    IndexError: list index out of range
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-joX_PJ/sentencepiece/

My OS is Ubuntu 16.04, python 3.6.1 installed by miniconda3 , python 2.7.12 installed with system.

Any advise or guidance would be greatly appreciated.

Decoding from ids with UNK tokens

The behaviour of the following function calls is strange:

>>> p.DecodeIds(p.EncodeAsIds('test ^^'))
'test  \u2047 '

If '^^' is unknown, why is it decoded as \u2047 and not unk token ?

It's properly encoded into ids:

>>> p.EncodeAsIds('test ^^') 
[2528, 6, 0]

[ feature request]

As of now, it is possible to specify user defined symbol to by pass some sequences.
It would be great if we could pass a "pattern" for these symbol especially when we have plenty of placeholders.
for instance pattern could be '(((*)))' and the * would be any kind of string.
cheers.

Core Dumped during Saving model

I installed sentencepiece successfully in Ubuntu 14.04 64 bit.

But when I tried to train this model with this simple input.txt file, Core Dumped happens during Saving model.

Here are the full log

$ spm_train --input=input.txt --model_prefix=m_a --vocab_size=1000

unigram_model_trainer.cc(494) LOG(INFO) Starts training with : 
input: "input.txt"
model_prefix: "m_a"
model_type: UNIGRAM
vocab_size: 8000
character_coverage: 0.9995
input_sentence_size: 10000000
mining_sentence_size: 2000000
training_sentence_size: 10000000
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: true
split_by_whitespace: true

trainer_interface.cc(109) LOG(INFO) Loading corpus: input.txt
trainer_interface.cc(126) LOG(INFO) Loading: ▁Kết▁quả▁xổ▁số▁điện▁toán▁Vietlott▁ngày▁6/2/2017	size=0
trainer_interface.cc(148) LOG(INFO) Loaded 45 sentences
trainer_interface.cc(166) LOG(INFO) all chars count=14425
trainer_interface.cc(173) LOG(INFO) Done: 99.9584% characters are covered.
trainer_interface.cc(181) LOG(INFO) alphabet size=134
trainer_interface.cc(211) LOG(INFO) Done! 45 sentences are loaded
unigram_model_trainer.cc(121) LOG(INFO) Using 45 sentences for making seed sentencepieces
unigram_model_trainer.cc(149) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(153) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(204) LOG(INFO) Initialized 830 seed sentencepieces
trainer_interface.cc(215) LOG(INFO) Tokenizing input sentences with whitespace: 45
trainer_interface.cc(224) LOG(INFO) Done! 787
unigram_model_trainer.cc(513) LOG(INFO) Using 787 sentences for EM training
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=606 obj=12.9723 num_tokens=1859 num_tokens/piece=3.06766
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=530 obj=12.0251 num_tokens=1862 num_tokens/piece=3.51321
trainer_interface.cc(284) LOG(INFO) Saving model: m_a.model
trainer_interface.cc(275) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] 
Aborted (core dumped)

However, if I set an appropriate value for vocab_size, It works

$ spm_train --input=input.txt --model_prefix=m_a --vocab_size=200

unigram_model_trainer.cc(494) LOG(INFO) Starts training with : 
input: "input.txt"
model_prefix: "m_a"
model_type: UNIGRAM
vocab_size: 200
character_coverage: 0.9995
input_sentence_size: 10000000
mining_sentence_size: 2000000
training_sentence_size: 10000000
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: true
split_by_whitespace: true

trainer_interface.cc(109) LOG(INFO) Loading corpus: input.txt
trainer_interface.cc(126) LOG(INFO) Loading: ▁Kết▁quả▁xổ▁số▁điện▁toán▁Vietlott▁ngày▁6/2/2017	size=0
trainer_interface.cc(148) LOG(INFO) Loaded 45 sentences
trainer_interface.cc(166) LOG(INFO) all chars count=14425
trainer_interface.cc(173) LOG(INFO) Done: 99.9584% characters are covered.
trainer_interface.cc(181) LOG(INFO) alphabet size=134
trainer_interface.cc(211) LOG(INFO) Done! 45 sentences are loaded
unigram_model_trainer.cc(121) LOG(INFO) Using 45 sentences for making seed sentencepieces
unigram_model_trainer.cc(149) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(153) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(204) LOG(INFO) Initialized 830 seed sentencepieces
trainer_interface.cc(215) LOG(INFO) Tokenizing input sentences with whitespace: 45
trainer_interface.cc(224) LOG(INFO) Done! 787
unigram_model_trainer.cc(513) LOG(INFO) Using 787 sentences for EM training
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=606 obj=12.9723 num_tokens=1859 num_tokens/piece=3.06766
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=530 obj=12.0251 num_tokens=1862 num_tokens/piece=3.51321
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=397 obj=12.2622 num_tokens=1975 num_tokens/piece=4.97481
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=397 obj=12.1277 num_tokens=1975 num_tokens/piece=4.97481
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=297 obj=12.9592 num_tokens=2182 num_tokens/piece=7.3468
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=297 obj=12.7479 num_tokens=2182 num_tokens/piece=7.3468
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=222 obj=14.1593 num_tokens=2467 num_tokens/piece=11.1126
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=222 obj=13.8631 num_tokens=2467 num_tokens/piece=11.1126
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=220 obj=14.0721 num_tokens=2483 num_tokens/piece=11.2864
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=220 obj=14.0661 num_tokens=2493 num_tokens/piece=11.3318
trainer_interface.cc(284) LOG(INFO) Saving model: m_a.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m_a.vocab

I tried another value of vocab_size with my input data,

$ spm_train --input=input.txt  --model_prefix=m_a --vocab_size=400
-> OK
$ spm_train --input=input.txt  --model_prefix=m_a --vocab_size=500
-> OK
$ spm_train --input=input.txt  --model_prefix=m_a --vocab_size=600
-> Core Dumped
$ spm_train --input=input.txt  --model_prefix=m_a --vocab_size=547
-> OK
$ spm_train --input=input.txt  --model_prefix=m_a --vocab_size=548
-> Core Dumped

How can I chose appropriate value for vocab_size?

libprotobuf-c++ does not exist

In README.md, you recommend to install libprotobuf-c++ if libprotobuf9v5 cannot be found but libprotobuf-c++ does not exist in the Ubuntu package repositories.

make check failed if I rewrite the unk value

This is rather a feature request than an issue.

I want to reserve small word IDs for special purpose and so I set the unk ID to 3.
const uint32 ModelInterface::kUnkID = 3;

The code was built, however, make check failed.

Problem about mixed-language word

In Chinese, there are some words composed by an english letter and a chinese character. They should be recogized as one word during the tokenizing process. However, all the mixed-language words are splited when I use sentence piece to tokenize them.

Is this a special feature in this tools that recognize the language before tokenizing? Can I disable this feature?

Mac/Ubuntu Installation Fails `syntax error near unexpected token `PROTOBUF,'`

Out:

$ brew install protobuf autoconf automake libtool
Warning: protobuf 3.3.2 is already installed
Warning: autoconf 2.69 is already installed
Warning: automake 1.15.1 is already installed
Warning: libtool 2.4.6_1 is already installed
$ ./autogen.sh
Running aclocal ...
Running autoheader...
Running libtoolize ..
Running automake ...
Running autoconf ...
$ ./configure
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... ./install-sh -c -d
checking for gawk... no
checking for mawk... no
checking for nawk... no
checking for awk... awk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking build system type... x86_64-apple-darwin16.6.0
checking host system type... x86_64-apple-darwin16.6.0
checking how to print strings... printf
checking for style of include used by make... GNU
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking dependency style of gcc... gcc3
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by gcc... /Library/Developer/CommandLineTools/usr/bin/ld
checking if the linker (/Library/Developer/CommandLineTools/usr/bin/ld) is GNU ld... no
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 196608
checking how to convert x86_64-apple-darwin16.6.0 file names to x86_64-apple-darwin16.6.0 format... func_convert_file_noop
checking how to convert x86_64-apple-darwin16.6.0 file names to toolchain format... func_convert_file_noop
checking for /Library/Developer/CommandLineTools/usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc3
checking for ar... ar
checking for archiver @FILE support... no
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for a working dd... /bin/dd
checking how to truncate binary pipes... /bin/dd bs=4096 count=1
checking for mt... no
checking if : is a manifest tool... no
checking for dsymutil... dsymutil
checking for nmedit... nmedit
checking for lipo... lipo
checking for otool... otool
checking for otool64... no
checking for -single_module linker flag... yes
checking for -exported_symbols_list linker flag... yes
checking for -force_load linker flag... yes
checking how to run the C preprocessor... gcc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... yes
checking for gcc option to produce PIC... -fno-common -DPIC
checking if gcc PIC flag -fno-common -DPIC works... yes
checking if gcc static flag -static works... no
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/Library/Developer/CommandLineTools/usr/bin/ld) supports shared libraries... yes
checking dynamic linker characteristics... darwin16.6.0 dyld
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /Library/Developer/CommandLineTools/usr/bin/ld
checking if the linker (/Library/Developer/CommandLineTools/usr/bin/ld) is GNU ld... no
checking whether the g++ linker (/Library/Developer/CommandLineTools/usr/bin/ld) supports shared libraries... yes
checking for g++ option to produce PIC... -fno-common -DPIC
checking if g++ PIC flag -fno-common -DPIC works... yes
checking if g++ static flag -static works... no
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/Library/Developer/CommandLineTools/usr/bin/ld) supports shared libraries... yes
checking dynamic linker characteristics... darwin16.6.0 dyld
checking how to hardcode library paths into programs... immediate
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether g++ accepts -g... (cached) yes
checking dependency style of g++... (cached) gcc3
checking for gcc... (cached) gcc
checking whether we are using the GNU C compiler... (cached) yes
checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ISO C89... (cached) none needed
checking whether gcc understands -c and -o together... (cached) yes
checking dependency style of gcc... (cached) gcc3
./configure: line 17067: syntax error near unexpected token `PROTOBUF,'
./configure: line 17067: `PKG_CHECK_MODULES(PROTOBUF, protobuf >= 2.4.0)'

Brew protobuf info:

Michaels-MacBook-Pro:sentencepiece petrochuk$ brew info protobuf
protobuf: stable 3.3.2 (bottled), HEAD
Protocol buffers (Google's data interchange format)
https://github.com/google/protobuf/
/usr/local/Cellar/protobuf/3.3.2 (260 files, 16.1MB) *
  Poured from bottle on 2017-08-02 at 17:17:01
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/protobuf.rb
==> Dependencies
Build: autoconf ✔, automake ✔, libtool ✔
==> Requirements
Optional: python3 ✔
==> Options
--with-python3
	Build with python3 support
--with-test
	Run build-time check
--without-python
	Build without python support
--HEAD
	Install HEAD version
==> Caveats
Editor support and examples have been installed to:
  /usr/local/opt/protobuf/share/doc/protobuf

Python modules have been installed and Homebrew's site-packages is not
in your Python sys.path, so you will not be able to import the modules
this formula installed. If you plan to develop with these modules,
please run:
  mkdir -p /Users/petrochuk/Library/Python/2.7/lib/python/site-packages
  echo 'import site; site.addsitedir("/usr/local/lib/python2.7/site-packages")' >> /Users/petrochuk/Library/Python/2.7/lib/python/site-packages/homebrew.pth

Ability to avoid rare segmentation causing UNKs

When training a joint SPM model on two or more languages, is there a way to alleviate the problem of segmenting a token in language1 into subunits seen in language2 causing UNKs during test-time?

In subword-nmt, there's a vocabulary threshold for this that allows further segmentation of tokens until the subunits are seen with at least that threshold times in the relevant language.

protobuf issues when installing

When I run ./configure during installation, I get the following error:

./configure: line 17069: syntax error near unexpected token `PROTOBUF,'
./configure: line 17069: `PKG_CHECK_MODULES(PROTOBUF, protoc >= 2.4.0)'

If I comment out the offending line configure passes, but I receive the following error during make:

Undefined symbols for architecture x86_64:
  "google::protobuf::Message::Utf8DebugString() const", referenced from:
      std::__1::__function::__func<main::$_2, std::__1::allocator<main::$_2>, void (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in spm_encode_main.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[3]: *** [spm_encode] Error 1
make[2]: *** [all] Error 2
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

I am running macOS 10.13.1 and libprotoc 3.5.0

Symbols in sentencepieces

I am using sentencepieces in python and have a issue with user defined symbols.

For example, I trained with

"spm.SentencePieceTrainer.Train('--input=Stream.tsv --model_prefix=m.debug --vocab_size=1000 --input_sentence_size=100000000 --hard_vocab_limit=false --user_defined_symbols=,<
sep>,,,')"

and use

>>> sp.EncodeAsIds('<s>')
[9, 1]
>>> sp.EncodeAsIds('<pad>')
[9, 5]
>>> sp.EncodeAsIds('a<pad>')
[25, 5]

it always adds "9" (which is "") in the id list except "a". Is it expected? or Any way to remove the "" id?

Frequency weighted training sentences

I don't know if this is already implemented or if there's a workaround for this. Sometimes, due to the large amount of data, what we have are training sentences that are already uniquely condensed (and we just keep their frequency counts somewhere). Of course, this condensed training set does not necessarily reflect the original statistics of the occurrences of each word and may not reflect the optimal choices for the tokens.

Through experiments, I create two separate segmentation models using the (1) 10M most frequent sentences (unique), and (2) 10M random sentences from the entire set (duplicates are spread in the chronological order that they were generated). Case (2) is of course favored by the empirical evidence suggested by the developers. However, for our task I found that case (1) model is performing better. So the experiment was inconclusive in deciding on which seeds set to choose next time.

I also found that the maximum number of training set is 100M and I wanted to take advantage of this limit since we have a huge amount of data. Of course I could simply generate random 100M sentences from all the training data, but I found it quite empirical. Especially since we're working on CJK languages.

So I guess in a nutshell:
(1) Is there a way to incorporate frequency counts of sentences?
(2) What is a statistically reliable way to condense a huge amount of data (let's say the original size is 300% of the maximum allowed), but still reflect the (approximate) statistics of the words?

Trouble with reversibility

Thanks for this great tool. For context, I'm working with the python wrapper for the BPE tokenization, and I would like to write my tokenized input to files line by line.

Using the default normalization settings, it looks like I can't get complete (character by character) reversibility for some special tokens. If I turn the normalization off by setting --normalization_rule_name=identity, I get all sorts of odd tokenizations.

### Excerpt from a tokenization script that tries to encode and write line by line to a file
input_line = input_line.strip()
tokenized_line = [x.decode('utf-8') for x in spp.EncodeAsPieces(input_line)] # need to convert to strings at some point
encoded_output_line = ' '.join(tokenized_line) + '\n'
decoded_input_line = spp.DecodePieces([x.encode() for x in encoded_output_line.split()])
if not input_line == decoded_input_line:
    print("input_line: ", input_line)
    print("decoded_input_line: ", decoded_input_line)
outfile.write(encoded_output_line)

This yields things like the following:

input_line: Ich erkläre die am Donnerstag, dem 25. September 2003, unterbrochene Sitzungsperiode des Europäischen Parlaments für wieder aufgenommen.(1)

decoded_input_line: Ich erkläre die am Donnerstag, dem 25.September 2003, unterbrochene Sitzungsperiode des Europäischen Parlaments für wieder aufgenommen.(1)

See how the space between "25. September" was removed? Looks like spaces are getting removed in these examples as well (this is happening to many, many sentences):

input_line: Fünfhunderttausend russischsprachige Einwohner bzw. 40 % der Bevölkerung, die keine Staatsangehörigkeit besitzen, sind vom politischen Leben ausgeschlossen.

decoded_input_line: Fünfhunderttausend russischsprachige Einwohner bzw. 40% der Bevölkerung, die keine Staatsangehörigkeit besitzen, sind vom politischen Leben ausgeschlossen.

Here there is one where the space is removed between words (rather than just punctuation):

input_line: Mon emprisonnement m'a contraint à me pencher sur l'essentiel quant à moi-même, mon engagement politique et mon pays.

decoded_input_line: Mon emprisonnement m'a contraint à me pencher sur l'essentielquant à moi-même, mon engagement politique et mon pays.

I was hoping to get BPE, subword tokenizations from sentencepiece that were completely reversible, so that I could get back to the exact original input string. But, I'd also like to be able to cache files and write the BPE encoded inputs to a file. Is this possible, either with a different sentencepiece model or with a different method of writing to the file?

Suppression of isolated ▁'s

To maximize likelihood, there are cases where subword tokens that are seen after a space (meaning the start of a "word") do not get the special underscore because they also appear in the middle of a character combination somewhere else.

Is it possible to suppress this behavior? Meaning, we don't want to have an isolated ▁ as part of the generated vocab list.

Failed to install python wrapper on MacOS (python 3)

pip install sentencepiece

Collecting sentencepiece
Using cached https://files.pythonhosted.org/packages/ef/ba/17c0c4f8ccc746b2182c7e3c8292be0bdb37fbadeaf467d2f69565160764/sentencepiece-0.0.7.tar.gz
Building wheels for collected packages: sentencepiece
Running setup.py bdist_wheel for sentencepiece ... error
Complete output from command /Users/vostryakov/projects/3env/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/s4/l8hg1ch969d9p96z4bwfllmc0000gp/T/pip-install-gni78bgz/sentencepiece/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /private/var/folders/s4/l8hg1ch969d9p96z4bwfllmc0000gp/T/pip-wheel-60ftqmg6 --python-tag cp36:
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-10.11-x86_64-3.6
copying sentencepiece.py -> build/lib.macosx-10.11-x86_64-3.6
running build_ext
building '_sentencepiece' extension
creating build/temp.macosx-10.11-x86_64-3.6
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c sentencepiece_wrap.cxx -o build/temp.macosx-10.11-x86_64-3.6/sentencepiece_wrap.o -std=c++11 -g -O2 -I/usr/local/include
sentencepiece_wrap.cxx:3123:10: fatal error: 'sentencepiece_trainer.h' file not found
#include <sentencepiece_trainer.h>
^
1 error generated.
error: command 'clang' failed with exit status 1

Failed building wheel for sentencepiece

The same error when I try to build from a source.

Force some tokens to retain original form

Is the [title] possible? I mean, provide a list of tokens that will not be touched (i.e. will not be segmented) under the unigram setting?

The full alphabet is not preserved and the transformation is not reversible.

Hi,
Thank you for publishing such a great research!

After studying your docs we assumed that the default settings for training will retain the full alphabet so that the transformation is reversible. Based on that assumption we build our model for polish NLP contest and we are now trying not to get disqualified :(, (I hope that the competition jury will reason with us.)

We tracked the issue down to the default setting of character_coverage which is set to 0.9995.
I understand that normally this is recommended setting as the model may work better like that. But it would be good to add a mention of that setting in the docs so to help anyone that needs to have the transformation reversible.

I can submit PR to emphasize the fact that the default character_coverage is not 1.0 but I'm not sure how this parameter works.

For anyone that comes across this issue. Just train your model with ---character_coverage 1.0.
In our case, the default coverage of 0.9995 removed as many as 4 characters from our alphabet of size 92. The characters were indeed least frequently used so I guess that the coverage means how much of the text is going to be restored after reversing the transformation but that just guessing.

Anyway, this is a super small issue, and the algorithm is very useful. Once again thank you for making the library public.

Regards,
Piotr

Bad alloc

Hello, when running the following on a file with over 5mln sentences

spm.SentencePieceTrainer.Train('--input=... --model_prefix=prefix --vocab_size=50000')

I'm getting this error:

trainer_interface.cc(235) LOG(INFO) Done! 5175523 sentences are loaded unigram_model_trainer.cc(117) LOG(INFO) Using 2000000 sentences for making seed sentencepieces terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

Is this memory issue? How much RAM do I need to execute this operation?

Python wrapper from PyPi is incompatible with Python 3

Recently, a fix for the compatibility with Python 3 of the python wrapper was committed. However, the version of the python wrapper uploaded to PyPi is from 2017-08-28 and does not contain the aforementioned recent fixes. Could you upload the newest version? (from the PiPy page I guess that @taku910 is the owner of the package)

Windows Support

What's sentencepiece's story for windows support? Are there any future plans to support Windows?

Thank you.

terminate called after throwing an instance of 'Darts::Details::Exception'

After I train the model, and load the model...

Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sentencepiece as spm
>>> 
>>> sp = spm.SentencePieceProcessor()
>>> 
>>> sp.Load('m.model')
terminate called after throwing an instance of 'Darts::Details::Exception'
  what():  /sentencepiece/third_party/darts_clone/darts.h:1143: exception: failed to insert key: zero-length key
Aborted (core dumped)