Comments (6)
Thank you for the feedback and sorry for the lack of detail documentation for these special symbols.
By default, <s> </s> in the input text are not recognized as BOS/EOS to prevent the case where users indirectly change the behavior by putting <s> </s> in their request, which is a big problem in production setting. That's why BOS/EOS are inserted only with SetEncodeExtraOptions at this moment.
However, there is a workaround. --user_defined_symbols actually accepts <s> </s>.
User defined symbols are always treated as one symbol regardless of the context.
Here's is the example.
% ./spm_train --input=../data/botchan.txt --model_prefix=m --vocab_size=1000
% echo "hello world</s>" | ./spm_encode --model=m.model
▁he ll o ▁world </ s >
% echo "</s>" | ./spm_decode --model=m.model
(empty line)
% ./spm_train --input=../data/botchan.txt --model_prefix=m --user_defined_symbols='</s>' --vocab_size=1000
% echo "hello world</s>" | ./spm_encode --model=m.model
▁he ll o ▁world </s>
% echo "hello world" | ./spm_encode --model=m.model --extra_options=eos
▁he ll o ▁world </s>
% echo "hello world</s>" | ./spm_encode --model=m.model --extra_options=eos
▁he ll o ▁world </s> </s>
% echo "</s>" | ./spm_decode --model=m.model
</s>
Again, please make sure that this workaround is just for experiments, not recommended in the production system.
from sentencepiece.
I think I understand. If </s>
clashing in the production system is a problem, is it ok to redefine piece id 2 to something like <EOS>
, and then add <EOS>
to text input in preprocessing, and use it in production as well?
from sentencepiece.
Do you want to define <EOS>
=2 and disable </s>
?
It is partially possible. please try
% spm_train ... --unk_id=0 --bos_id=-1 --eos_id=-1 --pad_id=-1 --user_defined_symbols=<EOS>
With this setting, <EOS>
will have id 1, because the next empty id from 0 is 1, and bos/eos/pad id are disabled (-1). Basically, we can assign arbitrary ids for <s>,</s>,<unk>,<pad>
. The ids for user defined symbols are determined automatically from 0,1,2.. as long as they are not used.
So, the following command will assign BOS=1 and EOS=2
% spm_train ... --unk_id=0 --bos_id=-1 --eos_id=-1 --pad_id=-1 --user_defined_symbols=<BOS>,<EOS>
from sentencepiece.
Thanks, that makes sense.
I see that if you input ../data/botchan.txt
as you do up there, each "sentence" is really just arbitrarily segmented 80 char sequences, no? Does SP's internal unigram model not really care about this, as long as there are enough "sentences" to get a good sense of conditional probabilities?
But if you add SetEncoderExtraOptions('eos:bos'), you will get inputs like this:
<s>Because of an hereditary recklessness, I have been playing always a</s>
<s>losing game since my childhood. During my grammar school days, I was</s>
<s>once laid up for about a week by jumping from the second story of the</s>
Even if this doesn't matter for SP tokenization, won't this adversely affect NMT or Lang model training down the line? For example a language model would think that sentences commonly end without punctuation and periods can go in the middle of sentences.
from sentencepiece.
With the default setting, spm doesn't extract pieces crossing whitespaces. So, as long as one word is preserved in one line the trained model will not be different.
If the DNN model is sensitive to the word position (where the word appears), using , will affect the results.
from sentencepiece.
Thank you for the feedback and sorry for the lack of detail documentation for these special symbols.
By default,
in the input text are not recognized as BOS/EOS to prevent the case where users indirectly change the behavior by puttingin their request, which is a big problem in production setting. That's why BOS/EOS are inserted only with SetEncodeExtraOptions at this moment.However, there is a workaround. --user_defined_symbols actually accepts
. User defined symbols are always treated as one symbol regardless of the context.Here's is the example.
% ./spm_train --input=../data/botchan.txt --model_prefix=m --vocab_size=1000 % echo "hello world</s>" | ./spm_encode --model=m.model ▁he ll o ▁world </ s > % echo "</s>" | ./spm_decode --model=m.model (empty line)
% ./spm_train --input=../data/botchan.txt --model_prefix=m --user_defined_symbols='</s>' --vocab_size=1000 % echo "hello world</s>" | ./spm_encode --model=m.model ▁he ll o ▁world </s> % echo "hello world" | ./spm_encode --model=m.model --extra_options=eos ▁he ll o ▁world </s> % echo "hello world</s>" | ./spm_encode --model=m.model --extra_options=eos ▁he ll o ▁world </s> </s> % echo "</s>" | ./spm_decode --model=m.model </s>
Again, please make sure that this workaround is just for experiments, not recommended in the production system.
is there any api to achieve --user_defined_symbols='' ?
i only find function SetEncodeExtraOptions which cannot set user_defined_symbols
from sentencepiece.
Related Issues (20)
- High frequency token segmented into letter sequence when input is a tsv file HOT 3
- Error while installing the library "sentence-transformers" which has dependency on "sentencepiece" HOT 11
- Getting requirements to build wheel did not run successfully. HOT 5
- Not found google.protobuf packages HOT 1
- error while installing sentencepiece python 3.12.2 HOT 2
- Many tests fail HOT 2
- entry points return non-zero exit code (at least for `--help`) HOT 2
- HELP NEEDED Mask Token in SentencePiece tokenizer HELP NEEDED HOT 1
- Sequence of byte '<0x09>' as token HOT 1
- TSV for NFC normalization HOT 1
- Allow whitespace-only pieces
- coredump when build with CXXFLAGS `-Wp,-D_GLIBCXX_ASSERTIONS`
- Only Pretokenization HOT 3
- pip subprocess to install build dependencies did not run successfully. │ exit code: 1 HOT 1
- Windows pip Dependancy Installation Error HOT 2
- Any api for setting user defined symbols? HOT 1
- Inconsistent result between py and cpp HOT 1
- Error when running this command: pip install 'transformers[tf-cpu]' on mac HOT 1
- Support for Windows Python 3.12.2
- Is GGUF supported? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sentencepiece.