Git Product home page Git Product logo

Comments (6)

taku910 avatar taku910 commented on May 5, 2024 1

Thank you for the feedback and sorry for the lack of detail documentation for these special symbols.

By default, <s> </s> in the input text are not recognized as BOS/EOS to prevent the case where users indirectly change the behavior by putting <s> </s> in their request, which is a big problem in production setting. That's why BOS/EOS are inserted only with SetEncodeExtraOptions at this moment.

However, there is a workaround. --user_defined_symbols actually accepts <s> </s>.
User defined symbols are always treated as one symbol regardless of the context.

Here's is the example.

% ./spm_train --input=../data/botchan.txt --model_prefix=m --vocab_size=1000
% echo "hello world</s>" | ./spm_encode --model=m.model                     
▁he ll o ▁world </ s >
% echo "</s>" | ./spm_decode --model=m.model

(empty line)
% ./spm_train --input=../data/botchan.txt --model_prefix=m --user_defined_symbols='</s>' --vocab_size=1000
% echo "hello world</s>" | ./spm_encode --model=m.model                                                   
▁he ll o ▁world </s>

% echo "hello world" | ./spm_encode --model=m.model --extra_options=eos 
▁he ll o ▁world </s>

% echo "hello world</s>" | ./spm_encode --model=m.model --extra_options=eos
▁he ll o ▁world </s> </s>

% echo "</s>" | ./spm_decode --model=m.model
</s>

Again, please make sure that this workaround is just for experiments, not recommended in the production system.

from sentencepiece.

sooheon avatar sooheon commented on May 5, 2024

I think I understand. If </s> clashing in the production system is a problem, is it ok to redefine piece id 2 to something like <EOS>, and then add <EOS> to text input in preprocessing, and use it in production as well?

from sentencepiece.

taku910 avatar taku910 commented on May 5, 2024

Do you want to define <EOS>=2 and disable </s>?

It is partially possible. please try
% spm_train ... --unk_id=0 --bos_id=-1 --eos_id=-1 --pad_id=-1 --user_defined_symbols=<EOS>

With this setting, <EOS> will have id 1, because the next empty id from 0 is 1, and bos/eos/pad id are disabled (-1). Basically, we can assign arbitrary ids for <s>,</s>,<unk>,<pad>. The ids for user defined symbols are determined automatically from 0,1,2.. as long as they are not used.

So, the following command will assign BOS=1 and EOS=2
% spm_train ... --unk_id=0 --bos_id=-1 --eos_id=-1 --pad_id=-1 --user_defined_symbols=<BOS>,<EOS>

from sentencepiece.

sooheon avatar sooheon commented on May 5, 2024

Thanks, that makes sense.

I see that if you input ../data/botchan.txt as you do up there, each "sentence" is really just arbitrarily segmented 80 char sequences, no? Does SP's internal unigram model not really care about this, as long as there are enough "sentences" to get a good sense of conditional probabilities?

But if you add SetEncoderExtraOptions('eos:bos'), you will get inputs like this:

<s>Because of an hereditary recklessness, I have been playing always a</s>
<s>losing game since my childhood. During my grammar school days, I was</s>
<s>once laid up for about a week by jumping from the second story of the</s>

Even if this doesn't matter for SP tokenization, won't this adversely affect NMT or Lang model training down the line? For example a language model would think that sentences commonly end without punctuation and periods can go in the middle of sentences.

from sentencepiece.

taku910 avatar taku910 commented on May 5, 2024

With the default setting, spm doesn't extract pieces crossing whitespaces. So, as long as one word is preserved in one line the trained model will not be different.

If the DNN model is sensitive to the word position (where the word appears), using , will affect the results.

from sentencepiece.

zhangyuhanjc avatar zhangyuhanjc commented on May 5, 2024

Thank you for the feedback and sorry for the lack of detail documentation for these special symbols.

By default, in the input text are not recognized as BOS/EOS to prevent the case where users indirectly change the behavior by putting in their request, which is a big problem in production setting. That's why BOS/EOS are inserted only with SetEncodeExtraOptions at this moment.

However, there is a workaround. --user_defined_symbols actually accepts . User defined symbols are always treated as one symbol regardless of the context.

Here's is the example.

% ./spm_train --input=../data/botchan.txt --model_prefix=m --vocab_size=1000
% echo "hello world</s>" | ./spm_encode --model=m.model                     
▁he ll o ▁world </ s >
% echo "</s>" | ./spm_decode --model=m.model

(empty line)
% ./spm_train --input=../data/botchan.txt --model_prefix=m --user_defined_symbols='</s>' --vocab_size=1000
% echo "hello world</s>" | ./spm_encode --model=m.model                                                   
▁he ll o ▁world </s>

% echo "hello world" | ./spm_encode --model=m.model --extra_options=eos 
▁he ll o ▁world </s>

% echo "hello world</s>" | ./spm_encode --model=m.model --extra_options=eos
▁he ll o ▁world </s> </s>

% echo "</s>" | ./spm_decode --model=m.model
</s>

Again, please make sure that this workaround is just for experiments, not recommended in the production system.

is there any api to achieve --user_defined_symbols='' ?
i only find function SetEncodeExtraOptions which cannot set user_defined_symbols

from sentencepiece.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.