Are these control characters meant to be semantically meaningful for further dl tasks

Do you want to define <EOS> =2 and disable<code

Thanks, that makes sense. I see that if you input <code class="notra

Understanding BOS/EOS symbols about sentencepiece HOT 6 CLOSED

google commented on May 5, 2024

Understanding BOS/EOS symbols

from sentencepiece.

Comments (6)

taku910 commented on May 5, 2024 1

Thank you for the feedback and sorry for the lack of detail documentation for these special symbols.

By default, <s> </s> in the input text are not recognized as BOS/EOS to prevent the case where users indirectly change the behavior by putting <s> </s> in their request, which is a big problem in production setting. That's why BOS/EOS are inserted only with SetEncodeExtraOptions at this moment.

However, there is a workaround. --user_defined_symbols actually accepts <s> </s>.
User defined symbols are always treated as one symbol regardless of the context.

Here's is the example.

% ./spm_train --input=../data/botchan.txt --model_prefix=m --vocab_size=1000
% echo "hello world</s>" | ./spm_encode --model=m.model                     
▁he ll o ▁world </ s >
% echo "</s>" | ./spm_decode --model=m.model

(empty line)

% ./spm_train --input=../data/botchan.txt --model_prefix=m --user_defined_symbols='</s>' --vocab_size=1000
% echo "hello world</s>" | ./spm_encode --model=m.model                                                   
▁he ll o ▁world </s>

% echo "hello world" | ./spm_encode --model=m.model --extra_options=eos 
▁he ll o ▁world </s>

% echo "hello world</s>" | ./spm_encode --model=m.model --extra_options=eos
▁he ll o ▁world </s> </s>

% echo "</s>" | ./spm_decode --model=m.model
</s>

Again, please make sure that this workaround is just for experiments, not recommended in the production system.

from sentencepiece.

sooheon commented on May 5, 2024

I think I understand. If </s> clashing in the production system is a problem, is it ok to redefine piece id 2 to something like <EOS>, and then add <EOS> to text input in preprocessing, and use it in production as well?

from sentencepiece.

taku910 commented on May 5, 2024

Do you want to define <EOS>=2 and disable </s>?

It is partially possible. please try
% spm_train ... --unk_id=0 --bos_id=-1 --eos_id=-1 --pad_id=-1 --user_defined_symbols=<EOS>

With this setting, <EOS> will have id 1, because the next empty id from 0 is 1, and bos/eos/pad id are disabled (-1). Basically, we can assign arbitrary ids for <s>,</s>,<unk>,<pad>. The ids for user defined symbols are determined automatically from 0,1,2.. as long as they are not used.

So, the following command will assign BOS=1 and EOS=2
% spm_train ... --unk_id=0 --bos_id=-1 --eos_id=-1 --pad_id=-1 --user_defined_symbols=<BOS>,<EOS>

from sentencepiece.

sooheon commented on May 5, 2024

Thanks, that makes sense.

I see that if you input ../data/botchan.txt as you do up there, each "sentence" is really just arbitrarily segmented 80 char sequences, no? Does SP's internal unigram model not really care about this, as long as there are enough "sentences" to get a good sense of conditional probabilities?

But if you add SetEncoderExtraOptions('eos:bos'), you will get inputs like this:

<s>Because of an hereditary recklessness, I have been playing always a</s>
<s>losing game since my childhood. During my grammar school days, I was</s>
<s>once laid up for about a week by jumping from the second story of the</s>

Even if this doesn't matter for SP tokenization, won't this adversely affect NMT or Lang model training down the line? For example a language model would think that sentences commonly end without punctuation and periods can go in the middle of sentences.

from sentencepiece.

taku910 commented on May 5, 2024

With the default setting, spm doesn't extract pieces crossing whitespaces. So, as long as one word is preserved in one line the trained model will not be different.

If the DNN model is sensitive to the word position (where the word appears), using , will affect the results.

from sentencepiece.

zhangyuhanjc commented on May 5, 2024

Thank you for the feedback and sorry for the lack of detail documentation for these special symbols.

By default, in the input text are not recognized as BOS/EOS to prevent the case where users indirectly change the behavior by putting in their request, which is a big problem in production setting. That's why BOS/EOS are inserted only with SetEncodeExtraOptions at this moment.

However, there is a workaround. --user_defined_symbols actually accepts . User defined symbols are always treated as one symbol regardless of the context.

Here's is the example.
% ./spm_train --input=../data/botchan.txt --model_prefix=m --vocab_size=1000
% echo "hello world</s>" | ./spm_encode --model=m.model                     
▁he ll o ▁world </ s >
% echo "</s>" | ./spm_decode --model=m.model

(empty line)
% ./spm_train --input=../data/botchan.txt --model_prefix=m --user_defined_symbols='</s>' --vocab_size=1000
% echo "hello world</s>" | ./spm_encode --model=m.model                                                   
▁he ll o ▁world </s>

% echo "hello world" | ./spm_encode --model=m.model --extra_options=eos 
▁he ll o ▁world </s>

% echo "hello world</s>" | ./spm_encode --model=m.model --extra_options=eos
▁he ll o ▁world </s> </s>

% echo "</s>" | ./spm_decode --model=m.model
</s>
Again, please make sure that this workaround is just for experiments, not recommended in the production system.

is there any api to achieve --user_defined_symbols='' ?
i only find function SetEncodeExtraOptions which cannot set user_defined_symbols

from sentencepiece.

Understanding BOS/EOS symbols about sentencepiece HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent