Hi, I've found that feeding some multiline text into sentencepiece results in token f

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Is it OK to add " " to user-defined-symbols?

Hm, this sounds dangerous, I think it's better to replace in source text with custo

Does not recognize \n about sentencepiece HOT 7 CLOSED

google commented on May 5, 2024 6

Does not recognize \n

from sentencepiece.

Comments (7)

loretoparisi commented on May 5, 2024 3

@taku910 just to recap my previous question and this one, the steps to perform to normalize documents with linefeed \n

1) Do a replace(/\\n/g,'<n>') in the training set like

awk '{gsub(/\\n/,"<n>")}1' in_dataset.csv > out_dataset.csv`

2) Train with spm_train using --user_defined_symbols='<n>' to define the custom \n symbol for line feed

spm_train --user_defined_symbols='<n>' --model_prefix=m_spm --vocab_size=32000 --character_coverage=1.0 --model_type=bpe --input=out_dataset.csv

3) run inference with spm_encode in the usual way

echo "Hello\nWorld" | spm_encode --model=lyrics_spm.model --output_format=piece
▁ H e llo <n> W or ld

Since this is pretty important in real world documents, could you please put in the documentation this process?
Thanks a lot!

from sentencepiece.

mani-rai commented on May 5, 2024 1

Hey @taku910,

spm_train assumes that training data is one-sentence-per-line format,
meaning that the training data will never contain "\n".

Isn't it, the training data should have only one line to avoid using "\n"? This is new thing to me, I have never create multiple lines without using "\n".

from sentencepiece.

taku910 commented on May 5, 2024

Thank you for the feedback.

spm_train assumes that training data is one-sentence-per-line format,
meaning that the training data will never contain "\n".

Handy workaround to recognize "\n" is as follows:

Prepare a custom user defined symbols for "\n", for instance "<n>"
Runs spm_train with --user_defined_symbols="<n>" to recognize <n> as one symbol in any context.
Escapes "\n" in the input sentence into "<n>".

% ./spm_train --input=../data/botchan.txt --model_prefix=m --user_defined_symbols='<n>' --vocab_size=1000         
% echo "hello<n>world" | ./spm_encode --model=m.model 
▁he ll o <n> w or l d

Note that the symbol "<n>" must be unique enough. Otherwise, user may change the behavior by putting "<n>" in their request :)

from sentencepiece.

taku910 commented on May 5, 2024

Another minor FYI is that the default normalization rule is changed from "nfkc" to "nmt_nfkc" which normalizes "\n" into " ". As the normalization rule is embedded in the model file (self-contained), this change will not break the backward compatibility.

from sentencepiece.

sooheon commented on May 5, 2024

Is it OK to add "\n" to user-defined-symbols?

from sentencepiece.

taku910 commented on May 5, 2024

Possible but it is at your own risk basis at this moment, as the vocab file is also one-token-per-line format. We will see some serious side effects, though sentencepiece will not use this vocab file.

Please try the following (at your own risk)

Copy the default normalization rule data/nmt_nfkc.tsv to new.tsv
Removes the rule for \r \n from new.tsv to stop the normalization around \r \n.
Runs spm_train --normalization_rule_tsv=new.tsv --user_defined_symbols=\n
(It would be nicer to use C++/Python API for training so \n is passed to the training program properly)
Checks *.vocab file to contain "\n" as one token.

from sentencepiece.

sooheon commented on May 5, 2024

Hm, this sounds dangerous, I think it's better to replace \n in source text with custom symbol line , as you said.

from sentencepiece.

Does not recognize \n about sentencepiece HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent