Comments (7)
@taku910 just to recap my previous question and this one, the steps to perform to normalize documents with linefeed \n
1) Do a replace(/\\n/g,'<n>')
in the training set like
awk '{gsub(/\\n/,"<n>")}1' in_dataset.csv > out_dataset.csv`
2) Train with spm_train
using --user_defined_symbols='<n>'
to define the custom \n
symbol for line feed
spm_train --user_defined_symbols='<n>' --model_prefix=m_spm --vocab_size=32000 --character_coverage=1.0 --model_type=bpe --input=out_dataset.csv
3) run inference with spm_encode
in the usual way
echo "Hello\nWorld" | spm_encode --model=lyrics_spm.model --output_format=piece
▁ H e llo <n> W or ld
Since this is pretty important in real world documents, could you please put in the documentation this process?
Thanks a lot!
from sentencepiece.
Hey @taku910,
spm_train assumes that training data is one-sentence-per-line format,
meaning that the training data will never contain "\n".
Isn't it, the training data should have only one line to avoid using "\n"? This is new thing to me, I have never create multiple lines without using "\n".
from sentencepiece.
Thank you for the feedback.
spm_train assumes that training data is one-sentence-per-line format,
meaning that the training data will never contain "\n".
Handy workaround to recognize "\n" is as follows:
- Prepare a custom user defined symbols for "\n", for instance "<n>"
- Runs spm_train with --user_defined_symbols="<n>" to recognize <n> as one symbol in any context.
- Escapes "\n" in the input sentence into "<n>".
% ./spm_train --input=../data/botchan.txt --model_prefix=m --user_defined_symbols='<n>' --vocab_size=1000
% echo "hello<n>world" | ./spm_encode --model=m.model
▁he ll o <n> w or l d
Note that the symbol "<n>" must be unique enough. Otherwise, user may change the behavior by putting "<n>" in their request :)
from sentencepiece.
Another minor FYI is that the default normalization rule is changed from "nfkc" to "nmt_nfkc" which normalizes "\n" into " ". As the normalization rule is embedded in the model file (self-contained), this change will not break the backward compatibility.
from sentencepiece.
Is it OK to add "\n" to user-defined-symbols?
from sentencepiece.
Possible but it is at your own risk basis at this moment, as the vocab file is also one-token-per-line format. We will see some serious side effects, though sentencepiece will not use this vocab file.
Please try the following (at your own risk)
- Copy the default normalization rule data/nmt_nfkc.tsv to new.tsv
- Removes the rule for \r \n from new.tsv to stop the normalization around \r \n.
- Runs spm_train --normalization_rule_tsv=new.tsv --user_defined_symbols=\n
(It would be nicer to use C++/Python API for training so \n is passed to the training program properly) - Checks *.vocab file to contain "\n" as one token.
from sentencepiece.
Hm, this sounds dangerous, I think it's better to replace \n in source text with custom symbol line , as you said.
from sentencepiece.
Related Issues (20)
- coredump when build with CXXFLAG `-Wp,-D_GLIBCXX_ASSERTIONS` HOT 4
- High frequency token segmented into letter sequence when input is a tsv file HOT 3
- Error while installing the library "sentence-transformers" which has dependency on "sentencepiece" HOT 11
- Getting requirements to build wheel did not run successfully. HOT 5
- Not found google.protobuf packages HOT 1
- error while installing sentencepiece python 3.12.2 HOT 2
- Many tests fail HOT 2
- entry points return non-zero exit code (at least for `--help`) HOT 2
- HELP NEEDED Mask Token in SentencePiece tokenizer HELP NEEDED HOT 1
- Sequence of byte '<0x09>' as token HOT 1
- TSV for NFC normalization HOT 1
- Allow whitespace-only pieces
- coredump when build with CXXFLAGS `-Wp,-D_GLIBCXX_ASSERTIONS`
- Only Pretokenization HOT 3
- pip subprocess to install build dependencies did not run successfully. │ exit code: 1 HOT 1
- Windows pip Dependancy Installation Error HOT 2
- Any api for setting user defined symbols? HOT 1
- Inconsistent result between py and cpp HOT 1
- Error when running this command: pip install 'transformers[tf-cpu]' on mac HOT 1
- Support for Windows Python 3.12.2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sentencepiece.