Git Product home page Git Product logo

Comments (7)

loretoparisi avatar loretoparisi commented on May 5, 2024 3

@taku910 just to recap my previous question and this one, the steps to perform to normalize documents with linefeed \n

1) Do a replace(/\\n/g,'<n>') in the training set like

awk '{gsub(/\\n/,"<n>")}1' in_dataset.csv > out_dataset.csv`

2) Train with spm_train using --user_defined_symbols='<n>' to define the custom \n symbol for line feed

spm_train --user_defined_symbols='<n>' --model_prefix=m_spm --vocab_size=32000 --character_coverage=1.0 --model_type=bpe --input=out_dataset.csv

3) run inference with spm_encode in the usual way

echo "Hello\nWorld" | spm_encode --model=lyrics_spm.model --output_format=piece
▁ H e llo <n> W or ld

Since this is pretty important in real world documents, could you please put in the documentation this process?
Thanks a lot!

from sentencepiece.

mani-rai avatar mani-rai commented on May 5, 2024 1

Hey @taku910,

spm_train assumes that training data is one-sentence-per-line format,
meaning that the training data will never contain "\n".

Isn't it, the training data should have only one line to avoid using "\n"? This is new thing to me, I have never create multiple lines without using "\n".

from sentencepiece.

taku910 avatar taku910 commented on May 5, 2024

Thank you for the feedback.

spm_train assumes that training data is one-sentence-per-line format,
meaning that the training data will never contain "\n".

Handy workaround to recognize "\n" is as follows:

  1. Prepare a custom user defined symbols for "\n", for instance "<n>"
  2. Runs spm_train with --user_defined_symbols="<n>" to recognize <n> as one symbol in any context.
  3. Escapes "\n" in the input sentence into "<n>".
% ./spm_train --input=../data/botchan.txt --model_prefix=m --user_defined_symbols='<n>' --vocab_size=1000         
% echo "hello<n>world" | ./spm_encode --model=m.model 
▁he ll o <n> w or l d         

Note that the symbol "<n>" must be unique enough. Otherwise, user may change the behavior by putting "<n>" in their request :)

from sentencepiece.

taku910 avatar taku910 commented on May 5, 2024

Another minor FYI is that the default normalization rule is changed from "nfkc" to "nmt_nfkc" which normalizes "\n" into " ". As the normalization rule is embedded in the model file (self-contained), this change will not break the backward compatibility.

from sentencepiece.

sooheon avatar sooheon commented on May 5, 2024

Is it OK to add "\n" to user-defined-symbols?

from sentencepiece.

taku910 avatar taku910 commented on May 5, 2024

Possible but it is at your own risk basis at this moment, as the vocab file is also one-token-per-line format. We will see some serious side effects, though sentencepiece will not use this vocab file.

Please try the following (at your own risk)

  • Copy the default normalization rule data/nmt_nfkc.tsv to new.tsv
  • Removes the rule for \r \n from new.tsv to stop the normalization around \r \n.
  • Runs spm_train --normalization_rule_tsv=new.tsv --user_defined_symbols=\n
    (It would be nicer to use C++/Python API for training so \n is passed to the training program properly)
  • Checks *.vocab file to contain "\n" as one token.

from sentencepiece.

sooheon avatar sooheon commented on May 5, 2024

Hm, this sounds dangerous, I think it's better to replace \n in source text with custom symbol line , as you said.

from sentencepiece.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.