Thanks for the awesome tool. I am working on multilingual text data, specifically the

Guidance on multilingual text data about sentencepiece HOT 4 CLOSED

google commented on May 5, 2024

Guidance on multilingual text data

from sentencepiece.

Comments (4)

taku910 commented on May 5, 2024

Thank you for using sentencepiece.

Yes, you can simply feed Chinese and English only text as follows.

% spm_train --input=en.txt,zh.txt --model_prefix=shared.model ..

As spm_train only loads first 10M lines (can be changed --input_sentence_size), and the seed vocabs are generated from the first 2M sentences (--mining_sentence_size), it would be better to randomly merge the files in advance.

% shuf en.zh zh.txt | head -1000000 > shared.txt
% spm_train --input=shared.txt --model_prefix=..

Background:
By default, SentencePiece uses the Unicode script type as a boundary constraint, i.e., pieces crossing different script types will never be extracted. As Chinese character Kanji and Latin character have different script types, sentencepiece always puts a boundary between, "hockey" "羽毛" (Chinese and English boundary). --split_by_unicode_script=false disables this feature and may allow to extract the piece like "key羽".

Thank you.

from sentencepiece.

richielo commented on May 5, 2024

Thanks for the prompt reply and explanation. That sounds perfect for my use case. May I ask how should I balance between input sentence size and mining sentence size? Are there a range of optimal ratio that would work the best?

from sentencepiece.

neubig commented on May 5, 2024

I'd actually like to add a question: I noticed that there's a hard limit on --mining_sentence_size of 5M. Is there a reason why this needs to be a hard constraint? Or would it be safe for me to modify it?

from sentencepiece.

taku910 commented on May 5, 2024

Ideally, we want to set --mining_sentence_size == --input_sentence_size.

The default --mining_sentence_size is smaller, because the mining step is memory-consuming. In this step, frequent substrings are extracted from the corpus to make seed pieces. The space complexity of mining step is about O(4 * 4 N) bytes where N the number of Unicode characters in the corpus.

As this flag only affects the seed vocab selection, the default setting may work as long as the input is randomly shuffled. The main spm training uses the entire corpus.

By the way, I will make the hard limit of --mining_sentence_size to be the same as other variables, which should be more consistent.

from sentencepiece.

Recommend Projects

Guidance on multilingual text data about sentencepiece HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent