When training a joint SPM model on two or more languages, is there a way to alleviate

Updated the document. <a href="https://github.com/google/sentencepiece/blob/master

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ability to avoid rare segmentation causing UNKs about sentencepiece HOT 6 CLOSED

google commented on May 5, 2024

Ability to avoid rare segmentation causing UNKs

from sentencepiece.

Comments (6)

taku910 commented on May 5, 2024

Thank you for the suggestion. I will implement this feature.

from sentencepiece.

taku910 commented on May 5, 2024

Hi,

I've just implemented the feature of restricting vocabulary.
The usage is basically the same as subword-nmt.

% spm_encode --generate_vocabulary --model=spm.model < input.L1 > vocab.L1
% spm_encode --generate_vocabulary --model=spm.model < input.L2 > vocab.L2
% spm_encode --vocabulary=vocab.L1 --vocabulary_threshold=50 --model=spm.model < input > input.seg
% spm_encode --vocabulary=vocab.L2 --vocabulary_threshold=50 --model=spm.model < input > input.seg

I will update the document soon.

from sentencepiece.

ozancaglayan commented on May 5, 2024

That was quick, thanks!

I have one side question: Normally the spm_train generates a vocabulary file as well. Where is that file used in the pipeline? Because once we encode our training set with an SPM model, what we generally do is to let the NMT toolkit create its own vocabulary file, right? So until the above modification, was that .vocab file solely there for convenience or statistics?

from sentencepiece.

taku910 commented on May 5, 2024

SentencePiece (spm_(encode|decode)) doesn't use *.vocab file, as the same information is stored in the *.model file. The vocab file is emitted just for reference. MT toolkit can load the vocab file to manage vocab. spm_encode can directly generate id sequence which can bypass the vocab management in NMT toolkilt.

One note is that *vocab file doesn't contain all the unique tokens in the training corpus. spm_train first makes the valid character set so that the character-level coverage gets 99.95% (this can be configured by --character_coverage=1.0 parameter). This workaround is introduced for handling CJK languages that have larger character coverage, and may contain many noisy symbols depending on the corpus. The rare characters in the corpus are handled as UNK and not in the .vocab file.

from sentencepiece.

taku910 commented on May 5, 2024

Updated the document.
https://github.com/google/sentencepiece/blob/master/README.md#vocabulary-restriction

Let me close this issue. Thank you!

from sentencepiece.

hoangcuong2011 commented on May 5, 2024

@taku910:

Hi, In subword-nmt there are two types of tokens: one is the normal one (e.g. turn) and the another one is the one that is part of another longer token (e.g. turnarround). With that, we need to handle not only normal tokens, say, turn, but also turn@@ as this could work to segment a word turnarround for instance (e.g. turn@@ arround), assuming that turnarround does not appear in the merge operations.

As a result the size of true vocab collected from the encoded training dataset for NMT to handle is larger than the number of merge operations in subword.

I noticed this seems not the case with sentencepiece. Specifically when I count the unique ID in the encoded training dataset, it is the same size with the merge operations. So I am curious how you handle this (@@ part) if you can share?

Thanks.

from sentencepiece.

Ability to avoid rare segmentation causing UNKs about sentencepiece HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent