To maximize likelihood, there are cases where subword tokens that are seen after a spa

Suppression of isolated ▁'s about sentencepiece HOT 6 CLOSED

google commented on May 5, 2024

Suppression of isolated ▁'s

from sentencepiece.

Comments (6)

taku910 commented on May 5, 2024 1

Thank you for the explanation.

There is a code block to filter invalid piece here.
https://github.com/google/sentencepiece/blob/master/src/trainer_interface.cc#L107

You can add the following if-then block .

if (*it == kWSChar && sentencepiece.size() == 1) {
return false;
}

Then, the isolated "_" will not be handled as a piece. Note that in this case, if the input sentence contains "foo_bar" and "b" is oov, the output my include independent "", which is handled as unknown symbol.

We will add a flag to enable this behavior.

from sentencepiece.

taku910 commented on May 5, 2024

Technically possible especially in Latin-based languages, but I would like to know why this behavior is necessary.

Another concern is that in the languages without explicit word boundaries (Chinese/Japanese) rarely appear "_" symbol so it is natural to handle them as one independent symbol.

from sentencepiece.

feddyfedfed commented on May 5, 2024

We are using the segmented text data for a speech recognition task and in order to honor the language model probabilities, we need to incorporate the symbol in the vocabulary. We use silence as the symbol's pronunciation to allow our search to still include them, but we wish to also experiment not having it. In any case, if it's gonna be too laborious we can just manually edit our segmented texts. (But I realized, simply editing the segmented texts would result to a change in the vocabulary...)

On a more general concern, what is the canonical way of telling SentencePiece NOT to segment between certain character combinations. Like for example in Japanese, a contrived example for glides: キャラクター will be segmented to キ　ャラクター (again, this is just a contrived example, but it happens to other words). For a speech recognition task this is problematic since we have models for the complete phoneme "ky"

from sentencepiece.

feddyfedfed commented on May 5, 2024

Again, many thanks Takuさん.

from sentencepiece.

taku910 commented on May 5, 2024

I found that the hack I showed doesn't work, and the fix for your proposal is a little tricky especially for BPE segmentation.

BPE iteratively concats two frequent symbols to make new symbols, which means that the two symbols before merging must be a valid token. So, suppressing only "_" won't work in BPE segmentation.

In unigram-based segmentation, it is possible, but the many code for vocab filtering is shared by BPE and unigram.

At this moment, we would not like to implement this feature given the concern that the code becomes too complicated. Thank you for the understanding.

from sentencepiece.

feddyfedfed commented on May 5, 2024

No problem. At the moment, our workaround of assigning silence models to the special underscore seems to work fine. :) Thank you for considering, though.

from sentencepiece.

Suppression of isolated ▁'s about sentencepiece HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent