Git Product home page Git Product logo

Comments (6)

taku910 avatar taku910 commented on May 5, 2024 1

Thank you for the explanation.

There is a code block to filter invalid piece here.
https://github.com/google/sentencepiece/blob/master/src/trainer_interface.cc#L107

You can add the following if-then block .

if (*it == kWSChar && sentencepiece.size() == 1) {
return false;
}

Then, the isolated "_" will not be handled as a piece. Note that in this case, if the input sentence contains "foo_bar" and "b" is oov, the output my include independent "", which is handled as unknown symbol.

We will add a flag to enable this behavior.

from sentencepiece.

taku910 avatar taku910 commented on May 5, 2024

Technically possible especially in Latin-based languages, but I would like to know why this behavior is necessary.

Another concern is that in the languages without explicit word boundaries (Chinese/Japanese) rarely appear "_" symbol so it is natural to handle them as one independent symbol.

from sentencepiece.

feddyfedfed avatar feddyfedfed commented on May 5, 2024

We are using the segmented text data for a speech recognition task and in order to honor the language model probabilities, we need to incorporate the symbol in the vocabulary. We use silence as the symbol's pronunciation to allow our search to still include them, but we wish to also experiment not having it. In any case, if it's gonna be too laborious we can just manually edit our segmented texts. (But I realized, simply editing the segmented texts would result to a change in the vocabulary...)

On a more general concern, what is the canonical way of telling SentencePiece NOT to segment between certain character combinations. Like for example in Japanese, a contrived example for glides: キャラクター will be segmented to キ ャラクター (again, this is just a contrived example, but it happens to other words). For a speech recognition task this is problematic since we have models for the complete phoneme "ky"

from sentencepiece.

feddyfedfed avatar feddyfedfed commented on May 5, 2024

Again, many thanks Takuさん.

from sentencepiece.

taku910 avatar taku910 commented on May 5, 2024

I found that the hack I showed doesn't work, and the fix for your proposal is a little tricky especially for BPE segmentation.

BPE iteratively concats two frequent symbols to make new symbols, which means that the two symbols before merging must be a valid token. So, suppressing only "_" won't work in BPE segmentation.

In unigram-based segmentation, it is possible, but the many code for vocab filtering is shared by BPE and unigram.

At this moment, we would not like to implement this feature given the concern that the code becomes too complicated. Thank you for the understanding.

from sentencepiece.

feddyfedfed avatar feddyfedfed commented on May 5, 2024

No problem. At the moment, our workaround of assigning silence models to the special underscore seems to work fine. :) Thank you for considering, though.

from sentencepiece.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.