Comments (6)
Thank you for the explanation.
There is a code block to filter invalid piece here.
https://github.com/google/sentencepiece/blob/master/src/trainer_interface.cc#L107
You can add the following if-then block .
if (*it == kWSChar && sentencepiece.size() == 1) {
return false;
}
Then, the isolated "_" will not be handled as a piece. Note that in this case, if the input sentence contains "foo_bar" and "b" is oov, the output my include independent "", which is handled as unknown symbol.
We will add a flag to enable this behavior.
from sentencepiece.
Technically possible especially in Latin-based languages, but I would like to know why this behavior is necessary.
Another concern is that in the languages without explicit word boundaries (Chinese/Japanese) rarely appear "_" symbol so it is natural to handle them as one independent symbol.
from sentencepiece.
We are using the segmented text data for a speech recognition task and in order to honor the language model probabilities, we need to incorporate the symbol in the vocabulary. We use silence as the symbol's pronunciation to allow our search to still include them, but we wish to also experiment not having it. In any case, if it's gonna be too laborious we can just manually edit our segmented texts. (But I realized, simply editing the segmented texts would result to a change in the vocabulary...)
On a more general concern, what is the canonical way of telling SentencePiece NOT to segment between certain character combinations. Like for example in Japanese, a contrived example for glides: キャラクター will be segmented to キ ャラクター (again, this is just a contrived example, but it happens to other words). For a speech recognition task this is problematic since we have models for the complete phoneme "ky"
from sentencepiece.
Again, many thanks Takuさん.
from sentencepiece.
I found that the hack I showed doesn't work, and the fix for your proposal is a little tricky especially for BPE segmentation.
BPE iteratively concats two frequent symbols to make new symbols, which means that the two symbols before merging must be a valid token. So, suppressing only "_" won't work in BPE segmentation.
In unigram-based segmentation, it is possible, but the many code for vocab filtering is shared by BPE and unigram.
At this moment, we would not like to implement this feature given the concern that the code becomes too complicated. Thank you for the understanding.
from sentencepiece.
No problem. At the moment, our workaround of assigning silence models to the special underscore seems to work fine. :) Thank you for considering, though.
from sentencepiece.
Related Issues (20)
- How to safely extend vocabulary? HOT 3
- Extract & modify the merge rules from the .model file of a SentencePiece BPE model HOT 1
- Same oov count while using different vocab size HOT 2
- Evaluate Profile-Guided Optimization (PGO)
- Official support for Android compilation in Release/Assets HOT 1
- Merging tokenizers issue HOT 4
- RuntimeError HOT 1
- coredump when build with CXXFLAG `-Wp,-D_GLIBCXX_ASSERTIONS` HOT 4
- High frequency token segmented into letter sequence when input is a tsv file HOT 3
- Error while installing the library "sentence-transformers" which has dependency on "sentencepiece" HOT 11
- Getting requirements to build wheel did not run successfully. HOT 5
- Not found google.protobuf packages HOT 1
- error while installing sentencepiece python 3.12.2 HOT 2
- Many tests fail HOT 2
- entry points return non-zero exit code (at least for `--help`) HOT 2
- HELP NEEDED Mask Token in SentencePiece tokenizer HELP NEEDED HOT 1
- Sequence of byte '<0x09>' as token HOT 1
- TSV for NFC normalization HOT 1
- Allow whitespace-only pieces
- coredump when build with CXXFLAGS `-Wp,-D_GLIBCXX_ASSERTIONS`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sentencepiece.