Comments (6)
Thank you for the suggestion. I will implement this feature.
from sentencepiece.
Hi,
I've just implemented the feature of restricting vocabulary.
The usage is basically the same as subword-nmt.
% spm_encode --generate_vocabulary --model=spm.model < input.L1 > vocab.L1
% spm_encode --generate_vocabulary --model=spm.model < input.L2 > vocab.L2
% spm_encode --vocabulary=vocab.L1 --vocabulary_threshold=50 --model=spm.model < input > input.seg
% spm_encode --vocabulary=vocab.L2 --vocabulary_threshold=50 --model=spm.model < input > input.seg
I will update the document soon.
from sentencepiece.
That was quick, thanks!
I have one side question: Normally the spm_train generates a vocabulary file as well. Where is that file used in the pipeline? Because once we encode our training set with an SPM model, what we generally do is to let the NMT toolkit create its own vocabulary file, right? So until the above modification, was that .vocab
file solely there for convenience or statistics?
from sentencepiece.
SentencePiece (spm_(encode|decode)) doesn't use *.vocab file, as the same information is stored in the *.model file. The vocab file is emitted just for reference. MT toolkit can load the vocab file to manage vocab. spm_encode can directly generate id sequence which can bypass the vocab management in NMT toolkilt.
One note is that *vocab file doesn't contain all the unique tokens in the training corpus. spm_train first makes the valid character set so that the character-level coverage gets 99.95% (this can be configured by --character_coverage=1.0 parameter). This workaround is introduced for handling CJK languages that have larger character coverage, and may contain many noisy symbols depending on the corpus. The rare characters in the corpus are handled as UNK and not in the .vocab file.
from sentencepiece.
Updated the document.
https://github.com/google/sentencepiece/blob/master/README.md#vocabulary-restriction
Let me close this issue. Thank you!
from sentencepiece.
Hi, In subword-nmt there are two types of tokens: one is the normal one (e.g. turn) and the another one is the one that is part of another longer token (e.g. turnarround). With that, we need to handle not only normal tokens, say, turn, but also turn@@ as this could work to segment a word turnarround for instance (e.g. turn@@ arround), assuming that turnarround does not appear in the merge operations.
As a result the size of true vocab collected from the encoded training dataset for NMT to handle is larger than the number of merge operations in subword.
I noticed this seems not the case with sentencepiece. Specifically when I count the unique ID in the encoded training dataset, it is the same size with the merge operations. So I am curious how you handle this (@@ part) if you can share?
Thanks.
from sentencepiece.
Related Issues (20)
- High frequency token segmented into letter sequence when input is a tsv file HOT 3
- Error while installing the library "sentence-transformers" which has dependency on "sentencepiece" HOT 11
- Getting requirements to build wheel did not run successfully. HOT 5
- Not found google.protobuf packages HOT 1
- error while installing sentencepiece python 3.12.2 HOT 2
- Many tests fail HOT 2
- entry points return non-zero exit code (at least for `--help`) HOT 2
- HELP NEEDED Mask Token in SentencePiece tokenizer HELP NEEDED HOT 1
- Sequence of byte '<0x09>' as token HOT 1
- TSV for NFC normalization HOT 1
- Allow whitespace-only pieces
- coredump when build with CXXFLAGS `-Wp,-D_GLIBCXX_ASSERTIONS`
- Only Pretokenization HOT 3
- pip subprocess to install build dependencies did not run successfully. │ exit code: 1 HOT 1
- Windows pip Dependancy Installation Error HOT 2
- Any api for setting user defined symbols? HOT 1
- Inconsistent result between py and cpp HOT 1
- Error when running this command: pip install 'transformers[tf-cpu]' on mac HOT 1
- Support for Windows Python 3.12.2
- Is GGUF supported? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sentencepiece.