Git Product home page Git Product logo

Comments (6)

taku910 avatar taku910 commented on May 5, 2024

Thank you for the suggestion. I will implement this feature.

from sentencepiece.

taku910 avatar taku910 commented on May 5, 2024

Hi,

I've just implemented the feature of restricting vocabulary.
The usage is basically the same as subword-nmt.

% spm_encode --generate_vocabulary --model=spm.model < input.L1 > vocab.L1
% spm_encode --generate_vocabulary --model=spm.model < input.L2 > vocab.L2
% spm_encode --vocabulary=vocab.L1 --vocabulary_threshold=50 --model=spm.model < input > input.seg
% spm_encode --vocabulary=vocab.L2 --vocabulary_threshold=50 --model=spm.model < input > input.seg

I will update the document soon.

from sentencepiece.

ozancaglayan avatar ozancaglayan commented on May 5, 2024

That was quick, thanks!

I have one side question: Normally the spm_train generates a vocabulary file as well. Where is that file used in the pipeline? Because once we encode our training set with an SPM model, what we generally do is to let the NMT toolkit create its own vocabulary file, right? So until the above modification, was that .vocab file solely there for convenience or statistics?

from sentencepiece.

taku910 avatar taku910 commented on May 5, 2024

SentencePiece (spm_(encode|decode)) doesn't use *.vocab file, as the same information is stored in the *.model file. The vocab file is emitted just for reference. MT toolkit can load the vocab file to manage vocab. spm_encode can directly generate id sequence which can bypass the vocab management in NMT toolkilt.

One note is that *vocab file doesn't contain all the unique tokens in the training corpus. spm_train first makes the valid character set so that the character-level coverage gets 99.95% (this can be configured by --character_coverage=1.0 parameter). This workaround is introduced for handling CJK languages that have larger character coverage, and may contain many noisy symbols depending on the corpus. The rare characters in the corpus are handled as UNK and not in the .vocab file.

from sentencepiece.

taku910 avatar taku910 commented on May 5, 2024

Updated the document.
https://github.com/google/sentencepiece/blob/master/README.md#vocabulary-restriction

Let me close this issue. Thank you!

from sentencepiece.

hoangcuong2011 avatar hoangcuong2011 commented on May 5, 2024

@taku910:

Hi, In subword-nmt there are two types of tokens: one is the normal one (e.g. turn) and the another one is the one that is part of another longer token (e.g. turnarround). With that, we need to handle not only normal tokens, say, turn, but also turn@@ as this could work to segment a word turnarround for instance (e.g. turn@@ arround), assuming that turnarround does not appear in the merge operations.

As a result the size of true vocab collected from the encoded training dataset for NMT to handle is larger than the number of merge operations in subword.

I noticed this seems not the case with sentencepiece. Specifically when I count the unique ID in the encoded training dataset, it is the same size with the merge operations. So I am curious how you handle this (@@ part) if you can share?

Thanks.

from sentencepiece.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.