Comments (10)
- @vody-am the support for fast SIGLIP tokenizer is on it's way, and should actually be pretty straighforward. huggingface/transformers#29969
@EricLBuehler we actually shipped this in transformers
, but sure I can have a look.
Most of the tokenizers that are supported in GGUF format should use Metaspace
per-tokenizer and decoder, BPE or unigram model, and either not normalizer or a precompile charset map.
All the requirement are in the [convers_slow](https://github.com/huggingface/transformers/blob/8685b3c5d2dd2550527773d2a02499495a759e31/src/transformers/convert_slow_tokenizer.py#L56)
in transformers.
I'll think about potentially automatically convert sentencepiece.model to rust, but the big problem is that I don't want to have to support sentencepiece + tiktoken, so might just be example gists / snippets of how to do this!
from tokenizers.
oh I also have an interest in reading sentence piece tokenizers as well, in order to invoke the SigLIP text transformer in Rust!
EDIT: using the library mentioned by Eric above, I was able to load up https://huggingface.co/google/siglip-so400m-patch14-384/blob/main/spiece.model and it seemingly tokenized my input!
from tokenizers.
Thank you, @ArthurZucker for the link! I was actually able to get the GPT2 conversion to work now!
from tokenizers.
You cannot load a tokenizer.model, you need to write a converter.
This is because it does not come from the tokenizers
library but from either tiktoken
or sentencepiece
and there is no secret recipe. We need to adapt to the content of the file, but this is not super straight forward.
https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L544 is the simplest way to understand the process!
from tokenizers.
Ok, I understand. Do you know of a way or a library to do this in Rust without reaching for the Python transformers converter?
from tokenizers.
A library no, but we should be able to come up with a small rust code to do this ๐
from tokenizers.
@ArthurZucker are there any specifications or example loaders which I can look at to implement this?
from tokenizers.
I also have the same question, for llava reasons๐
from tokenizers.
Yes! Actually the best way to do this is to use the converters from transformers
see here: https://github.com/huggingface/transformers/blob/2965b204593df9d5652313386ec280ffbfd1753b/src/transformers/convert_slow_tokenizer.py#L1340 .
In rust we would need to read and parse the .model
file with a sentencepiece loader.
from tokenizers.
Ok. Could I use this crate?
One other question: I am implementing GGUF to HF tokenizers
conversion in mistral.rs, and have had success with the unigram
model. I am adding the gpt2
= bpe
model, but I was wondering what components of the Tokenizer
are required such as the normalizer, post processor, etc., and also what decoder to use?
This is what I currently do: https://github.com/EricLBuehler/mistral.rs/blob/d66e5aff1e7faf208469c5bef3c70d45ffda5401/mistralrs-core/src/pipeline/gguf_tokenizer.rs#L116-L142, I would appreciate it if you could take a quick look and see if there is anything obviously wrong!
from tokenizers.
Related Issues (20)
- How to write custom Wordpiece class? HOT 3
- Link to download the training text in `docs/source/quicktour.rst` is broken HOT 5
- Strange warnings with tokenizer for some models HOT 5
- Bug with `CodeQwen1.5`: `data did not match any variant of untagged enum PyPreTokenizerTypeWrapper` HOT 1
- Converting `tokenizers` tokenizers into `tiktoken` tokenizers HOT 5
- How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification HOT 1
- How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer? HOT 3
- [BUG]Might be a bug in Unigram Trainer HOT 1
- Training HuggingFace tokenizer - ignore_merges HOT 2
- Memory leak for large strings HOT 8
- Deserializing BPE tokenizer failure HOT 4
- llama3 tokenizer doesn't round trip HOT 4
- [BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.) HOT 6
- How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? HOT 5
- "Solution" to memory hogging in train_new_from_iterator with a hack HOT 7
- How to use `TokenizerBuilder`? HOT 4
- [Bug?] Modifying normalizer for pretrained tokenizers don't consistently work HOT 2
- Llama-3 offset-mapping needs fixing HOT 6
- `Encoding` object stub doesn't include `__len__` HOT 4
- Progress bar doesn't show in log file. HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tokenizers.