elixir-nx / tokenizers Goto Github PK

View Code? Open in Web Editor NEW

90.0 11.0 15.0 2.35 MB

Elixir bindings for 🤗 Tokenizers

Home Page: https://hexdocs.pm/tokenizers

License: Apache License 2.0

Elixir 51.32% Rust 47.90% Nix 0.78%

elixir machine-learning rust text-processing

tokenizers's People

Stargazers

Watchers

Forkers

kimjoaoun dmorn philss kianmeng fhunleth travismorton edennis jwilger mrmarbury virviil capitalist42 mruoss uldza roger120981 sliiser

tokenizers's Issues

Error when applying_tokenizer from Bumblebee

Hey,

I am using Bumblebee and I started to see these error sporadically (as in no error most of the time) from this dep:

** (ErlangError) Erlang error: "Could not decode field :resource on %ExTokenizersTokenizer{}"
The stacktrace:



(tokenizers  0.3.2) Tokenizers.Native.encode_batch("%Tokenizers.Tokenizer{:resource  => #Reference<90589.1590488966.808583178.129331>}", [binary],  true) |  
-- | --
lib/bumblebee/utils/tokenizers.ex:25 Bumblebee.Utils.Tokenizers.apply/4 |  
lib/xxx/bumblebee_utils.ex:54 XXX.BumblebeeUtils.log_token_count_with_tokenizer/2 |  
lib/task/supervised.ex:89 Task.Supervised.invoke_mfa/2 |  
proc_lib.erl:240 :proc_lib.init_p_do_apply/3

Any idea what might be causing this?

Docs Request: Are Tokenizers safe for concurrent use?

I couldn't find anywhere in the docs whether or not the Tokenizers are safe for concurrent use. In my own projects, I've wrapped the tokenizer that I want to use in a simple GenServer with handle_call to do the tokenizing in order to serialize the access similarly to how I would serialize writes to an ETS table, but I wanted to know whether the Tokenizers themselves are safe for concurrent use or not.

`token_to_id` and `apply` return different token_ids

This code

Bumblebee.Text.ClipTokenizer.token_to_id(tokenizer, "toy")

returns 14387
but this code

Bumblebee.Text.ClipTokenizer.apply(tokenizer, "toy", tokenizer_options)["input_ids"][0][1]

returns 5988.

5988 is the correct token id for "toy"

Incorrect type definition for Tokenizers.Tokenizer.t()

tokenizers/lib/tokenizers/tokenizer.ex

Line 17 in 0190d0a

@type t :: %__MODULE__{resource: binary(), reference: reference()}

Dialyzer is unhappy about the return results when getting a tokenizer, because the above type definition is incorrect. For example, if I build this for the "openai-gpt" tokenizer, the struct looks like:

{:ok,
 #Tokenizers.Tokenizer<[
   vocab_size: 40478,
   continuing_subword_prefix: nil,
   dropout: nil,
   end_of_word_suffix: "</w>",
   fuse_unk: false,
   model_type: "bpe",
   unk_token: "<unk>"
 ]>}

Usage with OpenAI tokenizers?

I'm trying to figure out how to use this library with OpenAI's cl100k_base tokenizer model, which to my understanding is what is used for GPT-3.5 and GPT-4.

Do you have any advice on loading this model? I've found some reference libraries but I haven't had success with loading a model from a file. It seems like it would be necessary to most accurately get tokens for use with OpenAI.

Thanks!

Using regular expressions in `Tokenizers.PreTokenizer.split/3`

Hey,

I don't know Rust, but if I understand this correctly, the call to Tokenizers.PreTokenizer.split/3 is forwarded to the underlying Rust function tokenizers::pre_tokenizers::split::Split. Looking at its declaration, it seems to accept some kind of Regex as value for the pattern argument. Is there a way to use regular expressions in Tokenizers.PreTokenizer.split/3?

Thanks,
Michael

Tokenizers don't work with OpenSSL 3.0???

THANK YOU to everyone who has worked hard to bring ML to Elixir!!!! Very exciting!!!!

This bug report may reflect user error, rather than an actual bug, because I'm an Elixir ML n00b still trying to understand all the cool stuff you all have built.

I think the error message I'm seeing is because I'm running Ubuntu 22.04.1, which uses OpenSSL 3.0.2 (I see: openssl/jammy-updates,jammy-security,now 3.0.2-0ubuntu1.7 amd64 [installed,automatic]), while tokenizers seems to assume the use of an earlier OpenSSL version (see similar bug reports at https://stackoverflow.com/questions/72133316/ubuntu-22-04-libssl-so-1-1-cannot-open-shared-object-file-no-such-file-or-di). In case this is relevant, I installed Erlang, Elixir, and Rust via asdf.

...
Generated rustler_precompiled app
==> tokenizers
Compiling 6 files (.ex)

16:40:13.320 [debug] Copying NIF from cache and extracting to /home/marlus/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/90822cf8006a0b87e3694ec427cef054/_build/prod/lib/tokenizers/priv/native/libex_tokenizers-v0.1.2-nif-2.16-x86_64-unknown-linux-gnu.so

16:40:13.337 [warn] The on_load function for module Elixir.Tokenizers.Native returned:
{:error,
 {:load_failed,
  'Failed to load NIF library /home/marlus/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/90822cf8006a0b87e3694ec427cef054/_build/prod/lib/tokenizers/priv/native/libex_tokenizers-v0.1.2-nif-2.16-x86_64-unknown-linux-gnu: \'libssl.so.1.1: cannot open shared object file: No such file or directory\''}}

Generated tokenizers app

Again, thank you for building all this cool stuff.

Cheers,

James

Train new language models?

I am wondering if (and how) this can be used to train my own model. I have seen that the huggingface tokenizer has a train function. How can I utilise this function from this library? Sorry for asking this is all pretty new to me.

elixir-nx / tokenizers Goto Github PK

tokenizers's People

Stargazers

Watchers

Forkers

tokenizers's Issues

Error when applying_tokenizer from Bumblebee

Docs Request: Are Tokenizers safe for concurrent use?

`token_to_id` and `apply` return different token_ids

Incorrect type definition for Tokenizers.Tokenizer.t()

Usage with OpenAI tokenizers?

Using regular expressions in `Tokenizers.PreTokenizer.split/3`

Tokenizers don't work with OpenSSL 3.0???

Train new language models?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent