Git Product home page Git Product logo

tokenizers's Introduction

Tokenizers

Go bindings for the HuggingFace Tokenizers library.

Installation

make build to build libtokenizers.a that you need to run your application that uses bindings. In addition, you need to inform the linker where to find that static library: go run -ldflags="-extldflags '-L./path/to/libtokenizers.a'" . or just add it to the CGO_LDFLAGS environment variable: CGO_LDFLAGS="-L./path/to/libtokenizers.a" to avoid specifying it every time.

Using pre-built binaries

If you don't want to install Rust toolchain, build it in docker: docker build --platform=linux/amd64 -f release/Dockerfile . or use prebuilt binaries from the releases page. Prebuilt libraries are available for:

Getting started

TLDR: working example.

Load a tokenizer from a JSON config:

import "github.com/daulet/tokenizers"

tk, err := tokenizers.FromFile("./data/bert-base-uncased.json")
if err != nil {
    return err
}
// release native resources
defer tk.Close()

Encode text and decode tokens:

fmt.Println("Vocab size:", tk.VocabSize())
// Vocab size: 30522
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", false))
// [2829 4419 14523 2058 1996 13971 3899] [brown fox jumps over the lazy dog]
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", true))
// [101 2829 4419 14523 2058 1996 13971 3899 102] [[CLS] brown fox jumps over the lazy dog [SEP]]
fmt.Println(tk.Decode([]uint32{2829, 4419, 14523, 2058, 1996, 13971, 3899}, true))
// brown fox jumps over the lazy dog

Benchmarks

go test . -run=^\$ -bench=. -benchmem -count=10 > test/benchmark/$(git rev-parse HEAD).txt

Decoding overhead (due to CGO and extra allocations) is between 2% to 9% depending on the benchmark.

go test . -bench=. -benchmem -benchtime=10s

goos: darwin
goarch: arm64
pkg: github.com/daulet/tokenizers
BenchmarkEncodeNTimes-10     	  959494	     12622 ns/op	     232 B/op	      12 allocs/op
BenchmarkEncodeNChars-10      1000000000	     2.046 ns/op	       0 B/op	       0 allocs/op
BenchmarkDecodeNTimes-10     	 2758072	      4345 ns/op	      96 B/op	       3 allocs/op
BenchmarkDecodeNTokens-10    	18689725	     648.5 ns/op	       7 B/op	       0 allocs/op
PASS
ok   github.com/daulet/tokenizers 126.681s

Run equivalent Rust tests with cargo bench.

decode_n_times          time:   [3.9812 µs 3.9874 µs 3.9939 µs]
                        change: [-0.4103% -0.1338% +0.1275%] (p = 0.33 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

decode_n_tokens         time:   [651.72 ns 661.73 ns 675.78 ns]
                        change: [+0.3504% +2.0016% +3.5507%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Contributing

Please refer to CONTRIBUTING.md for information on how to contribute a PR to this project.

tokenizers's People

Contributors

clems4ever avatar daulet avatar jmoney avatar jpekmez avatar riccardopinosio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tokenizers's Issues

example/main.go run error

i finish to make build, but when i test example/main.go, the following error was found:

➜  tokenizers git:(main) ✗ go run example/main.go
---  invalid argument
panic: invalid argument

goroutine 1 [running]:
main.main()
        /home/zhaopanpan/aigc/tokenizers/example/main.go:13 +0x396
exit status 2

and i example/main.go context:

package main

import (
        "fmt"

        "github.com/daulet/tokenizers"
)

func main() {
        tk, err := tokenizers.FromFile("/home/xxx/aigc/tokenizers/test/data/cohere-tokenizer.json")
        if err != nil {
                fmt.Println("--- ", err)
                panic(err)
        }
        // release native resources
        defer tk.Close()
        fmt.Println("Vocab size:", tk.VocabSize())
        // Vocab size: 30522
        fmt.Println(tk.Encode("brown fox jumps over the lazy dog", false))
        // [2829 4419 14523 2058 1996 13971 3899] [brown fox jumps over the lazy dog]
        fmt.Println(tk.Encode("brown fox jumps over the lazy dog", true))
        // [101 2829 4419 14523 2058 1996 13971 3899 102] [[CLS] brown fox jumps over the lazy dog [SEP]]
        fmt.Println(tk.Decode([]uint32{2829, 4419, 14523, 2058, 1996, 13971, 3899}, true))
        // brown fox jumps over the lazy dog
}

memory issues when using tokenizers

As I utilize the tokenizer, I've observed a continuous rise in memory consumption.
Based on the discussions in golang/go#53440 and the insights provided by https://dgraph.io/blog/post/manual-memory-management-golang-jemalloc/, it appears that the issue stems from glibc not returning memory to the operating system enough.

Considering this, I'm curious: is there any possibility that tokenizers might be adapted to utilize alternative memory allocators like jemalloc or tcmalloc in the future?

suport for offset mapping?

Hey!
Thanks for this great library, this helped us to avoid installing the whole transformers library to be able to use the tokenizer!

I want to ask how can I map the tokens I get from huggingface DistilBertTokenizer to the positions of the input text?
e.g. I have a new GPU -> ["i", "have", "a", "new", "gp", "##u"] -> [(0, 1), (2, 6), ...]

I'm interested in this because suppose that I have some attention values assigned to each token, I would like to show which part of the original text it actually corresponds to, since the tokenized version is not non-ML people friendly.

I have not found solution to this. The library only supports Encode and Decode method. Any insights would be appreciated. Thank you!

Performance regression

We've regressed in benchmarks quite a bit from initial release.

benchstat benchmarks/3188ded27885d1002698a0e25f0b32306c430e88.txt benchmarks/$(git rev-parse HEAD).txt
goos: darwin
goarch: arm64
pkg: github.com/daulet/tokenizers
                 │ benchmarks/3188ded27885d1002698a0e25f0b32306c430e88.txt │ benchmarks/38a9a14c1c56b113461b0c7350c72de949e23cc2.txt │
                 │                         sec/op                          │             sec/op               vs base                │
EncodeNTimes-10                                               11.99µ ±  3%                     13.11µ ±   1%    +9.39% (p=0.002 n=6)
EncodeNChars-10                                               2.584n ±  8%                     2.989n ± 272%         ~ (p=0.485 n=6)
DecodeNTimes-10                                               1.701µ ±  3%                     4.535µ ±   2%  +166.66% (p=0.002 n=6)
DecodeNTokens-10                                              193.6n ± 10%                     656.1n ±   3%  +238.78% (p=0.002 n=6)
geomean                                                       317.8n                           584.3n          +83.86%

                 │ benchmarks/3188ded27885d1002698a0e25f0b32306c430e88.txt │ benchmarks/38a9a14c1c56b113461b0c7350c72de949e23cc2.txt │
                 │                          B/op                           │             B/op               vs base                  │
EncodeNTimes-10                                               84.00 ± 0%                       232.00 ± 0%  +176.19% (p=0.002 n=6)
EncodeNChars-10                                               0.000 ± 0%                        0.000 ± 0%         ~ (p=1.000 n=6) ¹
DecodeNTimes-10                                               96.00 ± 0%                        96.00 ± 0%         ~ (p=1.000 n=6) ¹
DecodeNTokens-10                                              7.000 ± 0%                        7.000 ± 0%         ~ (p=1.000 n=6) ¹
geomean                                                                  ²                                   +28.91%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                 │ benchmarks/3188ded27885d1002698a0e25f0b32306c430e88.txt │ benchmarks/38a9a14c1c56b113461b0c7350c72de949e23cc2.txt │
                 │                        allocs/op                        │           allocs/op            vs base                  │
EncodeNTimes-10                                               4.000 ± 0%                       12.000 ± 0%  +200.00% (p=0.002 n=6)
EncodeNChars-10                                               0.000 ± 0%                        0.000 ± 0%         ~ (p=1.000 n=6) ¹
DecodeNTimes-10                                               3.000 ± 0%                        3.000 ± 0%         ~ (p=1.000 n=6) ¹
DecodeNTokens-10                                              0.000 ± 0%                        0.000 ± 0%         ~ (p=1.000 n=6) ¹
geomean                                                                  ²                                   +31.61%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

segfault running example main.go

jaybinks@Jays-Mac-mini example % go run main.go
Vocab size: 30522
SIGSEGV: segmentation violation
PC=0x104359c24 m=0 sigcode=2
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x104313b74, 0x14000052ca8)
	/usr/local/go/src/runtime/cgocall.go:157 +0x44 fp=0x14000052c70 sp=0x14000052c30 pc=0x10428e354
github.com/daulet/tokenizers._Cfunc_encode(0x129904660, 0x600001794000, 0x0)
	_cgo_gotypes.go:132 +0x38 fp=0x14000052ca0 sp=0x14000052c70 pc=0x104312a08
github.com/daulet/tokenizers.(*Tokenizer).Encode.func2(0x12?, 0x1?, 0x80?)
	/Users/jaybinks/go/pkg/mod/github.com/daulet/[email protected]/tokenizer.go:60 +0x64 fp=0x14000052d10 sp=0x14000052ca0 pc=0x104313394
github.com/daulet/tokenizers.(*Tokenizer).Encode(0x104892ea8?, {0x10463a583?, 0x14000052f00?}, 0x2?)
	/Users/jaybinks/go/pkg/mod/github.com/daulet/[email protected]/tokenizer.go:60 +0x8c fp=0x14000052e40 sp=0x14000052d10 pc=0x10431302c
main.main()
	/Users/jaybinks/src/tokenizers/example/main.go:18 +0xd0 fp=0x14000052f30 sp=0x14000052e40 pc=0x1043137e0
runtime.main()
	/usr/local/go/src/runtime/proc.go:267 +0x2bc fp=0x14000052fd0 sp=0x14000052f30 pc=0x1042bde0c
runtime.goexit()
	/usr/local/go/src/runtime/asm_arm64.s:1197 +0x4 fp=0x14000052fd0 sp=0x14000052fd0 pc=0x1042e92e4

ive tried this on my clean mac-mini M2 as well as on an intel mac, with the crash both on line 18 on the first call to tk.Encode()

libtokenizers.a was built with "make build"
go version go1.21.1 darwin/arm64

invalid argument when running example/main.go

I have compiled the project through make build, and put the compiled libtokenizers.a file in the root directory of tokenizers project.

I can run all the test cases in tokenizer_test.go normally through go test.

But when I run example, the program can read libtokenizers.a correctly, but it will report an error when I run example use cmd go run main.go:

panic: invalid argument

goroutine 1 [running]:
main.main()
        /search/odin/liliang/tokenizers-0.7.1/example/main.go:12 +0x2f2
exit status 2

I've tried master/v0.7.1/v0.8.0/v0.60.0, and they all have the same problem.

My golang version info is: go version go1.22.0 linux/amd64

panic: invalid argument

The issue is not resolved, so I create it one more time. I tried docker, make build, but no matter what I do, the issue is still not resolved.

i finish to make build, but when i test example/main.go, the following error was found:

➜ tokenizers git:(main) ✗ go run example/main.go
--- invalid argument
panic: invalid argument

goroutine 1 [running]:
main.main()
/home/zhaopanpan/aigc/tokenizers/example/main.go:13 +0x396
exit status 2

Thread safety issue

Hi, very happy to see the encodeWithOptions in main, but I've just spotted it has made tokenizers non-threadsafe. If I call encode on multiple goroutines simultaneously I now get the following:

[signal SIGSEGV: segmentation violation code=0x2 addr=0x7fd978021000 pc=0x10545f6]

goroutine 190 [running]:
runtime.throw({0x199ae0a?, 0x4ac2000?})
/usr/local/go/src/runtime/panic.go:1077 +0x5c fp=0xc00c5e2f00 sp=0xc00c5e2ed0 pc=0x45eedc
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:875 +0x285 fp=0xc00c5e2f60 sp=0xc00c5e2f00 pc=0x4752e5
github.com/daulet/tokenizers.uintVecToSlice(...)
/go/pkg/mod/github.com/!r!j!keevil/[email protected]/tokenizer.go:79
github.com/daulet/tokenizers.(*Tokenizer).EncodeWithOptions(0x450c80?, {0xc002da8960?, 0x1?}, 0x1, {0xc00c5e3250, 0x2, 0x43f588?})
/go/pkg/mod/github.com/!r!j!keevil/[email protected]/tokenizer.go:163 +0x2f6 fp=0xc00c5e3148 sp=0xc00c5e2f60 pc=0x10545f6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.