sugarme / tokenizer Goto Github PK
View Code? Open in Web Editor NEWNLP tokenizers written in Go language
License: Apache License 2.0
NLP tokenizers written in Go language
License: Apache License 2.0
I got error panic: concurrent map writes
, BPE TokenizeWithCache
func, Concurrent read and write operations on the map can lead to a panic.
func (b BPE) TokenizeWithCache(sequence string) (retVal []tokenizer.Token) {
if hit, ok := b.Cache.cmap[sequence]; ok {
return b.WordToTokens(hit)
} else {
word := b.MergeWord(sequence)
retVal = b.WordToTokens(*word)
if b.Cache != nil {
b.Cache.SetValues([]CacheItem{
{sequence, *word},
})
}
return retVal
}
}
Please check~
Hi, thanks for this lib!
I found that a log.Print
is used at init: https://github.com/sugarme/tokenizer/blob/master/init.go#L21 which I can't avoid.
My application is using stdout & stderr to communicate. Do you mind remove this log print and move to slog
for finer log level control?
Thanks.
The old versions have the "file not found" bug when trying to use pretrained tokenizers.
/home/gopath/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:875:22: cannot use 2 * gb (untyped int constant 2147483648) as int value in argument to scanner.Buffer (overflows)
Hi,
First of all, thanks for this great package! I am running inference against a triton server serving transformer models from go, and this library is a tremendous help.
One issue I couldn't figure out from the examples or the code is how to make the BPE tokenizer output encoded BOS and EOS tokens (i.e. < s > and < / s >). I checked that those tokens are part of my vocab.json but it seems they get ignored. I tried manually adding them to the tokenizer as special tokens, tried wrapping my input sentence in "< s > ... < / s >" manually, but I can't seem to get it to work. What am I missing?
Cheers!
edit: changed formatting for < s > so markdown doesn't eat them.
missing mutex lock at line 497
Lines 495 to 501 in 7796975
goroutine 1 [IO wait]:
internal/poll.runtime_pollWait(0xa040e68, 0x72)
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/netpoll.go:229 +0x89
internal/poll.(*pollDesc).wait(0xc000448480, 0x4, 0x0)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000448480)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_unix.go:402 +0x22c
net.(*netFD).accept(0xc000448480)
/usr/local/Cellar/go/1.17.2/libexec/src/net/fd_unix.go:173 +0x35
net.(*TCPListener).accept(0xc0005b3248)
/usr/local/Cellar/go/1.17.2/libexec/src/net/tcpsock_posix.go:140 +0x28
net.(*TCPListener).Accept(0xc0005b3248)
/usr/local/Cellar/go/1.17.2/libexec/src/net/tcpsock.go:262 +0x3d
google.golang.org/grpc.(*Server).Serve(0xc0004ae8c0, {0x1d59fe0, 0xc0005b3248})
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:786 +0x362
main.Listen()
/Users/ivosights/Project/service-prototype/src/sentiment/cmd/grpc/listen.go:93 +0x79f
main.glob..func1(0x2407a80, {0x1b9b1e9, 0x0, 0x0})
/Users/ivosights/Project/service-prototype/src/sentiment/cmd/grpc/listen.go:41 +0xa5
github.com/spf13/cobra.(*Command).execute(0x2407a80, {0x24497a0, 0x0, 0x0})
/Users/ivosights/go/pkg/mod/github.com/spf13/[email protected]/command.go:860 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0x2407d00)
/Users/ivosights/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x3bc
github.com/spf13/cobra.(*Command).Execute(...)
/Users/ivosights/go/pkg/mod/github.com/spf13/[email protected]/command.go:902
main.main()
/Users/ivosights/Project/service-prototype/src/sentiment/cmd/grpc/main.go:4 +0x25
goroutine 5 [select]:
database/sql.(*DB).connectionOpener(0xc0001f5ba0, {0x1d5fb38, 0xc0004db7c0})
/usr/local/Cellar/go/1.17.2/libexec/src/database/sql/sql.go:1196 +0x93
created by database/sql.OpenDB
/usr/local/Cellar/go/1.17.2/libexec/src/database/sql/sql.go:794 +0x188
goroutine 50 [select]:
github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher.func1()
/Users/ivosights/go/pkg/mod/github.com/go-sql-driver/[email protected]/connection.go:614 +0xb0
created by github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher
/Users/ivosights/go/pkg/mod/github.com/go-sql-driver/[email protected]/connection.go:611 +0x105
goroutine 35 [select]:
database/sql.(*DB).connectionCleaner(0xc0001f5ba0, 0x0)
/usr/local/Cellar/go/1.17.2/libexec/src/database/sql/sql.go:1068 +0xbd
created by database/sql.(*DB).startCleanerLocked
/usr/local/Cellar/go/1.17.2/libexec/src/database/sql/sql.go:1055 +0x105
goroutine 6 [sleep]:
time.Sleep(0x45d964b800)
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/time.go:193 +0x12e
sentiment/usecase.(*CustomAfinnService).Listen(0x0, {0x1d5fb70, 0xc00003c098})
/Users/ivosights/Project/service-prototype/src/sentiment/usecase/sentiment_afinn_service.go:35 +0x34
main.Listen.func1()
/Users/ivosights/Project/service-prototype/src/sentiment/cmd/grpc/listen.go:68 +0x25
created by main.Listen
/Users/ivosights/Project/service-prototype/src/sentiment/cmd/grpc/listen.go:67 +0x245
goroutine 7 [select]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc000138b40)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/balancer_conn_wrappers.go:69 +0x95
created by google.golang.org/grpc.newCCBalancerWrapper
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/balancer_conn_wrappers.go:60 +0x1d5
goroutine 8 [chan receive]:
google.golang.org/grpc.(*addrConn).resetTransport(0xc0001362c0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:1214 +0x48f
created by google.golang.org/grpc.(*addrConn).connect
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:844 +0x147
goroutine 40 [IO wait]:
internal/poll.runtime_pollWait(0xa040d80, 0x72)
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/netpoll.go:229 +0x89
internal/poll.(*pollDesc).wait(0xc00009e180, 0xc000ee0000, 0x0)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00009e180, {0xc000ee0000, 0x8000, 0x8000})
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00009e180, {0xc000ee0000, 0x80100000000, 0x4})
/usr/local/Cellar/go/1.17.2/libexec/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc0000a0018, {0xc000ee0000, 0x100e794, 0x18})
/usr/local/Cellar/go/1.17.2/libexec/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc000e4a3c0, {0xc0004a2038, 0x9, 0x14bc0c5})
/usr/local/Cellar/go/1.17.2/libexec/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x1d48340, 0xc000e4a3c0}, {0xc0004a2038, 0x9, 0x9}, 0x9)
/usr/local/Cellar/go/1.17.2/libexec/src/io/io.go:328 +0x9a
io.ReadFull(...)
/usr/local/Cellar/go/1.17.2/libexec/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc0004a2038, 0x9, 0xc002013190}, {0x1d48340, 0xc000e4a3c0})
/Users/ivosights/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0004a2000)
/Users/ivosights/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:492 +0x95
google.golang.org/grpc/internal/transport.(*http2Client).reader(0xc00000c3c0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:1347 +0x414
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:346 +0x174f
goroutine 41 [select]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc000eda1e0, 0x1)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/controlbuf.go:407 +0x11b
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc000e4a480)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/controlbuf.go:527 +0x85
google.golang.org/grpc/internal/transport.newHTTP2Client.func3()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:396 +0x65
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:394 +0x1da5
goroutine 28 [runnable]:
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932
created by google.golang.org/grpc.(*Server).serveStreams.func1
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932 +0x294
goroutine 29 [runnable]:
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932
created by google.golang.org/grpc.(*Server).serveStreams.func1
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932 +0x294
goroutine 27 [runnable]:
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932
created by google.golang.org/grpc.(*Server).serveStreams.func1
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932 +0x294
goroutine 42 [runnable]:
github.com/emirpasic/gods/trees/binaryheap.(*Heap).bubbleDownIndex(0xc0003b4530, 0x0)
/Users/ivosights/go/pkg/mod/github.com/emirpasic/[email protected]/trees/binaryheap/binaryheap.go:123 +0x2f5
github.com/emirpasic/gods/trees/binaryheap.(*Heap).bubbleDown(...)
/Users/ivosights/go/pkg/mod/github.com/emirpasic/[email protected]/trees/binaryheap/binaryheap.go:118
github.com/emirpasic/gods/trees/binaryheap.(*Heap).Pop(0xc0003b4530)
/Users/ivosights/go/pkg/mod/github.com/emirpasic/[email protected]/trees/binaryheap/binaryheap.go:74 +0x288
github.com/sugarme/tokenizer/model/bpe.(*Word).MergeAll(0xc001b81200, 0x15, {0x0, 0x1, 0x2})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/model/bpe/word.go:314 +0x296
github.com/sugarme/tokenizer/model/bpe.(*BPE).MergeWord(0xc000867260, {0xc000c349e0, 0xc000c349e0})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/model/bpe/bpe.go:433 +0x3af
github.com/sugarme/tokenizer/model/bpe.BPE.TokenizeWithCache({0xc000010058, 0xc000010048, 0xc000010060, 0xc000fc7860, 0x0, 0x0, 0x0, 0x0}, {0xc000c349e0, 0x6})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/model/bpe/bpe.go:500 +0xaa
github.com/sugarme/tokenizer/model/bpe.BPE.Tokenize({0xc000010058, 0xc000010048, 0xc000010060, 0xc000fc7860, 0x0, 0x0, 0x0, 0x0}, {0xc000c349e0, 0x6})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/model/bpe/bpe.go:487 +0xc7
github.com/sugarme/tokenizer.(*Tokenizer).doTokenize.func1(0x1af08c0)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:474 +0x31
github.com/sugarme/tokenizer.(*PreTokenizedString).Tokenize(0xc00354a6f0, 0xc000867450)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/pretokenizer.go:123 +0xda
github.com/sugarme/tokenizer.(*Tokenizer).doTokenize(0xc00027a128, 0xc000888000, 0x2445ba0, 0xc00024ed90, 0x0)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:473 +0x47
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingleSequence.func1(0x0, 0x0, {0xc000888000, 0x2000000000351})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:315 +0xba
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingleSequence(0x1a720e0, {{0xc0003b4370, 0xc0008677d0, 0xc000867798}, 0x100e794}, 0xc0008677a8, 0x20)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:334 +0xd4
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0x1ad5500, {0x1ad5500, 0xc0003a63c0}, 0x2)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:359 +0x270
sentiment/pkg/sentimentmodule.(*Roberta).Tokenize(0xc000521560, {0xc00354ef00, 0x1ba8622})
/Users/ivosights/Project/service-prototype/src/sentiment/pkg/sentimentmodule/roberta.go:97 +0xc8
sentiment/interface/grpc.(*sentimentServer).Predict(0xc001a640c0, {0x1d5fbe0, 0xc00354a660}, 0x0)
/Users/ivosights/Project/service-prototype/src/sentiment/interface/grpc/sentiment.go:60 +0x2a5
sentiment/proto._Sentiment_Predict_Handler.func1({0x1d5fbe0, 0xc00354a660}, {0x1b42200, 0xc0004ce550})
/Users/ivosights/Project/service-prototype/src/sentiment/proto/sentiment_grpc.pb.go:82 +0x78
github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1({0x1d5fbe0, 0xc00354a570}, {0x1b42200, 0xc0004ce550}, 0xc0003a62e0, 0xc002a1c1f8)
/Users/ivosights/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/logging/logrus/server_interceptors.go:31 +0x102
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x1d5fbe0, 0xc00354a570}, {0x1b42200, 0xc0004ce550})
/Users/ivosights/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x1d5fbe0, 0xc00354a570}, {0x1b42200, 0xc0004ce550}, 0xc0033f4bb8, 0x1ab2c20)
/Users/ivosights/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34 +0xbf
sentiment/proto._Sentiment_Predict_Handler({0x1abb480, 0xc001a640c0}, {0x1d5fbe0, 0xc00354a570}, 0xc00348c2a0, 0xc000d95e30)
/Users/ivosights/Project/service-prototype/src/sentiment/proto/sentiment_grpc.pb.go:84 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004ae8c0, {0x1d6ffa0, 0xc002cc8300}, 0xc0001f6900, 0xc001a64150, 0x23fa9f0, 0x0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:1286 +0xc8f
google.golang.org/grpc.(*Server).handleStream(0xc0004ae8c0, {0x1d6ffa0, 0xc002cc8300}, 0xc0001f6900, 0x0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:1609 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:934 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932 +0x294
goroutine 74 [runnable]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc002cc6190, 0x1)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/controlbuf.go:407 +0x11b
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc0017ba000)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/controlbuf.go:527 +0x85
google.golang.org/grpc/internal/transport.newHTTP2Server.func2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:292 +0xc6
created by google.golang.org/grpc/internal/transport.newHTTP2Server
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:289 +0x148f
goroutine 75 [select]:
google.golang.org/grpc/internal/transport.(*http2Server).keepalive(0xc002cc8300)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:993 +0x259
created by google.golang.org/grpc/internal/transport.newHTTP2Server
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:300 +0x14d7
goroutine 76 [runnable]:
syscall.syscall(0x10b64a0, 0xb, 0xc0034ea000, 0x8000)
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/sys_darwin.go:22 +0x3b
syscall.read(0xc00016a000, {0xc0034ea000, 0xc000b87c00, 0x10188a6})
/usr/local/Cellar/go/1.17.2/libexec/src/syscall/zsyscall_darwin_amd64.go:1171 +0x49
syscall.Read(...)
/usr/local/Cellar/go/1.17.2/libexec/src/syscall/syscall_unix.go:189
internal/poll.ignoringEINTRIO(...)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_unix.go:582
internal/poll.(*FD).Read(0xc00016a000, {0xc0034ea000, 0x8000, 0x8000})
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_unix.go:163 +0x285
net.(*netFD).Read(0xc00016a000, {0xc0034ea000, 0x18, 0x18})
/usr/local/Cellar/go/1.17.2/libexec/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc0000a0010, {0xc0034ea000, 0x104e6b4, 0x18})
/usr/local/Cellar/go/1.17.2/libexec/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc000efa3c0, {0xc0001622d8, 0x9, 0x1af32a0})
/usr/local/Cellar/go/1.17.2/libexec/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x1d48340, 0xc000efa3c0}, {0xc0001622d8, 0x9, 0x9}, 0x9)
/usr/local/Cellar/go/1.17.2/libexec/src/io/io.go:328 +0x9a
io.ReadFull(...)
/usr/local/Cellar/go/1.17.2/libexec/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc0001622d8, 0x9, 0x0}, {0x1d48340, 0xc000efa3c0})
/Users/ivosights/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0001622a0)
/Users/ivosights/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:492 +0x95
google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc002cc8300, 0x0, 0x0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:473 +0xb2
google.golang.org/grpc.(*Server).serveStreams(0xc0004ae8c0, {0x1d6ffa0, 0xc002cc8300})
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:918 +0x142
google.golang.org/grpc.(*Server).handleRawConn.func1()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:868 +0x46
created by google.golang.org/grpc.(*Server).handleRawConn
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:867 +0x472
exit status 2
Problem:
Encoding APIs EncodeSingle
, EncodePair
and Tokenize
are missing addSpecialTokens
param.
Solution:
Add optional param (default = false) to method param.
For consistency, check and change to pointer receiver methods at sub packages:
Tokenizer configured with TruncationParams and PaddingParams not working properly. Also, the AttentionMask and SpecialTokenMask are missing and incorrect values.
HF tokenization implementation with Rust uses parallelization when possible to achieve high performance for tokenization. Does this library implement the same type of business logic or would it be more performant to expose the Rust bindings vs C and import into Go through the C bindings?
I encountered a panic while encoding some documents. Unfortunately I can't provide the documents, as they are private.
After a quick look, it seems that pairEncoding
in util.go:108
is nil, so the GetIds()
call fails.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7a8854]
goroutine 295 [running]:
github.com/sugarme/tokenizer.(*Encoding).GetIds(...)
/home/superman/go/pkg/mod/github.com/sugarme/[email protected]/encoding.go:215
github.com/sugarme/tokenizer.TruncateEncodings(0xc00013e270, 0x0, 0x350?)
/home/superman/go/pkg/mod/github.com/sugarme/[email protected]/util.go:108 +0x54
github.com/sugarme/tokenizer.(*Tokenizer).PostProcess(0xc0000fa000, 0xc00013e270?, 0x0?, 0x1)
/home/superman/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:602 +0xef
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0x0?, {0x847520?, 0xc000025c40?}, 0x0?)
/home/superman/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:464 +0x2e5
github.com/sugarme/tokenizer.(*Tokenizer).EncodeBatch.func1(0xdc)
/home/superman/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:647 +0x90
created by github.com/sugarme/tokenizer.(*Tokenizer).EncodeBatch in goroutine 42
/home/superman/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:644 +0xea
It would be covenient to construct a sub-package pretrained
that provides APIs to load some popular pretrained tokenizers. For now, some pretrained such as
Hey,
I you mentioned this implementation was heavily inspired by the Huggingface one. I was wondering if it's possible to load a tokenizer trained with huggingface with this implementation ?
If it is, is there any gotchas to be aware of ?
I thought this information was lacking from the doc.
Thank you
Problem: Wordpiece Decoder Decode method does not strip prefix and join tokens without a space.
The package is missing the vocabulary for roberta-base in github.com/sugarme/[email protected]/pretrained/model/roberta-base-vocab.json.
Without it, tokenizer.Pretrained.RobertaBase() throws a FileNotFound Error. There aren't any other vocab.json files in model either, so it's possible this issue happens with other pretrained tokenizers.
Thanks for writing such a cool package!
Problem: Tokenizer struct use pointer to Decoder interface which prevents implement Decoder interface in specific model.
Solution: using interface type in stead of pointer to interface.
Convert int to int64 at API to make it easy for API consumers.
For example: tokenizer.Decode(ids []int64)
Problem:
tk := getBert()
truncParams := tokenizer.TruncationParams{
MaxLength: 25,
Strategy: tokenizer.OnlySecond,
Stride: 0,
}
tk.WithTruncation(&truncParams)
input := "A visually stunning rumination on love."
pairInput := "This is the long paragraph that I want to put context on it. It is not only about how to deal with anger but also how to maintain being calm at all time."
encodeInput := tokenizer.NewDualEncodeInput(tokenizer.NewInputSequence(input), tokenizer.NewInputSequence(pairInput))
pairEn, err := tk.Encode(encodeInput, false)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Ids: %+v\n\n", pairEn.Ids)
fmt.Printf("Tokens: %+q\n\n", pairEn.Tokens)
fmt.Printf("Words: %+v\n\n", pairEn.Words)
fmt.Printf("Overflow: %+v\n\n", pairEn.Overflowing)
}
// Output:
Ids: [1037 17453 14726 19379 12758 2006 2293 1012 2069 2055 2129 2000 3066 2007 4963 2021 2036 2129 2000 5441 2108 5475 2012 2035 2051 1012 2023 2003 1996 2146 20423 2008 1045 2215 2000 2404 6123 2006 2009 1012 2009 2003 2025]
Tokens: ["a" "visually" "stunning" "rum" "##ination" "on" "love" "." "only" "about" "how" "to" "deal" "with" "anger" "but" "also" "how" "to" "maintain" "being" "calm" "at" "all" "time" "." "this" "is" "the" "long" "paragraph" "that" "i" "want" "to" "put" "context" "on" "it" "." "it" "is" "not"]
Words: [0 1 2 3 3 4 5 6 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16]
Overflow: [{Ids:[1037 17453 14726 19379 12758 2006 2293 1012 2069 2055 2129 2000 3066 2007 4963 2021 2036 2129 2000 5441 2108 5475 2012 2035 2051] TypeIds:[0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Tokens:[a visually stunning rum ##ination on love . only about how to deal with anger but also how to maintain being calm at all time] Offsets:[[0 1] [2 10] [11 19] [20 23] [23 30] [31 33] [34 38] [38 39] [71 75] [76 81] [82 85] [86 88] [89 93] [94 98] [99 104] [105 108] [109 113] [114 117] [118 120] [121 129] [130 135] [136 140] [141 143] [144 147] [148 152]] SpecialTokenMask:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] AttentionMask:[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Overflowing:[] Words:[0 1 2 3 3 4 5 6 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33]} {Ids:[1037 17453 14726 19379 12758 2006 2293 1012 2069 2055 2129 2000 3066 2007 4963 2021 2036 2129 2000 5441 2108 5475 2012 2035 2051 1012] TypeIds:[0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Tokens:[a visually stunning rum ##ination on love . only about how to deal with anger but also how to maintain being calm at all time .] Offsets:[[0 1] [2 10] [11 19] [20 23] [23 30] [31 33] [34 38] [38 39] [71 75] [76 81] [82 85] [86 88] [89 93] [94 98] [99 104] [105 108] [109 113] [114 117] [118 120] [121 129] [130 135] [136 140] [141 143] [144 147] [148 152] [152 153]] SpecialTokenMask:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] AttentionMask:[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Overflowing:[] Words:[0 1 2 3 3 4 5 6 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34]}]
The truncated sequence was not truncated but moved to the end instead.
Truncation strategy works fine with SingleEncodeInput
type.
func main() {
tk := getBert()
truncParams := tokenizer.TruncationParams{
MaxLength: 25,
Strategy: tokenizer.OnlyFirst,
Stride: 0,
}
tk.WithTruncation(&truncParams)
pairInput := "This is the long paragraph that I want to put context on it. It is not only about how to deal with anger but also how to maintain being calm at all time."
tokenizer.NewInputSequence(pairInput))
encodeInput := tokenizer.NewSingleEncodeInput(tokenizer.NewInputSequence(pairInput))
pairEn, err := tk.Encode(encodeInput, false)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Ids: %+v\n\n", pairEn.Ids)
fmt.Printf("Tokens: %+q\n\n", pairEn.Tokens)
fmt.Printf("Words: %+v\n\n", pairEn.Words)
fmt.Printf("Overflow: %+v\n\n", pairEn.Overflowing)
}
// Output:
Ids: [2023 2003 1996 2146 20423 2008 1045 2215 2000 2404 6123 2006 2009 1012 2009 2003 2025 2069 2055 2129 2000 3066 2007 4963 2021]
Tokens: ["this" "is" "the" "long" "paragraph" "that" "i" "want" "to" "put" "context" "on" "it" "." "it" "is" "not" "only" "about" "how" "to" "deal" "with" "anger" "but"]
Words: [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
Overflow: [{Ids:[2036 2129 2000 5441 2108 5475 2012 2035 2051 1012] TypeIds:[0 0 0 0 0 0 0 0 0 0] Tokens:[also how to maintain being calm at all time .] Offsets:[[109 113] [114 117] [118 120] [121 129] [130 135] [136 140] [141 143] [144 147] [148 152] [152 153]] SpecialTokenMask:[0 0 0 0 0 0 0 0 0 0] AttentionMask:[1 1 1 1 1 1 1 1 1 1] Overflowing:[] Words:[25 26 27 28 29 30 31 32 33 34]}]
Is something planned?
First of all thanks for this great package!
I am trying to tokenize OpenAI CLIP text inputs (which I am not sure is even supported), from Huggingface models tokenizer.json files. Unfortunately, even though the processor (RobertaProcessing) seems to be supported, it always fails with a nil pointer panic in a postprocessing phase. With BERT style tokenizers, it works perfectly.
The respective tokenizer configs:
https://huggingface.co/openai/clip-vit-base-patch32/raw/main/tokenizer.json
or
https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K/raw/main/tokenizer.json
Stack trace:
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x102eb76e8]
goroutine 4 [running]:
testing.tRunner.func1()
/opt/homebrew/Cellar/go/1.21.0/libexec/src/testing/testing.go:1548 +0x528
panic({0x1034dd0a0?, 0x103b04eb0?})
/opt/homebrew/Cellar/go/1.21.0/libexec/src/runtime/panic.go:920 +0x254
github.com/sugarme/tokenizer/processor.(*RobertaProcessing).addSpecialToken(0x14006dc6a80, 0x0)
/Users/kristof/go/pkg/mod/github.com/sugarme/[email protected]/processor/roberta.go:133 +0x88
github.com/sugarme/tokenizer/processor.(*RobertaProcessing).Process(0x14006dc6a80, 0x1400954e820, 0x0, 0x1)
/Users/kristof/go/pkg/mod/github.com/sugarme/[email protected]/processor/roberta.go:96 +0xc8
github.com/sugarme/tokenizer.(*Tokenizer).PostProcess(0x14007979200, 0x1400954e820, 0x0, 0x1)
/Users/kristof/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:612 +0x230
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0x14007979200, {0x1034f9660, 0x1400918cec0}, 0x1)
/Users/kristof/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:464 +0x560
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingle(0x14007979200, {0x103419fd0, 0x1}, {0x140067bd4bf, 0x1, 0x1})
/Users/kristof/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:1085 +0xcc
Is this intended in any way? If so, could you please consider supporting CLIP tokenizers?
Thanks in advance.
Serialization of the Tokenizer and all the parts (PreTokenizer, Normalizer, ...) using encoding/gob
. It is now easy to save/load an entire tokenizer.
See: https://stackoverflow.com/questions/28020070/golang-serialize-and-deserialize-back
I got error panic: assignment to entry in nil map
when I was trying to tokenizer a text
"C:\Program Files\Java\jre1.8. 0_202\bin\java.exe" -Djava.util.logging.config.file="C:\Users\Administrator\Desktop\apache-tomcat-9.0.52\conf\logging.properties" -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dignore.endorsed.dirs="" -classpath "C:\Users\Administrator\Desktop\apache-tomcat-9.0.52\bin\bootstrap.jar;C:\Users\Administrator\Desktop\apache-tomcat-9.0.52\bin\tomcat-juli.jar" -Dcatalina.base="C:\Users\Administrator\Desktop\apache-tomcat-9.0.52" -Dcatalina.home="C:\Users\Administrator\Desktop\apache-tomcat-9.0.52" -Djava.io.tmpdir="C:\Users\Administrator\Desktop\apache-tomcat-9.0.52\temp" org.apache.catalina.startup.Bootstrap start`
I debug the source code and find in tokenize.go:Line96, there may be something wrong to correct, the element e
of encoding.GetOverflowing(), e.SequenceRanges does not initialize, which appears to be nil. Thus, error occurs when using e.SetSequenceIds(i), which will assign value to nil map. Please check and confirm whether this should be soon correct!
As the title, is there already a function that I can call to load my personal vocab? I find the code in pretrained and found that the vocabfilepath has been fixed?
Do you have the function that receives the path of new vocab and returns the tokenizer loaded from this path?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.