Git Product home page Git Product logo

tokenizer's People

Contributors

cwarden avatar jackielii avatar kasra73 avatar sugarme avatar yarcat avatar yujonglee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tokenizer's Issues

panic:fatal error: concurrent map writes

I got error panic: concurrent map writes , BPE TokenizeWithCache func, Concurrent read and write operations on the map can lead to a panic.

func (b BPE) TokenizeWithCache(sequence string) (retVal []tokenizer.Token) {

	if hit, ok := b.Cache.cmap[sequence]; ok {
		return b.WordToTokens(hit)
	} else {
		word := b.MergeWord(sequence)
		retVal = b.WordToTokens(*word)
		if b.Cache != nil {
			b.Cache.SetValues([]CacheItem{
				{sequence, *word},
			})
		}
		return retVal
	}
}

Please check~

Bump version?

The old versions have the "file not found" bug when trying to use pretrained tokenizers.

BOS/EOS tokens

Hi,

First of all, thanks for this great package! I am running inference against a triton server serving transformer models from go, and this library is a tremendous help.

One issue I couldn't figure out from the examples or the code is how to make the BPE tokenizer output encoded BOS and EOS tokens (i.e. < s > and < / s >). I checked that those tokens are part of my vocab.json but it seems they get ignored. I tried manually adding them to the tokenizer as special tokens, tried wrapping my input sentence in "< s > ... < / s >" manually, but I can't seem to get it to work. What am I missing?

Cheers!

edit: changed formatting for < s > so markdown doesn't eat them.

fatal error: concurrent map read and map write

missing mutex lock at line 497

tokenizer/model/bpe/bpe.go

Lines 495 to 501 in 7796975

func (b BPE) TokenizeWithCache(sequence string) (retVal []tokenizer.Token) {
if hit, ok := b.Cache.cmap[sequence]; ok {
return b.WordToTokens(hit)
} else {
word := b.MergeWord(sequence)
retVal = b.WordToTokens(*word)

trace:
goroutine 77 [running]:
runtime.throw({0x1bbc7fe, 0xc001844cc0})
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/panic.go:1198 +0x71 fp=0xc0017c1170 sp=0xc0017c1140 pc=0x1036ab1
runtime.mapaccess2_faststr(0x14, 0x33333333333333, {0xc000c2a018, 0x1})
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/map_faststr.go:116 +0x3d4 fp=0xc0017c11d8 sp=0xc0017c1170 pc=0x1014394
github.com/sugarme/tokenizer/model/bpe.BPE.TokenizeWithCache({0xc000010058, 0xc000010048, 0xc000010060, 0xc000fc7860, 0x0, 0x0, 0x0, 0x0}, {0xc000c2a018, 0x1})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/model/bpe/bpe.go:497 +0x7d fp=0xc0017c1250 sp=0xc0017c11d8 pc=0x153ad1d
github.com/sugarme/tokenizer/model/bpe.BPE.Tokenize({0xc000010058, 0xc000010048, 0xc000010060, 0xc000fc7860, 0x0, 0x0, 0x0, 0x0}, {0xc000c2a018, 0x1})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/model/bpe/bpe.go:487 +0xc7 fp=0xc0017c12b0 sp=0xc0017c1250 pc=0x153ac07
github.com/sugarme/tokenizer/model/bpe.(*BPE).Tokenize(0xc0004da000, {0xc000c2a018, 0xc00016af80})
:1 +0xb0 fp=0xc0017c1350 sp=0xc0017c12b0 pc=0x153ee50
github.com/sugarme/tokenizer.(*Tokenizer).doTokenize.func1(0x1af08c0)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:474 +0x31 fp=0xc0017c1378 sp=0xc0017c1350 pc=0x1534c11
github.com/sugarme/tokenizer.(*PreTokenizedString).Tokenize(0xc002cba9f0, 0xc0017c1450)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/pretokenizer.go:123 +0xda fp=0xc0017c1418 sp=0xc0017c1378 pc=0x153151a
github.com/sugarme/tokenizer.(*Tokenizer).doTokenize(0xc00027a128, 0xc0034fee00, 0x1, 0xc00024ed90, 0x10188a6)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:473 +0x47 fp=0xc0017c1470 sp=0xc0017c1418 pc=0x1534b67
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingleSequence.func1(0x0, 0x0, {0xc0034fee00, 0x200000000036d})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:315 +0xba fp=0xc0017c14c0 sp=0xc0017c1470 pc=0x1533f9a
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingleSequence(0x1a720e0, {{0xc0005225b0, 0xc0017c17d0, 0xc0017c1798}, 0x100e794}, 0xc0017c17a8, 0x20)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:334 +0xd4 fp=0xc0017c1700 sp=0xc0017c14c0 pc=0x1533b54
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0x1ad5500, {0x1ad5500, 0xc0005204a0}, 0x2)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:359 +0x270 fp=0xc0017c17e0 sp=0xc0017c1700 pc=0x1534250
sentiment/pkg/sentimentmodule.(*Roberta).Tokenize(0xc000521560, {0xc0034f8000, 0x1ba8622})
/Users/ivosights/Project/service-prototype/src/sentiment/pkg/sentimentmodule/roberta.go:97 +0xc8 fp=0xc0017c18d0 sp=0xc0017c17e0 pc=0x1948b08
sentiment/interface/grpc.(*sentimentServer).Predict(0xc001a640c0, {0x1d5fbe0, 0xc002cba960}, 0x0)
/Users/ivosights/Project/service-prototype/src/sentiment/interface/grpc/sentiment.go:60 +0x2a5 fp=0xc0017c19a8 sp=0xc0017c18d0 pc=0x194c565
sentiment/proto._Sentiment_Predict_Handler.func1({0x1d5fbe0, 0xc002cba960}, {0x1b42200, 0xc002cc63c0})
/Users/ivosights/Project/service-prototype/src/sentiment/proto/sentiment_grpc.pb.go:82 +0x78 fp=0xc0017c19e8 sp=0xc0017c19a8 pc=0x194b7f8
github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1({0x1d5fbe0, 0xc002cba570}, {0x1b42200, 0xc002cc63c0}, 0xc0005201a0, 0xc000efc240)
/Users/ivosights/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/logging/logrus/server_interceptors.go:31 +0x102 fp=0xc0017c1ad8 sp=0xc0017c19e8 pc=0x19a9a02
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x1d5fbe0, 0xc002cba570}, {0x1b42200, 0xc002cc63c0})
/Users/ivosights/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25 +0x3a fp=0xc0017c1b18 sp=0xc0017c1ad8 pc=0x19a89fa
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x1d5fbe0, 0xc002cba570}, {0x1b42200, 0xc002cc63c0}, 0xc000ecebb8, 0x1ab2c20)
/Users/ivosights/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34 +0xbf fp=0xc0017c1b70 sp=0xc0017c1b18 pc=0x19a889f
sentiment/proto._Sentiment_Predict_Handler({0x1abb480, 0xc001a640c0}, {0x1d5fbe0, 0xc002cba570}, 0xc000efa6c0, 0xc000d95e30)
/Users/ivosights/Project/service-prototype/src/sentiment/proto/sentiment_grpc.pb.go:84 +0x138 fp=0xc0017c1bc8 sp=0xc0017c1b70 pc=0x194b6b8
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004ae8c0, {0x1d6ffa0, 0xc002cc8300}, 0xc000120000, 0xc001a64150, 0x23fa9f0, 0x0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:1286 +0xc8f fp=0xc0017c1e48 sp=0xc0017c1bc8 pc=0x150b94f
google.golang.org/grpc.(*Server).handleStream(0xc0004ae8c0, {0x1d6ffa0, 0xc002cc8300}, 0xc000120000, 0x0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:1609 +0xa2a fp=0xc0017c1f68 sp=0xc0017c1e48 pc=0x150f52a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:934 +0x98 fp=0xc0017c1fe0 sp=0xc0017c1f68 pc=0x1509418
runtime.goexit()
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0017c1fe8 sp=0xc0017c1fe0 pc=0x106a161
created by google.golang.org/grpc.(*Server).serveStreams.func1
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932 +0x294

goroutine 1 [IO wait]:
internal/poll.runtime_pollWait(0xa040e68, 0x72)
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/netpoll.go:229 +0x89
internal/poll.(*pollDesc).wait(0xc000448480, 0x4, 0x0)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000448480)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_unix.go:402 +0x22c
net.(*netFD).accept(0xc000448480)
/usr/local/Cellar/go/1.17.2/libexec/src/net/fd_unix.go:173 +0x35
net.(*TCPListener).accept(0xc0005b3248)
/usr/local/Cellar/go/1.17.2/libexec/src/net/tcpsock_posix.go:140 +0x28
net.(*TCPListener).Accept(0xc0005b3248)
/usr/local/Cellar/go/1.17.2/libexec/src/net/tcpsock.go:262 +0x3d
google.golang.org/grpc.(*Server).Serve(0xc0004ae8c0, {0x1d59fe0, 0xc0005b3248})
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:786 +0x362
main.Listen()
/Users/ivosights/Project/service-prototype/src/sentiment/cmd/grpc/listen.go:93 +0x79f
main.glob..func1(0x2407a80, {0x1b9b1e9, 0x0, 0x0})
/Users/ivosights/Project/service-prototype/src/sentiment/cmd/grpc/listen.go:41 +0xa5
github.com/spf13/cobra.(*Command).execute(0x2407a80, {0x24497a0, 0x0, 0x0})
/Users/ivosights/go/pkg/mod/github.com/spf13/[email protected]/command.go:860 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0x2407d00)
/Users/ivosights/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x3bc
github.com/spf13/cobra.(*Command).Execute(...)
/Users/ivosights/go/pkg/mod/github.com/spf13/[email protected]/command.go:902
main.main()
/Users/ivosights/Project/service-prototype/src/sentiment/cmd/grpc/main.go:4 +0x25

goroutine 5 [select]:
database/sql.(*DB).connectionOpener(0xc0001f5ba0, {0x1d5fb38, 0xc0004db7c0})
/usr/local/Cellar/go/1.17.2/libexec/src/database/sql/sql.go:1196 +0x93
created by database/sql.OpenDB
/usr/local/Cellar/go/1.17.2/libexec/src/database/sql/sql.go:794 +0x188

goroutine 50 [select]:
github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher.func1()
/Users/ivosights/go/pkg/mod/github.com/go-sql-driver/[email protected]/connection.go:614 +0xb0
created by github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher
/Users/ivosights/go/pkg/mod/github.com/go-sql-driver/[email protected]/connection.go:611 +0x105

goroutine 35 [select]:
database/sql.(*DB).connectionCleaner(0xc0001f5ba0, 0x0)
/usr/local/Cellar/go/1.17.2/libexec/src/database/sql/sql.go:1068 +0xbd
created by database/sql.(*DB).startCleanerLocked
/usr/local/Cellar/go/1.17.2/libexec/src/database/sql/sql.go:1055 +0x105

goroutine 6 [sleep]:
time.Sleep(0x45d964b800)
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/time.go:193 +0x12e
sentiment/usecase.(*CustomAfinnService).Listen(0x0, {0x1d5fb70, 0xc00003c098})
/Users/ivosights/Project/service-prototype/src/sentiment/usecase/sentiment_afinn_service.go:35 +0x34
main.Listen.func1()
/Users/ivosights/Project/service-prototype/src/sentiment/cmd/grpc/listen.go:68 +0x25
created by main.Listen
/Users/ivosights/Project/service-prototype/src/sentiment/cmd/grpc/listen.go:67 +0x245

goroutine 7 [select]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc000138b40)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/balancer_conn_wrappers.go:69 +0x95
created by google.golang.org/grpc.newCCBalancerWrapper
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/balancer_conn_wrappers.go:60 +0x1d5

goroutine 8 [chan receive]:
google.golang.org/grpc.(*addrConn).resetTransport(0xc0001362c0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:1214 +0x48f
created by google.golang.org/grpc.(*addrConn).connect
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:844 +0x147

goroutine 40 [IO wait]:
internal/poll.runtime_pollWait(0xa040d80, 0x72)
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/netpoll.go:229 +0x89
internal/poll.(*pollDesc).wait(0xc00009e180, 0xc000ee0000, 0x0)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00009e180, {0xc000ee0000, 0x8000, 0x8000})
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00009e180, {0xc000ee0000, 0x80100000000, 0x4})
/usr/local/Cellar/go/1.17.2/libexec/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc0000a0018, {0xc000ee0000, 0x100e794, 0x18})
/usr/local/Cellar/go/1.17.2/libexec/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc000e4a3c0, {0xc0004a2038, 0x9, 0x14bc0c5})
/usr/local/Cellar/go/1.17.2/libexec/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x1d48340, 0xc000e4a3c0}, {0xc0004a2038, 0x9, 0x9}, 0x9)
/usr/local/Cellar/go/1.17.2/libexec/src/io/io.go:328 +0x9a
io.ReadFull(...)
/usr/local/Cellar/go/1.17.2/libexec/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc0004a2038, 0x9, 0xc002013190}, {0x1d48340, 0xc000e4a3c0})
/Users/ivosights/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0004a2000)
/Users/ivosights/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:492 +0x95
google.golang.org/grpc/internal/transport.(*http2Client).reader(0xc00000c3c0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:1347 +0x414
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:346 +0x174f

goroutine 41 [select]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc000eda1e0, 0x1)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/controlbuf.go:407 +0x11b
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc000e4a480)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/controlbuf.go:527 +0x85
google.golang.org/grpc/internal/transport.newHTTP2Client.func3()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:396 +0x65
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:394 +0x1da5

goroutine 28 [runnable]:
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932
created by google.golang.org/grpc.(*Server).serveStreams.func1
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932 +0x294

goroutine 29 [runnable]:
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932
created by google.golang.org/grpc.(*Server).serveStreams.func1
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932 +0x294

goroutine 27 [runnable]:
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932
created by google.golang.org/grpc.(*Server).serveStreams.func1
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932 +0x294

goroutine 42 [runnable]:
github.com/emirpasic/gods/trees/binaryheap.(*Heap).bubbleDownIndex(0xc0003b4530, 0x0)
/Users/ivosights/go/pkg/mod/github.com/emirpasic/[email protected]/trees/binaryheap/binaryheap.go:123 +0x2f5
github.com/emirpasic/gods/trees/binaryheap.(*Heap).bubbleDown(...)
/Users/ivosights/go/pkg/mod/github.com/emirpasic/[email protected]/trees/binaryheap/binaryheap.go:118
github.com/emirpasic/gods/trees/binaryheap.(*Heap).Pop(0xc0003b4530)
/Users/ivosights/go/pkg/mod/github.com/emirpasic/[email protected]/trees/binaryheap/binaryheap.go:74 +0x288
github.com/sugarme/tokenizer/model/bpe.(*Word).MergeAll(0xc001b81200, 0x15, {0x0, 0x1, 0x2})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/model/bpe/word.go:314 +0x296
github.com/sugarme/tokenizer/model/bpe.(*BPE).MergeWord(0xc000867260, {0xc000c349e0, 0xc000c349e0})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/model/bpe/bpe.go:433 +0x3af
github.com/sugarme/tokenizer/model/bpe.BPE.TokenizeWithCache({0xc000010058, 0xc000010048, 0xc000010060, 0xc000fc7860, 0x0, 0x0, 0x0, 0x0}, {0xc000c349e0, 0x6})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/model/bpe/bpe.go:500 +0xaa
github.com/sugarme/tokenizer/model/bpe.BPE.Tokenize({0xc000010058, 0xc000010048, 0xc000010060, 0xc000fc7860, 0x0, 0x0, 0x0, 0x0}, {0xc000c349e0, 0x6})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/model/bpe/bpe.go:487 +0xc7
github.com/sugarme/tokenizer.(*Tokenizer).doTokenize.func1(0x1af08c0)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:474 +0x31
github.com/sugarme/tokenizer.(*PreTokenizedString).Tokenize(0xc00354a6f0, 0xc000867450)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/pretokenizer.go:123 +0xda
github.com/sugarme/tokenizer.(*Tokenizer).doTokenize(0xc00027a128, 0xc000888000, 0x2445ba0, 0xc00024ed90, 0x0)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:473 +0x47
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingleSequence.func1(0x0, 0x0, {0xc000888000, 0x2000000000351})
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:315 +0xba
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingleSequence(0x1a720e0, {{0xc0003b4370, 0xc0008677d0, 0xc000867798}, 0x100e794}, 0xc0008677a8, 0x20)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:334 +0xd4
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0x1ad5500, {0x1ad5500, 0xc0003a63c0}, 0x2)
/Users/ivosights/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:359 +0x270
sentiment/pkg/sentimentmodule.(*Roberta).Tokenize(0xc000521560, {0xc00354ef00, 0x1ba8622})
/Users/ivosights/Project/service-prototype/src/sentiment/pkg/sentimentmodule/roberta.go:97 +0xc8
sentiment/interface/grpc.(*sentimentServer).Predict(0xc001a640c0, {0x1d5fbe0, 0xc00354a660}, 0x0)
/Users/ivosights/Project/service-prototype/src/sentiment/interface/grpc/sentiment.go:60 +0x2a5
sentiment/proto._Sentiment_Predict_Handler.func1({0x1d5fbe0, 0xc00354a660}, {0x1b42200, 0xc0004ce550})
/Users/ivosights/Project/service-prototype/src/sentiment/proto/sentiment_grpc.pb.go:82 +0x78
github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1({0x1d5fbe0, 0xc00354a570}, {0x1b42200, 0xc0004ce550}, 0xc0003a62e0, 0xc002a1c1f8)
/Users/ivosights/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/logging/logrus/server_interceptors.go:31 +0x102
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x1d5fbe0, 0xc00354a570}, {0x1b42200, 0xc0004ce550})
/Users/ivosights/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x1d5fbe0, 0xc00354a570}, {0x1b42200, 0xc0004ce550}, 0xc0033f4bb8, 0x1ab2c20)
/Users/ivosights/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34 +0xbf
sentiment/proto._Sentiment_Predict_Handler({0x1abb480, 0xc001a640c0}, {0x1d5fbe0, 0xc00354a570}, 0xc00348c2a0, 0xc000d95e30)
/Users/ivosights/Project/service-prototype/src/sentiment/proto/sentiment_grpc.pb.go:84 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004ae8c0, {0x1d6ffa0, 0xc002cc8300}, 0xc0001f6900, 0xc001a64150, 0x23fa9f0, 0x0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:1286 +0xc8f
google.golang.org/grpc.(*Server).handleStream(0xc0004ae8c0, {0x1d6ffa0, 0xc002cc8300}, 0xc0001f6900, 0x0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:1609 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:934 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:932 +0x294

goroutine 74 [runnable]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc002cc6190, 0x1)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/controlbuf.go:407 +0x11b
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc0017ba000)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/controlbuf.go:527 +0x85
google.golang.org/grpc/internal/transport.newHTTP2Server.func2()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:292 +0xc6
created by google.golang.org/grpc/internal/transport.newHTTP2Server
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:289 +0x148f

goroutine 75 [select]:
google.golang.org/grpc/internal/transport.(*http2Server).keepalive(0xc002cc8300)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:993 +0x259
created by google.golang.org/grpc/internal/transport.newHTTP2Server
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:300 +0x14d7

goroutine 76 [runnable]:
syscall.syscall(0x10b64a0, 0xb, 0xc0034ea000, 0x8000)
/usr/local/Cellar/go/1.17.2/libexec/src/runtime/sys_darwin.go:22 +0x3b
syscall.read(0xc00016a000, {0xc0034ea000, 0xc000b87c00, 0x10188a6})
/usr/local/Cellar/go/1.17.2/libexec/src/syscall/zsyscall_darwin_amd64.go:1171 +0x49
syscall.Read(...)
/usr/local/Cellar/go/1.17.2/libexec/src/syscall/syscall_unix.go:189
internal/poll.ignoringEINTRIO(...)
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_unix.go:582
internal/poll.(*FD).Read(0xc00016a000, {0xc0034ea000, 0x8000, 0x8000})
/usr/local/Cellar/go/1.17.2/libexec/src/internal/poll/fd_unix.go:163 +0x285
net.(*netFD).Read(0xc00016a000, {0xc0034ea000, 0x18, 0x18})
/usr/local/Cellar/go/1.17.2/libexec/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc0000a0010, {0xc0034ea000, 0x104e6b4, 0x18})
/usr/local/Cellar/go/1.17.2/libexec/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc000efa3c0, {0xc0001622d8, 0x9, 0x1af32a0})
/usr/local/Cellar/go/1.17.2/libexec/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x1d48340, 0xc000efa3c0}, {0xc0001622d8, 0x9, 0x9}, 0x9)
/usr/local/Cellar/go/1.17.2/libexec/src/io/io.go:328 +0x9a
io.ReadFull(...)
/usr/local/Cellar/go/1.17.2/libexec/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc0001622d8, 0x9, 0x0}, {0x1d48340, 0xc000efa3c0})
/Users/ivosights/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0001622a0)
/Users/ivosights/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:492 +0x95
google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc002cc8300, 0x0, 0x0)
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:473 +0xb2
google.golang.org/grpc.(*Server).serveStreams(0xc0004ae8c0, {0x1d6ffa0, 0xc002cc8300})
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:918 +0x142
google.golang.org/grpc.(*Server).handleRawConn.func1()
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:868 +0x46
created by google.golang.org/grpc.(*Server).handleRawConn
/Users/ivosights/go/pkg/mod/google.golang.org/[email protected]/server.go:867 +0x472
exit status 2

pointer receiver

For consistency, check and change to pointer receiver methods at sub packages:

  • Normalizer
  • Pretokenizer

Performance / Parallelization Support

HF tokenization implementation with Rust uses parallelization when possible to achieve high performance for tokenization. Does this library implement the same type of business logic or would it be more performant to expose the Rust bindings vs C and import into Go through the C bindings?

panic: runtime error: invalid memory address or nil pointer dereference

I encountered a panic while encoding some documents. Unfortunately I can't provide the documents, as they are private.

After a quick look, it seems that pairEncoding in util.go:108 is nil, so the GetIds() call fails.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7a8854]

goroutine 295 [running]:
github.com/sugarme/tokenizer.(*Encoding).GetIds(...)
        /home/superman/go/pkg/mod/github.com/sugarme/[email protected]/encoding.go:215
github.com/sugarme/tokenizer.TruncateEncodings(0xc00013e270, 0x0, 0x350?)
        /home/superman/go/pkg/mod/github.com/sugarme/[email protected]/util.go:108 +0x54
github.com/sugarme/tokenizer.(*Tokenizer).PostProcess(0xc0000fa000, 0xc00013e270?, 0x0?, 0x1)
        /home/superman/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:602 +0xef
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0x0?, {0x847520?, 0xc000025c40?}, 0x0?)
        /home/superman/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:464 +0x2e5
github.com/sugarme/tokenizer.(*Tokenizer).EncodeBatch.func1(0xdc)
        /home/superman/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:647 +0x90
created by github.com/sugarme/tokenizer.(*Tokenizer).EncodeBatch in goroutine 42
        /home/superman/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:644 +0xea

pretrained tokenizers

It would be covenient to construct a sub-package pretrained that provides APIs to load some popular pretrained tokenizers. For now, some pretrained such as

  • BERT
  • ROBERTA
    ...

Loading tokenizer trained with Huggingface implementation

Hey,
I you mentioned this implementation was heavily inspired by the Huggingface one. I was wondering if it's possible to load a tokenizer trained with huggingface with this implementation ?
If it is, is there any gotchas to be aware of ?
I thought this information was lacking from the doc.
Thank you

Wordpiece Decoder

Problem: Wordpiece Decoder Decode method does not strip prefix and join tokens without a space.

Missing roberta-base-vocab.json

The package is missing the vocabulary for roberta-base in github.com/sugarme/[email protected]/pretrained/model/roberta-base-vocab.json.

Without it, tokenizer.Pretrained.RobertaBase() throws a FileNotFound Error. There aren't any other vocab.json files in model either, so it's possible this issue happens with other pretrained tokenizers.

Thanks for writing such a cool package!

Decoder using pointer to interface

Problem: Tokenizer struct use pointer to Decoder interface which prevents implement Decoder interface in specific model.

Solution: using interface type in stead of pointer to interface.

int64

Convert int to int64 at API to make it easy for API consumers.

For example: tokenizer.Decode(ids []int64)

Encode - Truncation working incorrectly with DualEncodeInput

Problem:

tk := getBert()
	truncParams := tokenizer.TruncationParams{
		MaxLength: 25,
		Strategy:  tokenizer.OnlySecond,
		Stride:    0,
	}
	tk.WithTruncation(&truncParams)

	input := "A visually stunning rumination on love."
	pairInput := "This is the long paragraph that I want to put context on it. It is not only about how to deal with anger but also how to maintain being calm at all time."

	encodeInput := tokenizer.NewDualEncodeInput(tokenizer.NewInputSequence(input), tokenizer.NewInputSequence(pairInput))
	pairEn, err := tk.Encode(encodeInput, false)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Ids: %+v\n\n", pairEn.Ids)
	fmt.Printf("Tokens: %+q\n\n", pairEn.Tokens)
	fmt.Printf("Words: %+v\n\n", pairEn.Words)
	fmt.Printf("Overflow: %+v\n\n", pairEn.Overflowing)
}

// Output:
Ids: [1037 17453 14726 19379 12758 2006 2293 1012 2069 2055 2129 2000 3066 2007 4963 2021 2036 2129 2000 5441 2108 5475 2012 2035 2051 1012 2023 2003 1996 2146 20423 2008 1045 2215 2000 2404 6123 2006 2009 1012 2009 2003 2025]

Tokens: ["a" "visually" "stunning" "rum" "##ination" "on" "love" "." "only" "about" "how" "to" "deal" "with" "anger" "but" "also" "how" "to" "maintain" "being" "calm" "at" "all" "time" "." "this" "is" "the" "long" "paragraph" "that" "i" "want" "to" "put" "context" "on" "it" "." "it" "is" "not"]

Words: [0 1 2 3 3 4 5 6 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16]

Overflow: [{Ids:[1037 17453 14726 19379 12758 2006 2293 1012 2069 2055 2129 2000 3066 2007 4963 2021 2036 2129 2000 5441 2108 5475 2012 2035 2051] TypeIds:[0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Tokens:[a visually stunning rum ##ination on love . only about how to deal with anger but also how to maintain being calm at all time] Offsets:[[0 1] [2 10] [11 19] [20 23] [23 30] [31 33] [34 38] [38 39] [71 75] [76 81] [82 85] [86 88] [89 93] [94 98] [99 104] [105 108] [109 113] [114 117] [118 120] [121 129] [130 135] [136 140] [141 143] [144 147] [148 152]] SpecialTokenMask:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] AttentionMask:[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Overflowing:[] Words:[0 1 2 3 3 4 5 6 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33]} {Ids:[1037 17453 14726 19379 12758 2006 2293 1012 2069 2055 2129 2000 3066 2007 4963 2021 2036 2129 2000 5441 2108 5475 2012 2035 2051 1012] TypeIds:[0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Tokens:[a visually stunning rum ##ination on love . only about how to deal with anger but also how to maintain being calm at all time .] Offsets:[[0 1] [2 10] [11 19] [20 23] [23 30] [31 33] [34 38] [38 39] [71 75] [76 81] [82 85] [86 88] [89 93] [94 98] [99 104] [105 108] [109 113] [114 117] [118 120] [121 129] [130 135] [136 140] [141 143] [144 147] [148 152] [152 153]] SpecialTokenMask:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] AttentionMask:[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Overflowing:[] Words:[0 1 2 3 3 4 5 6 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34]}]

The truncated sequence was not truncated but moved to the end instead.

Truncation strategy works fine with SingleEncodeInput type.

func main() {
	tk := getBert()
	truncParams := tokenizer.TruncationParams{
		MaxLength: 25,
		Strategy:  tokenizer.OnlyFirst,
		Stride:    0,
	}
	tk.WithTruncation(&truncParams)

	pairInput := "This is the long paragraph that I want to put context on it. It is not only about how to deal with anger but also how to maintain being calm at all time."

	tokenizer.NewInputSequence(pairInput))
	encodeInput := tokenizer.NewSingleEncodeInput(tokenizer.NewInputSequence(pairInput))
	pairEn, err := tk.Encode(encodeInput, false)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Ids: %+v\n\n", pairEn.Ids)
	fmt.Printf("Tokens: %+q\n\n", pairEn.Tokens)
	fmt.Printf("Words: %+v\n\n", pairEn.Words)
	fmt.Printf("Overflow: %+v\n\n", pairEn.Overflowing)
}

// Output:
Ids: [2023 2003 1996 2146 20423 2008 1045 2215 2000 2404 6123 2006 2009 1012 2009 2003 2025 2069 2055 2129 2000 3066 2007 4963 2021]

Tokens: ["this" "is" "the" "long" "paragraph" "that" "i" "want" "to" "put" "context" "on" "it" "." "it" "is" "not" "only" "about" "how" "to" "deal" "with" "anger" "but"]

Words: [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]

Overflow: [{Ids:[2036 2129 2000 5441 2108 5475 2012 2035 2051 1012] TypeIds:[0 0 0 0 0 0 0 0 0 0] Tokens:[also how to maintain being calm at all time .] Offsets:[[109 113] [114 117] [118 120] [121 129] [130 135] [136 140] [141 143] [144 147] [148 152] [152 153]] SpecialTokenMask:[0 0 0 0 0 0 0 0 0 0] AttentionMask:[1 1 1 1 1 1 1 1 1 1] Overflowing:[] Words:[25 26 27 28 29 30 31 32 33 34]}]

OpenAI CLIP tokenization support?

First of all thanks for this great package!

I am trying to tokenize OpenAI CLIP text inputs (which I am not sure is even supported), from Huggingface models tokenizer.json files. Unfortunately, even though the processor (RobertaProcessing) seems to be supported, it always fails with a nil pointer panic in a postprocessing phase. With BERT style tokenizers, it works perfectly.

The respective tokenizer configs:
https://huggingface.co/openai/clip-vit-base-patch32/raw/main/tokenizer.json
or
https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K/raw/main/tokenizer.json

Stack trace:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x102eb76e8]
goroutine 4 [running]:
testing.tRunner.func1()
	/opt/homebrew/Cellar/go/1.21.0/libexec/src/testing/testing.go:1548 +0x528
panic({0x1034dd0a0?, 0x103b04eb0?})
	/opt/homebrew/Cellar/go/1.21.0/libexec/src/runtime/panic.go:920 +0x254
github.com/sugarme/tokenizer/processor.(*RobertaProcessing).addSpecialToken(0x14006dc6a80, 0x0)
	/Users/kristof/go/pkg/mod/github.com/sugarme/[email protected]/processor/roberta.go:133 +0x88
github.com/sugarme/tokenizer/processor.(*RobertaProcessing).Process(0x14006dc6a80, 0x1400954e820, 0x0, 0x1)
	/Users/kristof/go/pkg/mod/github.com/sugarme/[email protected]/processor/roberta.go:96 +0xc8
github.com/sugarme/tokenizer.(*Tokenizer).PostProcess(0x14007979200, 0x1400954e820, 0x0, 0x1)
	/Users/kristof/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:612 +0x230
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0x14007979200, {0x1034f9660, 0x1400918cec0}, 0x1)
	/Users/kristof/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:464 +0x560
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingle(0x14007979200, {0x103419fd0, 0x1}, {0x140067bd4bf, 0x1, 0x1})
	/Users/kristof/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:1085 +0xcc

Is this intended in any way? If so, could you please consider supporting CLIP tokenizers?

Thanks in advance.

panic: assignment to entry in nil map

I got error panic: assignment to entry in nil map when I was trying to tokenizer a text

"C:\Program Files\Java\jre1.8. 0_202\bin\java.exe" -Djava.util.logging.config.file="C:\Users\Administrator\Desktop\apache-tomcat-9.0.52\conf\logging.properties" -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dignore.endorsed.dirs="" -classpath "C:\Users\Administrator\Desktop\apache-tomcat-9.0.52\bin\bootstrap.jar;C:\Users\Administrator\Desktop\apache-tomcat-9.0.52\bin\tomcat-juli.jar" -Dcatalina.base="C:\Users\Administrator\Desktop\apache-tomcat-9.0.52" -Dcatalina.home="C:\Users\Administrator\Desktop\apache-tomcat-9.0.52" -Djava.io.tmpdir="C:\Users\Administrator\Desktop\apache-tomcat-9.0.52\temp" org.apache.catalina.startup.Bootstrap start`

I debug the source code and find in tokenize.go:Line96, there may be something wrong to correct, the element e of encoding.GetOverflowing(), e.SequenceRanges does not initialize, which appears to be nil. Thus, error occurs when using e.SetSequenceIds(i), which will assign value to nil map. Please check and confirm whether this should be soon correct!

How to load self-made vocab.txt

As the title, is there already a function that I can call to load my personal vocab? I find the code in pretrained and found that the vocabfilepath has been fixed?

Do you have the function that receives the path of new vocab and returns the tokenizer loaded from this path?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.