I made a toy GPT2 tokenizer as a python rust extension. It seems to be slightly faste

Thanks, those are really nice results! Last time I checked, re

Definitely easier if PCRE is fast enough, but if there's still a significant spe

Performance ideas about tiktoken HOT 6 OPEN

openai commented on August 28, 2024

Performance ideas

from tiktoken.

Comments (6)

paplorinc commented on August 28, 2024 1

I'm currently working on optimizing the tokenizer and the token counter (on the Java implementation at https://github.com/knuddelsgmbh/jtokkit, but most of the tricks should be applicable to other implementations as well).

Benchmark                                                      (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOriginal              data    ss   10  6.503 ± 0.053   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount                      data    ss   10  2.094 ± 0.042   s/op

So far it's 3x faster, but I still have a few ideas left.
I'll check after, if the recommendations here are applicable or not.

from tiktoken.

christopher-hesse commented on August 28, 2024

I had a version using a trie to do single-pass-ish encoding of an input, but it wasn't correct. I'm not certain how fast a correct version of that trie would be.

from tiktoken.

hauntsaninja commented on August 28, 2024

Thanks, those are really nice results!

Last time I checked, regex splitting was the majority of the time — I'd be interested in benchmarking the splitting part if easy. I'm potentially interested in specialised code, but we do vary the splitting regexes. Hopefully the PCRE approach demonstrated in the original version of #31 is viable and closes most of the gap.
Nice! The repeated bytes hashing that tiktoken does is clearly not efficient (I was surprised it was viable). Skip list seems like a good way to avoid the O(n) deletes in the loop.
Yeah, tiktoken does this and it was a big piece in ensuring good perf.
I thought about this. I wouldn't want caller to provide array because it's annoying to size. But even just returning an numpy array to get rid of the overhead of going from Rust vec to Python list could be good (PyO3 has numpy bindings) — but I haven't benchmarked.
If you figure out how to do this, let me know! I don't see a way. I'm always uncertain about the perf characteristics of tries, since they're not CPU cache friendly.

from tiktoken.

christopher-hesse commented on August 28, 2024

Definitely easier if PCRE is fast enough, but if there's still a significant speed gain from hand-writing the most common regexps, could be worth it.

Previous script, full encode: csh_bpe 16592057.7610188 bytes / s => 60.3 ns/byte
Previous script, splitting only (commented out the bigram part): csh_bpe 104345021.38634916 bytes / s => 9.6 ns/byte

The caller supplies an array that is the length of the input, out_tokens = np.empty(input.shape, dtype=np.int32). This does cost more memory during encoding, though the caller can copy the used part of the array afterward if they want. Also unclear to me if this has any measurable performance advantage.
Yeah, the cache unfriendlyness is worrying, but definitely having a correct trie is the first step. It's not obviously impossible to me, but the correct trie could be gigantic.

from tiktoken.

christopher-hesse commented on August 28, 2024

Feel free to close this if the ideas have been ideated.

from tiktoken.

alkoumpa commented on August 28, 2024

Hello,

It seems that the slow performance is due to an ineffective implementation of the negative lookahead clause ("\s+(?!\S)") in the fancy_regex library.

A possible solution to mimic the negative lookahead functionality is to remove it from the regex and manually re-add spaces to the matched parts, such as words or numbers. Although this approach achieves the same performance as pcre2, it may not be the most elegant solution.

from tiktoken.

Performance ideas about tiktoken HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent