Git Product home page Git Product logo

Comments (6)

paplorinc avatar paplorinc commented on August 28, 2024 1

I'm currently working on optimizing the tokenizer and the token counter (on the Java implementation at https://github.com/knuddelsgmbh/jtokkit, but most of the tricks should be applicable to other implementations as well).

Benchmark                                                      (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOriginal              data    ss   10  6.503 ± 0.053   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount                      data    ss   10  2.094 ± 0.042   s/op

So far it's 3x faster, but I still have a few ideas left.
I'll check after, if the recommendations here are applicable or not.

from tiktoken.

christopher-hesse avatar christopher-hesse commented on August 28, 2024

I had a version using a trie to do single-pass-ish encoding of an input, but it wasn't correct. I'm not certain how fast a correct version of that trie would be.

from tiktoken.

hauntsaninja avatar hauntsaninja commented on August 28, 2024

Thanks, those are really nice results!

  1. Last time I checked, regex splitting was the majority of the time — I'd be interested in benchmarking the splitting part if easy. I'm potentially interested in specialised code, but we do vary the splitting regexes. Hopefully the PCRE approach demonstrated in the original version of #31 is viable and closes most of the gap.
  2. Nice! The repeated bytes hashing that tiktoken does is clearly not efficient (I was surprised it was viable). Skip list seems like a good way to avoid the O(n) deletes in the loop.
  3. Yeah, tiktoken does this and it was a big piece in ensuring good perf.
  4. I thought about this. I wouldn't want caller to provide array because it's annoying to size. But even just returning an numpy array to get rid of the overhead of going from Rust vec to Python list could be good (PyO3 has numpy bindings) — but I haven't benchmarked.
  5. If you figure out how to do this, let me know! I don't see a way. I'm always uncertain about the perf characteristics of tries, since they're not CPU cache friendly.

from tiktoken.

christopher-hesse avatar christopher-hesse commented on August 28, 2024
  1. Definitely easier if PCRE is fast enough, but if there's still a significant speed gain from hand-writing the most common regexps, could be worth it.

Previous script, full encode: csh_bpe 16592057.7610188 bytes / s => 60.3 ns/byte
Previous script, splitting only (commented out the bigram part): csh_bpe 104345021.38634916 bytes / s => 9.6 ns/byte

  1. The caller supplies an array that is the length of the input, out_tokens = np.empty(input.shape, dtype=np.int32). This does cost more memory during encoding, though the caller can copy the used part of the array afterward if they want. Also unclear to me if this has any measurable performance advantage.
  2. Yeah, the cache unfriendlyness is worrying, but definitely having a correct trie is the first step. It's not obviously impossible to me, but the correct trie could be gigantic.

from tiktoken.

christopher-hesse avatar christopher-hesse commented on August 28, 2024

Feel free to close this if the ideas have been ideated.

from tiktoken.

alkoumpa avatar alkoumpa commented on August 28, 2024

Hello,

It seems that the slow performance is due to an ineffective implementation of the negative lookahead clause ("\s+(?!\S)") in the fancy_regex library.

A possible solution to mimic the negative lookahead functionality is to remove it from the regex and manually re-add spaces to the matched parts, such as words or numbers. Although this approach achieves the same performance as pcre2, it may not be the most elegant solution.

from tiktoken.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.