Git Product home page Git Product logo

compare-tokenizers's Introduction

Compare Tokenizers

A test suite comparing Node.js BPE tokenizers for use with AI models.

Build Status MIT License Prettier Code Formatting

Intro

This repo contains a small test suite for comparing the results of different Node.js BPE tokenizers for use with LLMs like GPT-3.

Check out OpenAI's tiktoken Rust / Python lib for reference and OpenAI's Tokenizer Playground to experiment with different inputs.

This repo only tests tokenizers aimed at text, not code-specific tokenizers like the ones used by Codex.

Benchmark

Package / encoder Average Time (ms) Variance (ms)
gpt3-tokenizer 56132 334621
gpt-3-encoder 31148 333120
gpt-tokenizer gpt2 6792 3562
gpt-tokenizer text-davinci-003 20678 6362
@dqbd/tiktoken gpt2 6792 3562
@dqbd/tiktoken text-davinci-003 6073 178
tiktoken-node gpt2 6005 675
tiktoken-node text-davinci-003 5726 236

(lower times are better)

@dqbd/tiktoken which is a wasm port of the official Rust tiktoken is ~3-6x faster than the JS variants with significantly less memory overhead and variance. πŸ”₯

To reproduce:

pnpm build
node build/bench.mjs

Tokenization Tests

This maps over an array of test fixtures in different languages and prints the number of tokens generated for each of the tokenizers.

0) 5 chars "hello" β‡’ {
  'gpt3-tokenizer': 1,
  'gpt-3-encoder': 1,
  'gpt-tokenizer gpt2': 1,
  'gpt-tokenizer text-davinci-003': 1,
  '@dqbd/tiktoken gpt2': 1,
  '@dqbd/tiktoken text-davinci-003': 1,
  'tiktoken-node gpt2': 1,
  'tiktoken-node text-davinci-003': 1
}
1) 17 chars "hello πŸ‘‹ world 🌍" β‡’ {
  'gpt3-tokenizer': 7,
  'gpt-3-encoder': 7,
  'gpt-tokenizer gpt2': 7,
  'gpt-tokenizer text-davinci-003': 7,
  '@dqbd/tiktoken gpt2': 7,
  '@dqbd/tiktoken text-davinci-003': 7,
  'tiktoken-node gpt2': 7,
  'tiktoken-node text-davinci-003': 7
}
2) 445 chars "Lorem ipsum dolor si..." β‡’ {
  'gpt3-tokenizer': 153,
  'gpt-3-encoder': 153,
  'gpt-tokenizer gpt2': 153,
  'gpt-tokenizer text-davinci-003': 153,
  '@dqbd/tiktoken gpt2': 153,
  '@dqbd/tiktoken text-davinci-003': 153,
  'tiktoken-node gpt2': 153,
  'tiktoken-node text-davinci-003': 153
}
3) 2636 chars "Lorem ipsum dolor si..." β‡’ {
  'gpt3-tokenizer': 939,
  'gpt-3-encoder': 939,
  'gpt-tokenizer gpt2': 939,
  'gpt-tokenizer text-davinci-003': 922,
  '@dqbd/tiktoken gpt2': 939,
  '@dqbd/tiktoken text-davinci-003': 922,
  'tiktoken-node gpt2': 939,
  'tiktoken-node text-davinci-003': 922
}
4) 246 chars "δΉŸη§°δΉ±ζ•°ε‡ζ–‡ζˆ–θ€…ε“‘ε…ƒζ–‡ζœ¬οΌŒ ζ˜―ε°εˆ·εŠζŽ’η‰ˆ..." β‡’ {
  'gpt3-tokenizer': 402,
  'gpt-3-encoder': 402,
  'gpt-tokenizer gpt2': 402,
  'gpt-tokenizer text-davinci-003': 402,
  '@dqbd/tiktoken gpt2': 402,
  '@dqbd/tiktoken text-davinci-003': 402,
  'tiktoken-node gpt2': 402,
  'tiktoken-node text-davinci-003': 402
}
5) 359 chars "εˆ©γƒ˜γ‚ͺγƒ’γƒ²η‰Ήι€†γ‚‚γ‹ζ„ζ›Έθ³Όγ‚΅η±³ε…¬γˆε‡ΊδΈ»γƒˆγ»..." β‡’ {
  'gpt3-tokenizer': 621,
  'gpt-3-encoder': 621,
  'gpt-tokenizer gpt2': 621,
  'gpt-tokenizer text-davinci-003': 621,
  '@dqbd/tiktoken gpt2': 621,
  '@dqbd/tiktoken text-davinci-003': 621,
  'tiktoken-node gpt2': 621,
  'tiktoken-node text-davinci-003': 621
}
6) 2799 chars "это тСкст-"Ρ€Ρ‹Π±Π°", Ρ‡Π°..." β‡’ {
  'gpt3-tokenizer': 2813,
  'gpt-3-encoder': 2813,
  'gpt-tokenizer gpt2': 2813,
  'gpt-tokenizer text-davinci-003': 2811,
  '@dqbd/tiktoken gpt2': 2813,
  '@dqbd/tiktoken text-davinci-003': 2811,
  'tiktoken-node gpt2': 2813,
  'tiktoken-node text-davinci-003': 2811
}
7) 658 chars "If the dull substanc..." β‡’ {
  'gpt3-tokenizer': 175,
  'gpt-3-encoder': 175,
  'gpt-tokenizer gpt2': 175,
  'gpt-tokenizer text-davinci-003': 170,
  '@dqbd/tiktoken gpt2': 175,
  '@dqbd/tiktoken text-davinci-003': 170,
  'tiktoken-node gpt2': 175,
  'tiktoken-node text-davinci-003': 170
}
8) 3189 chars "Enter [two Players a..." β‡’ {
  'gpt3-tokenizer': 876,
  'gpt-3-encoder': 876,
  'gpt-tokenizer gpt2': 876,
  'gpt-tokenizer text-davinci-003': 872,
  '@dqbd/tiktoken gpt2': 876,
  '@dqbd/tiktoken text-davinci-003': 872,
  'tiktoken-node gpt2': 876,
  'tiktoken-node text-davinci-003': 872
}
9) 17170 chars "ANTONY. [To CAESAR] ..." β‡’ {
  'gpt3-tokenizer': 5801,
  'gpt-3-encoder': 5801,
  'gpt-tokenizer gpt2': 5801,
  'gpt-tokenizer text-davinci-003': 5306,
  '@dqbd/tiktoken gpt2': 5801,
  '@dqbd/tiktoken text-davinci-003': 5306,
  'tiktoken-node gpt2': 5801,
  'tiktoken-node text-davinci-003': 5306
}

To reproduce:

pnpm build
node build/index.mjs

License

MIT Β© Travis Fischer

If you found this project interesting, please consider sponsoring me or following me on twitter twitter

compare-tokenizers's People

Contributors

transitive-bullshit avatar niieani avatar dqbd avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.