Git Product home page Git Product logo

tiktokenizer's Introduction

Tiktokenizer


Tiktokenizer

Online playground for openai/tiktoken, calculating the correct number of tokens for a given prompt.

Special thanks to Diagram for sponsorship and guidance.

CleanShot.2023-03-02.at.22.58.11.mp4

Acknowledgments

tiktokenizer's People

Contributors

darknoon avatar dqbd avatar shadcn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tiktokenizer's Issues

How can I get the correct output at the edge?

Using issues for a question... Sorry about that.

When using the following the output doesn't match the output from the online tiktokenizer and it has a different length :

      import model from "tiktoken/encoders/cl100k_base";
      import { init, Tiktoken } from "tiktoken/lite/init";
      // @ts-expect-error
      import wasm from "tiktoken/lite/tiktoken_bg.wasm?module";

      export const runtime = "edge";
      // ...

      await init((imports) => WebAssembly.instantiate(wasm, imports));
      const inputText = getChatGPTEncoding(messages, "gpt-3.5-turbo");
      const encoding = new Tiktoken(
        model.bpe_ranks,
        model.special_tokens,
        model.pat_str
      );
      const tokens = encoding.encode(inputText);
      encoding.free();
      return new Response(`${tokens}`);

For the following input (What is saved into the inputText variable):

<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?<|im_end|>\n<|im_start|>user\nHi<|im_end|>\n<|im_start|>assistant\n

I get following tokens for gpt-3.5-turbo at https://tiktokenizer.vercel.app/ :

[100264, 882, 1734, 15339, 100265, 1734, 100264, 78191, 1734, 9906, 0, 2650, 649, 358, 7945, 499, 3432, 30, 100265, 1734, 100264, 882, 1734, 13347, 100265, 1734, 100264, 78191, 1734]

But when running the code I get the following tokens:

[27,91,318,5011,91,29,882,198,15339,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198,9906,0,2650,649,358,7945,499,3432,76514,91,318,6345,91,397,27,91,318,5011,91,29,882,198,13347,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198]

TOKENS 切割功能

开发者你好,能不能添加一个"TOKENS 切割功能"方便按特定量分批输入大长文。

Quick suggestions

Love this project, David 💯

A few low-hanging suggestions to improve the repo:

  1. Add a desc to the github repo's metadata (in the upper right corner; same place you edit the URL metadata for the repo). This is really helpful because it's what shows up in opengraph metadata and when shared around the github site itself.
  2. Add a screenshot to the readme like in your tweet 😄
  3. Add a license file; just copy one from an existing repo

You'd be amazed at how much these small details impact developer conversions, especially the image.

Keep up the great work && thanks 🙏

Add new OpenAI models

It would be really cool to calculate for gpt4-turbo and the updated prices for gpt3.5-turbo

token count is inconsistent with OpenAI tokenizer

As shown below:

screenshot 2023-11-21 at 10 41 58@2x

screenshot 2023-11-21 at 10 42 20@2x

text:

<|im_start|>dd<|im_sep|>OpenAI's large language models (sometimes referred to as GPT's) process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.<|im_end|><|im_start|>assistant<|im_sep|><|im_end|><|im_start|>assistant<|im_sep|>

I see in your demo that I can use a custom name for a message. (How) Is that also possible in OpenAI's API?

In your demo, you can choose a custom role / name for the author of a message.

image

But OpenAI's documentation says that when you call their API, the role of a message can only be one of system, user, assistant and the recently added function. Not just a custom name.

image

Now I'm a bit confused myself. I see that they also offer a name property, but in the past the description said that that was only used to monitor specific users for possible abuse. Like this:

image

It never said anything about that it could be used to actually let the model know the name of the user.

So, do you know? Can I use a custom name in the API? And if yes, should I just fill it in the role property or the name property?

Can you add a function to `tiktoken` that automatically adds special characters to chat messages?

I see that the text that is generated from the messages automatically gets the special tokens added to it:

image

Such as <|im_start|>, <|im_end|>\n and it even always ends with <|im_start|>assistant.

That makes me wonder, when I'm trying to encode and count the tokens of an entire chat that I have using tiktoken, am I responsible for formatting my text in the correct way with the correct special tokens placed in the correct places?

Or is there a function, where I can just give an array of messages where each item has the following shape?

{ role: "user", content: "I need some help with MS Word!" }

And that then tiktoken's encoder would automatically add the right special tokens in the right places? I know it would be fairly trivial to make such a function by myself, but if you'd be willing to add it to tiktoken then I have more trust that if OpenAI ever changes anything about the special tokens / formatting that they use internally, that then you'd probably pick up on that and update your package accordingly.

Why does the tiktokenizer demo output completely different token numbers than the actual npm package?

So in the tiktokenizer demo, the textarea box looks like this (I'm using the gpt-3.5-turbo model):

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Yes, please help<|im_end|>
<|im_start|>assistant

And the token array looks like this:

[100264, 9125, 198, 2675, 527, 264, 11190, 18328, 100265, 198, 100264, 882, 198, 9642, 11, 4587, 1520, 100265, 198, 100264, 78191, 198]

But when I use the exact same text in my javascript file:

image

I get completely different tokens as you can see in the terminal panel the right side.
And no, it's not because you see triangles in the string in my screenshot, that's just because of the font I'm using.

Can you explain what's going on here?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.