dqbd / tiktokenizer Goto Github PK

View Code? Open in Web Editor NEW

660.0 5.0 79.0 848 KB

Online playground for OpenAPI tokenizers

Home Page: https://tiktokenizer.vercel.app

License: MIT License

JavaScript 8.33% TypeScript 91.57% CSS 0.10%

chatgpt nextjs openai t3-stack tiktoken tokenizer

tiktokenizer's Introduction

Tiktokenizer

Online playground for openai/tiktoken, calculating the correct number of tokens for a given prompt.

Special thanks to Diagram for sponsorship and guidance.

CleanShot.2023-03-02.at.22.58.11.mp4

Acknowledgments

tiktokenizer's People

Contributors

Stargazers

Watchers

tiktokenizer's Issues

How can I get the correct output at the edge?

Using issues for a question... Sorry about that.

When using the following the output doesn't match the output from the online tiktokenizer and it has a different length :

      import model from "tiktoken/encoders/cl100k_base";
      import { init, Tiktoken } from "tiktoken/lite/init";
      // @ts-expect-error
      import wasm from "tiktoken/lite/tiktoken_bg.wasm?module";

      export const runtime = "edge";
      // ...

      await init((imports) => WebAssembly.instantiate(wasm, imports));
      const inputText = getChatGPTEncoding(messages, "gpt-3.5-turbo");
      const encoding = new Tiktoken(
        model.bpe_ranks,
        model.special_tokens,
        model.pat_str
      );
      const tokens = encoding.encode(inputText);
      encoding.free();
      return new Response(`${tokens}`);

For the following input (What is saved into the inputText variable):

<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?<|im_end|>\n<|im_start|>user\nHi<|im_end|>\n<|im_start|>assistant\n

I get following tokens for gpt-3.5-turbo at https://tiktokenizer.vercel.app/ :

[100264, 882, 1734, 15339, 100265, 1734, 100264, 78191, 1734, 9906, 0, 2650, 649, 358, 7945, 499, 3432, 30, 100265, 1734, 100264, 882, 1734, 13347, 100265, 1734, 100264, 78191, 1734]

But when running the code I get the following tokens:

[27,91,318,5011,91,29,882,198,15339,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198,9906,0,2650,649,358,7945,499,3432,76514,91,318,6345,91,397,27,91,318,5011,91,29,882,198,13347,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198]

Make a Chrome extension for the ChatGPT website that counts number of tokens in input

I personally just want something like this

It should know the token limit and turn the text red when I'm over it and it should be able to tell if I've selected 3.5 or 4 from the dropdown and use the correct tokenizer/limit.

[bug] When I use a custom name and then change my mind and try to use one of the default ones, it doesn't update in the textarea below

The video will explain it all:

Screen.Recording.2023-07-03.at.13.18.47.mov

TOKENS 切割功能

开发者你好，能不能添加一个"TOKENS 切割功能"方便按特定量分批输入大长文。

Quick suggestions

Love this project, David 💯

A few low-hanging suggestions to improve the repo:

Add a desc to the github repo's metadata (in the upper right corner; same place you edit the URL metadata for the repo). This is really helpful because it's what shows up in opengraph metadata and when shared around the github site itself.
Add a screenshot to the readme like in your tweet 😄
Add a license file; just copy one from an existing repo

You'd be amazed at how much these small details impact developer conversions, especially the image.

Keep up the great work && thanks 🙏

changing tokenizer resets the textarea

Changing tokenizer resets the textarea. I just lost my example! :)

Steps to reproduce:

Go to https://tiktokenizer.vercel.app/

Enter anything into user content input box:

Now change the tokenizer from the dropdown box to e.g. gpt2:

...and the example gets reset:

Add new OpenAI models

It would be really cool to calculate for gpt4-turbo and the updated prices for gpt3.5-turbo

token count is inconsistent with OpenAI tokenizer

As shown below:

text:

<|im_start|>dd<|im_sep|>OpenAI's large language models (sometimes referred to as GPT's) process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.<|im_end|><|im_start|>assistant<|im_sep|><|im_end|><|im_start|>assistant<|im_sep|>

I see in your demo that I can use a custom name for a message. (How) Is that also possible in OpenAI's API?

In your demo, you can choose a custom role / name for the author of a message.

But OpenAI's documentation says that when you call their API, the role of a message can only be one of system, user, assistant and the recently added function. Not just a custom name.

Now I'm a bit confused myself. I see that they also offer a name property, but in the past the description said that that was only used to monitor specific users for possible abuse. Like this:

It never said anything about that it could be used to actually let the model know the name of the user.

So, do you know? Can I use a custom name in the API? And if yes, should I just fill it in the role property or the name property?

Can't run the tiktokenizer from the app folder in next.js

Hello,

Trying to run tiktokenizer from the app folder and I'm getting the following error:

Error: Element type is invalid. Received a promise that resolves to: [object Promise]. Lazy element type must resolve to a class or function.

Here is the branch to reproduce the error.

Mind if the OpenAI Cookbook repo points to your Tiktokenizer app?

I got a PR suggesting to add a link to your Tiktokenizer app from our OpenAI tokenization tutorial: openai/openai-cookbook#604

Would this be ok with you?

I don't want to send lots of traffic your way if:

it costs you $
you're planning to shutdown/modify soon
it makes you feel obligated to support
or anything like that

Cheers,
Ted

Can you add a function to `tiktoken` that automatically adds special characters to chat messages?

I see that the text that is generated from the messages automatically gets the special tokens added to it:

That makes me wonder, when I'm trying to encode and count the tokens of an entire chat that I have using tiktoken, am I responsible for formatting my text in the correct way with the correct special tokens placed in the correct places?

Or is there a function, where I can just give an array of messages where each item has the following shape?

{ role: "user", content: "I need some help with MS Word!" }

And that then tiktoken's encoder would automatically add the right special tokens in the right places? I know it would be fairly trivial to make such a function by myself, but if you'd be willing to add it to tiktoken then I have more trust that if OpenAI ever changes anything about the special tokens / formatting that they use internally, that then you'd probably pick up on that and update your package accordingly.

Tweak Request: Don't clear prompt textbox when switching models.

9 times out of 10, if I'm switching to a new model, it's specifically to compare and visualize prompt sizes between different encoders and models, on the same prompt. Just a little tweak that would personally save me about 100 copy and pastes every day.

TokenViewer shows "replacement characters" when entering CJK characters

👍🏻 The project is awesome!

Just a heads up: when inputting Chinese, Japanese, or Korean characters in Tiktokenizer, the TokenViewer displays "replacement characters" (�).

Add new `o200k_base` OpenAi encoding for `gpt-4o`

It’s spring. And new encodings came out with new model.

https://platform.openai.com/docs/models

openai/tiktoken@9d01e56

Can we automatically add input templates for our models?

Why does the tiktokenizer demo output completely different token numbers than the actual npm package?

So in the tiktokenizer demo, the textarea box looks like this (I'm using the gpt-3.5-turbo model):

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Yes, please help<|im_end|>
<|im_start|>assistant

And the token array looks like this:

[100264, 9125, 198, 2675, 527, 264, 11190, 18328, 100265, 198, 100264, 882, 198, 9642, 11, 4587, 1520, 100265, 198, 100264, 78191, 198]

But when I use the exact same text in my javascript file:

I get completely different tokens as you can see in the terminal panel the right side.
And no, it's not because you see triangles in the string in my screenshot, that's just because of the font I'm using.

Can you explain what's going on here?