Online playground for openai/tiktoken
, calculating the correct number of tokens for a given prompt.
Special thanks to Diagram for sponsorship and guidance.
Online playground for OpenAPI tokenizers
Home Page: https://tiktokenizer.vercel.app
License: MIT License
Online playground for openai/tiktoken
, calculating the correct number of tokens for a given prompt.
Special thanks to Diagram for sponsorship and guidance.
Using issues for a question... Sorry about that.
When using the following the output doesn't match the output from the online tiktokenizer and it has a different length :
import model from "tiktoken/encoders/cl100k_base";
import { init, Tiktoken } from "tiktoken/lite/init";
// @ts-expect-error
import wasm from "tiktoken/lite/tiktoken_bg.wasm?module";
export const runtime = "edge";
// ...
await init((imports) => WebAssembly.instantiate(wasm, imports));
const inputText = getChatGPTEncoding(messages, "gpt-3.5-turbo");
const encoding = new Tiktoken(
model.bpe_ranks,
model.special_tokens,
model.pat_str
);
const tokens = encoding.encode(inputText);
encoding.free();
return new Response(`${tokens}`);
For the following input (What is saved into the inputText
variable):
<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?<|im_end|>\n<|im_start|>user\nHi<|im_end|>\n<|im_start|>assistant\n
I get following tokens for gpt-3.5-turbo
at https://tiktokenizer.vercel.app/ :
[100264, 882, 1734, 15339, 100265, 1734, 100264, 78191, 1734, 9906, 0, 2650, 649, 358, 7945, 499, 3432, 30, 100265, 1734, 100264, 882, 1734, 13347, 100265, 1734, 100264, 78191, 1734]
But when running the code I get the following tokens:
[27,91,318,5011,91,29,882,198,15339,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198,9906,0,2650,649,358,7945,499,3432,76514,91,318,6345,91,397,27,91,318,5011,91,29,882,198,13347,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198]
开发者你好,能不能添加一个"TOKENS 切割功能"方便按特定量分批输入大长文。
Love this project, David 💯
A few low-hanging suggestions to improve the repo:
You'd be amazed at how much these small details impact developer conversions, especially the image.
Keep up the great work && thanks 🙏
Changing tokenizer resets the textarea. I just lost my example! :)
Steps to reproduce:
Go to https://tiktokenizer.vercel.app/
Enter anything into user content input box:
Now change the tokenizer from the dropdown box to e.g. gpt2:
...and the example gets reset:
It would be really cool to calculate for gpt4-turbo and the updated prices for gpt3.5-turbo
As shown below:
text:
<|im_start|>dd<|im_sep|>OpenAI's large language models (sometimes referred to as GPT's) process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.<|im_end|><|im_start|>assistant<|im_sep|><|im_end|><|im_start|>assistant<|im_sep|>
In your demo, you can choose a custom role / name for the author of a message.
But OpenAI's documentation says that when you call their API, the role
of a message can only be one of system
, user
, assistant
and the recently added function
. Not just a custom name.
Now I'm a bit confused myself. I see that they also offer a name
property, but in the past the description said that that was only used to monitor specific users for possible abuse. Like this:
It never said anything about that it could be used to actually let the model know the name of the user.
So, do you know? Can I use a custom name in the API? And if yes, should I just fill it in the role
property or the name
property?
Hello,
Trying to run tiktokenizer from the app folder and I'm getting the following error:
Error: Element type is invalid. Received a promise that resolves to: [object Promise]. Lazy element type must resolve to a class or function.
Here is the branch to reproduce the error.
I got a PR suggesting to add a link to your Tiktokenizer app from our OpenAI tokenization tutorial: openai/openai-cookbook#604
Would this be ok with you?
I don't want to send lots of traffic your way if:
Cheers,
Ted
I see that the text that is generated from the messages automatically gets the special tokens added to it:
Such as <|im_start|>
, <|im_end|>\n
and it even always ends with <|im_start|>assistant
.
That makes me wonder, when I'm trying to encode and count the tokens of an entire chat that I have using tiktoken
, am I responsible for formatting my text in the correct way with the correct special tokens placed in the correct places?
Or is there a function, where I can just give an array of messages where each item has the following shape?
{ role: "user", content: "I need some help with MS Word!" }
And that then tiktoken
's encoder would automatically add the right special tokens in the right places? I know it would be fairly trivial to make such a function by myself, but if you'd be willing to add it to tiktoken
then I have more trust that if OpenAI ever changes anything about the special tokens / formatting that they use internally, that then you'd probably pick up on that and update your package accordingly.
9 times out of 10, if I'm switching to a new model, it's specifically to compare and visualize prompt sizes between different encoders and models, on the same prompt. Just a little tweak that would personally save me about 100 copy and pastes every day.
It’s spring. And new encodings came out with new model.
So in the tiktokenizer demo, the textarea box looks like this (I'm using the gpt-3.5-turbo
model):
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Yes, please help<|im_end|>
<|im_start|>assistant
And the token array looks like this:
[100264, 9125, 198, 2675, 527, 264, 11190, 18328, 100265, 198, 100264, 882, 198, 9642, 11, 4587, 1520, 100265, 198, 100264, 78191, 198]
But when I use the exact same text in my javascript file:
I get completely different tokens as you can see in the terminal panel the right side.
And no, it's not because you see triangles in the string in my screenshot, that's just because of the font I'm using.
Can you explain what's going on here?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.