Git Product home page Git Product logo

gpt-3-encoder-php's Introduction

GPT-3-Encoder-Decoder-PHP

PHP BPE Text Encoder/Decoder for GPT-2 / GPT-3

About

GPT-2 and GPT-3 use byte pair encoding to turn text into a series of integers to feed into the model. This is a PHP implementation of OpenAI's original python encoder and decoder which can be found here. The main source of inspiration for writing this encoder was the NodeJS version of this encoder, found here.

You can test the results, by comparing the output generated by this script, with the official tokenizer page from OpenAI.

This specific encoder and decoder is used in the Aiomatic WordPress plugin, to count the number of tokens a string will use when sent to OpenAI API. Check more of my work on my website.

Usage

The mbstring PHP extension is needed for this tool to work correctly (in case non-ASCII characters are present in the tokenized text): details here on how to install mbstring

$prompt = "Many words map to one token, but some don't: indivisible. Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🀚🏾 Sequences of characters commonly found next to each other may be grouped together: 1234567890";

$token_array = gpt_encode($prompt);

$original_text = gpt_decode($token_array);

gpt-3-encoder-php's People

Contributors

coderevolutionplugins avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

gpt-3-encoder-php's Issues

The result of gpt_encode for Chinese characters may not be correct

The expected result was generated by the OpenAI API tokenizer ( https://platform.openai.com/tokenizer ) , while the actual result was produced by the gpt_encode function.

Wrong cases:

  • δΈ€ε€©δΈ€θ˜‹ζžœοΌŒι†«η”Ÿι ι›’ζˆ‘οΌ Expected: 25, Actual: 24
  • δΈ€ε€©δΈ€θ˜‹ζžœοΌŒι†«η”Ÿι ι›’ζˆ‘ Expected: 22, Actual: 21
  • δΈ€ε€©δΈ€θ˜‹ζžœ ι†«η”Ÿι ι›’ζˆ‘ Expected: 19, Actual: 18
  • δΈ€ε€©δΈ€θ˜‹ζžœοΌŒι†«η”Ÿι ι›’ζˆ‘ apple a day keeps the doctor away Expected: 29, Actual: 28

Correct cases:

  • θ˜‹ζžœ Expected: 6, Actual: 6

Extremely slow and causing time-outs?

Hi,

Thanks again for the script.

I do have an issue though, it seems to be very slow and causing our scripts to time-out even after increasing time-out to 120 seconds on a 200 row array loop. We count tokens for each array value of max 200-500 words or so.

When we turned the script/checks off, our scripts ran within 2-4 seconds. Is this a known issue? Anything we can do to help troubleshoot and see what is lagging, and where?

Cheers.

problem with Chinese

English is no problem,
But there is a problem when switching to Chinese

$prompt = "ζˆ‘ζ˜― GPT-4οΌŒδΈ€ζ¬Ύε…ˆθΏ›ηš„θ‡ͺη„Άθ―­θ¨€η”Ÿζˆ AI。";
echo ('Count: ' . gpt_encode($prompt) );

Count: 34 tokens

in reality: 19 tokens

Error in gpt_utf8_encode function

Function has parameter $s, which is not used in the body. Additionally, the function starts with
$str .= $str;

The parameter should be simply $str instead of $s

Move sample code to a different file

It would be nice to move the sample code from the end of gpt3-encoder.php to an independent file. That way we can use the functions without getting that sample execution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.