Comments (2)
Also the caching is very undocumented!
from tiktoken.
Thanks for the report! Since you posted this issue there have been a couple relevant improvements. We no longer use blobfile in most cases; there's much less scope for an azure hang with just a plain unauthenticated http request. I also changed the default cache location a little for better cross-platform support.
By "pass in the directory directly", where exactly are you envisioning passing that? I'm unlikely to complicate interfaces for this feature request (unless it proves popular). Some workarounds in the meantime:
- Construct the Encoding objects directly, using whatever file loading logic you'd like
- Assign to
os.environ
before callingget_encoding
Thanks for the feedback on documentation. There aren't really dedicated docs for tiktoken right now, when I get time to add some, I'll make sure to mention this somewhere.
from tiktoken.
Related Issues (20)
- Optimize _byte_pair_merge function in BPE implementation
- Tiktoken not installing on a macbook pro with m2 chip HOT 2
- Exception has occurred: ConnectionError HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001F4D42B0EE0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
- Understanding the intended behaviour of `_encode_bytes`
- Custom tokenizer fails to encode despite characters being in mergeable_ranks HOT 1
- Use a custom exception ValueError subclass for the special tokens warning
- Error
- Combining marks and indic vowel marks within words are being split breaking all indic languages and most languages except English and CJKs HOT 4
- I need tiktoken win32 python3.8 version, can anyone provide it? HOT 1
- A character is splited into two tokens HOT 1
- gpt-4o tokenizer HOT 1
- Or
- GPT4o出现低级bug:发现最新token中的垃圾语料及实测GPT4o胡言乱语出现幻觉 HOT 3
- `o200k_base` pretokenizer - regex error? HOT 2
- Tiktoken educational BPE trainer takes long time to train with vocab size 30k HOT 2
- tiktoken 0.7.0 isn't compatible with python 3.11.* HOT 3
- Unknown encoding gpt2 HOT 1
- I want to modify the code in self._core_bpe.decode_bytes(tokens).decode("utf-8", errors=errors) HOT 1
- TikToken Tokenizer from scratch ?
- Support for GPT 4o HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tiktoken.