Comments (6)
which version are you using?
from scrapegraph-ai.
This is on the refactoring-tokenization
branch. :)
It looks like Ollama does not actually provide a tokenization endpoint yet, and (perhaps because of this) the Langchain implementation of get_num_tokens
is incomplete, see: ollama/ollama#1716
I think we should leave this issue open until Ollama adds a tokenization endpoint - hopefully shouldn't be too long as there is an open PR for it already.
In the meantime I will put in a PR to update the branch so that it uses the Langchain implementation, as an interim solution, which hopefully will be improved once Ollama is improved.
I did come across an alternative which is to use the hugging face package to download the tokenizer directly, however you need to know the huggingface model ID and pass a huggingface API key in order to use it (for gated models, inc Llama 3.1) and I suspect we don't want to add that complexity here if a better solution is not too far away.
from scrapegraph-ai.
I've added the tokenization code for Ollama and Mistral. I noticed that the new chunking code in pre/beta is calling the tokenizers for every word (which for Mistral takes a few seconds per word, so on long webpages it can run for a very long time) so I have switched it back to using semchunk which only seems to make a few calls to the tokenizers. Let me know if another approach is preferred.
Also for Ollama token counting to be fully correct, we are dependent on Ollama implementing token counting endpoints as per the comment above. However hopefully once that is done, langchain will implement it within their existing API and therefore no more changes will be needed in Scrapegraph.
from scrapegraph-ai.
π This issue has been resolved in version 1.19.0-beta.7 π
The release is available on:
v1.19.0-beta.7
- GitHub release
Your semantic-release bot π¦π
from scrapegraph-ai.
hi,
please update to the new beta
from scrapegraph-ai.
π This issue has been resolved in version 1.20.0-beta.1 π
The release is available on:
v1.20.0-beta.1
- GitHub release
Your semantic-release bot π¦π
from scrapegraph-ai.
Related Issues (20)
- i am getting the below while running for ollama model HOT 7
- Token count implementation in ParseNode splits text on spaces which is not correct HOT 1
- When I use examples/deepseek/smart_scraper_deepseek.py ,I have a error. HOT 4
- v1.17.0b5: No module named 'PIL' HOT 4
- Chunking support for ScriptCreatorGraph HOT 4
- Support for OpenAI Assistants API HOT 2
- Provider bedrock is not supported when trying to use bedrock examples listed in repo. HOT 6
- Error Instancing bedrock model from example code HOT 2
- ValueError: Error raised by bedrock service: 'str' object has no attribute 'invoke_model' HOT 6
- It canΒ΄t scrape URLs from the source HOT 8
- Executing RAG Node HOT 2
- Not getting extraction results after upgrading from 1.6.1 to 1.18.1 HOT 1
- SmartScraperGraph with Gemini: Provider google is not supported in SmartScraperGraph HOT 2
- SmartScraperGraph Initialization Error HOT 3
- [Feature Request] Add a hook to customize "wait_for_load_state" behavior HOT 1
- return number of input and output tokens with the model HOT 2
- Can you collect user usage data from within the library and write it to readme HOT 4
- Cerebras and SambaNova
- Bedrock model copy recursion HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrapegraph-ai.