Git Product home page Git Product logo

Comments (6)

VinciGit00 avatar VinciGit00 commented on September 14, 2024

which version are you using?

from scrapegraph-ai.

tm-robinson avatar tm-robinson commented on September 14, 2024

This is on the refactoring-tokenization branch. :)

It looks like Ollama does not actually provide a tokenization endpoint yet, and (perhaps because of this) the Langchain implementation of get_num_tokens is incomplete, see: ollama/ollama#1716

I think we should leave this issue open until Ollama adds a tokenization endpoint - hopefully shouldn't be too long as there is an open PR for it already.

In the meantime I will put in a PR to update the branch so that it uses the Langchain implementation, as an interim solution, which hopefully will be improved once Ollama is improved.

I did come across an alternative which is to use the hugging face package to download the tokenizer directly, however you need to know the huggingface model ID and pass a huggingface API key in order to use it (for gated models, inc Llama 3.1) and I suspect we don't want to add that complexity here if a better solution is not too far away.

from scrapegraph-ai.

tm-robinson avatar tm-robinson commented on September 14, 2024

I've added the tokenization code for Ollama and Mistral. I noticed that the new chunking code in pre/beta is calling the tokenizers for every word (which for Mistral takes a few seconds per word, so on long webpages it can run for a very long time) so I have switched it back to using semchunk which only seems to make a few calls to the tokenizers. Let me know if another approach is preferred.

Also for Ollama token counting to be fully correct, we are dependent on Ollama implementing token counting endpoints as per the comment above. However hopefully once that is done, langchain will implement it within their existing API and therefore no more changes will be needed in Scrapegraph.

from scrapegraph-ai.

github-actions avatar github-actions commented on September 14, 2024

πŸŽ‰ This issue has been resolved in version 1.19.0-beta.7 πŸŽ‰

The release is available on:

Your semantic-release bot πŸ“¦πŸš€

from scrapegraph-ai.

VinciGit00 avatar VinciGit00 commented on September 14, 2024

hi,
please update to the new beta

from scrapegraph-ai.

github-actions avatar github-actions commented on September 14, 2024

πŸŽ‰ This issue has been resolved in version 1.20.0-beta.1 πŸŽ‰

The release is available on:

Your semantic-release bot πŸ“¦πŸš€

from scrapegraph-ai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.