Git Product home page Git Product logo

Comments (11)

Ananderz avatar Ananderz commented on June 11, 2024

I am getting the exact same problem as you with the CUDA extension not installed.

It is also saying please use the "tie_weights" method before using the infer_auto_device function.

GPTQ is much slower than GGML for me aswell.

from chatdocs.

Ciaranwuk avatar Ciaranwuk commented on June 11, 2024

GPTQ is much slower than GGML for me aswell.

Have you checked your model is small enough to fit on your GPU and run efficiently? I did find it sped up, just not very much. The only time I found GPTQ slower was when I was running a 7GB (13B parameter) model on a 12GB card, because the RAM was being maxed out

from chatdocs.

Ananderz avatar Ananderz commented on June 11, 2024

@Ciaranwuk yup!

I am using wiz-vic 7b uncensored ggml with a gtx 3060 12gb vram

I tried the same model wiz-vic 7b uncensored gptq and it was probably around 4 times slower.

Maybe I don't have the correct settings for GPTQ, I know how to optimize ggml models with batch size, context length etc but I don't know how to use GPTQ models optimized for my card.

I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!

Got suggestions ?

My GGML prompting on wizvic7b is lightning fast, it prompts in less than a second.

from chatdocs.

Ciaranwuk avatar Ciaranwuk commented on June 11, 2024

Eventually managed to get the speed where I was expecting. turned out I had 2 versions of CUDA (still not sure which packages) running at the same time. I had to update nvcc to match the pytorch installation (11.8) which I got off the pytorch website. The bottom of Issue#21 (of this repo) has a good step by step on the setup.

Very surprised you're getting GGML to run that fast. Have you checked that it is actually drawing from the database? I've found if the database doesn't exist the models run waaaaaay faster, but obviously don't read the documents

from chatdocs.

marella avatar marella commented on June 11, 2024

When I run chatdocs ui command it raises a message "CUDA extension not installed"

If you are seeing this message then it will run very slow. Try installing a prebuilt binary from their releases page:

pip install auto_gptq-0.2.2+cu118-cp310-cp310-win_amd64.whl

I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!

Only ggml (ctransformers) models support streaming.

from chatdocs.

Ananderz avatar Ananderz commented on June 11, 2024

@Ciaranwuk it is drawing from the database. It's lightning fast with GGML.

@marella thanks for clarification about the streaming! I will probably stick with GGML then! :) the 7b models are so fast. I am trying to find a way to make the 13b models as fast because I have 12GB of VRAM. This is why I have tried GPTQ

from chatdocs.

abhishekrai43 avatar abhishekrai43 commented on June 11, 2024

@Ananderz 7b models are so fast...Can you please share your code? I am stuck here with questioning and answering. I had it set up on different VM , It was working perfect, but before I could save my work that VM was gone. I am using windows server 2022, have the same message that CUDA Extesntion is not installed. I need to find a way to get fast question answer over docs kind of thing and I need to do it fast. If there is just some other script I can use, please share. I have cuda, GPU etc set up and available. I was using it this way earlier, https://stackoverflow.com/questions/76553771/langchain-prints-context-before-question-and-answer, so any variation of the same where I can use the fastest model ggml/gptq doesn't matter , which uses GPU , is all I need.

from chatdocs.

abhishekrai43 avatar abhishekrai43 commented on June 11, 2024

When I run chatdocs ui command it raises a message "CUDA extension not installed"

If you are seeing this message then it will run very slow. Try installing a prebuilt binary from their releases page:

pip install auto_gptq-0.2.2+cu118-cp310-cp310-win_amd64.whl

I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!

Only ggml (ctransformers) models support streaming.

This cannot be installed on windows server and python 11?, I checked the releases, none match. :(

from chatdocs.

Ciaranwuk avatar Ciaranwuk commented on June 11, 2024

@abhishekrai43 if you had it working before you probably just need to create a new virtual env and reinstall. I also found that I needed my CUDA download (that marella mentioned higher up) needed to match my nvcc installation and that I needed to restart my PC after all that. once I had everything running on CUDA 11.8 and I had restarted, the "CUDA extension not installed" message went away.

from chatdocs.

abhishekrai43 avatar abhishekrai43 commented on June 11, 2024

@Ciaranwuk Thanks for this. Will try

from chatdocs.

Ciaranwuk avatar Ciaranwuk commented on June 11, 2024

I'm closing this now, as I managed to get it working once I got my environment set right

from chatdocs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.