I've been using this chatdocs project with a ggml model which has worked really well i

GPTQ is much slower than GGML for me aswell. <p dir="au

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

GPTQ model seems slow about chatdocs HOT 11 CLOSED

Ciaranwuk commented on June 11, 2024

GPTQ model seems slow

from chatdocs.

Comments (11)

Ananderz commented on June 11, 2024

I am getting the exact same problem as you with the CUDA extension not installed.

It is also saying please use the "tie_weights" method before using the infer_auto_device function.

GPTQ is much slower than GGML for me aswell.

from chatdocs.

Ciaranwuk commented on June 11, 2024

GPTQ is much slower than GGML for me aswell.

Have you checked your model is small enough to fit on your GPU and run efficiently? I did find it sped up, just not very much. The only time I found GPTQ slower was when I was running a 7GB (13B parameter) model on a 12GB card, because the RAM was being maxed out

from chatdocs.

Ananderz commented on June 11, 2024

@Ciaranwuk yup!

I am using wiz-vic 7b uncensored ggml with a gtx 3060 12gb vram

I tried the same model wiz-vic 7b uncensored gptq and it was probably around 4 times slower.

Maybe I don't have the correct settings for GPTQ, I know how to optimize ggml models with batch size, context length etc but I don't know how to use GPTQ models optimized for my card.

I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!

Got suggestions ?

My GGML prompting on wizvic7b is lightning fast, it prompts in less than a second.

from chatdocs.

Ciaranwuk commented on June 11, 2024

Eventually managed to get the speed where I was expecting. turned out I had 2 versions of CUDA (still not sure which packages) running at the same time. I had to update nvcc to match the pytorch installation (11.8) which I got off the pytorch website. The bottom of Issue#21 (of this repo) has a good step by step on the setup.

Very surprised you're getting GGML to run that fast. Have you checked that it is actually drawing from the database? I've found if the database doesn't exist the models run waaaaaay faster, but obviously don't read the documents

from chatdocs.

marella commented on June 11, 2024

When I run chatdocs ui command it raises a message "CUDA extension not installed"

If you are seeing this message then it will run very slow. Try installing a prebuilt binary from their releases page:

pip install auto_gptq-0.2.2+cu118-cp310-cp310-win_amd64.whl

I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!

Only ggml (ctransformers) models support streaming.

from chatdocs.

Ananderz commented on June 11, 2024

@Ciaranwuk it is drawing from the database. It's lightning fast with GGML.

@marella thanks for clarification about the streaming! I will probably stick with GGML then! :) the 7b models are so fast. I am trying to find a way to make the 13b models as fast because I have 12GB of VRAM. This is why I have tried GPTQ

from chatdocs.

abhishekrai43 commented on June 11, 2024

@Ananderz 7b models are so fast...Can you please share your code? I am stuck here with questioning and answering. I had it set up on different VM , It was working perfect, but before I could save my work that VM was gone. I am using windows server 2022, have the same message that CUDA Extesntion is not installed. I need to find a way to get fast question answer over docs kind of thing and I need to do it fast. If there is just some other script I can use, please share. I have cuda, GPU etc set up and available. I was using it this way earlier, https://stackoverflow.com/questions/76553771/langchain-prints-context-before-question-and-answer, so any variation of the same where I can use the fastest model ggml/gptq doesn't matter , which uses GPU , is all I need.

from chatdocs.

abhishekrai43 commented on June 11, 2024

When I run chatdocs ui command it raises a message "CUDA extension not installed"

If you are seeing this message then it will run very slow. Try installing a prebuilt binary from their releases page:
pip install auto_gptq-0.2.2+cu118-cp310-cp310-win_amd64.whl
I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!

Only ggml (ctransformers) models support streaming.

This cannot be installed on windows server and python 11?, I checked the releases, none match. :(

from chatdocs.

Ciaranwuk commented on June 11, 2024

@abhishekrai43 if you had it working before you probably just need to create a new virtual env and reinstall. I also found that I needed my CUDA download (that marella mentioned higher up) needed to match my nvcc installation and that I needed to restart my PC after all that. once I had everything running on CUDA 11.8 and I had restarted, the "CUDA extension not installed" message went away.

from chatdocs.

abhishekrai43 commented on June 11, 2024

@Ciaranwuk Thanks for this. Will try

from chatdocs.

Ciaranwuk commented on June 11, 2024

I'm closing this now, as I managed to get it working once I got my environment set right

from chatdocs.

GPTQ model seems slow about chatdocs HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent