Comments (11)
I am getting the exact same problem as you with the CUDA extension not installed.
It is also saying please use the "tie_weights" method before using the infer_auto_device function.
GPTQ is much slower than GGML for me aswell.
from chatdocs.
GPTQ is much slower than GGML for me aswell.
Have you checked your model is small enough to fit on your GPU and run efficiently? I did find it sped up, just not very much. The only time I found GPTQ slower was when I was running a 7GB (13B parameter) model on a 12GB card, because the RAM was being maxed out
from chatdocs.
@Ciaranwuk yup!
I am using wiz-vic 7b uncensored ggml with a gtx 3060 12gb vram
I tried the same model wiz-vic 7b uncensored gptq and it was probably around 4 times slower.
Maybe I don't have the correct settings for GPTQ, I know how to optimize ggml models with batch size, context length etc but I don't know how to use GPTQ models optimized for my card.
I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!
Got suggestions ?
My GGML prompting on wizvic7b is lightning fast, it prompts in less than a second.
from chatdocs.
Eventually managed to get the speed where I was expecting. turned out I had 2 versions of CUDA (still not sure which packages) running at the same time. I had to update nvcc to match the pytorch installation (11.8) which I got off the pytorch website. The bottom of Issue#21 (of this repo) has a good step by step on the setup.
Very surprised you're getting GGML to run that fast. Have you checked that it is actually drawing from the database? I've found if the database doesn't exist the models run waaaaaay faster, but obviously don't read the documents
from chatdocs.
When I run chatdocs ui command it raises a message "CUDA extension not installed"
If you are seeing this message then it will run very slow. Try installing a prebuilt binary from their releases page:
pip install auto_gptq-0.2.2+cu118-cp310-cp310-win_amd64.whl
I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!
Only ggml (ctransformers) models support streaming.
from chatdocs.
@Ciaranwuk it is drawing from the database. It's lightning fast with GGML.
@marella thanks for clarification about the streaming! I will probably stick with GGML then! :) the 7b models are so fast. I am trying to find a way to make the 13b models as fast because I have 12GB of VRAM. This is why I have tried GPTQ
from chatdocs.
@Ananderz 7b models are so fast...Can you please share your code? I am stuck here with questioning and answering. I had it set up on different VM , It was working perfect, but before I could save my work that VM was gone. I am using windows server 2022, have the same message that CUDA Extesntion is not installed. I need to find a way to get fast question answer over docs kind of thing and I need to do it fast. If there is just some other script I can use, please share. I have cuda, GPU etc set up and available. I was using it this way earlier, https://stackoverflow.com/questions/76553771/langchain-prints-context-before-question-and-answer, so any variation of the same where I can use the fastest model ggml/gptq doesn't matter , which uses GPU , is all I need.
from chatdocs.
When I run chatdocs ui command it raises a message "CUDA extension not installed"
If you are seeing this message then it will run very slow. Try installing a prebuilt binary from their releases page:
pip install auto_gptq-0.2.2+cu118-cp310-cp310-win_amd64.whlI also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!
Only ggml (ctransformers) models support streaming.
This cannot be installed on windows server and python 11?, I checked the releases, none match. :(
from chatdocs.
@abhishekrai43 if you had it working before you probably just need to create a new virtual env and reinstall. I also found that I needed my CUDA download (that marella mentioned higher up) needed to match my nvcc installation and that I needed to restart my PC after all that. once I had everything running on CUDA 11.8 and I had restarted, the "CUDA extension not installed" message went away.
from chatdocs.
@Ciaranwuk Thanks for this. Will try
from chatdocs.
I'm closing this now, as I managed to get it working once I got my environment set right
from chatdocs.
Related Issues (20)
- how to turn off citations? HOT 1
- `score_threshold` in db.as_retriever doesn't seem to be enforced HOT 1
- ModuleNotFoundError: No module named 'langchain.embeddings.base' HOT 1
- ImportError: cannot import name 'soft_unicode' from 'markupsafe' HOT 1
- pad_token errors
- Google colab: OSError: libcudart.so.12: cannot open shared object file: No such file or directory
- model DocsGPT-7B
- Error ImportError: cannot import name 'url_quote' from 'werkzeug.urls'after this command chatdocs ui HOT 1
- ui color change HOT 1
- HTTPS not working? Please help.
- In which path the add file will be loaded any how can we delete the loaded file ?
- Works only on one session at a time.
- Error ImportError: cannot import name 'url_quote' from 'werkzeug.urls' HOT 1
- Turning on GPU gives PTX error
- Tabulate missing as dependency
- Update to Python 3.12 - Remove 'stdlib distutils module' requirement (deprecated)
- French language support
- IndexError when using chatdocs add command for some documents
- Can't download models anymore, not sure why. Used to work perfectly HOT 4
- Is chatdocs still being supported? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatdocs.