Comments (2)
Please refer to https://github.com/PygmalionAI/aphrodite-engine/wiki/8.-Quantization#gguf-dev-branch about how to use llama 3 gguf models (the tokenizer of llama 3 isn't a LlamaTokenizer
)
The performance issue on the other hand can't be solved easily, as for most quants there isn't a fp32 kernel.
from aphrodite-engine.
Please refer to https://github.com/PygmalionAI/aphrodite-engine/wiki/8.-Quantization#gguf-dev-branch about how to use llama 3 gguf models (the tokenizer of llama 3 isn't a
LlamaTokenizer
) The performance issue on the other hand can't be solved easily, as for most quants there isn't a fp32 kernel.
Thanks for replying. Definitely missed that part because of banging my head trying different things to make this thing not go so slow. Does the GGUF kernel run on FP32? It does exist on llama.cpp but as I understand aphrodite converts GGUF to safetensors first anyways?
EDIT: Ok so I figured it out finally, for anyone trying to run aphrodite on Pascal non-GP100 GPUs, here is how:
- Create a miniconda environment:
conda create -n aphrodite python=3.11
- Install CUDA:
conda install -y -c "nvidia/label/cuda-12.1.1" cuda
- Install pytorch:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
- Install aphrodite:
pip install -e .
- Run aphrodite with either exl2 (slow as balls) or GGUF (fast)
On Llama 3 8B Q4KM, I can get about 40t/s on one request on a GTX Titan X Pascal 12GB, but increases of parallel requests just tanks the performance immediately after more than 4 parallel requests. It does seem like the GPU core is fully utilized at 100% and is choking on doing matrix multiplications without the help of Tensor cores. Its ok but definitely not usable for multiple parallel requests.
Completed 16 prompts and produced 967 tokens in 30.533 seconds.
Average TPS across all 1 threads: 31.7 - Individual Threads: Min TPS: 31.7, Max TPS: 31.7
Completed 16 prompts and produced 4361 tokens in 61.385 seconds.
Average TPS across all 4 threads: 71.0 - Individual Threads: Min TPS: 15.5, Max TPS: 18.9
Completed 16 prompts and produced 9541 tokens in 484.179 seconds.
Average TPS across all 8 threads: 19.7 - Individual Threads: Min TPS: 2.2, Max TPS: 2.7
Completed 16 prompts and produced 14610 tokens in 544.334 seconds.
Average TPS across all 12 threads: 26.8 - Individual Threads: Min TPS: 1.9, Max TPS: 2.7
from aphrodite-engine.
Related Issues (20)
- [Bug]: [rank0]: KeyError: 'input_ids' HOT 2
- [Usage]: Higher Context Length. HOT 2
- [Feature]: WARNING: Model is quantized. Forcing float16 datatype HOT 4
- [Misc]: INT8 kv quant seems removed.
- [Bug]: unable use all the vram in wsl cuda environment
- [Bug]: /metrics Endpoint Returns 404 HOT 2
- [Feature]: An alternative to `max_tokens` which defaults to `minimum(max_tokens, remaining_tokens)`
- [Bug]: SnowStorm-v1.15-4x8B: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=128, NumelOut=128, Timeout(ms)=600000)
- [Usage]: OOM crash following Offline Inference setup HOT 3
- [Feature]: Speculative decoding with dual GPUs
- [Bug]: Segmentation fault (core dumped)
- [Bug]: Docker container refuses connection (read ECONNRESET)
- [Installation]: pip installs no executable HOT 3
- [Feature]: Suggestion for build older versions of aphrodite engine's docker images
- [Bug]: Cannot start GGUF FP16 models HOT 4
- [Feature]: Add Support for aya-23-8b with GGUF HOT 4
- [Bug]: pip install fails due to incompatible torch 2.3.0 HOT 2
- [Feature]: Support [RecurrentGemmaForCausalLM] HOT 3
- [Bug]: Few minor bugs with the docker image
- [Feature]: Support for Q4 Cache for Exl2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aphrodite-engine.