Your current environment <div class="snippet-clipboard-content notranslate posit

Please refer to <a href="https://github.com/PygmalionAI/aphrodite-engine/wiki/8.-Quant

Please refer to <a href="https://github.com/PygmalionAI/aphrodite-engine/

[Usage]: What to set to get acceptable performance on Pascal GPUs? (Non-P100) about aphrodite-engine HOT 2 CLOSED

Nero10578 commented on June 16, 2024

[Usage]: What to set to get acceptable performance on Pascal GPUs? (Non-P100)

from aphrodite-engine.

Comments (2)

sgsdxzy commented on June 16, 2024

Please refer to https://github.com/PygmalionAI/aphrodite-engine/wiki/8.-Quantization#gguf-dev-branch about how to use llama 3 gguf models (the tokenizer of llama 3 isn't a LlamaTokenizer)
The performance issue on the other hand can't be solved easily, as for most quants there isn't a fp32 kernel.

from aphrodite-engine.

Nero10578 commented on June 16, 2024

Please refer to https://github.com/PygmalionAI/aphrodite-engine/wiki/8.-Quantization#gguf-dev-branch about how to use llama 3 gguf models (the tokenizer of llama 3 isn't a LlamaTokenizer) The performance issue on the other hand can't be solved easily, as for most quants there isn't a fp32 kernel.

Thanks for replying. Definitely missed that part because of banging my head trying different things to make this thing not go so slow. Does the GGUF kernel run on FP32? It does exist on llama.cpp but as I understand aphrodite converts GGUF to safetensors first anyways?

EDIT: Ok so I figured it out finally, for anyone trying to run aphrodite on Pascal non-GP100 GPUs, here is how:

Create a miniconda environment: conda create -n aphrodite python=3.11
Install CUDA: conda install -y -c "nvidia/label/cuda-12.1.1" cuda
Install pytorch: pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
Install aphrodite: pip install -e .
Run aphrodite with either exl2 (slow as balls) or GGUF (fast)

On Llama 3 8B Q4KM, I can get about 40t/s on one request on a GTX Titan X Pascal 12GB, but increases of parallel requests just tanks the performance immediately after more than 4 parallel requests. It does seem like the GPU core is fully utilized at 100% and is choking on doing matrix multiplications without the help of Tensor cores. Its ok but definitely not usable for multiple parallel requests.

Completed 16 prompts and produced 967 tokens in 30.533 seconds.
Average TPS across all 1 threads: 31.7 - Individual Threads: Min TPS: 31.7, Max TPS: 31.7

Completed 16 prompts and produced 4361 tokens in 61.385 seconds.
Average TPS across all 4 threads: 71.0 - Individual Threads: Min TPS: 15.5, Max TPS: 18.9

Completed 16 prompts and produced 9541 tokens in 484.179 seconds.
Average TPS across all 8 threads: 19.7 - Individual Threads: Min TPS: 2.2, Max TPS: 2.7

Completed 16 prompts and produced 14610 tokens in 544.334 seconds.
Average TPS across all 12 threads: 26.8 - Individual Threads: Min TPS: 1.9, Max TPS: 2.7

from aphrodite-engine.

Recommend Projects

[Usage]: What to set to get acceptable performance on Pascal GPUs? (Non-P100) about aphrodite-engine HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent