Comments (10)
I got further when I loaded a gptq model. It turns out you have to specify quantization or else you will get an OOM. This isn't very intuitive. Unfortunately I'm still finding context consumes a LOT of memory. I am only using batch size of 1 so I don't get how I can't load a GPTQ model with more than 4096 context.
on all 4 now: determine_num_available_blocks causes a deadlock and GPUs to get stuck.
removing flash_attn got rid of the deadlock but now I am get 2.5t/s and GPUs are utilized to 77% memory each. I thought this was supposed to be faster than pipeline parallel?
from aphrodite-engine.
Tensor parallelism by nature doesn't work well with asymmetric setup. The 2080ti is dragging 3090s down in terms of both vram and speed. For optimal performance you really need 3090x4.
You can add --enable-chunked-prefill
to launch options to save vram.
from aphrodite-engine.
It's only 2gb less than a 3090. Compute wise yea, it's a bit slower. When used with pure exllama or other engines the hit isn't that bad.
When I try chunked prefill I get:
ERROR: File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/triton/language/semantic.py", line 1207, in dot
ERROR: assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
ERROR: File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/triton/language/semantic.py", line 1183, in assert_dtypes_valid
ERROR: assert lhs_dtype == rhs_dtype, f"First input ({lhs_dtype}) and second input ({rhs_dtype}) must have the same dtype!"
ERROR: ^^^^^^^^^^^^^^^^^^^^^^
ERROR: AssertionError: First input (fp16) and second input (uint8) must have the same dtype!
Have tried all --max-num-batched-tokens --max-model-len --kv-cache-dtype fp8
and setting max requests to 1 but no dice.
On 2 GPU I can only fit 8192.
INFO: Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO: Model = '/mnt/7d815d93-e74c-4d1e-b1da-6d7e1d187a17/models/Midnight-Miqu-70B-v1.0_GPTQ32G'
INFO: Speculative Config = None
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = gptq
INFO: Context Length = 8192
INFO: Enforce Eager Mode = True
INFO: KV Cache Data Type = fp8
INFO: KV Cache Params Path = None
INFO: Device = cuda
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
INFO: Using FlashAttention backend.
(RayWorkerAphrodite pid=935938) INFO: Using FlashAttention backend.
INFO: Aphrodite is using nccl==2.21.5
(RayWorkerAphrodite pid=935938) INFO: Aphrodite is using nccl==2.21.5
INFO: reading GPU P2P access cache from /home/supermicro/.config/aphrodite/gpu_p2p_access_cache_for_0,2.json
(RayWorkerAphrodite pid=935938) INFO: reading GPU P2P access cache from /home/supermicro/.config/aphrodite/gpu_p2p_access_cache_for_0,2.json
INFO: Model weights loaded. Memory usage: 19.82 GiB x 2 = 39.63 GiB
(RayWorkerAphrodite pid=935938) INFO: Model weights loaded. Memory usage: 19.82 GiB x 2 = 39.63 GiB
INFO: # GPU blocks: 537, # CPU blocks: 3276
INFO: Minimum concurrency: 1.05x
INFO: Maximum sequence length allowed in the cache: 8592
(RayWorkerAphrodite pid=935938) INFO: Maximum sequence length allowed in the cache: 8592
That 4bit cache is really something. I can normally fit 32k with a GPTQ model like this and 16K with 5bit EXL2, only in 48gb.
Throughput is not massively better either:
INFO: Avg prompt throughput: 157.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.9%, CPU KV
cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 12.8%, CPU KV
cache usage: 0.0%
In this case cards have PCIE 3.0x16 and Nvlink. Perhaps I am missing a setting to make it more adapted to single batch or it can only go fast when the # of batches is >1
from aphrodite-engine.
At the moment, FP8 can't work with chunked prefill/context shifting. There's some work being done in this branch to address this issue.
from aphrodite-engine.
I hit a similar bug:
Environment:
4x3090, Cuda 12.4, Aphrodite 0.53, total 96GB of VRAM, tensor parallel=4.
When I try to load elinas_Meta-Llama-3-120B-Instruct-4.0bpw-exl2 (61GB), it runs out of VRAM instantly, it don't even attempt to actually load the model from disk.
But it can load Meta-Llama-3-70B-Instruct-8.0bpw-h8-exl2 just fine, even if this model is bigger at 68 GB.
Bot models load fine with exllamav2.
from aphrodite-engine.
You need to specify -q exl2
for exl2 models if they are quantized with older versions of exllamav2 and doesn't have quantization config in config.json
from aphrodite-engine.
So to compile I need to do it in 11.8 still? I am using 12.x conda and I had trouble. It wasn't able to find ninja despite it being installed and available from the command line.
from aphrodite-engine.
Related Issues (20)
- [Installation]: Cannot install the library
- [Bug]: Unable to use OpenAI API with an auth key via a web browser due to OPTIONS preflight request returning 401. HOT 1
- [Bug]: HOT 1
- [Usage]: Please provide the environment variable that closes the KoboldAI Lite page.
- [Performance]: Memory Usage Fix for gguf. HOT 3
- [Installation]: ValueError: 17 is not a valid GGMLQuantizationType HOT 21
- [Installation]: Upload Aphrodite v0.5.2 On Pypi.org HOT 3
- [Usage]: What to set to get acceptable performance on Pascal GPUs? (Non-P100) HOT 2
- [Installation]: Installing from source does not work. undefined symbol: _ZN3c104cuda14ExchangeDeviceEa HOT 8
- [Bug]: PermissionError: [Errno 13] Permission denied: '/app/aphrodite-engine/.triton' HOT 3
- [Bug]: LoRA broken when TP>1
- [Bug]: LoRA fails to load HOT 1
- [Feature]: Exllamav2 Q4 cache HOT 2
- [Usage]: Lora Adapter Parameter while inferencing HOT 1
- [Bug]: Flash attention cannot be used on v0.5.3 HOT 7
- [Bug]: GPUExecutor throwing 'TypeError: 'type' object is not subscriptable' on 0.5.3 HOT 2
- [Bug]: torch._dynamo.exc.BackendCompilerFailed with command-r-plus HOT 3
- [Bug]: Cannot load llama-3 gguf based models HOT 1
- [Bug]: Int8 k/v cache calibrate don't work with QWen model?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aphrodite-engine.