I'm experimenting with the new implementation of CUDA acceleration for quantized model

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Quantized models on multi-GPU about candle HOT 5 OPEN

hugoabonizio commented on June 9, 2024 1

Quantized models on multi-GPU

from candle.

Comments (5)

hugoabonizio commented on June 9, 2024

@LaurentMazare I'm sorry to bother, but I just want to ask: is it possible to use the current implementation of quantized models in a multi-GPU setup (like the llama_multiprocess example)? If not, is there any plan to support this feature in the future?

I appreciate your work on pushing forward the CUDA kernels for quantization.

from candle.

LaurentMazare commented on June 9, 2024

I'm not sure that having the same technique as what is used for llama-multiprocess would make sense here. The llama-multiprocess version is useful when some tensors have to be shared across different gpus, however I don't think there would be quantized models that would be large enough so that this would actually be useful?
If the goal is just to have multiple models that live on different gpus, then that part should be reasonably easy to do even with the current api by creating one device per gpu that you want to target, but maybe you're after something more complex than this?

from candle.

hugoabonizio commented on June 9, 2024

I'm after sharding larger models that wouldn't fit on a single 24GB GPU and could instead be split across, for example, 4 of them. If I'm not mistaken, llama.cpp supports multi-GPU through pipeline parallelism but supported tensor splitting between GPUs before that.

from candle.

LaurentMazare commented on June 9, 2024

If there is no need to shard one tensor one multiple gpus, I would recommend doing something a lot simpler than llama-multiprocess and instead put the different weights on different gpus. I guess it's likely what the pipeline processing of llama.cpp is doing.

from candle.

hugoabonizio commented on June 9, 2024

Unfortunately, it is necessary to shard the tensors for both larger models (40b+ params) and to speed up larger batch sizes. My use case is an API serving multiple concurrent requests.

Is the solution you're suggesting of putting different weights (layers?) on different GPUs similar to transformers' device_map? I suppose it's slower than sharding, right?

from candle.

Recommend Projects

Quantized models on multi-GPU about candle HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent