Comments (5)
@LaurentMazare I'm sorry to bother, but I just want to ask: is it possible to use the current implementation of quantized models in a multi-GPU setup (like the llama_multiprocess example)? If not, is there any plan to support this feature in the future?
I appreciate your work on pushing forward the CUDA kernels for quantization.
from candle.
I'm not sure that having the same technique as what is used for llama-multiprocess would make sense here. The llama-multiprocess version is useful when some tensors have to be shared across different gpus, however I don't think there would be quantized models that would be large enough so that this would actually be useful?
If the goal is just to have multiple models that live on different gpus, then that part should be reasonably easy to do even with the current api by creating one device per gpu that you want to target, but maybe you're after something more complex than this?
from candle.
I'm after sharding larger models that wouldn't fit on a single 24GB GPU and could instead be split across, for example, 4 of them. If I'm not mistaken, llama.cpp supports multi-GPU through pipeline parallelism but supported tensor splitting between GPUs before that.
from candle.
If there is no need to shard one tensor one multiple gpus, I would recommend doing something a lot simpler than llama-multiprocess and instead put the different weights on different gpus. I guess it's likely what the pipeline processing of llama.cpp is doing.
from candle.
Unfortunately, it is necessary to shard the tensors for both larger models (40b+ params) and to speed up larger batch sizes. My use case is an API serving multiple concurrent requests.
Is the solution you're suggesting of putting different weights (layers?) on different GPUs similar to transformers' device_map? I suppose it's slower than sharding, right?
from candle.
Related Issues (20)
- Batch training implementation HOT 5
- Vulkan support HOT 1
- [Error] DeviceMismatchBinaryOp - mixtral quantized on cuda HOT 1
- candle-core::quantized::avx contains AVX2 instructions which of course crashes on AVX machines HOT 2
- The file referenced here is not available in the repo HOT 1
- tensor `sign` operator HOT 2
- Abstraction framework on top of candle, viable embedding and attention layers
- Severe performance drop on WSL vs pure-Linux, but same GPU type HOT 3
- CUDA_ERROR_ILLEGAL_ADDRESS: "an illegal memory access was encountered" with candle_nn `Silu` HOT 6
- How to increase model performance? HOT 3
- Yolo-v8 Mac M1 Error: Metal max_pool2d not implemented HOT 2
- Stable Diffusion fails to produce images with bad "Shape" above 80 or so characters prompt input
- Metal iOS HOT 1
- candle-flash-attn linking error with Red Hat based distributions HOT 6
- CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain." HOT 2
- Is there a way to zero-copy `tch-rs::Tensor` as `candle_core::Tensor`? HOT 2
- Error while training multi class classification model having resnet50. HOT 1
- Error: device mismatch in conv2d, lhs: Cpu, rhs: Metal { gpu_id: 4294968713 } HOT 3
- Will candle support YOLOV5? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from candle.