b4rtaz / distributed-llama Goto Github PK
View Code? Open in Web Editor NEWRun LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
License: MIT License
Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
License: MIT License
I'm inferencing a q40 weight of llama-3-70b-instruct across 3 x86_64 machines, and I'm getting this on my root node:
main: src/transformer.cpp:17: MatmulSlice::MatmulSlice(FloatType, int, int, int): Assertion `d % nSlices == 0' failed.
Any suggestions?
How about the multi-core support of stand-alone dual-socket motherboards?
I promised to share results of Turing RK1 module. It arrived yesterday so I took the chance to run distributed llama on it.
Capability: 8 cores, 32 GB RAM. Storage: 1 TB NVMe SSD
OS: custom Ubuntu Server
Model: llama-2-7b
sudo nice -n -20 ./main inference \
--model /mnt/bigdata/llama-2-7b/dllama_llama-2-7b_q40.bin \
--tokenizer ./tokenizer.bin \
--weights-float-type q40 \
--buffer-float-type q80 \
--prompt "Hello world" \
--steps 16 \
--nthreads 4
π‘ dim: 4096
π‘ hiddenDim: 11008
π‘ nLayers: 32
π‘ nHeads: 32
π‘ nKvHeads: 32
π‘ vocabSize: 32000
π‘ seqLen: 2048
π‘ nSlices: 1
β© Loaded 4242882560 bytes
πΆ G 372 ms I 372 ms T 0 ms S 0 kB R 0 kB Hello
πΆ G 378 ms I 378 ms T 0 ms S 0 kB R 0 kB world
πΆ G 369 ms I 367 ms T 1 ms S 0 kB R 0 kB ,
πΆ G 379 ms I 379 ms T 0 ms S 0 kB R 0 kB I
πΆ G 424 ms I 397 ms T 27 ms S 0 kB R 0 kB '
πΆ G 376 ms I 376 ms T 0 ms S 0 kB R 0 kB m
πΆ G 378 ms I 377 ms T 0 ms S 0 kB R 0 kB E
πΆ G 407 ms I 407 ms T 0 ms S 0 kB R 0 kB .
πΆ G 383 ms I 380 ms T 0 ms S 0 kB R 0 kB ΡΡΡΠ½Ρ
πΆ G 372 ms I 371 ms T 1 ms S 0 kB R 0 kB
πΆ G 379 ms I 378 ms T 0 ms S 0 kB R 0 kB 2
πΆ G 374 ms I 373 ms T 0 ms S 0 kB R 0 kB 0
πΆ G 382 ms I 381 ms T 0 ms S 0 kB R 0 kB 1
πΆ G 375 ms I 373 ms T 2 ms S 0 kB R 0 kB 8
πΆ G 378 ms I 377 ms T 1 ms S 0 kB R 0 kB at
πΆ G 382 ms I 382 ms T 0 ms S 0 kB R 0 kB
Generated tokens: 16
Avg generation time: 381.75 ms
Avg inference time: 379.25 ms
Avg transfer time: 2.00 ms
I setup a single master-worker pair to experiment with distributed llama. The master is actually an RPi 5 with 8 GB RAM and the only worker is a RPi 4 having the same memory size.
When I run the inference, the master crashes after a while with segfault. The worker also quits due to closed socket connection.
Any idea why? I tried with the smallest, llama-2-7b model.
Terminal capture from master
segabor@bigfive:~/src/distributed-llama $ ./run.sh
π‘ dim: 4096
π‘ hiddenDim: 11008
π‘ nLayers: 32
π‘ nHeads: 32
π‘ nKvHeads: 32
π‘ vocabSize: 32000
π‘ seqLen: 2048
π‘ nSlices: 2
./run.sh: line 9: 268004 Segmentation fault sudo nice -n -20 ./main inference --model /mnt/data/llama-2-7b/dllama_llama-2-7b_q40.bin --tokenizer ./tokenizer.bin --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 30.0.0.12:9998
Worker capture:
^Csegabor@lohere:~/src/distributed-llama$ ./run.sh
Listening on 0.0.0.0:9998...
Client connected
π‘ sliceIndex: 1
π‘ nSlices: 2
β© Received 56918016 bytes for block 0 (83092 kB/s)
β© Received 56918016 bytes for block 1 (112709 kB/s)
β© Received 56918016 bytes for block 2 (112486 kB/s)
β© Received 56918016 bytes for block 3 (112709 kB/s)
β© Received 56918016 bytes for block 4 (91069 kB/s)
β© Received 56918016 bytes for block 5 (114986 kB/s)
β© Received 56918016 bytes for block 6 (103865 kB/s)
β© Received 56918016 bytes for block 7 (106190 kB/s)
β© Received 56918016 bytes for block 8 (112709 kB/s)
β© Received 56918016 bytes for block 9 (63172 kB/s)
β© Received 56918016 bytes for block 10 (63172 kB/s)
β© Received 56918016 bytes for block 11 (63313 kB/s)
β© Received 56918016 bytes for block 12 (63313 kB/s)
β© Received 56918016 bytes for block 13 (63172 kB/s)
β© Received 56918016 bytes for block 14 (60810 kB/s)
β© Received 56918016 bytes for block 15 (64097 kB/s)
β© Received 56918016 bytes for block 16 (60551 kB/s)
β© Received 56918016 bytes for block 17 (60358 kB/s)
β© Received 56918016 bytes for block 18 (60423 kB/s)
β© Received 56918016 bytes for block 19 (61600 kB/s)
β© Received 56918016 bytes for block 20 (62205 kB/s)
β© Received 56918016 bytes for block 21 (61136 kB/s)
β© Received 56918016 bytes for block 22 (62138 kB/s)
β© Received 56918016 bytes for block 23 (64753 kB/s)
β© Received 56918016 bytes for block 24 (100208 kB/s)
β© Received 56918016 bytes for block 25 (112486 kB/s)
β© Received 56918016 bytes for block 26 (112486 kB/s)
β© Received 56918016 bytes for block 27 (114064 kB/s)
β© Received 56918016 bytes for block 28 (111823 kB/s)
β© Received 56918016 bytes for block 29 (111168 kB/s)
Error receiving data: socket closed
When download the model from Meta website using email url, there are often get network problems 403 Forbidden
. Is there any support for hugging face model possible ? Thanks !
A very impressive job!
But it doesn't seem to support the use of GPU. Does the author consider developing code that supports GPU acceleration?
Any suggestions to migrate this project to CUDA/HIP acceleration?
Thanks for any help!
Sorry if asking a stupid question. I am newbie llm user, have been using Ollama and love its ease of use. However, I also have limit on the hardware. distributed-llama seem a promising solution for me. But I don't know how to use those model provided by Ollama. o is it feasible at all?
Dear Author,
Your contribution is critical for the open-source community. The distributed-llama repo has implemented tensor parallelism from scratch. And the result is amazingly significant. However, there are still improvements that could be made. Because of my poor coding ability, not able to make improvements myself, I hope you can look at my suggestions below.
When I run the repo version '0.1.0', I find that the softmax
operations in MultiHead
are conducted on the root node only. This operation costs a significant portion of the total time. Second, the synFfnA
and synFfn2
functions also cost a lot of time.
In fact, these challenges have been found in this paper: https://arxiv.org/abs/1909.08053. Its solution is shown in the image:
It conducts attention mechanism(softmax) on every worker. Second, the matrix segmentation direction is using column segment and row segment in two consecutive matrices, thus reducing to one synchronization operation instead of two.
If you are willing to make further improvements to the repo, the following is the mature solution for every component of llama2
using tensor parallelism and sequence parallelism.
https://pytorch.org/tutorials/intermediate/TP_tutorial.html
However, it's implemented in Python, and you will be the first one to implement the solution in C++.
Thanks for your contribution!!!
Best Regards
Is it in the scope of the project to eventually provide a WebAssembly version?
Hi Mr Bart,
Your distributed-llama is great! However, are there any clear instructions to set up the whole environment from scratch? I'm interested in your distributed-llama but I lack related knowledge in Raspberry Pi devices. So I don't even know how to connect the devices into my PC or install some dependencies in my PC or Raspberry Pi devices. Could you please help me with it? Thank you so much!
I got this error while compiling:
src/socket.cpp:61:34: error: βtimeβ was not declared in this scope
61 | time_t now = time(NULL);
| ^~~~
src/socket.cpp:12:1: note: βtimeβ is defined in header β<ctime>β; did you forget to β#include <ctime>β?
11 | #include <stdexcept>
+++ |+#include <ctime>
12 |
Adding #include <ctime>
to src/socket.cpp fixed the compiling error for me. Please note that I do not code in C++, so please forgive any ignorance I have on this issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.