Git Product home page Git Product logo

distributed-llama's People

Contributors

b4rtaz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distributed-llama's Issues

Assertion `d % nSlices == 0' failed.

I'm inferencing a q40 weight of llama-3-70b-instruct across 3 x86_64 machines, and I'm getting this on my root node:
main: src/transformer.cpp:17: MatmulSlice::MatmulSlice(FloatType, int, int, int): Assertion `d % nSlices == 0' failed.

Any suggestions?

Turing RK1 compute module results

I promised to share results of Turing RK1 module. It arrived yesterday so I took the chance to run distributed llama on it.
Capability: 8 cores, 32 GB RAM. Storage: 1 TB NVMe SSD
OS: custom Ubuntu Server
Model: llama-2-7b

Command

sudo nice -n -20 ./main inference \
  --model /mnt/bigdata/llama-2-7b/dllama_llama-2-7b_q40.bin \
  --tokenizer ./tokenizer.bin \
  --weights-float-type q40 \
  --buffer-float-type q80 \
  --prompt "Hello world" \
  --steps 16 \
  --nthreads 4

Result

πŸ’‘ dim: 4096
πŸ’‘ hiddenDim: 11008
πŸ’‘ nLayers: 32
πŸ’‘ nHeads: 32
πŸ’‘ nKvHeads: 32
πŸ’‘ vocabSize: 32000
πŸ’‘ seqLen: 2048
πŸ’‘ nSlices: 1
⏩ Loaded 4242882560 bytes
πŸ”Ά G  372 ms I  372 ms T    0 ms S      0 kB R      0 kB Hello
πŸ”Ά G  378 ms I  378 ms T    0 ms S      0 kB R      0 kB  world
πŸ”Ά G  369 ms I  367 ms T    1 ms S      0 kB R      0 kB ,
πŸ”Ά G  379 ms I  379 ms T    0 ms S      0 kB R      0 kB  I
πŸ”Ά G  424 ms I  397 ms T   27 ms S      0 kB R      0 kB '
πŸ”Ά G  376 ms I  376 ms T    0 ms S      0 kB R      0 kB m
πŸ”Ά G  378 ms I  377 ms T    0 ms S      0 kB R      0 kB  E
πŸ”Ά G  407 ms I  407 ms T    0 ms S      0 kB R      0 kB .
πŸ”Ά G  383 ms I  380 ms T    0 ms S      0 kB R      0 kB  січня
πŸ”Ά G  372 ms I  371 ms T    1 ms S      0 kB R      0 kB  
πŸ”Ά G  379 ms I  378 ms T    0 ms S      0 kB R      0 kB 2
πŸ”Ά G  374 ms I  373 ms T    0 ms S      0 kB R      0 kB 0
πŸ”Ά G  382 ms I  381 ms T    0 ms S      0 kB R      0 kB 1
πŸ”Ά G  375 ms I  373 ms T    2 ms S      0 kB R      0 kB 8
πŸ”Ά G  378 ms I  377 ms T    1 ms S      0 kB R      0 kB  at
πŸ”Ά G  382 ms I  382 ms T    0 ms S      0 kB R      0 kB  
Generated tokens:    16
Avg generation time: 381.75 ms
Avg inference time:  379.25 ms
Avg transfer time:   2.00 ms

Master process crashes running out of memory on a 8 GB RPi 5

I setup a single master-worker pair to experiment with distributed llama. The master is actually an RPi 5 with 8 GB RAM and the only worker is a RPi 4 having the same memory size.
When I run the inference, the master crashes after a while with segfault. The worker also quits due to closed socket connection.
Any idea why? I tried with the smallest, llama-2-7b model.

Terminal capture from master

segabor@bigfive:~/src/distributed-llama $ ./run.sh 
πŸ’‘ dim: 4096
πŸ’‘ hiddenDim: 11008
πŸ’‘ nLayers: 32
πŸ’‘ nHeads: 32
πŸ’‘ nKvHeads: 32
πŸ’‘ vocabSize: 32000
πŸ’‘ seqLen: 2048
πŸ’‘ nSlices: 2
./run.sh: line 9: 268004 Segmentation fault      sudo nice -n -20 ./main inference --model /mnt/data/llama-2-7b/dllama_llama-2-7b_q40.bin --tokenizer ./tokenizer.bin --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 30.0.0.12:9998

Worker capture:

^Csegabor@lohere:~/src/distributed-llama$ ./run.sh 
Listening on 0.0.0.0:9998...
Client connected
πŸ’‘ sliceIndex: 1
πŸ’‘ nSlices: 2
⏩ Received 56918016 bytes for block 0 (83092 kB/s)
⏩ Received 56918016 bytes for block 1 (112709 kB/s)
⏩ Received 56918016 bytes for block 2 (112486 kB/s)
⏩ Received 56918016 bytes for block 3 (112709 kB/s)
⏩ Received 56918016 bytes for block 4 (91069 kB/s)
⏩ Received 56918016 bytes for block 5 (114986 kB/s)
⏩ Received 56918016 bytes for block 6 (103865 kB/s)
⏩ Received 56918016 bytes for block 7 (106190 kB/s)
⏩ Received 56918016 bytes for block 8 (112709 kB/s)
⏩ Received 56918016 bytes for block 9 (63172 kB/s)
⏩ Received 56918016 bytes for block 10 (63172 kB/s)
⏩ Received 56918016 bytes for block 11 (63313 kB/s)
⏩ Received 56918016 bytes for block 12 (63313 kB/s)
⏩ Received 56918016 bytes for block 13 (63172 kB/s)
⏩ Received 56918016 bytes for block 14 (60810 kB/s)
⏩ Received 56918016 bytes for block 15 (64097 kB/s)
⏩ Received 56918016 bytes for block 16 (60551 kB/s)
⏩ Received 56918016 bytes for block 17 (60358 kB/s)
⏩ Received 56918016 bytes for block 18 (60423 kB/s)
⏩ Received 56918016 bytes for block 19 (61600 kB/s)
⏩ Received 56918016 bytes for block 20 (62205 kB/s)
⏩ Received 56918016 bytes for block 21 (61136 kB/s)
⏩ Received 56918016 bytes for block 22 (62138 kB/s)
⏩ Received 56918016 bytes for block 23 (64753 kB/s)
⏩ Received 56918016 bytes for block 24 (100208 kB/s)
⏩ Received 56918016 bytes for block 25 (112486 kB/s)
⏩ Received 56918016 bytes for block 26 (112486 kB/s)
⏩ Received 56918016 bytes for block 27 (114064 kB/s)
⏩ Received 56918016 bytes for block 28 (111823 kB/s)
⏩ Received 56918016 bytes for block 29 (111168 kB/s)
Error receiving data: socket closed

To support Hugging Face model

When download the model from Meta website using email url, there are often get network problems 403 Forbidden. Is there any support for hugging face model possible ? Thanks !

Can I use Ollama model

Sorry if asking a stupid question. I am newbie llm user, have been using Ollama and love its ease of use. However, I also have limit on the hardware. distributed-llama seem a promising solution for me. But I don't know how to use those model provided by Ollama. o is it feasible at all?

[Feature Suggest] Tensor Parallellism for Accelerating LLM

Dear Author,

Your contribution is critical for the open-source community. The distributed-llama repo has implemented tensor parallelism from scratch. And the result is amazingly significant. However, there are still improvements that could be made. Because of my poor coding ability, not able to make improvements myself, I hope you can look at my suggestions below.

Challenge: root node's special task and synchronization

When I run the repo version '0.1.0', I find that the softmax operations in MultiHead are conducted on the root node only. This operation costs a significant portion of the total time. Second, the synFfnA and synFfn2 functions also cost a lot of time.

Mature solutions

In fact, these challenges have been found in this paper: https://arxiv.org/abs/1909.08053. Its solution is shown in the image:

image

It conducts attention mechanism(softmax) on every worker. Second, the matrix segmentation direction is using column segment and row segment in two consecutive matrices, thus reducing to one synchronization operation instead of two.

If you are willing to make further improvements to the repo, the following is the mature solution for every component of llama2 using tensor parallelism and sequence parallelism.
https://pytorch.org/tutorials/intermediate/TP_tutorial.html
However, it's implemented in Python, and you will be the first one to implement the solution in C++.

Thanks for your contribution!!!
Best Regards

WebAssembly version

Is it in the scope of the project to eventually provide a WebAssembly version?

Need help in set up all the devices

Hi Mr Bart,

Your distributed-llama is great! However, are there any clear instructions to set up the whole environment from scratch? I'm interested in your distributed-llama but I lack related knowledge in Raspberry Pi devices. So I don't even know how to connect the devices into my PC or install some dependencies in my PC or Raspberry Pi devices. Could you please help me with it? Thank you so much!

Compiling error related to include of <ctime>

I got this error while compiling:

src/socket.cpp:61:34: error: β€˜time’ was not declared in this scope
   61 |                     time_t now = time(NULL);
      |                                  ^~~~
src/socket.cpp:12:1: note: β€˜time’ is defined in header β€˜<ctime>’; did you forget to β€˜#include <ctime>’?
   11 | #include <stdexcept>
  +++ |+#include <ctime>
   12 | 

Adding #include <ctime> to src/socket.cpp fixed the compiling error for me. Please note that I do not code in C++, so please forgive any ignorance I have on this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.