b4rtaz / distributed-llama Goto Github PK

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.

License: MIT License

Makefile 0.24% C++ 97.25% Python 2.44% C 0.07%

distributed-computing llama2 llm llm-inference neural-network llms open-llm distributed-llm llama3

distributed-llama's Introduction

Distributed Llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. This project proves that it's possible split the workload of LLMs across multiple devices and achieve a significant speedup. Distributed Llama allows you to run huge LLMs in-house. The project uses TCP sockets to synchronize the state. You can easily configure your AI cluster by using a home router.

_{^{Distributed Llama running Llama 2 70B on 8 Raspberry Pi 4B devices}}

🔥 Run Distributed Llama by single command

Python and GCC required. Download this repository and run:

Llama 3 8B: python download-model.py llama3
Llama 3 8B Instruct: python download-model.py llama3_instruct
TinyLlama: python download-model.py tinylama

Supported modes:

Inference
Chat
API Server

Known limitations:

You can run Distributed Llama only on 1, 2, 4... 2^n nodes.
The maximum number of nodes is equal to the number of KV heads in the model #70.
Optimized for (weights format × buffer format):
- ARM CPUs
  - ✅ F32 × F32
  - ❌ F16 × F32
  - ❌ Q40 × F32
  - ✅ Q40 × Q80
- x86_64 AVX2 CPUs
  - ❌ F32 × F32
  - ❌ F16 × F32
  - ❌ Q40 × F32
  - ✅ Q40 × Q80

Architecture
The project is split up into two parts:

Root node - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.
Worker node - it processes own slice of the neural network. It doesn't require any configuration related to the model.

You always need the root node and you can add 2^n - 1 worker nodes to speed up the inference. The RAM usage of the neural network is split up across all nodes. The root node requires a bit more RAM than worker nodes.

📊 Measurements

Average Single Token Generation Time

All tests below utilized Q40 weights and a Q80 buffer. The generation time encompasses the inference time, network transfer time, sampling time, and multi-thread synchronization time. Number of samples: 16.

Raspberry Pi 5 8GB

Model	1 x RasPi 5 8 GB	2 x RasPi 5 8 GB	4 x RasPi 5 8 GB
Llama 2 7B	441.09 ms, 2.26 t/s _{^{(I: 434.84 ms, T: 5.25 ms)}}	341.46 ms, 2.92 t/s _{^{(I: 257.78 ms, T: 83.27 ms)}}	219.08 ms, 4.56 t/s _{^{(I: 163.42 ms, T: 55.25 ms)}}
Llama 3 8B	564.31 ms, 1.77 t/s _{^{(I: 556.67 ms, T: 6.17 ms)}}	444.27 ms, 2.25 t/s _{^{(I: 362.73 ms, T: 80.11 ms)}}	331.47 ms, 3.01 t/s _{^{(I: 267.62 ms, T: 62.34 ms)}}

_{^{I - inference time of the root node, T - network transfer time, tested on 0.3.1 version}}

Raspberry Pi 4B 8 GB

_{^{8 x Raspberry Pi 4B 8GB}}

All Raspberry Pi units were connected via Gigabit Ethernet to the TP-Link LS1008G Switch.

Model	1 x RasPi 4B 8 GB	2 x RasPi 4B 8 GB	4 x RasPi 4B 8 GB	8 x RasPi 4B 8 GB
Llama 2 7B	1312.50 ms _{^{(I: 1307.94 ms, T: 1.81 ms)}}	793.69 ms _{^{(I: 739.00 ms, T: 52.50 ms)}}	494.00 ms 🔥 _{^{(I: 458.81 ms, T: 34.06 ms)}}	588.19 ms _{^{(I: 296.69 ms, T: 289.75 ms)}}
Llama 2 13B	_{^{Not enough RAM}}	1497.19 ms _{^{(I: 1465.06 ms, T: 30.88 ms)}}	848.19 ms 🔥 _{^{(I: 746.88 ms, T: 99.50 ms)}}	1114.88 ms _{^{(I: 460.8 ms, T: 652.88 ms)}}
Llama 2 70B	_{^{Not enough RAM}}	_{^{Not enough RAM}}	_{^{Not enough RAM}}	4842.81 ms 🔥 _{^{(I: 2121.94 ms, T: 2719.62 ms)}}

_{^{I - inference time of the root node, T - network transfer time, tested on 0.1.0 version}}

x86_64 CPU Cloud Server

All tests below were conducted on c3d-highcpu-30 (30 vCPU, 15 core, 59 GB memory) VMs in Google Cloud. More details.

Model	1 x VM	2 x VM	4 x VM
Llama 2 7B	101.81 ms _{^{(I: 101.06 ms, T: 0.19 ms)}}	69.69 ms _{^{(I: 61.50 ms, T: 7.62 ms)}}	53.69 ms 🔥 _{^{(I: 40.25 ms, T: 12.81 ms)}}
Llama 2 13B	184.19 ms _{^{(I: 182.88 ms, T: 0.69 ms)}}	115.38 ms _{^{(I: 107.12 ms, T: 7.81 ms)}}	86.81 ms 🔥 _{^{(I: 66.25 ms, T: 19.94 ms)}}
Llama 2 70B	909.69 ms _{^{(I: 907.25 ms, T: 1.75 ms)}}	501.38 ms _{^{(I: 475.50 ms, T: 25.00 ms)}}	293.06 ms 🔥 _{^{(I: 264.00 ms, T: 28.50 ms)}}

_{^{I - inference time of the root node, T - network transfer time, tested on 0.1.0 version}}

Network Transfer for Generating Single Token

F32 Buffer

Model	2 devices	4 devices	8 devices
Llama 3 8B	2048 kB _{^{(S: 1024 kB, R: 1024 kB)}}	6144 kB _{^{(S: 3072 kB, R: 3072 kB)}}	14336 kB _{^{(S: 7168 kB, R: 7168 kB)}}

_{^{S - sent data from the root node to workers, R - received data by the root node from workers, tested on 0.7.1 version}}

Q80 Buffer

Model	2 devices	4 devices	8 devices
Llama 3 8B	544 kB _{^{(S: 272 kB, R: 272 kB)}}	1632 kB _{^{(S: 816 kB, R: 816 kB)}}	3808 kB _{^{(S: 1904 kB, R: 1904 kB)}}

_{^{S - sent data from the root node to workers, R - received data by the root node from workers, tested on 0.7.1 version}}

Download Model and Run

📟 How to Run on Raspberry Pi Devices

Install Raspberry Pi OS Lite (64 bit) on your Raspberry Pi devices. This OS doesn't have desktop environment.
Connect all devices to the Gigabit switch.
Connect to all devices via SSH.

ssh [email protected]
ssh [email protected]

Install Git:

sudo apt install git

Clone this repository:

git clone https://github.com/b4rtaz/distributed-llama.git

Compile Distributed Llama:

make dllama

Transfer weights and the tokenizer file to the root device.
Optional: assign static IP addresses.

sudo ip addr add 10.0.0.1/24 dev eth0 # 1th device
sudo ip addr add 10.0.0.2/24 dev eth0 # 2th device

Run worker nodes on worker devices:

sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4

Run root node on the root device:

sudo nice -n -20 ./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998

To add more worker nodes, just add more addresses to the --workers argument.

./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998

Share your results!

💻 How to Run on MacOS, Linux, or Windows

You need to have x86_64 AVX2 CPU or ARM CPU. Different devices may have different CPUs. The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS, or Windows.

MacOS and Linux

Install Git and G++:

sudo apt install git build-essential

Clone this repository:

git clone https://github.com/b4rtaz/distributed-llama.git

Compile Distributed Llama:

make dllama

Transfer weights and the tokenizer file to the root node.
Run worker nodes on worker devices:

sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4

Run root node on the root device:

sudo nice -n -20 ./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998

To run the root node in the chat mode:

sudo nice -n -20 ./dllama chat --model ../dllama_llama-2-7b-chat_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.0.1:9998

Windows

Install Git and Mingw (Chocolatey):

https://chocolatey.org/install

choco install mingw

Clone this repository:

git clone https://github.com/b4rtaz/distributed-llama.git

Compile Distributed Llama:

make dllama

Transfer weights and the tokenizer file to the root node.
Run worker nodes on worker devices:

./dllama worker --port 9998 --nthreads 4

Run root node on the root device:

./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998

To run the root node in the chat mode:

./dllama chat --model ../dllama_llama-2-7b-chat_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.0.1:9998

Share your results!

💡 License

This project is released under the MIT license.

📖 Citation

@misc{dllama,
  author = {Bartłomiej Tadych},
  title = {Distributed Llama},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/b4rtaz/distributed-llama}},
  commit = {7eb77ca93ec0d502e28d36b6fb20039b449cbea4}
}

distributed-llama's People

Contributors

Stargazers

Watchers

distributed-llama's Issues

converter.py OOM while converting llama-2-7b weights on my Raspberryi Pi 5

What's the memory requirement of the weight converter? Apparently it doesn't fit into 8 GB RAM (swapfile not enabled).

[Feature Suggest] From All-Reduce to Ring-All-Reduce

Dear author,

Challenge and solution

This repository has implemented Tensor Parallel, which facilitates the system by distributing the computation workload evenly to each node, achieving nearly linear acceleration in terms of inference time. However, the communication workload is not distributed. In other words, the transfer time will increase with the number of workers. This situation can be solved by changing all-reduce to ring-all-reduce, which distributes the transfer data workload to every worker.

Let me briefly introduce the concept of all-reduce and ring-all-reduce.

All-reduce

This master-worker architecture has currently been implemented.
It can be changed into ring-workers.

Ring-all-reduce

The ring-all-reduce algorithm is divided into two stages.

stage 1

First, Distribute P workers on a ring and divide each worker's data into P parts. In your case, the hidden_dim is divided into P parts.

Next, look at the k-th worker, who will send the k-th data to the next worker and receive the k-1-st data from the previous worker.

Afterwards, the worker will integrate the received k-1-st data with their own k-1st data, and then send the integrated data to the next worker.

After P-1 cycles, each worker will include a copy of the final integration result.

stage 2

In the second stage, each worker sends the integrated part to the next worker. Workers can update the corresponding part of its data after receiving the data. After P-1 cycles, each worker will include a full copy of the final integration result. This result is the same as All-Reduce.

Assuming that each worker's data is a vector of length hidden_dim = h, the amount of data sent or received by each worker is 2(P-1)*h/P, almost independent of the number of workers P.

When P=1, the transfer data is 0;
When P=2, the transfer data is h;
When P=4, the transfer data is 1.5*h, less than current 3*h;
When P=8, the transfer data is 1.75*h, much less than current 7*h;
When p->∞, the transfer data is 2*h, of course less than ∞h

Summary

Ring AllReduce can avoid the problem of the master needing to handle the amount of O(h*P) data in the master-worker architecture, which can become a network bottleneck when the number of devices increases to 8 or more.

Best Regards.

For your reference: Optimization of Collective Communication Operations in MPICH .pdf

(Crashing on Low Memory SBC) main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0

Is there anyway that main and worker could be separated so I can use a cluster of 8 RPi 3b+ for the compute but the scheduling is offset to another device with more memory?
I understand this is most likely not a priority.
Perhaps a smaller model? https://github.com/jzhang38/TinyLlama ?

main:

ubuntu@ubuntu:~/distributed-llama$ sudo main chat --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --model ~/dllama_meta-lla
ma-3-8b_q40.bin --tokenizer ~/dllama-llama3-tokenizer.t --workers 192.168.2.212:9998 192.168.2.213:9998 192.168.2.214:9998 192.168.2.215:
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 8
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
Killed

Worker

ubuntu@ubuntu:~$ sudo nice -n -20 main worker --port 9998 --nthreads 4]
Listening on 0.0.0.0:9998...
Client connected
terminate called after throwing an instance of 'ReadSocketException'
  what():  std::exception
Aborted

May 19 08:46:24 ubuntu kernel: [107061.602328] main invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
May 19 08:46:24 ubuntu kernel: [107061.602392] CPU: 0 PID: 4676 Comm: main Tainted: G         C  E     5.15.0-1055-raspi #58-Ubuntu
May 19 08:46:24 ubuntu kernel: [107061.602412] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
May 19 08:46:24 ubuntu kernel: [107061.602423] Call trace:
May 19 08:46:24 ubuntu kernel: [107061.602430]  dump_backtrace+0x0/0x200
May 19 08:46:24 ubuntu kernel: [107061.602455]  show_stack+0x20/0x30
May 19 08:46:24 ubuntu kernel: [107061.602470]  dump_stack_lvl+0x8c/0xb8
May 19 08:46:24 ubuntu kernel: [107061.602490]  dump_stack+0x18/0x34
May 19 08:46:24 ubuntu kernel: [107061.602506]  dump_header+0x54/0x21c
May 19 08:46:24 ubuntu kernel: [107061.602520]  oom_kill_process+0x22c/0x230
May 19 08:46:24 ubuntu kernel: [107061.602539]  out_of_memory+0xf4/0x370
May 19 08:46:24 ubuntu kernel: [107061.602554]  __alloc_pages_slowpath.constprop.0+0x604/0x8e0
May 19 08:46:24 ubuntu kernel: [107061.602574]  __alloc_pages+0x29c/0x320
May 19 08:46:24 ubuntu kernel: [107061.602590]  alloc_zeroed_user_highpage_movable+0x40/0x50
May 19 08:46:24 ubuntu kernel: [107061.602607]  do_anonymous_page+0x88/0x4ec
May 19 08:46:24 ubuntu kernel: [107061.602628]  handle_pte_fault+0x170/0x1c0
May 19 08:46:24 ubuntu kernel: [107061.602642]  __handle_mm_fault+0x1d0/0x350
May 19 08:46:24 ubuntu kernel: [107061.602655]  handle_mm_fault+0x108/0x294
May 19 08:46:24 ubuntu kernel: [107061.602669]  faultin_page+0x84/0x150
May 19 08:46:24 ubuntu kernel: [107061.602685]  __get_user_pages+0x194/0x2c0
May 19 08:46:24 ubuntu kernel: [107061.602701]  populate_vma_page_range+0x64/0x70
May 19 08:46:24 ubuntu kernel: [107061.602719]  __mm_populate+0xc4/0x1d0
May 19 08:46:24 ubuntu kernel: [107061.602735]  do_mlock+0xdc/0x26c
May 19 08:46:24 ubuntu kernel: [107061.602750]  __arm64_sys_mlock+0x20/0x30
May 19 08:46:24 ubuntu kernel: [107061.602765]  invoke_syscall+0x50/0x120
May 19 08:46:24 ubuntu kernel: [107061.602784]  el0_svc_common.constprop.0+0x6c/0x1a0
May 19 08:46:24 ubuntu kernel: [107061.602803]  do_el0_svc+0x30/0xb0
May 19 08:46:24 ubuntu kernel: [107061.602820]  el0_svc+0x4c/0x170
May 19 08:46:24 ubuntu kernel: [107061.602837]  el0t_64_sync_handler+0xa4/0x130
May 19 08:46:24 ubuntu kernel: [107061.602854]  el0t_64_sync+0x1a4/0x1a8
May 19 08:46:24 ubuntu kernel: [107061.602888] Mem-Info:
May 19 08:46:24 ubuntu kernel: [107061.602905] active_anon:735 inactive_anon:16569 isolated_anon:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  active_file:36 inactive_file:28 isolated_file:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  unevictable:185356 dirty:0 writeback:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  slab_reclaimable:6070 slab_unreclaimable:10550
May 19 08:46:24 ubuntu kernel: [107061.602905]  mapped:1869 shmem:749 pagetables:923 bounce:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  kernel_misc_reclaimable:0
May 19 08:46:24 ubuntu kernel: [107061.602905]  free:5609 free_pcp:0 free_cma:0
May 19 08:46:24 ubuntu kernel: [107061.602949] Node 0 active_anon:2940kB inactive_anon:66276kB active_file:144kB inactive_file:112kB unevictable:741424kB isolated(anon):0kB isolated(file):0kB mapped:7476kB dirty:0kB writeback:0kB shmem:2996kB >May 19 08:46:24 ubuntu kernel: [107061.602992] DMA free:22436kB min:24576kB low:30208kB high:35840kB reserved_highatomic:0KB active_anon:2940kB inactive_anon:66276kB active_file:196kB inactive_file:292kB unevictable:741332kB writepending:0kB p>May 19 08:46:24 ubuntu kernel: [107061.603035] lowmem_reserve[]: 0 0 0 0
May 19 08:46:24 ubuntu kernel: [107061.603114] DMA: 1113*4kB (UME) 633*8kB (UME) 296*16kB (UME) 129*32kB (UME) 48*64kB (UME) 11*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22860kB
May 19 08:46:24 ubuntu kernel: [107061.603406] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 19 08:46:24 ubuntu kernel: [107061.603428] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
May 19 08:46:24 ubuntu kernel: [107061.603449] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 19 08:46:24 ubuntu kernel: [107061.603469] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
May 19 08:46:24 ubuntu kernel: [107061.603489] 2704 total pagecache pages
May 19 08:46:24 ubuntu kernel: [107061.603504] 0 pages in swap cache
May 19 08:46:24 ubuntu kernel: [107061.603518] Swap cache stats: add 0, delete 0, find 0/0
May 19 08:46:24 ubuntu kernel: [107061.603536] Free swap  = 0kB
May 19 08:46:24 ubuntu kernel: [107061.603550] Total swap = 0kB
May 19 08:46:24 ubuntu kernel: [107061.603565] 242688 pages RAM
May 19 08:46:24 ubuntu kernel: [107061.603580] 0 pages HighMem/MovableOnly
May 19 08:46:24 ubuntu kernel: [107061.603594] 10931 pages reserved
May 19 08:46:24 ubuntu kernel: [107061.603609] 16384 pages cma reserved
May 19 08:46:24 ubuntu kernel: [107061.603624] Tasks state (memory values in pages):
May 19 08:46:24 ubuntu kernel: [107061.603638] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 19 08:46:24 ubuntu kernel: [107061.603685] [    379]     0   379    12038      852    94208        0          -250 systemd-journal
May 19 08:46:24 ubuntu kernel: [107061.603716] [    406]     0   406    72414     6415   118784        0         -1000 multipathd
May 19 08:46:24 ubuntu kernel: [107061.603745] [    420]     0   420     5982      942    69632        0         -1000 systemd-udevd
May 19 08:46:24 ubuntu kernel: [107061.603789] [    553]   103   553    22163      732    77824        0             0 systemd-timesyn
May 19 08:46:24 ubuntu kernel: [107061.603819] [    612]   100   612     4068      777    73728        0             0 systemd-network
May 19 08:46:24 ubuntu kernel: [107061.603847] [    614]   101   614     6339     1633    90112        0             0 systemd-resolve
May 19 08:46:24 ubuntu kernel: [107061.603875] [    625]   102   625     2267      838    57344        0          -900 dbus-daemon
May 19 08:46:24 ubuntu kernel: [107061.603904] [    629]     0   629    20487      611    65536        0             0 irqbalance
May 19 08:46:24 ubuntu kernel: [107061.603933] [    634]     0   634     8236     2733   114688        0             0 networkd-dispat
May 19 08:46:24 ubuntu kernel: [107061.603961] [    640]   104   640    55504      826    81920        0             0 rsyslogd
May 19 08:46:24 ubuntu kernel: [107061.603989] [    644]     0   644   366640     2855   249856        0          -900 snapd
May 19 08:46:24 ubuntu kernel: [107061.604017] [    653]     0   653     3887      791    69632        0             0 systemd-logind
May 19 08:46:24 ubuntu kernel: [107061.604045] [    655]     0   655     3809      626    73728        0             0 wpa_supplicant
May 19 08:46:24 ubuntu kernel: [107061.604073] [    683]     0   683     1727      501    45056        0             0 cron
May 19 08:46:24 ubuntu kernel: [107061.604100] [    703]     0   703    27482     2589   110592        0             0 unattended-upgr
May 19 08:46:24 ubuntu kernel: [107061.604128] [    710]     0   710     1408      126    53248        0             0 agetty
May 19 08:46:24 ubuntu kernel: [107061.604155] [    712]     0   712     1397      139    49152        0             0 agetty
May 19 08:46:24 ubuntu kernel: [107061.604183] [    720]     0   720     3788     1039    69632        0         -1000 sshd
May 19 08:46:24 ubuntu kernel: [107061.604211] [    844]     0   844      559       44    36864        0             0 hciattach
May 19 08:46:24 ubuntu kernel: [107061.604239] [    856]     0   856     2384      602    61440        0             0 bluetoothd
May 19 08:46:24 ubuntu kernel: [107061.604266] [   1172]     0  1172    74368     1369   167936        0             0 packagekitd
May 19 08:46:24 ubuntu kernel: [107061.604305] [   1178]     0  1178    58582      814    94208        0             0 polkitd
May 19 08:46:24 ubuntu kernel: [107061.604336] [   4481]     0  4481     4596     1078    81920        0             0 sshd
May 19 08:46:24 ubuntu kernel: [107061.604364] [   4484]  1000  4484     4559     1187    73728        0             0 systemd
May 19 08:46:24 ubuntu kernel: [107061.604391] [   4485]  1000  4485    42829     1235   110592        0             0 (sd-pam)
May 19 08:46:24 ubuntu kernel: [107061.604421] [   4571]  1000  4571     4631      881    81920        0             0 sshd
May 19 08:46:24 ubuntu kernel: [107061.604448] [   4572]  1000  4572     2147      846    53248        0             0 bash
May 19 08:46:24 ubuntu kernel: [107061.604481] [   4674]  1000  4674     3345      616    61440        0             0 sudo
May 19 08:46:24 ubuntu kernel: [107061.604509] [   4675]  1000  4675     3345      172    61440        0             0 sudo
May 19 08:46:24 ubuntu kernel: [107061.604536] [   4676]     0  4676  1725546   180701  1495040        0             0 main
May 19 08:46:24 ubuntu kernel: [107061.604563] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-39.scope,task=main,pid=4676,uid=0
May 19 08:46:24 ubuntu kernel: [107061.604827] Out of memory: Killed process 4676 (main) total-vm:6902184kB, anon-rss:721280kB, file-rss:1524kB, shmem-rss:0kB, UID:0 pgtables:1460kB oom_score_adj:0
May 19 08:46:25 ubuntu systemd[1]: session-39.scope: A process of this unit has been killed by the OOM killer.

To support Hugging Face model

When download the model from Meta website using email url, there are often get network problems 403 Forbidden. Is there any support for hugging face model possible ? Thanks !

terminate called after throwing an instance of 'ReadSocketException'

The nodes connect, but crash after roughly 3 seconds.
Server:

sudo main simple-server --weights-float-type q40 --buffer-float-type q40 --nthreads 4 --model ~/dllama_meta-llama-3-8b_q40.bin --tokenizer ~/dllama-llama3-tokenizer.t --workers 192.168.2.212:9998 192.168.2.213:9998 192.168.2.214:9998 192.168.2.215:9998 192.168.2.216:9998 192.168.2.217:9998 192.168.2.218:9998

💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 8
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001

For Each Worker:
sudo main worker --port 9998 --nthreads 4

Listening on 0.0.0.0:9998...
Client connected
terminate called after throwing an instance of 'ReadSocketException'
  what():  std::exception
Aborted

Need help in set up all the devices

Hi Mr Bart,

Your distributed-llama is great! However, are there any clear instructions to set up the whole environment from scratch? I'm interested in your distributed-llama but I lack related knowledge in Raspberry Pi devices. So I don't even know how to connect the devices into my PC or install some dependencies in my PC or Raspberry Pi devices. Could you please help me with it? Thank you so much!

Master process crashes running out of memory on a 8 GB RPi 5

I setup a single master-worker pair to experiment with distributed llama. The master is actually an RPi 5 with 8 GB RAM and the only worker is a RPi 4 having the same memory size.
When I run the inference, the master crashes after a while with segfault. The worker also quits due to closed socket connection.
Any idea why? I tried with the smallest, llama-2-7b model.

Terminal capture from master

segabor@bigfive:~/src/distributed-llama $ ./run.sh 
💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 2
./run.sh: line 9: 268004 Segmentation fault      sudo nice -n -20 ./main inference --model /mnt/data/llama-2-7b/dllama_llama-2-7b_q40.bin --tokenizer ./tokenizer.bin --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 30.0.0.12:9998

Worker capture:

^Csegabor@lohere:~/src/distributed-llama$ ./run.sh 
Listening on 0.0.0.0:9998...
Client connected
💡 sliceIndex: 1
💡 nSlices: 2
⏩ Received 56918016 bytes for block 0 (83092 kB/s)
⏩ Received 56918016 bytes for block 1 (112709 kB/s)
⏩ Received 56918016 bytes for block 2 (112486 kB/s)
⏩ Received 56918016 bytes for block 3 (112709 kB/s)
⏩ Received 56918016 bytes for block 4 (91069 kB/s)
⏩ Received 56918016 bytes for block 5 (114986 kB/s)
⏩ Received 56918016 bytes for block 6 (103865 kB/s)
⏩ Received 56918016 bytes for block 7 (106190 kB/s)
⏩ Received 56918016 bytes for block 8 (112709 kB/s)
⏩ Received 56918016 bytes for block 9 (63172 kB/s)
⏩ Received 56918016 bytes for block 10 (63172 kB/s)
⏩ Received 56918016 bytes for block 11 (63313 kB/s)
⏩ Received 56918016 bytes for block 12 (63313 kB/s)
⏩ Received 56918016 bytes for block 13 (63172 kB/s)
⏩ Received 56918016 bytes for block 14 (60810 kB/s)
⏩ Received 56918016 bytes for block 15 (64097 kB/s)
⏩ Received 56918016 bytes for block 16 (60551 kB/s)
⏩ Received 56918016 bytes for block 17 (60358 kB/s)
⏩ Received 56918016 bytes for block 18 (60423 kB/s)
⏩ Received 56918016 bytes for block 19 (61600 kB/s)
⏩ Received 56918016 bytes for block 20 (62205 kB/s)
⏩ Received 56918016 bytes for block 21 (61136 kB/s)
⏩ Received 56918016 bytes for block 22 (62138 kB/s)
⏩ Received 56918016 bytes for block 23 (64753 kB/s)
⏩ Received 56918016 bytes for block 24 (100208 kB/s)
⏩ Received 56918016 bytes for block 25 (112486 kB/s)
⏩ Received 56918016 bytes for block 26 (112486 kB/s)
⏩ Received 56918016 bytes for block 27 (114064 kB/s)
⏩ Received 56918016 bytes for block 28 (111823 kB/s)
⏩ Received 56918016 bytes for block 29 (111168 kB/s)
Error receiving data: socket closed

Will this awesome proj consider supporting GPU acceleration？

A very impressive job!

But it doesn't seem to support the use of GPU. Does the author consider developing code that supports GPU acceleration?

Any suggestions to migrate this project to CUDA/HIP acceleration?

Thanks for any help!

How about the multi-core support of stand-alone dual-socket motherboards?

Support for another models (ollama models)

@b4rtaz Hey, thank you for your wonderful work. Could you please offer some details about how to add supported model? For example, how to to convert some ollama models like command+r or starcoder or llama3 70b to ddlama

https://ollama.com/library/command-r-plus
https://ollama.com/library/llama3:70b
https://ollama.com/library/starcoder2

Assertion `d % nSlices == 0' failed.

I'm inferencing a q40 weight of llama-3-70b-instruct across 3 x86_64 machines, and I'm getting this on my root node:
main: src/transformer.cpp:17: MatmulSlice::MatmulSlice(FloatType, int, int, int): Assertion `d % nSlices == 0' failed.

Any suggestions?

[Setup] Multiple Apple Silicon Macs: Questions

Hi, been dreaming of a project like this.

Some questions:

Silicon macs are pretty fast for this kind of stuff and benefit from the unified memory. If I've got a Macbook with 24GB and one with 36GB ram could I technically run ~60GB models? I assume I won't get the performance of MLX llama implementations but can I assume similar performance of llama.cpp granted I've got fast internet?
The --workers flag in the examples only have 1 IP address, do I add comma separated values or a spaced out list?

Thanks in advance, will post in discussions with results if I get some answers. Might try and pool a few colleagues Macs together to see how far we can push it.

AWESOME PROJECT. Massive respect.

JSONDecodeError("Expecting value", s, err.value) from None

ubuntu@ubuntu:~/llama3/Meta-Llama-3-8B-Instruct$ python3 ../../distributed-llama/converter/convert-llama.py ./ q40
Model name:
Target float type: q40
Target file: dllama__q40.bin
Traceback (most recent call last):
  File "/home/ubuntu/llama3/Meta-Llama-3-8B-Instruct/../../distributed-llama/converter/convert-llama.py", line 119, in <module>
    convert(modelPath, outputFileName, targetFloatType)
  File "/home/ubuntu/llama3/Meta-Llama-3-8B-Instruct/../../distributed-llama/converter/convert-llama.py", line 15, in convert
    params = json.load(f)
  File "/usr/lib/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 7 column 1 (char 6)

I get this when trying to convert llama3-instruct downloaded from the Meta Repo to q40.

Can I use Ollama model

Sorry if asking a stupid question. I am newbie llm user, have been using Ollama and love its ease of use. However, I also have limit on the hardware. distributed-llama seem a promising solution for me. But I don't know how to use those model provided by Ollama. o is it feasible at all?

[Feature Suggest] Config File alternative to Command Line Arguments

I have an idea to make the program a bit more user-friendly.
Right now, we have to pass all the settings through command-line arguments, which can be a hassle. How about we add support for a configuration file that the program can use by default if no command-line arguments are provided? Plus, it would be great if we could specify which config file to use.

Default Config File:
The program should check for a default config file (like config.json, config.json, or settings.ini) in the program directory.
If there are no command-line arguments, the program should automatically use the settings from this config file.

Custom Config File:
Add a new command-line option (like --config <path_to_config_file>) that lets users specify a different config file.
If this option is used, the program should load settings from the specified config file instead of the default one.

With the addition of dllama-api we have the ability to easily parse JSON files, what do you think?

Vulkan Acceleration

Hi @b4rtaz

I was tinkering a bit over the weekend and figured it might be possible to create a version of worker/main that accelerates the inference by offloading some work to the GPU to handle.

I've never really worked with compute shaders or Vulkan for that matter but I put together a simple demo that successfully ran a compute shader using Vulkan
The compute shader currently just takes an input buffer and copies the data to an output buffer.

This is what I have so far
compute-shader-example.zip

My next step is to upgrade it to do a matmul on two matrices and do the same operation on CPU and compare the results, I'm hopeful that I could utilize the worker/root node's dedicated/integrated GPU to do the heavy lifting.

I'll do some experiments on my fork on integrating it once I have a matmul compute shader working and let you know how it goes.

Right now I just want to get something working where I give it two matrices and it computes the resulting matmul output.

Fleshing out API mode

Hi there

Amazing project by the way, it has given me hopes in being able to run really big models, specifically I'm very excited about the upcoming 400b Llama model coming in the following months, and I've been planning how to run it locally.
Anyways I'm going off topic here, I'd really like to help out where I can and while it's undocumented in the readme, I see there is an simple server capability built in to distributed-llama main program.

I was wondering if I can assist in fleshing it out, there are a few things I can think of at the top of my head that could be needed in an API for distributed-llama, in no particular order:

ability to stop on encountering specific strings (known issue with llama 3 where eos is <!end_of_text!> in tokenizer.config but chat template uses <|eot_id|>)
chat template integration, and for that matter an openai chat completion api endpoint (makes integrating it into chat webui's much simpler)
statistics (get througput stats from the workers so you can isolate where bottlenecks are)

I'm fairly new to C++ but I'd be happy to work with you on this

How To Add Suppoerted Model

@b4rtaz Hey, thank you for your wonderful work. Could you please offer some details about how to add supported model? For example, how to split the network according to structure of model. It is difficult to work without your help! THANKS!

Compiling error related to include of <ctime>

I got this error while compiling:

src/socket.cpp:61:34: error: ‘time’ was not declared in this scope
   61 |                     time_t now = time(NULL);
      |                                  ^~~~
src/socket.cpp:12:1: note: ‘time’ is defined in header ‘<ctime>’; did you forget to ‘#include <ctime>’?
   11 | #include <stdexcept>
  +++ |+#include <ctime>
   12 |

Adding #include <ctime> to src/socket.cpp fixed the compiling error for me. Please note that I do not code in C++, so please forgive any ignorance I have on this issue.

Unknown header key's while converting llama 3 70b to distributed format

Hi there

I'm busy converting llama 3 70b to the distributed format, but I get the following output:

Target float type: q40
Target file: D:\Meta-Llama-3-70B-Instruct-Distributed\dllama_original_q40.bin
💿 Chunking model 1/16...
Unknown header key: ffn_dim_multiplier
Unknown header key: multiple_of
Unknown header key: norm_eps
Unknown header key: head_size
{'dim': 8192, 'ffn_dim_multiplier': 1.3, 'multiple_of': 4096, 'n_heads': 64, 'n_kv_heads': 8, 'n_layers': 80, 'norm_eps': 1e-05, 'vocab_size': 128256, 'rope_theta': 500000, 'head_size': 128.0, 'max_seq_len': 2048, 'arch_type': 11259136, 'n_experts': 0, 'n_active_experts': 0, 'hidden_dim': 28672}
🔶 Exporting tok_embeddings.weight torch.Size([16032, 65536])...
Saved f32 tensor in 72.36s, 4202692608 bytes
🔶 Exporting layers.0.attention.wq.weight torch.Size([8192, 8192])...
Saved q40 tensor in 15.90s, 37748736 bytes
🔶 Exporting layers.0.attention.wk.weight torch.Size([1024, 8192])...
Saved q40 tensor in 1.99s, 4718592 bytes

Would it still work fine?
Conversion process so far is really slow on my machine, should be done in a couple of hours

Hi, do you know why the synchronization time from 4pi to 8pi suddenly increases？

Turing RK1 compute module results

I promised to share results of Turing RK1 module. It arrived yesterday so I took the chance to run distributed llama on it.
Capability: 8 cores, 32 GB RAM. Storage: 1 TB NVMe SSD
OS: custom Ubuntu Server
Model: llama-2-7b

Command

sudo nice -n -20 ./main inference \
  --model /mnt/bigdata/llama-2-7b/dllama_llama-2-7b_q40.bin \
  --tokenizer ./tokenizer.bin \
  --weights-float-type q40 \
  --buffer-float-type q80 \
  --prompt "Hello world" \
  --steps 16 \
  --nthreads 4

Result

💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G  372 ms I  372 ms T    0 ms S      0 kB R      0 kB Hello
🔶 G  378 ms I  378 ms T    0 ms S      0 kB R      0 kB  world
🔶 G  369 ms I  367 ms T    1 ms S      0 kB R      0 kB ,
🔶 G  379 ms I  379 ms T    0 ms S      0 kB R      0 kB  I
🔶 G  424 ms I  397 ms T   27 ms S      0 kB R      0 kB '
🔶 G  376 ms I  376 ms T    0 ms S      0 kB R      0 kB m
🔶 G  378 ms I  377 ms T    0 ms S      0 kB R      0 kB  E
🔶 G  407 ms I  407 ms T    0 ms S      0 kB R      0 kB .
🔶 G  383 ms I  380 ms T    0 ms S      0 kB R      0 kB  січня
🔶 G  372 ms I  371 ms T    1 ms S      0 kB R      0 kB  
🔶 G  379 ms I  378 ms T    0 ms S      0 kB R      0 kB 2
🔶 G  374 ms I  373 ms T    0 ms S      0 kB R      0 kB 0
🔶 G  382 ms I  381 ms T    0 ms S      0 kB R      0 kB 1
🔶 G  375 ms I  373 ms T    2 ms S      0 kB R      0 kB 8
🔶 G  378 ms I  377 ms T    1 ms S      0 kB R      0 kB  at
🔶 G  382 ms I  382 ms T    0 ms S      0 kB R      0 kB  
Generated tokens:    16
Avg generation time: 381.75 ms
Avg inference time:  379.25 ms
Avg transfer time:   2.00 ms

WebAssembly version

Is it in the scope of the project to eventually provide a WebAssembly version?

network utilization

Let's calculate the transfer time theoretically.

llama3 8B

The original experiment data is here.
Since the transfer is full-duplex, there's no interference between uplink and downlink.
So, we can choose the bigger 510 kB as the transfer data volume to calculate the transfer time.

$$510000*8 bit/1G bps = 4.08ms$$

$$4.08ms/199.60 ms ~= 2\%$$

So, the average transfer time should be 4.08ms. However, your result is 199.60 ms, 50 times higher.
So, the network utilization ratio is merely 2%.

llama2 7B

For comparison, I summarize a similar model (llama2 7B) using different devices:

VMs

In this discussion, the Network Bandwidth is 20 Gbps, reference here.

$$590000*8 bit/20G bps = 0.236ms$$

$$0.236ms/7.62ms= 3\%$$

So, the network utilization ratio is merely 3%.
Similarly, we can calculate the result of 4 VMs to be 6%.

RaspberryPi

Also, the result of the Raspberry Pi cluster is calculated to be 9.0%, 48.0%, 14.1% for 2,4,8 devices.

llama2 13B
23.9%,25.75%, 9.8%
llama2 70B
8.5%

Summary

I think the network utilization, average around 11%, ranging from 2% to 48%, is under-optimized.
Developing the code possibly ensures a stable and high network utilization.

Originally posted by @zhengpeirong in #41 (reply in thread)

Support nSlices > nKvHeads

After the attention layers were splitted into all nodes I missed the implications what it introduced.

Long story short: to calculate the attention for a single head from the Q output, I need to have the whole head from the K output. For x Q head I need to have whole floor(x / (nHeads / nKvHeads)) K head to calculate the result.

For example Llama 3 8B:

💡 dim: 128
💡 nHeads: 32
💡 nKvHeads: 8

Q head 0  => floor(  0 / ( 32 / 8) ) => K head 0
Q head 1  => floor(  1 / ( 32 / 8) ) => K head 0
Q head 2  => floor(  2 / ( 32 / 8) ) => K head 0
...
Q head 8  => floor(  8 / ( 32 / 8) ) => K head 2
Q head 9  => floor(  9 / ( 32 / 8) ) => K head 2
...
Q head 31 => floor( 31 / ( 32 / 8) ) => K head 7

By this currently is not possible to split nodes to more than nKvHeads nodes.

^ The same problem is with the V layer.

How this could be fixed?

1. Synchronize missing outputs

For nSlices > nKvHeads setups there could be introduced a new synchronization step. This step would synchornize missing Q/V outputs across nodes. Ofc the synchronization is the slowest part of Distributed Llama.

2. Redundancy

The redundancy could be introduces for K/V layers. These layers should be splited with the aligment to headSize. By this there is no synchronization, and redundant amount of calculations seems to be small (headSize - kvDim0).

For example Llama 3 8B:

headSize = dim / nHeads = 128
kvDim = (dim * kvHeads) / nHeads = 1024

nSlices = 16
kvDim0 = kvDim / nSlices = 64
redundancy = 128 - 64 = 64 outputs of K & V

nSlices = 32
kvDim0 = kvDim / nSlices = 32
redundancy = 128 - 32 = 96 outputs of K & V

[Feature Suggest] Tensor Parallellism for Accelerating LLM

Dear Author,

Your contribution is critical for the open-source community. The distributed-llama repo has implemented tensor parallelism from scratch. And the result is amazingly significant. However, there are still improvements that could be made. Because of my poor coding ability, not able to make improvements myself, I hope you can look at my suggestions below.

Challenge: root node's special task and synchronization

When I run the repo version '0.1.0', I find that the softmax operations in MultiHead are conducted on the root node only. This operation costs a significant portion of the total time. Second, the synFfnA and synFfn2 functions also cost a lot of time.

Mature solutions

In fact, these challenges have been found in this paper: https://arxiv.org/abs/1909.08053. Its solution is shown in the image:

It conducts attention mechanism(softmax) on every worker. Second, the matrix segmentation direction is using column segment and row segment in two consecutive matrices, thus reducing to one synchronization operation instead of two.

If you are willing to make further improvements to the repo, the following is the mature solution for every component of llama2 using tensor parallelism and sequence parallelism.
https://pytorch.org/tutorials/intermediate/TP_tutorial.html
However, it's implemented in Python, and you will be the first one to implement the solution in C++.

Thanks for your contribution!!!
Best Regards

b4rtaz / distributed-llama Goto Github PK

distributed-llama's Introduction

Distributed Llama

📊 Measurements

Average Single Token Generation Time

Network Transfer for Generating Single Token

Download Model and Run

📟 How to Run on Raspberry Pi Devices

💻 How to Run on MacOS, Linux, or Windows

MacOS and Linux

Windows

💡 License

📖 Citation

distributed-llama's People

Contributors

Stargazers

Watchers

Forkers

distributed-llama's Issues

Challenge and solution

All-reduce

Ring-all-reduce

stage 1

stage 2

Summary

Command

Result

Let's calculate the transfer time theoretically.

llama3 8B

llama2 7B

VMs

RaspberryPi

Summary

1. Synchronize missing outputs

2. Redundancy

Challenge: root node's special task and synchronization

Mature solutions

Recommend Projects

Recommend Topics

Recommend Org