bigscience-workshop / petals Goto Github PK

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

License: MIT License

Python 99.74% Shell 0.05% Dockerfile 0.21%

bloom deep-learning distributed-systems language-models large-language-models machine-learning neural-networks pytorch volunteer-computing pipeline-parallelism

petals's Introduction

Run large language models at home, BitTorrent-style.
Fine-tuning and inference up to 10x faster than offloading

Generate text with distributed Llama 3.1 (up to 405B), Mixtral (8x22B), Falcon (40B+) or BLOOM (176B) and fine‑tune them for your own tasks — right from your desktop computer or Google Colab:

from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM

# Choose any model available at https://health.petals.dev
model_name = "meta-llama/Meta-Llama-3.1-405B-Instruct"

# Connect to a distributed network hosting model layers
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)

# Run the model as if it were on your computer
inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0]))  # A cat sat on a mat...

🚀 Try now in Colab

🦙 Want to run Llama? Request access to its weights, then run huggingface-cli login in the terminal before loading the model. Or just try it in our chatbot app.

🔏 Privacy. Your data will be processed with the help of other people in the public swarm. Learn more about privacy here. For sensitive data, you can set up a private swarm among people you trust.

💬 Any questions? Ping us in our Discord!

Connect your GPU and increase Petals capacity

Petals is a community-run system — we rely on people sharing their GPUs. You can help serving one of the available models or host a new model from 🤗 Model Hub!

As an example, here is how to host a part of Llama 3.1 (405B) Instruct on your GPU:

🦙 Want to host Llama? Request access to its weights, then run huggingface-cli login in the terminal before loading the model.

🐧 Linux + Anaconda. Run these commands for NVIDIA GPUs (or follow this for AMD):

conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install git+https://github.com/bigscience-workshop/petals
python -m petals.cli.run_server meta-llama/Meta-Llama-3.1-405B-Instruct

🪟 Windows + WSL. Follow this guide on our Wiki.

🐋 Docker. Run our Docker image for NVIDIA GPUs (or follow this for AMD):

sudo docker run -p 31330:31330 --ipc host --gpus all --volume petals-cache:/cache --rm \
    learningathome/petals:main \
    python -m petals.cli.run_server --port 31330 meta-llama/Meta-Llama-3.1-405B-Instruct

🍏 macOS + Apple M1/M2 GPU. Install Homebrew, then run these commands:

brew install python
python3 -m pip install git+https://github.com/bigscience-workshop/petals
python3 -m petals.cli.run_server meta-llama/Meta-Llama-3.1-405B-Instruct

📚 Learn more (how to use multiple GPUs, start the server on boot, etc.)

🔒 Security. Hosting a server does not allow others to run custom code on your computer. Learn more here.

💬 Any questions? Ping us in our Discord!

🏆 Thank you! Once you load and host 10+ blocks, we can show your name or link on the swarm monitor as a way to say thanks. You can specify them with --public_name YOUR_NAME.

How does it work?

You load a small part of the model, then join a network of people serving the other parts. Single‑batch inference runs at up to 6 tokens/sec for Llama 2 (70B) and up to 4 tokens/sec for Falcon (180B) — enough for chatbots and interactive apps.
You can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch and 🤗 Transformers.

📜 Read paper 📚 See FAQ

📚 Tutorials, examples, and more

Basic tutorials:

Getting started: tutorial
Prompt-tune Llama-65B for text semantic classification: tutorial
Prompt-tune BLOOM to create a personified chatbot: tutorial

Useful tools:

Chatbot web app (connects to Petals via an HTTP/WebSocket endpoint): source code
Monitor for the public swarm: source code

Advanced guides:

Launch a private swarm: guide
Run a custom model: guide

Benchmarks

Please see Section 3.3 of our paper.

🛠️ Contributing

Please see our FAQ on contributing.

📜 Citations

Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel. Petals: Collaborative Inference and Fine-tuning of Large Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023.

@inproceedings{borzunov2023petals,
  title = {Petals: Collaborative Inference and Fine-tuning of Large Models},
  author = {Borzunov, Alexander and Baranchuk, Dmitry and Dettmers, Tim and Riabinin, Maksim and Belkada, Younes and Chumachenko, Artem and Samygin, Pavel and Raffel, Colin},
  booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)},
  pages = {558--568},
  year = {2023},
  url = {https://arxiv.org/abs/2209.01188}
}

Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, and Colin Raffel. Distributed inference and fine-tuning of large language models over the Internet. Advances in Neural Information Processing Systems 36 (2023).

@inproceedings{borzunov2023distributed,
  title = {Distributed inference and fine-tuning of large language models over the {I}nternet},
  author = {Borzunov, Alexander and Ryabinin, Max and Chumachenko, Artem and Baranchuk, Dmitry and Dettmers, Tim and Belkada, Younes and Samygin, Pavel and Raffel, Colin},
  booktitle = {Advances in Neural Information Processing Systems},
  volume = {36},
  pages = {12312--12331},
  year = {2023},
  url = {https://arxiv.org/abs/2312.08361}
}

This project is a part of the BigScience research workshop.

petals's People

Contributors

Stargazers

Watchers

Forkers

justheuristic pfin techthiyanes commune-ai dongs0104 dfrntl nashid kartashofs dumpmemory worthlesspixels ciremaina richmix mdmmn378 hsuanchi lagait marcus-arcadius vadi2 5l1v3r1 cybercongress dav009 hvaara kernelguardian uakbr soft-wa-re gabrielfalcao yonas-g dyzsasd hbcbh1999 mannykayy onchere sekmet nontrivialzer0 tjdev7 stanleyjacob icodein jrcribb chenalee wilsonianb some-forks netzkontrast lyrl gangasanireddy yinchinan010 yanggum kekewind owenanalytics vivek9chavan t46 danijelkecman piotrlnordea toozande sksundaram-learning aidasdir magalareuben osaidz zinohome aarriandiaga globaloptimal bartekupartek davidlanz schwabischesbauernbrot hbqdev muhtasham smeyerhot jmcastinheira 4agi tijszwinkels rmallof aliang-nlp david20080125 zsc versoindustries laokpa shenbozeng francistotle yuchenlin nadermx jingmouren fuxiaoyi xu-song babyblue26 leobarbash webrulon phymucs dfmjndm dashzeroalion jamesthesnake jszymanskijs hatsu3 nomiscientist jmsundin ayazyousafxai trungtv zhangsanfeng86 slush0 ilerik manelio minghsuanwu saladtechnologies darrengao628

petals's Issues

Error: BFloat16 Unsupported scalar when trying to execute across multiple GPUs with BFloat16 & 8-Bits

I tried to run BLOOM distributed across multiple A100 GPUs with 8-Bit and using BFloat16 but ran into this error while trying to execute a slightly adjusted version of the example script:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA SETUP: CUDA runtime path found: /datadrive/miniconda3/envs/petals/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
Oct 18 09:52:07.795 [WARN] [/datadrive/repos/petals/src/client/remote_sequential.py.__init__:34] RemoteSequential is in active development; expect adventures
Some weights of DistributedBloomForCausalLM were not initialized from the model checkpoint at bloom-testing/test-bloomd-560m-main and are newly initialized: ['lm_head.word_embeddings.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/datadrive/repos/petals/simple_test_script.py", line 17, in <module>
    remote_outputs = model.generate(inputs, max_length=100)
  File "/datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/datadrive/repos/petals/src/client/remote_generation.py", line 113, in generate
    hidden_state = sess.step(embs, prompts=intermediate_prompts, hypo_ids=hypo_ids)[:, -1]
  File "/datadrive/repos/petals/src/client/inference_session.py", line 200, in step
    outputs = session.step(inputs, prompts[self.chosen_spans[0].start : self.chosen_spans[0].end], **kwargs)
  File "/datadrive/repos/petals/src/client/inference_session.py", line 109, in step
    tensors=[
  File "/datadrive/repos/petals/src/client/inference_session.py", line 110, in <listcomp>
    serialize_torch_tensor(tensor.to(proto.dtype), proto.compression)
  File "/datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/hivemind/compression/serialization.py", line 41, in serialize_torch_tensor
    return compression.compress(tensor, info, allow_inplace)
  File "/datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/hivemind/compression/base.py", line 83, in compress
    array = tensor.detach().numpy()
TypeError: Got unsupported ScalarType BFloat16

The code of simple_example_script:

import torch
import torch.nn.functional as F
import transformers
from src import DistributedBloomForCausalLM

MODEL_NAME = "bloom-testing/test-bloomd-560m-main" #"bigscience/bloom-petals"
import os
initial_peer = os.getenv("initial_peer")
initial_peers = [initial_peer]  # e.g. ["/ip4/127.0.0.1/tcp/more/stuff/here"]
tokenizer = transformers.BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(
  MODEL_NAME, initial_peers=initial_peers, low_cpu_mem_usage=True, torch_dtype=torch.float32
)  # this model has only embeddings / logits, all transformer blocks rely on remote servers

# model = model.to('cuda')
inputs = tokenizer("a cat sat", return_tensors="pt")["input_ids"]
remote_outputs = model.generate(inputs, max_length=100)
print(tokenizer.decode(remote_outputs[0]))  # "a cat sat in the back of the car,"

# "train" input embeddings by backprop through distributed transformer blocks
model.transformer.word_embeddings.weight.requires_grad = True
outputs = model.forward(input_ids=inputs)
loss = F.cross_entropy(outputs.logits.flatten(0, 1), inputs.flatten())
loss.backward()
print("Gradients (norm):", model.transformer.word_embeddings.weight.grad.norm())

Server launched via commands:

python -m cli.run_server bloom-testing/test-bloomd-560m-main --num_blocks 12 --torch_dtype bfloat16 --host_maddrs /ip4/0.0.0.0/tcp/31337 --load_in_8bit

python -m cli.run_server bloom-testing/test-bloomd-560m-main  --torch_dtype bfloat16 --host_maddrs /ip4/127.0.0.1/tcp/0 --load_in_8bit --initial_peers /ip4/127.0.0.1/tcp/31337/p2p/QmTHnjwKQFzvxrPesrSjtaL5eKUVdHfLsxV87vx8RFH21U --block_indices 12:24 --device cuda:1

Packages in the environment, have been installed via requirements.txt:

# packages in environment at /datadrive/miniconda3/envs/petals:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
accelerate                0.10.0                   pypi_0    pypi
aiohttp                   3.8.3                    pypi_0    pypi
aiosignal                 1.2.0                    pypi_0    pypi
asttokens                 2.0.5              pyhd3eb1b0_0  
async-timeout             4.0.2                    pypi_0    pypi
attrs                     22.1.0                   pypi_0    pypi
backcall                  0.2.0              pyhd3eb1b0_0  
base58                    2.1.1                    pypi_0    pypi
bitsandbytes              0.34.0                   pypi_0    pypi
blas                      1.0                         mkl  
brotlipy                  0.7.0           py39h27cfd23_1003  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2022.07.19           h06a4308_0  
certifi                   2022.9.24        py39h06a4308_0  
cffi                      1.15.1           py39h74dc2b5_0  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
click                     8.1.3                    pypi_0    pypi
configargparse            1.5.3                    pypi_0    pypi
cryptography              37.0.1           py39h9ce1e76_0  
cudatoolkit               11.3.1               h2bc3f7f_2  
datasets                  2.5.2                    pypi_0    pypi
debugpy                   1.5.1            py39h295c915_0  
decorator                 5.1.1              pyhd3eb1b0_0  
dill                      0.3.5.1                  pypi_0    pypi
docker-pycreds            0.4.0                    pypi_0    pypi
entrypoints               0.4              py39h06a4308_0  
executing                 0.8.3              pyhd3eb1b0_0  
ffmpeg                    4.3                  hf484d3e_0    pytorch
filelock                  3.8.0                    pypi_0    pypi
freetype                  2.11.0               h70c0345_0  
frozenlist                1.3.1                    pypi_0    pypi
fsspec                    2022.8.2                 pypi_0    pypi
giflib                    5.2.1                h7b6447c_0  
gitdb                     4.0.9                    pypi_0    pypi
gitpython                 3.1.29                   pypi_0    pypi
gmp                       6.2.1                h295c915_3  
gnutls                    3.6.15               he1e5248_0  
grpcio                    1.49.1                   pypi_0    pypi
grpcio-tools              1.48.2                   pypi_0    pypi
hivemind                  1.1.1                    pypi_0    pypi
huggingface-hub           0.7.0                    pypi_0    pypi
humanfriendly             10.0                     pypi_0    pypi
idna                      3.3                pyhd3eb1b0_0  
intel-openmp              2021.4.0          h06a4308_3561  
ipykernel                 6.15.2           py39h06a4308_0  
ipython                   8.4.0            py39h06a4308_0  
jedi                      0.18.1           py39h06a4308_1  
jpeg                      9e                   h7f8727e_0  
jupyter_client            7.3.5            py39h06a4308_0  
jupyter_core              4.11.1           py39h06a4308_0  
lame                      3.100                h7b6447c_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libdeflate                1.8                  h7f8727e_5  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libiconv                  1.16                 h7f8727e_2  
libidn2                   2.3.2                h7f8727e_0  
libpng                    1.6.37               hbc83047_0  
libsodium                 1.0.18               h7b6447c_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.16.0               h27cfd23_0  
libtiff                   4.4.0                hecacb30_0  
libunistring              0.9.10               h27cfd23_0  
libwebp                   1.2.4                h11a3e52_0  
libwebp-base              1.2.4                h5eee18b_0  
lz4-c                     1.9.3                h295c915_1  
matplotlib-inline         0.1.6            py39h06a4308_0  
mkl                       2021.4.0           h06a4308_640  
mkl-service               2.4.0            py39h7f8727e_0  
mkl_fft                   1.3.1            py39hd3c417c_0  
mkl_random                1.2.2            py39h51133e4_0  
msgpack                   1.0.4                    pypi_0    pypi
multiaddr                 0.0.9                    pypi_0    pypi
multidict                 6.0.2                    pypi_0    pypi
multiprocess              0.70.13                  pypi_0    pypi
ncurses                   6.3                  h5eee18b_3  
nest-asyncio              1.5.5            py39h06a4308_0  
netaddr                   0.8.0                    pypi_0    pypi
nettle                    3.7.3                hbbd107a_1  
numpy                     1.23.1           py39h6c91a56_0  
numpy-base                1.23.1           py39ha15fc14_0  
openh264                  2.1.1                h4ff587b_0  
openssl                   1.1.1q               h7f8727e_0  
packaging                 21.3               pyhd3eb1b0_0  
pandas                    1.5.0                    pypi_0    pypi
parso                     0.8.3              pyhd3eb1b0_0  
pathtools                 0.1.2                    pypi_0    pypi
pexpect                   4.8.0              pyhd3eb1b0_3  
pickleshare               0.7.5           pyhd3eb1b0_1003  
pillow                    9.2.0            py39hace64e9_1  
pip                       22.2.2           py39h06a4308_0  
prefetch-generator        1.0.1                    pypi_0    pypi
promise                   2.3                      pypi_0    pypi
prompt-toolkit            3.0.20             pyhd3eb1b0_0  
protobuf                  3.20.3                   pypi_0    pypi
psutil                    5.9.2                    pypi_0    pypi
ptyprocess                0.7.0              pyhd3eb1b0_2  
pure_eval                 0.2.2              pyhd3eb1b0_0  
pyarrow                   9.0.0                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pydantic                  1.10.2                   pypi_0    pypi
pygments                  2.11.2             pyhd3eb1b0_0  
pymultihash               0.8.2                    pypi_0    pypi
pyopenssl                 22.0.0             pyhd3eb1b0_0  
pyparsing                 3.0.9            py39h06a4308_0  
pysocks                   1.7.1            py39h06a4308_0  
python                    3.9.13               haa1d7c7_1  
python-dateutil           2.8.2              pyhd3eb1b0_0  
pytorch                   1.12.1          py3.9_cuda11.3_cudnn8.3.2_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2022.4                   pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
pyzmq                     23.2.0           py39h6a678d5_0  
readline                  8.1.2                h7f8727e_1  
regex                     2022.9.13                pypi_0    pypi
requests                  2.28.1           py39h06a4308_0  
responses                 0.18.0                   pypi_0    pypi
scipy                     1.9.2                    pypi_0    pypi
sentry-sdk                1.9.10                   pypi_0    pypi
setproctitle              1.3.2                    pypi_0    pypi
setuptools                63.4.1           py39h06a4308_0  
shortuuid                 1.0.9                    pypi_0    pypi
six                       1.16.0             pyhd3eb1b0_1  
smmap                     5.0.0                    pypi_0    pypi
sortedcontainers          2.4.0                    pypi_0    pypi
sqlite                    3.39.3               h5082296_0  
stack_data                0.2.0              pyhd3eb1b0_0  
tk                        8.6.12               h1ccaba5_0  
tokenizers                0.12.1                   pypi_0    pypi
torchaudio                0.12.1               py39_cu113    pytorch
torchvision               0.13.1               py39_cu113    pytorch
tornado                   6.2              py39h5eee18b_0  
tqdm                      4.64.1                   pypi_0    pypi
traitlets                 5.1.1              pyhd3eb1b0_0  
transformers              4.21.3                   pypi_0    pypi
typing_extensions         4.3.0            py39h06a4308_0  
tzdata                    2022c                h04d1e81_0  
urllib3                   1.26.11          py39h06a4308_0  
uvloop                    0.17.0                   pypi_0    pypi
varint                    1.0.2                    pypi_0    pypi
wandb                     0.13.4                   pypi_0    pypi
wcwidth                   0.2.5              pyhd3eb1b0_0  
wheel                     0.37.1             pyhd3eb1b0_0  
xxhash                    3.0.0                    pypi_0    pypi
xz                        5.2.6                h5eee18b_0  
yarl                      1.8.1                    pypi_0    pypi
zeromq                    4.3.4                h2531618_0  
zlib                      1.2.12               h5eee18b_3  
zstd                      1.5.2                ha4553b6_0

I just used the small version for debugging purposes, I need to distribute it across multiple GPUs since I intend to run the 176bn BLOOM version. I tried to naively just convert the tensor at that line to a supported DType but then another error occured somewhere else down the line.

Since I want to do Prompt Tuning on 8x 40GB A100s, I think I have to use BFloat16 & 8Bit or is there another solution/workaround with good performance?

Implement standard inference modes

It would be awesome to have a collection of standard inference methods, e.g. greedy, temperature, top-p, top-k, eventually tree / beam search and batched inference, once we support it on the backend.

For now, perhaps, it would be best to implement them as standalone functions / functors that take the full model as input and do the inference. Eventually, we'll figure out the best way of integrating them together.

Roadmap (tentative)

sampling, greedy, top-k, nucleus, etc -- with obligatory support for prefixes
inference with prompt-tuned model
beam search (requires changes on backend)

.. and then, in no particular order,

inference with LoRA / AdaMix
user-defined, constraints, other crazy stuff

Getting Petals to run on macOS

The primary motivation, is

to get as much high bandwidth memory, in a low cost way (thanks to its unified memory model)
to be easily used for training / inference
its probably gonna be slower then 3090's (i have no idea), but i dun think thats the point here
could potentially also be used with large number of "student laptops"

As with the latest beta pytorch has included optimisations for m1 metal GPU

https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/

This present an interesting possibility of scaling up on more easily & affordably, for example. To hit 352GB of memory...
(and assuming up to 75% of a Mac's memory is allocated to GPU, you could in theory go 75%+, but I suspect we need at-least 25% for OS, and filesystem operations)

Number of nodes: 4
Ram allocated per node: 96GB (75% of 128GB)
Upfront cost of nodes: $23,200.00 ($5,800.00 / node)
Max KWh: 0.86 (0.215 KWh / node)

However if you were to try build this using A100 for example

Number of nodes: 5
Ram allocated per node: 80GB
Upfront cost of nodes: $65,000.00 ($13,000.00 / node)
Max KWh: 1.5 (0.300 KWh / node)
Price & Energy usage exclude overheads for CPU, RAM, Motherboard, Storage, Cooling, and networking

Also as outlined, alternatively would be 30 student laptops/mac-mini ...

Number of nodes: 30
Ram allocated per node (12GB, 75% of 16GB)
Upfront cost**: $33,000.00 ($1,100 / node)
Max KWh**: 4.5 (0.150 KWh / node)
** not that it matters in this case

Making it possibly one of the most accessible way for students, to setup a private swarm, and try training on their own hardware in a datalab.

[CODE] Basic tests

test for allclose between rpc_inference and local inference

Failed finding central directory

I believe the server got stuck on following error, because this is the last output in console and it is many hours old. There's no following line about reading the block from HF. Restarting the server helped and it works normally now.

It might be cause by network downtime, but I suppose the server should survive short outages like this.

Jan 24 06:36:00.449 [WARN] [/home/petals/src/petals/bloom/from_pretrained.py._load_state_dict:121] Failed to load block 64 from HF Hub (retry in 10240 sec)
Traceback (most recent call last):
  File "/home/petals/src/petals/bloom/from_pretrained.py", line 118, in _load_state_dict
    return torch.load(archive_file, map_location="cpu")
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 777, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 282, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

[DESIGN] auction-like priorities for servers

[for the record: this was proposed by @TimDettmers ]

Currently, hivemind-server treats all requests on a first come first served basis.
If we want to reward active participants with faster inference/training, we could change that into an auction.

Here's how client-server interaction looks like:

server gives client its stats, the current-highest bid, and maybe some metadata for bidding, e.g. the lowest serviced bids over last T seconds
client makes a bid - and signs it in such a way that it becomes a commitment ( see #6 )
in TaskPool.iterate_minibatches, server will now generate minibatches in the order of highest bid first
in TaskPool.priority, server will now set pool's priority based on highest bid in the pool, instead of wait time

As suggested by @GreenFatGuy , we need to think through how to deal with situations when low bids on high-demand servers won't ever be processed, and will hence take up memory on both client and server. First order solution: add absolute expiration time to each request, drop requests that hit expiration time.

[Feature Request] Direct server-server communication ("and then" clause)

Based on conversations with @borzunov , @dbaranchuk

Premise: currently in rpc_inference, each client sends inputs to a given server, collects responses from that server, then sends this input manually to the next server; this is needed for full fault-tolerance, in case one of the servers disconnects. A faster option is to send data directly from server 1 to server 2, if we can make it without compromising fault-tolerance -- and without insane code complexity.

Proposed solution: in rpc_inference, whenever a client sends a pb2 request, it can add a metadata key, e.g. "next_peer", which denotes the peer id of the next server. When a server finishes computing that key, it will immediately send results to the specified peer_id and marks it as "hidden states for session {inference_session_id}" - assuming that the next peer currently takes part in the same session.

On the receiving end, each server awaits asyncio.wait(request_from_client, request_from_previous_server), whichever comes first. If the request from previous server came first, current server will begin processing it immediately, but will still wait for the client's data to ensure that the results are valid.

Sending data to the next server is not guaranteed: the requested server will simply fire a request and forget about it.
Notably, the server will still return hidden states to the client as usual. The extra communication is fine because rpc_inference performance does not use much network throughput ("mbps"), being more sensitive to latency ("ping").

Notes:

client can request a different next_peer after each new inference step. This happens if one of the "next servers" disconnected from the inference session. Servers should send each hidden_states to the server that was specified in the current request.next_peer
if a server receives a request that doesn't correspond to any active session, it simply ignores the request. this is fine because if that request was valid, the client will still send the same data later
[security] since the previous server can be faulty/malicious, the "next peer" server should check that the data it received from previous peer is equal to the data it eventually received from client; when we implement full verification, the server can simply sign the next peer message so it can be used as a proof of (benign or malicious) activity
- if this took place, a server may have to re-send inference message; we can support this by specifying the current length in the server's response
[security] the server-to-server traffic caused by the client is strictly less than client-to-server traffic, which eliminates the potential misuse via ddos amplification
the current-best routing strategy would still work decently for this algorithm because it uses a strictly non-optimistic (time>=actual) performance model

@dbaranchuk also proposed a clever alternative solution, where each server runs its own fault-tolerant inference session to subsequent servers. This can be a better solution If we find a way to limit the memory / bandwidth usage on a server.

Add prompt tuning to the basic example

The following code can be added to basic example to run prompt tuning and show demonstrate that it works

Quest: go to the main example notebook (in readme), and add some simple example based on prompt tuning

Here's a basic example. I'm not sure it aint missing something, but if it is, we can help add it back.

import torch
from transformers import BloomTokenizerFast
from petals.client import DistributedBloomForCausalLM
assert 'model' not in globals(), "please restart the kernel"
 
MODEL_NAME = "bigscience/bloom-petals"
tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME, padding_side='right')
model = DistributedBloomForCausalLM.from_pretrained(
    MODEL_NAME, tuning_mode='deep_ptune', pre_seq_len=3
).cuda()

inputs = tokenizer("A quick brown fox ", return_tensors="pt")["input_ids"].cuda()
remote_outputs = model.generate(inputs, max_new_tokens=7)
print("generated:", tokenizer.decode(remote_outputs[0]))
 
opt = torch.optim.Adam(model.parameters(), lr=1e-3)

the_fox_is_innocent = tokenizer("A quick brown fox did not jump over the lazy dog", return_tensors="pt")["input_ids"].cuda()
for i in range(50):
  loss = model(input_ids=the_fox_is_innocent, labels=the_fox_is_innocent).loss
  opt.zero_grad()
  loss.backward()
  opt.step()
  print(f"loss[{i}] = {loss.item():.3f}")

inputs = tokenizer("A quick brown fox ", return_tensors="pt")["input_ids"].cuda()
remote_outputs = model.generate(inputs, max_new_tokens=7)
print("generated:", tokenizer.decode(remote_outputs[0]))

Outdated server version warning

Add a warning message when starting a petals server if the version used is outdated.
- This should prevent servers from joining the swarm when they are not up-to-date.

BONUS: also warn currently running servers about a new update, and ban them after a defined grace period if the update has not been performed.

[DESIGN] user experience

This issue is about how to make distributed bloom user-friendly. See comments below for interface prototype.

__
A bunch of things that we may want to make alongside the tech so that people can... you know... use it :)

basic website (see training-transformers-together.github.io/ )
- what to expect from it?
  - like openai api
- why can't i just use ram/ssd offload?
  - it's faster together; passing activations for large models is faster than swapping layers
- legal disclaimer: by using our model you agree to follow the model's license, read more here
inference notebook
- as responsive as possible, print words as they appear
- option to show what happens under the hood (i.e. running through this guy; that guy disconnected; waiting for this)
- [optional] make it easy to customize sampling method
prompt-tuning notebook
- pick some task that can be solved relatively quickly with visible results
- TODO ask @artek0chumak on best practices
- [optional] explain how to add other adapters / ptuning modes
- [optional] inference prompt-tuned model? push adapters to HF?
- [optional] collaborative prompt-tuning?
inference website
- the default option can simply run inference notebook on server hosted by us
System Demonstration
- August 1st: https://2022.emnlp.org/calls/System_Demonstrations/

try bloomz in chat.petals.ml

There are rumors that bloomz makes for a far closer approximation of intruct/chatGPT than the original bloom. Let's see how it works with chat.petals!

Steps:

convert the model to petals format, like this, upload to yourname/bloomz-petals
- conversion script requires 400gb ram (can be optimized)
run some petals servers with the new model:
- python -m petals.cli.run_server yourname/bloomz-petals
run a version of chat.petals with the new model
- take https://github.com/borzunov/chat.petals.ml
- change this line to your model name
- (optional) you may want to change prompts to better fit the model's original role
let contributors play with it

Note: since we don't have active volunteers for the model (yet!), i can help you find some gpus to bootstrap the model

Access to the public model requires HF API token

When run:
python -m cli.run_server --prefix bloom6b3 --converted_model_name_or_path bigscience/test-bloomd-6b3 \ --block_indices 3:5 --torch_dtype float32 --identity_path ./server1.id --host_maddrs /ip4/127.0.0.1/tcp/31337

Fails with:
OSError: You specified use_auth_token=True, but a huggingface token was not found.

Fix: set use_auth_token=False here:
https://github.com/learning-at-home/bloom-demo/blob/ca3c08acc1c36d1da396ff95d9016522ea84b83c/src/server/server.py#L134

and here:
https://github.com/learning-at-home/bloom-demo/blob/ca3c08acc1c36d1da396ff95d9016522ea84b83c/src/bloom/from_pretrained.py#L73

Miscellaneous server-side improvements

task_pool
- remove interate_minibatches, once learning-at-home/hivemind#506 is merged
- batch sequences of similar length
runtime
- do not log "0 parameters" on init (misleading)
- consider removing Runtime, once learning-at-home/hivemind#505 is merged
handler
- verify what happens if server have mixed torch_dtype and/or compression
- actually follow forward/backward/inference schema instead of hard-coding
- extract the code for adding prompts into a separate file
- consider merging the code from hivemind's ConnectionHandler instead of inheriting
- add a test that covers _rpc_inference with prompts
- add a test that covers _rpc_inference with hypo-ids
MemoryCache
- when running inference over multiple layers on the same server, avoid passing layer activations between cpu<->gpu by
  storing them in MemoryCache
  - before implementing this, gotta check if this will bring any performance benefit
- LRU-offload stale cache from gpu to ram
point system
- make sure points are integers everywhere
- implement a nonzero prioritizer :)
- move client-side spending polity to sequence_manager

Integration with gensyn.ai

tl;dr This is just an idea, that I unfortunately don't have time to work on right now but would be cool if anyone picks it up.

I learnt about Petals from Simon William's Blog Post and thought that tight "2nd party" integration with gensyn.ai could be interesting.

Gensys.ai is a blockchain-based incentive-driven distributed computing network, and Petals is a framework for distributed LLM training.

If/when I ever have the time, this would be fun to work on, but just jotting down the idea publically for now.

[CODE] miscellaneous small issues for later

Things that can be done to improve the code, but were left out to launch MVP faster:

server-side: connection_handler, backend, runtime
- modify task pool to deal cache handles as pure-python integers? (currently they are converted to tensors)
- when running inference over multiple layers on the same server, avoid passing layer activations between cpu<->gpu by
  storing them in MemoryCache
  - moved to #68
- optimize disk space. r/n a server will eventually download all bloom blocks and store them into HF cache. Check for disk space in advance and/or figure out some cache eviction policy.
server-side: MemoryCache
- in allocate_cache, if there is not enough memory, wait for memory to be freed by existing tasks up to a given timeout.
  - note: this can be done using mp.Condtion
- allocate cache as one contigous buffer to avoid fragmentation
  - note: this feature is active as of #779959bc we will eventually switch back to non-cached version; rationale: did not observe significant issues from fragmentation, but contiguous buffers did complicate the code
- quantize cached values using bitsandbytes
  - wontfix (as of 2022.01.02): our current code relies on transformers' default bloom implementation, so we can't intervene in attention internals
- LRU-offload cache from gpu to ram?
  - moved to #68
client-side: internals
- make begin_inference_session into a contextmanager

[CODE] optimize BloomAttention, remove unnecessary host-to-device transfers

TL;DR the current code for BloomAttention is surprisingly inefficient, with obvious problems that should be easy to fix

The current code for BloomAttention is surprisingly inefficient:

it does batch-repetition on CPU before copying larger tensor to GPU - https://github.com/huggingface/transformers/blob/ca2a55e9dfb245527b5e1c954fec6ffbb7aef07b/src/transformers/models/bloom/modeling_bloom.py#L358
it checks for "0 in mask" on each layer, triggering needless CUDA synchronization https://github.com/huggingface/transformers/blob/ca2a55e9dfb245527b5e1c954fec6ffbb7aef07b/src/transformers/models/bloom/modeling_bloom.py#L361
it checks runs a for-loop with synchronization to shift alibi weights inside each layer in each forward pass https://github.com/huggingface/transformers/blob/ca2a55e9dfb245527b5e1c954fec6ffbb7aef07b/src/transformers/models/bloom/modeling_bloom.py#L162-L167
it computes alibi biases on CPU in python - https://github.com/huggingface/transformers/blob/ca2a55e9dfb245527b5e1c954fec6ffbb7aef07b/src/transformers/models/bloom/modeling_bloom.py#L94

It also misses a ton of opportunities for in-place ops for memory savings, but that's the least of its problems

Is Petals the Solution that ChatGPT needs?

I recently found out about Petals and the way it optimizes the process of running LLM models on your local machine through a Bit-Torrent style-based approach seemed very innovative to me.

Right now, it is estimated that it costs over a 100k dollars to run ChatGPT daily and I was wondering if this cost could be brought down using solutions like Petal as this could also help the environment by lowering a very massive footprint that this complex operation would cause

The prompt tuning example (prompt-tuning-sst2) don't work

Hi,
I have tried to run the notebook example you have published (without editing) in colab but it doesn't work...
I get the following error:

RuntimeError                              Traceback (most recent call last)
[<ipython-input-12-7f7d7fa267e9>](https://localhost:8080/#) in <module>
     17 
     18         model.train()
---> 19         outputs = model(**batch)
     20         loss = outputs.loss
     21         loss.backward()

3 frames
[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py](https://localhost:8080/#) in forward(self, input)
    112 
    113     def forward(self, input: Tensor) -> Tensor:
--> 114         return F.linear(input, self.weight, self.bias)
    115 
    116     def extra_repr(self) -> str:

RuntimeError: expected scalar type Half but found Float

It is probably about the transformers version out of date but petals 1.1.1 requires transformers==4.25.1

Running on SLURM

Having to specifically hard code IP adresses makes it very hard to run petals on a SLURM cluster. There I submit batch jobs that are then run on some node of the partition I specified. I do not know the IP beforehand of the node or any nodes that I run a petals server instance on.

So one thing that would be helpful is a "self discovery" of petals server instances inside a specified network.

Automatic test for failed backward

CPU-only server never passes the throughput check

I am building a solution that needs to support the most basic of hardware. As such, I am building a number of default profiles, to fit different hardware configurations.

I've already confirmed that the Petals Docker image works great, when I expose my GPU to the image.

However, if I don't (if I'm a user who doesn't have a dedicated GPU), the Docker image just hangs indefinitely:

[INFO] Measuring network and compute throughput. This takes about a minute and will be cached for future runs

I can't join the swarm at all.

Is this by-design? Does the Hivemind reject CPU clients, by default?

[CODE] fault-tolerant inference (client side)

Why: this is the main "value added" by LM api: users can open colab notebook and run a gigantic model.

Depends on #3

Solution (sketch)
Each client holds embeddings/logits on their side (loads a special a huggingface model)

Inputs: prefix(tokens), sampling parameters (top-k, top-p, etc)

Init:

tokenize prefix tokens and compute embeddings
for each stage, send out K initial requests to random servers in parallel. (no tensors are sent yet). Servers respond with their throughput sizes, queue sizes, latency. Pick the best server in each group.
- figure out how servers represent themselves in DHT
- [heuristic] if the chosen server has several consecutive stages, always choose to run all these stages.
client finds a sequence of servers that collectively have all model layers
- [optional] record non-selected servers locally as backups for generation
client runs prefix through all stages, stores inter-stage activations

Generation:
6. get last prefix token embedding from final stage
7. run through logits, run top-k/top-p/whatever inference

optional: use knn (e.g. hnsw from faiss) to quickly find nearest token; alternative = colab gpu

embed this token using local embeddings
send it through one pipeline stage at a time
- on each stage, add one more vector to prefix_embeddings
GOTO 6

Fault recovery:

If a given pipeline stage fails, find a replacement stage using the same procedure as during initial stage selection
feed the new stage with locally_saved intermediate embeddings that serve as inputs to that stage
continue inferencing normally

Ways to view: total GPU capacity, total GPU utilization, and personal speedup on public Petal Network?

Really cool project! Feels like it has the potential to be a game-changer for poor students like myself 😮

Does anyone know how/where to view the following statistics of the public Petal network:

Total GPU capacity of Petal network
Total GPU utilization of Petal network (how heavily the network is being taxed)
Personal speed-up

[CODE] port HF bloom to hivemind

Why: so we can play with in in inference mode

Original bloom code: huggingface/transformers#17474

The quest is to

implement bloom transformer layer as a hivemind expert.
prepare a huggingface model that only has bloom embeddings and logits, but runs all transformer layers via hivemind.RemoteExpert

exception 'Torch not compiled with CUDA enabled' when forcing '--device cuda'

Following the install instructions on the README.md into a dedicated conda env, the server throws an exception as follows:

python -m petals.cli.run_server bigscience/bloom-petals --device cuda
Jan 10 05:56:52.030 [INFO] Automatic dht prefix: bigscience/bloom-petals
Jan 10 05:56:58.672 [INFO] Connecting to the public swarm, peer_id = 12D3KooWRKGMtAyEzYsqk1fcT4Eu9CC5jd7YeyTHCoQgDcw1nh4i
Jan 10 05:56:58.673 [INFO] Model weights will be loaded in 8-bit format
Traceback (most recent call last):
  File "/home/a/anaconda3/envs/petals/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/a/anaconda3/envs/petals/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/petals/cli/run_server.py", line 213, in <module>
    main()
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/petals/cli/run_server.py", line 196, in main
    server = Server(
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/server.py", line 168, in __init__
    num_blocks = self._choose_num_blocks()
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/server.py", line 232, in _choose_num_blocks
    total_memory = torch.cuda.get_device_properties(self.device).total_memory
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/torch/cuda/__init__.py", line 221, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Is it a cudatoolkit version mismatch with what I have?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   31C    P0    26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

:hugs: transformers compatibility issues

Hello,

I'm trying to make the DistributedBloomForCausalLM work with our library inseq to extract feature attributions from BLOOM generations. However, at the moment I am facing some issues that prevent me from using the distributed model:

Inseq assumes the possibility of producing a structured output from model.generate by passing the return_dict_in_generate=True argument, as supported by HuggingFace. In your current implementation, there doesn't seem to be a way to extract such outputs, so when we access the property sequences an exception is thrown. To reproduce:

import torch
import inseq
from transformers import BloomTokenizerFast 
from petals import DistributedBloomForCausalLM

MODEL_NAME = "bigscience/bloom-petals"
model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()
inseq_model = inseq.load_model(model=model, tokenizer="bigscience/bloom-petals", attribution_method="saliency")
out = inseq_model.attribute(
    "A cat in French is \"",
    generation_args={"max_new_tokens": 3}
)

╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ <ipython-input-7-60ac37021f03>:1 in <module>                                              │
│ /usr/local/lib/python3.8/dist-packages/inseq/models/attribution_model.py:184 in attribute │
│                                                                                           │
│   181 │   │   │   )                                                                       │
│   182 │   │   if not constrained_decoding:                                                │
│   183 │   │   │   encoded_input = self.encode(input_texts, return_baseline=True, include_ │
│ ❱ 184 │   │   │   generated_texts = self.generate(encoded_input, return_generation_output │
│   185 │   │   logger.debug(f"reference_texts={generated_texts}")                          │
│   186 │   │   attribution_method = self.get_attribution_method(method, override_default_a │
│   187 │   │   attributed_fn = self.get_attributed_fn(attributed_fn)                       │
│                                                                                           │
│ /usr/local/lib/python3.8/dist-packages/inseq/models/model_decorators.py:13 in             │
│ attribution_free_wrapper                                                                  │
│                                                                                           │
│   10 │   │   if self.is_hooked:                                                           │
│   11 │   │   │   was_hooked = True                                                        │
│   12 │   │   │   self.attribution_method.unhook()                                         │
│ ❱ 13 │   │   out = f(self, *args, **kwargs)                                               │
│   14 │   │   if was_hooked:                                                               │
│   15 │   │   │   self.attribution_method.hook()                                           │
│   16 │   │   return out                                                                   │
│                                                                                           │
│ /usr/local/lib/python3.8/dist-packages/inseq/models/huggingface_model.py:190 in generate  │
│                                                                                           │
│   187 │   │   │   **kwargs,                                                               │
│   188 │   │   )                                                                           │
│   189 │   │   texts = self.tokenizer.batch_decode(                                        │
│ ❱ 190 │   │   │   generation_out.sequences,                                               │
│   191 │   │   │   skip_special_tokens=True,                                               │
│   192 │   │   )                                                                           │
│   193 │   │   if return_generation_output:                                                │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'Tensor' object has no attribute 'sequences'

Using Inseq we can bypass the generation step by attributing a pre-specified generation. In that case, feature attributions will be performed by calling normal forward/backward passes on the model step by step. If I try this by adapting the call to model.attribute as:

out = inseq_model.attribute(
    "A cat in French is \"",
	generated_texts="A cat in French is \"chat\"",
    generation_args={"max_new_tokens": 3}
)

I get the following error:

╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ /usr/local/lib/python3.8/dist-packages/petals/client/remote_model.py:163 in forward       │
│                                                                                           │
│   160 │   │   attention_mask: Optional[torch.Tensor] = None,                              │
│   161 │   │   **kwargs,                                                                   │
│   162 │   ):                                                                              │
│ ❱ 163 │   │   assert attention_mask is None, "DistributedBloomModel does not support atte │
│   164 │   │                                                                               │
│   165 │   │   for k, v in kwargs.items():                                                 │
│   166 │   │   │   if not (v is None or v is False):                                       │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: DistributedBloomModel does not support attention masks right now

Correct me if I'm wrong, but I believe both return_dict_in_generate and attention_mask support should be achievable for the petals implementation, right? Would you consider supporting such usage? Thanks in advance! 🙂

Roadmap (tentative)

Current tasks:

prototype bloom points system @borzunov (#6 )
local tensor parallelism ( #143 , using BlackSamorez/tesnor_parallel by @BlackSamorez and @IaroslavLisniak )
increase default max sequence length (from #146 )
allow running a server without open ports @Vahe1994
option to download pre-quantized blocks (@mryab )
improved routing (@justheuristic )
newest-latest libp2p - @Vahe1994
touch up fine-tuning examples, make sure they work in reasonable time ( @justheuristic )
a way to temporarily shutdown petals server
- suggested by @craffel : when running a petals server on a machine that is often in use, people should be able to shut off petals servers while running their experiments
- suggested behavior: shut down asap, restart once gpus are not in use for T minutes
Wanna contribute?
- go to our discord server and ask around!
- always in demand:
  - contribute examples (recommended but not required: create an issue / draft first, before you code them)
  - OS support / hardware support ( e.g. see #147 )
  - more models: OPT-175B, switch-XXL, whatever comes into fashion
  - host a server! (see README)

End of december: cover more use cases

tutorial: generation notebook
tutorial: prompt-tuning notebook
PreLayer prompt-tuning - mentioned as one of the baselines in https://arxiv.org/abs/2106.09685 - DONE
inference with prompt-tuning ( #13 by @artek0chumak)
advanced inference: beam search, constraints/fusion, LoRA/AdaMix ( @artek0chumak , #13 )
some kind of hub for tutorials, e.g. a minimalistic website
alpha test: let more people play with 176B model (where? no-brainer: bigscience, stability, discord)
rich inference interface for designing custom generation algorithms (by: @artek0chumak )
let servers run requests with different priorities ( #8 by: @GreenFatGuy )
By this point, we must answer the main questions: (1) will people use it? (2a) what for? (2b) why not?

End of ~~july~~ august: make it reliable, test with early adopters

make it so that servers cannot be killed by a bad client request ( learning-at-home/hivemind#3 by: @justheuristic)
find the best way to reduce the size of 176B model ( #4 by: @TimDettmers )
let servers automatically find and serve the most in-demand layers ( @borzunov )
implement popular non-beam inference types ( #13 by @artek0chumak )
compress the activations sent between client and server nodes ( by: @mryab )
find enough hardware to run the 176B model ( #14 by: @justheuristic )
pre-alpha test: once it is stable enough, let some (trusted) folks play with it and get their feedback
submit a EMNLP system demonstration proposal ( https://2022.emnlp.org/calls/System_Demonstrations/ )
begin investigating: tutorials, documentation, examples

End of june: build a proof-of-concept

agree on the user interface (see #5 (comment) )
run simple (but correct!) inference with a smaller model (for generation)
do simple (but correct!) forward/backward with frozen layers (for prompt tuning)
client can dynamically choose which remote servers to use for inference ( by: @justheuristic )
create basic correctness tests for later
check if 8-bit compression is remotely feasible ( by: @TimDettmers )
it's okay if the code is not super reliable for now
it's okay if servers have to be set up manually for now
begin investigating: quantized weights, quantized communication, automatic server allocation, "bloom points"

Important, but not urgent:

multiplicative adapters from https://wandb.ai/learning-at-home/LM_OWT/reports/Parameter-sharing-revisited--VmlldzoxOTAxNjcx ?
non-critical performance improvements ( #11 )
better finetuning methods: LoRA, AdaMix, PreLayer (see LoRA/AdaMix), whatever is SoTA at the time of building
fully decentralized point system

"NoneType object is not callable" on stopping P2P

I have very simple inference testing script. No threading or any advanced stuff. Basically "hello world" inference on Petals. Everything is going well, but when the script is exiting, I always get this error:

Exception ignored in: <function P2P.__del__ at 0x7f4ac1feed40>
Traceback (most recent call last):
  File "/home/dev/.local/lib/python3.10/site-packages/hivemind/p2p/p2p_daemon.py", line 632, in __del__
  File "/home/dev/.local/lib/python3.10/site-packages/hivemind/p2p/p2p_daemon.py", line 659, in _terminate
  File "/home/dev/.local/lib/python3.10/site-packages/multiaddr/multiaddr.py", line 254, in value_for_protocol
TypeError: 'NoneType' object is not callable

It is rather cosmetic issue, but something is not OK there.

`rpc_info` fails if remote server just initialized

https://github.com/bigscience-workshop/petals/runs/8118602726?check_suite_focus=true

Client-side code improvements [Yozh-todo-list]

Important stuff

optimal (instead of random) routing in RemoteSequenceManager
fault tolerance in RemoteSequentialInferenceSession
quick slicing with RemoteSequential[5:15] ( @TimDettmers )
BloomForCausalLM.generate ( @artek0chumak )
remove lm_head WEIGHT from checkpoint and/or model, convert BloomModel instead (to save 2x client RAM) [@dbaranchuk did it]

Minor stuff

This list contains all the things i've delegated to future me. They are typically too obscure for anyone else to worry about

fault tolerance inside RemoteTransformerBlock (as opposed to RemoteSequential)
in SequenceInfo, keep block_infos as TimedStorage instead of just RemoteSequenceInfo for automatic removal of expired servers
make RemoteSequential print the state of the model ( @TimDettmers )
less frequent DHT lookups in RemoteSequenceInfo
change BloomForCausalLM.forward to call RemoteSequential instead of working layer-by-layer (why: to allow chaining subsequent blocks efficiently)

Note from #45 by @borzunov

Please consider running a loop instead, maybe using just --num_blocks without explicit --block_indices. If it gets more complicated than just repeating something N times, please move it to a Python script.

[CODE] Basic Inferencing API on hivemind.Server aka rpc_inference

Why: talk to a 176B model ran on hundreds of small devices.

Implement an extended hivemind.Server that has forward/backward as usual, and an additional RPC named forward_incremental (stream<->stream)

Here's the protocol for forward_incremental:

client sends request containing:
- requested layers
- requested max sequence length
- [optional: bid?]

server responds with info protobuf that contains:

bool accepted: if True, server decides to let client run inference and will await first request for T=10 seconds.
- [optional: queue?]
[float queue length: 0 if accepted right now, N if need to wait for N other nodes to finish before running]
[float throughput: server's estimated computation time, including time in queue]

client sends prefix embeddings:

Tensor prefix input_embeddings [1, prefix_length, hidden_size] with compression
[optional prefix attention mask[prefix_length, prefix_length], default = tril]

server runs forward pass, saves attention caches and return

Tensor prefix output_embeddings [1, prefix_length, hidden_size] with compression

client sends another token input embeddings

Tensor input_embeddings [1, 1, hidden_size] with compression
[optional prefix attention mask[1, prefix_length + prev_tokens], default = tril]

server runs forward pass with attention cache and returns

Tensor output_embeddings
current length

GOTO step 5 while current length <= max length
If client does not send ping in T seconds (maybe empty message if no data yet), server closes connection.

Don't think about it:

support fixed max length for now, e.g. 1024 or 2048?
inference up to 256 steps excluding prefix - to ensure we don't spend too long with the same node?
select one or more of that node's consequent layers to inference at once?
send more than one token at a time?
option to backtrack for a few tokens for beam search inference?
beam search with multiple hypotheses - and an option to reorder them internally?

how to profile petals inference time cost

Hello all,

I launch my own swarm on 24 computers according to Launch-your-own-swarm .
Each computer has 1 gpu and 32GB memory, and contains 3 blocks of petals-bloom.

It works fine but takes a long time to inference once.

Follow Use the model, I only inference bloom without train the model.

import torch
import torch.nn.functional as F
import transformers
from src import DistributedBloomForCausalLM

initial_peers = [TODO_put_one_or_more_server_addresses_here]  # e.g. ["/ip4/127.0.0.1/tcp/more/stuff/here"]
tokenizer = transformers.BloomTokenizerFast.from_pretrained("bigscience/bloom-petals")
model = DistributedBloomForCausalLM.from_pretrained(
  "bigscience/bloom-petals", initial_peers=initial_peers, low_cpu_mem_usage=True, torch_dtype=torch.float32
)

inputs = tokenizer("a cat sat", return_tensors="pt")["input_ids"]
# use max_new_tokens instead of max_length 
remote_outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.decode(remote_outputs[0]))

I tried different max_new_tokens values, and the time-consuming almost 2s/token

Is there a way to profile the inference performance , i wonder why it is so slow

Thanks

With setting the device to cpu in the hyperparameters used for training in the sst2 prompt tuning example

Hi Big-Science team,

quick question: Why is the device set to 'CPU' in these training hyperparameters?

Especially when the notebook recommends using a GPU.

MODEL_NAME = "bigscience/bloom-petals" # select model you like
NUM_PREFIX_TOKENS = 16
DEVICE = 'cpu'
BATCH_SIZE = 16
LR = 1e-2
WEIGHT_DECAY = 0.0
NUM_EPOCHS = 3
SEED = 42
MODEL_MAX_LENGTH = 64
TUNING_MODE = 'ptune' # choose between ['ptune', 'deep_ptune']

Thanks

Best regards

Jerome

Specify minimal requirements to GPU's for contributing

I tried to contribute to the Swarm using an 8gb card and then quickly realized, even when setting the Pytorch fragmeneted split size to 512mb that I could not use this card to contribute to inference. It would be nice to have a section in the readme that specifies this.

Inference issues on Volta-based swarm

Hello folks,

Trying to run a private swarm on a 7x Volta-generation GPUs. As suggested by docs, i've set torch_dtype to float16 and NUM_BLOCKS to 10 (these are 32GB GPUs) and removed load-8-bit argument. All 7 are running on the same linux host.

Swarm starts ok and loads all the model blocks, but many of included tests fail.

Trying even the most basic generation seems to always generate the same token (0 == UNK)

Are older GPUs even supported? There are some notes in documentation on what to set for pre-Turing, but the arxiv paper says the server needs to have Turing or later gen GPU.

If older GPUs are supported, do i also need to specify the torch_dtype to be float16 on instantiating model?
(i get RuntimeError: "LayerNormKernelImpl" not implemented for 'Half' when running .generate() in this case)

It is torch 1.12.1+cu113 on cuda 11.3

This is what i get as tests:

tests/test_aux_functions.py::test_throughput_basic FAILED <-- this system is behind proxy, i think this is expected
tests/test_block_exact_match.py::test_remote_block_exact_match FAILED
tests/test_chained_calls.py::test_forward_backward_exact_match FAILED
tests/test_chained_calls.py::test_chained_inference_exact_match FAILED
tests/test_full_model.py::test_full_model_exact_match[True] FAILED
tests/test_full_model.py::test_full_model_exact_match[False] FAILED
tests/test_full_model.py::test_greedy_generation PASSED
tests/test_full_model.py::test_sampling[sampling_options0] SKIPPED (Sampling is currently not consistent with outputs from Transformers)
tests/test_full_model.py::test_sampling[sampling_options1] SKIPPED (Sampling is currently not consistent with outputs from Transformers)
tests/test_full_model.py::test_sampling[sampling_options2] SKIPPED (Sampling is currently not consistent with outputs from Transformers)
tests/test_full_model.py::test_sampling[sampling_options3] SKIPPED (Sampling is currently not consistent with outputs from Transformers)
tests/test_full_model.py::test_beam_search_generation FAILED
tests/test_linear8bitlt.py::test_layout_exact_match SKIPPED (this test requires a turing-generation or newer GPU, see bitsandbytes docs)
tests/test_linear8bitlt.py::test_linear_exact_match SKIPPED (this test requires a turing-generation or newer GPU, see bitsandbytes docs)
tests/test_linear8bitlt.py::test_linear_no_igemmlt PASSED
tests/test_priority_pool.py::test_priority_pools PASSED
tests/test_remote_sequential.py::test_remote_sequential FAILED
tests/test_remote_sequential.py::test_remote_sequential_prompts FAILED

The greedy search test seems to pass, but i'm suspicious... could it be an issue with a test?

Here is what i see from .generate():

model = DistributedBloomForCausalLM.from_pretrained("bigscience/bloom-petals", initial_peers=INITIAL_PEERS)
inputs = tokenizer("Cat sat on", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0]))
Cat sat on
type(outputs), outputs.shape, outputs
(torch.Tensor,
torch.Size([1, 8]),
tensor([[40171, 13770, 664, 0, 0, 0, 0, 0]]))

If needed, can provide the log from tests.

Use interactive inference in the Personachat example

Fast interactive inference is an important feature of Petals. Let's use it in the example. We already have code for this (~10 lines) somewhere in tests.

Better throughput estimation

Current swarm load-balancing relies on a single throughput implemented in #21 .

It works, but there are a few ways to make this work better:

speedtest-based speed can be unreliable. Some users reported that speedtest does not worn
- quick fix: if speedtest failed, warn and set throughout to 100mbps; add an env variable to set specifically network throughput
when a server holds many layers, it is less affected by low network throughput
- find a better way to account for num_blocks;
- for example, min(compute_throughput, network_throughput * min(num_blocks, 5))

Addtional information that could help clients find a better chain of servers for training/inference:

declare compute and network throughput separately
[maybe] declare whether or not a peer is directly reachable (not using relay)

max token input token length

[in consultation with @mryab]

The max input token length is 2048 right now. It would be nice to process more than 2048 tokens through the distributed BLOOM. Increasing the max input token length would help me a lot in my research.

@mryab @borzunov @justheuristic

[RESEARCH] 8-bit bloom / opt

Why?: the more stages we can fit on the same GPU, the less latency users gonna get on everything.

investigate ways to cast model to 8 bits
- minimalistic example with bitsandbytes: https://huggingface.co/hivemind/gpt-j-6B-8bit
- [optional] maybe below 8 bits?
investigate ways to use less memory for attention cache
- quantize? sparsify?

@TimDettmers knows this part best, ask him before we do anything

chat.petals integration with Hugging Face Spaces

Currently, you can use the model either directly, via this notebook, or through http://chat.petals.ml/ .

The latter is a self-hosted chat app that runs code similar to the colab notebook, but wraps it with a simple web server. The source code is available here.

It would be great if anybody could fork the chat project and reuse it for their pet projects. Thing is, not everyone has a server to host that web server - and it is not always convenient to share jupyter notebooks for everything.

To make it easier to create similar apps, we can create an example with Spaces. Hugging Face Spaces allows you to run simple web apps at the expense of HF.

References:

@younesbelkada has made a space with the early version of Petals: https://huggingface.co/spaces/ybelkada/petals/tree/main
There is another related project that integrates petals 1.0.0, https://huggingface.co/spaces/hivemind-personalized-chat/chat-gradio/tree/main

Task: create something similar to chat.petals on top of HF spaces. One way to do this is to fork @younesbelkada 's project and update it to the latest stable version of Petals

tensor parallelism exception

Dual P100 setup:

(petals) qtr@pve:~/petals$ python -m petals.cli.run_server bigscience/bloomz-petals --tensor_parallel_devices cuda:0 cuda:1 --num_blocks 6
Jan 31 07:28:41.703 [INFO] Running Petals 1.1.2
Jan 31 07:28:56.893 [INFO] Direct reachability: 3/5
Jan 31 07:28:56.937 [INFO] This server will run DHT in full peer mode
Jan 31 07:28:59.927 [INFO] Connecting to the public swarm, peer_id = 12D3KooWMMiL1fw5oPpsegj4E1iu7Woqqc9rLzjrGoannTdJNxY1
Jan 31 07:29:00.045 [INFO] Model weights will be split between cuda:0, cuda:1
Jan 31 07:29:00.082 [WARN] [/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/server.py.__init__:167] Tensor parallelism doesn't work properly with 8-bit weights yet, loading weights in 16-bit. You can explicitly set `--load_in_8bit True` to override this
Jan 31 07:29:00.082 [INFO] Model weights will be loaded in auto format
Jan 31 07:29:00.082 [INFO] Attention cache for all blocks will consume up to 3.00 GiB
Jan 31 07:29:00.084 [INFO] Loading throughput info
Jan 31 07:29:01.221 [INFO] Announced that blocks [7, 8, 9, 10, 11, 12] are joining
Jan 31 07:29:01.622 [INFO] Reachability service started
Jan 31 07:29:29.358 [INFO] Loaded bigscience/bloomz-petals block 7, <All keys matched successfully>
Jan 31 07:30:04.819 [INFO] Loaded bigscience/bloomz-petals block 8, <All keys matched successfully>
Jan 31 07:30:35.309 [INFO] Loaded bigscience/bloomz-petals block 9, <All keys matched successfully>
Jan 31 07:31:05.673 [INFO] Loaded bigscience/bloomz-petals block 10, <All keys matched successfully>
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [03:15<00:00, 25.2MB/s]
Jan 31 07:34:51.803 [INFO] Loaded bigscience/bloomz-petals block 11, <All keys matched successfully>
Jan 31 07:35:30.357 [INFO] Loaded bigscience/bloomz-petals block 12, <All keys matched successfully>
Jan 31 07:35:34.023 [INFO] Server is reachable from the Internet. It will appear at http://health.petals.ml soon
Jan 31 07:35:35.407 [INFO] Started
Jan 31 08:16:10.959 [INFO] rpc_inference.open(blocks=7:13, remote_peer=...AM4oxF)
Jan 31 08:16:10.962 [INFO] rpc_inference.wait_for_alloc(size=0.16 GiB), already used 0.00/3.00 GiB (0.0%)
Jan 31 08:16:10.965 [INFO] rpc_inference.alloc(size=0.16 GiB)
Jan 31 08:16:11.371 [ERROR] [hivemind.moe.server.runtime.run:104] Caught indices should be either on cpu or on the same device as the indexed tensor (cuda:1), attempting to recover
Traceback (most recent call last):
  File "/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/hivemind/moe/server/runtime.py", line 91, in run
    outputs = pool.process_func(*batch)
  File "/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/backend.py", line 175, in __call__
    (hidden_states,) = self.backends[inference_info.uid].inference_step(hidden_states, hypo_ids, inference_info)
  File "/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/backend.py", line 92, in inference_step
    self._reorder_cache_inplace(cache_tensors, hypo_ids)
  File "/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/backend.py", line 102, in _reorder_cache_inplace
    cache_tensor[...] = cache_tensor[hypo_ids]  # in-place reorder cache by hypo ids
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)
Jan 31 08:16:11.407 [INFO] rpc_inference.close(blocks=7:13, remote_peer=...AM4oxF)

More convenient test runner?

[as suggested by @GreenFatGuy ]

As of now, running tests locally is inconvenient as it requires spinning up servers.
We cannot afford to launch servers for every test as it will increase CI time more than 5x.

Are there any standard tricks that allows PyTest to spin up the necessary servers automatically, but use the same set of servers in all tests?

Critical Code Analysis {backend.py}

Context: This issue contains the outputs of an LM-based code analysis tool ran by @versoindustries on a part of Petals code. [comment added by @borzunov]

\\ GPT-3|Codex|ChatGPT Unofficial ChatGPT API ///
~ Fine Tuned Complex Model GPT Model wrapped into a program, and vs extension used to analyze codebases at a reasonable cost~
Will slowly add in other files. I haven't dove all the way into the codebase, but as I do the model will get a clearer picture of what's going on. Some of these may not actually be issues if it's being used in a library or source that's not known to the model yet.

The assert not param.requires_grad and assert not buf.requires_grad checks in the constructor of the TransformerBackend class may cause problems if the model's parameters or buffers are expected to accumulate gradients.

The inference_pool and forward_pool variables are being assigned the same instance of PrioritizedTaskPool, which could lead to unexpected behavior when processing requests for forward and inference.

The self.cache_bytes_per_token variable is being assigned the value of Counter() which is not being used later on the code and is not being used for any operations.

The max_batch_size variable is being used in the constructor of PrioritizedTaskPool but it is not defined in the code.

The self.shard_num_heads variable is being used without being defined or assigned any value before.

The backward_pool variable is defined but it is not being used in the code.

The self.inference_schema variable is defined but it is not being used in the code.

The class TransformerBackend inherits from ModuleBackend but it is not being used.

The import of BloomConfig is not being used in the code.

The import of BloomAttention is not being used in the code.

The self.dtype variable is being used but it is not being defined or assigned any value before.

The self.memory_cache variable is defined but it is not being used in the code.

The import of InferenceMetadata is not being used in the code.

The import of Handle is not being used in the code.

The import of is_dummy is not being used in the code.

The self.forward_pool and self.inference_pool are being defined with the same PrioritizedTaskPool instance, which could cause confusion and unexpected behavior when handling forward and inference requests.

The self.forward_pool and self.backward_pool are defined with the same max_batch_size variable, which is not defined in the code. It might cause an error if this variable is not passed as an argument.

The self.config variable is defined in the constructor but it is not used anywhere in the code.

The *args and **kwargs passed to the constructor are not used in the code, which might cause confusion and unexpected behavior if they are passed with specific values.

The from future import annotations statement at the top of the code is not needed and does not affect the execution of the code in any way.

The self.inference_schema variable is defined but it is not used in the code. It is unclear if it is intended to be used for validation or documentation of the input and output schema of the inference_step method.

The self.cache_bytes_per_token variable is defined but it is not used in the code. It is unclear if it is intended to be used for memory management or performance optimization.

The self.get_inference_cache_descriptors method is defined but it is not used in the code. It is unclear what its intended purpose is and how it is related to the self.cache_bytes_per_token variable.

The batch_size and max_length arguments passed to the self.get_inference_cache_descriptors method are not used in the code. It is unclear if they are intended to be used for memory management or performance optimization.

The self.dtype variable is defined in the constructor but it is not used in the code. It is unclear what the intended use of this variable is.

The self.shard_num_heads variable is defined but it is not used in the code. It is unclear what the intended use of this variable is.

The self.memory_cache variable is defined in the constructor but it is not used in the code. It is unclear what the intended use of this variable is.

There is no clear error handling mechanism in the code. If an error occurs, it might go unnoticed and cause unexpected behavior.

The code is not commented, which makes it difficult to understand the intended behavior and the meaning of the variables and methods.

The code is not well organized, making it difficult to understand the flow of execution and the dependencies between the different parts of the code.

The code is not written in a modular or reusable way, making it difficult to reuse parts of the code for other projects or applications.

he self.forward, self.backward and self.inference_step methods are being used in the constructor of the TransformerBackend class, but they are not defined in the code. It is unclear how these methods are supposed to work and what their intended behavior is.

The self.inference_pool variable is defined but it is not used in the code. It is unclear what the intended use of this variable is.

The self.args_schema, self.kwargs_schema variables are used in the constructor but they are not defined in the code. It is unclear what their intended use is.

tensor_parallel and petals are not standard python library and are not clear what they are and what they are used for.

The self.inference_schema variable is defined but it is not used in the code. It is unclear what the intended use of this variable is.

Overall the code seems to have been in the middle of development, and not ready for use. It has multiple issues that need to be addressed and cleaned up before it can be used in any real-world scenario.

|||||||||||||||||||||||||||||||||||||||||||

Investigate QUIC (v1) reliability

Our network layer supports quic like this: hivemind.DHT(..., host_maddrs=['/ip4/1.2.3.4/udp/1337/quic'])
However, petals servers currently default to TCP-only host maddrs, unless user specifies --host_maddrs.

In other hivemind-based experiments, we found that QUIC is superior to TCP when operating under household NAT because udp hole punching is more reliable than tcp hole punching. It would be great if we could enable it by default.

The reason why QUIC is in default maddrs is that we haven't tested it thoroughly enough and we fear that it might cause throughput issues.

Quest: try running a quic-only peer in the public swarm, bombard it with requests from (your pc, colab, some publicly accessible machine), check if it works alright.

Criteria (suggestion):

cycles per second, forward and inference (vs TCP)
retries / relay fallbacks (vs TCP)

We should check for cases where QUIC makes the system unusable (10x slow or does not work at all).

If some cases are slower by tens of %%, this is fine. If there are cases where quic is 2x slower or similar, we can check that running a server with both tcp and quic is still as fast as tcp-only - and if so, it is fine to enable quic in main.

[RESEARCH] LM API merit system

Why:

contributors who support the swarm over long time should feel that they are appreciated
one client should not be able to DDOS the entire swarm - it should be prioritized according to some pre-defined system

Optional:

client that has higher point total may end up prioritized on the processing queue
contributors who run their GPUs should be motivated to use the model for something of theirs (free coupon effect)

Demo constraint: the first public version must not have a mechanism for converting internal points into anything except for priority usage and participant self-worth (e.g. via leaderboards).