Git Product home page Git Product logo

bigscience-workshop / petals Goto Github PK

View Code? Open in Web Editor NEW
8.7K 87.0 460.0 4.18 MB

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

Home Page: https://petals.dev

License: MIT License

Python 99.73% Shell 0.05% Dockerfile 0.21%
bloom deep-learning distributed-systems language-models large-language-models machine-learning neural-networks pytorch volunteer-computing pipeline-parallelism tensor-parallelism guanaco llama chatbot gpt transformer nlp pretrained-models llama2 falcon

petals's Introduction


Run large language models at home, BitTorrent-style.
Fine-tuning and inference up to 10x faster than offloading


Generate text with distributed Llama 2 (70B), Falcon (40B+), BLOOM (176B) (or their derivatives), and fine‑tune them for your own tasks — right from your desktop computer or Google Colab:

from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM

# Choose any model available at https://health.petals.dev
model_name = "petals-team/StableBeluga2"  # This one is fine-tuned Llama 2 (70B)

# Connect to a distributed network hosting model layers
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)

# Run the model as if it were on your computer
inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0]))  # A cat sat on a mat...

🚀  Try now in Colab

🔏 Privacy. Your data will be processed with the help of other people in the public swarm. Learn more about privacy here. For sensitive data, you can set up a private swarm among people you trust.

🦙 Want to run Llama 2? Request access to its weights at the ♾️ Meta AI website and 🤗 Model Hub, then run huggingface-cli login in the terminal before loading the model. Or just try it in our chatbot app.

💬 Any questions? Ping us in our Discord!

Connect your GPU and increase Petals capacity

Petals is a community-run system — we rely on people sharing their GPUs. You can check out available models and help serving one of them! As an example, here is how to host a part of Stable Beluga 2 on your GPU:

🐧 Linux + Anaconda. Run these commands for NVIDIA GPUs (or follow this for AMD):

conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install git+https://github.com/bigscience-workshop/petals
python -m petals.cli.run_server petals-team/StableBeluga2

🪟 Windows + WSL. Follow this guide on our Wiki.

🐋 Docker. Run our Docker image for NVIDIA GPUs (or follow this for AMD):

sudo docker run -p 31330:31330 --ipc host --gpus all --volume petals-cache:/cache --rm \
    learningathome/petals:main \
    python -m petals.cli.run_server --port 31330 petals-team/StableBeluga2

🍏 macOS + Apple M1/M2 GPU. Install Homebrew, then run these commands:

brew install python
python3 -m pip install git+https://github.com/bigscience-workshop/petals
python3 -m petals.cli.run_server petals-team/StableBeluga2

📚  Learn more (how to use multiple GPUs, start the server on boot, etc.)

💬 Any questions? Ping us in our Discord!

🦙 Want to host Llama 2? Request access to its weights at the ♾️ Meta AI website and 🤗 Model Hub, generate an 🔑 access token, then add --token YOUR_TOKEN_HERE to the python -m petals.cli.run_server command.

🔒 Security. Hosting a server does not allow others to run custom code on your computer. Learn more here.

🏆 Thank you! Once you load and host 10+ blocks, we can show your name or link on the swarm monitor as a way to say thanks. You can specify them with --public_name YOUR_NAME.

How does it work?

  • You load a small part of the model, then join a network of people serving the other parts. Single‑batch inference runs at up to 6 tokens/sec for Llama 2 (70B) and up to 4 tokens/sec for Falcon (180B) — enough for chatbots and interactive apps.
  • You can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch and 🤗 Transformers.

📜  Read paper            📚  See FAQ

📚 Tutorials, examples, and more

Basic tutorials:

  • Getting started: tutorial
  • Prompt-tune Llama-65B for text semantic classification: tutorial
  • Prompt-tune BLOOM to create a personified chatbot: tutorial

Useful tools:

Advanced guides:

  • Launch a private swarm: guide
  • Run a custom model: guide

Benchmarks

Please see Section 3.3 of our paper.

🛠️ Contributing

Please see our FAQ on contributing.

📜 Citation

Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel. Petals: Collaborative Inference and Fine-tuning of Large Models. arXiv preprint arXiv:2209.01188, 2022.

@article{borzunov2022petals,
  title = {Petals: Collaborative Inference and Fine-tuning of Large Models},
  author = {Borzunov, Alexander and Baranchuk, Dmitry and Dettmers, Tim and Ryabinin, Max and Belkada, Younes and Chumachenko, Artem and Samygin, Pavel and Raffel, Colin},
  journal = {arXiv preprint arXiv:2209.01188},
  year = {2022},
  url = {https://arxiv.org/abs/2209.01188}
}

This project is a part of the BigScience research workshop.

petals's People

Contributors

artek0chumak avatar borzunov avatar bot66 avatar dbaranchuk avatar dvmazur avatar eltociear avatar greenfatguy avatar justheuristic avatar mryab avatar muhtasham avatar tonywang16 avatar vadi2 avatar vahe1994 avatar zsc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

petals's Issues

Is Petals the Solution that ChatGPT needs?

I recently found out about Petals and the way it optimizes the process of running LLM models on your local machine through a Bit-Torrent style-based approach seemed very innovative to me.

Right now, it is estimated that it costs over a 100k dollars to run ChatGPT daily and I was wondering if this cost could be brought down using solutions like Petal as this could also help the environment by lowering a very massive footprint that this complex operation would cause

Specify minimal requirements to GPU's for contributing

I tried to contribute to the Swarm using an 8gb card and then quickly realized, even when setting the Pytorch fragmeneted split size to 512mb that I could not use this card to contribute to inference. It would be nice to have a section in the readme that specifies this.

CPU-only server never passes the throughput check

I am building a solution that needs to support the most basic of hardware. As such, I am building a number of default profiles, to fit different hardware configurations.

I've already confirmed that the Petals Docker image works great, when I expose my GPU to the image.

However, if I don't (if I'm a user who doesn't have a dedicated GPU), the Docker image just hangs indefinitely:

[INFO] Measuring network and compute throughput. This takes about a minute and will be cached for future runs

I can't join the swarm at all.

Is this by-design? Does the Hivemind reject CPU clients, by default?

Better throughput estimation

Current swarm load-balancing relies on a single throughput implemented in #21 .

It works, but there are a few ways to make this work better:

  • speedtest-based speed can be unreliable. Some users reported that speedtest does not worn
    • quick fix: if speedtest failed, warn and set throughout to 100mbps; add an env variable to set specifically network throughput
  • when a server holds many layers, it is less affected by low network throughput
    • find a better way to account for num_blocks;
    • for example, min(compute_throughput, network_throughput * min(num_blocks, 5))

Addtional information that could help clients find a better chain of servers for training/inference:

  • declare compute and network throughput separately
  • [maybe] declare whether or not a peer is directly reachable (not using relay)

[Feature Request] Direct server-server communication ("and then" clause)

Based on conversations with @borzunov , @dbaranchuk

Premise: currently in rpc_inference, each client sends inputs to a given server, collects responses from that server, then sends this input manually to the next server; this is needed for full fault-tolerance, in case one of the servers disconnects. A faster option is to send data directly from server 1 to server 2, if we can make it without compromising fault-tolerance -- and without insane code complexity.

Proposed solution: in rpc_inference, whenever a client sends a pb2 request, it can add a metadata key, e.g. "next_peer", which denotes the peer id of the next server. When a server finishes computing that key, it will immediately send results to the specified peer_id and marks it as "hidden states for session {inference_session_id}" - assuming that the next peer currently takes part in the same session.

On the receiving end, each server awaits asyncio.wait(request_from_client, request_from_previous_server), whichever comes first. If the request from previous server came first, current server will begin processing it immediately, but will still wait for the client's data to ensure that the results are valid.

Sending data to the next server is not guaranteed: the requested server will simply fire a request and forget about it.
Notably, the server will still return hidden states to the client as usual. The extra communication is fine because rpc_inference performance does not use much network throughput ("mbps"), being more sensitive to latency ("ping").

Notes:

  • client can request a different next_peer after each new inference step. This happens if one of the "next servers" disconnected from the inference session. Servers should send each hidden_states to the server that was specified in the current request.next_peer
  • if a server receives a request that doesn't correspond to any active session, it simply ignores the request. this is fine because if that request was valid, the client will still send the same data later
  • [security] since the previous server can be faulty/malicious, the "next peer" server should check that the data it received from previous peer is equal to the data it eventually received from client; when we implement full verification, the server can simply sign the next peer message so it can be used as a proof of (benign or malicious) activity
    • if this took place, a server may have to re-send inference message; we can support this by specifying the current length in the server's response
  • [security] the server-to-server traffic caused by the client is strictly less than client-to-server traffic, which eliminates the potential misuse via ddos amplification
  • the current-best routing strategy would still work decently for this algorithm because it uses a strictly non-optimistic (time>=actual) performance model

@dbaranchuk also proposed a clever alternative solution, where each server runs its own fault-tolerant inference session to subsequent servers. This can be a better solution If we find a way to limit the memory / bandwidth usage on a server.

[CODE] port HF bloom to hivemind

Why: so we can play with in in inference mode

Original bloom code: huggingface/transformers#17474

The quest is to

  • implement bloom transformer layer as a hivemind expert.
  • prepare a huggingface model that only has bloom embeddings and logits, but runs all transformer layers via hivemind.RemoteExpert

exception 'Torch not compiled with CUDA enabled' when forcing '--device cuda'

Following the install instructions on the README.md into a dedicated conda env, the server throws an exception as follows:

python -m petals.cli.run_server bigscience/bloom-petals --device cuda
Jan 10 05:56:52.030 [INFO] Automatic dht prefix: bigscience/bloom-petals
Jan 10 05:56:58.672 [INFO] Connecting to the public swarm, peer_id = 12D3KooWRKGMtAyEzYsqk1fcT4Eu9CC5jd7YeyTHCoQgDcw1nh4i
Jan 10 05:56:58.673 [INFO] Model weights will be loaded in 8-bit format
Traceback (most recent call last):
  File "/home/a/anaconda3/envs/petals/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/a/anaconda3/envs/petals/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/petals/cli/run_server.py", line 213, in <module>
    main()
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/petals/cli/run_server.py", line 196, in main
    server = Server(
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/server.py", line 168, in __init__
    num_blocks = self._choose_num_blocks()
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/server.py", line 232, in _choose_num_blocks
    total_memory = torch.cuda.get_device_properties(self.device).total_memory
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/a/anaconda3/envs/petals/lib/python3.10/site-packages/torch/cuda/__init__.py", line 221, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Is it a cudatoolkit version mismatch with what I have?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   31C    P0    26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Critical Code Analysis {backend.py}

Context: This issue contains the outputs of an LM-based code analysis tool ran by @versoindustries on a part of Petals code. [comment added by @borzunov]

\\ GPT-3|Codex|ChatGPT Unofficial ChatGPT API ///
~ Fine Tuned Complex Model GPT Model wrapped into a program, and vs extension used to analyze codebases at a reasonable cost~
Will slowly add in other files. I haven't dove all the way into the codebase, but as I do the model will get a clearer picture of what's going on. Some of these may not actually be issues if it's being used in a library or source that's not known to the model yet.

The assert not param.requires_grad and assert not buf.requires_grad checks in the constructor of the TransformerBackend class may cause problems if the model's parameters or buffers are expected to accumulate gradients.

The inference_pool and forward_pool variables are being assigned the same instance of PrioritizedTaskPool, which could lead to unexpected behavior when processing requests for forward and inference.

The self.cache_bytes_per_token variable is being assigned the value of Counter() which is not being used later on the code and is not being used for any operations.

The max_batch_size variable is being used in the constructor of PrioritizedTaskPool but it is not defined in the code.

The self.shard_num_heads variable is being used without being defined or assigned any value before.

The backward_pool variable is defined but it is not being used in the code.

The self.inference_schema variable is defined but it is not being used in the code.

The class TransformerBackend inherits from ModuleBackend but it is not being used.

The import of BloomConfig is not being used in the code.

The import of BloomAttention is not being used in the code.

The self.dtype variable is being used but it is not being defined or assigned any value before.

The self.memory_cache variable is defined but it is not being used in the code.

The import of InferenceMetadata is not being used in the code.

The import of Handle is not being used in the code.

The import of is_dummy is not being used in the code.

The self.forward_pool and self.inference_pool are being defined with the same PrioritizedTaskPool instance, which could cause confusion and unexpected behavior when handling forward and inference requests.

The self.forward_pool and self.backward_pool are defined with the same max_batch_size variable, which is not defined in the code. It might cause an error if this variable is not passed as an argument.

The self.config variable is defined in the constructor but it is not used anywhere in the code.

The *args and **kwargs passed to the constructor are not used in the code, which might cause confusion and unexpected behavior if they are passed with specific values.

The from future import annotations statement at the top of the code is not needed and does not affect the execution of the code in any way.

The self.inference_schema variable is defined but it is not used in the code. It is unclear if it is intended to be used for validation or documentation of the input and output schema of the inference_step method.

The self.cache_bytes_per_token variable is defined but it is not used in the code. It is unclear if it is intended to be used for memory management or performance optimization.

The self.get_inference_cache_descriptors method is defined but it is not used in the code. It is unclear what its intended purpose is and how it is related to the self.cache_bytes_per_token variable.

The batch_size and max_length arguments passed to the self.get_inference_cache_descriptors method are not used in the code. It is unclear if they are intended to be used for memory management or performance optimization.

The self.dtype variable is defined in the constructor but it is not used in the code. It is unclear what the intended use of this variable is.

The self.shard_num_heads variable is defined but it is not used in the code. It is unclear what the intended use of this variable is.

The self.memory_cache variable is defined in the constructor but it is not used in the code. It is unclear what the intended use of this variable is.

There is no clear error handling mechanism in the code. If an error occurs, it might go unnoticed and cause unexpected behavior.

The code is not commented, which makes it difficult to understand the intended behavior and the meaning of the variables and methods.

The code is not well organized, making it difficult to understand the flow of execution and the dependencies between the different parts of the code.

The code is not written in a modular or reusable way, making it difficult to reuse parts of the code for other projects or applications.

he self.forward, self.backward and self.inference_step methods are being used in the constructor of the TransformerBackend class, but they are not defined in the code. It is unclear how these methods are supposed to work and what their intended behavior is.

The self.inference_pool variable is defined but it is not used in the code. It is unclear what the intended use of this variable is.

The self.args_schema, self.kwargs_schema variables are used in the constructor but they are not defined in the code. It is unclear what their intended use is.

tensor_parallel and petals are not standard python library and are not clear what they are and what they are used for.

The self.inference_schema variable is defined but it is not used in the code. It is unclear what the intended use of this variable is.

Overall the code seems to have been in the middle of development, and not ready for use. It has multiple issues that need to be addressed and cleaned up before it can be used in any real-world scenario.

|||||||||||||||||||||||||||||||||||||||||||

:hugs: transformers compatibility issues

Hello,

I'm trying to make the DistributedBloomForCausalLM work with our library inseq to extract feature attributions from BLOOM generations. However, at the moment I am facing some issues that prevent me from using the distributed model:

  1. Inseq assumes the possibility of producing a structured output from model.generate by passing the return_dict_in_generate=True argument, as supported by HuggingFace. In your current implementation, there doesn't seem to be a way to extract such outputs, so when we access the property sequences an exception is thrown. To reproduce:
import torch
import inseq
from transformers import BloomTokenizerFast 
from petals import DistributedBloomForCausalLM

MODEL_NAME = "bigscience/bloom-petals"
model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()
inseq_model = inseq.load_model(model=model, tokenizer="bigscience/bloom-petals", attribution_method="saliency")
out = inseq_model.attribute(
    "A cat in French is \"",
    generation_args={"max_new_tokens": 3}
)
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ <ipython-input-7-60ac37021f03>:1 in <module>                                              │
│ /usr/local/lib/python3.8/dist-packages/inseq/models/attribution_model.py:184 in attribute │
│                                                                                           │
│   181 │   │   │   )                                                                       │
│   182 │   │   if not constrained_decoding:                                                │
│   183 │   │   │   encoded_input = self.encode(input_texts, return_baseline=True, include_ │
│ ❱ 184 │   │   │   generated_texts = self.generate(encoded_input, return_generation_output │
│   185 │   │   logger.debug(f"reference_texts={generated_texts}")                          │
│   186 │   │   attribution_method = self.get_attribution_method(method, override_default_a │
│   187 │   │   attributed_fn = self.get_attributed_fn(attributed_fn)                       │
│                                                                                           │
│ /usr/local/lib/python3.8/dist-packages/inseq/models/model_decorators.py:13 in             │
│ attribution_free_wrapper                                                                  │
│                                                                                           │
│   10 │   │   if self.is_hooked:                                                           │
│   11 │   │   │   was_hooked = True                                                        │
│   12 │   │   │   self.attribution_method.unhook()                                         │
│ ❱ 13 │   │   out = f(self, *args, **kwargs)                                               │
│   14 │   │   if was_hooked:                                                               │
│   15 │   │   │   self.attribution_method.hook()                                           │
│   16 │   │   return out                                                                   │
│                                                                                           │
│ /usr/local/lib/python3.8/dist-packages/inseq/models/huggingface_model.py:190 in generate  │
│                                                                                           │
│   187 │   │   │   **kwargs,                                                               │
│   188 │   │   )                                                                           │
│   189 │   │   texts = self.tokenizer.batch_decode(                                        │
│ ❱ 190 │   │   │   generation_out.sequences,                                               │
│   191 │   │   │   skip_special_tokens=True,                                               │
│   192 │   │   )                                                                           │
│   193 │   │   if return_generation_output:                                                │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'Tensor' object has no attribute 'sequences'
  1. Using Inseq we can bypass the generation step by attributing a pre-specified generation. In that case, feature attributions will be performed by calling normal forward/backward passes on the model step by step. If I try this by adapting the call to model.attribute as:
out = inseq_model.attribute(
    "A cat in French is \"",
	generated_texts="A cat in French is \"chat\"",
    generation_args={"max_new_tokens": 3}
)

I get the following error:

╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
│ /usr/local/lib/python3.8/dist-packages/petals/client/remote_model.py:163 in forward       │
│                                                                                           │
│   160 │   │   attention_mask: Optional[torch.Tensor] = None,                              │
│   161 │   │   **kwargs,                                                                   │
│   162 │   ):                                                                              │
│ ❱ 163 │   │   assert attention_mask is None, "DistributedBloomModel does not support atte │
│   164 │   │                                                                               │
│   165 │   │   for k, v in kwargs.items():                                                 │
│   166 │   │   │   if not (v is None or v is False):                                       │
╰───────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: DistributedBloomModel does not support attention masks right now

Correct me if I'm wrong, but I believe both return_dict_in_generate and attention_mask support should be achievable for the petals implementation, right? Would you consider supporting such usage? Thanks in advance! 🙂

Access to the public model requires HF API token

When run:
python -m cli.run_server --prefix bloom6b3 --converted_model_name_or_path bigscience/test-bloomd-6b3 \ --block_indices 3:5 --torch_dtype float32 --identity_path ./server1.id --host_maddrs /ip4/127.0.0.1/tcp/31337

Fails with:
OSError: You specified use_auth_token=True, but a huggingface token was not found.

Fix: set use_auth_token=False here:
https://github.com/learning-at-home/bloom-demo/blob/ca3c08acc1c36d1da396ff95d9016522ea84b83c/src/server/server.py#L134

and here:
https://github.com/learning-at-home/bloom-demo/blob/ca3c08acc1c36d1da396ff95d9016522ea84b83c/src/bloom/from_pretrained.py#L73

With setting the device to cpu in the hyperparameters used for training in the sst2 prompt tuning example

Hi Big-Science team,

quick question: Why is the device set to 'CPU' in these training hyperparameters?

Especially when the notebook recommends using a GPU.

MODEL_NAME = "bigscience/bloom-petals" # select model you like
NUM_PREFIX_TOKENS = 16
DEVICE = 'cpu'
BATCH_SIZE = 16
LR = 1e-2
WEIGHT_DECAY = 0.0
NUM_EPOCHS = 3
SEED = 42
MODEL_MAX_LENGTH = 64
TUNING_MODE = 'ptune' # choose between ['ptune', 'deep_ptune']

Thanks

Best regards

Jerome

Failed finding central directory

I believe the server got stuck on following error, because this is the last output in console and it is many hours old. There's no following line about reading the block from HF. Restarting the server helped and it works normally now.

It might be cause by network downtime, but I suppose the server should survive short outages like this.

Jan 24 06:36:00.449 [WARN] [/home/petals/src/petals/bloom/from_pretrained.py._load_state_dict:121] Failed to load block 64 from HF Hub (retry in 10240 sec)
Traceback (most recent call last):
  File "/home/petals/src/petals/bloom/from_pretrained.py", line 118, in _load_state_dict
    return torch.load(archive_file, map_location="cpu")
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 777, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 282, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

try bloomz in chat.petals.ml

There are rumors that bloomz makes for a far closer approximation of intruct/chatGPT than the original bloom. Let's see how it works with chat.petals!

Steps:

  1. convert the model to petals format, like this, upload to yourname/bloomz-petals
    • conversion script requires 400gb ram (can be optimized)
  2. run some petals servers with the new model:
    • python -m petals.cli.run_server yourname/bloomz-petals
  3. run a version of chat.petals with the new model
  4. let contributors play with it

Note: since we don't have active volunteers for the model (yet!), i can help you find some gpus to bootstrap the model

Inference issues on Volta-based swarm

Hello folks,

Trying to run a private swarm on a 7x Volta-generation GPUs. As suggested by docs, i've set torch_dtype to float16 and NUM_BLOCKS to 10 (these are 32GB GPUs) and removed load-8-bit argument. All 7 are running on the same linux host.

Swarm starts ok and loads all the model blocks, but many of included tests fail.

Trying even the most basic generation seems to always generate the same token (0 == UNK)

Are older GPUs even supported? There are some notes in documentation on what to set for pre-Turing, but the arxiv paper says the server needs to have Turing or later gen GPU.

If older GPUs are supported, do i also need to specify the torch_dtype to be float16 on instantiating model?
(i get RuntimeError: "LayerNormKernelImpl" not implemented for 'Half' when running .generate() in this case)

It is torch 1.12.1+cu113 on cuda 11.3

This is what i get as tests:

tests/test_aux_functions.py::test_throughput_basic FAILED <-- this system is behind proxy, i think this is expected
tests/test_block_exact_match.py::test_remote_block_exact_match FAILED
tests/test_chained_calls.py::test_forward_backward_exact_match FAILED
tests/test_chained_calls.py::test_chained_inference_exact_match FAILED
tests/test_full_model.py::test_full_model_exact_match[True] FAILED
tests/test_full_model.py::test_full_model_exact_match[False] FAILED
tests/test_full_model.py::test_greedy_generation PASSED
tests/test_full_model.py::test_sampling[sampling_options0] SKIPPED (Sampling is currently not consistent with outputs from Transformers)
tests/test_full_model.py::test_sampling[sampling_options1] SKIPPED (Sampling is currently not consistent with outputs from Transformers)
tests/test_full_model.py::test_sampling[sampling_options2] SKIPPED (Sampling is currently not consistent with outputs from Transformers)
tests/test_full_model.py::test_sampling[sampling_options3] SKIPPED (Sampling is currently not consistent with outputs from Transformers)
tests/test_full_model.py::test_beam_search_generation FAILED
tests/test_linear8bitlt.py::test_layout_exact_match SKIPPED (this test requires a turing-generation or newer GPU, see bitsandbytes docs)
tests/test_linear8bitlt.py::test_linear_exact_match SKIPPED (this test requires a turing-generation or newer GPU, see bitsandbytes docs)
tests/test_linear8bitlt.py::test_linear_no_igemmlt PASSED
tests/test_priority_pool.py::test_priority_pools PASSED
tests/test_remote_sequential.py::test_remote_sequential FAILED
tests/test_remote_sequential.py::test_remote_sequential_prompts FAILED

The greedy search test seems to pass, but i'm suspicious... could it be an issue with a test?

Here is what i see from .generate():

model = DistributedBloomForCausalLM.from_pretrained("bigscience/bloom-petals", initial_peers=INITIAL_PEERS)
inputs = tokenizer("Cat sat on", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0]))
Cat sat on
type(outputs), outputs.shape, outputs
(torch.Tensor,
torch.Size([1, 8]),
tensor([[40171, 13770, 664, 0, 0, 0, 0, 0]]))

If needed, can provide the log from tests.

Investigate QUIC (v1) reliability

Our network layer supports quic like this: hivemind.DHT(..., host_maddrs=['/ip4/1.2.3.4/udp/1337/quic'])
However, petals servers currently default to TCP-only host maddrs, unless user specifies --host_maddrs.

In other hivemind-based experiments, we found that QUIC is superior to TCP when operating under household NAT because udp hole punching is more reliable than tcp hole punching. It would be great if we could enable it by default.

The reason why QUIC is in default maddrs is that we haven't tested it thoroughly enough and we fear that it might cause throughput issues.

Quest: try running a quic-only peer in the public swarm, bombard it with requests from (your pc, colab, some publicly accessible machine), check if it works alright.

Criteria (suggestion):

  • cycles per second, forward and inference (vs TCP)
  • retries / relay fallbacks (vs TCP)

We should check for cases where QUIC makes the system unusable (10x slow or does not work at all).

If some cases are slower by tens of %%, this is fine. If there are cases where quic is 2x slower or similar, we can check that running a server with both tcp and quic is still as fast as tcp-only - and if so, it is fine to enable quic in main.

Multi-GPU support?

Are there docs on enabling multi-GPU support? If not, can we have a feature request to add this? Per the default install instructions, I'm only seeing GPU0 being utilized. Maybe it's as simple as adding a "device_map='auto'" in the model creation?

"NoneType object is not callable" on stopping P2P

I have very simple inference testing script. No threading or any advanced stuff. Basically "hello world" inference on Petals. Everything is going well, but when the script is exiting, I always get this error:

Exception ignored in: <function P2P.__del__ at 0x7f4ac1feed40>
Traceback (most recent call last):
  File "/home/dev/.local/lib/python3.10/site-packages/hivemind/p2p/p2p_daemon.py", line 632, in __del__
  File "/home/dev/.local/lib/python3.10/site-packages/hivemind/p2p/p2p_daemon.py", line 659, in _terminate
  File "/home/dev/.local/lib/python3.10/site-packages/multiaddr/multiaddr.py", line 254, in value_for_protocol
TypeError: 'NoneType' object is not callable

It is rather cosmetic issue, but something is not OK there.

[CODE] Basic Inferencing API on hivemind.Server aka rpc_inference

Why: talk to a 176B model ran on hundreds of small devices.

Implement an extended hivemind.Server that has forward/backward as usual, and an additional RPC named forward_incremental (stream<->stream)

  1. Here's the protocol for forward_incremental:
  • client sends request containing:
    • requested layers
    • requested max sequence length
    • [optional: bid?]
  1. server responds with info protobuf that contains:
  • bool accepted: if True, server decides to let client run inference and will await first request for T=10 seconds.
    • [optional: queue?]
  • [float queue length: 0 if accepted right now, N if need to wait for N other nodes to finish before running]
  • [float throughput: server's estimated computation time, including time in queue]
  1. client sends prefix embeddings:
  • Tensor prefix input_embeddings [1, prefix_length, hidden_size] with compression
  • [optional prefix attention mask[prefix_length, prefix_length], default = tril]
  1. server runs forward pass, saves attention caches and return
  • Tensor prefix output_embeddings [1, prefix_length, hidden_size] with compression
  1. client sends another token input embeddings
  • Tensor input_embeddings [1, 1, hidden_size] with compression
  • [optional prefix attention mask[1, prefix_length + prev_tokens], default = tril]
  1. server runs forward pass with attention cache and returns
  • Tensor output_embeddings
  • current length

GOTO step 5 while current length <= max length
If client does not send ping in T seconds (maybe empty message if no data yet), server closes connection.

Don't think about it:

  • support fixed max length for now, e.g. 1024 or 2048?
  • inference up to 256 steps excluding prefix - to ensure we don't spend too long with the same node?
  • select one or more of that node's consequent layers to inference at once?
  • send more than one token at a time?
  • option to backtrack for a few tokens for beam search inference?
  • beam search with multiple hypotheses - and an option to reorder them internally?

Getting Petals to run on macOS

The primary motivation, is

  • to get as much high bandwidth memory, in a low cost way (thanks to its unified memory model)
  • to be easily used for training / inference
  • its probably gonna be slower then 3090's (i have no idea), but i dun think thats the point here
  • could potentially also be used with large number of "student laptops"

As with the latest beta pytorch has included optimisations for m1 metal GPU

This present an interesting possibility of scaling up on more easily & affordably, for example. To hit 352GB of memory...
(and assuming up to 75% of a Mac's memory is allocated to GPU, you could in theory go 75%+, but I suspect we need at-least 25% for OS, and filesystem operations)

  • Number of nodes: 4
  • Ram allocated per node: 96GB (75% of 128GB)
  • Upfront cost of nodes: $23,200.00 ($5,800.00 / node)
  • Max KWh: 0.86 (0.215 KWh / node)

However if you were to try build this using A100 for example

  • Number of nodes: 5
  • Ram allocated per node: 80GB
  • Upfront cost of nodes: $65,000.00 ($13,000.00 / node)
  • Max KWh: 1.5 (0.300 KWh / node)
  • Price & Energy usage exclude overheads for CPU, RAM, Motherboard, Storage, Cooling, and networking

Also as outlined, alternatively would be 30 student laptops/mac-mini ...

  • Number of nodes: 30
  • Ram allocated per node (12GB, 75% of 16GB)
  • Upfront cost**: $33,000.00 ($1,100 / node)
  • Max KWh**: 4.5 (0.150 KWh / node)
    ** not that it matters in this case

Making it possibly one of the most accessible way for students, to setup a private swarm, and try training on their own hardware in a datalab.

[RESEARCH] LM API merit system

Why:

  • contributors who support the swarm over long time should feel that they are appreciated
  • one client should not be able to DDOS the entire swarm - it should be prioritized according to some pre-defined system

Optional:

  • client that has higher point total may end up prioritized on the processing queue
  • contributors who run their GPUs should be motivated to use the model for something of theirs (free coupon effect)

Demo constraint: the first public version must not have a mechanism for converting internal points into anything except for priority usage and participant self-worth (e.g. via leaderboards).

Client-side code improvements [Yozh-todo-list]

Important stuff

  • optimal (instead of random) routing in RemoteSequenceManager
  • fault tolerance in RemoteSequentialInferenceSession
  • quick slicing with RemoteSequential[5:15] ( @TimDettmers )
  • BloomForCausalLM.generate ( @artek0chumak )
  • remove lm_head WEIGHT from checkpoint and/or model, convert BloomModel instead (to save 2x client RAM) [@dbaranchuk did it]

Minor stuff

This list contains all the things i've delegated to future me. They are typically too obscure for anyone else to worry about

  • fault tolerance inside RemoteTransformerBlock (as opposed to RemoteSequential)

  • in SequenceInfo, keep block_infos as TimedStorage instead of just RemoteSequenceInfo for automatic removal of expired servers

  • make RemoteSequential print the state of the model ( @TimDettmers )

  • less frequent DHT lookups in RemoteSequenceInfo

  • change BloomForCausalLM.forward to call RemoteSequential instead of working layer-by-layer (why: to allow chaining subsequent blocks efficiently)

Note from #45 by @borzunov

Please consider running a loop instead, maybe using just --num_blocks without explicit --block_indices. If it gets more complicated than just repeating something N times, please move it to a Python script.

how to profile petals inference time cost

Hello all,

I launch my own swarm on 24 computers according to Launch-your-own-swarm .
Each computer has 1 gpu and 32GB memory, and contains 3 blocks of petals-bloom.

It works fine but takes a long time to inference once.

Follow Use the model, I only inference bloom without train the model.

import torch
import torch.nn.functional as F
import transformers
from src import DistributedBloomForCausalLM

initial_peers = [TODO_put_one_or_more_server_addresses_here]  # e.g. ["/ip4/127.0.0.1/tcp/more/stuff/here"]
tokenizer = transformers.BloomTokenizerFast.from_pretrained("bigscience/bloom-petals")
model = DistributedBloomForCausalLM.from_pretrained(
  "bigscience/bloom-petals", initial_peers=initial_peers, low_cpu_mem_usage=True, torch_dtype=torch.float32
)

inputs = tokenizer("a cat sat", return_tensors="pt")["input_ids"]
# use max_new_tokens instead of max_length 
remote_outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.decode(remote_outputs[0]))

I tried different max_new_tokens values, and the time-consuming almost 2s/token

Is there a way to profile the inference performance , i wonder why it is so slow

Thanks

Miscellaneous server-side improvements

  • task_pool

  • runtime

  • handler

    • verify what happens if server have mixed torch_dtype and/or compression
    • actually follow forward/backward/inference schema instead of hard-coding
    • extract the code for adding prompts into a separate file
    • consider merging the code from hivemind's ConnectionHandler instead of inheriting
    • add a test that covers _rpc_inference with prompts
    • add a test that covers _rpc_inference with hypo-ids
  • MemoryCache

    • when running inference over multiple layers on the same server, avoid passing layer activations between cpu<->gpu by
      storing them in MemoryCache
      • before implementing this, gotta check if this will bring any performance benefit
    • LRU-offload stale cache from gpu to ram
  • point system

    • make sure points are integers everywhere
    • implement a nonzero prioritizer :)
    • move client-side spending polity to sequence_manager

[CODE] miscellaneous small issues for later

Things that can be done to improve the code, but were left out to launch MVP faster:

  • server-side: connection_handler, backend, runtime
    • modify task pool to deal cache handles as pure-python integers? (currently they are converted to tensors)
    • when running inference over multiple layers on the same server, avoid passing layer activations between cpu<->gpu by
      storing them in MemoryCache
      - moved to #68
    • optimize disk space. r/n a server will eventually download all bloom blocks and store them into HF cache. Check for disk space in advance and/or figure out some cache eviction policy.
  • server-side: MemoryCache
    • in allocate_cache, if there is not enough memory, wait for memory to be freed by existing tasks up to a given timeout.
      - note: this can be done using mp.Condtion
    • allocate cache as one contigous buffer to avoid fragmentation
      - note: this feature is active as of #779959bc we will eventually switch back to non-cached version; rationale: did not observe significant issues from fragmentation, but contiguous buffers did complicate the code
    • quantize cached values using bitsandbytes
      - wontfix (as of 2022.01.02): our current code relies on transformers' default bloom implementation, so we can't intervene in attention internals
    • LRU-offload cache from gpu to ram?
      - moved to #68
  • client-side: internals
    • make begin_inference_session into a contextmanager

[CODE] optimize BloomAttention, remove unnecessary host-to-device transfers

TL;DR the current code for BloomAttention is surprisingly inefficient, with obvious problems that should be easy to fix

The current code for BloomAttention is surprisingly inefficient:

It also misses a ton of opportunities for in-place ops for memory savings, but that's the least of its problems

[CODE] fault-tolerant inference (client side)

Why: this is the main "value added" by LM api: users can open colab notebook and run a gigantic model.

Depends on #3

Solution (sketch)
Each client holds embeddings/logits on their side (loads a special a huggingface model)

Inputs: prefix(tokens), sampling parameters (top-k, top-p, etc)

Init:

  1. tokenize prefix tokens and compute embeddings
  2. for each stage, send out K initial requests to random servers in parallel. (no tensors are sent yet). Servers respond with their throughput sizes, queue sizes, latency. Pick the best server in each group.
    • figure out how servers represent themselves in DHT
    • [heuristic] if the chosen server has several consecutive stages, always choose to run all these stages.
  3. client finds a sequence of servers that collectively have all model layers
    • [optional] record non-selected servers locally as backups for generation
  4. client runs prefix through all stages, stores inter-stage activations

Generation:
6. get last prefix token embedding from final stage
7. run through logits, run top-k/top-p/whatever inference

  • optional: use knn (e.g. hnsw from faiss) to quickly find nearest token; alternative = colab gpu
  1. embed this token using local embeddings
  2. send it through one pipeline stage at a time
    • on each stage, add one more vector to prefix_embeddings
  3. GOTO 6

Fault recovery:

  • If a given pipeline stage fails, find a replacement stage using the same procedure as during initial stage selection
  • feed the new stage with locally_saved intermediate embeddings that serve as inputs to that stage
  • continue inferencing normally

The prompt tuning example (prompt-tuning-sst2) don't work

Hi,
I have tried to run the notebook example you have published (without editing) in colab but it doesn't work...
I get the following error:

RuntimeError                              Traceback (most recent call last)
[<ipython-input-12-7f7d7fa267e9>](https://localhost:8080/#) in <module>
     17 
     18         model.train()
---> 19         outputs = model(**batch)
     20         loss = outputs.loss
     21         loss.backward()

3 frames
[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py](https://localhost:8080/#) in forward(self, input)
    112 
    113     def forward(self, input: Tensor) -> Tensor:
--> 114         return F.linear(input, self.weight, self.bias)
    115 
    116     def extra_repr(self) -> str:

RuntimeError: expected scalar type Half but found Float

It is probably about the transformers version out of date but petals 1.1.1 requires transformers==4.25.1

Integration with gensyn.ai

tl;dr This is just an idea, that I unfortunately don't have time to work on right now but would be cool if anyone picks it up.

I learnt about Petals from Simon William's Blog Post and thought that tight "2nd party" integration with gensyn.ai could be interesting.

Gensys.ai is a blockchain-based incentive-driven distributed computing network, and Petals is a framework for distributed LLM training.

If/when I ever have the time, this would be fun to work on, but just jotting down the idea publically for now.

Add prompt tuning to the basic example

The following code can be added to basic example to run prompt tuning and show demonstrate that it works

Quest: go to the main example notebook (in readme), and add some simple example based on prompt tuning

Here's a basic example. I'm not sure it aint missing something, but if it is, we can help add it back.

import torch
from transformers import BloomTokenizerFast
from petals.client import DistributedBloomForCausalLM
assert 'model' not in globals(), "please restart the kernel"
 
MODEL_NAME = "bigscience/bloom-petals"
tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME, padding_side='right')
model = DistributedBloomForCausalLM.from_pretrained(
    MODEL_NAME, tuning_mode='deep_ptune', pre_seq_len=3
).cuda()

inputs = tokenizer("A quick brown fox ", return_tensors="pt")["input_ids"].cuda()
remote_outputs = model.generate(inputs, max_new_tokens=7)
print("generated:", tokenizer.decode(remote_outputs[0]))
 
opt = torch.optim.Adam(model.parameters(), lr=1e-3)

the_fox_is_innocent = tokenizer("A quick brown fox did not jump over the lazy dog", return_tensors="pt")["input_ids"].cuda()
for i in range(50):
  loss = model(input_ids=the_fox_is_innocent, labels=the_fox_is_innocent).loss
  opt.zero_grad()
  loss.backward()
  opt.step()
  print(f"loss[{i}] = {loss.item():.3f}")

inputs = tokenizer("A quick brown fox ", return_tensors="pt")["input_ids"].cuda()
remote_outputs = model.generate(inputs, max_new_tokens=7)
print("generated:", tokenizer.decode(remote_outputs[0]))

More convenient test runner?

[as suggested by @GreenFatGuy ]

As of now, running tests locally is inconvenient as it requires spinning up servers.
We cannot afford to launch servers for every test as it will increase CI time more than 5x.

Are there any standard tricks that allows PyTest to spin up the necessary servers automatically, but use the same set of servers in all tests?

[RESEARCH] 8-bit bloom / opt

Why?: the more stages we can fit on the same GPU, the less latency users gonna get on everything.

  • investigate ways to cast model to 8 bits
  • investigate ways to use less memory for attention cache
    • quantize? sparsify?

@TimDettmers knows this part best, ask him before we do anything

[DESIGN] auction-like priorities for servers

[for the record: this was proposed by @TimDettmers ]

Currently, hivemind-server treats all requests on a first come first served basis.
If we want to reward active participants with faster inference/training, we could change that into an auction.

Here's how client-server interaction looks like:

  • server gives client its stats, the current-highest bid, and maybe some metadata for bidding, e.g. the lowest serviced bids over last T seconds
  • client makes a bid - and signs it in such a way that it becomes a commitment ( see #6 )
  • in TaskPool.iterate_minibatches, server will now generate minibatches in the order of highest bid first
  • in TaskPool.priority, server will now set pool's priority based on highest bid in the pool, instead of wait time

As suggested by @GreenFatGuy , we need to think through how to deal with situations when low bids on high-demand servers won't ever be processed, and will hence take up memory on both client and server. First order solution: add absolute expiration time to each request, drop requests that hit expiration time.

Roadmap (tentative)

Current tasks:

  • prototype bloom points system @borzunov (#6 )
  • local tensor parallelism ( #143 , using BlackSamorez/tesnor_parallel by @BlackSamorez and @IaroslavLisniak )
  • increase default max sequence length (from #146 )
  • allow running a server without open ports @Vahe1994
  • option to download pre-quantized blocks (@mryab )
  • improved routing (@justheuristic )
  • newest-latest libp2p - @Vahe1994
  • touch up fine-tuning examples, make sure they work in reasonable time ( @justheuristic )
  • a way to temporarily shutdown petals server
    • suggested by @craffel : when running a petals server on a machine that is often in use, people should be able to shut off petals servers while running their experiments
    • suggested behavior: shut down asap, restart once gpus are not in use for T minutes
  • Wanna contribute?
    • go to our discord server and ask around!
    • always in demand:
      • contribute examples (recommended but not required: create an issue / draft first, before you code them)
      • OS support / hardware support ( e.g. see #147 )
      • more models: OPT-175B, switch-XXL, whatever comes into fashion
      • host a server! (see README)

End of december: cover more use cases

  • tutorial: generation notebook
  • tutorial: prompt-tuning notebook
  • PreLayer prompt-tuning - mentioned as one of the baselines in https://arxiv.org/abs/2106.09685 - DONE
  • inference with prompt-tuning ( #13 by @artek0chumak)
  • advanced inference: beam search, constraints/fusion, LoRA/AdaMix ( @artek0chumak , #13 )
  • some kind of hub for tutorials, e.g. a minimalistic website
  • alpha test: let more people play with 176B model (where? no-brainer: bigscience, stability, discord)
  • rich inference interface for designing custom generation algorithms (by: @artek0chumak )
  • let servers run requests with different priorities ( #8 by: @GreenFatGuy )
  • By this point, we must answer the main questions: (1) will people use it? (2a) what for? (2b) why not?

End of july august: make it reliable, test with early adopters

End of june: build a proof-of-concept

  • agree on the user interface (see #5 (comment) )
  • run simple (but correct!) inference with a smaller model (for generation)
  • do simple (but correct!) forward/backward with frozen layers (for prompt tuning)
  • client can dynamically choose which remote servers to use for inference ( by: @justheuristic )
  • create basic correctness tests for later
  • check if 8-bit compression is remotely feasible ( by: @TimDettmers )
  • it's okay if the code is not super reliable for now
  • it's okay if servers have to be set up manually for now
  • begin investigating: quantized weights, quantized communication, automatic server allocation, "bloom points"

Important, but not urgent:

Error: BFloat16 Unsupported scalar when trying to execute across multiple GPUs with BFloat16 & 8-Bits

I tried to run BLOOM distributed across multiple A100 GPUs with 8-Bit and using BFloat16 but ran into this error while trying to execute a slightly adjusted version of the example script:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA SETUP: CUDA runtime path found: /datadrive/miniconda3/envs/petals/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
Oct 18 09:52:07.795 [WARN] [/datadrive/repos/petals/src/client/remote_sequential.py.__init__:34] RemoteSequential is in active development; expect adventures
Some weights of DistributedBloomForCausalLM were not initialized from the model checkpoint at bloom-testing/test-bloomd-560m-main and are newly initialized: ['lm_head.word_embeddings.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/datadrive/repos/petals/simple_test_script.py", line 17, in <module>
    remote_outputs = model.generate(inputs, max_length=100)
  File "/datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/datadrive/repos/petals/src/client/remote_generation.py", line 113, in generate
    hidden_state = sess.step(embs, prompts=intermediate_prompts, hypo_ids=hypo_ids)[:, -1]
  File "/datadrive/repos/petals/src/client/inference_session.py", line 200, in step
    outputs = session.step(inputs, prompts[self.chosen_spans[0].start : self.chosen_spans[0].end], **kwargs)
  File "/datadrive/repos/petals/src/client/inference_session.py", line 109, in step
    tensors=[
  File "/datadrive/repos/petals/src/client/inference_session.py", line 110, in <listcomp>
    serialize_torch_tensor(tensor.to(proto.dtype), proto.compression)
  File "/datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/hivemind/compression/serialization.py", line 41, in serialize_torch_tensor
    return compression.compress(tensor, info, allow_inplace)
  File "/datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/hivemind/compression/base.py", line 83, in compress
    array = tensor.detach().numpy()
TypeError: Got unsupported ScalarType BFloat16

The code of simple_example_script:

import torch
import torch.nn.functional as F
import transformers
from src import DistributedBloomForCausalLM

MODEL_NAME = "bloom-testing/test-bloomd-560m-main" #"bigscience/bloom-petals"
import os
initial_peer = os.getenv("initial_peer")
initial_peers = [initial_peer]  # e.g. ["/ip4/127.0.0.1/tcp/more/stuff/here"]
tokenizer = transformers.BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(
  MODEL_NAME, initial_peers=initial_peers, low_cpu_mem_usage=True, torch_dtype=torch.float32
)  # this model has only embeddings / logits, all transformer blocks rely on remote servers

# model = model.to('cuda')
inputs = tokenizer("a cat sat", return_tensors="pt")["input_ids"]
remote_outputs = model.generate(inputs, max_length=100)
print(tokenizer.decode(remote_outputs[0]))  # "a cat sat in the back of the car,"

# "train" input embeddings by backprop through distributed transformer blocks
model.transformer.word_embeddings.weight.requires_grad = True
outputs = model.forward(input_ids=inputs)
loss = F.cross_entropy(outputs.logits.flatten(0, 1), inputs.flatten())
loss.backward()
print("Gradients (norm):", model.transformer.word_embeddings.weight.grad.norm())

Server launched via commands:

python -m cli.run_server bloom-testing/test-bloomd-560m-main --num_blocks 12 --torch_dtype bfloat16 --host_maddrs /ip4/0.0.0.0/tcp/31337 --load_in_8bit

python -m cli.run_server bloom-testing/test-bloomd-560m-main  --torch_dtype bfloat16 --host_maddrs /ip4/127.0.0.1/tcp/0 --load_in_8bit --initial_peers /ip4/127.0.0.1/tcp/31337/p2p/QmTHnjwKQFzvxrPesrSjtaL5eKUVdHfLsxV87vx8RFH21U --block_indices 12:24 --device cuda:1

Packages in the environment, have been installed via requirements.txt:

# packages in environment at /datadrive/miniconda3/envs/petals:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
accelerate                0.10.0                   pypi_0    pypi
aiohttp                   3.8.3                    pypi_0    pypi
aiosignal                 1.2.0                    pypi_0    pypi
asttokens                 2.0.5              pyhd3eb1b0_0  
async-timeout             4.0.2                    pypi_0    pypi
attrs                     22.1.0                   pypi_0    pypi
backcall                  0.2.0              pyhd3eb1b0_0  
base58                    2.1.1                    pypi_0    pypi
bitsandbytes              0.34.0                   pypi_0    pypi
blas                      1.0                         mkl  
brotlipy                  0.7.0           py39h27cfd23_1003  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2022.07.19           h06a4308_0  
certifi                   2022.9.24        py39h06a4308_0  
cffi                      1.15.1           py39h74dc2b5_0  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
click                     8.1.3                    pypi_0    pypi
configargparse            1.5.3                    pypi_0    pypi
cryptography              37.0.1           py39h9ce1e76_0  
cudatoolkit               11.3.1               h2bc3f7f_2  
datasets                  2.5.2                    pypi_0    pypi
debugpy                   1.5.1            py39h295c915_0  
decorator                 5.1.1              pyhd3eb1b0_0  
dill                      0.3.5.1                  pypi_0    pypi
docker-pycreds            0.4.0                    pypi_0    pypi
entrypoints               0.4              py39h06a4308_0  
executing                 0.8.3              pyhd3eb1b0_0  
ffmpeg                    4.3                  hf484d3e_0    pytorch
filelock                  3.8.0                    pypi_0    pypi
freetype                  2.11.0               h70c0345_0  
frozenlist                1.3.1                    pypi_0    pypi
fsspec                    2022.8.2                 pypi_0    pypi
giflib                    5.2.1                h7b6447c_0  
gitdb                     4.0.9                    pypi_0    pypi
gitpython                 3.1.29                   pypi_0    pypi
gmp                       6.2.1                h295c915_3  
gnutls                    3.6.15               he1e5248_0  
grpcio                    1.49.1                   pypi_0    pypi
grpcio-tools              1.48.2                   pypi_0    pypi
hivemind                  1.1.1                    pypi_0    pypi
huggingface-hub           0.7.0                    pypi_0    pypi
humanfriendly             10.0                     pypi_0    pypi
idna                      3.3                pyhd3eb1b0_0  
intel-openmp              2021.4.0          h06a4308_3561  
ipykernel                 6.15.2           py39h06a4308_0  
ipython                   8.4.0            py39h06a4308_0  
jedi                      0.18.1           py39h06a4308_1  
jpeg                      9e                   h7f8727e_0  
jupyter_client            7.3.5            py39h06a4308_0  
jupyter_core              4.11.1           py39h06a4308_0  
lame                      3.100                h7b6447c_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libdeflate                1.8                  h7f8727e_5  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libiconv                  1.16                 h7f8727e_2  
libidn2                   2.3.2                h7f8727e_0  
libpng                    1.6.37               hbc83047_0  
libsodium                 1.0.18               h7b6447c_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.16.0               h27cfd23_0  
libtiff                   4.4.0                hecacb30_0  
libunistring              0.9.10               h27cfd23_0  
libwebp                   1.2.4                h11a3e52_0  
libwebp-base              1.2.4                h5eee18b_0  
lz4-c                     1.9.3                h295c915_1  
matplotlib-inline         0.1.6            py39h06a4308_0  
mkl                       2021.4.0           h06a4308_640  
mkl-service               2.4.0            py39h7f8727e_0  
mkl_fft                   1.3.1            py39hd3c417c_0  
mkl_random                1.2.2            py39h51133e4_0  
msgpack                   1.0.4                    pypi_0    pypi
multiaddr                 0.0.9                    pypi_0    pypi
multidict                 6.0.2                    pypi_0    pypi
multiprocess              0.70.13                  pypi_0    pypi
ncurses                   6.3                  h5eee18b_3  
nest-asyncio              1.5.5            py39h06a4308_0  
netaddr                   0.8.0                    pypi_0    pypi
nettle                    3.7.3                hbbd107a_1  
numpy                     1.23.1           py39h6c91a56_0  
numpy-base                1.23.1           py39ha15fc14_0  
openh264                  2.1.1                h4ff587b_0  
openssl                   1.1.1q               h7f8727e_0  
packaging                 21.3               pyhd3eb1b0_0  
pandas                    1.5.0                    pypi_0    pypi
parso                     0.8.3              pyhd3eb1b0_0  
pathtools                 0.1.2                    pypi_0    pypi
pexpect                   4.8.0              pyhd3eb1b0_3  
pickleshare               0.7.5           pyhd3eb1b0_1003  
pillow                    9.2.0            py39hace64e9_1  
pip                       22.2.2           py39h06a4308_0  
prefetch-generator        1.0.1                    pypi_0    pypi
promise                   2.3                      pypi_0    pypi
prompt-toolkit            3.0.20             pyhd3eb1b0_0  
protobuf                  3.20.3                   pypi_0    pypi
psutil                    5.9.2                    pypi_0    pypi
ptyprocess                0.7.0              pyhd3eb1b0_2  
pure_eval                 0.2.2              pyhd3eb1b0_0  
pyarrow                   9.0.0                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pydantic                  1.10.2                   pypi_0    pypi
pygments                  2.11.2             pyhd3eb1b0_0  
pymultihash               0.8.2                    pypi_0    pypi
pyopenssl                 22.0.0             pyhd3eb1b0_0  
pyparsing                 3.0.9            py39h06a4308_0  
pysocks                   1.7.1            py39h06a4308_0  
python                    3.9.13               haa1d7c7_1  
python-dateutil           2.8.2              pyhd3eb1b0_0  
pytorch                   1.12.1          py3.9_cuda11.3_cudnn8.3.2_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2022.4                   pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
pyzmq                     23.2.0           py39h6a678d5_0  
readline                  8.1.2                h7f8727e_1  
regex                     2022.9.13                pypi_0    pypi
requests                  2.28.1           py39h06a4308_0  
responses                 0.18.0                   pypi_0    pypi
scipy                     1.9.2                    pypi_0    pypi
sentry-sdk                1.9.10                   pypi_0    pypi
setproctitle              1.3.2                    pypi_0    pypi
setuptools                63.4.1           py39h06a4308_0  
shortuuid                 1.0.9                    pypi_0    pypi
six                       1.16.0             pyhd3eb1b0_1  
smmap                     5.0.0                    pypi_0    pypi
sortedcontainers          2.4.0                    pypi_0    pypi
sqlite                    3.39.3               h5082296_0  
stack_data                0.2.0              pyhd3eb1b0_0  
tk                        8.6.12               h1ccaba5_0  
tokenizers                0.12.1                   pypi_0    pypi
torchaudio                0.12.1               py39_cu113    pytorch
torchvision               0.13.1               py39_cu113    pytorch
tornado                   6.2              py39h5eee18b_0  
tqdm                      4.64.1                   pypi_0    pypi
traitlets                 5.1.1              pyhd3eb1b0_0  
transformers              4.21.3                   pypi_0    pypi
typing_extensions         4.3.0            py39h06a4308_0  
tzdata                    2022c                h04d1e81_0  
urllib3                   1.26.11          py39h06a4308_0  
uvloop                    0.17.0                   pypi_0    pypi
varint                    1.0.2                    pypi_0    pypi
wandb                     0.13.4                   pypi_0    pypi
wcwidth                   0.2.5              pyhd3eb1b0_0  
wheel                     0.37.1             pyhd3eb1b0_0  
xxhash                    3.0.0                    pypi_0    pypi
xz                        5.2.6                h5eee18b_0  
yarl                      1.8.1                    pypi_0    pypi
zeromq                    4.3.4                h2531618_0  
zlib                      1.2.12               h5eee18b_3  
zstd                      1.5.2                ha4553b6_0

I just used the small version for debugging purposes, I need to distribute it across multiple GPUs since I intend to run the 176bn BLOOM version. I tried to naively just convert the tensor at that line to a supported DType but then another error occured somewhere else down the line.

Since I want to do Prompt Tuning on 8x 40GB A100s, I think I have to use BFloat16 & 8Bit or is there another solution/workaround with good performance?

tensor parallelism exception

Dual P100 setup:

(petals) qtr@pve:~/petals$ python -m petals.cli.run_server bigscience/bloomz-petals --tensor_parallel_devices cuda:0 cuda:1 --num_blocks 6
Jan 31 07:28:41.703 [INFO] Running Petals 1.1.2
Jan 31 07:28:56.893 [INFO] Direct reachability: 3/5
Jan 31 07:28:56.937 [INFO] This server will run DHT in full peer mode
Jan 31 07:28:59.927 [INFO] Connecting to the public swarm, peer_id = 12D3KooWMMiL1fw5oPpsegj4E1iu7Woqqc9rLzjrGoannTdJNxY1
Jan 31 07:29:00.045 [INFO] Model weights will be split between cuda:0, cuda:1
Jan 31 07:29:00.082 [WARN] [/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/server.py.__init__:167] Tensor parallelism doesn't work properly with 8-bit weights yet, loading weights in 16-bit. You can explicitly set `--load_in_8bit True` to override this
Jan 31 07:29:00.082 [INFO] Model weights will be loaded in auto format
Jan 31 07:29:00.082 [INFO] Attention cache for all blocks will consume up to 3.00 GiB
Jan 31 07:29:00.084 [INFO] Loading throughput info
Jan 31 07:29:01.221 [INFO] Announced that blocks [7, 8, 9, 10, 11, 12] are joining
Jan 31 07:29:01.622 [INFO] Reachability service started
Jan 31 07:29:29.358 [INFO] Loaded bigscience/bloomz-petals block 7, <All keys matched successfully>
Jan 31 07:30:04.819 [INFO] Loaded bigscience/bloomz-petals block 8, <All keys matched successfully>
Jan 31 07:30:35.309 [INFO] Loaded bigscience/bloomz-petals block 9, <All keys matched successfully>
Jan 31 07:31:05.673 [INFO] Loaded bigscience/bloomz-petals block 10, <All keys matched successfully>
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [03:15<00:00, 25.2MB/s]
Jan 31 07:34:51.803 [INFO] Loaded bigscience/bloomz-petals block 11, <All keys matched successfully>
Jan 31 07:35:30.357 [INFO] Loaded bigscience/bloomz-petals block 12, <All keys matched successfully>
Jan 31 07:35:34.023 [INFO] Server is reachable from the Internet. It will appear at http://health.petals.ml soon
Jan 31 07:35:35.407 [INFO] Started
Jan 31 08:16:10.959 [INFO] rpc_inference.open(blocks=7:13, remote_peer=...AM4oxF)
Jan 31 08:16:10.962 [INFO] rpc_inference.wait_for_alloc(size=0.16 GiB), already used 0.00/3.00 GiB (0.0%)
Jan 31 08:16:10.965 [INFO] rpc_inference.alloc(size=0.16 GiB)
Jan 31 08:16:11.371 [ERROR] [hivemind.moe.server.runtime.run:104] Caught indices should be either on cpu or on the same device as the indexed tensor (cuda:1), attempting to recover
Traceback (most recent call last):
  File "/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/hivemind/moe/server/runtime.py", line 91, in run
    outputs = pool.process_func(*batch)
  File "/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/backend.py", line 175, in __call__
    (hidden_states,) = self.backends[inference_info.uid].inference_step(hidden_states, hypo_ids, inference_info)
  File "/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/backend.py", line 92, in inference_step
    self._reorder_cache_inplace(cache_tensors, hypo_ids)
  File "/home/qtr/anaconda3/envs/petals/lib/python3.10/site-packages/petals/server/backend.py", line 102, in _reorder_cache_inplace
    cache_tensor[...] = cache_tensor[hypo_ids]  # in-place reorder cache by hypo ids
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)
Jan 31 08:16:11.407 [INFO] rpc_inference.close(blocks=7:13, remote_peer=...AM4oxF)

Outdated server version warning

  • Add a warning message when starting a petals server if the version used is outdated.
    • This should prevent servers from joining the swarm when they are not up-to-date.

BONUS: also warn currently running servers about a new update, and ban them after a defined grace period if the update has not been performed.

chat.petals integration with Hugging Face Spaces

Currently, you can use the model either directly, via this notebook, or through http://chat.petals.ml/ .

The latter is a self-hosted chat app that runs code similar to the colab notebook, but wraps it with a simple web server. The source code is available here.

It would be great if anybody could fork the chat project and reuse it for their pet projects. Thing is, not everyone has a server to host that web server - and it is not always convenient to share jupyter notebooks for everything.

To make it easier to create similar apps, we can create an example with Spaces. Hugging Face Spaces allows you to run simple web apps at the expense of HF.

References:

Task: create something similar to chat.petals on top of HF spaces. One way to do this is to fork @younesbelkada 's project and update it to the latest stable version of Petals

Running on SLURM

Having to specifically hard code IP adresses makes it very hard to run petals on a SLURM cluster. There I submit batch jobs that are then run on some node of the partition I specified. I do not know the IP beforehand of the node or any nodes that I run a petals server instance on.

So one thing that would be helpful is a "self discovery" of petals server instances inside a specified network.

Implement standard inference modes

It would be awesome to have a collection of standard inference methods, e.g. greedy, temperature, top-p, top-k, eventually tree / beam search and batched inference, once we support it on the backend.

For now, perhaps, it would be best to implement them as standalone functions / functors that take the full model as input and do the inference. Eventually, we'll figure out the best way of integrating them together.

Roadmap (tentative)

  1. sampling, greedy, top-k, nucleus, etc -- with obligatory support for prefixes
  2. inference with prompt-tuned model
  3. beam search (requires changes on backend)

.. and then, in no particular order,

  • inference with LoRA / AdaMix
  • user-defined, constraints, other crazy stuff

[DESIGN] user experience

This issue is about how to make distributed bloom user-friendly. See comments below for interface prototype.

__
A bunch of things that we may want to make alongside the tech so that people can... you know... use it :)

  • basic website (see training-transformers-together.github.io/ )

    • what to expect from it?
      • like openai api
    • why can't i just use ram/ssd offload?
      • it's faster together; passing activations for large models is faster than swapping layers
    • legal disclaimer: by using our model you agree to follow the model's license, read more here
  • inference notebook

    • as responsive as possible, print words as they appear
    • option to show what happens under the hood (i.e. running through this guy; that guy disconnected; waiting for this)
    • [optional] make it easy to customize sampling method
  • prompt-tuning notebook

    • pick some task that can be solved relatively quickly with visible results
    • TODO ask @artek0chumak on best practices
    • [optional] explain how to add other adapters / ptuning modes
    • [optional] inference prompt-tuned model? push adapters to HF?
    • [optional] collaborative prompt-tuning?
  • inference website

    • the default option can simply run inference notebook on server hosted by us
  • System Demonstration

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.