Comments (15)
disable adam offload or keep each model in one node.
and it seems --rollout_batch_size 792
is a large value, so it's slow.
from openrlhf.
Actually, rollout is quite fast. I rollout 792 samples in just a few min.
I find that the issue of slowness is due to slow comm between nodes.
when the training step is stuck, I visualize the status using ray.dashboard. It shows that the speed of comm between nodes is 11.73MB/s, which is too slow for nccl. So I'm suspecting the nccl backend is not being used. But I don't know how to check if it's true, or how to make sure nccl is activated.
from openrlhf.
I'm not using adam offload
From dashboard, each model is in one node.
I have 6 actors in node 0.
6 ref actors in node 1.
1 reward actor in node 0.
the rest are vllm engines
from openrlhf.
OpenRLHF only supports NCCL weights sync with vLLM 0.4.1 ~ 0.4.2
If each model is on a separate node, the communication only occurs during weight synchronization.
We also need to consider what devices are used to connect these machines (IB or Ethernet?).
from openrlhf.
could you share how you setup the multinode to train llama3 70B? like the cluster setup and how ray is configured?
I think it's most likely to be a problem with the comms, so I'm asking my gpu provider for hardware details also.
from openrlhf.
Also, in the code, we create several torch process groups. These groups contain remote actors or vllm engines.
When we build the process groups and specify nccl as backend, how does ray know the comm specifics to actually use nccl as backend? I'm suspecting that it's using ethernet for comms.
from openrlhf.
Also, in the code, we create several torch process groups. These groups contain remote actors or vllm engines. When we build the process groups and specify nccl as backend, how does ray know the comm specifics to actually use nccl as backend? I'm suspecting that it's using ethernet for comms.
We have tested on both 16 A100s (2 nodes) and 32 A100s (4 nodes), and the 70B model works fine. Of course, our cluster is in an NCCL environment.
see https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_ppo_llama_ray_70b.sh
from openrlhf.
Yeah, I've checked this file before. I almost got to a point where I know every file in the OpenRLHF repo haha.
What I want to ask about is how the ray cluster is set up, and the details of how torch process groups work with ray remotes.
from openrlhf.
Ray is merely a glue/process launcher and does not affect or participate in DeepSpeed's NCCL communication.
We have provided the ray launch script on slurm (+NCCL) cluster:
https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_ppo_llama_ray_slurm.sh
It just use ray start xxxx
.
from openrlhf.
Hmm, I see, I'm using ray start and ray job submit also.
I printed out the backend used in the process groups, and it's nccl.
But I still have no idea why node to node comm is that slow with nccl.
However, I'm sure that torchrun and mpirun can successfully use nccl, because we can run pretraining in similar clusters. Unfortunately, I can't use torchrun to start vllm, because it will hang. Haven't tried mpirun. Have you guys tried before?
from openrlhf.
This is very strange. I haven't encountered a similar issue before. Ray doesn't affect whether it's single-node or multi-node communication, as Ray is only responsible for launching DeepSpeed processes on different machines. Then DeepSpeed handles the communication on its own. We've only introduced one additional communication, which is using NCCL to synchronize weights from DeepSpeed to VLLM (this may cross the nodes).
from openrlhf.
Yeah, I used grafana to visualize the results. I verified that the training is indeed stuck at broadcasting new weight to vllm engines (across nodes)
from openrlhf.
![image](https://private-user-images.githubusercontent.com/30493956/351230427-4bd922b3-674d-48a7-ac12-4c3c7c4e8239.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIxMzYwNjAsIm5iZiI6MTcyMjEzNTc2MCwicGF0aCI6Ii8zMDQ5Mzk1Ni8zNTEyMzA0MjctNGJkOTIyYjMtNjc0ZC00OGE3LWFjMTItNGMzYzdjNGU4MjM5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI4VDAzMDI0MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTBiNGQ1MmEwZTkxNzg0Y2IyODRiYjcyZjM1ODkwMGFiOWIzNDc4MDEzODFhNjg1NDJmZjU3NGY3MzRiMjJhMzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Ugq6R3xMtDfhBNbfCNDOV9dXDlEzrg_V8LF_XQilqH8)
from openrlhf.
![image](https://private-user-images.githubusercontent.com/30493956/351233504-6b187001-d698-48b6-a9b9-e99139a647b6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIxMzYwNjAsIm5iZiI6MTcyMjEzNTc2MCwicGF0aCI6Ii8zMDQ5Mzk1Ni8zNTEyMzM1MDQtNmIxODcwMDEtZDY5OC00OGI2LWE5YjktZTk5MTM5YTY0N2I2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI4VDAzMDI0MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTE2MDBiZDliNzczZDY4MzE4Y2IxMzUwMzBkMmQ4NjNkYjkzNjE3MzhiYmU2ZGY4OTY5Yjk1MDE3ODkxMTk3NzkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.D9LrA5kd_kzBjP-wf3pgRm-K-bhQygcfFBYuWU6wLBQ)
from openrlhf.
This is likely due to the failure in establishing the NCCL group:
then the broadcast hang here:
OpenRLHF/openrlhf/trainer/ray/ppo_actor.py
Line 158 in 6a13d95
Could you try vLLM 0.4.2 or use
--vllm_sync_backend gloo
?from openrlhf.
Related Issues (20)
- DPO Finetuning constantly gives preference loss as 0.6931 HOT 8
- Difference between `DeepSpeedEngine.save_checkpoint()` and `DeepSpeedStrategy.save_model()` HOT 2
- DPO后的模型推理出的结果都是无序符号 HOT 1
- Support training from breakpoint HOT 3
- llama3 70B DPO example script
- where is gradient_accumulation HOT 1
- Support RLOO HOT 1
- 现在Train_PPO_llama_ray 过程中会把Actor Model切分到不同卡上吗 HOT 4
- ConnectionRefusedError: [Errno 111] Connection refused HOT 5
- packing的问题 HOT 2
- "right" padding hardcoded HOT 3
- Error while saving the model under 4bit lora HOT 2
- 使用ray的时候Request Entity Too Large HOT 3
- dpo 训练显存 OOM HOT 1
- Online DPO 支持 HOT 4
- Feature: add DPO-P
- Zero stage 3 error HOT 1
- Performance of Iterative DPO? HOT 1
- Why multiplying rstd instead of dividing by rstd? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openrlhf.