Issue: When I run ppo with a single node, the train step finishes in dozens of seconds

multinode ppo training extremely slow about openrlhf HOT 15 CLOSED

babu111 commented on July 30, 2024

multinode ppo training extremely slow

from openrlhf.

Comments (15)

hijkzzz commented on July 30, 2024

disable adam offload or keep each model in one node.
and it seems --rollout_batch_size 792 is a large value, so it's slow.

from openrlhf.

babu111 commented on July 30, 2024

Actually, rollout is quite fast. I rollout 792 samples in just a few min.
I find that the issue of slowness is due to slow comm between nodes.
when the training step is stuck, I visualize the status using ray.dashboard. It shows that the speed of comm between nodes is 11.73MB/s, which is too slow for nccl. So I'm suspecting the nccl backend is not being used. But I don't know how to check if it's true, or how to make sure nccl is activated.

from openrlhf.

babu111 commented on July 30, 2024

I'm not using adam offload
From dashboard, each model is in one node.
I have 6 actors in node 0.
6 ref actors in node 1.
1 reward actor in node 0.
the rest are vllm engines

from openrlhf.

hijkzzz commented on July 30, 2024

OpenRLHF only supports NCCL weights sync with vLLM 0.4.1 ~ 0.4.2
If each model is on a separate node, the communication only occurs during weight synchronization.
We also need to consider what devices are used to connect these machines (IB or Ethernet?).

from openrlhf.

babu111 commented on July 30, 2024

could you share how you setup the multinode to train llama3 70B? like the cluster setup and how ray is configured?
I think it's most likely to be a problem with the comms, so I'm asking my gpu provider for hardware details also.

from openrlhf.

babu111 commented on July 30, 2024

Also, in the code, we create several torch process groups. These groups contain remote actors or vllm engines.
When we build the process groups and specify nccl as backend, how does ray know the comm specifics to actually use nccl as backend? I'm suspecting that it's using ethernet for comms.

from openrlhf.

hijkzzz commented on July 30, 2024

Also, in the code, we create several torch process groups. These groups contain remote actors or vllm engines. When we build the process groups and specify nccl as backend, how does ray know the comm specifics to actually use nccl as backend? I'm suspecting that it's using ethernet for comms.

We have tested on both 16 A100s (2 nodes) and 32 A100s (4 nodes), and the 70B model works fine. Of course, our cluster is in an NCCL environment.
see https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_ppo_llama_ray_70b.sh

from openrlhf.

babu111 commented on July 30, 2024

Yeah, I've checked this file before. I almost got to a point where I know every file in the OpenRLHF repo haha.
What I want to ask about is how the ray cluster is set up, and the details of how torch process groups work with ray remotes.

from openrlhf.

hijkzzz commented on July 30, 2024

Ray is merely a glue/process launcher and does not affect or participate in DeepSpeed's NCCL communication.
We have provided the ray launch script on slurm (+NCCL) cluster:
https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_ppo_llama_ray_slurm.sh
It just use ray start xxxx.

from openrlhf.

babu111 commented on July 30, 2024

Hmm, I see, I'm using ray start and ray job submit also.
I printed out the backend used in the process groups, and it's nccl.
But I still have no idea why node to node comm is that slow with nccl.

However, I'm sure that torchrun and mpirun can successfully use nccl, because we can run pretraining in similar clusters. Unfortunately, I can't use torchrun to start vllm, because it will hang. Haven't tried mpirun. Have you guys tried before?

from openrlhf.

hijkzzz commented on July 30, 2024

This is very strange. I haven't encountered a similar issue before. Ray doesn't affect whether it's single-node or multi-node communication, as Ray is only responsible for launching DeepSpeed processes on different machines. Then DeepSpeed handles the communication on its own. We've only introduced one additional communication, which is using NCCL to synchronize weights from DeepSpeed to VLLM (this may cross the nodes).

from openrlhf.

babu111 commented on July 30, 2024

Yeah, I used grafana to visualize the results. I verified that the training is indeed stuck at broadcasting new weight to vllm engines (across nodes)

from openrlhf.

babu111 commented on July 30, 2024

this is with vllm_tensor_parallel_size 2

from openrlhf.

babu111 commented on July 30, 2024

this is with vllm_tensor_parallel_size 1

from openrlhf.

hijkzzz commented on July 30, 2024

This is likely due to the failure in establishing the NCCL group:

OpenRLHF/openrlhf/trainer/ray/ppo_actor.py

Line 99 in 6a13d95

self._model_update_group = init_process_group(

then the broadcast hang here:

OpenRLHF/openrlhf/trainer/ray/ppo_actor.py

Line 158 in 6a13d95

torch.distributed.broadcast(param.data, 0, group=self._model_update_group)

Could you try vLLM 0.4.2 or use --vllm_sync_backend gloo?

from openrlhf.

multinode ppo training extremely slow about openrlhf HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent