Git Product home page Git Product logo

Comments (15)

hijkzzz avatar hijkzzz commented on July 30, 2024

disable adam offload or keep each model in one node.
and it seems --rollout_batch_size 792 is a large value, so it's slow.

from openrlhf.

babu111 avatar babu111 commented on July 30, 2024

Actually, rollout is quite fast. I rollout 792 samples in just a few min.
I find that the issue of slowness is due to slow comm between nodes.
when the training step is stuck, I visualize the status using ray.dashboard. It shows that the speed of comm between nodes is 11.73MB/s, which is too slow for nccl. So I'm suspecting the nccl backend is not being used. But I don't know how to check if it's true, or how to make sure nccl is activated.

from openrlhf.

babu111 avatar babu111 commented on July 30, 2024

I'm not using adam offload
From dashboard, each model is in one node.
I have 6 actors in node 0.
6 ref actors in node 1.
1 reward actor in node 0.
the rest are vllm engines

from openrlhf.

hijkzzz avatar hijkzzz commented on July 30, 2024

OpenRLHF only supports NCCL weights sync with vLLM 0.4.1 ~ 0.4.2
If each model is on a separate node, the communication only occurs during weight synchronization.
We also need to consider what devices are used to connect these machines (IB or Ethernet?).

from openrlhf.

babu111 avatar babu111 commented on July 30, 2024

could you share how you setup the multinode to train llama3 70B? like the cluster setup and how ray is configured?
I think it's most likely to be a problem with the comms, so I'm asking my gpu provider for hardware details also.

from openrlhf.

babu111 avatar babu111 commented on July 30, 2024

Also, in the code, we create several torch process groups. These groups contain remote actors or vllm engines.
When we build the process groups and specify nccl as backend, how does ray know the comm specifics to actually use nccl as backend? I'm suspecting that it's using ethernet for comms.

from openrlhf.

hijkzzz avatar hijkzzz commented on July 30, 2024

Also, in the code, we create several torch process groups. These groups contain remote actors or vllm engines. When we build the process groups and specify nccl as backend, how does ray know the comm specifics to actually use nccl as backend? I'm suspecting that it's using ethernet for comms.

We have tested on both 16 A100s (2 nodes) and 32 A100s (4 nodes), and the 70B model works fine. Of course, our cluster is in an NCCL environment.
see https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_ppo_llama_ray_70b.sh

from openrlhf.

babu111 avatar babu111 commented on July 30, 2024

Yeah, I've checked this file before. I almost got to a point where I know every file in the OpenRLHF repo haha.
What I want to ask about is how the ray cluster is set up, and the details of how torch process groups work with ray remotes.

from openrlhf.

hijkzzz avatar hijkzzz commented on July 30, 2024

Ray is merely a glue/process launcher and does not affect or participate in DeepSpeed's NCCL communication.
We have provided the ray launch script on slurm (+NCCL) cluster:
https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_ppo_llama_ray_slurm.sh
It just use ray start xxxx.

from openrlhf.

babu111 avatar babu111 commented on July 30, 2024

Hmm, I see, I'm using ray start and ray job submit also.
I printed out the backend used in the process groups, and it's nccl.
But I still have no idea why node to node comm is that slow with nccl.

However, I'm sure that torchrun and mpirun can successfully use nccl, because we can run pretraining in similar clusters. Unfortunately, I can't use torchrun to start vllm, because it will hang. Haven't tried mpirun. Have you guys tried before?

from openrlhf.

hijkzzz avatar hijkzzz commented on July 30, 2024

This is very strange. I haven't encountered a similar issue before. Ray doesn't affect whether it's single-node or multi-node communication, as Ray is only responsible for launching DeepSpeed processes on different machines. Then DeepSpeed handles the communication on its own. We've only introduced one additional communication, which is using NCCL to synchronize weights from DeepSpeed to VLLM (this may cross the nodes).

from openrlhf.

babu111 avatar babu111 commented on July 30, 2024

Yeah, I used grafana to visualize the results. I verified that the training is indeed stuck at broadcasting new weight to vllm engines (across nodes)

from openrlhf.

babu111 avatar babu111 commented on July 30, 2024
image this is with vllm_tensor_parallel_size 2

from openrlhf.

babu111 avatar babu111 commented on July 30, 2024
image this is with vllm_tensor_parallel_size 1

from openrlhf.

hijkzzz avatar hijkzzz commented on July 30, 2024

This is likely due to the failure in establishing the NCCL group:

self._model_update_group = init_process_group(

then the broadcast hang here:
torch.distributed.broadcast(param.data, 0, group=self._model_update_group)

Could you try vLLM 0.4.2 or use --vllm_sync_backend gloo?

from openrlhf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.