We recently added a DeviceMesh.from_group() API to su

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[DeviceMesh] Add support for `group: Tuple[ProcessGroup, ...]` in `from_group()` about pytorch HOT 2 OPEN

awgu commented on May 25, 2024

[DeviceMesh] Add support for `group: Tuple[ProcessGroup, ...]` in `from_group()`

from pytorch.

Comments (2)

wanchaol commented on May 25, 2024 1

@awgu I thought a bit about this, the recovery math can be quite complicated in N-D scenarios, even for 2-D/3-D it seems non-trival amount of code. I'm wondering if we should do the way we do for things like device_type, when user want to construct a device mesh from a pg, they also need to tell us a bit more about their subpg structures by passing in a mesh_tensor in addition to device_type, then under the hood we just do some simple validations to make sure the pg_ranks are the same as the mesh_tensor dimension values.

The mesh tensor dim values can be easily derived similar to this https://github.com/pytorch/pytorch/blob/main/torch/distributed/device_mesh.py#L289

from pytorch.

awgu commented on May 25, 2024

I am not clear how to recover the mesh tensor from the process groups in the general case. Each rank can get the ranks of the process groups passed to it, but to recover the mesh, we need to do some math.

For example, if we have mesh = torch.arange(32).view(4, 8), then rank 0 sees inter-node PG with ranks (0, 8, 16, 24) and intra-node PG with ranks (0, 1, 2, 3, 4, 5, 6, 7). We can see that the intra-node (0, 1, 2, 3, 4, 5, 6, 7) increments by 1 each time and use that to fill out the ranks along each element in (0, 8, 16, 24) to get back mesh.

Now, if we have say mesh = torch.arange(128).view(4, 4, 8), where the rightmost dim is excluded (e.g. 8-way TP with (4, 4)-way HSDP)), then rank 0 sees "intra-node" ranks (0, 8, 16, 24) and "inter-node" ranks (0, 32, 64, 96). We can similarly see that (0, 8, 16, 24) increments by 8 each time and use that to fill out the ranks along each element in (0, 32, 64, 96).

Now, how does this generalize to the user passing in N process groups?

from pytorch.

Recommend Projects

[DeviceMesh] Add support for `group: Tuple[ProcessGroup, ...]` in `from_group()` about pytorch HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent