Git Product home page Git Product logo

Comments (18)

hclearner avatar hclearner commented on September 15, 2024 1

了解,moe_group和dp_group相等的话确实可以在流水上更密一些,错误如下:
下面是每个不同的ranks输出的所在的groups:

mp_ranks:  [0, 1]
mp_ranks:  [0, 1]
mp_ranks:  [2, 3]
mp_ranks:  [2, 3]
mp_ranks:  [4, 5]
mp_ranks:  [4, 5]
mp_ranks:  [6, 7]
mp_ranks:  [6, 7]
dp_ranks:  [0, 2]
dp_ranks:  [0, 2]
dp_ranks:  [1, 3]
dp_ranks:  [1, 3]
dp_ranks:  [4, 6]
dp_ranks:  [4, 6]
dp_ranks:  [5, 7]
dp_ranks:  [5, 7]
moe_ranks:  [0, 1, 2, 3]
moe_ranks:  [0, 1, 2, 3]
moe_ranks:  [0, 1, 2, 3]
moe_ranks:  [0, 1, 2, 3]
moe_ranks:  [4, 5, 6, 7]
moe_ranks:  [4, 5, 6, 7]
moe_ranks:  [4, 5, 6, 7]
moe_ranks:  [4, 5, 6, 7]

错误如下:
NCCL WARN Send : invalid root 4 (root should be in the 0..4 range)
trace出来之后对应cuda/global_exchange.h第18行,fmoe_cuda_expert_exchange_impl这个函数,send或者recv的时候进程号超限制了

from fastmoe.

laekov avatar laekov commented on September 15, 2024

mp_group 假设的是该 group 内所有进程的输入是 duplicate 的, 在 mp_group 内部会对 input 进行 slice, 因此并不会显著减小显存. 切分 experts 来达到减少显存消耗的功能还在开发中.

from fastmoe.

hclearner avatar hclearner commented on September 15, 2024

也就是说目前的版本并没有在减少显存使用上有太大的帮助?看介绍说不同的expert会通过模型并行放在不同的GPU上
还有一个问题moe_groups的流水线并行是否支持,是不是对减少显存的使用会有帮助

from fastmoe.

laekov avatar laekov commented on September 15, 2024

不同 expert 放到不同 gpu 上并不会使得显存大小比 dense 模型小, 但是模型的总大小随着 gpu 数量增加而增加了.
pipeline parallelism 在 #59 这个 pr 中进行支持了, 可以减小显存使用.

from fastmoe.

hclearner avatar hclearner commented on September 15, 2024

收到,FMoETransformerMLP里面的num_experts是每个GPU上的experts数量,我理解错了,谢谢,也就是最终计算网格的划分是mp_groups * np_groups * moe_groups是的吧
确认下,假如有8个(222)process,那么mp_groups是[0,1][2,3][4,5][6,7],np_goups是[0,2][1,3][4,6][5,7],moe_groups是[0,1,2,3][4,5,6,7]

from fastmoe.

laekov avatar laekov commented on September 15, 2024

收到,FMoETransformerMLP里面的num_experts是每个GPU上的experts数量,我理解错了,谢谢,也就是最终计算网格的划分是mp_groups * np_groups * moe_groups是的吧 确认下,假如有8个(2_2_2)process,那么mp_groups是[0,1][2,3][4,5][6,7],np_goups是[0,2][1,3][4,6][5,7],moe_groups是[0,1,2,3][4,5,6,7]

np_group 是什么?

from fastmoe.

hclearner avatar hclearner commented on September 15, 2024

收到,FMoETransformerMLP里面的num_experts是每个GPU上的experts数量,我理解错了,谢谢,也就是最终计算网格的划分是mp_groups * np_groups * moe_groups是的吧 确认下,假如有8个(2_2_2)process,那么mp_groups是[0,1][2,3][4,5][6,7],np_goups是[0,2][1,3][4,6][5,7],moe_groups是[0,1,2,3][4,5,6,7]

np_group 是什么?

不好意思,写错了,是dp_group

from fastmoe.

laekov avatar laekov commented on September 15, 2024

那如果没有 pipeline 的话 dp_group 应该是 [0,2,4,6][1,3,5,7], 然后 moe_group[0-7]

from fastmoe.

hclearner avatar hclearner commented on September 15, 2024

那如果没有 pipeline 的话 dp_group 应该是 [0,2,4,6][1,3,5,7], 然后 moe_group[0-7]

嗯 如果dp_group=mp_group=moe_group =2的话呢,我理解的moe_group就是pipeline

from fastmoe.

hclearner avatar hclearner commented on September 15, 2024

假如进程网格是2 * 2 * 2的话 mp_group是[0,1][2,3][4,5][6,7] dp_group是[0,2][1,3][4,6][5,7],我根据megatron/layers.py里面对moe_groups的设置应该对应的是[0,1,2,3][4,5,6,7],但是这么设置会报NCCL的错误,设置成[0,4][1,5][2,6][3,7]也会报NCCL的错误,请问有没有网格设置这块的参考,我是如下设置的:

for j in range(args.world_size // args.mp_size):
            ranks=[j * args.mp_size + i for i in range(args.mp_size)]
            mp_group_tmp = torch.distributed.new_group(ranks, backend="nccl")
            if args.rank in ranks:
                    mp_group = mp_group_tmp

for k in range(args.pipeline_model_parallel_size):
            for j in range(args.mp_size):
                ranks=[
                    i * args.mp_size + j + k * (args.world_size // args.pipeline_model_parallel_size)
                    for i in range(args.world_size // (args.mp_size * args.pipeline_model_parallel_size))
                ]
                dp_group_tmp = torch.distributed.new_group(ranks, backend="nccl")
                if args.rank in ranks:
                    dp_group = dp_group_tmp

stage_size = args.world_size // args.pipeline_model_parallel_size
for j in range(0, stage_size):
        ranks=[i * stage_size + j for i in range(args.pipeline_model_parallel_size)]
        moe_group_tmp = torch.distributed.new_group(ranks, backend="nccl")
        if args.rank in ranks:
            moe_group = moe_group_tmp

from fastmoe.

laekov avatar laekov commented on September 15, 2024

嗯 如果dp_group=mp_group=moe_group =2的话呢,我理解的moe_group就是pipeline

首先 moe_group 是 fastmoe 吃的, 是 fastmoe 的 megatron 兼容层生成的. 在 megatron 里面的 pipeline group 是一条 pipeline 里的所有 worker, 而我们需要的 moe_group 是 pipeline 的同一个 stage 的所有 worker, 所以我们在 megatron 兼容层自己创建了一个 moe_group.

假如进程网格是2 * 2 * 2的话 mp_group是[0,1][2,3][4,5][6,7] dp_group是[0,2][1,3][4,6][5,7],我根据megatron/layers.py里面对moe_groups的设置应该对应的是[0,1,2,3][4,5,6,7],但是这么设置会报NCCL的错误,设置成[0,4][1,5][2,6][3,7]也会报NCCL的错误,请问有没有网格设置这块的参考,我是如下设置的:

moe_groups 设成 [0,1,2,3][4,5,6,7] 应该是对的. 可以看一下 nccl 具体报了什么错吗?

btw, 我们在 v0.3.0 里面计划把 moe_group 变成和 dp_group 相同, 然后在 mp_group 里对 experts 做 model parallel

from fastmoe.

laekov avatar laekov commented on September 15, 2024

我怀疑传给 FMoE 类的 world_size 不对. 可以检查一下. world_size 应该就是 moe_group 的 size.

from fastmoe.

hclearner avatar hclearner commented on September 15, 2024

也就是说传给DistributedGroupedDataParallel的world_size是 model_parallel_size * data_parallel_size * pipe_parallel_size,FMoETransformerMLP的world_size是model_parallel_size * data_parallel_size

from fastmoe.

laekov avatar laekov commented on September 15, 2024

DistributedGroupedDataParallel 这个类不吃 world_size 呀?

from fastmoe.

hclearner avatar hclearner commented on September 15, 2024

hah 不好意思记错了,现在改成world_size/pipe_parallel_size之后可以跑了 不过设置pipe=2,显存还是没有下降

from fastmoe.

laekov avatar laekov commented on September 15, 2024

只要 dp 数量没变, 显存就基本是相同的吧

from fastmoe.

hclearner avatar hclearner commented on September 15, 2024

整个模型的experts数量是 单个gpu的experts * model_parallel_size 还是单个gpu的experts * model_parallel_size * pipe_parallel_size,要是后者的话,显存不变就没啥问题了

from fastmoe.

laekov avatar laekov commented on September 15, 2024

整个模型的experts数量是 单个gpu的experts * model_parallel_size 还是单个gpu的experts * model_parallel_size * pipe_parallel_size,要是后者的话,显存不变就没啥问题了

每个 transformer block 的 expert 数量是 data_parallel_size * model_parallel_size

from fastmoe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.