Comments (18)
了解,moe_group和dp_group相等的话确实可以在流水上更密一些,错误如下:
下面是每个不同的ranks输出的所在的groups:
mp_ranks: [0, 1]
mp_ranks: [0, 1]
mp_ranks: [2, 3]
mp_ranks: [2, 3]
mp_ranks: [4, 5]
mp_ranks: [4, 5]
mp_ranks: [6, 7]
mp_ranks: [6, 7]
dp_ranks: [0, 2]
dp_ranks: [0, 2]
dp_ranks: [1, 3]
dp_ranks: [1, 3]
dp_ranks: [4, 6]
dp_ranks: [4, 6]
dp_ranks: [5, 7]
dp_ranks: [5, 7]
moe_ranks: [0, 1, 2, 3]
moe_ranks: [0, 1, 2, 3]
moe_ranks: [0, 1, 2, 3]
moe_ranks: [0, 1, 2, 3]
moe_ranks: [4, 5, 6, 7]
moe_ranks: [4, 5, 6, 7]
moe_ranks: [4, 5, 6, 7]
moe_ranks: [4, 5, 6, 7]
错误如下:
NCCL WARN Send : invalid root 4 (root should be in the 0..4 range)
trace出来之后对应cuda/global_exchange.h第18行,fmoe_cuda_expert_exchange_impl这个函数,send或者recv的时候进程号超限制了
from fastmoe.
mp_group
假设的是该 group 内所有进程的输入是 duplicate 的, 在 mp_group
内部会对 input 进行 slice, 因此并不会显著减小显存. 切分 experts 来达到减少显存消耗的功能还在开发中.
from fastmoe.
也就是说目前的版本并没有在减少显存使用上有太大的帮助?看介绍说不同的expert会通过模型并行放在不同的GPU上
还有一个问题moe_groups的流水线并行是否支持,是不是对减少显存的使用会有帮助
from fastmoe.
不同 expert 放到不同 gpu 上并不会使得显存大小比 dense 模型小, 但是模型的总大小随着 gpu 数量增加而增加了.
pipeline parallelism 在 #59 这个 pr 中进行支持了, 可以减小显存使用.
from fastmoe.
收到,FMoETransformerMLP里面的num_experts是每个GPU上的experts数量,我理解错了,谢谢,也就是最终计算网格的划分是mp_groups * np_groups * moe_groups是的吧
确认下,假如有8个(222)process,那么mp_groups是[0,1][2,3][4,5][6,7],np_goups是[0,2][1,3][4,6][5,7],moe_groups是[0,1,2,3][4,5,6,7]
from fastmoe.
收到,FMoETransformerMLP里面的num_experts是每个GPU上的experts数量,我理解错了,谢谢,也就是最终计算网格的划分是mp_groups * np_groups * moe_groups是的吧 确认下,假如有8个(2_2_2)process,那么mp_groups是[0,1][2,3][4,5][6,7],np_goups是[0,2][1,3][4,6][5,7],moe_groups是[0,1,2,3][4,5,6,7]
np_group
是什么?
from fastmoe.
收到,FMoETransformerMLP里面的num_experts是每个GPU上的experts数量,我理解错了,谢谢,也就是最终计算网格的划分是mp_groups * np_groups * moe_groups是的吧 确认下,假如有8个(2_2_2)process,那么mp_groups是[0,1][2,3][4,5][6,7],np_goups是[0,2][1,3][4,6][5,7],moe_groups是[0,1,2,3][4,5,6,7]
np_group
是什么?
不好意思,写错了,是dp_group
from fastmoe.
那如果没有 pipeline 的话 dp_group
应该是 [0,2,4,6][1,3,5,7]
, 然后 moe_group
是 [0-7]
from fastmoe.
那如果没有 pipeline 的话
dp_group
应该是[0,2,4,6][1,3,5,7]
, 然后moe_group
是[0-7]
嗯 如果dp_group=mp_group=moe_group =2的话呢,我理解的moe_group就是pipeline
from fastmoe.
假如进程网格是2 * 2 * 2的话 mp_group是[0,1][2,3][4,5][6,7] dp_group是[0,2][1,3][4,6][5,7],我根据megatron/layers.py里面对moe_groups的设置应该对应的是[0,1,2,3][4,5,6,7],但是这么设置会报NCCL的错误,设置成[0,4][1,5][2,6][3,7]也会报NCCL的错误,请问有没有网格设置这块的参考,我是如下设置的:
for j in range(args.world_size // args.mp_size):
ranks=[j * args.mp_size + i for i in range(args.mp_size)]
mp_group_tmp = torch.distributed.new_group(ranks, backend="nccl")
if args.rank in ranks:
mp_group = mp_group_tmp
for k in range(args.pipeline_model_parallel_size):
for j in range(args.mp_size):
ranks=[
i * args.mp_size + j + k * (args.world_size // args.pipeline_model_parallel_size)
for i in range(args.world_size // (args.mp_size * args.pipeline_model_parallel_size))
]
dp_group_tmp = torch.distributed.new_group(ranks, backend="nccl")
if args.rank in ranks:
dp_group = dp_group_tmp
stage_size = args.world_size // args.pipeline_model_parallel_size
for j in range(0, stage_size):
ranks=[i * stage_size + j for i in range(args.pipeline_model_parallel_size)]
moe_group_tmp = torch.distributed.new_group(ranks, backend="nccl")
if args.rank in ranks:
moe_group = moe_group_tmp
from fastmoe.
嗯 如果dp_group=mp_group=moe_group =2的话呢,我理解的moe_group就是pipeline
首先 moe_group
是 fastmoe 吃的, 是 fastmoe 的 megatron 兼容层生成的. 在 megatron 里面的 pipeline group 是一条 pipeline 里的所有 worker, 而我们需要的 moe_group
是 pipeline 的同一个 stage 的所有 worker, 所以我们在 megatron 兼容层自己创建了一个 moe_group
.
假如进程网格是2 * 2 * 2的话 mp_group是[0,1][2,3][4,5][6,7] dp_group是[0,2][1,3][4,6][5,7],我根据megatron/layers.py里面对moe_groups的设置应该对应的是[0,1,2,3][4,5,6,7],但是这么设置会报NCCL的错误,设置成[0,4][1,5][2,6][3,7]也会报NCCL的错误,请问有没有网格设置这块的参考,我是如下设置的:
moe_groups
设成 [0,1,2,3][4,5,6,7]
应该是对的. 可以看一下 nccl 具体报了什么错吗?
btw, 我们在 v0.3.0 里面计划把 moe_group
变成和 dp_group
相同, 然后在 mp_group
里对 experts 做 model parallel
from fastmoe.
我怀疑传给 FMoE
类的 world_size
不对. 可以检查一下. world_size
应该就是 moe_group
的 size.
from fastmoe.
也就是说传给DistributedGroupedDataParallel的world_size是 model_parallel_size * data_parallel_size * pipe_parallel_size,FMoETransformerMLP的world_size是model_parallel_size * data_parallel_size
from fastmoe.
DistributedGroupedDataParallel
这个类不吃 world_size
呀?
from fastmoe.
hah 不好意思记错了,现在改成world_size/pipe_parallel_size之后可以跑了 不过设置pipe=2,显存还是没有下降
from fastmoe.
只要 dp 数量没变, 显存就基本是相同的吧
from fastmoe.
整个模型的experts数量是 单个gpu的experts * model_parallel_size 还是单个gpu的experts * model_parallel_size * pipe_parallel_size,要是后者的话,显存不变就没啥问题了
from fastmoe.
整个模型的experts数量是 单个gpu的experts * model_parallel_size 还是单个gpu的experts * model_parallel_size * pipe_parallel_size,要是后者的话,显存不变就没啥问题了
每个 transformer block 的 expert 数量是 data_parallel_size * model_parallel_size
from fastmoe.
Related Issues (20)
- why smart schedule only support one expert per process? HOT 2
- python setup.py install error with ["ninja", "-v"] HOT 11
- AttributeError: module 'torch.autograd' has no attribute 'graph' HOT 3
- AssertionError: assert topk_val[i, j] == gate_score[i, v]
- 询问DistributedGroupedDataParallel的使用方式 HOT 7
- AttributeError: module 'fmoe_cuda' has no attribute 'ensure_nccl'
- nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed HOT 9
- How to use Convolution operator as the expert? HOT 9
- Ninja Build Stopped Subcommand Failed HOT 2
- module 'fmoe_cuda' has no attribute 'ensure_nccl' HOT 4
- Performance difference when replacing FFN with FMoETransformerMLP in transformer HOT 1
- NCCL Error at /home/xxx/fastmoe/cuda/global_exchange.cpp:127 value 5 HOT 2
- About balance loss HOT 3
- The top_k in Gshard seems to be one. HOT 3
- During inference, I need to run forward on CPU, so FMOE does not support CPU inference now? HOT 2
- fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o: No such file or directory HOT 9
- 'Namespace' object has no attribute 'balance_strategy' HOT 2
- CUBLAS_STATUS_ARCH_MISMATCH HOT 2
- More GPU number than expert number HOT 5
- Does FastMoe have a plan to support pipeline parallel with Megatron? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastmoe.