Git Product home page Git Product logo

fastmoe's Introduction

Release note | 中文文档 | Slack workspace

Introduction

An easy-to-use and efficient system to support the Mixture of Experts (MoE) model for PyTorch.

Installation

Prerequisites

PyTorch with CUDA is required. The repository is currently tested with PyTorch v1.10.0 and CUDA 11.3, with designed compatibility to older and newer versions.

The minimum version of supported PyTorch is 1.7.2 with CUDA 10. However, there are a few known issues that requires manual modification of FastMoE's code with specific older dependents.

If the distributed expert feature is enabled, NCCL with P2P communication support, typically versions >=2.7.5, is needed.

Installing

FastMoE contains a set of PyTorch customized opearators, including both C and Python components. Use python setup.py install to easily install and enjoy using FastMoE for training.

A step-by-step tutorial for the installation procedure can be found here.

The distributed expert feature is enabled by default. If you want to disable it, pass environment variable USE_NCCL=0 to the setup script.

Note that an extra NCCL developer package is needed, which has to be consistent with your PyTorch's NCCL version, which can be inspected by running torch.cuda.nccl.version(). The official PyTorch docker image is recommended, as the environment is well-setup there. Otherwise, you can access the download link of all NCCL versions to download the NCCL package that is suitable for you.

Usage

FMoEfy a Transformer model

Transformer is currently one of the most popular models to be extended by MoE. Using FastMoE, a Transformer-based model can be extended as MoE by an one-key plugin shown as follow.

For example, when using Megatron-LM, using the following lines can help you easily scale up the MLP layers to multiple experts.

model = ...

from fmoe.megatron import fmoefy
model = fmoefy(model, fmoe_num_experts=<number of experts per worker>)

train(model, ...)

A detailed tutorial to moefy Megatron-LM can be found here.

Using FastMoE as a PyTorch module

An example MoE transformer model can be seen in the Transformer-XL example. The easist way is to replace the MLP layer by the FMoE layers.

Using FastMoE in Parallel

FastMoE supports multiple ways of parallel training. See a comprehensive document for parallelism for details. Below shows the two simplest ways of using FastMoE in parallel.

Data Parallel

In FastMoE's data parallel mode, both the gate and the experts are replicated on each worker. The following figure shows the forward pass of a 3-expert MoE with 2-way data parallel.

For data parallel, no extra coding is needed. FastMoE works seamlessly with PyTorch's DataParallel or DistributedDataParallel. The only drawback of data parallel is that the number of experts is constrained by each worker's memory.

Expert Parallel (also called Model Parlallel in some previous versions)

In FastMoE's expert parallel mode, the gate network is still replicated on each worker but experts are placed separately across workers. Thus, by introducing additional communication cost, FastMoE enjoys a large expert pool whose size is proportional to the number of workers.

The following figure shows the forward pass of a 6-expert MoE with 2-way model parallel. Note that experts 1-3 are located in worker 1 while experts 4-6 are located in worker 2.

FastMoE's expert parallel requires sophiscated parallel strategies that neither PyTorch nor Megatron-LM provided when FastMoE was created. The fmoe.DistributedGroupedDataParallel module is introduced to replace PyTorch's DDP module.

Faster Performance Features

From a PPoPP'22 paper, FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models, we have adopted techniques to make FastMoE's model parallel much more efficient.

These optimizations are named as Faster Performance Features, and can be enabled via several environment variables. Their usage and constraints are detailed in a separate document.

Citation

For the core FastMoE system.

@article{he2021fastmoe,
      title={FastMoE: A Fast Mixture-of-Expert Training System}, 
      author={Jiaao He and Jiezhong Qiu and Aohan Zeng and Zhilin Yang and Jidong Zhai and Jie Tang},
      journal={arXiv preprint arXiv:2103.13262},
      year={2021}
}

For the faster performance features.

@inproceedings{he2022fastermoe,
    author = {He, Jiaao and Zhai, Jidong and Antunes, Tiago and Wang, Haojie and Luo, Fuwen and Shi, Shangfeng and Li, Qin},
    title = {FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models},
    year = {2022},
    isbn = {9781450392044},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3503221.3508418},
    doi = {10.1145/3503221.3508418},
    booktitle = {Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
    pages = {120–134},
    numpages = {15},
    keywords = {parallelism, distributed deep learning, performance modeling},
    location = {Seoul, Republic of Korea},
    series = {PPoPP '22}
}

Troubleshootings / Discussion

If you have any problem using FastMoE, or you are interested in getting involved in developing FastMoE, feel free to join our slack channel.

fastmoe's People

Contributors

cbockman avatar cclauss avatar co1lin avatar cobalt-27 avatar fragile-azalea avatar hclearner avatar heheda12345 avatar heihaierr avatar helloworldlty avatar ijkilchenko avatar kimiyoung avatar laekov avatar lopuhin avatar roastduck avatar santurini avatar sengxian avatar serendipitycoding avatar stefan-it avatar tiagomantunes avatar xptree avatar ymjiang avatar yongbowin avatar zhang-rq avatar zhuzilin avatar zihangdai avatar zms1999 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastmoe's Issues

Compatible version of Megatron-LM

"FastMoE currently works with both v2.0 and v2.1 release of Megatron-LM."
May I ask where can I find the v2.0/v2.1 version of Megatron? To my knowledge, the newest version of Megatron in pypi is 1.15, and I can't find any information about versions in its git project, neither.

Installation error for PyTorch version under 1.8.0

Describe the bug
Failed to build fastmoe on PyTorch 1.7.1 due to the error located in "global_exchange.cpp".

To Reproduce
Steps to reproduce the behavior:

  1. export NCCL=1
  2. python setup.py install

Logs

FAILED: /home/zhengyi_ex/code/image_classification/build/temp.linux-x86_64-3.6/cuda/global_exchange.o
c++ -MMD -MF /home/zhengyi_ex/code/image_classification/build/temp.linux-x86_64-3.6/cuda/global_exchange.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.6m -c -c /home/zhengyi_ex/code/image_classification/cuda/global_exchange.cpp -o /home/zhengyi_ex/code/image_classification/build/temp.linux-x86_64-3.6/cuda/global_exchange.o -DFMOE_USE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /opt/conda/lib/python3.6/site-packages/torch/include/ATen/Parallel.h:149:0,
                 from /opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/utils.h:3,
                 from /opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5,
                 from /opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/nn.h:3,
                 from /opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/all.h:12,
                 from /opt/conda/lib/python3.6/site-packages/torch/include/torch/extension.h:4,
                 from /home/zhengyi_ex/code/image_classification/cuda/global_exchange.cpp:3:
/opt/conda/lib/python3.6/site-packages/torch/include/ATen/ParallelOpenMP.h:84:0: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
 #pragma omp parallel for if ((end - begin) >= grain_size)
 ^
/home/zhengyi_ex/code/image_classification/cuda/global_exchange.cpp: In member function ‘ncclComm* HackNCCLGroup::getcomm(c10::Device)’:
/home/zhengyi_ex/code/image_classification/cuda/global_exchange.cpp:85:23: error: ‘c10d::OpType’ has not been declared
                 c10d::OpType::SEND,
                       ^
ninja: build stopped: subcommand failed.

Platform

  • Device: [e.g. NVIDIA P100]
  • CUDA version: [e.g. 10.2]
  • NCCL version: [e.g. 2.7.8]
  • PyTorch: 1.7.1

setup.py failed

Describe the bug
I'm using the docker image suggested in the Megatron-LM repo. I pulled the container using docker pull nvcr.io/nvidia/pytorch:20.12-py3. Inside the container, I tried to clone this repo and build fastMoE, which failed.

To Reproduce
Steps to reproduce the behavior:

  1. launch the container using docker run --gpus all -it --rm --ipc=host -v /home/me/:/home/me/ --name pytorch-moe <image_ID>
  2. Once inside the container. Clone this repo and execute USE_NCCL=1 python setup.py install
  3. It fails to build global_exchange.o

Error Message
Here is a complete message dump:

running install
running bdist_egg
running egg_info
creating fastmoe.egg-info
writing fastmoe.egg-info/PKG-INFO
writing dependency_links to fastmoe.egg-info/dependency_links.txt
writing top-level names to fastmoe.egg-info/top_level.txt
writing manifest file 'fastmoe.egg-info/SOURCES.txt'
reading manifest file 'fastmoe.egg-info/SOURCES.txt'
writing manifest file 'fastmoe.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/fmoe
copying fmoe/functions.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/layers.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/transformer.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/balance.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/utils.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/__init__.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/linear.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/distributed.py -> build/lib.linux-x86_64-3.8/fmoe
creating build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/layers.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/checkpoint.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/balance.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/utils.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/patch.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/__init__.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/distributed.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
creating build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/naive_gate.py -> builde.g. 1.8.0/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/base_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/zero_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/switch_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/noisy_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/gshard_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/swipe_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/utils.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/__init__.py -> build/lib.linux-x86_64-3.8/fmoe/gates
running build_ext
building 'fmoe_cuda' extension
creating /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8
creating /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda
Emitting ninja build file /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/6] c++ -MMD -MF /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/stream_manager.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.8 -c -c /home/edwardhu/fastmoe/cuda/stream_manager.cpp -o /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/stream_manager.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
[2/6] c++ -MMD -MF /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.8 -c -c /home/edwardhu/fastmoe/cuda/global_exchange.cpp -o /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++14
FAILED: /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/global_exchange.o 
c++ -MMD -MF /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.8 -c -c /home/edwardhu/fastmoe/cuda/global_exchange.cpp -o /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /opt/conda/lib/python3.8/site-packages/torch/include/ATen/Parallel.h:149,
                 from /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/utils.h:3,
                 from /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5,
                 from /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/nn.h:3,
                 from /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/all.h:12,
                 from /opt/conda/lib/python3.8/site-packages/torch/include/torch/extension.h:4,
                 from /home/edwardhu/fastmoe/cuda/global_exchange.cpp:3:
/opt/conda/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:84: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
   84 | #pragma omp parallel for if ((end - begin) >= grain_size)
      | 
/home/edwardhu/fastmoe/cuda/global_exchange.cpp: In member function ‘ncclComm* HackNCCLGroup::getcomm(c10::Device)’:
/home/edwardhu/fastmoe/cuda/global_exchange.cpp:118:38: error: no matching function for call to ‘HackNCCLGroup::broadcastUniqueNCCLID(ncclUniqueId*)’
  118 |         broadcastUniqueNCCLID(&ncclID);
      |                                      ^
In file included from /home/edwardhu/fastmoe/cuda/global_exchange.cpp:101:
/opt/conda/lib/python3.8/site-packages/torch/include/c10d/ProcessGroupNCCL.hpp:509:8: note: candidate: ‘void c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, const string&, int)’
  509 |   void broadcastUniqueNCCLID(
      |        ^~~~~~~~~~~~~~~~~~~~~e.g. 1.8.0
/opt/conda/lib/python3.8/site-packages/torch/include/c10d/ProcessGroupNCCL.hpp:509:8: note:   candidate expects 4 arguments, 1 provided
[3/6] c++ -MMD -MF /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/fmoe_cuda.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.8 -c -c /home/edwardhu/fastmoe/cuda/fmoe_cuda.cpp -o /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/fmoe_cuda.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /opt/conda/lib/python3.8/site-packages/torch/include/ATen/Parallel.h:149,
                 from /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/utils.h:3,
                 from /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5,
                 from /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/nn.h:3,
                 from /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/all.h:12,
                 from /opt/conda/lib/python3.8/site-packages/torch/include/torch/extension.h:4,
                 from /home/edwardhu/fastmoe/cuda/fmoe_cuda.cpp:3:
/opt/conda/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:84: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
   84 | #pragma omp parallel for if ((end - begin) >= grain_size)
      | 
[4/6] /usr/local/cuda/bin/nvcc -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.8 -c -c /home/edwardhu/fastmoe/cuda/parallel_linear.cu -o /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/parallel_linear.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++14 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_75,code=sm_75
[5/6] /usr/local/cuda/bin/nvcc -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.8 -c -c /home/edwardhu/fastmoe/cuda/local_exchange.cu -o /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/local_exchange.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++14 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_75,code=sm_75
[6/6] /usr/local/cuda/bin/nvcc -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.8 -c -c /home/edwardhu/fastmoe/cuda/balancing.cu -o /home/edwardhu/fastmoe/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++14 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_75,code=sm_75
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1549, in _run_ninja_build
    subprocess.run(
  File "/opt/conda/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "setup.py", line 38, in <module>
    setuptools.setup(
  File "/opt/conda/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
    return distutils.core.setup(**attrs)
  File "/opt/conda/lib/python3.8/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/opt/conda/lib/python3.8/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.8/site-packages/setuptools/command/install.py", line 67, in run
    self.do_egg_install()
  File "/opt/conda/lib/python3.8/site-packages/setuptools/command/install.py", line 109, in do_egg_install
    self.run_command('bdist_egg')
  File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 167, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/opt/conda/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 153, in call_command
    self.run_command(cmdname)
  File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.8/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/opt/conda/lib/python3.8/distutils/command/install_lib.py", line 107, in build
    self.run_command('build_ext')
  File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)
  File "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
    _build_ext.build_ext.run(self)
  File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 683, in build_extensions
    build_ext.build_extensions(self)
  File "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 194, in build_extensions
    self.build_extension(ext)
  File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
    _build_ext.build_extension(self, ext)
  File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
    objects = self.compiler.compile(sources,
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 503, in unix_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1261, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1565, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

Platform (docker environment)

The environment is unmodified image nvcr.io/nvidia/pytorch:20.12-py3.

  • Device: [3 x NVIDIA V100 (16G)]
  • OS: [Ubuntu 20.04.1 LTS]
  • CUDA version: [11.3]
  • NCCL version: [2.8.3]
  • PyTorch version: [1.8.0a0+1606899]

Additional context
Building fastMoE under the official pytorch container with tag 1.9.1-cuda11.1-cudnn8-devel seems fine. Not sure if earlier version PyTorch is deprecated or unsupported by fastMoE.

Support a detailed install introduction

I wanna try your exciting work, but I have toubled in installation.
Could you please give me a detailed steps for set up? like all dependences with verison in requirements.txt and so on.

Error occurs when I run gpt in megtran

Describe the bug
A clear and concise description of what the bug is.
When I have modified the megatron v2.2 according to the fmoefy-v2.2.patch and pretrained the gpt, the error occurs as below.

########## ERROR ##########
building GPT model ...
Traceback (most recent call last):
File "pretrain_gpt.py", line 148, in
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/home/liji09/moe_demo/megatron_moe/Megatron-LM/megatron/training.py", line 116, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
File "/home/liji09/moe_demo/megatron_moe/Megatron-LM/megatron/training.py", line 276, in setup_model_and_optimizer
model = get_model(model_provider_func)
File "/home/liji09/moe_demo/megatron_moe/Megatron-LM/megatron/training.py", line 192, in get_model
model = model_provider_func()
File "pretrain_gpt.py", line 52, in model_provider
model = fmoefy(model, num_experts=4)
File "/opt/conda/lib/python3.8/site-packages/fastmoe-0.2.0-py3.8-linux-x86_64.egg/fmoe/megatron/layers.py", line 191, in fmoefy
l.mlp = MegatronMLP(args, mpu.get_model_parallel_group(), idx)
File "/opt/conda/lib/python3.8/site-packages/fastmoe-0.2.0-py3.8-linux-x86_64.egg/fmoe/megatron/layers.py", line 105, in init
super().init(
File "/opt/conda/lib/python3.8/site-packages/fastmoe-0.2.0-py3.8-linux-x86_64.egg/fmoe/transformer.py", line 66, in init
self.experts = _Expert(
File "/opt/conda/lib/python3.8/site-packages/fastmoe-0.2.0-py3.8-linux-x86_64.egg/fmoe/transformer.py", line 18, in init
self.htoh4 = FMoELinear(num_expert, d_model, d_hidden, bias=True, rank=rank)
File "/opt/conda/lib/python3.8/site-packages/fastmoe-0.2.0-py3.8-linux-x86_64.egg/fmoe/layers.py", line 34, in init
self.weight = nn.Parameter(torch.Tensor(num_expert, out_feat, in_feat))
TypeError: new(): argument 'size' must be tuple of ints, but found element of type NoneType at pos 2
./examples/pretrain_gpt.sh: line 43: --fmoefy: command not found
########### ERROR ##########

Platform

  • Device: Tesla v100
  • OS: Ubuntu
  • CUDA version: 11.1
  • NCCL version: 2.8.3

the settings of DistributedGroupedDataParallel api

请问在分布式模式下,除了DistributedGroupedDataParallel和FMoETransformerMLP里面需要mp_group的设置,还有其他地方需要更新吗,在做多个专家的时候,增加mp_groups的数量,GPU的显存没有明显降低,谢谢

The program hangs up in gpt

Describe the bug
A clear and concise description of what the bug is.
when I run the moe gpt in megatron with single devices or multidevices, the program hangs in the text 'training ...' .

Logs
If applicable, add logs to help explain your problem.
The program hangs.

time (ms) | model and optimizer: 5870.73 | train/valid/test data iterators: 558.40
[after dataloaders are built] datetime: 2021-06-10 03:49:39
done with setups ...
training ...
[before the start of training step] datetime: 2021-06-10 03:49:39

Platform

  • Device: Tesla v100
  • OS: Ubuntu
  • CUDA version: 11.1
  • NCCL version: 2.8.3

When running fastmoe with model parallel, the training process hanged

Describe the bug
FastMoE是一个非常有意义的工作。当前,我们在体验FastMoE,但遇到一些使用上的问题,希望获得您的帮助。
我们在使用模型并行+FastMoE遇到程序hang住的问题。

To Reproduce
使用相关软件版本信息如下:

cuda version:11.1
pytroch: 1.8.0
fastmoe: v0.2.1
Megatron: v2.2
NCCL: 2.8.3

运行如下命令(单机4卡 v100, 2路数据并行性+2路模型并行):

python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       pretrain_gpt.py \
       --tensor-model-parallel-size 2 \
       --pipeline-model-parallel-size 2 \
       --num-layers 24 \
       --hidden-size 1024 \
       --num-attention-heads 16 \
       --micro-batch-size 4 \
       --global-batch-size 16 \
       --seq-length 1024 \
       --max-position-embeddings 1024 \
       --train-iters 500000 \
       --lr-decay-iters 320000 \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH \
       --vocab-file gpt2-vocab.json \
       --merge-file gpt2-merges.txt \
       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --min-lr 1.0e-5 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --checkpoint-activations \
       --log-interval 100 \
       --save-interval 10000 \
       --eval-interval 1000 \
       --eval-iters 10 \
       --num-experts 4 \
       --top-k 2 \
       --fmoefy

使用上述命令运行,程序会直接hang住。
我尝试修改clip_gradients.py,去掉其中的all_reduce操作,则程序可运行数十至数百step后hang住。
针对上面的问题,希望得到您的帮助,谢谢。

Adding Expert Prototyping to FastMoE

Hi, thanks for your provding end-to-end training framework in Pytorch for MoE models. We have recently implemented MoE in tensorflow and found out that categorizing experts to different groups can bring improvements in model quality. More details can be referred to our paper https://arxiv.org/abs/2105.15082. I wonder if it is possible to add this feature as FastMoE really facilitates research in sparse expert models.

Generally, this strategy categorizes experts to different groups, each of which has its own gating function for routing. It is compatible with the conventional routing method like Switch or top-2 routing as you can set the group number to 1. We find that increasing the value of k in top-k can increase model performance and k top-1 can achieve similar effect. Also, it is possible to try out more complex strategies, say k top-k' or so.

We have a code snippet in the appendix, which may be helpful.

magic number (256) in CUDA functions

There is a magic number (256) in Both in CUDA functions moe_cuda_local_scatter_impl and moe_cuda_local_gather_impl. I cannot understand what it means and not sure if it's a potential bug in fastmoe. Is it related to the parameters of hardwares?

Related codes:
batch_scatter_kernel<scalar_t> <<<batch_size, 256, 0, smgr->stream(0)>>>(in_feat, d_pos, input, input_buf);

Does fastmoe support distributed training with multiple machine?

Hi there,

I have installed the fastmoe and the model with distributed experts can be trained successfully in single machine. However, when the experts are distributed on multiple machine, the torch.distributed subprocess will die with <Signals.SIGSEGV: 11>. Have you experimented fastmoe in distributed training with multiple machine?

many thanks~

How to load pretrained weights to FMoETransformerMLP

Hi, thank you for your great work!
Describe the bug
I'm using a T5 model from Huggingface transformers. I change the T5LayerFF() to FMoETransformerMLP() and try to load pre-trained weights: model = T5ForConditionalGeneration.from_pretrained('t5-base'). However, the weights of the T5LayerFF() are not transformed to FMoETransformerMLP(). I wonder if there is a way to do it.

To Reproduce
change the T5LayerFF() in Huggingface T5 to FMoETransformerMLP()

Logs
Some weights of the model checkpoint at t5-base were not used when initializing T5ForConditionalGeneration: ['decoder.block.7.layer.2.DenseReluDense.wo.weight', 'decoder.block.9.layer.2.DenseReluDense.wi.weight', 'encoder.block.3.layer.1.DenseReluDense.wi.weight', 'encoder.block.4.layer.1.DenseReluDense.wi.weight', 'encoder.block.10.layer.1.DenseReluDense.wi.weight', 'decoder.block.8.layer.2.DenseReluDense.wi.weight', 'decoder.block.5.layer.2.DenseReluDense.wi.weight', 'decoder.block.3.layer.2.DenseReluDense.wo.weight', 'encoder.block.8.layer.1.DenseReluDense.wo.weight', 'decoder.block.7.layer.2.DenseReluDense.wi.weight', 'encoder.block.0.layer.1.DenseReluDense.wo.weight', 'decoder.block.6.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.2.DenseReluDense.wi.weight', 'encoder.block.7.layer.1.DenseReluDense.wo.weight', 'encoder.block.6.layer.1.DenseReluDense.wi.weight', 'encoder.block.5.layer.1.DenseReluDense.wi.weight', 'encoder.block.10.layer.1.DenseReluDense.wo.weight', 'encoder.block.8.layer.1.DenseReluDense.wi.weight', 'decoder.block.6.layer.2.DenseReluDense.wi.weight', 'decoder.block.9.layer.2.DenseReluDense.wo.weight', 'decoder.block.4.layer.2.DenseReluDense.wo.weight', 'decoder.block.10.layer.2.DenseReluDense.wi.weight', 'decoder.block.11.layer.2.DenseReluDense.wi.weight', 'encoder.block.11.layer.1.DenseReluDense.wi.weight', 'encoder.block.11.layer.1.DenseReluDense.wo.weight', 'decoder.block.10.layer.2.DenseReluDense.wo.weight', 'decoder.block.1.layer.2.DenseReluDense.wo.weight', 'encoder.block.6.layer.1.DenseReluDense.wo.weight', 'encoder.block.1.layer.1.DenseReluDense.wi.weight', 'encoder.block.2.layer.1.DenseReluDense.wo.weight', 'encoder.block.9.layer.1.DenseReluDense.wi.weight', 'decoder.block.2.layer.2.DenseReluDense.wo.weight', 'encoder.block.2.layer.1.DenseReluDense.wi.weight', 'encoder.block.3.layer.1.DenseReluDense.wo.weight', 'decoder.block.2.layer.2.DenseReluDense.wi.weight', 'decoder.block.11.layer.2.DenseReluDense.wo.weight', 'encoder.block.9.layer.1.DenseReluDense.wo.weight', 'decoder.block.0.layer.2.DenseReluDense.wo.weight', 'decoder.block.4.layer.2.DenseReluDense.wi.weight', 'decoder.block.1.layer.2.DenseReluDense.wi.weight', 'encoder.block.5.layer.1.DenseReluDense.wo.weight', 'encoder.block.0.layer.1.DenseReluDense.wi.weight', 'decoder.block.5.layer.2.DenseReluDense.wo.weight', 'encoder.block.7.layer.1.DenseReluDense.wi.weight', 'encoder.block.4.layer.1.DenseReluDense.wo.weight', 'encoder.block.1.layer.1.DenseReluDense.wo.weight', 'decoder.block.8.layer.2.DenseReluDense.wo.weight', 'decoder.block.3.layer.2.DenseReluDense.wi.weight']

  • This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

data parallel with fmoe

测试场景:
(1) 1GPU, num_experts=8, batch_size=8, expert_dp_num="none", dp_rank=0, world_size=1
(2) 4GPU, num_experts=2, batch_size=2, expert_dp_num="none", dp_rank=[0,1,2,3], world_size=4

以上两个场景,相同的lr,每一步的loss值应该是差不多,但是实际测试出来第二种loss下降的明显要比第一种情况慢,请问这可能是什么原因?

安装成功后在test文件夹下测试出现的问题

你好,我在安装好后在test文件中进行测试的时候出现了一些问题,希望您能提供一些帮助。
1、在运行test文件夹内的测试文件时,如果使用多卡,会出现如下的问题
12c0b350f5ab94bba88fe741faab7f1

2、请问moe是否有单测或者citest的脚本可以参考,在运行tests多卡的时候出现了一些问题
使用命令: mpirun -np 2 ./test.sh test_gate.py

NCCL related problems occurred during installation

Describe the bug
NCCL issues always arise during fastmoe compilation and installation. I tried to download other versions of NCCL compressed files including include and lib for import, but still encountered the same problem, the error is as follows:

微信图片_20210929095232

Platform

  • Device: [NVIDIA V100]
  • OS: [centos]
  • CUDA version: [10.1]
  • NCCL version: [2.8.3]
  • PyTorch version: [1.8.0]

Run into this CUDA error,

CUDA error at /home/FMoe_test/fastmoe/cuda/stream_manager.cpp:31 code=101(cudaErrorInvalidDevice) "cudaSetDevice(device)"

Could you help me with this issue? My device info: ubuntu16.04, torch1.8.0, cuda 10.2

Installation error

Describe the bug
Failed to build fastmoe in the docker images that megatron provides.
https://ngc.nvidia.com/catalog/containers/nvidia:pytorch

To Reproduce
Steps to reproduce the behavior:
USE_NCCL=1 python setup.py install
Expected behavior
Installed successfully.
Logs
FAILED: /root/paddlejob/toyer_switch/fastmoe/build/temp.linux-x86_64-3.8/cuda/global_exchange.o
error: no matching function for call to ‘HackNCCLGroup::broadcastUniqueNCCLID(ncclUniqueId*)’
91 | broadcastUniqueNCCLID(&ncclID);
Platform

  • Device: NVIDIA V100
  • OS:Ubuntu
  • CUDA version: 11.1
  • NCCL version: 2.8.3

Confusion in the paper [FastMoE: A Fast Mixture-of-Expert Training System]

Dear authors,
I like your project and find it is very useful. I am reading your paper and trying to catch some basic concepts in MOE.
However, I have some confusion in the paper.

  1. Figure 4. should input 5 be sent to expert 2, according to the gate output? In the paper, it is sent to the expert 0.
  2. Page 6, you mentioned the load balance problem. Was it solved in this project? or it is inherent in the model parallelism.

[NCCL Error] Enable distributed expert feature

Hi,

I installed fastmoe useding USE_NCCL=1 python setup.py install.

When i set "expert_dp_comm" to "dp", the training process is fine. But when i set "expert_dp_comm" to "none" (i.e., Each worker serves several unique expert networks), the process has a nccl error:

NCCL Error at /home/h/code_gpt/fmoe-package/cuda/moe_comm_kernel.cu:29 value 4

I'm looking forward to the help!

My environment:
python 1.8
nccl 2.8.3
cuda 10.1

[Roadmap] v0.2 release plan

v0.2 Release

[Feature] Load Balance

Implement various gating mechanisms from GShard and MoE for load balance.
Refer to Section 2.2 of https://arxiv.org/pdf/2006.16668.pdf and https://arxiv.org/abs/1701.06538 for details. @Sengxian

  • GShard's Expert capacity
  • GShard's Local group dispatching
  • GShard's Auxiliary loss
  • GShard's Random routing
  • MoE's noisy gate

[Enhancement] FP16 Training

  • FP16 backward time inspection (@laekov)

[Feature] MoE Model Save and Load

Model save and reload for MoE with model parallel, where experts are located in different workers (@xptree)

  • FP32
  • FP16, FP16 optimizers has additional FP32 master weights to deal with.

[Enhancement] MoELinear

  • Accept bias term in MoELinear.apply and support bias via cuBLAS's beta (@TiagoMAntunes)

Maybe later (after v0.2)

  • Support DeepSpeed, compatible to zero optimizer.

Instability While Training Transformer-XL with FastMoE

Hi there,

I have been running experiments with FastMoE implementation on Transformer-XL, however the gradients overflow while using mixed-precision and dynamic loss scaling option.

Here are the settings of the training:

  • Base Transformer-XL (12 layers)
  • enwik8 dataset
  • Number of experts: 16
  • fp16 and dynamic loss scaling
  • The remaining parameters are the default ones as run_enwik8_base_moe.sh

Output:

| epoch  24 step   185600 |   3900 batches | lr 0.000139 | ms/batch 659.04 | loss  0.65 | bpc   0.93429
| epoch  24 step   185800 |   4100 batches | lr 0.000139 | ms/batch 658.89 | loss  0.65 | bpc   0.94323
Gradient overflow.  Skipping step, reducing loss scale to 16384.0
Gradient overflow.  Skipping step, reducing loss scale to 8192.0
Gradient overflow.  Skipping step, reducing loss scale to 4096.0
...
Gradient overflow.  Skipping step, reducing loss scale to 3.16e-322
Gradient overflow.  Skipping step, reducing loss scale to 1.6e-322
Gradient overflow.  Skipping step, reducing loss scale to 8e-323
Gradient overflow.  Skipping step, reducing loss scale to 4e-323
Gradient overflow.  Skipping step, reducing loss scale to 2e-323
Gradient overflow.  Skipping step, reducing loss scale to 1e-323
Gradient overflow.  Skipping step, reducing loss scale to 5e-324
Gradient overflow.  Skipping step, reducing loss scale to 0.0
Traceback (most recent call last):
ZeroDivisionError: float division by zero

Here is my system setup:

  • 4 GPUs
  • Pytorch 1.7
  • CUDA 11.1.0
  • NCCL 2.7.8

I was wondering whether you have faced a similar issue in your experiments.

Thanks!

Installation in pytorch==1.10.0 failure

I have a try on fastmoe under pytorch==1.10.0 but get failure, please have a look, thanks.

Describe the bug
Installation in pytorch==1.10.0 via `USE_NCCL=1 python setup.py install' failure

To Reproduce
Steps to reproduce the behavior:

  1. install pytorch via pip install torch==1.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
  2. install apex via https://github.com/NVIDIA/apex#linux
  3. Run USE_NCCL=1 python setup.py install processes on "v100"

Expected behavior
successs

Logs

running install
running bdist_egg
running egg_info
writing fastmoe.egg-info/PKG-INFO
writing dependency_links to fastmoe.egg-info/dependency_links.txt
writing top-level names to fastmoe.egg-info/top_level.txt
reading manifest file 'fastmoe.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'fastmoe.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'fmoe_cuda' extension
Emitting ninja build file /home/***/fastmoe_shen/build/temp.linux-x86_64-3.7/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF /home/***/fastmoe_shen/build/temp.linux-x86_64-3.7/cuda/global_exchange.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.7/site-packages/torch/include -I/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -c -c /home/***/fastmoe_shen/cuda/global_exchange.cpp -o /home/suntengxu/fastmoe_shen/build/temp.linux-x86_64-3.7/cuda/global_exchange.o -DFMOE_USE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/***/fastmoe_shen/build/temp.linux-x86_64-3.7/cuda/global_exchange.o 
c++ -MMD -MF /home/***/fastmoe_shen/build/temp.linux-x86_64-3.7/cuda/global_exchange.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.7/site-packages/torch/include -I/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -c -c /home/***/fastmoe_shen/cuda/global_exchange.cpp -o /home/***/fastmoe_shen/build/temp.linux-x86_64-3.7/cuda/global_exchange.o -DFMOE_USE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/home/***/fastmoe_shen/cuda/global_exchange.cpp:76:29: error: ‘c10d’ has not been declared
 class HackNCCLGroup: public c10d::ProcessGroupNCCL {
                             ^~~~
/home/***/fastmoe_shen/cuda/global_exchange.cpp:76:35: error: expected ‘{’ before ‘ProcessGroupNCCL’
 class HackNCCLGroup: public c10d::ProcessGroupNCCL {
                                   ^~~~~~~~~~~~~~~~
/home/***/fastmoe_shen/cuda/global_exchange.cpp:77:1: error: expected primary-expression before ‘public’
 public:
 ^~~~~~
/home/***/fastmoe_shen/cuda/global_exchange.cpp:77:1: error: expected ‘}’ before ‘public’
/home/***/fastmoe_shen/cuda/global_exchange.cpp:97:1: error: expected declaration before ‘}’ token
 };
 ^
[2/2] c++ -MMD -MF /home/***/fastmoe_shen/build/temp.linux-x86_64-3.7/cuda/fmoe_cuda.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.7/site-packages/torch/include -I/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -c -c /home/***/fastmoe_shen/cuda/fmoe_cuda.cpp -o /home/***/fastmoe_shen/build/temp.linux-x86_64-3.7/cuda/fmoe_cuda.o -DFMOE_USE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/***/fastmoe_shen/build/temp.linux-x86_64-3.7/cuda/fmoe_cuda.o 
c++ -MMD -MF /home/suntengxu/fastmoe_shen/build/temp.linux-x86_64-3.7/cuda/fmoe_cuda.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.7/site-packages/torch/include -I/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -c -c /home/***/fastmoe_shen/cuda/fmoe_cuda.cpp -o /home/***/fastmoe_shen/build/temp.linux-x86_64-3.7/cuda/fmoe_cuda.o -DFMOE_USE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/home/***/fastmoe_shen/cuda/fmoe_cuda.cpp:21:19: error: variable or field ‘_ensure_nccl’ declared void
 void _ensure_nccl(c10d::ProcessGroupNCCL& p, torch::Tensor t);
                   ^~~~
/home/***/fastmoe_shen/cuda/fmoe_cuda.cpp:21:19: error: ‘c10d’ has not been declared
/home/***/fastmoe_shen/cuda/fmoe_cuda.cpp:21:43: error: ‘p’ was not declared in this scope
 void _ensure_nccl(c10d::ProcessGroupNCCL& p, torch::Tensor t);
                                           ^
/home/***/fastmoe_shen/cuda/fmoe_cuda.cpp:21:43: note: suggested alternatives:
In file included from /opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/Dimname.h:3:0,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/NamedTensor.h:3,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:24,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/ATen/Tensor.h:3,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/ATen/Context.h:4,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/ATen/ATen.h:9,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
                 from /opt/conda/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
                 from /home/suntengxu/fastmoe_shen/cuda/fmoe_cuda.cpp:3:
/opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/aten_interned_strings.h:962:9: note:   ‘c10::attr::p’
 _(attr, p) \
         ^
/opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/interned_strings.h:615:35: note: in definition of macro ‘DEFINE_SYMBOL’
   namespace ns { constexpr Symbol s(static_cast<unique_t>(_keys::ns##_##s)); }
                                   ^
/opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/interned_strings.h:448:3: note: in expansion of macro ‘FORALL_ATTR_BASE_SYMBOLS’
   FORALL_ATTR_BASE_SYMBOLS(_)        \
   ^~~~~~~~~~~~~~~~~~~~~~~~
/opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/interned_strings.h:616:1: note: in expansion of macro ‘FORALL_NS_SYMBOLS’
 FORALL_NS_SYMBOLS(DEFINE_SYMBOL)
 ^~~~~~~~~~~~~~~~~
/opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/aten_interned_strings.h:962:9: note:   ‘c10::attr::p’
 _(attr, p) \
         ^
/opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/interned_strings.h:615:35: note: in definition of macro ‘DEFINE_SYMBOL’
   namespace ns { constexpr Symbol s(static_cast<unique_t>(_keys::ns##_##s)); }
                                   ^
/opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/interned_strings.h:448:3: note: in expansion of macro ‘FORALL_ATTR_BASE_SYMBOLS’
   FORALL_ATTR_BASE_SYMBOLS(_)        \
   ^~~~~~~~~~~~~~~~~~~~~~~~
/opt/conda/lib/python3.7/site-packages/torch/include/ATen/core/interned_strings.h:616:1: note: in expansion of macro ‘FORALL_NS_SYMBOLS’
 FORALL_NS_SYMBOLS(DEFINE_SYMBOL)
 ^~~~~~~~~~~~~~~~~
/home/suntengxu/fastmoe_shen/cuda/fmoe_cuda.cpp:21:60: error: expected primary-expression before ‘t’
 void _ensure_nccl(c10d::ProcessGroupNCCL& p, torch::Tensor t);
                                                            ^
/home/suntengxu/fastmoe_shen/cuda/fmoe_cuda.cpp: In function ‘void pybind11_init_fmoe_cuda(pybind11::module_&)’:
/home/suntengxu/fastmoe_shen/cuda/fmoe_cuda.cpp:61:27: error: ‘_ensure_nccl’ was not declared in this scope
     m.def("ensure_nccl", &_ensure_nccl, "FastMoE ensure torch nccl comm");
                           ^~~~~~~~~~~~
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build
    env=env)
  File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "setup.py", line 52, in <module>
    'build_ext': BuildExtension
  File "/opt/conda/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
    return distutils.core.setup(**attrs)
  File "/opt/conda/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py", line 67, in run
    self.do_egg_install()
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py", line 109, in do_egg_install
    self.run_command('bdist_egg')
  File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", line 164, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", line 150, in call_command
    self.run_command(cmdname)
  File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/opt/conda/lib/python3.7/distutils/command/install_lib.py", line 107, in build
    self.run_command('build_ext')
  File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)
  File "/opt/conda/lib/python3.7/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 735, in build_extensions
    build_ext.build_extensions(self)
  File "/opt/conda/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions
    self._build_extensions_serial()
  File "/opt/conda/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial
    self.build_extension(ext)
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
    _build_ext.build_extension(self, ext)
  File "/opt/conda/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension
    depends=ext.depends)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 565, in unix_wrap_ninja_compile
    with_cuda=with_cuda)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1404, in _write_ninja_file_and_compile_objects
    error_prefix='Error compiling objects for extension')
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1733, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

Platform

  • Device: [NVIDIA V100]
  • OS: [Ubuntu 18.04.5 LTS]
  • CUDA version: [ 11.1]
  • NCCL version: [e.g. 2.7.8-1]
  • PyTorch version: [1.10.0]
  • gcc: [7.5.0]
  • ninja version: [ 1.9.0/1.10.2]

Additional context
Add any other context about the problem here.

it seems like some code error as shown in

/home/***/fastmoe_shen/cuda/global_exchange.cpp:76:29: error: ‘c10d’ has not been declared
 class HackNCCLGroup: public c10d::ProcessGroupNCCL {
                             ^~~~
/home/***/fastmoe_shen/cuda/global_exchange.cpp:76:35: error: expected ‘{’ before ‘ProcessGroupNCCL’
 class HackNCCLGroup: public c10d::ProcessGroupNCCL {
                                   ^~~~~~~~~~~~~~~~
/home/***/fastmoe_shen/cuda/global_exchange.cpp:77:1: error: expected primary-expression before ‘public’
 public:
 ^~~~~~
/home/***/fastmoe_shen/cuda/global_exchange.cpp:77:1: error: expected ‘}’ before ‘public’
/home/***/fastmoe_shen/cuda/global_exchange.cpp:97:1: error: expected declaration before ‘}’ token
 };
/home/***/fastmoe_shen/cuda/fmoe_cuda.cpp:21:19: error: variable or field ‘_ensure_nccl’ declared void
 void _ensure_nccl(c10d::ProcessGroupNCCL& p, torch::Tensor t);
                   ^~~~
/home/***/fastmoe_shen/cuda/fmoe_cuda.cpp:21:19: error: ‘c10d’ has not been declared
/home/***/fastmoe_shen/cuda/fmoe_cuda.cpp:21:43: error: ‘p’ was not declared in this scope
 void _ensure_nccl(c10d::ProcessGroupNCCL& p, torch::Tensor t);

but can get success under pytroch=1.9.0. This is very weird. Can it be a version problem?
Any idea on how to solved this would be grateful.

[RoadMap] v0.1.0 first release plan

Functions

  • A model-injection-style easy-to-use user interface.
  • Support both data parallel and model parallel, and a hybrid of the two, using a new customized DDP module.
  • Remove dependency to specific modified Megatron Repository.
  • Support to customized nn.Module as an expert.

Document and infrastructure

  • Use PyTest instead of current shell script, and add more tests.
  • Installation and usage guide.
  • launching using PyTorch's native distributed launcher
  • Explanation of functions and code structure
  • Add a license Apache v2.0 the last commit before releasing

Performance

  • A well-organized benchmark

Expert capacity

Hi

I was just wondering if the fastmoe implementation uses the concept of expert capacity as described by the Switch Transformer paper? In other words if we have 8 tokens, 4 experts the expert capacity would be 2 (without considering the capacity factor). So in this scenario if more than 2 tokens gets assigned to a given expert does the token get dropped as in the Switch Transformer formulation? Or does it still get processed in fastmoe?

PyPI Package

Could you push FastMoE to PyPI?
As a workaround, we're currently using git+https://github.com/laekov/fastmoe in our requirements.txt, but a proper pip integration would be very much appreciated.

Balance loss is not optimized

loss_list = [l.mlp.gate.get_loss(clear=False) for l in model.language_model.transformer.layers]
(loss, state_dict), bal_loss = (
output,
(
torch.tensor(
loss_list, device=loss_list[0].device
).mean()
* args.balance_loss_weight
).float(),
)

torch.tensor(loss_list) will remove the grad_fn of each loss. See https://pytorch.org/docs/stable/generated/torch.tensor.html?highlight=tensor#torch.tensor:

When data is a tensor x, torch.tensor() reads out ‘the data’ from whatever it is passed, and constructs a leaf variable. Therefore torch.tensor(x) is equivalent to x.clone().detach()

We should use torch.cat instead, i.e., torch.cat([loss.unsqueeze(0) for loss in loss_list]).

Can't find ProcessGroupNCCL.hpp

Hi all. I've installed fmoe without USE_NCCL option successfully. However, when I turned on this option, I got the following error:

cuda/moe.cpp:112:37: fatal error: c10d/ProcessGroupNCCL.hpp: No such file or directory.

Environment:
PyTorch 1.3,Cuda 10.0, Linux

Looking foward to your advice.

python steup.py install错误

我安装了nccl-repo-ubuntu1604-2.8.3-ga-cuda10.1
但是在安装的时候报错
File "/home/pai/lib/python3.6/distutils/command/build_ext.py", line 448, in build_extensions
self._build_extensions_serial()
File "/home/pai/lib/python3.6/distutils/command/build_ext.py", line 473, in _build_extensions_serial
self.build_extension(ext)
File "/home/pai/lib/python3.6/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
_build_ext.build_extension(self, ext)
File "/home/pai/lib/python3.6/distutils/command/build_ext.py", line 533, in build_extension
depends=ext.depends)
File "/home/pai/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 524, in unix_wrap_ninja_compile
cuda_post_cflags = unix_cuda_flags(cuda_post_cflags)
File "/home/pai/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 423, in unix_cuda_flags
cflags + _get_cuda_arch_flags(cflags))
File "/home/pai/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1561, in _get_cuda_arch_flags
arch_list[-1] += '+PTX'

请问这是什么原因

Clip-grad in Megatron does not cope with MoE

Describe the bug
In Megatron's gradient clip, it uses the square sum of all local parameters, which lead to numerical inequivalent.

Expected behavior
We should add up the square gradients of all experts.

How does FastMoE compare to the Switch Transformer

Hi there,

Thank you for releasing this repo! This is a fantastic system and will benefit the research of many of us.

Quick question: how does FastMoE compare to the Switch Transformer?

I understand that a direct comparison is not feasible because mesh-tensorflow requires TPUs. But do you have some approximate metrics?

For example, by looking at Figure 7, I see that applying FastMoE only save roughly 30% of wall clock time in order to reach the final loss of the baseline. However, my impression of reading the Switch transformer paper is that applying MoE is 6-7x faster when training a T5-base model.

I understand that there are several variables that are not controlled here. Hardware, models, training hyperparameters are all different. That said, I am curious to hear your perspective on this. What are the key factors that make FastMoE much slower?

Thank you in advance!

Improvements in validation/test ppl of the transformer-xl with MoE on wt103

Hi there,

I have been running some experiments with mixture of experts and transformer-xl, I noticed that while the training ppl gets significant improvement with inclusion of MoE, however the same results do not get reflected in the testing/validation accuracies.

Here is an example with transformer-xl trained on wt103 dataset (the numbers correspond to the best model saved after full training):

Baseline transformer-xl with 16-layers:

  • Training ppl: 20.48
  • Validation ppl: 22.835
  • Test ppl: 23.710

MoE transformer-xl with 16-layers and 16-experts:

  • Training ppl: 14.53
  • Validation ppl: 21.859
  • Test ppl: 22.682

While we are getting a minor improvement with MoE, however it seems to me that by training with MoE, the model is facing overfitting issues. I have tried adjusting the dropout rate of the expert layers as suggested by switch-Transformer, and while it helped by having similar training and evaluation performance, it has not helped to achieve better test ppl. Have you faced similar issues?

setup.py install 安装报错

Describe the bug
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1673, in _run_ninja_build
env=env)
File "/opt/conda/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "setup.py", line 67, in
'build_ext': BuildExtension
File "/opt/conda/lib/python3.6/site-packages/setuptools/init.py", line 145, in setup
return distutils.core.setup(**attrs)
File "/opt/conda/lib/python3.6/distutils/core.py", line 148, in setup
dist.run_commands()
File "/opt/conda/lib/python3.6/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.6/site-packages/setuptools/command/install.py", line 67, in run
self.do_egg_install()
File "/opt/conda/lib/python3.6/site-packages/setuptools/command/install.py", line 109, in do_egg_install
self.run_command('bdist_egg')
File "/opt/conda/lib/python3.6/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.6/site-packages/setuptools/command/bdist_egg.py", line 172, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "/opt/conda/lib/python3.6/site-packages/setuptools/command/bdist_egg.py", line 158, in call_command
self.run_command(cmdname)
File "/opt/conda/lib/python3.6/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.6/site-packages/setuptools/command/install_lib.py", line 11, in run
self.build()
File "/opt/conda/lib/python3.6/distutils/command/install_lib.py", line 107, in build
self.run_command('build_ext')
File "/opt/conda/lib/python3.6/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.6/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/opt/conda/lib/python3.6/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
_build_ext.build_ext.run(self)
File "/opt/conda/lib/python3.6/distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 708, in build_extensions
build_ext.build_extensions(self)
File "/opt/conda/lib/python3.6/site-packages/Cython/Distutils/old_build_ext.py", line 194, in build_extensions
self.build_extension(ext)
File "/opt/conda/lib/python3.6/site-packages/setuptools/command/build_ext.py", line 205, in build_extension
_build_ext.build_extension(self, ext)
File "/opt/conda/lib/python3.6/distutils/command/build_ext.py", line 533, in build_extension
depends=ext.depends)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 538, in unix_wrap_ninja_compile
with_cuda=with_cuda)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1359, in _write_ninja_file_and_compile_objects
error_prefix='Error compiling objects for extension')
File "/opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1683, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

To Reproduce
python setup.py install 安装报错

Platform

  • Device: NVIDIA V100
  • CUDA version: 10.2
  • PyTorch version: 1.8.0

The program hang at the forward function when use model parallel in Megatron-LM

thanks for your work!!! I love it very much!!
I met a problem, hope you can help me. Thx a lot !

Platform

  • v100 , single node ,8gpu
  • pytorch:1.8.0
  • cuda11.1
  • cudnn8

update

  • if set pipeline-model-parallel-size=1, the program can run well (tensor-model-parallel-size>1)
  • if set pipeline-model-parallel-size > 1, the program will hang

Hi, an error occurs when i install with USE_NCCL=1:

Hi, an error occurs when i install with USE_NCCL=1:

RuntimeErrorRuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Env:

  1. torch 1.8
  2. cuda 10.2
  3. nccl 2.7.8

Has anyone had a similar problem ?

Originally posted by @BinHeRunning in #16 (comment)

About Megatron (AttributeError: module 'fmoe_cuda' has no attribute 'ensure_nccl')

Describe the bug

Error when training Megatron with switch gate.

[AttributeError: module 'fmoe_cuda' has no attribute 'ensure_nccl']

########## ERROR ##########
...
File "/opt/conda/lib/python3.8/site-packages/fastmoe-0.2.0-py3.8-linux-x86_64.egg/fmoe/functions.py", line 32, in count_by_gate
_ensure_nccl(gate)
File "/opt/conda/lib/python3.8/site-packages/fastmoe-0.2.0-py3.8-linux-x86_64.egg/fmoe/functions.py", line 16, in _ensure_nccl
fmoe_cuda.ensure_nccl(comm, t)
AttributeError: module 'fmoe_cuda' has no attribute 'ensure_nccl'
Traceback (most recent call last):
...
########### ERROR ##########

Platform

  • Device: [NVIDIA V100] * 2
  • Docker: [nvcr.io/nvidia/pytorch:20.12-py3]
  • Version: Megatron V2.2

**MOE Arguments **

def _add_fmoe_args(parser):

group = parser.add_argument_group(title='fmoe')
group.add_argument('--num-experts', type=int, default=2,  help='Num of experts')
group.add_argument('--top-k', type=int, default=1, help='top-k experts')
group.add_argument('--balance-strategy', default="switch", 
                   choices=['naive', 'noisy','gshard','switch'], 
                   help='Balance strategy for experts')    
return parser

Balance loss does not support FP16 training

if hasattr(model, 'module'):
model = model.module
balance_dict_tensor = torch.vstack(
[l.mlp.gate.get_loss(clear=True) for l in model.language_model.transformer.layers]

if hasattr(model, 'module'):
model = model.module
loss_list = [l.mlp.gate.get_loss(clear=False).view(1)
for l in model.language_model.transformer.layers]

The above code does not handle the case of fp16 training, and raises error of AttributeError: 'FP16Module' object has no attribute 'language_model'. For fp16 training, it should be something like:

 while hasattr(model, 'module'): 
     model = model.module 

Here, model.module is a FP16Module and model.module.module is the true megatron model.

How can I fmoefy a BERT properly?

Like what you have done in fmoefying GPT with Megatron, I tried to fmoefy a BERT model. However, the model I got didn't outperform the baseline BERT. Loss curves of BERT w & w/o are just similar.

May I ask if you have tried to fmoefy a BERT successfully? Do we need to make some additional modification for this? Thanks a lot!

Adaptation guidelines for Megatron v2.4

Hi developers,

It seems that current patch for v2.2 no longer works directly for v2.4. I tried to migrate the code line by line, but here's the error log during runtime:

Traceback (most recent call last):
  File "/root/Megatron/pretrain_gpt.py", line 189, in <module>
    args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
  File "/root/Megatron/megatron/training.py", line 124, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/root/Megatron/megatron/training.py", line 323, in setup_model_and_optimizer
    model = get_model(model_provider_func)
  File "/root/Megatron/megatron/training.py", line 269, in get_model
    for model_module in model]
  File "/root/Megatron/megatron/training.py", line 269, in <listcomp>
    for model_module in model]
TypeError: __init__() takes 2 positional arguments but 4 were given

Is there any guideline for me to fmoefy megatron-v2.4? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.