Hi, I installed fastmoe useding USE_NCCL=1

[NCCL Error] Enable distributed expert feature about fastmoe HOT 6 CLOSED

laekov commented on August 10, 2024

[NCCL Error] Enable distributed expert feature

from fastmoe.

Comments (6)

laekov commented on August 10, 2024 1

The distributed experts feature is by default enabled in fmoefy. You may double check the place where you call the function.
In our experiment, we use NVIDIA V100 32GB. 12 experts are placed on each expert. In other words, our --num-expert is set to 12.

from fastmoe.

laekov commented on August 10, 2024

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

from fastmoe.

BinHeRunning commented on August 10, 2024

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

download nccl package from https://github.com/NVIDIA/nccl/archive/refs/tags/v2.8.3-1.tar.gz
build and install nccl as described in NCCL repository

$ cd nccl
$ make -j src.build

$ sudo apt install build-essential devscripts debhelper fakeroot
$ make pkg.debian.build
$ ls build/pkg/deb/

$ dpkg -i libnccl2_2.8.3-1+cuda10.1_amd64.deb
$ dpkg -i libnccl-dev_2.8.3-1+cuda10.1_amd64.deb
apt search nccl, nccl 2.8.3 is installed

Sorting... Done
Full Text Search... Done
libhttpasyncclient-java/bionic 4.1.3-1 all
HTTP/1.1 compliant asynchronous HTTP agent implementation

libnccl-dev/now 2.8.3-1+cuda10.1 amd64 [installed,local]
NVIDIA Collective Communication Library (NCCL) Development Files

libnccl2/now 2.8.3-1+cuda10.1 amd64 [installed,local]
NVIDIA Collective Communication Library (NCCL) Runtime

libpuppetlabs-http-client-clojure/bionic 0.9.0-1 all
Clojure wrapper around libhttpasyncclient-java

libvncclient1/bionic-security,bionic-updates 0.9.11+dfsg-1ubuntu1.4 amd64
API to write one's own VNC server - client library

libvncclient1-dbg/bionic-security,bionic-updates 0.9.11+dfsg-1ubuntu1.4 amd64
debugging symbols for libvncclient

python-ncclient/bionic 0.5.3-4 all
Python library for NETCONF clients (Python 2)

python-ncclient-doc/bionic 0.5.3-4 all
Documentation for python-ncclient (Python library for NETCONF clients)

python3-ncclient/bionic 0.5.3-4 all
Python library for NETCONF clients (Python 3)
installed fastmoe using USE_NCCL=1 python setup.py install

from fastmoe.

BinHeRunning commented on August 10, 2024

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

"The repository is currently tested with PyTorch v1.8.0 and CUDA 10, with designed compatibility to older versions."
Can you provide the docker image with cuda10 ?

from fastmoe.

Sengxian commented on August 10, 2024

We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature.

It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708

from fastmoe.

BinHeRunning commented on August 10, 2024

We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature.

It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708

Thanks for the docker image.

I installed fastmoe using USE_NCCL=1, but when i run GPT2 (L12-H768, intermediate size 1536, top2)in a 8xGPU device, the largest expert number can be increased to 32. However, 96 experts reported in the FastMoE paper.

When i increase the expert number to 48 (batch size per gpu: 1), CUDA OOM occurs.

It seems that the distributed expert feature was not activated. Do you have any suggestions ?

from fastmoe.

[NCCL Error] Enable distributed expert feature about fastmoe HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent