Git Product home page Git Product logo

Comments (6)

laekov avatar laekov commented on August 10, 2024 1

The distributed experts feature is by default enabled in fmoefy. You may double check the place where you call the function.
In our experiment, we use NVIDIA V100 32GB. 12 experts are placed on each expert. In other words, our --num-expert is set to 12.

from fastmoe.

laekov avatar laekov commented on August 10, 2024

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

from fastmoe.

BinHeRunning avatar BinHeRunning commented on August 10, 2024

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

  1. download nccl package from https://github.com/NVIDIA/nccl/archive/refs/tags/v2.8.3-1.tar.gz

  2. build and install nccl as described in NCCL repository

    $ cd nccl
    $ make -j src.build

    $ sudo apt install build-essential devscripts debhelper fakeroot
    $ make pkg.debian.build
    $ ls build/pkg/deb/

    $ dpkg -i libnccl2_2.8.3-1+cuda10.1_amd64.deb
    $ dpkg -i libnccl-dev_2.8.3-1+cuda10.1_amd64.deb

  3. apt search nccl, nccl 2.8.3 is installed

    Sorting... Done
    Full Text Search... Done
    libhttpasyncclient-java/bionic 4.1.3-1 all
    HTTP/1.1 compliant asynchronous HTTP agent implementation

    libnccl-dev/now 2.8.3-1+cuda10.1 amd64 [installed,local]
    NVIDIA Collective Communication Library (NCCL) Development Files

    libnccl2/now 2.8.3-1+cuda10.1 amd64 [installed,local]
    NVIDIA Collective Communication Library (NCCL) Runtime

    libpuppetlabs-http-client-clojure/bionic 0.9.0-1 all
    Clojure wrapper around libhttpasyncclient-java

    libvncclient1/bionic-security,bionic-updates 0.9.11+dfsg-1ubuntu1.4 amd64
    API to write one's own VNC server - client library

    libvncclient1-dbg/bionic-security,bionic-updates 0.9.11+dfsg-1ubuntu1.4 amd64
    debugging symbols for libvncclient

    python-ncclient/bionic 0.5.3-4 all
    Python library for NETCONF clients (Python 2)

    python-ncclient-doc/bionic 0.5.3-4 all
    Documentation for python-ncclient (Python library for NETCONF clients)

    python3-ncclient/bionic 0.5.3-4 all
    Python library for NETCONF clients (Python 3)

  4. installed fastmoe using USE_NCCL=1 python setup.py install

from fastmoe.

BinHeRunning avatar BinHeRunning commented on August 10, 2024

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

"The repository is currently tested with PyTorch v1.8.0 and CUDA 10, with designed compatibility to older versions."
Can you provide the docker image with cuda10 ?

from fastmoe.

Sengxian avatar Sengxian commented on August 10, 2024

We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature.

It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708

from fastmoe.

BinHeRunning avatar BinHeRunning commented on August 10, 2024

We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature.

It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708

Thanks for the docker image.

I installed fastmoe using USE_NCCL=1, but when i run GPT2 (L12-H768, intermediate size 1536, top2)in a 8xGPU device, the largest expert number can be increased to 32. However, 96 experts reported in the FastMoE paper.

When i increase the expert number to 48 (batch size per gpu: 1), CUDA OOM occurs.

It seems that the distributed expert feature was not activated. Do you have any suggestions ?

from fastmoe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.