Git Product home page Git Product logo

Comments (1)

trocker avatar trocker commented on June 28, 2024

I want to close this ticket with a couple of comments on how I resolved a whole bunch of issues on AMD Raedon RX 7900 XTX.

1.) Issue above was because of differences of Pytorch versions. One was compiled with ROCM 5.6 while the other was compiled with ROCM 6.0 (even though ROCM itself was upgraded to 6.0 in both).

2.) Master node was throwing a SEGSEGV error which showed up on /var/log/syslog as something happening with librocm/libhip. Turns out you absolutely have to set export PYTORCH_ROCM_ARCH="gfx1100" for this SEGSEGV error to go away. Consider setting the following as well export HSA_OVERRIDE_GFX_VERSION=11.0.0. These will resolve the SEGSEGV on Master node.

3.) Worker node still had a whole bunch of NCCL (RCCL) related issues. I found out that setting the following environment variables will clear all of those.

export LOGLEVEL=DEBUG
export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=INFO


export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_DEBUG=INFO

if [ "$SLURM_PROCID" == "0" ]; then
   # Change these to your actual ethernet interfaces by doing ifconfig
    export NCCL_SOCKET_IFNAME="enp74s0"
else
    export NCCL_SOCKET_IFNAME="enp5s0"
fi





export PYTORCH_ROCM_ARCH="gfx1100"
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export HIP_VISIBLE_DEVICES=0

The debug flags, obviously, are to get more information inside the error logs. Also, the backend for torchrun is different from the backend for init_process_group!

You should now be able to test dist.send(), dist.recv(), and dist.barrier()! Congratulations!

from pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.