Comments (1)
I want to close this ticket with a couple of comments on how I resolved a whole bunch of issues on AMD Raedon RX 7900 XTX.
1.) Issue above was because of differences of Pytorch versions. One was compiled with ROCM 5.6 while the other was compiled with ROCM 6.0 (even though ROCM itself was upgraded to 6.0 in both).
2.) Master node was throwing a SEGSEGV
error which showed up on /var/log/syslog
as something happening with librocm/libhip. Turns out you absolutely have to set export PYTORCH_ROCM_ARCH="gfx1100"
for this SEGSEGV error to go away. Consider setting the following as well export HSA_OVERRIDE_GFX_VERSION=11.0.0
. These will resolve the SEGSEGV on Master node.
3.) Worker node still had a whole bunch of NCCL (RCCL) related issues. I found out that setting the following environment variables will clear all of those.
export LOGLEVEL=DEBUG
export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=INFO
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_DEBUG=INFO
if [ "$SLURM_PROCID" == "0" ]; then
# Change these to your actual ethernet interfaces by doing ifconfig
export NCCL_SOCKET_IFNAME="enp74s0"
else
export NCCL_SOCKET_IFNAME="enp5s0"
fi
export PYTORCH_ROCM_ARCH="gfx1100"
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export HIP_VISIBLE_DEVICES=0
The debug flags, obviously, are to get more information inside the error logs. Also, the backend for torchrun is different from the backend for init_process_group!
You should now be able to test dist.send()
, dist.recv()
, and dist.barrier()
! Congratulations!
from pytorch.
Related Issues (20)
- Incorrect index from torch.mode
- `python3 setup.py bdist_wheel` tries to write to /usr/local/... during build HOT 2
- PyTorch C++ API binary compiled with xmake crashes HOT 4
- [ExecutionTraceObserver] Tracer gets stuck using Pytorch 2.2 versions for some models using torch.compile
- [ONNX][low pri] Move old (non-public) implementation into legacy/ and schedule for deprecation
- `argsort()` can use the 0D tensor of a complex type value against error message HOT 1
- Upgrade dependencies MKL and Intel OpenMP to 2024.2.0 HOT 6
- The unexpected behavior of `argsort()`
- `msort()` can use the 0D tensor of a complex type value against error message HOT 1
- [TP+FSDP2] model weights become fully shard again after calling model.unshard() followed by dcp get_model_state_dict HOT 1
- `int` type for `dims` of `tile()` without `dims=` works with a tensor against the doc HOT 1
- `repeat_interleave()` without `repeats` argument and `input` keyword works HOT 1
- [export/dynamo] torch._check fails at compile time when the condition evaluates to False HOT 7
- Torch dynamo deep dive and overview discrepancy HOT 1
- _foreach_addc_
- Fuyou Training Framework Integration for PyTorch HOT 3
- Exporting the operator 'aten::fft_fft' to ONNX opset version 12 is not supported.
- torch.Tensor.register_hook() source link does not work HOT 3
- `start` and `step` of `arange()` should be optional on the doc
- `end`, `start` and `step` argument of `arange()` work with a 0D tensor against error messages
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch.