rocm / rccl-tests Goto Github PK

View Code? Open in Web Editor NEW

38.0 11.0 34.0 405 KB

RCCL Performance Benchmark Tests

License: Other

Makefile 2.75% Cuda 58.31% C 0.88% Shell 1.22% Python 10.16% C++ 19.20% Groovy 1.61% CMake 5.87%

rccl-tests's Introduction

RCCL Tests

These tests check both the performance and the correctness of RCCL operations. They can be compiled against RCCL.

Build

To build the tests, just type make.

If HIP is not installed in /opt/rocm, you may specify HIP_HOME. Similarly, if RCCL is not installed in /usr, you may specify NCCL_HOME and CUSTOM_RCCL_LIB.

$ make HIP_HOME=/path/to/hip NCCL_HOME=/path/to/rccl CUSTOM_RCCL_LIB=/path/to/rccl/lib/librccl.so

RCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI_HOME to the path where MPI is installed.

$ make MPI=1 MPI_HOME=/path/to/mpi HIP_HOME=/path/to/hip RCCL_HOME=/path/to/rccl

RCCL tests can also be built using cmake. A typical sequence will be:

$ mkdir build
$ cd build
$ CXX=/opt/rocm/bin/hipcc cmake -DCMAKE_PREFIX_PATH=/path/to/rccl ..
$ make

When using the cmake build procedure, please make sure that RCCL has also been built using cmake (i.e. not using the install.sh script), since cmake will check for cmake target and config files that are created during the RCCL build.

Using the cmake method also has the advantage that the build is automatically checking for MPI installations, i.e. it is not necessary to explicitly request MPI builds. A user can request to use a particular MPI library by using the MPI_PATH variable. MPI support can be explicitely disabled by adding the -DNO_MPI=1 flag to the cmake command line.

Usage

RCCL tests can run on multiple processes, multiple threads, and multiple HIP devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=HIP devices) will be equal to (number of processes)*(number of threads)*(number of GPUs per thread).

Quick examples

Run on 8 GPUs (-g 8), scanning from 8 Bytes to 128MBytes :

$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

Run with MPI on 10 processes (potentially on multiple nodes) with 4 GPUs each, for a total of 40 GPUs:

$ mpirun -np 10 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

For performance-oriented runs, on both single-node and multi-node, we suggest using 1 MPI process per GPU and -g 1. So, a run on 8 GPUs looks like :

$ mpirun -np 8 --bind-to numa ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

Running with 1 MPI process per GPU ensures a 1:1 mapping for CPUs and GPUs, which can be beneficial for smaller message sizes and better represents the real-world use of RCCL in Deep Learning frameworks like Pytorch and TensorFlow.

Performance

See the Performance page for explanation about numbers, and in particular the "busbw" column.

Arguments

All tests support the same set of arguments :

Number of GPUs
- -t,--nthreads <num threads> number of threads per process. Default : 1.
- -g,--ngpus <GPUs per thread> number of gpus per thread. Default : 1.
Sizes to scan
- -b,--minbytes <min size in bytes> minimum size to start with. Default : 32M.
- -e,--maxbytes <max size in bytes> maximum size to end at. Default : 32M.
- Increments can be either fixed or a multiplication factor. Only one of those should be used
  - -i,--stepbytes <increment size> fixed increment between sizes. Default : 1M.
  - -f,--stepfactor <increment factor> multiplication factor between sizes. Default : disabled.
RCCL operations arguments
- -o,--op <sum/prod/min/max/avg/all> Specify which reduction operation to perform. Only relevant for reduction operations like Allreduce, Reduce or ReduceScatter. Default : Sum.
- -d,--datatype <nccltype/all> Specify which datatype to use. Default : Float.
- -r,--root <root/all> Specify which root to use. Only for operations with a root like broadcast or reduce. Default : 0.
- -y,--memory_type <coarse/fine/host/managed> Default: Coarse
- -s,--stress_cycles <number of cycles> Default: 1
- -u,--cumask <d0,d1,d2,d3> Default: None
Performance
- -n,--iters <iteration count> number of iterations. Default : 20.
- -w,--warmup_iters <warmup iteration count> number of warmup iterations (not timed). Default : 5.
- -m,--agg_iters <aggregation count> number of operations to aggregate together in each iteration. Default : 1.
- -a,--average <0/1/2/3> Report performance as an average across all ranks (MPI=1 only). <0=Rank0,1=Avg,2=Min,3=Max>. Default : 1.
Test operation
- -p,--parallel_init <0/1> use threads to initialize NCCL in parallel. Default : 0.
- -c,--check <check iteration count> perform count iterations, checking correctness of results on each iteration. This can be quite slow on large numbers of GPUs. Default : 1.
- -z,--blocking <0/1> Make NCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0.
- -G,--cudagraph <num graph launches> Capture iterations as a CUDA graph and then replay specified number of times. Default : 0.
- -F,--cache_flush <cache flush after every -F iteration> Enable cache flush after every -F iteration. Default : 0 (No cache flush).

Unit tests

Unit tests for rccl-tests are implemented with pytest (python3 is also required). Several notes for the unit tests:

The LD_LIBRARY_PATH environment variable will need to be set to include /path/to/rccl-install/lib/ in order to run the unit tests.
The HSA_FORCE_FINE_GRAIN_PCIE environment variable will need to be set to 1 in order to run the unit tests which use fine-grained memory type.

The unit tests can be invoked within the rccl-tests root, or in the test subfolder. An example call to the unit tests:

$ LD_LIBRARY_PATH=/path/to/rccl-install/lib/ HSA_FORCE_FINE_GRAIN_PCIE=1 python3 -m pytest

Copyright

RCCL tests are provided under the BSD license.

rccl-tests's People

Stargazers

Watchers

rccl-tests's Issues

Test NCCL failure common.cu:1285 : internal error

Hi, when I try to run the example in the README ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4, I'm getting this error

# nThreads: 1 nGpus: 4 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid 117738 on sharbox-ultra device  0 [0000:63:00.0] AMD Instinct MI210
#   Rank  1 Pid 117738 on sharbox-ultra device  1 [0000:43:00.0] AMD Instinct MI210
#   Rank  2 Pid 117738 on sharbox-ultra device  2 [0000:30:00.0] AMD Instinct MI210
#   Rank  3 Pid 117738 on sharbox-ultra device  3 [0000:03:00.0] AMD Instinct MI210
sharbox-ultra: Test NCCL failure common.cu:1285 'internal error - please report this issue to the NCCL developers'
 .. sharbox-ultra pid 117738: Test failure common.cu:1161

When I tried to run the tests for building rccl, I got this error output

[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from AllReduce
[ RUN      ] AllReduce.OutOfPlace
[ INFO     ] Calling PIPE_READ to Child 0

sharbox-ultra:204802:204802 [0] /home/elias/rccl/build/release/hipify/src/init.cc:125 NCCL WARN Missing "amd_iommu=on" from kernel command line which can lead to system instablity or hang!

sharbox-ultra:204802:204802 [0] /home/elias/rccl/build/release/hipify/src/init.cc:127 NCCL WARN Missing "iommu=pt" from kernel command line which can lead to system instablity or hang!

sharbox-ultra:204802:204802 [0] /home/elias/rccl/build/release/hipify/src/init.cc:132 NCCL WARN Missing "HSA_FORCE_FINE_GRAIN_PCIE=1" from environment which can lead to low RCCL performance, system instablity or hang!
[ INFO     ] Got PIPE_READ 128 from Child 0
[ INFO     ] Calling PIPE_READ to Child 0
[ INFO     ] Got PIPE_READ 4 from Child 0
[ INFO     ] Calling PIPE_READ to Child 0
RCCL version 2.18.3+hip5.5 develop:6ecf771+

sharbox-ultra:204802:204836 [1] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:220 NCCL WARN hipIpcGetMemHandle failed : invalid argument

sharbox-ultra:204802:204836 [1] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:222 NCCL WARN Cuda failure 'invalid argument'

sharbox-ultra:204802:204836 [1] /home/elias/rccl/build/release/hipify/src/proxy.cc:1524 NCCL WARN [Proxy Service 1] Failed to execute operation Setup from rank 1, retcode 1

sharbox-ultra:204802:204837 [0] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:220 NCCL WARN hipIpcGetMemHandle failed : invalid argument

sharbox-ultra:204802:204837 [0] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:222 NCCL WARN Cuda failure 'invalid argument'

sharbox-ultra:204802:204837 [0] /home/elias/rccl/build/release/hipify/src/proxy.cc:1524 NCCL WARN [Proxy Service 0] Failed to execute operation Setup from rank 0, retcode 1

sharbox-ultra:204802:204835 [1] /home/elias/rccl/build/release/hipify/src/misc/socket.cc:57 NCCL WARN socketProgress: Connection closed by remote peer sharbox-ultra<48295>

sharbox-ultra:204802:204835 [1] /home/elias/rccl/build/release/hipify/src/proxy.cc:1148 NCCL WARN Socket recv failed while polling for opId=0x7fe95066b760

sharbox-ultra:204802:204834 [0] /home/elias/rccl/build/release/hipify/src/misc/socket.cc:57 NCCL WARN socketProgress: Connection closed by remote peer sharbox-ultra<35867>

sharbox-ultra:204802:204834 [0] /home/elias/rccl/build/release/hipify/src/proxy.cc:1148 NCCL WARN Socket recv failed while polling for opId=0x7fe958288240
[ ERROR    ] Child process 0 fails NCCL call ncclGroupEnd with code 3
[ ERROR    ] Child 0 failed on command [INIT_COMMS]:
[ INFO     ] Got PIPE_READ 4 from Child 0
[ ERROR    ] Child 0 reports failure
/home/elias/rccl/test/common/TestBed.cpp:178: Failure
Expected equality of these values:
  response
    Which is: 1
  TEST_SUCCESS
    Which is: 0
[  FAILED  ] AllReduce.OutOfPlace (665 ms)

Do you have any idea of what could be causing this crash?
It seems like the invalid arguments are the root of this issue, but I'm unsure what to do about it.

Test HIP failure common.cu:1129 'hipErrorInvalidDevice' while running test

When I was testing rccl installation using the rccl-tests I get Test HIP failure common.cu:1129 'hipErrorInvalidDevice'

My configurations:
Device: AMD] Starship/Matisse Reserved SPP
Distribution: "Ubuntu 20.04.4 LTS"

Repro steps:
export HCC_AMDGPU_TARGET=gfx90a
cd home
git clone https://github.com/ROCmSoftwarePlatform/rccl-tests.git
cd rccl-tests && make
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

Any idea on what could be wrong with my environment? I was able to run the tests last week.

I really appreciate any help!

All_reduce_perf segfaults with Custom Built RCCL

Problem Description

all_reduce_perf segfaults with custom built RCCL. It works fine if RCCL is from /opt/rocm-6.0.0/lib

Operating System

SLES15.4

CPU

AMD EPYC 7A53

GPU

AMD Instinct MI250

ROCm Version

ROCm 6.0.0

ROCm Component

rccl

Steps to Reproduce

Libfabric with 1.15.2.0
RCCL was custom built using
CXX=hipcc cmake -DCMAKE_PREFIX_PATH=${RCCL_ROOT} -DCMAKE_INSTALL_PREFIX=${RCCL_ROOT}

AWS Libfabric
./autogen.sh
CC=hipcc ./configure --prefix=${RCCL_ROOT} --with-hip=/opt/rocm-6.0.0/ --with-rccl=$RCCL_ROOT --with-libfabric=$OFI_ROOT --prefix=$RCCL_ROOT --disable-tests --with-gdrcopy=$GDRCOPY --with-mpi=$MPI_HOME

git clone https://github.com/ROCm/rccl-tests.git
cd rccl-tests
make MPI=1 MPI_HOME=${MPI_HOME} HIP_HOME=/opt/rocm-6.0.0/ CUSTOM_RCCL_LIB=${RCCL_ROOT}/lib

all_reduce_perf segfaults at:
GDB:
#0 0x000014ca4d0a14ca in ncclTopoCompute(ncclTopoSystem*, ncclTopoGraph*) ()
from /<CUSTOM_PATH>/lib/librccl.so.1

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

rccl-test get stuck on gfx1100

I'm testing the connectivities of rccl on two gfx1100 devices. The rocm-bandwidth-test is ok, but the rccl-tests get stuck.
The output is:

# nThreads: 1 nGpus: 2 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#   Rank  0 Pid  61927 on  deepspeed device  0 [0000:83:00.0] Radeon RX 7900 XTX
#   Rank  1 Pid  61927 on  deepspeed device  1 [0000:03:00.0] Radeon RX 7900 XTX
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

No further output is derived and the program won't move on.
Same problem is observed when using pytorch distributed package. The following code get stucks too:

# main.py
import os
import torch
import torch.distributed as dist

dist.init_process_group('nccl')
rank = int(os.getenv('LOCAL_RANK', 0))
word = int(os.getenv('WORLD_SIZE', 0))

torch.cuda.set_device(rank)
a = dist.broadcast(a, 0)

torchrun --nnode 1 --nproc-per-node 2 main.py

Does rccl support gfx1100?

Multi-GPU Support with External Pinning

In my HPC environment, srun accomplishes pinning of MPI ranks to specific cores and GPU-s (by setting ROCR_VISIBLE_DEVICES). However, this conflicts with rccl-tests, which tries to manually select GPUs based on the MPI rank.

I have fixed this in my own build (frobnitzem@5b347ee) by always running the step gpuid = gpuid % args->localNumDevices, regardless of whether args->enable_multiranks is true or not.

I suggest adopting this change, and reverting the update: d16d1fb which throws an error in this case instead.

ncclCommInitRankMulti has been deprecated in RCCL but still exists in RCCL tests

If we compile RCCL tests with the latest RCCL, it will report error "ld.lld: error: undefined symbol: ncclCommInitRankMulti".