Git Product home page Git Product logo

npkit's Introduction

Introduction

NPKit (Networking Profiling Kit) is a profiling framework designed for popular collective communication libraries (CCLs), including Microsoft MSCCL, NVIDIA NCCL and AMD RCCL. It enables users to insert customized profiling events into different CCL components, especially into giant GPU kernels. These events are then automatically placed onto a unified timeline in Google Trace Event Format, which users can then leverage trace viewer to understand CCLs' workflow and performance.

NPKit is easy to use. It runs with all kinds of workloads where CCLs are leveraged. Users only need to dynamically link their workload binary to CCLs built with NPKit enabled, then the unified timeline with profiling events are automatically generated.

NPKit is lightweight. During each run, users can choose to only enable profiling events they care about to minimize overhead caused by NPKit.

Below is an example of NPKit timeline result. Green blocks are LL128 data transfer times in GPU, and each line represents a independent data flow (typically mapped to a channel or thread block). Red/purple blocks are net send/recv times in CPU. Each block contains other attributes, including data size, channel ID, etc.

NPKit Result Example

Quick Start

Please check msccl_samples for MSCCL quick start, nccl_samples for NCCL quick start and rccl_samples for RCCL quick start.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

npkit's People

Contributors

yzygitzh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

npkit's Issues

Empty trace file

Describe the bug
Hi authors, I got an empty trace file when I followed the usage example. I feel confused about it. Here is my code. There is no parsed_gpu_event in any trace file that is satisfied with this condition https://github.com/microsoft/NPKit/blob/main/npkit_for_nccl_v2.10.3-1/samples/npkit/npkit_post_process.py.diff#L79

To Reproduce
Steps to reproduce the behavior:

$ git clone https://github.com/nvidia/nccl nccl-v2.10.3-1
$ cd nccl-v2.10.3-1
$ git checkout 7e51592
$ find ../npkit_for_nccl_v2.10.3-1/ | grep '.diff$' | awk '{print "git apply "$1}' | bash
$ make -j src.build NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"
$ cd samples
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make MPI=1 MPI_HOME=/home/xinglinpan/mpi/openmpi-4.1.4/ CUDA_HOME=/usr/local/cuda-10.2/ NCCL_HOME=/home/xinglinpan/npkit/npkit_result/npkit_src/
$ CUDA_HOME=/usr/local/cuda-10.2/ LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib:/home/xinglinpan/mpi/openmpi-4.1.4/lib/  mpirun -np 4 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 -mca coll_hcoll_enable 0 -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_num_concurrent 8192 -x NCCL_UCX_TLS=rc_x,cuda_copy,cuda_ipc -x NCCL_UCX_RNDV_THRESH=0 -x NCCL_UCX_RNDV_SCHEME=get_zcopy -x UCX_RC_MLX5_TM_ENABLE=y -x NPKIT_DUMP_DIR=./  ./all_reduce_perf -b 8 -e 128M -f 2 -g 1
$ python npkit_post_process.py --npkit_dump_dir=/home/xinglinpan/nccl-tests/build --npkit_event_header_path=/home/xinglinpan/npkit/nccl-v2.10.3-1/src/include/npkit/npkit_event.h --output_dir=./

Logs

# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid   3229 on       gpu9 device  0 [0x3d] NVIDIA GeForce RTX 2080 Ti
#   Rank  1 Pid   3230 on       gpu9 device  1 [0x3e] NVIDIA GeForce RTX 2080 Ti
#   Rank  2 Pid   3231 on       gpu9 device  2 [0xb1] NVIDIA GeForce RTX 2080 Ti
#   Rank  3 Pid   3232 on       gpu9 device  3 [0xb2] NVIDIA GeForce RTX 2080 Ti
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum    18.34    0.00    0.00  1e-07    56.46    0.00    0.00  0e+00
          16             4     float     sum    20.02    0.00    0.00  3e-08    24.77    0.00    0.00  3e-08
          32             8     float     sum    21.58    0.00    0.00  3e-08    23.57    0.00    0.00  3e-08
          64            16     float     sum    28.89    0.00    0.00  3e-08    25.10    0.00    0.00  3e-08
         128            32     float     sum    18.51    0.01    0.01  3e-08    17.26    0.01    0.01  3e-08
         256            64     float     sum    17.96    0.01    0.02  3e-08    18.64    0.01    0.02  3e-08
         512           128     float     sum    16.71    0.03    0.05  3e-08    20.96    0.02    0.04  1e-08
        1024           256     float     sum    27.02    0.04    0.06  1e-07    27.71    0.04    0.06  1e-07
        2048           512     float     sum    28.58    0.07    0.11  1e-07    24.29    0.08    0.13  1e-07
        4096          1024     float     sum    27.17    0.15    0.23  2e-07    30.42    0.13    0.20  2e-07
        8192          2048     float     sum    30.83    0.27    0.40  2e-07    36.36    0.23    0.34  2e-07
       16384          4096     float     sum    30.71    0.53    0.80  2e-07    40.37    0.41    0.61  2e-07
       32768          8192     float     sum    268.6    0.12    0.18  2e-07    43.18    0.76    1.14  2e-07
       65536         16384     float     sum    58.56    1.12    1.68  2e-07    64.99    1.01    1.51  2e-07
      131072         32768     float     sum    102.0    1.28    1.93  2e-07    105.5    1.24    1.86  2e-07
      262144         65536     float     sum    167.2    1.57    2.35  2e-07    178.3    1.47    2.21  2e-07
      524288        131072     float     sum    276.3    1.90    2.85  2e-07    239.1    2.19    3.29  2e-07
     1048576        262144     float     sum    360.9    2.91    4.36  2e-07    358.3    2.93    4.39  2e-07
     2097152        524288     float     sum    726.3    2.89    4.33  2e-07    729.2    2.88    4.31  2e-07
     4194304       1048576     float     sum   1442.7    2.91    4.36  2e-07   2084.9    2.01    3.02  2e-07
     8388608       2097152     float     sum   4161.9    2.02    3.02  2e-07   2828.9    2.97    4.45  2e-07
    16777216       4194304     float     sum   6272.6    2.67    4.01  2e-07   6720.3    2.50    3.74  2e-07
    33554432       8388608     float     sum    13248    2.53    3.80  2e-07    13023    2.58    3.86  2e-07
    67108864      16777216     float     sum    26097    2.57    3.86  2e-07    25792    2.60    3.90  2e-07
   134217728      33554432     float     sum    49738    2.70    4.05  2e-07    49205    2.73    4.09  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.71279

Platform

  • Device: GeForce RTX 2080Ti * 4
  • OS: Linux gpu9 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • CUDA version: 10.2
  • NCCL version: 2.7.8-1
  • PyTorch version: 1.9.1
  • Python Version: 3.8

Question about the misalignment of the generated files

I used NPKIT to generate profiler files on two machines, and I found that the time of the files on the two machines did not seem to be aligned.
Here is an example. Process 3 and Process 4 are from different machines and the trace files don't seem to be aligned.
image

Here is my code with NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT".

CUDA_HOME=/usr/local/cuda-10.2/ LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib/:/home/xinglinpan/mpi/openmpi-4.1.4/lib/ mpirun --prefix /home/xinglinpan/mpi/openmpi-4.1.4/ -np 8 -x NPKIT_DUMP_DIR=./ -x LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib/:/home/xinglinpan/mpi/openmpi-4.1.4/lib/ -x NCCL_DEBUG=TRACE -x NCCL_DEBUG_SUBSYS=GRAPH -H gpu9:4,gpu10:4 ./build/alltoall_perf -b 64M -e 64M -f 2 -g 1 -n 1 -w 0

By the way, the event seems to be divided into 6 parts. By setting n=0, I found that there are still 4 parts (i.e., only initialization is completed), so is AlltoAll completed using the events of the last 2 parts?

Unable to generate GPU traces for MSCCL

I have 8 machines, each with a single GPU. When following the build instructions for NCCL I get traces for both CPU and GPU events, but after following the steps for MSCCL I only get traces for CPU events. Below is each step taken to try and get GPU traces with MSCCL.

git clone https://github.com/microsoft/NPKit.git
cd NPKit
git clone https://github.com/microsoft/msccl msccl-master-e52c525
cd msccl-master-e52c525
git checkout e52c525
find ../npkit_for_msccl_master_e52c525/ | grep '.diff$' | awk '{print "git apply "$1}' | bash
make -j src.build src.build NVCC_GENCODE="-arch=sm_80" NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"

cd ..
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests/
make MPI=1 MPI_HOME=/usr/local/openmpi/ NCCL_HOME=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build -j

cd ..
mkdir dump_files
mkdir trace_files

# root directory copied to all machines

mpirun -H navi,quiritis,saria,oshun,midna,parvati,rhiannon,tara rm -f /home/jasonfantl/NPKit/MSCCL/NPKit/dump_files/* && \
mpirun \
    --tag-output \
    -H navi,quiritis,saria,oshun,midna,parvati,rhiannon,tara \
    -x PATH \
    -x LD_PRELOAD=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build/lib/libnccl.so.2 \
    -x LD_LIBRARY_PATH=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build:/usr/local/openmpi/lib:/usr/local/cuda/lib64:/usr/local/openmpi/lib:$LD_LIBRARY_PATH  \
    -x NCCL_P2P_DISABLE=1 \
    -x NCCL_SHM_DISABLE=1 \
    -x NCCL_SOCKET_IFNAME=wan0 \
    -x NCCL_NET=IB \
    -x NCCL_IB_GID_INDEX=3 \
    -x NCCL_IB_HCA=mlx5 \
    -x NCCL_NET_GDR_LEVEL=SYS  \
    -x NCCL_ALGO=MSCCL \
    -x NCCL_PROTO=LL \
    -x NPKIT_DUMP_DIR=/home/jasonfantl/NPKit/MSCCL/NPKit/dump_files \
    -x MSCCL_XML_FILES=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl_samples/msccl_algo_sample.xml \
    /home/jasonfantl/NPKit/MSCCL/NPKit/nccl-tests/build/all_reduce_perf -b 1048576 -e 1048576 -f 2 -g 1 -c 1 -n 100 -w 100 -z 0

python /home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/samples/npkit/npkit_post_process.py \
  --npkit_dump_dir=/home/jasonfantl/NPKit/MSCCL/NPKit/dump_files \
  --npkit_event_header_path=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/src/include/npkit/npkit_event.h \
  --output_dir=/home/jasonfantl/NPKit/MSCCL/NPKit/trace_files

A potentially useful note: When trying different settings I noticed that when NCCL_ALGO=RING, then NCCL_PROTO=LL (with -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT) doesn't produce GPU traces, but NCCL_PROTO=LL128 (with -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT) does (and I believe there's a typo in npkit_post_process.py line 77, curr_cpu_base_time needs to be replaced with curr_gpu_base_time in order to parse).

The current MSCCL example npkit_runner.sh uses NPKIT=1 as the build flag, which does not seem to enable any traces at all. I saw the MSCCL example had recently used -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_ENTRY -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_EXIT, which also didn't produce traces.

What are the correct build flags to generate GPU traces with MSCCL?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.