nvidia / cuda-profiler Goto Github PK

Tools and extensions for CUDA profiling

Perl 4.69% Makefile 0.44% CWeb 4.60% CMake 5.07% C++ 11.60% Python 73.61%

cuda-profiler's Introduction

Tools and extensions for CUDA profiling

Extension	Extends tool	Description
one-hop profiling	NVIDIA Visual Profiler	Remotely profile a CUDA program when the machine actually running it is not accessible from the machine running the NVIDIA Visual Profiler
NVTX MPI Wrappers	nvprof	Inserts NVTX ranges for many common Message Passing Interface (MPI) functions.

cuda-profiler's People

Contributors

Stargazers

Watchers

cuda-profiler's Issues

nvtx_pmpi Fortran interface crashes when using MPI_IN_PLACE

nvtx_pmpi interfaces Fortran MPI_* calls to C PMPI_* calls itself, rather than leaving that step up to the underlying MPI library.
Unfortunately it gets some things wrong in the process, in particular, handling special constants that are used instead of data pointers, like MPI_IN_PLACE and MPI_BOTTOM.

I have a workaround for OpenMPI/SpectrumMPI, but it's not general, and I'm not positive that it's possible to do this generically in the first place. Anyway, I guess the first question is whether there is interest in addressing this issue, if so, it'd be worth discussing options on how to do it.

dlprof tools

The dlprof tool analyzed the deep model and proposed that the data shape did not meet the requirements of tensor core. The original script set five full connection layers, namely, shape (8,1024), (1024,1024), (1024,512), (512,1) and batch=128. When the batch=64, the improvements proposed by DLprof were resolved. why？I did not change the shape(512,1) to (512,8)

this is my code in github, https://github.com/fenfaqingnian/dlprof_v100/tree/master/Profiler_DLprof_TF1-master

compile failure using PGI

git reflog
5a6577f (HEAD -> master, origin/master, origin/HEAD) HEAD@{0}: clone: from https://github.com/NVIDIA/cuda-profiler.git

65> make
python2.7 wrap/wrap.py -f -o nvtx_pmpi.c nvtx.w
mpicc -I/nasa/cuda/10.1/include -DPIC -fPIC -c nvtx_pmpi.c -o nvtx_pmpi.o
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 627)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 729)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 729)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 825)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 825)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 921)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 921)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 1017)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 1017)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 1017)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 1017)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 1065)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 1065)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 1065)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 1065)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 1065)
PGC-S-0094-Illegal type conversion required (nvtx_pmpi.c: 1065)
PGC/x86-64 Linux 19.5-0: compilation completed with severe errors
Makefile:5: recipe for target 'nvtx_pmpi.o' failed
make: *** [nvtx_pmpi.o] Error 2

compiles fine with gcc-8.2.0

Nvprof event resampling

Dear developers:

How to reduce Nvporf output nvvp file sizes through re-sampling events? One of the options I can find from Nvprof help that I thought it might work is the option: --continuous-sampling-interval. When I specify it as 10 ms (the default is 2 ms according to the document), it still produces as the same sizes of output files as without specifying it. Is it a bug, or something else that I am missing with it?

Thanks,
Shelton

MPI annotation option does not output any MPI information

Dear Nvprof developers:

I want to use nvprof to profile my cuda+mpi application. But the little test shows that the options --annote-mpi openmpi does not produce any information about MPI interface as described in the nvprof document. The following is the information of example for the test:

Sample Test:
From Link: http://geco.mines.edu/tesla/cuda_tutorial_mio/
Source Files: mpi_hello_gpu.cu, vecadd.cu
OpenMPI Version: 4.0.2
Cuda Version: 10.1
Command: $ mpirun -np 2 nvprof --annotate-mpi openmpi ./mpi_cuda

Output ( using 2 mpi processes):
rank 0 of 2 on p3dev02 received bcastme[3]=3 [gpu 0]
rank 1 of 2 on p3dev02 received bcastme[3]=3 [gpu 1]
==70253== NVPROF is profiling process 70253, command: ./mpi_cuda
==70254== NVPROF is profiling process 70254, command: ./mpi_cuda
rank 0: cudaGetDevice()=0
rank 1: cudaGetDevice()=1
rank 1: C[0]=0.000000
ranksum= 1
==70253== Profiling application: ./mpi_cuda
==70253== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 62.58% 3.1040us 2 1.5520us 1.3440us 1.7600us [CUDA memcpy HtoD]
37.42% 1.8560us 1 1.8560us 1.8560us 1.8560us [CUDA memcpy DtoH]
API calls: 86.74% 352.44ms 3 117.48ms 10.267us 352.42ms cudaMalloc
5.39% 21.910ms 582 37.645us 258ns 2.0794ms cuDeviceGetAttribute
4.75% 19.303ms 50000 386ns 303ns 102.73us cudaLaunchKernel
2.07% 8.3917ms 6 1.3986ms 1.1406ms 1.4661ms cuDeviceTotalMem
0.68% 2.7607ms 1 2.7607ms 2.7607ms 2.7607ms cudaGetDeviceProperties
0.34% 1.3713ms 6 228.55us 215.41us 247.59us cuDeviceGetName
0.02% 66.319us 3 22.106us 14.092us 30.931us cudaMemcpy
0.01% 20.708us 3 6.9020us 1.8690us 16.755us cudaFree
0.00% 12.278us 6 2.0460us 1.3700us 4.3850us cuDeviceGetPCIBusId
0.00% 7.5770us 12 631ns 375ns 973ns cuDeviceGet
0.00% 6.6190us 1 6.6190us 6.6190us 6.6190us cudaSetDevice
0.00% 6.2070us 4 1.5510us 867ns 2.3670us cuPointerGetAttributes
0.00% 2.3390us 6 389ns 354ns 461ns cuDeviceGetUuid
0.00% 1.8280us 3 609ns 437ns 780ns cuDeviceGetCount
0.00% 1.5210us 1 1.5210us 1.5210us 1.5210us cudaGetDevice
0.00% 1.2300us 1 1.2300us 1.2300us 1.2300us cudaGetDeviceCount
==70254== Profiling application: ./mpi_cuda
==70254== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 179.83ms 50000 3.5960us 3.5510us 4.0640us vecAdd(float*, float*, float*)
0.00% 3.0400us 2 1.5200us 1.3440us 1.6960us [CUDA memcpy HtoD]
0.00% 2.0480us 1 2.0480us 2.0480us 2.0480us [CUDA memcpy DtoH]
API calls: 68.49% 884.64ms 50000 17.692us 16.647us 1.4335ms cudaLaunchKernel
28.85% 372.61ms 3 124.20ms 15.212us 372.57ms cudaMalloc
1.55% 20.003ms 582 34.368us 453ns 1.2518ms cuDeviceGetAttribute
0.76% 9.7675ms 6 1.6279ms 1.6077ms 1.6602ms cuDeviceTotalMem
0.25% 3.2029ms 1 3.2029ms 3.2029ms 3.2029ms cudaGetDeviceProperties
0.10% 1.2356ms 6 205.93us 135.78us 224.53us cuDeviceGetName
0.01% 103.42us 3 34.473us 19.464us 60.273us cudaMemcpy
0.00% 60.895us 3 20.298us 4.2420us 51.665us cudaFree
0.00% 16.364us 4 4.0910us 2.0370us 9.1220us cuPointerGetAttributes
0.00% 14.154us 6 2.3590us 1.9510us 3.1620us cuDeviceGetPCIBusId
0.00% 11.338us 12 944ns 580ns 1.5080us cuDeviceGet
0.00% 7.3840us 1 7.3840us 7.3840us 7.3840us cudaSetDevice
0.00% 3.8410us 6 640ns 592ns 673ns cuDeviceGetUuid
0.00% 2.7020us 3 900ns 699ns 1.0970us cuDeviceGetCount
0.00% 1.9360us 1 1.9360us 1.9360us 1.9360us cudaGetDevice
0.00% 1.2750us 1 1.2750us 1.2750us 1.2750us cudaGetDeviceCount

Hope you can reproduce the issue.

Best,
Shelton

Remote profiling - failed to Create a new session (Ctrl + N)

I'm running Visual Profiler on Windows and try to remotely profile ubuntu machine.
I don't have Nvidia GPU on my Windows.
When trying to create new session, I got the following message and Visual Profiler exit.:
"Unable to locate CUDA libraries and establish connection with CUDA driver"

nvidia / cuda-profiler Goto Github PK

cuda-profiler's Introduction

cuda-profiler's People

Contributors

Stargazers

Watchers

Forkers

cuda-profiler's Issues

nvtx_pmpi Fortran interface crashes when using MPI_IN_PLACE

dlprof tools

compile failure using PGI

Nvprof event resampling

MPI annotation option does not output any MPI information

Remote profiling - failed to Create a new session (Ctrl + N)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent