bazel run //:cudnn_test --action_env=CUDNN_PATH=cuda9.0_cudnn_v7.4.1/cuda --action_en

Hi everyone, Intermittently failed on cuDNN 7.4.1 <p dir="aut

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ok, I did reproduced it. It's non-deterministic. <div class="snippet-clipboard-con

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

an intermittent error, failed on V100 about nvidia_libs_test HOT 20 CLOSED

google commented on April 28, 2024

an intermittent error, failed on V100

from nvidia_libs_test.

Comments (20)

gawain102000 commented on April 28, 2024

Hope you help to have a look! And the testing log was attached too

from nvidia_libs_test.

jlebar commented on April 28, 2024

Thanks for the bug report!

These are quite probably bugs in cudnn. I don't believe we have tested with cudnn 7.4.2 yet ourselves.

@timshen91 wdyt?

from nvidia_libs_test.

timshen91 commented on April 28, 2024

I'll take a look. If it's indeed a cuDNN 7.4.2 regression, I'll let the Nvidia folks know.

from nvidia_libs_test.

gawain102000 commented on April 28, 2024

Thanks for your quick response! Please try if you can reproduce this issue and if any help on more info, please let me know here

Thanks

from nvidia_libs_test.

gawain102000 commented on April 28, 2024

Hi everyone,
Intermittently failed on cuDNN 7.4.1

Thanks

from nvidia_libs_test.

timshen91 commented on April 28, 2024

@gawain102000, I'm unable to reproduce it. See the full log below.

Which host compiler did you use?

~/src/nvidia_libs_test % bazel run --define libunwind=true --action_env=CC=/usr/bin/gcc-6 --action_env=CUDA_PATH="$HOME/sandbox/cuda" :cudnn_test -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME"
INFO: Invocation ID: 8da274cc-63d5-49f3-b9c3-2a9718323c70
INFO: Analysed target //:cudnn_test (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //:cudnn_test up-to-date:
  bazel-bin/cudnn_test
INFO: Elapsed time: 0.202s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: external/bazel_tools/tools/test/test-setup.sh ./cudnn_test '--gtest_filter=Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_INFO: Build completed successfully, 1 total action
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //:cudnn_test
-----------------------------------------------------------------------------
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1212 23:10:47.320444 37724 cudnn_util.cc:68] Running cuDNN v7.4.1 for CUDA 9.0.0 on TITAN V
Note: Google Test filter = Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Conv3d/ConvolutionTest
[ RUN      ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[       OK ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME (20 ms)
[----------] 1 test from Conv3d/ConvolutionTest (20 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (20 ms total)
[  PASSED  ] 1 test.

from nvidia_libs_test.

timshen91 commented on April 28, 2024

Ok, I did reproduced it. It's non-deterministic.

~/src/nvidia_libs_test % bazel run --define libunwind=true --action_env=CC=/usr/bin/gcc-6 --action_env=CUDA_PATH="$HOME/sandbox/cuda" :cudnn_test -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" --gtest_repeat=100 2>&1 | grep FAILED
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST

from nvidia_libs_test.

gawain102000 commented on April 28, 2024

Hi @timshen91
Cool! Thanks for your reproducing this! I worked on Ubuntu16.04 with default gcc

"
gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
"

Thanks
Bo

from nvidia_libs_test.

timshen91 commented on April 28, 2024

Hi @gawain102000 ,

I noticed something. For this backward filter, for each convolution call, nvidia_libs_test by default fills the result buffer with NaNs.

The code is in TEST_P(ConvolutionTest, CompareResults), and the call is to FillWithNaNs(). If I remove that line, all failures disappear.

I suspect that cuDNN is error-prone when the result buffer has garbage NaNs before the convolution happens.

from nvidia_libs_test.

gawain102000 commented on April 28, 2024

Hi @timshen91

Thanks for your investigating on this! For the error log "[2788]: 0.22229 vs nan, error = nan", the correct value should be [2278] : 0.22229, and the error value is nan. I just think that several factors can make this non-deterministic issue

(1) Output at [2788] was never updated or
(2) Output at [2788] was updated with correct value and then updated again with nan or
(3) cuDNN gave nan at [2788], which is incorrect

BTW, both nvidia_libs_test and cudnn include some CUDA asynchronous calls and cannot be sure if they can work with expected behavior. Only is a guess .

Hope above info is helpful for you!

Thanks
Bo

from nvidia_libs_test.

timshen91 commented on April 28, 2024

@gawain102000 can you try to reproduce it locally directly with cuDNN, with the filter buffer filled with NaNs before callong the conv?

from nvidia_libs_test.

gawain102000 commented on April 28, 2024

Hi @timshen91
Yes, cuDNN can see what API was called by nvidia_libs_test and by only running testing on that API, currently, I still cannot reproduce this issue.
And I will give a double check

Thanks

from nvidia_libs_test.

timshen91 commented on April 28, 2024

Wait, did you actually try to do that with the result buffer filled with NaNs?

from nvidia_libs_test.

gawain102000 commented on April 28, 2024

There is an option to fill with NaN and I need to double check if it can really work

Thanks

from nvidia_libs_test.

gawain102000 commented on April 28, 2024

BTW, I cannot be sure how device memory was requested or managed by nvidia_libs_test. There are two choices as following,

The first way is
(1) Request the memory with the total size of input and output, let's use ptrT point to the beginning of it. And then (2) use ptrI(input point) point to ptrT and use ptrO(output point) point to ptrT + sizeof(input)

The second way is
(1) Request the total memory for input and then use ptrI point to it
(2) Request the total memory for output and then use ptrO point to it

Thanks
Bo

from nvidia_libs_test.

gawain102000 commented on April 28, 2024

Hi @timshen91

We use the following flow to run the test on cuDNN
(1) When beta is zero, always fill hfilterOutput with NaN on host and then copy to device
memset(hfilterOutput, 0xFF, filterOutputDesc.totalSize * sizeof(half));
cudaMemcpy(dfilterOutput, hfilterOutput, filterOutputDesc.totalSize * sizeof(half), cudaMemcpyHostToDevice)

(2) Use cudaMalloc on Input
cudaMalloc((void **)&(devPtrI), InputTest.totalSize * sizeof(half)));

(3) Use cudaMalloc on InputDiff
cudaMalloc((void **)&(devPtrIdiff), InputDiffTest.totalSize * sizeof(half)))

From above, for each of them, use cudaMalloc to create device space and then let a pointer point to the beginning which can be 32-, 64-, 128- or 512-byte segments of device memory that are aligned to their size

Currently, still no issue for me. And hope above is helpful for you!

Thanks
Bo

from nvidia_libs_test.

chsigg commented on April 28, 2024

All tensors are allocated individually in nvidia_libs_test (you should see this in an CUDA API trace). We rely on the default alignment being sufficient.

As Tim pointed out though, the failure isn't consistent (7 out of 100?), which would suggest this is a timing issue. Have you tried repeating your direct testing a number of times?

from nvidia_libs_test.

gawain102000 commented on April 28, 2024

Hi @chsigg

Thanks for your response and your team's help on this issue! Yes, we are trying directly testing many times locally

Thanks

from nvidia_libs_test.

mruberry commented on April 28, 2024

@nluehr is investigating this issue now.

from nvidia_libs_test.

gawain102000 commented on April 28, 2024

Thanks for everybody's help! Looks like this is a cuDNN issue rather than framework. Since cuDNN engineer is internally investigating this issue, please permit me close here. And if you have any question, please let me know

Thanks

from nvidia_libs_test.

an intermittent error, failed on V100 about nvidia_libs_test HOT 20 CLOSED

Comments (20)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent