Comments (20)
Hope you help to have a look! And the testing log was attached too
Bo
from nvidia_libs_test.
Thanks for the bug report!
These are quite probably bugs in cudnn. I don't believe we have tested with cudnn 7.4.2 yet ourselves.
@timshen91 wdyt?
from nvidia_libs_test.
I'll take a look. If it's indeed a cuDNN 7.4.2 regression, I'll let the Nvidia folks know.
from nvidia_libs_test.
Thanks for your quick response! Please try if you can reproduce this issue and if any help on more info, please let me know here
Thanks
from nvidia_libs_test.
Hi everyone,
Intermittently failed on cuDNN 7.4.1
Thanks
from nvidia_libs_test.
@gawain102000, I'm unable to reproduce it. See the full log below.
Which host compiler did you use?
~/src/nvidia_libs_test % bazel run --define libunwind=true --action_env=CC=/usr/bin/gcc-6 --action_env=CUDA_PATH="$HOME/sandbox/cuda" :cudnn_test -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME"
INFO: Invocation ID: 8da274cc-63d5-49f3-b9c3-2a9718323c70
INFO: Analysed target //:cudnn_test (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //:cudnn_test up-to-date:
bazel-bin/cudnn_test
INFO: Elapsed time: 0.202s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: external/bazel_tools/tools/test/test-setup.sh ./cudnn_test '--gtest_filter=Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_INFO: Build completed successfully, 1 total action
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //:cudnn_test
-----------------------------------------------------------------------------
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1212 23:10:47.320444 37724 cudnn_util.cc:68] Running cuDNN v7.4.1 for CUDA 9.0.0 on TITAN V
Note: Google Test filter = Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Conv3d/ConvolutionTest
[ RUN ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[ OK ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME (20 ms)
[----------] 1 test from Conv3d/ConvolutionTest (20 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (20 ms total)
[ PASSED ] 1 test.
from nvidia_libs_test.
Ok, I did reproduced it. It's non-deterministic.
~/src/nvidia_libs_test % bazel run --define libunwind=true --action_env=CC=/usr/bin/gcc-6 --action_env=CUDA_PATH="$HOME/sandbox/cuda" :cudnn_test -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" --gtest_repeat=100 2>&1 | grep FAILED
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
from nvidia_libs_test.
Hi @timshen91
Cool! Thanks for your reproducing this! I worked on Ubuntu16.04 with default gcc
"
gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
"
Thanks
Bo
from nvidia_libs_test.
Hi @gawain102000 ,
I noticed something. For this backward filter, for each convolution call, nvidia_libs_test by default fills the result buffer with NaNs.
The code is in TEST_P(ConvolutionTest, CompareResults)
, and the call is to FillWithNaNs()
. If I remove that line, all failures disappear.
I suspect that cuDNN is error-prone when the result buffer has garbage NaNs before the convolution happens.
from nvidia_libs_test.
Hi @timshen91
Thanks for your investigating on this! For the error log "[2788]: 0.22229 vs nan, error = nan", the correct value should be [2278] : 0.22229, and the error value is nan. I just think that several factors can make this non-deterministic issue
(1) Output at [2788] was never updated or
(2) Output at [2788] was updated with correct value and then updated again with nan or
(3) cuDNN gave nan at [2788], which is incorrect
BTW, both nvidia_libs_test and cudnn include some CUDA asynchronous calls and cannot be sure if they can work with expected behavior. Only is a guess .
Hope above info is helpful for you!
Thanks
Bo
from nvidia_libs_test.
@gawain102000 can you try to reproduce it locally directly with cuDNN, with the filter buffer filled with NaNs before callong the conv?
from nvidia_libs_test.
Hi @timshen91
Yes, cuDNN can see what API was called by nvidia_libs_test and by only running testing on that API, currently, I still cannot reproduce this issue.
And I will give a double check
Thanks
from nvidia_libs_test.
Wait, did you actually try to do that with the result buffer filled with NaNs?
from nvidia_libs_test.
There is an option to fill with NaN and I need to double check if it can really work
Thanks
from nvidia_libs_test.
BTW, I cannot be sure how device memory was requested or managed by nvidia_libs_test. There are two choices as following,
The first way is
(1) Request the memory with the total size of input and output, let's use ptrT point to the beginning of it. And then (2) use ptrI(input point) point to ptrT and use ptrO(output point) point to ptrT + sizeof(input)
The second way is
(1) Request the total memory for input and then use ptrI point to it
(2) Request the total memory for output and then use ptrO point to it
Thanks
Bo
from nvidia_libs_test.
Hi @timshen91
We use the following flow to run the test on cuDNN
(1) When beta is zero, always fill hfilterOutput with NaN on host and then copy to device
memset(hfilterOutput, 0xFF, filterOutputDesc.totalSize * sizeof(half));
cudaMemcpy(dfilterOutput, hfilterOutput, filterOutputDesc.totalSize * sizeof(half), cudaMemcpyHostToDevice)
(2) Use cudaMalloc on Input
cudaMalloc((void **)&(devPtrI), InputTest.totalSize * sizeof(half)));
(3) Use cudaMalloc on InputDiff
cudaMalloc((void **)&(devPtrIdiff), InputDiffTest.totalSize * sizeof(half)))
From above, for each of them, use cudaMalloc to create device space and then let a pointer point to the beginning which can be 32-, 64-, 128- or 512-byte segments of device memory that are aligned to their size
Currently, still no issue for me. And hope above is helpful for you!
Thanks
Bo
from nvidia_libs_test.
All tensors are allocated individually in nvidia_libs_test (you should see this in an CUDA API trace). We rely on the default alignment being sufficient.
As Tim pointed out though, the failure isn't consistent (7 out of 100?), which would suggest this is a timing issue. Have you tried repeating your direct testing a number of times?
from nvidia_libs_test.
Hi @chsigg
Thanks for your response and your team's help on this issue! Yes, we are trying directly testing many times locally
Thanks
from nvidia_libs_test.
@nluehr is investigating this issue now.
from nvidia_libs_test.
Thanks for everybody's help! Looks like this is a cuDNN issue rather than framework. Since cuDNN engineer is internally investigating this issue, please permit me close here. And if you have any question, please let me know
Thanks
from nvidia_libs_test.
Related Issues (6)
- kernel_timer.cc build failed with "error: 'uint64' was not declared in this scope domain_cycles.end(), uint64{0})" HOT 1
- cuda/extras/CUPTI/include/cupti_result.h relocated in CUDA11.1 HOT 1
- An issue was reported when running cuda-memcheck --tool racecheck HOT 1
- Windows support HOT 2
- build failed HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nvidia_libs_test.