Git Product home page Git Product logo

Comments (20)

gawain102000 avatar gawain102000 commented on April 28, 2024

Hope you help to have a look! And the testing log was attached too

Bo

from nvidia_libs_test.

jlebar avatar jlebar commented on April 28, 2024

Thanks for the bug report!

These are quite probably bugs in cudnn. I don't believe we have tested with cudnn 7.4.2 yet ourselves.

@timshen91 wdyt?

from nvidia_libs_test.

timshen91 avatar timshen91 commented on April 28, 2024

I'll take a look. If it's indeed a cuDNN 7.4.2 regression, I'll let the Nvidia folks know.

from nvidia_libs_test.

gawain102000 avatar gawain102000 commented on April 28, 2024

Thanks for your quick response! Please try if you can reproduce this issue and if any help on more info, please let me know here

Thanks

from nvidia_libs_test.

gawain102000 avatar gawain102000 commented on April 28, 2024

Hi everyone,
Intermittently failed on cuDNN 7.4.1

Thanks

from nvidia_libs_test.

timshen91 avatar timshen91 commented on April 28, 2024

@gawain102000, I'm unable to reproduce it. See the full log below.

Which host compiler did you use?

~/src/nvidia_libs_test % bazel run --define libunwind=true --action_env=CC=/usr/bin/gcc-6 --action_env=CUDA_PATH="$HOME/sandbox/cuda" :cudnn_test -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME"
INFO: Invocation ID: 8da274cc-63d5-49f3-b9c3-2a9718323c70
INFO: Analysed target //:cudnn_test (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //:cudnn_test up-to-date:
  bazel-bin/cudnn_test
INFO: Elapsed time: 0.202s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: external/bazel_tools/tools/test/test-setup.sh ./cudnn_test '--gtest_filter=Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_INFO: Build completed successfully, 1 total action
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //:cudnn_test
-----------------------------------------------------------------------------
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1212 23:10:47.320444 37724 cudnn_util.cc:68] Running cuDNN v7.4.1 for CUDA 9.0.0 on TITAN V
Note: Google Test filter = Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Conv3d/ConvolutionTest
[ RUN      ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[       OK ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME (20 ms)
[----------] 1 test from Conv3d/ConvolutionTest (20 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (20 ms total)
[  PASSED  ] 1 test.

from nvidia_libs_test.

timshen91 avatar timshen91 commented on April 28, 2024

Ok, I did reproduced it. It's non-deterministic.

~/src/nvidia_libs_test % bazel run --define libunwind=true --action_env=CC=/usr/bin/gcc-6 --action_env=CUDA_PATH="$HOME/sandbox/cuda" :cudnn_test -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" --gtest_repeat=100 2>&1 | grep FAILED
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST

from nvidia_libs_test.

gawain102000 avatar gawain102000 commented on April 28, 2024

Hi @timshen91
Cool! Thanks for your reproducing this! I worked on Ubuntu16.04 with default gcc

"
gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
"

Thanks
Bo

from nvidia_libs_test.

timshen91 avatar timshen91 commented on April 28, 2024

Hi @gawain102000 ,

I noticed something. For this backward filter, for each convolution call, nvidia_libs_test by default fills the result buffer with NaNs.

The code is in TEST_P(ConvolutionTest, CompareResults), and the call is to FillWithNaNs(). If I remove that line, all failures disappear.

I suspect that cuDNN is error-prone when the result buffer has garbage NaNs before the convolution happens.

from nvidia_libs_test.

gawain102000 avatar gawain102000 commented on April 28, 2024

Hi @timshen91

Thanks for your investigating on this! For the error log "[2788]: 0.22229 vs nan, error = nan", the correct value should be [2278] : 0.22229, and the error value is nan. I just think that several factors can make this non-deterministic issue

(1) Output at [2788] was never updated or
(2) Output at [2788] was updated with correct value and then updated again with nan or
(3) cuDNN gave nan at [2788], which is incorrect

BTW, both nvidia_libs_test and cudnn include some CUDA asynchronous calls and cannot be sure if they can work with expected behavior. Only is a guess .

Hope above info is helpful for you!

Thanks
Bo

from nvidia_libs_test.

timshen91 avatar timshen91 commented on April 28, 2024

@gawain102000 can you try to reproduce it locally directly with cuDNN, with the filter buffer filled with NaNs before callong the conv?

from nvidia_libs_test.

gawain102000 avatar gawain102000 commented on April 28, 2024

Hi @timshen91
Yes, cuDNN can see what API was called by nvidia_libs_test and by only running testing on that API, currently, I still cannot reproduce this issue.
And I will give a double check

Thanks

from nvidia_libs_test.

timshen91 avatar timshen91 commented on April 28, 2024

Wait, did you actually try to do that with the result buffer filled with NaNs?

from nvidia_libs_test.

gawain102000 avatar gawain102000 commented on April 28, 2024

There is an option to fill with NaN and I need to double check if it can really work

Thanks

from nvidia_libs_test.

gawain102000 avatar gawain102000 commented on April 28, 2024

BTW, I cannot be sure how device memory was requested or managed by nvidia_libs_test. There are two choices as following,

The first way is
(1) Request the memory with the total size of input and output, let's use ptrT point to the beginning of it. And then (2) use ptrI(input point) point to ptrT and use ptrO(output point) point to ptrT + sizeof(input)

The second way is
(1) Request the total memory for input and then use ptrI point to it
(2) Request the total memory for output and then use ptrO point to it

Thanks
Bo

from nvidia_libs_test.

gawain102000 avatar gawain102000 commented on April 28, 2024

Hi @timshen91

We use the following flow to run the test on cuDNN
(1) When beta is zero, always fill hfilterOutput with NaN on host and then copy to device
memset(hfilterOutput, 0xFF, filterOutputDesc.totalSize * sizeof(half));
cudaMemcpy(dfilterOutput, hfilterOutput, filterOutputDesc.totalSize * sizeof(half), cudaMemcpyHostToDevice)

(2) Use cudaMalloc on Input
cudaMalloc((void **)&(devPtrI), InputTest.totalSize * sizeof(half)));

(3) Use cudaMalloc on InputDiff
cudaMalloc((void **)&(devPtrIdiff), InputDiffTest.totalSize * sizeof(half)))

From above, for each of them, use cudaMalloc to create device space and then let a pointer point to the beginning which can be 32-, 64-, 128- or 512-byte segments of device memory that are aligned to their size

Currently, still no issue for me. And hope above is helpful for you!

Thanks
Bo

from nvidia_libs_test.

chsigg avatar chsigg commented on April 28, 2024

All tensors are allocated individually in nvidia_libs_test (you should see this in an CUDA API trace). We rely on the default alignment being sufficient.

As Tim pointed out though, the failure isn't consistent (7 out of 100?), which would suggest this is a timing issue. Have you tried repeating your direct testing a number of times?

from nvidia_libs_test.

gawain102000 avatar gawain102000 commented on April 28, 2024

Hi @chsigg

Thanks for your response and your team's help on this issue! Yes, we are trying directly testing many times locally

Thanks

from nvidia_libs_test.

mruberry avatar mruberry commented on April 28, 2024

@nluehr is investigating this issue now.

from nvidia_libs_test.

gawain102000 avatar gawain102000 commented on April 28, 2024

Thanks for everybody's help! Looks like this is a cuDNN issue rather than framework. Since cuDNN engineer is internally investigating this issue, please permit me close here. And if you have any question, please let me know

Thanks

from nvidia_libs_test.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.