google / nvidia_libs_test Goto Github PK

View Code? Open in Web Editor NEW

52.0 11.0 22.0 126 KB

Tests and benchmarks for cudnn (and in the future, other nvidia libraries)

License: Apache License 2.0

C++ 91.20% Cuda 2.97% Starlark 5.83%

cudnn benchmark cuda gpu gpu-computing

nvidia_libs_test's Introduction

Tests and Benchmarks for cuDNN

The repository contains a set of convolution tests and benchmarks for NVIDIA's cuDNN library.

This is not an officially supported Google product.

Prerequisites

Install bazel (instructions).

Install the CUDA Toolkit (CUDA 8 is the minimal supported version)

Install the cuDNN SDK (cuDNN 6 is the minimal supported version).

Common parameters

Bazel parameters:

-c opt Build with optimizations. Recommended for benchmarks.
--action_env=CUDA_PATH=<path>: Path to the CUDA SDK directory. Default is /usr/local/cuda.
--action_env=CUDNN_PATH=<path>: Path to the CUDNN SDK directory. Default is CUDA_PATH.
--action_env=CC=<compiler>: Name of (or path to) the compiler. Examples: clang, gcc-6.

Executable parameters:

--cuda_device=<device_id>: CUDA device to use. Default is 0.
--device_memory_limit_mb=<size>: Upper limit of device memory (in megabytes) to use for cuDNN workspace after tensors have been allocated. Negative values specify an offset from the memory available at startup. Default is 4096.

Test instructions

bazel run [bazel parameters] //:cudnn_test -- [test parameters]

Bazel can run tests in a sandbox (which allows reporting crashes as failures). To run in a sandbox, replace 'bazel run ...' with 'bazel test ...' and prefix each test parameter with '--test_arg='

bazel test [bazel parameters] //:cudnn_test --test_arg=[test parameter 1] ...

Test parameters:

--gtest_filter=<pattern>: Only run tests that match the given pattern. Example pattern: '*FWD*'.
--proto_path=<path>: Path to textproto file that contains additional convolutions to run. Default is 'cudnn_tests.textproto'.
--gtest_random_seed=<value>: Seed for random generator. Changing the value produces tests with a different mix of tensor shapes, filter sizes, etc. Default is 0.
--gtest_also_run_disabled_tests: Include disabled tests (i.e. tests with names that start with DISABLED_).
--help for more options.

Benchmark instructions

bazel run [bazel parameters] //:cudnn_benchmark -- [benchmark parameters]

Benchmark parameters:

--benchmark_filter=<regex>: Only run tests that match the given regex pattern. Example: 'BWD_DATA'.
--proto_path=<path>: Path to textproto file that contains additional convolutions to run. Default is 'cudnn_benchmarks.textproto'.
--timing=<method>: How to measure execution time. One of 'kernel-duration' (default), 'kernel-cycles', or 'host-duration'.
--help and --helpfull for more options.

nvidia_libs_test's People

Contributors

Stargazers

Watchers

nvidia_libs_test's Issues

An issue was reported when running cuda-memcheck --tool racecheck

When running the commanding "cuda-memcheck --tool racecheck --print-level error --flush-to-disk no --error-exitcode 1 /usr/bin/bazel run //:cudnn_test --action_env=CUDNN_PATH=/home/swqa/.vulcan/install/cuda --action_env=CUDA_PATH=/home/swqa/.vulcan/install/cuda -- --gtest_filter=CONVOLUTION_FWD_NCHW_TENSOR_OP_52x7x112x4_873x7x3x3_VALID_GetAlgo_v7" on TITAN V, the following issue was reported
"
[ RUN ] FromFile/ConvolutionTest.CompareResults/CONVOLUTION_FWD_NCHW_TENSOR_OP_52x7x112x4_873x7x3x3_VALID_GetAlgo_v7
F1023 04:04:30.495419 17575 cuda_util.cc:92] Check failed: OkStatus() == GetStatus(cudaFree(ptr_)) (ok vs. CUDA Runtime API error 'an illegal memory access was encountered')
*** Check failure stack trace: ***
@ 0x186dde0 google::LogMessage::Fail()
@ 0x186dd24 google::LogMessage::SendToLog()
@ 0x186d675 google::LogMessage::Flush()
@ 0x1870aee google::LogMessageFatal::~LogMessageFatal()
@ 0x46c42b nvidia_libs_test::DeviceMemory::~DeviceMemory()
@ 0x40e9d9 _ZN16nvidia_libs_test12_GLOBAL__N_114RunConvolutionEddRKSt10unique_ptrI12cudnnContextNS_6detail18CudnnHandleDeleterEERKNS_11ConvolutionERKN4absl7variantIJ25cudnnConvolutionFwdAlgo_t29cudnnConvolutionBwdDataAlgo_t31cudnnConvolutionBwdFilterAlgo_tEEE
@ 0x410b42 nvidia_libs_test::(anonymous namespace)::ConvolutionTest_CompareResults_Test::TestBody()
@ 0x18bf017 testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@ 0x18ba07f testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x189f35e testing::Test::Run()
@ 0x189fc50 testing::TestInfo::Run()
@ 0x18a02a5 testing::TestCase::Run()
@ 0x18a72a1 testing::internal::UnitTestImpl::RunAllTests()
@ 0x18bfd3f testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@ 0x18bacb5 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x18a5f0f testing::UnitTest::Run()
@ 0x451181 RUN_ALL_TESTS()
@ 0x4509e8 main
@ 0x7fb41c5ff830 __libc_start_main
@ 0x40d639 _start
@ (nil) (unknown)
========= CUDA-MEMCHECK
========= RACECHECK SUMMARY: 0 hazards displayed (0 errors, 0 warnings)
"

Thanks
Bo

build failed

bazel run //:cudnn_benchmark
ERROR: /data//.cache/bazel/bazel/dad4edfde16590c5e6cb1df1644cc6bb/external/rules_proto/proto/private/native.bzl:22:19: name 'ProtoInfo' is not defined
ERROR: /data//tmp/nvidia_libs_test/BUILD:98:1: error loading package '@com_google_protobuf//': Extension 'proto/private/native.bzl' has errors and referenced by '//:cudnn_benchmark'
ERROR: Analysis of target '//:cudnn_benchmark' failed; build aborted: error loading package '@com_google_protobuf//': Extension 'proto/private/native.bzl' has errors
INFO: Elapsed time: 5.034s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (6 packages loaded)
currently loading: @com_google_protobuf// ... (3 packages)
ERROR: Build failed. Not running target

kernel_timer.cc build failed with "error: 'uint64' was not declared in this scope domain_cycles.end(), uint64{0})"

Build cudnn_benchmark
bazel run //:cudnn_benchmark -c opt --action_env=CUDNN_PATH=/data/tmp/cuda --action_env=CUDA_PATH=/data/tmp/cuda
kernel_timer.cc build failed with the following error:
###############
kernel_timer.cc:194:68: error: 'uint64' was not declared in this scope
domain_cycles.end(), uint64{0});
^
################
This error was by caused by latest commit a378e0f
It should be uint64_t{0}

Windows support

I would find this incredibly helpful to verify correctness of my CuDNN installation, but it appears to me that the bazel code supports online Linux. (Please correct me if I am wrong.)

an intermittent error, failed on V100

bazel run //:cudnn_test --action_env=CUDNN_PATH=cuda9.0_cudnn_v7.4.1/cuda --action_env=CUDA_PATH=cuda -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME"
DEBUG: /home/lab/.vulcan/install/cuda/_tests/google_cudnn_test/codes/nvidia_libs_test-master/cuda_configure.bzl:12:5: Using CUDA from /home/lab/.vulcan/install/cuda
DEBUG: /home/lab/.vulcan/install/cuda/_tests/google_cudnn_test/codes/nvidia_libs_test-master/cuda_configure.bzl:13:5: Using cuDNN from /home/lab/bow/project/5_software/cuda9.0_cudnn_v7.4.1/cuda
INFO: Analysed target //:cudnn_test (0 packages loaded).
INFO: Found 1 target...
Target //:cudnn_test up-to-date:
bazel-bin/cudnn_test
INFO: Elapsed time: 0.112s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: external/bazel_tools/tools/test/test-setup.sh ./cudnn_test '--gtest_filter=*Conv3d/ConvolutionTest.CompareRINFO: Build completed successfully, 1 total action
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //:cudnn_test
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1212 02:57:55.774238 18306 cudnn_util.cc:68] Running cuDNN v7.4.1 for CUDA 9.0.0 on Tesla V100-DGXS-16GB
Note: Google Test filter = Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Conv3d/ConvolutionTest
[ RUN ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
cudnn_conv_test.cc:462: Failure
Value of: IsOk(TensorDataEqual(ref_result_data, *result_data, *result_desc, tolerance))
Actual: false (6 elements differ more than 10. Largest differences:
[2788]: 0.22229 vs nan, error = nan
[5904]: 0 vs nan, error = nan
[1744]: 0 vs nan, error = nan
[2784]: 0 vs nan, error = nan
[1748]: 0.221191 vs nan, error = nan
[5908]: 0.220581 vs nan, error = nan)
Expected: true
format: TENSOR_NCHW
data_type: DATA_HALF
compute_mode: DATA_HALF
math_type: DEFAULT_MATH
algo: CONVOLUTION_BWD_FILTER_ALGO_1
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
reference {
input {
dimension: 82
dimension: 4
dimension: 79
dimension: 9
dimension: 2
data_type: DATA_DOUBLE
format: TENSOR_NCHW
}
filter {
dimension: 12
dimension: 4
dimension: 2
dimension: 13
dimension: 5
data_type: DATA_DOUBLE
format: TENSOR_NCHW
}
convolution {
pad: 1
pad: 6
pad: 2
compute_mode: DATA_DOUBLE
}
one_minus_alpha: 0.99996569585949024
bwd_filter_algo: CONVOLUTION_BWD_FILTER_ALGO_0
label: "CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME"
}
test {
input {
data_type: DATA_HALF
format: TENSOR_NCHW
}
filter {
data_type: DATA_HALF
format: TENSOR_NCHW
}
convolution {
compute_mode: DATA_HALF
math_type: DEFAULT_MATH
}
all_algos: CONVOLUTION_BWD_FILTER
}
(21 ms)
[----------] 1 test from Conv3d/ConvolutionTest (21 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (21 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
reference {
input {
dimension: 82
dimension: 4
dimension: 79
dimension: 9
dimension: 2
data_type: DATA_DOUBLE
format: TENSOR_NCHW
}
filter {
dimension: 12
dimension: 4
dimension: 2
dimension: 13
dimension: 5
data_type: DATA_DOUBLE
format: TENSOR_NCHW
}
convolution {
pad: 1
pad: 6
pad: 2
compute_mode: DATA_DOUBLE
}
one_minus_alpha: 0.99996569585949024
bwd_filter_algo: CONVOLUTION_BWD_FILTER_ALGO_0
label: "CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME"
}
test {
input {
data_type: DATA_HALF
format: TENSOR_NCHW
}
filter {
data_type: DATA_HALF
format: TENSOR_NCHW
}
convolution {
compute_mode: DATA_HALF
math_type: DEFAULT_MATH
}
all_algos: CONVOLUTION_BWD_FILTER
}

1 FAILED TEST

cuda/extras/CUPTI/include/cupti_result.h relocated in CUDA11.1

this file is relocated under /usr/local/cuda in CUDA 11.1(installed from local deb.)

cuda_util.h:26:10: fatal error: cuda/extras/CUPTI/include/cupti_result.h: No such file or directory
 #include "cuda/extras/CUPTI/include/cupti_result.h"
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
Target //:cudnn_test failed to build
Use --verbose_failures to see the command lines of failed build steps.