Git Product home page Git Product logo

nvbench's Introduction

Overview

This project is a work-in-progress. Everything is subject to change.

NVBench is a C++17 library designed to simplify CUDA kernel benchmarking. It features:

  • Parameter sweeps: a powerful and flexible "axis" system explores a kernel's configuration space. Parameters may be dynamic numbers/strings or static types.
  • Runtime customization: A rich command-line interface allows redefinition of parameter axes, CUDA device selection, locking GPU clocks (Volta+), changing output formats, and more.
  • Throughput calculations: Compute and report:
    • Item throughput (elements/second)
    • Global memory bandwidth usage (bytes/second and per-device %-of-peak-bw)
  • Multiple output formats: Currently supports markdown (default) and CSV output.
  • Manual timer mode: (optional) Explicitly start/stop timing in a benchmark implementation.
  • Multiple measurement types:
    • Cold Measurements:
      • Each sample runs the benchmark once with a clean device L2 cache.
      • GPU and CPU times are reported.
    • Batch Measurements:
      • Executes the benchmark multiple times back-to-back and records total time.
      • Reports the average execution time (total time / number of executions).

Supported Compilers and Tools

  • CMake > 2.23.1
  • CUDA Toolkit + nvcc: 11.1 -> 12.4
  • g++: 7 -> 12
  • clang++: 9 -> 18
  • cl.exe: 2019 -> 2022 (19.29, 29.39)
  • Headers are tested with C++17 -> C++20.

Getting Started

Minimal Benchmark

A basic kernel benchmark can be created with just a few lines of CUDA C++:

void my_benchmark(nvbench::state& state) {
  state.exec([](nvbench::launch& launch) {
    my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>();
  });
}
NVBENCH_BENCH(my_benchmark);

See Benchmarks for information on customizing benchmarks and implementing parameter sweeps.

Command Line Interface

Each benchmark executable produced by NVBench provides a rich set of command-line options for configuring benchmark execution at runtime. See the CLI overview and CLI axis specification for more information.

Examples

This repository provides a number of examples that demonstrate various NVBench features and usecases:

Building Examples

To build the examples:

mkdir -p build
cd build
cmake -DNVBench_ENABLE_EXAMPLES=ON -DCMAKE_CUDA_ARCHITECTURES=70 .. && make

Be sure to set CMAKE_CUDA_ARCHITECTURE based on the GPU you are running on.

Examples are built by default into build/bin and are prefixed with nvbench.example.

Example output from `nvbench.example.throughput`
# Devices

## [0] `Quadro GV100`
* SM Version: 700 (PTX Version: 700)
* Number of SMs: 80
* SM Default Clock Rate: 1627 MHz
* Global Memory: 32163 MiB Free / 32508 MiB Total
* Global Memory Bus Peak: 870 GiB/sec (4096-bit DDR @850MHz)
* Max Shared Memory: 96 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: No

# Log

Run:  throughput_bench [Device=0]
Warn: Current measurement timed out (15.00s) while over noise threshold (1.26% > 0.50%)
Pass: Cold: 0.262392ms GPU, 0.267860ms CPU, 7.19s total GPU, 27393x
Pass: Batch: 0.261963ms GPU, 7.18s total GPU, 27394x

# Benchmark Results

## throughput_bench

### [0] Quadro GV100

| NumElements |  DataSize  | Samples |  CPU Time  | Noise |  GPU Time  | Noise | Elem/s  | GlobalMem BW  | BWPeak | Batch GPU  | Batch  |
|-------------|------------|---------|------------|-------|------------|-------|---------|---------------|--------|------------|--------|
|    16777216 | 64.000 MiB |  27393x | 267.860 us | 1.25% | 262.392 us | 1.26% | 63.940G | 476.387 GiB/s | 58.77% | 261.963 us | 27394x |

Demo Project

To get started using NVBench with your own kernels, consider trying out the NVBench Demo Project.

nvbench_demo provides a simple CMake project that uses NVBench to build an example benchmark. It's a great way to experiment with the library without a lot of investment.

Contributing

Contributions are welcome!

For current issues, see the issue board. Issues labeled with are good for first time contributors.

Tests

To build nvbench tests:

mkdir -p build
cd build
cmake -DNVBench_ENABLE_TESTING=ON .. && make

Tests are built by default into build/bin and prefixed with nvbench.test.

To run all tests:

make test

or

ctest

License

NVBench is released under the Apache 2.0 License with LLVM exceptions. See LICENSE.

Scope and Related Projects

NVBench will measure the CPU and CUDA GPU execution time of a single host-side critical region per benchmark. It is intended for regression testing and parameter tuning of individual kernels. For in-depth analysis of end-to-end performance of multiple applications, the NVIDIA Nsight tools are more appropriate.

NVBench is focused on evaluating the performance of CUDA kernels and is not optimized for CPU microbenchmarks. This may change in the future, but for now, consider using Google Benchmark for high resolution CPU benchmarks.

nvbench's People

Contributors

aaronmondal avatar alliepiper avatar andmax avatar brycelelbach avatar cliffburdick avatar gevtushenko avatar hahnjo avatar jrhemstad avatar miscco avatar pointkernel avatar psvvsp avatar robertmaynard avatar s-o-t avatar shwina avatar vyasr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nvbench's Issues

Build fails with fmt 8.1.1

Building with gcc 11.1 and fmt 8.1.1 fails with following message:

In file included from /usr/include/fmt/format.h:48,
                 from /tmp/nvbench/nvbench/detail/throw.cuh:21,
                 from /tmp/nvbench/nvbench/axes_metadata.cxx:21:
/usr/include/fmt/core.h: In instantiation of ‘constexpr fmt::v8::detail::value<Context> fmt::v8::detail::make_arg(T&&) [with bool IS_PACKED = true; Context = fmt::v8::basic_format_context<fmt::v8::appender, char>; fmt::v8::detail::type <anonymous> = fmt::v8::detail::type::custom_type; T = nvbench::axis_type&; typename std::enable_if<IS_PACKED, int>::type <anonymous> = 0]’:
/usr/include/fmt/core.h:1855:77:   required from ‘constexpr fmt::v8::format_arg_store<Context, Args>::format_arg_store(T&& ...) [with T = {std::basic_string_view<char, std::char_traits<char> >&, nvbench::axis_type&, nvbench::axis_type&}; Context = fmt::v8::basic_format_context<fmt::v8::appender, char>; Args = {std::basic_string_view<char, std::char_traits<char> >, nvbench::axis_type, nvbench::axis_type}]’
/usr/include/fmt/core.h:1872:38:   required from ‘constexpr fmt::v8::format_arg_store<Context, typename std::remove_cv<typename std::remove_reference<Args>::type>::type ...> fmt::v8::make_format_args(Args&& ...) [with Context = fmt::v8::basic_format_context<fmt::v8::appender, char>; Args = {std::basic_string_view<char, std::char_traits<char> >&, nvbench::axis_type&, nvbench::axis_type&}]’
/usr/include/fmt/core.h:3119:44:   required from ‘std::string fmt::v8::format(fmt::v8::format_string<T ...>, T&& ...) [with T = {std::basic_string_view<char, std::char_traits<char> >&, nvbench::axis_type&, nvbench::axis_type}; std::string = std::__cxx11::basic_string<char>; fmt::v8::format_string<T ...> = fmt::v8::basic_format_string<char, std::basic_string_view<char, std::char_traits<char> >&, nvbench::axis_type&, nvbench::axis_type>]’
/tmp/nvbench/nvbench/axes_metadata.cxx:225:5:   required from here
/usr/include/fmt/core.h:1728:7: error: static assertion failed: Cannot format an argument. To make type T formattable provide a formatter<T> specialization: https://fmt.dev/latest/api.html#udt
 1728 |       formattable,
      |       ^~~~~~~~~~~
/usr/include/fmt/core.h:1728:7: note: ‘formattable’ evaluates to false

Add option to add plain text description to a benchmark declaration

One of my favorite things about Catch2 is that you can provide a plain text, human readable description with each test.

TEST_CASE( "Factorials are computed", "[factorial]" ) {
    REQUIRE( Factorial(1) == 1 );
    REQUIRE( Factorial(2) == 2 );
    REQUIRE( Factorial(3) == 6 );
    REQUIRE( Factorial(10) == 3628800 );
}

It would be nice to extend this idea to benchmark declaration.

Taking from the example project, this could look something like:

NVBENCH_BENCH("Benchmarks sleep across a range of values", sleep_benchmark)
    .add_int64_axis("Duration (us)", nvbench::range(0, 100, 5))
    .set_timeout(1); // Limit to one second per measurement.

or it could be an additional member to the builder:

NVBENCH_BENCH(sleep_benchmark)
    .add_int64_axis("Duration (us)", nvbench::range(0, 100, 5))
    .set_timeout(1) // Limit to one second per measurement.
    .description("Benchmarks sleep across a range of values");

This description could then be included when doing things like --list.

`test_export` tests fail on MSVC

The test_export tests added in #42 do not work with MSVC when run from a VS2019 dev prompt. The test executables fail to run:

image

This could be fixed by manually specifying the PATH to the nvbench library in my terminal, but it seems like there should be a way for this to work automatically.

Add a DoNotOptimize that works in device code

It would be nice to have a utility similar to Google Benchmarks "DoNotOptimize" that attempts to prevent compiler optimizations from discarding results.

GBench's version uses inline assembly tricks that won't work with PTX because even if the generated PTX avoids the optimizer, OCG and SASS generation will almost certainly undo any attempt made to avoid optimization. We'll have to be a little trickier with the device equivalent.

Make nvbench targets use SYSTEM includes

When using nvbench::main in a target, nvbench adds an include path for nvbench using -I instead of -isystem. This causes tools like clang-tidy to pick up nvbench as outside of the system scope, and subsequently all the dependencies of nvbench are picked up too (libfmt, for example).

Add hidden benchmarks

As @jrhemstad mentioned here one could add hidden benchmarks using a "hidden" tag using a tag feature as proposed in #81 to be able to define benchmarks that will not run per default.

Add optional types to type axes

I would be nice to be able to specify

NVBENCH_BENCH_TYPES(my_benchmark, NVBENCH_TYPE_AXES_OPT({my_types, my_optional_types}))
  .set_type_axis_names({"ValueType"});

such that my_bench_executable --benchmark my_benchmark does only execute (the Cartesian product of) the non-optional my_types, but my_bench_executable --benchmark my_benchmark -a ValueType=SomeOptionalType still works, i.e. the full Cartesian product of the concatenation of my_types and my_optional_types with other type axes (not included in above code) is instantiated.

Enable CUPTI for CTK <= 11.2.

There are some compilation errors with CUPTI from CTK 11.2:

FAILED: nvbench/CMakeFiles/nvbench.dir/cupti_profiler.cxx.o
/home/av/code/utils/ccache-install/bin/g++ -DFMT_LOCALE -Dnvbench_EXPORTS -I/home/av/code/src/nvbench -I/home/av/code/build/nvbench-ctk_11.2 -I/home/av/code/build/nvbench-ctk_11.2/_deps/fmt-src/include -I/home/av/code/build/nvbench-ctk_11.2/nvbench/detail -isystem /usr/local/cuda_11.2.0_460.22/include -isystem /usr/local/cuda_11.2.0_460.22/extras/CUPTI/include -isystem /home/av/code/build/nvbench-ctk_11.2/_deps/nlohmann_json-src/include -O3 -DNDEBUG -fPIC -Wall -Wextra -Wconversion -Woverloaded-virtual -Wcast-qual -Wpointer-arith -Wunused-parameter -Wvla -Werror -std=gnu++1z -MD -MT nvbench/CMakeFiles/nvbench.dir/cupti_profiler.cxx.o -MF nvbench/CMakeFiles/nvbench.dir/cupti_profiler.cxx.o.d -o nvbench/CMakeFiles/nvbench.dir/cupti_profiler.cxx.o -c /home/av/code/src/nvbench/nvbench/cupti_profiler.cxx
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx:154:3: error: ‘NVPW_MetricsEvaluator’ does not name a type; did you mean ‘NVPA_MetricUserData’?
   NVPW_MetricsEvaluator *evaluator_ptr;
   ^~~~~~~~~~~~~~~~~~~~~
   NVPA_MetricUserData
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx:157:38: error: expected ‘)’ before ‘*’ token
   eval_request(NVPW_MetricsEvaluator *evaluator_ptr,
                                      ^
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx:199:3: error: ‘NVPW_MetricEvalRequest’ does not name a type; did you mean ‘NVPA_RawMetricRequest’?
   NVPW_MetricEvalRequest request;
   ^~~~~~~~~~~~~~~~~~~~~~
   NVPA_RawMetricRequest
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx: In member function ‘std::vector<const char*> nvbench::detail::{anonymous}::eval_request::get_raw_dependencies()’:
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx:179:5: error: ‘NVPW_MetricsEvaluator_GetMetricRawDependencies_Params’ was not declared in this scope
     NVPW_MetricsEvaluator_GetMetricRawDependencies_Params params{};
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx:179:5: note: suggested alternative: ‘NVPW_MetricsContext_GetMetricNames_Begin_Params’
     NVPW_MetricsEvaluator_GetMetricRawDependencies_Params params{};
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     NVPW_MetricsContext_GetMetricNames_Begin_Params
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx:181:5: error: ‘params’ was not declared in this scope
     params.structSize =
     ^~~~~~
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx:182:7: error: ‘NVPW_MetricsEvaluator_GetMetricRawDependencies_Params_STRUCT_SIZE’ was not declared in this scope
       NVPW_MetricsEvaluator_GetMetricRawDependencies_Params_STRUCT_SIZE;
       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx:182:7: note: suggested alternative: ‘NVPW_MetricsContext_GetMetricNames_Begin_Params_STRUCT_SIZE’
       NVPW_MetricsEvaluator_GetMetricRawDependencies_Params_STRUCT_SIZE;
       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       NVPW_MetricsContext_GetMetricNames_Begin_Params_STRUCT_SIZE
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx:183:42: error: ‘evaluator_ptr’ was not declared in this scope
     params.pMetricsEvaluator           = evaluator_ptr;
                                          ^~~~~~~~~~~~~
/home/av/code/src/nvbench/nvbench/cupti_profiler.cxx:184:43: error: ‘request’ was not declared in this scope
     params.pMetricEvalRequests         = &request;
                                           ^~~~~~~

In a moment I'll push a PR that disables CUPTI support by default for CTK <= 11.2.

[FEA] Built-in bandwidth measurement via CUPTI

Does it make sense to introduce profiling tools like CUPTI into nvbench?

In that way, nvbench can provide precise bandwidth measurement by using CUPTI metrics like gld_throughput and gst_throughput.

Instead of manually specifying data type and size:

state.add_global_memory_reads<InputType>(size);
state.add_global_memory_writes<OutputType>(size);

Users may simply use NVBENCH_BENCH(bm).add_global_memory_bandwidth(); to add such measurement into their tests.

Support benchmarking kernels that cannot take an explicit stream

From the example, nvbench expects all kernels to be executed on the stream provided by launch.get_stream().

void my_benchmark(nvbench::state& state) {
  state.exec([](nvbench::launch& launch) { 
    my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>();
  });
}
NVBENCH_BENCH(my_benchmark);

This can be problematic when attempting to benchmark functions that contain kernel calls, but do not expose stream parameters (for one reason or another) on which those kernels should run. It would be nice to still be able to benchmark such functions.

[FEA] Use NVML to manage clocks, etc

Adding NVML as an optional dependency would allow some cool features:

  • Lock clock frequency.
    • Per-device default frequency.
    • Per-device maximum frequency.
    • Explicit frequency.
  • Log various device stats per measurement
    • SM/Mem clock frequencies.
    • Device utilization
    • Power state/usage
  • Check throttle state after each measurement.
    • Log a warning with the throttle reason and details (e.g. for thermal throttle, show current temp and thresholds).

Option parser test failing

~/nvbench/build$ ./bin/nvbench.test.option_parser
nvbench/nvbench/option_parser.cu:672: Error parsing --axis ` PO2s [ pow2 ] = 7 `:
nvbench/nvbench/option_parser.cu:694: Invalid flag for int64 axis: ` �|`

git bisect points to 15434ce, which is fairly surprising, since that really shouldn't affect behavior.

Too many timeouts -- bad convergence checks for fast / noisy benchmarks

The current "keep measuring until rel stdev drops below threshold" approach is fragile. Some benchmarks, particularly those that run in <50us, tend to have unavoidably high relative standard deviation due to kernel launch overhead, etc.

Such tests have no choice but to run until the timeout is reached. Many of them converge to a stable result in less than a few seconds, but instead they keep running for 15 seconds by default.

I'm working on a new set of convergence checks that will address this. Early results show that the runtime for thrust+cub's benchmarks can be reduced from 30 hours (!!) per device to just a few hours just by tweaking the convergence logic.

Check for CUDA errors after running benchmark

We recently had an issue where a benchmark kernel caused an illegal memory access, and the error was asynchronously reported in an unrelated NVBench CUDA API call. Any errors emitted during the execution of a benchmark should be correctly reported as originating from the benchmark execution.

We should add NVBENCH_CUDA_CALL(cudaGetLastError()); or similar in the detail/measure_* runners after each kernel execution.

Don't Print Unused Devices

When running the benchmark a bunch of information about each device is printed. However, if you specify the --device parameter you usually don't care about the other devices on the system. This issue is requesting to remove printing the other devices

Reasonable enum defaults don't work

Not specifying the optional NVBENCH_DECLARE_ENUM_TYPE_STRINGS causes compiler errors:

error: no instance of overloaded function "std::to_string" matches the argument list

This should be easy to reproduce by commenting out the NVBENCH_DECLARE_ENUM_TYPE_STRINGS part of the enums.cu example.

My setup is cuda11.4.3 and gcc10.3.0.

Detect if GPU is being used for graphical purposes

Recent results show that noise could be increased up to 50% due to X-Server running on the device. To warn users about the noisy environment, we could check if GPU is being used for graphical purposes. It's possible to get this kind of information with NVML:

#include <iostream>
#include <nvml.h>

void chck(nvmlReturn_t status) {
    if (status != NVML_SUCCESS) {
        throw std::runtime_error(std::string("Failed to initialize NVML: ") + nvmlErrorString(status));
    }
}

struct NVML_RAII {
    bool initialized = false;

    NVML_RAII() {
        chck(nvmlInit());
        initialized = true;
    }

    ~NVML_RAII() {
        if (initialized) {
            chck(nvmlShutdown());
        }
    }
};

int main() {
    NVML_RAII nvml_raii;
    unsigned int device_count;

    chck(nvmlDeviceGetCount(&device_count));

    for (unsigned int i = 0; i < device_count; i++) {
        nvmlDevice_t device;
        char name[NVML_DEVICE_NAME_BUFFER_SIZE];
        chck(nvmlDeviceGetHandleByIndex(i, &device));
        chck(nvmlDeviceGetName(device, name, NVML_DEVICE_NAME_BUFFER_SIZE));

        std::cout << "\t" << name << std::endl;

        unsigned int graphics_processes = 0;
        nvmlDeviceGetGraphicsRunningProcesses(device, &graphics_processes, nullptr);

        if (graphics_processes) {
            std::cerr << "X-Server is running, please turn it off (" << graphics_processes << ")" << std::endl;
        }

        unsigned int compute_processes = 0;
        nvmlDeviceGetComputeRunningProcesses(device, &compute_processes, nullptr);

        if (compute_processes) {
            std::cerr << "Target device is in use (" << compute_processes << ")" << std::endl;
        }
    }

    return 0;
}

The header could be easily overlooked, so we could also bring one's attention to the noisy environment by adding a column to the nvbench output.

Enable specifying type axes name together with type list

Problem

Today, I can provide a meaningful name for a type axes using the set_type_axes_names function.

For example,

using ctcs_types = nvbench::type_list<nvbench::int8_t,
                                      nvbench::int16_t,
                                      nvbench::int32_t,
                                      nvbench::float32_t,
                                      nvbench::int64_t,
                                      nvbench::float64_t>;
NVBENCH_BENCH_TYPES(copy_type_conversion_sweep,   NVBENCH_TYPE_AXES(ctcs_types, ctcs_types))
  .set_type_axes_names({"In", "Out"});

However, setting the names out-of-band like this makes it easy to have mismatches in either the number of arguments or which type a name corresponds to. This is especially likely to occur when copy/pasting an example or existing benchmark and modifying it.

As it happens, a mismatch in the number of arguments currently causes a segfault (which lead me to open this issue in the first place).

Proposed Solution

It would be nice to be able to provide a name for a type list/axes inline.

This could be done a few ways. One way would be to add something like a named_type_list:

struct my_named_type_list{
   static constexpr auto = "My type list name";
   using types = nvbench::type_list<nvbench::int8_t, nvbench::int16_t>;
};

NVBENCH_BENCH_TYPES(copy_type_conversion_sweep,  NVBENCH_TYPE_AXES(my_named_type_list) )

This could be inconvenient if you want to use the same type list for multiple axes with different names (as in the example above), so another option would be specifying the name along with the list:

NVBENCH_BENCH_TYPES(copy_type_conversion_sweep,   NVBENCH_TYPE_AXES({"In", ctcs_types}, {"Out", ctcs_types}));

There's a bunch of different ways this could be accomplished and am open to suggestions. The main idea is to prevent mismatch between names/type axes by allowing defining the name and type list together instead of separately.

CUDA version requirement

This repo can't be compiled by CUDA 11.0 (and older).
In my environment (CUDA 11.0), a compile error(below) happened at json_print.o.

[  3%] Building CXX object _deps/fmt-build/CMakeFiles/fmt.dir/src/format.cc.o
[  6%] Building CXX object _deps/fmt-build/CMakeFiles/fmt.dir/src/os.cc.o
[ 10%] Linking CXX static library libfmt.a
[ 10%] Built target fmt
Scanning dependencies of target nvbench
[ 13%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/axes_metadata.cu.o
[ 16%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/axis_base.cu.o
[ 20%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/benchmark_base.cu.o
[ 23%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/benchmark_manager.cu.o
[ 26%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/blocking_kernel.cu.o
[ 30%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/csv_printer.cu.o
[ 33%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/cuda_call.cu.o
[ 36%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/device_info.cu.o
[ 40%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/device_manager.cu.o
[ 43%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/float64_axis.cu.o
[ 46%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/int64_axis.cu.o
[ 50%] Building CUDA object nvbench/CMakeFiles/nvbench.dir/json_printer.cu.o
Segmentation fault (core dumped)
make[2]: *** [nvbench/CMakeFiles/nvbench.dir/build.make:225: nvbench/CMakeFiles/nvbench.dir/json_printer.cu.o] Error 139
make[1]: *** [CMakeFiles/Makefile2:220: nvbench/CMakeFiles/nvbench.dir/all] Error 2
make: *** [Makefile:103: all] Error 2

I think the reason for this error is that the backend gcc compiler of CUDA 11.0 doesn't fully support C++17 (I am not sure). Actually, I could compile by CUDA 11.2.

The bad thing about this error is that it gives no information to the user. In fact, it took me some time to identify the cause.
So, could you either make it clear in the documentation that CUDA 11.2 or higher is required, or put a condition on the version of CUDA by cmake. (I think the latter would be more user-friendly.)

Build errors while using library with .cpp

I've got a plenty of -fpermissive and libstdc++-related errors while playing with library examples and accidentally building the simplest benchmark in .cpp file.
The my own example simplest.cpp was:

#include <nvbench/nvbench.cuh>

void f() {}

(I have added it into library's examples, and CMakeLists.txt (and it also wants CUDA::cudart from find_package(CUDAToolkit))

I've got these compile errors:

[ 42%] Building CXX object examples/CMakeFiles/nvbench.example.simplest.dir/simplest.cpp.o
In file included from /root/nvbench/nvbench/benchmark_base.cuh:21,
                 from /root/nvbench/nvbench/benchmark.cuh:21,
                 from /root/nvbench/nvbench/nvbench.cuh:21,
                 from /root/nvbench/examples/simplest.cpp:1:
/root/nvbench/nvbench/axes_metadata.cuh: In lambda function:
/root/nvbench/nvbench/axes_metadata.cuh:119:33: error: expected primary-expression before ‘>’ token
  119 |       axis->set_inputs<type_list>();
      |                                 ^
/root/nvbench/nvbench/axes_metadata.cuh:119:35: error: expected primary-expression before ‘)’ token
  119 |       axis->set_inputs<type_list>();
      |                                   ^
In file included from /root/nvbench/nvbench/device_info.cuh:21,
                 from /root/nvbench/nvbench/benchmark_base.cuh:22,
                 from /root/nvbench/nvbench/benchmark.cuh:21,
                 from /root/nvbench/nvbench/nvbench.cuh:21,
                 from /root/nvbench/examples/simplest.cpp:1:
/root/nvbench/nvbench/device_info.cuh: In function ‘int nvbench::detail::get_ptx_version(int)’:
/root/nvbench/nvbench/device_info.cuh:218:51: error: invalid conversion from ‘void (*)()’ to ‘const void*’ [-fpermissive]
  218 |     cudaFuncGetAttributes(&attr, nvbench::detail::noop_kernel_ptr));
      |                                  ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
      |                                                   |
      |                                                   void (*)()
/root/nvbench/nvbench/cuda_call.cuh:29:49: note: in definition of macro ‘NVBENCH_CUDA_CALL’
   29 |     const cudaError_t nvbench_cuda_call_error = call;                          \
      |                                                 ^~~~
In file included from /root/nvbench/nvbench/cuda_call.cuh:21,
                 from /root/nvbench/nvbench/device_info.cuh:21,
                 from /root/nvbench/nvbench/benchmark_base.cuh:22,
                 from /root/nvbench/nvbench/benchmark.cuh:21,
                 from /root/nvbench/nvbench/nvbench.cuh:21,
                 from /root/nvbench/examples/simplest.cpp:1:
/usr/local/cuda/include/cuda_runtime_api.h:4186:125: note:   initializing argument 2 of ‘cudaError_t cudaFuncGetAttributes(cudaFuncAttributes*, const void*)’
 4186 | extern __host__ __cudart_builtin__ cudaError_t CUDARTAPI cudaFuncGetAttributes(struct cudaFuncAttributes *attr, const void *func);
      |                                                                                                                 ~~~~~~~~~~~~^~~~
In file included from /root/nvbench/nvbench/device_info.cuh:21,
                 from /root/nvbench/nvbench/benchmark_base.cuh:22,
                 from /root/nvbench/nvbench/benchmark.cuh:21,
                 from /root/nvbench/nvbench/nvbench.cuh:21,
                 from /root/nvbench/examples/simplest.cpp:1:
/root/nvbench/nvbench/detail/l2flush.cuh: In constructor ‘nvbench::detail::l2flush::l2flush()’:
/root/nvbench/nvbench/detail/l2flush.cuh:38:36: error: invalid conversion from ‘int**’ to ‘void**’ [-fpermissive]
   38 |       NVBENCH_CUDA_CALL(cudaMalloc(&m_l2_buffer, m_l2_size));
      |                                    ^~~~~~~~~~~~
      |                                    |
      |                                    int**
/root/nvbench/nvbench/cuda_call.cuh:29:49: note: in definition of macro ‘NVBENCH_CUDA_CALL’
   29 |     const cudaError_t nvbench_cuda_call_error = call;                          \
      |                                                 ^~~~
In file included from /root/nvbench/nvbench/cuda_call.cuh:21,
                 from /root/nvbench/nvbench/device_info.cuh:21,
                 from /root/nvbench/nvbench/benchmark_base.cuh:22,
                 from /root/nvbench/nvbench/benchmark.cuh:21,
                 from /root/nvbench/nvbench/nvbench.cuh:21,
                 from /root/nvbench/examples/simplest.cpp:1:
/usr/local/cuda/include/cuda_runtime_api.h:4626:76: note:   initializing argument 1 of ‘cudaError_t cudaMalloc(void**, size_t)’
 4626 | extern __host__ __cudart_builtin__ cudaError_t CUDARTAPI cudaMalloc(void **devPtr, size_t size);
      |                                                                     ~~~~~~~^~~~~~
In file included from /root/nvbench/nvbench/detail/measure_cold.cuh:31,
                 from /root/nvbench/nvbench/detail/state_exec.cuh:30,
                 from /root/nvbench/nvbench/state.cuh:278,
                 from /root/nvbench/nvbench/benchmark_base.cuh:24,
                 from /root/nvbench/nvbench/benchmark.cuh:21,
                 from /root/nvbench/nvbench/nvbench.cuh:21,
                 from /root/nvbench/examples/simplest.cpp:1:
/root/nvbench/nvbench/detail/statistics.cuh: In function ‘nvbench::float64_t nvbench::detail::compute_noise(const std::vector<double>&, nvbench::float64_t)’:
/root/nvbench/nvbench/detail/statistics.cuh:60:31: error: ‘sqrt’ is not a member of ‘std’; did you mean ‘sort’?
   60 |   const auto abs_stdev = std::sqrt(variance);
      |                               ^~~~
      |                               sort
make[2]: *** [examples/CMakeFiles/nvbench.example.simplest.dir/build.make:76: examples/CMakeFiles/nvbench.example.simplest.dir/simplest.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:371: examples/CMakeFiles/nvbench.example.simplest.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

I have partial fixes for permissive errors (seems nvcc uses the "C" language rules for implicit casts from function pointer to void and from T**** to void ****) and the lacking #include <cmath> (which, at my glance, a more concerned error, as lib[std]c++ evolves, the implicit #includes could disappear, and this error will shoot in the future), but the type_list error is weird. 🤷‍♂️

Thanks for your attention.

`enum_type_list` test fails on nvcc + gcc 7

There appears to a bug in template resolution affecting nvcc+gcc. In the test program below, the two calls to type_name in test3() should return different strings, but they both return the "unscoped" enum string.

  • Removing any of the other code in test1() or test2() makes the bug disappear, and the expected string are printed.
  • This is not reproducible on nvcc+MSVC or plain g++.

Minimal reproducer:

#include <string>
#include <tuple>
#include <type_traits>
#include <typeinfo>

#include <cstdio>

enum class scoped { val1, val2 };
enum unscoped { uval1, uval2 };

template <typename...>
struct type_list {};

namespace detail
{
  template <std::size_t I, typename... Ts>
  auto get(type_list<Ts...>)
    -> std::tuple_element_t<I, std::tuple<Ts...>>;
}

template <std::size_t Index, typename TypeList>
using get = decltype(detail::get<Index>(TypeList{}));

template <auto Value, typename T = decltype(Value)>
struct enum_type : std::integral_constant<T, Value>
{};

template <auto... Ts>
using enum_type_list = type_list<enum_type<Ts>...>;

template <typename T>
std::string type_name()
{
  return typeid(T).name();
}

void test1()
{
  if (!std::is_same_v<enum_type_list<scoped::val1,
                                     scoped::val2>,
                      type_list<enum_type<scoped::val1>,
                                enum_type<scoped::val2>>>)
  {
    printf("FAIL\n");
  }
}

void test2()
{
  using values = enum_type_list<unscoped::uval1,
				unscoped::uval2>;
  using elem_1  = get<0, values>;
  printf("%s\n", type_name<elem_1>().c_str());
}

void test3()
{
  printf("\n");
  printf("These should be different, but aren't:\n");
  printf("%s\n", type_name<enum_type<scoped::val1>>().c_str());
  printf("%s\n", type_name<enum_type<unscoped::uval1>>().c_str());
  printf("\n");
  printf("These should be different, and are:\n");
  printf("%s\n", typeid(enum_type<scoped::val1>).name());
  printf("%s\n", typeid(enum_type<unscoped::uval1>).name());
}

int main()
{
  test1();
  test2();
  test3();
}

Output:

$ nvcc -std=c++17 t.cu -run
9enum_typeIL6scoped0ES0_E

These should be different, but aren't:
9enum_typeIL6scoped0ES0_E
9enum_typeIL6scoped0ES0_E

These should be different, and are:
9enum_typeIL6scoped0ES0_E
9enum_typeIL8unscoped0ES0_E

Same code on gcc doesn't have the bug (godbolt):

$ cp t.cu t.cxx
$ g++ -std=c++17 t.cxx && ./a.out
9enum_typeIL6scoped0ES0_E

These should be different, but aren't:
9enum_typeIL6scoped0ES0_E
9enum_typeIL8unscoped0ES0_E

These should be different, and are:
9enum_typeIL6scoped0ES0_E
9enum_typeIL8unscoped0ES0_E

Version info:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0

$ g++ --version
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Add a `zip_axis` / `tuple_axis` that iterates through multiple parameter axes in lockstep

Overview

RAPIDS needs a way to iterate through correlated axes in lockstep.

Example Usecase

Consider a benchmark that takes three parameters "X", "Y", and "Z", where X is an int, Y is a float, and Z is a string.

We want to run two instances of this benchmark:

  1. "X" is 402, "Y" is 0.6, and "Z" is "foo".
  2. "X" is 862, "Y" is 0.2, and "Z" is "bar".

Using regular axes here is troublesome, since those expand to a cartesian product by default. Defining these axes naively will produce 8 parametrizations.

Existing Solutions

Derived parameters

If Y and Z could be derived from X, we could just define X, compute Y and Z, then add summaries to the state for markdown output.

However, these parameters may not always be easily related, as in the example above.

Skipping

nvbench::state::skip provides a mechanism to skip some of the configurations in the cartesian product, and can be used to slice out just the configurations of interest.

However, this is tedious and fragile. You'd need to maintain some sort of validation logic that stays in sync with the axis values.

Lookup Tables

Each set of values could be put in a lookup function, and a single integer axis could be used to enumerate each desired set of parameters.

This is subpar. There is no way to override the actual values at runtime with -a options, only the index is modifiable. By default, the output will be cryptic, reporting only the test case index, not the actual values used. (This could be worked-around by hiding/adding summaries in the benchmark body).

Proposed Solution

Add a new zip_axis (tuple_axis?) type that appears as multiple distinct axes, but is defined by specifying discrete sets of inputs as tuples. This effectively adds an abstraction that simplifies the Lookup Table solution.

void my_bench(nvbench::state &state)
{
  const auto x = state.get_int64("X"); // 402, 862
  const auto y = state.get_int64("Y"); // 0.6, 0.2
  const auto z = state.get_int64("Z"); // foo, bar

  // ...
}

// (a)
auto my_values = nvbench::zip_axis_values<nvbench::int64_t, nvbench::float64_t, std::string>
{
  // Names for subaxes:
  {"X", "Y", "Z"},
  //  Tuples of grouped parameters
  {402, 0.6, "foo"},
  {862, 0.2, "bar"},
  // ...
};
NVBENCH_BENCH(my_bench)
  .add_zip_axis("ZippedValues", my_values);

// or

// (b)
NVBENCH_BENCH(my_bench)
  .add_zip_axis("ZippedValues",
                        "X", {402, 0.6},
                        "Y", {0.6, 0.2},
                        "Z", {"foo", "bar"});
  • The (a) form is nice because it lays out the grouped values together, making it easy to check that all subaxes are synced and have the same number of values.
  • The (b) form is convenient for small axes.

This should play nicely with other axes -- the cartesian product of all other axes + the zipped values should still be generated.

Open Questions

Command line interactions with -a

Proposed: If any subaxes are redefined, all subaxes must be redefined and have the same length, e.g.

my_bench -b 0 -a "X:[285,128,42]" -a "Y:[0.1,3.4,1.2]" -a "Z:[bing,bang,bong]"

If any axes are mismatched in length or unspecified, throw an error. Maybe we allow redefinition of a single subaxis only when the new definition is the same length as the hard-coded axis.

This could be punted on for the first version of this, since it'll be complicated no matter what approach we use, and it's not strictly necessary to meet the immediate need for zipped axes.

Markdown Output

Proposed: The zip axis should be transparent here -- ignore it and just treat the subaxes as regular axes:

|  X  |  Y  |   Z   | ... |
|-----|-----|-------|-----|
| 402 | 0.6 | "foo" | ... |
| 806 | 0.2 | "bar" | ... |

CSV Output

Proposed: Add columns with subaxes values, as well as a value that identifies the index in the zip, e.g.

ZippedValues, X, Y, Z, ...
0, 402, 0.6, "foo", ...
1, 806, 0.2, "bar", ....

JSON Output

This will be tricky and require some thought. We'll need to address both the per-benchmark axis definition as well as the per-state parameter encoding.

Implementation

This will likely be a new subclass of nvbench::axis_base that holds a vector of other nvbench::axis_bases.

Several areas will need to be updated, including but not limited to:

  • nvbench::axis_type enum
  • nvbench::option_parser for handling CLI
  • nvbench::markdown_printer, nvbench::csv_printer, nvbench::json_printer
  • nvbench::benchmark_base to add the add_zip_axis
  • nvbench::detail::state_generator and state_generator_iterator
  • nvbench::axis_metadata will need to learn how to handle these cleanly
  • Unit tests
  • Examples

Feature Request: Setup and Teardown for `benchmark_base`

To implement a Setup, which is shared among all states of a benchmark_base a current solution could be:

void my_benchmark(nvbench::state& state) {
  static int num_execs = 0;
  if (execs == 0) {
    // SETUP calls here
    // e.g. expensive hard disk I/O
    ++execs;
  }
  state.exec([](nvbench::launch& launch) { 
    my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>(/* uses data from SETUP */);
  });
}
NVBENCH_BENCH(my_benchmark).add_int64_axis("i", {1, 2, 3, 4, 5, ..., 99});

It would be more convenient to have an explicit setup and teardown functionality (similar to e.g. Boost Test). For example:

void my_benchmark_setup() {
  // SETUP calls here
}

void my_benchmark_teardown() {
  // TEARDOWN calls here
}

void my_benchmark(nvbench::state& state) {
  state.exec([](nvbench::launch& launch) { 
    my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>(/* uses data from SETUP */);
  });
}
NVBENCH_BENCH(my_benchmark).add_int64_axis("i", {1, 2, 3, 4, 5, ..., 99})
  .register_setup(my_benchmark_setup).register_teardown(my_benchmark_teardown);

A discussion about how the setup and teardown registration should look like, would be helpful.

Add an option to limit number of executions of a benchmark

In addition to setting a --min_samples it would be nice to be able to set a --max_samples as well.

My use case is for profiling where I'd like to use a profile like Nsight Systems or Nsight Compute on my benchmark code and I don't want to profile an extraneous number of executions.

A --profile option would be a handy shorthand for --max_samples=1.

Add sample outputs for each example

We should add an output directory under nvbench/example that contains the json, csv, and md outputs for each benchmark on a multidevice system. This should be linked to from the main README.

Add `--profile` flag to fixup various profiling issues

#10 was a similar issue about adding support for profiling -- this issue resulted in the --run-once option, which disables warmup runs and only does a single execution of the KernelLauncher.

However, there is another issue when using CUDA profilers. The profiler will force the CPU to block until the kernel is executed, and this causes a deadlock when a benchmark uses the blocking_kernel. This can be worked around by using nvbench::exec_tag::sync to disable the blocking kernel, but we should add a runtime switch to disable this as well.

  • Add --disable-blocking-kernel to simply switch off the blocking kernel
  • Add --profile, which implies --disable-blocking-kernel --run-once

[FEA] Support templated fixture tests

All cudf benchmarks use templated fixture tests to handle rmm memory pool allocation/deallocation. It will be great if this functionality can be supported by nvbench.

Export raw samples as binary blobs

Add an option to the json formatter that would print raw time samples to a binary output file:

# Running
my_bench --json my_bench.json --write-samples

# Creates
mybench.json         # Contains references to below files:
mybench.json.0.0.bin # Raw timings from bench 0 config 0
mybench.json.0.1.bin # bench 0 config 1
mybench.json.1.0.bin # bench 1 config 0
# etc...

We can then write tooling to process these into histograms, etc for analysis, and if we implement it carefully we can let other printers have access to the raw samples.

  • Add a mechanism to pass printer-specific options on the command line.
  • Add a new virtual to printer_base: void process_time_samples(state, vector_of_time_samples). Allow the printer to add new summaries to the state object (processed results, filenames, etc).
    • Update printer_multiplex to forwards calls to new virtual.
  • Implement a binary data writer for printer_json
    • Write samples to .bin file
      • Experiment with different output encodings -- e.g. integral nanoseconds vs. floating point seconds, which compresses better?
    • Add new summary to state with filename
  • Add CLI option to --write-samples for JSON

cc: @jrhemstad @senior-zero since we've all wanted histograms at some point 😁

Use rapids-cmake

We should use the new rapids-cmake framework internally, as it simplifies a lot of dependency management and CUDA-related stuff.

nvbench::range drops the last element

void nvbench_test(nvbench::state& state)
{
  std::cout << state.get_float64("Occupancy") << "\n";
}

NVBENCH_BENCH(nvbench_test).add_float64_axis("Occupancy", nvbench::range(0.7, 0.9, 0.1));

Execution of the above code will work as expected and the output will be:

Run:  nvbench_test [Device=0 Occupancy=0.7]
0.7
Run:  nvbench_test [Device=0 Occupancy=0.8]
0.8
Run:  nvbench_test [Device=0 Occupancy=0.9]
0.9

If I change stride to 0.05:

NVBENCH_BENCH(nvbench_test).add_float64_axis("Occupancy", nvbench::range(0.7, 0.9, 0.05));

The output is shown below where the last element within the range (0.9 ) is not included:

Run:  nvbench_test [Device=0 Occupancy=0.7]
0.7
Run:  nvbench_test [Device=0 Occupancy=0.75]
0.75
Run:  nvbench_test [Device=0 Occupancy=0.8]
0.8
Run:  nvbench_test [Device=0 Occupancy=0.85]
0.85

Enable stricter warning flags

NVBench currently builds with default warning flags. We need to add more (in cmake/NVBenchConfigTarget.cmake) to help catch issues before they're merged.

What's the right way to include nvbench to an existing cmake project?

Currently, I am using FetchContent and only worrying about the include directory:

include(FetchContent)
set(FETCHCONTENT_QUIET ON)

message(STATUS "Cloning External Project: NVBench")
get_filename_component(FC_BASE "../externals"
                REALPATH BASE_DIR "${CMAKE_BINARY_DIR}")
set(FETCHCONTENT_BASE_DIR ${FC_BASE})

FetchContent_Declare(
    nvbench
    GIT_REPOSITORY https://github.com/NVIDIA/nvbench.git
    GIT_TAG        main
)

FetchContent_GetProperties(nvbench)
if(NOT nvbench_POPULATED)
  FetchContent_Populate(
    nvbench
  )
endif()
set(NVBENCH_INCLUDE_DIR "${nvbench_SOURCE_DIR}")

Then I can link my project with nvbench using;

include(${PROJECT_SOURCE_DIR}/cmake/FetchNVBench.cmake)
target_include_directories(my_project
    INTERFACE ${NVBENCH_INCLUDE_DIR}
)

Compiling a simple example using nvbench, I get the following error:

/nvbench/device_info.cuh:21:10: fatal error: nvbench/config.cuh: No such file or directory
 #include <nvbench/config.cuh>

By the way, in my simple test, all I am doing is:

#include <nvbench/nvbench.cuh>

Add example showing how to parse additional arguments

Several users have asked about how to add custom arguments to their benchmarks (e.g. #86).

This is done by implementing an application specific main(argc, argv) function and linking to the nvbench::nvbench CMake target instead of nvbench::main. The custom main will need to parse out additional arguments, and then use the macros defined in main.cuh to parse the remaining args and run NVBench. Something like:

int main(int argc, char** argv)
{
  // parse custom args and remove them from (or rebuild) argc/argv here
  
  NVBENCH_MAIN_BODY(argc, argv);
}

A singleton or some global state is then needed to communicate the custom options to the benchmarks.

We should add an example that shows how to do this.

[FEA] Show bandwidth data in base-ten format

The current terminal output shows global memory bandwidth (GlobalMem BW) in base-two format by default:

| Samples |  CPU Time  | Noise |  GPU Time  | Noise |  Elem/s  | GlobalMem BW  | BWPeak |
|---------|------------|-------|------------|-------|----------|---------------|--------|
|     11x | 168.574 ms | 2.20% | 168.569 ms | 2.20% |   1.592b |  11.865 GiB/s |  1.42% |
|     11x | 189.253 ms | 1.74% | 189.249 ms | 1.74% |   1.418b |  10.568 GiB/s |  1.26% |
|     10x |   10.472 s | 0.70% |   10.472 s | 0.70% |  25.633m | 195.568 MiB/s |  0.02% |
|     11x | 170.236 ms | 2.08% | 170.230 ms | 2.08% |   1.577b |  11.749 GiB/s |  1.40% |

It would be useful to have a base-10 display option (e.g., GB/s instead of GiB/s) which is more consistent with hardware specs/whitepaper numbers.

Document CUDA_ARCH considerations when installing NVBench

Update

The information below is relevant, but outdated. #42 is adding install rules for NVBench. It's still extremely important to make sure that either:

  1. All projects involved are building with the same CMAKE_CUDA_ARCHITECTURE flags, or
  2. The installed NVBench is compiled with CMAKE_CUDA_ARCHITECTURE set to contain every arch it could possibly run against.

But rather than not allow installation, we should just document the sharp edges clearly in the README.

Outdated

After some internal discussions, we ended up deciding that it would be best to not install nvbench or provide packages for it.

Since the library provides detailed information that is tightly coupled with the compilation process (virtual architecture targets, etc), the nvbench library must be compiled with the same CUDA flags as any downstream benchmarks. Otherwise, we'd report misleading per-device information and would likely introduce some ODR violations that result in unpredictable and hard to debug behavior. Ultimately, this means that we cannot provide a pre-built library that is well-behaved.

To avoid these issues, it's best to build nvbench as part of the downstream benchmark suite, including it via CPM or git submodule, similar to https://github.com/NVIDIA/nvbench_demo or https://github.com/allisonvacanti/thrust_benchmark. This way, we can ensure that the resulting binaries are accurate and stable.

This may change someday, but would require careful consideration of the issues and edgecases that would arise from sharing the nvbench library implementation.

I need to document this better and remove some of the misleading comments from our CMake code, which was written before I realized that the library should not be pre-built or shared between projects.

(Taken from PR #20)

Add benchmark groups/sets/tags

It would enable a nicer (shorter) way of specifying benchmarks in the CLI interface if one could define groups/sets of benchmarks in the source code.

One way of doings this could be to leverage existing namespaces, i.e. being able to call bench --namespace foo to execute all benchmarks in that namespace (and sub-namespaces). Although I don't know how easy this is to implement compared to just adding a new class.

Accept Benchmark Name Wildcards

Right now it looks like benchmark names must be the full name of the benchmark, but it would be useful to provide a regex syntax as gbench allows.

Python Comparison Scripts

NVBench has a work-in-progress JSON output format and I'm working on a very basic python script to compare two JSON files.

We should grow this functionality into a more complete set of analysis tools. At minimum, this should cover the features provided by Google Benchmarks' excellent comparison scripts.

If anyone is interested in writing some python to help with this, let me know. I'll update this issue once I have finalized the JSON output format.

Basic Regression Testing

  • P0: Compare two json files:
    compare.py baseline.json test.json
  • P0: Specify a custom error threshold:
    compare.py --gpu-threshold 5 baseline.json test.json
    (gpu-threshold, cpu-threshold, batch-threshold)
  • P2: Run a benchmark executable and compare with a json
    compare.py baseline.json --run test.exe -b 3 -a T=[I32,U64] -a Elements[pow2]=30

These should:

  • Compare the benchmarks with the same name + config.
  • Print abs/rel changes for cpu/gpu/batch measurements.
  • Highlight any entries that exceed a threshold time.
  • Return an error code if any exceed thresholds.

Analysis modes

Compare benchmarks with different names. Answers questions:

  • How much faster is benchmark X for input type T vs. U for a variety of input sizes?
  • Does Algorithm X take more time to run than Algorithms Y for the same inputs?

These will need some way of specifying the sets of configurations to compare. Google benchmark has worked out a general syntax for specifying this, we should adapt what they've done to use the NVBench axis syntax.

Output

Ideally markdown formatted, similar to NVBench's default output.

References

[QST] Passing an integer as a template parameter

// SIZE can be 1, 2, 4, 8, 16 or 32
template <typename T, int SIZE>
void nvbench_test(nvbench::state &state)
{
  ...
}

Is there a way to pass integers (or values) as template parameters to a target benchmark in nvbench? I.e., passing SIZE to nvbench_test in the above example.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.