anl-cesar / rsbench Goto Github PK

View Code? Open in Web Editor NEW

21.0 7.0 27.0 506 KB

A mini-app to represent the multipole resonance representation lookup cross section algorithm.

License: MIT License

C 59.56% Makefile 3.89% C++ 15.94% Cuda 20.61%

rsbench's People

Contributors

Stargazers

Watchers

rsbench's Issues

SYCL "simulation only" runtime statistics misleading

For the SYCL version, we are currently reporting runtime statistics for both the kernel initialization / JIT compiling as well as the actual execution. This may result in some issues on certain systems, e.g.:

Total Time Statistics (SYCL+OpenCL Init / JIT Compilation + Simulation Kernel)
Runtime:                XXXXXXX seconds
Lookups:               XXXXXXXXXX
Lookups/s:            XXXXXXXXXX
Simulation Kernel Only Statistics
Runtime:               0.00001 seconds
Lookups/s:             1,000,000,000,000,000
Verification checksum: (Valid)

Timing these things as we are now included some assumptions as to the asynchronous behavior of SYCL that do not appear to be true in all cases with all compilers on all machines. Instead, we should just time only the total runtime.

Possible buffer overrun of poles buffer

This line

RSBench/src/init.c

Line 57 in cf42d27

 Pole * contiguous = (Pole *) malloc( input.n_nuclides * input.avg_n_poles * sizeof(Pole)); 

... assumes that the total equals to avg*nuclides.
But that is only true if we never hit this line:

RSBench/src/init.c

Line 16 in cf42d27

R[i] = 1;

If we do, then the total is larger, and RSBench reaches a buffer overrun bug.
This is a marginal case and highly unlikely with large avg_n_poles, but still a problem with lower values.

This can be fixed locally, by summing up n_poles inside generate_poles instead of multiplying by the average.
Not sure if there are other side effects to this, though

CUDA error and Libomptarget error: openmp-offload method

Simulation crashes at runtime for openmp-offload mode:

Compiler: LLVM 14.0.0 (nightly build: February 2nd 2022) + cudatoolkit/21.9_11.4
Machine: Perlmutter (NVIDIA A100 GPU + AMD Milan CPU)

Reproducer

cd openmp-offload
export CC=clang
make
./rsbench -m event

No changes were made to the Makefile

Beginning baseline event based simulation on device... CUDA error: an illegal memory access was encountered Libomptarget error: Copying data from device failed. Libomptarget error: Call to targetDataEnd failed, abort target. Libomptarget error: Failed to process data after launching the kernel. Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings. simulation.c:24:2: Libomptarget fatal error 1: failure of target construct while offloading is mandatory Aborted

Verification checksum differs when compiled with gnu.

When compiling the openmp-offload version of the benchmark and execute the application terminates normally, but the verification checksum differs.

a typo in readme?

"RSBench represents the multipole method of perfoming continuous energy macroscopic neutron cross section lookups."

perfoming->performing

Implement XL and XXL sizes

The README documents XL and XXL sizes, but only small and large appear to be implemented. These would indeed be helpful for benchmarking large nodes. In the meantime, can you suggest command line options that can scale the problem in ways that makes sense, e.g., to an order of magnitude larger than the current large size? Thank you.

Result of computation is never checked -> optimising compilers skew results

Again, similar to XSbench issue:
in main.c:
calculate_macro_xs( macro_xs, mat, E, input, data, sigTfactors, &abrarov, &alls );
The results in macro_xs are never checked. Simply adding asm volatile (""::"m"(macro_xs[0]),...) brings performance back in line for aggressive optimising compilers (adding LTO to GCC optimisation flags).
Three core difference is significant again:
$ while true; do res1=$(./rsbench -s small | awk '/Lookups.s:/ {print $2}'); res2=$(./rsbench.force_use -s small | awk '/Lookups.s:/ {print $2}'); echo $res1 $res2; done
808,197 401,135
792,076 367,372
765,152 366,358
Showing a performance difference of 2x.

So please make sure that the benchmark results get used, either by employing similar asm volatile barriers, or adding a running sum over the results (and printing / asm-volatile-consuming that).

Unchecked mallocs

Similar to my recent issue in XSbench, RSbench has also unchecked mallocs that can cause segfaults.

$ grep -n malloc c
init.c:56: Pole * R = (Pole *) malloc( input.n_nuclides * sizeof( Pole *));
init.c:57: Pole * contiguous = (Pole *) malloc( input.n_nuclides * input.avg_n_poles * sizeof(Pole));
init.c:89: Window * R = (Window *) malloc( input.n_nuclides * sizeof( Window *));
init.c:90: Window * contiguous = (Window *) malloc( input.n_nuclides * input.avg_n_windows * sizeof(Window));
init.c:128: double * R = (double *) malloc( input.n_nuclides * sizeof( double * ));
init.c:129: double * contiguous = (double *) malloc( input.n_nuclides * input.numL * sizeof(double));
main.c:120: (complex double *) malloc( input.numL * sizeof(complex double) );
material.c:17: int * num_nucs = (int)malloc(12_sizeof(int));
material.c:44: int *_ mats = (int *) malloc( 12 * sizeof(int *) );
material.c:46: mats[i] = (int *) malloc(num_nucs[i] * sizeof(int) );
material.c:112: double * concs = (double **)malloc( 12 * sizeof( double *) );
material.c:115: concs[i] = (double *)malloc( num_nucs[i] * sizeof(double) );
papi.c:252: int * events = malloc(num_papi_events * sizeof(int));
papi.c:257: long_long * values = malloc( num_papi_events * sizeof(long_long));

Not quite as bad as in XSbench, because the default allocation is smaller (~250MB), but it would still be good to have checked mallocs.

Request for Adding New Programming Models - Kokkos and RAJA

Our research group is currently doing research on parallel programming models and are interested in contributing to the RSBench project by adding new models such as Kokkos and RAJA.

As part of the plan, we propose restructuring the project to follow a similar structure as BabelStream, which would allow for better organization and maintainability.

We would like to know if there are any current plans within the RSBench project regarding the addition of these models We are eager to contribute to the project by implementing these products.

Any feedback, suggestions, or guidance on this proposal would be highly appreciated. We are looking forward to collaborating with the RSBench community to further improve this valuable benchmarking tool.

Thank you for considering our request.

anl-cesar / rsbench Goto Github PK

rsbench's People

Contributors

Stargazers

Watchers

Forkers

rsbench's Issues

SYCL "simulation only" runtime statistics misleading

Possible buffer overrun of poles buffer

CUDA error and Libomptarget error: openmp-offload method

Verification checksum differs when compiled with gnu.

a typo in readme?

Implement XL and XXL sizes

Result of computation is never checked -> optimising compilers skew results

Unchecked mallocs

Request for Adding New Programming Models - Kokkos and RAJA

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent