jonasdelacour / lockstepdualisation Goto Github PK

standalone repository for lockstep parallel dualisation

C++ 73.56% Cuda 7.45% CMake 2.34% Jupyter Notebook 7.69% Makefile 0.38% Python 8.60%

lockstepdualisation's Introduction

LockstepDualisation Artefact Description

This repository contains the code for the paper "Lockstep-Parallel Dualization of Surface Triangulations" by Jonas Dornonville de la Cour, Carl-Johannes Johnsen, and James Emil Avery. Submitted to the 2023 International Conference for High Performance Computing, Networking, Storage, and Analysis.

Instructions

Software Prerequisites

Linux or MacOS X (Tested on Ubuntu 18.04, Ubuntu 22.04, Arch Linux 6, MacOS X 13.3.1). CPU version works on Windows WSL on Ubuntu 22.04, GPU version does not due to Windows not supporting NVIDIA Unified Memory.
CMake 3.18 or higher (Tested on Cmake 3.23 and 3.26)
C++ compiler with C++17 support (Tested and verified with gcc 7.5, 11.3, and 12.2. Does not work with clang)
Nvidia CUDA Toolkit 11.8 or higher
Nvidia GPU with compute capability 5.0 or higher
Git
Fortran compiler (Tested with gfortran 7.5, 11.3, and 12.2)

Build

Quickstart

Download the code using the recursive-flag, as the code benchmarks against the dualization implementation from http://github.com/jamesavery/fullerenes/ and includes it as a sub-module:

git clone --recursive [email protected]/jonasdelacour/LockstepDualisation.git
cd LockstepDualisation

To automatically build, run automatic validation, and run benchmarks, simply type

make all

Each of the benchmarks produces a CSV file containing the results, and generates the benchmark plots. The benchmark and validation output will be placed in a directory named output/<hostname>.

To only build, or run benchmarks or validation separately, run

make build

make validation

make benchmarks

The validation checks the results from all the parallel implementations against the reference sequential dualization implementation in the Fullerene software package. For every $n$ in $[20,24,26,\ldots,200]$, the check is performed against a random sample of 10,000 dual $C_n$ fullerene isomer graphs (or the full isomer space if smaller than 10,000). We verify that the results are identical.

The benchmarks can also be performed interactively with the Jupyter notebook, reproduce.ipynb.

Manual build

In case the automatic build fails for some reason, the individual steps to build and run the software is as follows:

Fetch the Fullerene software package as a submodule (for reference comparisons)

git submodule update --init

After this, the benchmarks can be built using CMake and make:

mkdir build
cd build/
cmake ..
make -j

Manual Run

After building, return to the repository root directory before running the benchmarks. The executables are located in the build/benchmarks and build/validation directories. The executables are:

build/benchmarks/baseline
build/benchmarks/omp_multicore
build/benchmarks/single_gpu
build/benchmarks/multi_gpu

The GPU benchmarks will only be built if the CUDA toolkit is available.

All executables take the same command line parameters. For example:

./build/benchmarks/multi_gpu <Ntriangles> <Ngraphs> <Nruns> <Nwarmup> <variant:0|1>

Ntriangles: one of [20, 24, 26, 28, ... , 200] (to match the fullerene test-data). Default: 200
Ngraphs : batch size, i.e. the number of graphs to dualise in parallel.
Nruns: number of repeated runs. Default: 10. To reproduce results from the paper, set to 100 (but takes longer).
'Nwarmup`: number of warmup runs. Default: 1
variant: Kernel variant.

For GPU, kernel 0 uses one thread per triangle (Ntriangles threads), and kernel 1 uses one thread per vertex.
For CPU, kernel 0 is the shared-memory parallel version, and kernel 1 is the task-parallel version.

For example,

./build/benchmarks/single_gpu 100 1000000 100 1 1

runs the single-GPU benchmark for a million C100 fullerene isomers, repeated 100 times for statistics, with a single warmup-run, using GPU kernel 1.

lockstepdualisation's People

Contributors

Watchers

lockstepdualisation's Issues

sycl_benchmark crashes for N>128 on LUMI-G

Running for N>128 passes validation, but crashes in benchmark.

averyjam@nid005021:~/dualize/LockstepDualisation/build> ./validation/sycl/sycl_validation gpu 200 200 
Validating SYCL implementation for gpu device: gfx90a:sramecc+:xnack-.
N = 200
Success!

averyjam@nid005021:~/dualize/LockstepDualisation/build> ./benchmarks/sycl/sycl_benchmark gpu 200
Dualising 1000000 triangulation graphs, each with 200 triangles, repeated 10 times and with 1 warmup runs.
Platform: Intel(R) FPGA Emulation Platform for OpenCL(TM)
        NOT USING: Intel(R) FPGA Emulation Device has 4 compute-units.
Platform: Intel(R) OpenCL
        NOT USING: AMD EPYC 7A53 64-Core Processor                 has 4 compute-units.
Platform: AMD HIP BACKEND
        USING    : gfx90a:sramecc+:xnack- has 110 compute-units.
Using 1 gpu-devices
/users/averyjam/dualize/LockstepDualisation/src/sycl/dual.cc:31: K DeviceDualGraph<6, unsigned short>::dedge_ix(const K, const K) const [MaxDegree = 6, K = unsigned short]: global id: [1377767,0,0], local id: [167,0,0] Assertion `false` failed.
/users/averyjam/dualize/LockstepDualisation/src/sycl/dual.cc:31: K DeviceDualGraph<6, unsigned short>::dedge_ix(const K, const K) const [MaxDegree = 6, K = unsigned short]: global id: [1377768,0,0], local id: [168,0,0] Assertion `false` failed.
/users/averyjam/dualize/LockstepDualisation/src/sycl/dual.cc:31: K DeviceDualGraph<6, unsigned short>::dedge_ix(const K, const K) const [MaxDegree = 6, K = unsigned short]: global id: [1377769,0,0], local id: [169,0,0] Assertion `false` failed.
/users/averyjam/dualize/LockstepDualisation/src/sycl/dual.cc:31: K DeviceDualGraph<6, unsigned short>::dedge_ix(const K, const K) const [MaxDegree = 6, K = unsigned short]: global id: [1377770,0,0], local id: [170,0,0] Assertion `false` failed.
/users/averyjam/dualize/LockstepDualisation/src/sycl/dual.cc:31: K DeviceDualGraph<6, unsigned short>::dedge_ix(const K, const K) const [MaxDegree = 6, K = unsigned short]: global id: [1377771,0,0], local id: [171,0,0] Assertion `false` failed.
/users/averyjam/dualize/LockstepDualisation/src/sycl/dual.cc:31: K DeviceDualGraph<6, unsigned short>::dedge_ix(const K, const K) const [MaxDegree = 6, K = unsigned short]: global id: [1377772,0,0], local id: [172,0,0] Assertion `false` failed.
/users/averyjam/dualize/LockstepDualisation/src/sycl/dual.cc:31: K DeviceDualGraph<6, unsigned short>::dedge_ix(const K, const K) const [MaxDegree = 6, K = unsigned short]: global id: [1377773,0,0], local id: [173,0,0] Assertion `false` failed.
/users/averyjam/dualize/LockstepDualisation/src/sycl/dual.cc:31: K DeviceDualGraph<6, unsigned short>::dedge_ix(const K, const K) const [MaxDegree = 6, K = unsigned short]: global id: [1377774,0,0], local id: [174,0,0] Assertion `false` failed.
/users/averyjam/dualize/LockstepDualisation/src/sycl/dual.cc:31: K DeviceDualGraph<6, unsigned short>::dedge_ix(const K, const K) const [MaxDegree = 6, K = unsigned short]: global id: [1377776,0,0], local id: [176,0,0] Assertion `false` failed.
/users/averyjam/dualize/LockstepDualisation/src/sycl/dual.cc:31: K DeviceDualGraph<6, unsigned short>::dedge_ix(const K, const K) const [MaxDegree = 6, K = unsigned short]: global id: [1377778,0,0], local id: [178,0,0] Assertion `false` failed.
:0:rocdevice.cpp            :2652: 1910724722915 us: 1686 : [tid:0x14a7b1aef700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
Aborted

Running for N<=128 works for both. Why?

averyjam@nid005021:~/dualize/LockstepDualisation/build> ./validation/sycl/sycl_validation gpu 128 128
Validating SYCL implementation for gpu device: gfx90a:sramecc+:xnack-.
N = 128
Success!

averyjam@nid005021:~/dualize/LockstepDualisation/build> ./benchmarks/sycl/sycl_benchmark gpu 128
Dualising 1000000 triangulation graphs, each with 128 triangles, repeated 10 times and with 1 warmup runs.
Platform: Intel(R) FPGA Emulation Platform for OpenCL(TM)
        NOT USING: Intel(R) FPGA Emulation Device has 4 compute-units.
Platform: Intel(R) OpenCL
        NOT USING: AMD EPYC 7A53 64-Core Processor                 has 4 compute-units.
Platform: AMD HIP BACKEND
        USING    : gfx90a:sramecc+:xnack- has 110 compute-units.
Using 1 gpu-devices
Mean Time per Graph: 26.4305 +/- 7.02391 ns

jonasdelacour / lockstepdualisation Goto Github PK

lockstepdualisation's Introduction

LockstepDualisation Artefact Description

Instructions

Software Prerequisites

Build

Quickstart

Manual build

Manual Run

lockstepdualisation's People

Contributors

Watchers

lockstepdualisation's Issues

sycl_benchmark crashes for N>128 on LUMI-G

Portrable compiles for SYCL and CUDA on NVIDIA and AMD GPUs

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent