Hello, I've successfully build the CUDA version of the code. <p

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Measure Performance without OpenGL about cuda-to-sycl-nbody HOT 7 OPEN

codeplaysoftware commented on May 24, 2024

Measure Performance without OpenGL

from cuda-to-sycl-nbody.

Comments (7)

vitduck commented on May 24, 2024 1

Hi Duncan,

Thanks for your reply.

I do agree that a second target without graphical component is better than removing OpenGL altogether.
Looking at the code, it seems that the rendering is strongly coupled with simulation part.
So I am not sure it is worth the effort on your end to isolate it.

For now, I will set up a linux box to test the code.

from cuda-to-sycl-nbody.

DuncanMcBain commented on May 24, 2024

Hi @vitduck,

as a temporary solution it might be possible to use the solution described here instead of messing around with the X virtual framebuffer stuff, though I haven't tried personally. It should be possible to compile Mesa and LLVMPipe without requiring that they are installed to the system.

We don't have any quick fixes for removing the graphical dependency but it's something we're considering doing in some fashion. It might be possible to simply remove the OpenGL code from the main file, though I think if we pick this task up I'd like to make a second target which builds from a separate main file that has no graphical component.

Duncan.

from cuda-to-sycl-nbody.

DuncanMcBain commented on May 24, 2024

Hi @vitduck,

We have a PR open that should fix this issue (#30).

I hope this helps!

from cuda-to-sycl-nbody.

vitduck commented on May 24, 2024

Hi @DuncanMcBain
Thanks very much for the notice.

I am testing the latest commit as follow:

$ module purge 
$ module load cuda/10.1 
$ sh scripts/build_cuda.sh no_render 
-- The CXX compiler identification is GNU 4.8.5
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- The CUDA compiler identification is NVIDIA 10.1.243
-- Check for working CUDA compiler: /apps/cuda/10.1/bin/nvcc
-- Check for working CUDA compiler: /apps/cuda/10.1/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.27.1") 
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /apps/cuda/10.1 (found version "10.1") 
-- Configuring done
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    GLEW_LIBRARY


-- Build files have been written to: /scratch/optpar01/work/2024/cuda-to-sycl-nbody/build_cuda
Scanning dependencies of target nbody_cuda
[ 25%] Building CXX object src/CMakeFiles/nbody_cuda.dir/nbody.cpp.o
[ 50%] Building CXX object src/CMakeFiles/nbody_cuda.dir/sim_param.cpp.o
[ 75%] Building CUDA object src/CMakeFiles/nbody_cuda.dir/simulator.cu.o
[100%] Linking CXX executable ../../nbody_cuda
[100%] Built target nbody_cuda
Scanning dependencies of target release
[100%] Built target release

So OpenGL libs are no longer required!

However, I encounter the following error when running the compiled binary:

$ ./scripts/run_nbody.sh -b cuda 100 10  
GPUassert: initialization error /scratch/optpar01/work/2024/cuda-to-sycl-nbody/src/simulator.cuh 94

Looking the the relevant line of simulator.cuh, it is just a standard cudaMalloc

 92     ¦ ParticleData_d(size_t n) {
 93     ¦   ¦// Allocate device memory for particle coords & velocity...
 94     ¦   ¦gpuErrchk(cudaMalloc((void **)&x, sizeof(coords_t) * n));
 95     ¦   ¦gpuErrchk(cudaMalloc((void **)&y, sizeof(coords_t) * n));
 96     ¦   ¦gpuErrchk(cudaMalloc((void **)&z, sizeof(coords_t) * n));
 97     ¦ };

I tried smaller system size as well, but the error persists (We have 40 GB memory)
Do you have some insight on this issue ?

from cuda-to-sycl-nbody.

DuncanMcBain commented on May 24, 2024

Hi @vitduck,

We won't really be able to help with the pure CUDA version of the code (we didn't write it), but if you're able to try the SYCL version we'd be happy to help with that!

from cuda-to-sycl-nbody.

vitduck commented on May 24, 2024

Duncan,
Sorry for the an oversight on my part. The aforementioned CUDA error is due to MIG partition.
Both CUDA and SYCL-migrated codes can now be built and run without rendering.

Could you kindly confirm if the following output is expected ?
(If I understand correctly, the kernel time will be measured in ms)

Backend enumeration

$ sycl-ls 
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.6.0.22_223734]
[opencl:cpu:1] Intel(R) OpenCL, AMD EPYC 7543 32-Core Processor                 3.0 [2023.16.6.0.22_223734]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-80GB 8.8 [CUDA 11.6]

CUDA performance

$ ./nbody_cuda 50 10 0.999998 0.005 1.0e-7 2 10000
... 
At step 10000 kernel time is 15.4361 and mean is 15.435 and stddev is: 0.0853953

SYCL/CUDA performance

$ SYCL_DEVICE_FILTER=cuda ./nbody_dpcpp 50 10 0.999998 0.005 1.0e-7 2 10000
...
At step 10000 kernel time is 8.60655 and mean is 8.60897 and stddev is: 0.0694211

I would have expected some level of parity between native CUDA and SYCL with a slight edge for the former.
Here, the result unexpectedly shows that SYCL/CUDA is two times faster.
I am not sure how to interpret this outcome.

from cuda-to-sycl-nbody.

DuncanMcBain commented on May 24, 2024

Hi @vitduck,
so we have a section in the README (the last section) which covers performance and we effectively managed to get the results to be about the same between CUDA and SYCL on a 3060 GPU back when we were working on this. Obviously the software stack has changed since then so it's hard to say exactly what might be similar or different since then.

I'll check with a colleague, we might be able to send you some of our updated numbers, but also you could check with the NVIDIA NSight Compute profiling tool to see if there are any obvious things going on.

from cuda-to-sycl-nbody.

Measure Performance without OpenGL about cuda-to-sycl-nbody HOT 7 OPEN

Comments (7)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent