tugrul512bit / libgpgpu Goto Github PK

Multi-GPU & CPU OpenCL kernel executor with load-balancing as if there is one big GPU.

License: MIT License

C++ 52.02% C 43.60% CMake 4.38%

gpgpu heterogeneous-computing load-balancing multi-device multi-gpu opencl high-performance low-latency parallel-computing

libgpgpu's Introduction

libgpgpu's People

Contributors

Stargazers

Watchers

Forkers

mindrunner mfkiwl

libgpgpu's Issues

Do scope tests for host parameters and Computer instance.

Internal data of parameter objects should not be destructed before computer object. Same for worker objects and their threads.

Test with different scenarios:

Inside a class, as fields.
Inside vectors
Outside but different scopes
Static allocations

Add kernel generator method to create kernel headers that combine 10s of small arrays as a single big array.

Some devices can not allocate arrays bigger than 128MB. With this feature, all free space (like 16GB) of VRAM can be used at once. It just requires power-of-2 sized sub-arrays and log(n) number of indexing-steps to select an element. As long as all workitems do similar indexing, it should work fast.

I don't know how else an OpenCL 1.2 kernel can access whole VRAM from a single buffer parameter.

use of CL_MEM_USE_HOST_PTR is undefined behavior when multiple device-buffers use same host-pointer.

If a buffer has CL_MEM_USE_HOST_PTR, then remove the buffer from worker array, use it directly on HostParameter for mapping/unmapping.

Benchmark parameters of sizes with multiple of 4096 bytes for I/O performance

If not fast, add explicit pinning option for fast device I/O.

Add comments, hide unnecessary fields of structs as private, separate the implementations from h files to cpp files.

Add caching for host buffers that are bigger than RAM.

Backing store: SSD with 3GB/s bandwidth, serialized access from all threads.

L2: combined VRAMs of devices, with PCIE bandwidth (so quad titans can make good cache layer) but high latency, LRU.

L1: RAM that has 60+ GB/s for DDR5 (even more if data fits into CPU cache), direct-mapped.

This works only for dynamic-load balancing with static chunk size and atomic signaling from kernel on only RAM-sharing devices or normal devices with periodic buffer copies to express memory region request within kernel and only for opencl 2.x.

Add kernel-chaining and kernel-repeating features to decrease unnecessary buffer copies

Good for:

Reduction algorithms.

Complex algorithms where output of a kernel directly used by another kernel.

State machines with temporary/non-host arrays.

Add caching for kernel compilations to decrease compile times for duplicated devices.

Good for initialization performance and development time.

Device buffers with read=true, write=true and readAll=true should be read/written in sync with all devices.

It should work like this:

copy all read+write buffers
sync
copy all inputs --> run all kernels --> copy all outputs
sync
copy all read+write buffers

because if a device writes its result before another device reads its input (or worse, while a RAM-sharing device directly working on the host buffer), the results will be undefined.

But when a buffer is read-only or write-only, then all non-ram-sharing devices can work independently (overlapping read-only or non-overlapping write-only/read-only). Also readAll flag not set is ok when reading + writing.

If 2 devices are sharing RAM, their mapping should be unified on a single device buffer to evade opencl-side undefined behavior. (todo)

tugrul512bit / libgpgpu Goto Github PK

libgpgpu's Introduction

libgpgpu's People

Contributors

Stargazers

Watchers

Forkers

libgpgpu's Issues

Do scope tests for host parameters and Computer instance.

Add kernel generator method to create kernel headers that combine 10s of small arrays as a single big array.

use of CL_MEM_USE_HOST_PTR is undefined behavior when multiple device-buffers use same host-pointer.

Benchmark parameters of sizes with multiple of 4096 bytes for I/O performance

Add comments, hide unnecessary fields of structs as private, separate the implementations from h files to cpp files.

Add caching for host buffers that are bigger than RAM.

Add kernel-chaining and kernel-repeating features to decrease unnecessary buffer copies

Add caching for kernel compilations to decrease compile times for duplicated devices.

Device buffers with read=true, write=true and readAll=true should be read/written in sync with all devices.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent