Git Product home page Git Product logo

libgpgpu's Introduction

tugrul512bit's GitHub stats

libgpgpu's People

Contributors

tugrul512bit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

mindrunner mfkiwl

libgpgpu's Issues

Do scope tests for host parameters and Computer instance.

Internal data of parameter objects should not be destructed before computer object. Same for worker objects and their threads.

Test with different scenarios:

  • Inside a class, as fields.
  • Inside vectors
  • Outside but different scopes
  • Static allocations

Add kernel generator method to create kernel headers that combine 10s of small arrays as a single big array.

Some devices can not allocate arrays bigger than 128MB. With this feature, all free space (like 16GB) of VRAM can be used at once. It just requires power-of-2 sized sub-arrays and log(n) number of indexing-steps to select an element. As long as all workitems do similar indexing, it should work fast.

I don't know how else an OpenCL 1.2 kernel can access whole VRAM from a single buffer parameter.

Add caching for host buffers that are bigger than RAM.

Backing store: SSD with 3GB/s bandwidth, serialized access from all threads.

L2: combined VRAMs of devices, with PCIE bandwidth (so quad titans can make good cache layer) but high latency, LRU.

L1: RAM that has 60+ GB/s for DDR5 (even more if data fits into CPU cache), direct-mapped.

This works only for dynamic-load balancing with static chunk size and atomic signaling from kernel on only RAM-sharing devices or normal devices with periodic buffer copies to express memory region request within kernel and only for opencl 2.x.

Device buffers with read=true, write=true and readAll=true should be read/written in sync with all devices.

It should work like this:

  • copy all read+write buffers
  • sync
  • copy all inputs --> run all kernels --> copy all outputs
  • sync
  • copy all read+write buffers

because if a device writes its result before another device reads its input (or worse, while a RAM-sharing device directly working on the host buffer), the results will be undefined.

But when a buffer is read-only or write-only, then all non-ram-sharing devices can work independently (overlapping read-only or non-overlapping write-only/read-only). Also readAll flag not set is ok when reading + writing.

If 2 devices are sharing RAM, their mapping should be unified on a single device buffer to evade opencl-side undefined behavior. (todo)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.