<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

(Note: this is a continuation of <a class="issue-link js-issue-link" data-error-text="

Also perhaps <a href="https://github.com/nengo/nengo_ocl/blob/master/nengo_ocl/clra_no

OCL Contribution about nengo-ocl HOT 10 CLOSED

LouisCastricato commented on June 12, 2024

OCL Contribution

from nengo-ocl.

Comments (10)

arvoelke commented on June 12, 2024

(Note: this is a continuation of nengo/nengo#1050)

from nengo-ocl.

LouisCastricato commented on June 12, 2024

Also perhaps this is a better example of OCL not avoiding branching. The branching present in this implementation would most likely create a significant bottleneck with enough throughput

from nengo-ocl.

hunse commented on June 12, 2024

Those all sound like good ideas!

One thing to keep in mind is the types of models that we typically run, and the types of operations that are most prevalent. One good example is the circular convolution benchmark, and profile_circconv.py in the same folder runs just one simulation with profiling. In that network, the GEMV operations are far and away the most expensive, so I think anything that can speed them up will lead to significant overall improvements. The neuron step functions are also reasonably costly, so maybe removing some of the branching could help.

As you pointed out, it's also important to maintain support for many types of GPUs. For example, if we wanted to add dynamic parallelism, I'd want it done in such a way that a) GPUs that don't support it can just run using what we've got now, and b) the added complexity is factored out as much as possible, so that one doesn't have to understand dynamic parallelism to understand how things are running on basic GPUs.

from nengo-ocl.

LouisCastricato commented on June 12, 2024

Yeah and to add to that I think we can even drop the requirements a bit. Imagine running nengo ocl on an RPI or paralella. If we were to implement an APU mode and support precomputing the kernel call list then we would probably have no problem with supporting a bunch of odd and interesting devices.

I'd even be down for doing some research of how nengo performs on a huge cluster of RPIs or a beowulf cluster of arm processors.

from nengo-ocl.

LouisCastricato commented on June 12, 2024

Also I implemented most of the neuron step functions without branching in cuda. I'll look into porting it over this week

from nengo-ocl.

LouisCastricato commented on June 12, 2024

Do you mind explaining what GEMV does? Maybe its because I'm fairly novice in Comp Neuro but I tried reading through the file and got lost rather quickly.

Edit:
Decided it was in my best interest to audit and analyze the performance of your code and provide you with potentially very useful feedback.

I started looking through Gemv. I think the first challenge is figuring out exactly how many instructions each of these branches is doing. Branching is okay as long as the branches do the same instructions in the same order but perhaps on different objects. Eg:
if (x > y) return 0; return 1;

Is okay since both branches do the same instruction just with different values. Hence decoherence would essentially be equivalent to having no branching at all.

Just from a brief look through, you can probably expect a 20 - 30% reduction in the number of wasted cycles by properly optimizing gemv. I don't know how large of a performance increase this would be, but probably more than 3% which is significant in the long run.

TODO:

element_wise_inc in plan needs to be using shared memory. Also it needs to be slightly redesigned to avoid throttling when loading shared memory.

linearfilter needs to be scrapped and rewritten. I can't even begin to explain where its issues are. If you run Nsight on it, or some other GPU debugger program you will see that only 1/10th of the threads that were told to run it are active at any given point. Its running at 10% its maximum efficiency.

Probes looks okay besides the first branch. That needs to be fixed. I would recommend breaking it up into two passes and using DP here. It seems like a ripe usage.

direct looks perfectly fine

LIF needs some work. It wastes too much memory every time the kernel is initialized. Luckily I implemented a much more memory efficient one in CUDA already. I can port it over ASAP.

TODO: could precompute -expm1(-dtu / tau)

Its probably negligible. Almost all performance issues on GPUs is memory and/or decoherence related.

lif_rate looks ok

template is a bit interesting lol

rng looks fine

White noise needs some work. Nsight was giving an efficiency rating of about 40% That's ok if it isn't called every often but I don't think that's the case.

I have no idea what present input does but thats a LOT of memory loading for half a dozen math instructions. It may need some revision.

Why not use a professionally made convolution implementation? Your implementation doesn't seem to do anything special when compared to normal 2d convolution, so perhaps you should use a standard library? Not to discredit you obviously but its fairly likely that their implementation may perform significantly faster than yours.

Same goes for pool2d

Back to gemv

With block dot product you had the right idea but wrong tactic. I need to look over it a bit more, but you need to use 1) Shared Memory 2) Atomic block reductions (Eg the GPU shuffle ♫ ) 3) The branching can easily be removed here and I think it'll make a HUUGE performance difference. 4) We need to move to a 3D layout instead of 2D in order to get rid of that for loop. This is typically how a very performance oriented version of matrix multiplication is implemented in CUDA, and as such it will provide large benefits here too. Alternatively, there may be ways to unroll the for loop when building the network.

from nengo-ocl.

LouisCastricato commented on June 12, 2024

Also I want to move signals to F16 rather than F32. Lower memory bandwidth requirements and I don't think it'll make that much of a difference in end result since its so noisy to start with anyway.

from nengo-ocl.

hunse commented on June 12, 2024

Also I want to move signals to F16 rather than F32. Lower memory bandwidth requirements and I don't think it'll make that much of a difference in end result since its so noisy to start with anyway.

I would never hard-code this. But it would be great to have an option. There's actually an issue in Nengo to change signal dtypes.

Do you mind explaining what GEMV does?

It's a general matrix-vector multiply.

White noise needs some work. Nsight was giving an efficiency rating of about 40% That's ok if it isn't called every often but I don't think that's the case.

That definitely is the case. We don't use it very much right now.

GEMV and LIF steps are far and away the main culprits. Also, there are actually a lot of copies, and even though the kernel is not bad, because they're not getting grouped together it's just resulting in a lot of kernel calls and thus a lot of overhead.

from nengo-ocl.

LouisCastricato commented on June 12, 2024

Hmm ok so first thing I am going to do is look into cleaning up LIF and and managing the copy calls. I think solving the issues with copy calls is more of a CPU bounded issue than it is a GPU bounded one. Perhaps look into multithreading the creation of copy calls?

The kernel is fine, I agree. I can't really find any issues with it.

Fixing GEMV may take a very long time and it was only recently that NVIDIA and AMD began working on the architecture of their cards to better handle reductions like GEMV calls. This is going to take some serious work but it looks like a fun challenge :)

from nengo-ocl.

drasmuss commented on June 12, 2024

Closing this since it's several years old at this point, but still good ideas in this thread for anyone that wanted to take them on!

from nengo-ocl.

OCL Contribution about nengo-ocl HOT 10 CLOSED

Comments (10)

TODO: could precompute -expm1(-dtu / tau)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent