the parallel spikespace which is used in the synaptic propagation/effect application &

I think <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

thresholder should use the parallel spikespace about brian2cuda HOT 2 CLOSED

brian-team commented on June 11, 2024

thresholder should use the parallel spikespace

from brian2cuda.

Comments (2)

moritzaugustin commented on June 11, 2024

I think @Kwartke you had implemented this already, i.e., it might be worth to scan the old branches

from brian2cuda.

denisalevi commented on June 11, 2024

It turns out, that using atomicAdd on shared memory instead of global memory in the thresholder kernel is slower (on GTX Titan Black). A performance plot comparing the two implementations for N=10000 neurons, with every n-th neuron spiking in each time step can be found .
The time measurements were done using the nvprof command line profiler and are average values of 10 kernel calls.

Time measurement of only the atomicAdd instructions within the kernel for the two implementations using clock() in the kernel code (for the case of all neurons spiking) show the same results:

using shared atomics takes ~35.5 us per kernel call
using global atomics takes ~9.6 us per kernel call

The code for the time measurements and how to reproduce them can be found in the dev/issues/issue9_spikespace folder (commit dee9bf7).
clock() measures the number of clocks per-multiprocessor counter, so this is the time the device takes executing the thread (including waiting/replays of conflicting atomics) (as explained here).

In this nvidia blogpost a good explanation can be found, where shared and global atomics are compared for a similar use case.

Kepler emulates shared memory atomics in software [...].

However the Maxwell architecture features hardware support for shared memory atomics and we can clearly see that in all cases the shared atomics version performs best.

This explains the results for our performance measurements (The GTX Titan Black is a Kepler-GPU). The performance gain with shared atomics on Maxwell architectures in aboves blog post is also only a factor of 2 (independent of the number of conflicting atomics).

Therefore, we keep the implementation using global atomics. Optional shared atomics for Maxwell architectures could be considered at a later development stage.
The implementation using shared atomics can be found in the issue9_spikespace branch (eb215d4).

Closing the issue.

from brian2cuda.

Recommend Projects

thresholder should use the parallel spikespace about brian2cuda HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent