brian-team / brian2cuda Goto Github PK

A brian2 extension to simulate spiking neural networks on GPUs

Home Page: https://brian2cuda.readthedocs.io/

License: GNU General Public License v3.0

Python 64.47% C++ 6.63% Cuda 24.77% Makefile 0.26% Shell 3.00% C 0.28% Jupyter Notebook 0.53% Batchfile 0.05%

python biological-simulations brian brian2 code-generation computational-neuroscience differential-equations gpu gpu-acceleration neuroscience

brian2cuda's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger hytang85 starhxh jakencool nigroup troymiao denisalevi edinburgher babyblue26 achuchmala sliorbar wisamreid

brian2cuda's Issues

spatialdateupdate kernel_*_integration requires too many registers

Using the example compartmental/rall.py in cuda_standalone mode on my GT 610 GPU the kernel execution fails due to CUDA error (during integration): too many resources requested for launch (when calling the *_integration kernel).
The problems seems to be the kernel register usage and the resolution is to remove (at least two) unused pointers from the kernel argument list. However, as the latter is generated in cuda_standalone/device.py and we either have to reduce the items in uses_variables within the respective template (and explicitly declare variables in kernels) or otherwise we need to have a way of explicitly excluding variables from the list %DEVICE_PARAMETERS% (and for %KERNEL_VARIABLES% similarily).

related with iss #4

improve spatialneuron implementation

make spatialneurons implementation work again
reduce 3 to 1 linear systems, cf. cpp_standalone
outline with link to code how to improve (separate issue) linear systems solution efficiency
add precalculation of the max. no of compartment for appropriate kernel grid sizing in the spatialstateupdater.cu template
implement everything which was lately added in the cpp_standalone version
solve the register usage issue #6

code finalization

more to come...

We compile for gpu architecture `sm_20` by default

Currently we have -arch=sm_20 as default nvcc argument. But if we benchmark performance, shouldn't we compile for the architecture that our GPU has (which is sm_35)?

@moritzaugustin any reason not to let the compiler choose the GPU architecture based on the GPU by default? The user can still change this with prefs.devices.cuda.extra_compile_args_nvcc (once I have implemented it, see #35).

IOError due to not written 'restuls/last_run_info.txt' in cuda_standalone mode

When running a brian2 script in cuda_standalone mode with compile=True and run=True arguments for device.build(), following error is thrown:

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    device.build(directory='./', compile=True, run=True, debug=False)
  File "/mnt/antares_raid/home/denisalevi/projects/dev_brian2cuda/brian2cuda_repo/brian2cuda/device.py", line 586, in build
    self.run(directory, with_output, run_args)
  File "/home/denisalevi/anaconda2/envs/dev_b2c/lib/python2.7/site-packages/brian2/devices/cpp_standalone/device.py", line 797, in run
    last_run_info = open('results/last_run_info.txt', 'r').read()
IOError: [Errno 2] No such file or directory: 'results/last_run_info.txt'

This error can be reproduced with:

from brian2 import *
import brian2cuda                                                     

set_device('cuda_standalone')
device.build(directory='./', compile=True, run=True, debug=False)

@moritzaugustin could you check if you can reproduce this error?

The CUDAStandaloneDevice object calls the CPPStandaloneDevice.run() method (which it inherits) and that tries to open results/last_run_info.txt. In cpp_standalone mode, the created objects.cpp file writes the results/last_run_info.txt in its _write_arrays() function. In cuda_standalone mode, it doesn't.

So either override the run() method in the CUDAStandaloneDevice class or change the brian2cuda/templates/objects.cu template to write the 'restuls/last_run_info.txt' file as done in brian2/devices/cpp_standalone/templates/objects.cpp.

Check functionality of `results/last_run_info.txt' file!

Optionally collect profiling info inside kernel calls

In C++ standalone every codeobject records the time it needs for execution (using std::clock()). This info is then written to disk as results/profiling_info.txt. Feature tests and speed tests do collect profiling info (with our update to brian2.0) and raise an error if the file is not found. My workaround is, that _write_arrays() now creates a dummy profiling info file of the same format as in cpp_standalone with runtime = 0 for all codeobjects.

But it wouldn't harm to implement the profiling info just as cpp_standalone does (meaning each host side function call collects the time it needs for execution, with alle kernel calls being part of that). This could probably be done quite easily.

And then there is also the option to implement the same thing for each actual kernel call. The only thing is, that inside each kernel call we would need to add the value to global memory. This would mean one atomicAdd per thread and kernel call (or coalesced write and then some reduction algorithm). But either way, this increases simulation time unnecessary when profiling info isn't needed. So either we add something like

{{ % if profile % }}
// collect time
{{ % end if % }}

to the templates. Or we just leave it like it is and use nvprof for kernel time profiling, since that is what it does anyways. Guess that is better.

If we decide not to collect profiling info at all, we should overwrite CPPStandaloneDevice.get_profiling_info() to handle the not existence of the profiling_info.txt instead of writing a dummy file. That would be cleaner.

speed tests

initial version of the usual speed tests cpp_standalone vs cuda_standalone vs brian2genn
figures & relevant profiling output can be inserted here and the discussion can start :-)

in addition it would be good to have the Python speed (as well as feature) test scripts in an appropriate folder in the repository to allow easy repetition of the "experiments"

finish zero and const delay spike propagation mode

parallel implementation (multiple blocks and threads; blocked data structures)
put pre_size_by... in a register
detect the (seldom) case of multiple synapses for a neuronal pair and enforce 1 thread per block in this case (compared to multiple threads as above when for each pre neuron there is at most one connection to any post neuron)
optional: load spikespace (portion) into shared memory
tests (feature_tests and all suitable real examples)

copy labels from brian2 repository

@mstimberg could you please make available the labels (to allow tagging of issues) from https://github.com/brian-team/brian2 for this repository as well? in particular i would like to have labels for: priority, cleanup, enhancement but optimally just copy all labels here (I think there are tools available for that purpose)

push into synaptic queue in parallel per block

evaluate alternatives and implement after discussion the best:

on initilization phase calculate per pre neuron (in the resp., post neuron block): the number of outgoing synapsis with delay 0dt, 1dt, ... Nmax*dt AND enforce sorting of the blocked connectivity matrix s.t. the synapses are ordered after delays => employ the number above parallel threads to push into the SAME delay queue in parallel AND in parallel for the different delays
old idea (should be contained in above one): make an assignment of threadid-delayid (maybe outer loop necessary) to push into queue[delayid] in parallel for each block having multiple threads vs. currently one active thread (id 0) in the actual push phase
also cf. current brian2 cpp_standalone multithreading for ideas

add to the queue also the postneuron id to allow better parallel synaptic code objects, c.f. issue #15

Compilation error: undefined "dev_array_neurongroup_not_refractory"

When executing the following code:

from brian2 import *
import brian2cuda

set_device('cuda_standalone')

neurons = NeuronGroup(1, 'v:1', threshold='True')
run(1*ms)
device.build(directory='build', compile=True, run=False, debug=False)

a compilation error occurs:

code_objects/neurongroup_group_variable_set_conditional_codeobject.cu(98): error: identifier "dev_array_neurongroup_not_refractory" is undefined

1 error detected in the compilation of "/tmp/tmpxft_00003e22_00000000-7_neurongroup_group_variable_set_conditional_codeobject.cpp1.ii".
make: *** [code_objects/neurongroup_group_variable_set_conditional_codeobject.o] Error 2

When setting the device to 'cpp_standalone', everything is fine and the neuron spikes once every timestep as expected.

advance to more recent (frozen) brian2 commit due to brian2genn

Use the following versions in accordance with brian-team/brian2genn#25 (comment)

brian2: 71f12749e24f1e5eb6c775c331d8ad5060daad95
brian2genn: a00d75961c9eb95fcbee5c05bbf8cc7602c86765
genn: 0cb5825cbc45ce3b2c4fc0c499c6377b8b31f3fd

Optimize parallel synapse pushing (when not in bundle_mode)

Parallel push is working now (#16), but there are still multiple, easy optimization possibilities, which I didn't have time to do yet. They can be found in the TODO comments in CSpikeQueue::push() in spikequeue.h and the run kernel function in synapes_push_spikes.cu.

And when doing this, maybe rename CSpikeQueue to CudaSpikeQueue...

invalid clock member access in ratemonitor.cu template

When running any example with a ratemonitor ( e.g. adaptation_oscillation.py or STDP_standalone.py) the compilations fails:
makefile:35: recipe for target 'code_objects/ratemonitor_codeobject.o' failed code_objects/ratemonitor_codeobject.cu(87): error: class "Clock" has no member "i"

Line 6 of the ratemonitor.cu template seems to wrongly access the current timestep index.

SynapticPathways using subsets of the same NeuronGroup as source should loop only once through the spikespace when pushing

When using subsets of a NeuronGroup as sources of SynapticPathways, currently each SynapticPathway loops once through the entire spikespace of the source NeuronGroup and pushes only those synpase ids into the spikequeues, which neuron ids are part of the source of the SynapticPathway. So when we have multiple SynapticPathways using different subsets of the same NeuronGroup as source, then we loop through the same spikespace multiple times. This should in general be possible to avoid by just looping through the spikespace once and pushing the synapse ids into the spikequeue of the correct SynapticPathway.
Just opening this issue to keep this in mind. This could be tackled once we have more detailed profiling results, where we can see if its even worth it.

Example code snippet which would result in 2 times looping through the same spikepace at each timestep (once using a boolean to define the subset and once a Subgroup):

P = NeuronGroup(4000, ...)
Syn0 = Synapses(P, P, ...)
Syn0.connect('i<3200', ...)
Syn1 = Synapses(P[3200:], P, ...) 
Syn1.connect(True, ...)

prevent excessive kernel register usage

it seems that big models (with many variables) lead to unexecutable kernels because of exceeding the max. no of registers per thread (https://en.wikipedia.org/wiki/CUDA). this is related with iss #5

if this is indeed true we should change the storage of pointers within kernels from registers to shared or global memory

alternatively explicitly support only up to ... eqs and raise NotImplemented during code generation otherwise)

Clean up makefile generation

The makefile generation is kind of a mess right now. Only for linux systems stuff is set up correctly, for windows the CUDAStandaloneDevice::generate_makefile() and the win_makefile template haven't really been adjusted to cuda yet. This needs a windows VM to test, so I won't touch it for now.

spikemonitor should return sorted data

the template spikemonitor.cu contains code which combines after a simulation the arrays for spike time and neuron index for the blocks. however, the final time array should be sorted by time (and the spike time array of course accordingly) which is as far as I see not yet implemented. it can of course be done as comfortable as possible, e.g. on the host side

Add prefs for random number generation

Since we are using cuRAND for random number generation, which has quite a few options, e.g. for different generators, we should add its options to prefs.codegen.cuda, so the user can change them.

correct the reported simulation time

when running a working example such as STDP the simulation time is wrongly reported:

96.0611 s (96%) simulated in 391 s, estimated 16 s remaining.
98.6048 s (98%) simulated in 401 s, estimated 6 s remaining.
100 s (100%) simulated in 4.06472e+14 s
Number of synapses: 1000
Number of synapses: 1000
Number of spikes: 1499574

check readme commit hashes after testing brian2genn

Investigate the effect of sorting synIDs for coalescing

This should include an evaluation (profiling) for difference sorting options and should possibly be possible to set by the user in the brian2cuda preferences.

'no_or_const_delay_mode' is not set to True for no or scalar delays

Testing for no or scalar delays is not implemented correctly. In standalone mode the delay parameter and therefore its .scalar and .size flags are not accessible before simulation hasn't been run. At least not in the way it was intended in the current version here in commit bac8b5a.

I would leave this for now and look into it once the no_or_const_delay_mode implementation is finished and we updated to brian2 master (#22).

some thoughts:

as far as I remember string expressions such as delay = '5 * ms * rand()' or delay = 'i * ms' get translated into c++/cuda code in standalone mode and thats why their values are not available before simulation
check when the delay.scalar and delay.size flags are updated
if they are updated after simulation, maybe some checking for appearances of rand(), i, xi etc. in the delay expression might be necessary in order to determine weather there will be different delays for different synapses or not (check available brian2 functions for string characterisation)

Until this is fixed, no_or_const_delay_mode = False is hard coded here!

Rename synaptic transmission modes

Since we now will have 3 different synaptic transmission modes, we can't use a bool anymore.
Instead of no_or_const_delay_mode == True/False, @moritzaugustin suggested to use something along the lines of

synaptic_transmission == HOMOGENOUS_DELAYS
synaptic_transmission == HETEROGENOUS_DELAYS_INDIVIDUAL
synaptic_transmission == HETEROGENOUS_DELAYS_BUNDLED

Depending on the mode, the templates should then use different code.

This should be introduced when doing #12 and #36 .

Final evaluation and documentation

In the end we want comparisons between brian2cuda, brian2genn and cpp_standalone for different modi. And documentation of brian2cuda performance increase for different optimizations, changes in concepts and data structures.

Documentations should therefore include visualizations of concepts and data structures used.

thresholder should use the parallel spikespace

the parallel spikespace which is used in the synaptic propagation/effect application & spike monitor is so-far not yet used
currently: mainly the first block (atomicAdd on global memory) fills the spike space -- almost purely sequentially
should be: all blocks in parallel should fill their respective portion of the spike space using only shared memory atomicAdd (for index computation and local spike count in smem) and only at the very end an atomicAdd on the global memory to find the total no of spikes (global spike count) as the sum of the local spike counts

Set up compiler preferences correclty for nvcc usage

Currently the compiler flags specified for cpp compilers are passed to nvcc.

CudaStandaloneDevice::build() gets compiler_flags from brian2/codegen/cpp_prefs.py::get_compiler_and_args() and passes those to CudaStandaloneDevice::generate_makefile(), where they are used in the makefile template as arguments for the nvcc call.

For now I added the following line to brian2cuda/cuda_cenerator.py to get rid of the default gcc compiler flags that are not compatible with a nvcc call (namely -ffast-math, -fno-finite-math-only and -march=native).

prefs.codegen.cpp.extra_compile_args_gcc = ['-w', '-O3']

Projects do compile with this options, but we should probably create a cuda_prefs.py file analog to the cpp_prefs.py file, which would then also give the user control over nvcc compiler flags. And then there should be a way to pass compiler flags for the cpp compiler, that is called by nvcc to compiles the host code. This is where the prefs.codegen.cpp arguments should be passed to. And I also haven't looked much into nvcc compiler flags, which should be done to choose a proper default.

Add cuda implementations to TimedArray and BinomialFunction classes

This lets a few brian2 test suite tests fail. Probably easy to do, we can just use the C++ implementation.

Deterministic neuron model occasionally return too few spikes when repeatedly run on GPU

Running the following code

from brian2 import * 
import brian2cuda 

set_device('cuda_standalone') 

all_neurons = NeuronGroup(10000, 'v:1', threshold='True')

neuron_mon = SpikeMonitor(all_neurons) 

run(1*ms) 
device.build(directory='./', compile=True, run=True, debug=False)

and then executing the compiled cuda binary multiple times, e.g. with

for _ in `seq 1 10`:; do ./main; done

results occasionally in too few "Number of spikes" to be printed.
10000 neurons x 10 time steps = 100000 spikes should be the right result, while sometimes btwn 95000 and 99999 spikes are recorded (or printed by the spikemonitor).

Profile CUBA, CUBA_DISTDELAYS and STDP examples

First detailed profiling of some brian2 examples with brian2cuda after updating brian2 version (#22).

Change unsigned integer datatypes to signed

Our current CSpikeQueue(spikequeue.h) and cudaVector (cudaVector.h) classes seem to have a random mix of size32_t, int and unsigend int datatypes. Or at least I could not really figure out a consistent reasoning behind it. Now I'm not quite sure what datatypes would make sense here.

Looking at the CSpikeQueue from brian2 (cspikequeue.cpp), I have a few questions @mstimberg

It looks like signed integer types are used for neuron IDs and synapse IDs. Are we ever handling negative IDs, or is there another reason not to use unsigned integer types?
Then for synapse IDs in CSpikeQueue.synapses, size32_t is used and for synapse IDs in the CSpikeQueue.queue int is used. Why not use the same datatype in both?

My understanding so far is, that the C++ standard requires unsigned int to hold only at least values in the range 0 to 65535 and int in the range -32767 to 32767. And the latter seems to me to little for synapse IDs? Or is any computer running brian2 simulations anyways going to use a computer architecture / compiler where int can store a higher range and it simply doesn't really matter?

Sort EventMonitor data by index within each time step

Currently our EventMonitor (~ SpikeMonitor) returns times and indices sorted by time but for each time step the indices are not sorted. So we get e.g. this

times =   [0,0,0, 1,1,1,1, 2,2,2,2,2]
indices = [5,1,2, 1,3,2,4, 4,2,5,1,3]

A few tests form the brian2 test suite fail because of this, since brian2 returns the indices sorted as well, like this

times =   [0,0,0, 1,1,1,1, 2,2,2,2,2]
indices = [1,2,5, 1,2,3,4, 1,2,3,4,5]

This is low priority right now, but I guess we should return the data in the same format as brian2.

We could save the number of spikes at each time step and then use that to sort the indices in the end of the simulation using thrust::sort. Or if we want to save one copy from shared to global memory per time step, we could write our own kernel for sorting.

Parallelize synaptic effects application in `target` mode for heterogeneous delays WITHOUT bundle mode and WITHOUT atomics

EDIT by denisalevi:
Summary: sort synapse IDs in spike queues by post neuron IDs and use thread <-> postNeuron correspondence.

currently as the current synapse queue contains synapse ids for which the synaptic effects have to be applied. this is done in a sequential fashion for each block.

[see denis comment for the first attempt too]

we should make it parallel for each block by

loading synapse ids and postneuron ids of the spike queue into shared memory (outer loop might be necessary)
sort that data w.r.t postneuron id (using bitonic merge sort)
run the synaptic code in parallel with the relation threadid-occuringpostneuronid

Spikemonitor kernel is performance bottleneck for networks without synapses.

Profiling IF_curve_LIF.py and IF_curve_Hodgkin_Huxley.py examples from brian2 shows that the performance bottleneck is the spikemonitor kernel.

Below are profiling results (using nvprof) for three different implementation szenarios.

The unchanged not-parallel spikespace, using the unchanged spikemonitor (that is assuming a parallel spikespace) (dee9bf7)
A quick and dirty modified spikemonitor, that now assumes a not-parallel spikespace (and should now also return sorted data automatically) (branch issue10_sorted_spikemonitor, 609005d)
The parallel spikespace (from the slower thresholder implementation with shared atomicAdds from issue #9) with the spikemonitor from 1. (that assumes the parallel spikespace) (branch issue9_spikespace, eb215d4)

Some of the results using N=10000 neurons in the IF_curve_LIF.py example:

Kernel	1.	2.	3.
run_spikemonitor	280 us (78%)	250 us (79%)	213 us (73%)
copy_spikemonitor	670 ms (19%)	540 ms (17%)	665 ms (23%)
thresholder	2.6 us (0.7%)	2 us (0.8%)	5.3 us (1.8%)

times are [average times per kernel call] and percentages are [time spent in that kernel (all kernel calls) / total time]:

That means:

For all of the above scenarios the spikemonitor kernel is the bottleneck (especially when simulating many neurons) and performance only varies around 20% between them.
Weather the thresholder is filling the spikespace in parallel or not really doesn't matter, since even for many neurons it takes up blow 2% of the time
Even though the copy_spikemonitor kernel is only called once, it actually takes quite long if there are a lot of neurons.

My conclusions:

No need for a parallel spikespace, it doesn't solve the "slowness" of the spikemonitor
The serialized spikemonitor (2.) might even be a little faster with a little clean up (right now there is some unnecessary copying happening). And then we could advance to other topics
The following might actually increase performance: In the thresholder just use a register, that is either -1 (if the neuron didn't spike) or threadID if it spiked and just write it to the spikespace at idx=threadID. This way the spikespace has a lot of -1 values inbetween neuronID values, but we don't need any atomics. Then use the thrust::copy_if function, that returns a vector of all not--1 values from the spikespace and the size of that vector (using some parallel GPU algorithm). And then just add that returned vector to a the spikemonitor (which can then just be thrust::device_vector). We should talk about this approach.

Below are the detailed profiling results:

Profiling of IF_curve_Hodgkin_Huxley.py for different number of neurons N, using the unmodified spikemonitor (that assumes a parallel spikespace) (1.):

HODGKIN HUXLEY, N=100
==10548== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 52.48%  291.81ms     20000  14.590us  13.440us  17.024us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, char*, double*, double*, double*, double*)
 24.20%  134.56ms     20000  6.7270us  1.6640us  30.271ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 10.57%  58.748ms     60024     978ns     928ns  2.6880us  [CUDA memcpy HtoD]
  8.40%  46.715ms     20000  2.3350us  1.6960us  2.7840us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  4.16%  23.111ms         1  23.111ms  23.111ms  23.111ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  0.16%  878.66us         1  878.66us  878.66us  878.66us  _run_spikemonitor_codeobject_init(unsigned int)
  0.01%  80.896us         1  80.896us  80.896us  80.896us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
  0.01%  77.632us        18  4.3120us  2.0800us  26.752us  [CUDA memcpy DtoH]
  0.00%  5.3120us         1  5.3120us  5.3120us  5.3120us  _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
  0.00%  3.8080us         1  3.8080us  3.8080us  3.8080us  _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, int*, double*)


HODGKIN HUXLEY, N=1000
==31099== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 57.69%  828.03ms     20000  41.401us  1.5680us  225.32ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 18.55%  266.23ms     20000  13.311us  13.088us  16.128us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, char*, double*, double*, double*, double*)
 16.26%  233.40ms         1  233.40ms  233.40ms  233.40ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  4.04%  57.938ms     60024     965ns     928ns  3.1680us  [CUDA memcpy HtoD]
  3.36%  48.259ms     20000  2.4120us  1.7280us  3.0720us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  0.06%  876.13us         1  876.13us  876.13us  876.13us  _run_spikemonitor_codeobject_init(unsigned int)
  0.03%  434.05us        18  24.113us  2.1120us  251.49us  [CUDA memcpy DtoH]
  0.01%  81.472us         1  81.472us  81.472us  81.472us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
  0.00%  5.2480us         1  5.2480us  5.2480us  5.2480us  _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)

Profiling of IF_curve_LIF.py for different number of neurons `N``, using the unmodified spikemonitor (that assumes a parallel spikespace) (1.):

LIF, N=1000
==32636== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 58.81%  235.41ms     10000  23.540us  1.3760us  60.282ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 16.55%  66.264ms         1  66.264ms  66.264ms  66.264ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  8.37%  33.502ms     10000  3.3500us  3.1680us  4.3200us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  7.30%  29.222ms     30021     973ns     928ns  4.8640us  [CUDA memcpy HtoD]
  5.08%  20.352ms     10000  2.0350us  1.4400us  2.6240us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  3.60%  14.400ms     10000  1.4400us  1.3120us  1.9200us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.22%  879.91us         1  879.91us  879.91us  879.91us  _run_spikemonitor_codeobject_init(unsigned int)
  0.04%  151.84us        15  10.122us  2.1120us  73.184us  [CUDA memcpy DtoH]
  0.02%  81.056us         1  81.056us  81.056us  81.056us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
  0.00%  5.5360us         1  5.5360us  5.5360us  5.5360us  _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
  0.00%  3.3920us         1  3.3920us  3.3920us  3.3920us  _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)


LIF, N=10000
==1277== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 78.12%  2.79040s     10000  279.04us  1.2800us  901.28ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 18.71%  668.47ms         1  668.47ms  668.47ms  668.47ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  1.11%  39.806ms     10000  3.9800us  3.9040us  5.4400us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  0.82%  29.177ms     30021     971ns     928ns  28.000us  [CUDA memcpy HtoD]
  0.72%  25.781ms     10000  2.5780us  1.7600us  3.4240us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  0.44%  15.810ms     10000  1.5810us  1.5040us  2.2400us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.05%  1.7333ms        15  115.56us  2.1120us  1.2212ms  [CUDA memcpy DtoH]
  0.02%  881.92us         1  881.92us  881.92us  881.92us  _run_spikemonitor_codeobject_init(unsigned int)
  0.00%  81.280us         1  81.280us  81.280us  81.280us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)

After a quick and dirty change of the spikemonitor (adapted for not parallel spikespace) (2.):

LIF, N=1000, SERIALIZED SPIKEMONITOR
==15351== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 58.65%  214.83ms     10000  21.482us  1.5360us  55.489ms  _run_spikemonitor_codeobject_kernel(unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 14.67%  53.725ms         1  53.725ms  53.725ms  53.725ms  _copy_spikemonitor_codeobject_kernel(int*, double*, bool)
  9.15%  33.501ms     10000  3.3500us  3.1360us  4.3520us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  7.98%  29.222ms     30021     973ns     928ns  7.1680us  [CUDA memcpy HtoD]
  5.56%  20.374ms     10000  2.0370us  1.4400us  2.6240us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  3.91%  14.317ms     10000  1.4310us  1.3440us  1.9520us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.04%  151.36us        15  10.090us  2.1120us  73.280us  [CUDA memcpy DtoH]
  0.02%  77.632us         1  77.632us  77.632us  77.632us  _run_debugmsg_spikemonitor_codeobject_kernel(void)
  0.01%  53.280us         1  53.280us  53.280us  53.280us  _run_spikemonitor_codeobject_init(void)
  0.00%  3.2960us         1  3.2960us  3.2960us  3.2960us  _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)


LIF, N=10000, SERIALIZED SPIKEMONITOR
==11854== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 79.31%  2.51796s     10000  251.80us  1.5680us  827.75ms  _run_spikemonitor_codeobject_kernel(unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 17.06%  541.50ms         1  541.50ms  541.50ms  541.50ms  _copy_spikemonitor_codeobject_kernel(int*, double*, bool)
  1.28%  40.739ms     10000  4.0730us  3.9680us  5.3760us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  0.92%  29.247ms     30021     974ns     928ns  28.032us  [CUDA memcpy HtoD]
  0.81%  25.818ms     10000  2.5810us  1.6640us  3.4240us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  0.56%  17.714ms     10000  1.7710us  1.6000us  2.1440us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.05%  1.7398ms        15  115.99us  2.1120us  1.2275ms  [CUDA memcpy DtoH]
  0.00%  76.928us         1  76.928us  76.928us  76.928us  _run_debugmsg_spikemonitor_codeobject_kernel(void)
  0.00%  53.568us         1  53.568us  53.568us  53.568us  _run_spikemonitor_codeobject_init(void)

Using the parallel spikespace (filled in parallel by the modified thresholder) and the parallel spikemonitor (3.):

LIF, N=1000, PARALLEL THRESHOLDER
==16310== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 60.58%  268.46ms     10000  26.846us  1.3760us  15.050ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 15.26%  67.602ms         1  67.602ms  67.602ms  67.602ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  8.03%  35.578ms     10000  3.5570us  3.2960us  4.4160us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  6.64%  29.410ms     30021     979ns     928ns  3.5200us  [CUDA memcpy HtoD]
  5.62%  24.903ms     10000  2.4900us  1.7280us  5.1520us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  3.62%  16.056ms     10000  1.6050us  1.4080us  1.8880us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.20%  876.83us         1  876.83us  876.83us  876.83us  _run_spikemonitor_codeobject_init(unsigned int)
  0.03%  151.74us        15  10.116us  2.1120us  73.248us  [CUDA memcpy DtoH]
  0.02%  81.664us         1  81.664us  81.664us  81.664us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
  0.00%  5.6640us         1  5.6640us  5.6640us  5.6640us  _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
  0.00%  3.2640us         1  3.2640us  3.2640us  3.2640us  _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)


LIF, N=10000, PARALLEL THRESHOLDER
==16906== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 72.62%  2.13823s     10000  213.82us  1.3760us  112.15ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 22.60%  665.34ms         1  665.34ms  665.34ms  665.34ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  1.80%  53.087ms     10000  5.3080us  2.0160us  18.304us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  1.36%  39.906ms     10000  3.9900us  3.8720us  5.1520us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  0.99%  29.177ms     30021     971ns     928ns  28.128us  [CUDA memcpy HtoD]
  0.54%  15.887ms     10000  1.5880us  1.5360us  1.9840us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.06%  1.7550ms        15  117.00us  2.1120us  1.2425ms  [CUDA memcpy DtoH]
  0.03%  879.14us         1  879.14us  879.14us  879.14us  _run_spikemonitor_codeobject_init(unsigned int)
  0.00%  81.632us         1  81.632us  81.632us  81.632us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)

Summed variables implementation fix and update

There was a bug in brian2 when using the (summed) keyword (brian-team/brian2#704).
The fix might be rather easy if we just adapt the changes to the cpp template from the corresponding PR (brian-team/brian2#705).

But since we talked about changing the cuda implementation of summed variables anyways, I will leave this for now.

Parallelize `source` mode of synaptic effects application

Currently the 'source' serialization mode uses only one thread and one block and loops for each spiking neuron through all post neurons (total serialization). There are probably ways to parallelize this.

But the question is if this is worth is at all. Meaning: will we gain enough parallelization to be faster then cpp_standalone for some use cases?

And what models might use multiple serialization modes and might benefit from performance increase in the post and syn modes while still using pre modes? For those a better pre mode would still be beneficial even if it is slower then in cpp_standalone. There were some STDP model with traces mentioned. This should be checked.

Modulo function not correctly implemented

Running this code:

from brian2 import *  
import brian2cuda     

set_device('cuda_standalone')                                       
neuron = NeuronGroup(1, 'v:1', threshold='i%2==0')             
run(1*ms)                                                           
device.build(directory='./', compile=True, run=True, debug=False)

results in the following error:

code_objects/neurongroup_thresholder_codeobject.cu(90): error: identifier "_brian_mod" is undefined

1 error detected in the compilation of "/tmp/tmpxft_00005f4d_00000000-7_neurongroup_thresholder_codeobject.cpp1.ii".
make: *** [code_objects/neurongroup_thresholder_codeobject.o] Error 2

Looks like there is something wrong with the modulo function, in cpp_standalone mode everything is fine.

In brian2genn _brian_mod is replaced by fmod in the GeNNDevice() class in device.py.

Update brian2cuda to use current brian2 master

There seem to be some changes in the spikequeue in brian2.

Following PRs from brian2 referenced possible changes for brian2cuda:

specification of minimal device capability

from the used cuda features alone the minimal compute capability is 2.0
when running some tests on asterope (device capability 2.1):

CUBA does not show any spikes with cuda_standalone (but not cpp_standalone)
the feature test test_cuda_standalone.py (copied from test_cpp_standalone.py replacing all cpp_standalone occurences with cuda_standalone) does not work as expected: AssertionError in line 424)

running cuda-memcheck ./main -- e.g. for the 1. example above -- a cudaErrorLaunchOutofResources error is reported due to too many resources requested for launch on CUDA API call to cudaLaunch

it is likely that compute capability 3.5 is required due to the available registers per thread (c.f. https://en.wikipedia.org/wiki/CUDA) -- we have to check which kernels are problematic if we wanted to relax the required cc version no

this is related with iss #4

Parallelize serializations in 'target' serialization_modes

Compute extra queue with postIDs (connectivity matrix with pre -> post => ...post_id_by_pre analogon to ...synapses_id_by_pre)
Sort synIDs by postID (first easy version shared memory, later try e.g. bitonic sort)
Use thread <-> synapse correspondence, but let threads which postID has already occurred (can be computed from 2.) return and the first thread having a certain postID loop through all synapses with that postID, applying the synaptic effects. (We assume that postIDs only rarely occur more then once in relation to the total number of synapses in the queue, therefore we accept the resulting thread divergence from returning threads -> needs profiling)
with tid <-> syn correspondence:

if (shared_postID[tid-1] == shared_postID[tid]) {return};
int k = tid;
while (shared_postID[k] == shared_postID[tid])
    // apply synaptic effect to postID
    k += 1

Sort same postID by synID (for potentially better coalescing -> needs profiling, see #32 )

Implement Nemo-inspired variant of heterogeneous delay mode

Instead of pushing individual synapse ids separately, as a second alternative we can push synapse "bundle" ids into the respective delay queues, where each synapse bundle is defined by (firing presynaptic neuron id, delay, postsynaptic group) -- and the numbering is done e.g. within each postsynaptic group separately.
This procedure should reduce the cost of the push operation signifcantly given that the synapse bundles have not too small size. Furthermore the application of the synaptic effects can be doine exactly as in the no or const delay mode -- not sorting of active synapses required! More precisely: the postsynaptic groups independently iterate sequentially through the active synapse bundles but apply the effects of all contained synapses in parallel among the resp. threads of the block.
Note: in order to push the synaptic bundle ids in parallel for a given postsyn group block the number of synaptic bundles for each presyn. neuron has to be known -- but this is exactly the number of "unique delays".

The idea is by (Fidjeland and & Shanahan 2009, 2010).

The benchmark of the two heterogenous delay modes should include an example with uniform delay distribution (for too small numbers of synapses within each postsyn group the old propagation might be faster) and with a more bulky one (the bundles should perform much better) -- e.g. shifted peaky distribution.

Spikegenerator period argument not working

These tests are failing:

brian2.tests.test_spikegenerator.test_spikenerator_period
brian2.tests.test_spikegenerator.test_spikenerator_rounding_period
brian2.tests.test_spikegenerator.test_spikegenerator_change_spikes

It looks like the generated spikes are not repeated after period time has passed.

StateMonitor idexing with slice syntax doesn't work

For some reason using [1:] style indexing for a StateMonitor returns an empty array while e.g. [1] indexing returns the correct value.

...
mon = StateMonitor(...)
...
run(2*defaultclock.dt)  # -> len(mon.t) is 2
print(mon.t[1:])  # returns []
print(mon[1])  # returns Quantity([defaultclock.dt])

This seems to apply for any kind of [3:] style indexing of the StateMonitor.
This breaks e.g. brian2.tests.test_monitor.test_state_monitor_indexing.

Benchmark performance against brian2genn and cpp_standalone

This should be the first benchmarking (speed tests) against brian2genn and cpp_standalone after updating brian2cuda to use brian2 2.0 (#22), but before major optimizations of brian2genn.
It should include automated benchmarking scripts to compare effects of later optimizations.

cuba examples gives error in clean environment

when using clean versions of moritzaugustin:brian2/frozen and brian-team/master the CUBA example fails in file cpp_standalone/device.py in line 797 when not successfully reading results/last_run_info.txt (no such file or directory)

Parallelize ratemonitor when using subgroups

When using a PopulationRateMonitor with a NeuronGroup, which is not a Subgroup, we can just read out the last element of the spikespace to get the number of spiking neurons for the current time step to calculate the instantaneous rate.
When using subgroups though, we can't do that, since a spikespace belongs to an entire Neurongroup. Currently we call each timestep a kernel with 1 thread and 1 block, which then tests if the ratemonitor belongs to a subgroup and if so, loops through all spiking neurons in the spikespace and increments a counter for each neuron belonging to the subgroup. We should check if doing it in parallel with e.g atomicAdd is worth it performancewise.

Cpp standalone uses the fact, that their neuron indices are sorted in the spikespace to get the first and last spiking neuron of the subgroup. Since our spikespace is not sorted by neuron indices, we can't do that.

serializing_mode is not set correctly

In the CUDAStandaloneDevice in brian2cuda/device.py, the serializing_mode flag seems to be set incorrect. It is e.g set to 'post' when using Synapses(..., pre=...), which should not result in any serialization at all.

With commit ba61c2b serializing_mode = 'syn' is hard coded until this is fixed.

Single precision preference

As some users might be interested in using single precision due to the strong arithemetic performance of consumer grade nvidia GPUs in comparison to double precision we should offer this option.
For this a preference (float type or similar) should be choosable which is by default double and on demand float -- and all templates and the code generation should rely on it.

Overload `clip` function for differnt data types

Our implementation of the clip function currently casts all arguments to double. When implementing #37, we should also add at least an option for casting to float. Maybe even overload the function for different types as done with the modulo function (while taking care of type safety when comparing different data types).

Support of `integer` and `long double` types for CUDA math functions

The CUDA math library functions are only overloaded for single (float) and double precision (double) arguments. That means if we want to support integer arithmetics in device functions, we need to overload them ourselves. In C++ they are overloaded additionally for int and long double arguments.

Question is, do we create overloaded functions and add them all to CUDACodeGenerator.universal_support_code as is currently done with _brian_mod (here)?

Or would it be cleaner to do that in an extra header file that we add to brianlib? And instead of overloading it, we could also just template it like it is done for int_ in stdint_compat.h. We could then just cast everything to float, using the single precision math functions and have one specialisation for double, using the double precision math function, like this:

template<typename T> __host__ __device__ int function(T value)
{
    return cudaMathAPI_function((float)value);
}
template<> inline __host__ __device__ int function(double value)
{
    return cudaMathAPI_function(value);
}

And since in some cases there is some easier way to calculate a function for integer types, we could instead do the integer calculation and only specialize for float and double. And we could even add #ifdef __CUDA_ARCH ... to have different code for host or device function calls.

Another thing is, that long double arithmetics are not supported on CUDA at all. Question is, do we simply cast long double to double or let it fail instead. Right now the _brian_mod function just fails for long double arguments with something like

error: more than one instance of overloaded function "_brian_mod" matches the argument list: ...

And since CUDA doesn't have any exception handling, the user wouldn't know whats happening.

@mstimberg Are there actual use cases where people use long double data types? And is there an easy way to check for it and give a warning or error when using cuda_standalone?
And how does genn do it? From what I could see, genn just uses fmod, so I suppose it only supports float and double?