brian-team / brian2cuda Goto Github PK
View Code? Open in Web Editor NEWA brian2 extension to simulate spiking neural networks on GPUs
Home Page: https://brian2cuda.readthedocs.io/
License: GNU General Public License v3.0
A brian2 extension to simulate spiking neural networks on GPUs
Home Page: https://brian2cuda.readthedocs.io/
License: GNU General Public License v3.0
Parallel push is working now (#16), but there are still multiple, easy optimization possibilities, which I didn't have time to do yet. They can be found in the TODO comments in CSpikeQueue::push()
in spikequeue.h and the run kernel function in synapes_push_spikes.cu.
And when doing this, maybe rename CSpikeQueue
to CudaSpikeQueue
...
This should be the first benchmarking (speed tests) against brian2genn and cpp_standalone after updating brian2cuda to use brian2 2.0 (#22), but before major optimizations of brian2genn.
It should include automated benchmarking scripts to compare effects of later optimizations.
THREADS_PER_BLOCK
is always equal to blockDim.x
=> redundancy and register waste?num_blocks(...)
and num_threads(...)
passages in common_group.cu
: comment and remove redundancy with the functions having the same names in group_variable_set_conditional.cu
cudaDeviceSetSharedMemConfig()
to cudaSharedMemBankSizeEightByte
cudaFuncCachePreferL1
via cudaFuncSetCacheConfig()
-1
to other threads of the block to stop them loading unnecessary -1
values from global memorymore to come...
when running a working example such as STDP the simulation time is wrongly reported:
96.0611 s (96%) simulated in 391 s, estimated 16 s remaining.
98.6048 s (98%) simulated in 401 s, estimated 6 s remaining.
100 s (100%) simulated in 4.06472e+14 s
Number of synapses: 1000
Number of synapses: 1000
Number of spikes: 1499574
The makefile generation is kind of a mess right now. Only for linux systems stuff is set up correctly, for windows the CUDAStandaloneDevice::generate_makefile()
and the win_makefile
template haven't really been adjusted to cuda yet. This needs a windows VM to test, so I won't touch it for now.
The CUDA math library functions are only overloaded for single (float
) and double precision (double
) arguments. That means if we want to support integer arithmetics in device functions, we need to overload them ourselves. In C++ they are overloaded additionally for int
and long double
arguments.
Question is, do we create overloaded functions and add them all to CUDACodeGenerator.universal_support_code
as is currently done with _brian_mod
(here)?
Or would it be cleaner to do that in an extra header file that we add to brianlib
? And instead of overloading it, we could also just template it like it is done for int_
in stdint_compat.h. We could then just cast everything to float, using the single precision math functions and have one specialisation for double
, using the double precision math function, like this:
template<typename T> __host__ __device__ int function(T value)
{
return cudaMathAPI_function((float)value);
}
template<> inline __host__ __device__ int function(double value)
{
return cudaMathAPI_function(value);
}
And since in some cases there is some easier way to calculate a function for integer types, we could instead do the integer calculation and only specialize for float and double. And we could even add #ifdef __CUDA_ARCH ...
to have different code for host or device function calls.
Another thing is, that long double
arithmetics are not supported on CUDA at all. Question is, do we simply cast long double
to double
or let it fail instead. Right now the _brian_mod
function just fails for long double
arguments with something like
error: more than one instance of overloaded function "_brian_mod" matches the argument list: ...
And since CUDA doesn't have any exception handling, the user wouldn't know whats happening.
@mstimberg Are there actual use cases where people use long double
data types? And is there an easy way to check for it and give a warning or error when using cuda_standalone
?
And how does genn do it? From what I could see, genn just uses fmod
, so I suppose it only supports float
and double
?
In the CUDAStandaloneDevice
in brian2cuda/device.py
, the serializing_mode
flag seems to be set incorrect. It is e.g set to 'post' when using Synapses(..., pre=...)
, which should not result in any serialization at all.
With commit ba61c2b serializing_mode = 'syn'
is hard coded until this is fixed.
Since we are using cuRAND for random number generation, which has quite a few options, e.g. for different generators, we should add its options to prefs.codegen.cuda
, so the user can change them.
Running this code:
from brian2 import *
import brian2cuda
set_device('cuda_standalone')
neuron = NeuronGroup(1, 'v:1', threshold='i%2==0')
run(1*ms)
device.build(directory='./', compile=True, run=True, debug=False)
results in the following error:
code_objects/neurongroup_thresholder_codeobject.cu(90): error: identifier "_brian_mod" is undefined
1 error detected in the compilation of "/tmp/tmpxft_00005f4d_00000000-7_neurongroup_thresholder_codeobject.cpp1.ii".
make: *** [code_objects/neurongroup_thresholder_codeobject.o] Error 2
Looks like there is something wrong with the modulo function, in cpp_standalone
mode everything is fine.
In brian2genn _brian_mod
is replaced by fmod
in the GeNNDevice()
class in device.py
.
Profiling IF_curve_LIF.py
and IF_curve_Hodgkin_Huxley.py
examples from brian2
shows that the performance bottleneck is the spikemonitor kernel.
Below are profiling results (using nvprof
) for three different implementation szenarios.
issue10_sorted_spikemonitor
, 609005d)atomicAdd
s from issue #9) with the spikemonitor from 1. (that assumes the parallel spikespace) (branch issue9_spikespace
, eb215d4)Some of the results using N=10000
neurons in the IF_curve_LIF.py
example:
Kernel | 1. | 2. | 3. |
---|---|---|---|
run_spikemonitor | 280 us (78%) | 250 us (79%) | 213 us (73%) |
copy_spikemonitor | 670 ms (19%) | 540 ms (17%) | 665 ms (23%) |
thresholder | 2.6 us (0.7%) | 2 us (0.8%) | 5.3 us (1.8%) |
times are [average times per kernel call] and percentages are [time spent in that kernel (all kernel calls) / total time]:
That means:
My conclusions:
-1
(if the neuron didn't spike) or threadID
if it spiked and just write it to the spikespace at idx=threadID
. This way the spikespace has a lot of -1
values inbetween neuronID values, but we don't need any atomics. Then use the thrust::copy_if
function, that returns a vector of all not--1
values from the spikespace and the size of that vector (using some parallel GPU algorithm). And then just add that returned vector to a the spikemonitor (which can then just be thrust::device_vector
). We should talk about this approach.Below are the detailed profiling results:
Profiling of IF_curve_Hodgkin_Huxley.py
for different number of neurons N
, using the unmodified spikemonitor (that assumes a parallel spikespace) (1.):
HODGKIN HUXLEY, N=100
==10548== Profiling result:
Time(%) Time Calls Avg Min Max Name
52.48% 291.81ms 20000 14.590us 13.440us 17.024us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, char*, double*, double*, double*, double*)
24.20% 134.56ms 20000 6.7270us 1.6640us 30.271ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
10.57% 58.748ms 60024 978ns 928ns 2.6880us [CUDA memcpy HtoD]
8.40% 46.715ms 20000 2.3350us 1.6960us 2.7840us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
4.16% 23.111ms 1 23.111ms 23.111ms 23.111ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
0.16% 878.66us 1 878.66us 878.66us 878.66us _run_spikemonitor_codeobject_init(unsigned int)
0.01% 80.896us 1 80.896us 80.896us 80.896us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
0.01% 77.632us 18 4.3120us 2.0800us 26.752us [CUDA memcpy DtoH]
0.00% 5.3120us 1 5.3120us 5.3120us 5.3120us _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
0.00% 3.8080us 1 3.8080us 3.8080us 3.8080us _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, int*, double*)
HODGKIN HUXLEY, N=1000
==31099== Profiling result:
Time(%) Time Calls Avg Min Max Name
57.69% 828.03ms 20000 41.401us 1.5680us 225.32ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
18.55% 266.23ms 20000 13.311us 13.088us 16.128us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, char*, double*, double*, double*, double*)
16.26% 233.40ms 1 233.40ms 233.40ms 233.40ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
4.04% 57.938ms 60024 965ns 928ns 3.1680us [CUDA memcpy HtoD]
3.36% 48.259ms 20000 2.4120us 1.7280us 3.0720us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
0.06% 876.13us 1 876.13us 876.13us 876.13us _run_spikemonitor_codeobject_init(unsigned int)
0.03% 434.05us 18 24.113us 2.1120us 251.49us [CUDA memcpy DtoH]
0.01% 81.472us 1 81.472us 81.472us 81.472us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
0.00% 5.2480us 1 5.2480us 5.2480us 5.2480us _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
Profiling of IF_curve_LIF.py
for different number of neurons `N``, using the unmodified spikemonitor (that assumes a parallel spikespace) (1.):
LIF, N=1000
==32636== Profiling result:
Time(%) Time Calls Avg Min Max Name
58.81% 235.41ms 10000 23.540us 1.3760us 60.282ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
16.55% 66.264ms 1 66.264ms 66.264ms 66.264ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
8.37% 33.502ms 10000 3.3500us 3.1680us 4.3200us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
7.30% 29.222ms 30021 973ns 928ns 4.8640us [CUDA memcpy HtoD]
5.08% 20.352ms 10000 2.0350us 1.4400us 2.6240us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
3.60% 14.400ms 10000 1.4400us 1.3120us 1.9200us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.22% 879.91us 1 879.91us 879.91us 879.91us _run_spikemonitor_codeobject_init(unsigned int)
0.04% 151.84us 15 10.122us 2.1120us 73.184us [CUDA memcpy DtoH]
0.02% 81.056us 1 81.056us 81.056us 81.056us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
0.00% 5.5360us 1 5.5360us 5.5360us 5.5360us _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
0.00% 3.3920us 1 3.3920us 3.3920us 3.3920us _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)
LIF, N=10000
==1277== Profiling result:
Time(%) Time Calls Avg Min Max Name
78.12% 2.79040s 10000 279.04us 1.2800us 901.28ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
18.71% 668.47ms 1 668.47ms 668.47ms 668.47ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
1.11% 39.806ms 10000 3.9800us 3.9040us 5.4400us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
0.82% 29.177ms 30021 971ns 928ns 28.000us [CUDA memcpy HtoD]
0.72% 25.781ms 10000 2.5780us 1.7600us 3.4240us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
0.44% 15.810ms 10000 1.5810us 1.5040us 2.2400us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.05% 1.7333ms 15 115.56us 2.1120us 1.2212ms [CUDA memcpy DtoH]
0.02% 881.92us 1 881.92us 881.92us 881.92us _run_spikemonitor_codeobject_init(unsigned int)
0.00% 81.280us 1 81.280us 81.280us 81.280us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
After a quick and dirty change of the spikemonitor (adapted for not parallel spikespace) (2.):
LIF, N=1000, SERIALIZED SPIKEMONITOR
==15351== Profiling result:
Time(%) Time Calls Avg Min Max Name
58.65% 214.83ms 10000 21.482us 1.5360us 55.489ms _run_spikemonitor_codeobject_kernel(unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
14.67% 53.725ms 1 53.725ms 53.725ms 53.725ms _copy_spikemonitor_codeobject_kernel(int*, double*, bool)
9.15% 33.501ms 10000 3.3500us 3.1360us 4.3520us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
7.98% 29.222ms 30021 973ns 928ns 7.1680us [CUDA memcpy HtoD]
5.56% 20.374ms 10000 2.0370us 1.4400us 2.6240us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
3.91% 14.317ms 10000 1.4310us 1.3440us 1.9520us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.04% 151.36us 15 10.090us 2.1120us 73.280us [CUDA memcpy DtoH]
0.02% 77.632us 1 77.632us 77.632us 77.632us _run_debugmsg_spikemonitor_codeobject_kernel(void)
0.01% 53.280us 1 53.280us 53.280us 53.280us _run_spikemonitor_codeobject_init(void)
0.00% 3.2960us 1 3.2960us 3.2960us 3.2960us _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)
LIF, N=10000, SERIALIZED SPIKEMONITOR
==11854== Profiling result:
Time(%) Time Calls Avg Min Max Name
79.31% 2.51796s 10000 251.80us 1.5680us 827.75ms _run_spikemonitor_codeobject_kernel(unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
17.06% 541.50ms 1 541.50ms 541.50ms 541.50ms _copy_spikemonitor_codeobject_kernel(int*, double*, bool)
1.28% 40.739ms 10000 4.0730us 3.9680us 5.3760us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
0.92% 29.247ms 30021 974ns 928ns 28.032us [CUDA memcpy HtoD]
0.81% 25.818ms 10000 2.5810us 1.6640us 3.4240us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
0.56% 17.714ms 10000 1.7710us 1.6000us 2.1440us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.05% 1.7398ms 15 115.99us 2.1120us 1.2275ms [CUDA memcpy DtoH]
0.00% 76.928us 1 76.928us 76.928us 76.928us _run_debugmsg_spikemonitor_codeobject_kernel(void)
0.00% 53.568us 1 53.568us 53.568us 53.568us _run_spikemonitor_codeobject_init(void)
Using the parallel spikespace (filled in parallel by the modified thresholder) and the parallel spikemonitor (3.):
LIF, N=1000, PARALLEL THRESHOLDER
==16310== Profiling result:
Time(%) Time Calls Avg Min Max Name
60.58% 268.46ms 10000 26.846us 1.3760us 15.050ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
15.26% 67.602ms 1 67.602ms 67.602ms 67.602ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
8.03% 35.578ms 10000 3.5570us 3.2960us 4.4160us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
6.64% 29.410ms 30021 979ns 928ns 3.5200us [CUDA memcpy HtoD]
5.62% 24.903ms 10000 2.4900us 1.7280us 5.1520us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
3.62% 16.056ms 10000 1.6050us 1.4080us 1.8880us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.20% 876.83us 1 876.83us 876.83us 876.83us _run_spikemonitor_codeobject_init(unsigned int)
0.03% 151.74us 15 10.116us 2.1120us 73.248us [CUDA memcpy DtoH]
0.02% 81.664us 1 81.664us 81.664us 81.664us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
0.00% 5.6640us 1 5.6640us 5.6640us 5.6640us _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
0.00% 3.2640us 1 3.2640us 3.2640us 3.2640us _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)
LIF, N=10000, PARALLEL THRESHOLDER
==16906== Profiling result:
Time(%) Time Calls Avg Min Max Name
72.62% 2.13823s 10000 213.82us 1.3760us 112.15ms _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
22.60% 665.34ms 1 665.34ms 665.34ms 665.34ms _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
1.80% 53.087ms 10000 5.3080us 2.0160us 18.304us kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
1.36% 39.906ms 10000 3.9900us 3.8720us 5.1520us kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
0.99% 29.177ms 30021 971ns 928ns 28.128us [CUDA memcpy HtoD]
0.54% 15.887ms 10000 1.5880us 1.5360us 1.9840us kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
0.06% 1.7550ms 15 117.00us 2.1120us 1.2425ms [CUDA memcpy DtoH]
0.03% 879.14us 1 879.14us 879.14us 879.14us _run_spikemonitor_codeobject_init(unsigned int)
0.00% 81.632us 1 81.632us 81.632us 81.632us _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
When running a brian2 script in cuda_standalone
mode with compile=True
and run=True
arguments for device.build()
, following error is thrown:
Traceback (most recent call last):
File "test.py", line 5, in <module>
device.build(directory='./', compile=True, run=True, debug=False)
File "/mnt/antares_raid/home/denisalevi/projects/dev_brian2cuda/brian2cuda_repo/brian2cuda/device.py", line 586, in build
self.run(directory, with_output, run_args)
File "/home/denisalevi/anaconda2/envs/dev_b2c/lib/python2.7/site-packages/brian2/devices/cpp_standalone/device.py", line 797, in run
last_run_info = open('results/last_run_info.txt', 'r').read()
IOError: [Errno 2] No such file or directory: 'results/last_run_info.txt'
This error can be reproduced with:
from brian2 import *
import brian2cuda
set_device('cuda_standalone')
device.build(directory='./', compile=True, run=True, debug=False)
@moritzaugustin could you check if you can reproduce this error?
The CUDAStandaloneDevice
object calls the CPPStandaloneDevice.run()
method (which it inherits) and that tries to open results/last_run_info.txt
. In cpp_standalone
mode, the created objects.cpp
file writes the results/last_run_info.txt
in its _write_arrays()
function. In cuda_standalone
mode, it doesn't.
So either override the run()
method in the CUDAStandaloneDevice
class or change the brian2cuda/templates/objects.cu
template to write the 'restuls/last_run_info.txt' file as done in brian2/devices/cpp_standalone/templates/objects.cpp
.
Check functionality of `results/last_run_info.txt' file!
First detailed profiling of some brian2 examples with brian2cuda after updating brian2 version (#22).
Currently we have -arch=sm_20
as default nvcc
argument. But if we benchmark performance, shouldn't we compile for the architecture that our GPU has (which is sm_35
)?
@moritzaugustin any reason not to let the compiler choose the GPU architecture based on the GPU by default? The user can still change this with prefs.devices.cuda.extra_compile_args_nvcc
(once I have implemented it, see #35).
EDIT by denisalevi:
Summary: sort synapse IDs in spike queues by post neuron IDs and use thread <-> postNeuron
correspondence.
currently as the current synapse queue contains synapse ids for which the synaptic effects have to be applied. this is done in a sequential fashion for each block.
[see denis comment for the first attempt too]
we should make it parallel for each block by
Use the following versions in accordance with brian-team/brian2genn#25 (comment)
brian2: 71f12749e24f1e5eb6c775c331d8ad5060daad95
brian2genn: a00d75961c9eb95fcbee5c05bbf8cc7602c86765
genn: 0cb5825cbc45ce3b2c4fc0c499c6377b8b31f3fd
Using the example compartmental/rall.py
in cuda_standalone mode on my GT 610 GPU the kernel execution fails due to CUDA error (during integration): too many resources requested for launch
(when calling the *_integration kernel).
The problems seems to be the kernel register usage and the resolution is to remove (at least two) unused pointers from the kernel argument list. However, as the latter is generated in cuda_standalone/device.py
and we either have to reduce the items in uses_variables
within the respective template (and explicitly declare variables in kernels) or otherwise we need to have a way of explicitly excluding variables from the list %DEVICE_PARAMETERS%
(and for %KERNEL_VARIABLES%
similarily).
related with iss #4
initial version of the usual speed tests cpp_standalone
vs cuda_standalone
vs brian2genn
figures & relevant profiling output can be inserted here and the discussion can start :-)
in addition it would be good to have the Python speed (as well as feature) test scripts in an appropriate folder in the repository to allow easy repetition of the "experiments"
In the end we want comparisons between brian2cuda, brian2genn and cpp_standalone for different modi. And documentation of brian2cuda performance increase for different optimizations, changes in concepts and data structures.
Documentations should therefore include visualizations of concepts and data structures used.
This lets a few brian2 test suite tests fail. Probably easy to do, we can just use the C++ implementation.
Currently the 'source'
serialization mode uses only one thread and one block and loops for each spiking neuron through all post neurons (total serialization). There are probably ways to parallelize this.
But the question is if this is worth is at all. Meaning: will we gain enough parallelization to be faster then cpp_standalone for some use cases?
And what models might use multiple serialization modes and might benefit from performance increase in the post
and syn
modes while still using pre
modes? For those a better pre
mode would still be beneficial even if it is slower then in cpp_standalone. There were some STDP model with traces mentioned. This should be checked.
When using subsets of a NeuronGroup as sources of SynapticPathways, currently each SynapticPathway loops once through the entire spikespace of the source NeuronGroup and pushes only those synpase ids into the spikequeues, which neuron ids are part of the source of the SynapticPathway. So when we have multiple SynapticPathways using different subsets of the same NeuronGroup as source, then we loop through the same spikespace multiple times. This should in general be possible to avoid by just looping through the spikespace once and pushing the synapse ids into the spikequeue of the correct SynapticPathway.
Just opening this issue to keep this in mind. This could be tackled once we have more detailed profiling results, where we can see if its even worth it.
Example code snippet which would result in 2 times looping through the same spikepace at each timestep (once using a boolean to define the subset and once a Subgroup):
P = NeuronGroup(4000, ...)
Syn0 = Synapses(P, P, ...)
Syn0.connect('i<3200', ...)
Syn1 = Synapses(P[3200:], P, ...)
Syn1.connect(True, ...)
Running the following code
from brian2 import *
import brian2cuda
set_device('cuda_standalone')
all_neurons = NeuronGroup(10000, 'v:1', threshold='True')
neuron_mon = SpikeMonitor(all_neurons)
run(1*ms)
device.build(directory='./', compile=True, run=True, debug=False)
and then executing the compiled cuda binary multiple times, e.g. with
for _ in `seq 1 10`:; do ./main; done
results occasionally in too few "Number of spikes" to be printed.
10000 neurons x 10 time steps = 100000 spikes
should be the right result, while sometimes btwn 95000 and 99999
spikes are recorded (or printed by the spikemonitor).
the parallel spikespace which is used in the synaptic propagation/effect application & spike monitor is so-far not yet used
currently: mainly the first block (atomicAdd
on global memory) fills the spike space -- almost purely sequentially
should be: all blocks in parallel should fill their respective portion of the spike space using only shared memory atomicAdd
(for index computation and local spike count in smem) and only at the very end an atomicAdd
on the global memory to find the total no of spikes (global spike count) as the sum of the local spike counts
In C++ standalone every codeobject records the time it needs for execution (using std::clock()
). This info is then written to disk as results/profiling_info.txt
. Feature tests and speed tests do collect profiling info (with our update to brian2.0) and raise an error if the file is not found. My workaround is, that _write_arrays()
now creates a dummy profiling info file of the same format as in cpp_standalone with runtime = 0 for all codeobjects.
But it wouldn't harm to implement the profiling info just as cpp_standalone does (meaning each host side function call collects the time it needs for execution, with alle kernel calls being part of that). This could probably be done quite easily.
And then there is also the option to implement the same thing for each actual kernel call. The only thing is, that inside each kernel call we would need to add the value to global memory. This would mean one atomicAdd
per thread and kernel call (or coalesced write and then some reduction algorithm). But either way, this increases simulation time unnecessary when profiling info isn't needed. So either we add something like
{{ % if profile % }}
// collect time
{{ % end if % }}
to the templates. Or we just leave it like it is and use nvprof
for kernel time profiling, since that is what it does anyways. Guess that is better.
If we decide not to collect profiling info at all, we should overwrite CPPStandaloneDevice.get_profiling_info()
to handle the not existence of the profiling_info.txt
instead of writing a dummy file. That would be cleaner.
Testing for no or scalar delays is not implemented correctly. In standalone mode the delay parameter and therefore its .scalar
and .size
flags are not accessible before simulation hasn't been run. At least not in the way it was intended in the current version here in commit bac8b5a.
I would leave this for now and look into it once the no_or_const_delay_mode
implementation is finished and we updated to brian2 master (#22).
some thoughts:
delay = '5 * ms * rand()'
or delay = 'i * ms'
get translated into c++/cuda code in standalone mode and thats why their values are not available before simulationdelay.scalar
and delay.size
flags are updatedrand()
, i
, xi
etc. in the delay expression might be necessary in order to determine weather there will be different delays for different synapses or not (check available brian2 functions for string characterisation)Until this is fixed, no_or_const_delay_mode = False
is hard coded here!
it seems that big models (with many variables) lead to unexecutable kernels because of exceeding the max. no of registers per thread (https://en.wikipedia.org/wiki/CUDA). this is related with iss #5
if this is indeed true we should change the storage of pointers within kernels from registers to shared or global memory
alternatively explicitly support only up to ... eqs and raise NotImplemented
during code generation otherwise)
When running any example with a ratemonitor ( e.g. adaptation_oscillation.py
or STDP_standalone.py
) the compilations fails:
makefile:35: recipe for target 'code_objects/ratemonitor_codeobject.o' failed code_objects/ratemonitor_codeobject.cu(87): error: class "Clock" has no member "i"
Line 6 of the ratemonitor.cu
template seems to wrongly access the current timestep index.
For some reason using [1:]
style indexing for a StateMonitor
returns an empty array while e.g. [1]
indexing returns the correct value.
...
mon = StateMonitor(...)
...
run(2*defaultclock.dt) # -> len(mon.t) is 2
print(mon.t[1:]) # returns []
print(mon[1]) # returns Quantity([defaultclock.dt])
This seems to apply for any kind of [3:]
style indexing of the StateMonitor.
This breaks e.g. brian2.tests.test_monitor.test_state_monitor_indexing.
There seem to be some changes in the spikequeue in brian2.
Following PRs from brian2 referenced possible changes for brian2cuda:
When executing the following code:
from brian2 import *
import brian2cuda
set_device('cuda_standalone')
neurons = NeuronGroup(1, 'v:1', threshold='True')
run(1*ms)
device.build(directory='build', compile=True, run=False, debug=False)
a compilation error occurs:
code_objects/neurongroup_group_variable_set_conditional_codeobject.cu(98): error: identifier "dev_array_neurongroup_not_refractory" is undefined
1 error detected in the compilation of "/tmp/tmpxft_00003e22_00000000-7_neurongroup_group_variable_set_conditional_codeobject.cpp1.ii".
make: *** [code_objects/neurongroup_group_variable_set_conditional_codeobject.o] Error 2
When setting the device to 'cpp_standalone'
, everything is fine and the neuron spikes once every timestep as expected.
As some users might be interested in using single precision due to the strong arithemetic performance of consumer grade nvidia GPUs in comparison to double precision we should offer this option.
For this a preference (float type or similar) should be choosable which is by default double and on demand float -- and all templates and the code generation should rely on it.
evaluate alternatives and implement after discussion the best:
add to the queue also the postneuron id to allow better parallel synaptic code objects, c.f. issue #15
There was a bug in brian2 when using the (summed)
keyword (brian-team/brian2#704).
The fix might be rather easy if we just adapt the changes to the cpp template from the corresponding PR (brian-team/brian2#705).
But since we talked about changing the cuda implementation of summed variables anyways, I will leave this for now.
Our current CSpikeQueue
(spikequeue.h) and cudaVector
(cudaVector.h) classes seem to have a random mix of size32_t
, int
and unsigend int
datatypes. Or at least I could not really figure out a consistent reasoning behind it. Now I'm not quite sure what datatypes would make sense here.
Looking at the CSpikeQueue
from brian2
(cspikequeue.cpp), I have a few questions @mstimberg
CSpikeQueue.synapses
, size32_t
is used and for synapse IDs in the CSpikeQueue.queue
int
is used. Why not use the same datatype in both?My understanding so far is, that the C++ standard requires unsigned int
to hold only at least values in the range 0 to 65535
and int
in the range -32767 to 32767
. And the latter seems to me to little for synapse IDs? Or is any computer running brian2
simulations anyways going to use a computer architecture / compiler where int
can store a higher range and it simply doesn't really matter?
Since we now will have 3 different synaptic transmission modes, we can't use a bool anymore.
Instead of no_or_const_delay_mode == True/False
, @moritzaugustin suggested to use something along the lines of
synaptic_transmission == HOMOGENOUS_DELAYS
synaptic_transmission == HETEROGENOUS_DELAYS_INDIVIDUAL
synaptic_transmission == HETEROGENOUS_DELAYS_BUNDLED
Depending on the mode, the templates should then use different code.
Instead of pushing individual synapse ids separately, as a second alternative we can push synapse "bundle" ids into the respective delay queues, where each synapse bundle is defined by (firing presynaptic neuron id, delay, postsynaptic group) -- and the numbering is done e.g. within each postsynaptic group separately.
This procedure should reduce the cost of the push operation signifcantly given that the synapse bundles have not too small size. Furthermore the application of the synaptic effects can be doine exactly as in the no or const delay mode -- not sorting of active synapses required! More precisely: the postsynaptic groups independently iterate sequentially through the active synapse bundles but apply the effects of all contained synapses in parallel among the resp. threads of the block.
Note: in order to push the synaptic bundle ids in parallel for a given postsyn group block the number of synaptic bundles for each presyn. neuron has to be known -- but this is exactly the number of "unique delays".
The idea is by (Fidjeland and & Shanahan 2009, 2010).
The benchmark of the two heterogenous delay modes should include an example with uniform delay distribution (for too small numbers of synapses within each postsyn group the old propagation might be faster) and with a more bulky one (the bundles should perform much better) -- e.g. shifted peaky distribution.
from the used cuda features alone the minimal compute capability is 2.0
when running some tests on asterope (device capability 2.1):
cuda_standalone
(but not cpp_standalone
)test_cuda_standalone.py
(copied from test_cpp_standalone.py
replacing all cpp_standalone
occurences with cuda_standalone
) does not work as expected: AssertionError in line 424
)running cuda-memcheck ./main
-- e.g. for the 1. example above -- a cudaErrorLaunchOutofResources
error is reported due to too many resources requested for launch on CUDA API call to cudaLaunch
it is likely that compute capability 3.5 is required due to the available registers per thread (c.f. https://en.wikipedia.org/wiki/CUDA) -- we have to check which kernels are problematic if we wanted to relax the required cc version no
this is related with iss #4
pre -> post
=> ...post_id_by_pre
analogon to ...synapses_id_by_pre
)thread <-> synapse
correspondence, but let threads which postID has already occurred (can be computed from 2.) return and the first thread having a certain postID loop through all synapses with that postID, applying the synaptic effects. (We assume that postIDs only rarely occur more then once in relation to the total number of synapses in the queue, therefore we accept the resulting thread divergence from returning threads -> needs profiling)tid <-> syn
correspondence:if (shared_postID[tid-1] == shared_postID[tid]) {return};
int k = tid;
while (shared_postID[k] == shared_postID[tid])
// apply synaptic effect to postID
k += 1
Currently our EventMonitor
(~ SpikeMonitor
) returns times and indices sorted by time but for each time step the indices are not sorted. So we get e.g. this
times = [0,0,0, 1,1,1,1, 2,2,2,2,2]
indices = [5,1,2, 1,3,2,4, 4,2,5,1,3]
A few tests form the brian2 test suite fail because of this, since brian2 returns the indices sorted as well, like this
times = [0,0,0, 1,1,1,1, 2,2,2,2,2]
indices = [1,2,5, 1,2,3,4, 1,2,3,4,5]
This is low priority right now, but I guess we should return the data in the same format as brian2.
We could save the number of spikes at each time step and then use that to sort the indices in the end of the simulation using thrust::sort
. Or if we want to save one copy from shared to global memory per time step, we could write our own kernel for sorting.
the template spikemonitor.cu
contains code which combines after a simulation the arrays for spike time and neuron index for the blocks. however, the final time array should be sorted by time (and the spike time array of course accordingly) which is as far as I see not yet implemented. it can of course be done as comfortable as possible, e.g. on the host side
When using a PopulationRateMonitor
with a NeuronGroup
, which is not a Subgroup
, we can just read out the last element of the spikespace to get the number of spiking neurons for the current time step to calculate the instantaneous rate.
When using subgroups though, we can't do that, since a spikespace belongs to an entire Neurongroup. Currently we call each timestep a kernel with 1 thread and 1 block, which then tests if the ratemonitor belongs to a subgroup and if so, loops through all spiking neurons in the spikespace and increments a counter for each neuron belonging to the subgroup. We should check if doing it in parallel with e.g atomicAdd
is worth it performancewise.
Cpp standalone uses the fact, that their neuron indices are sorted in the spikespace to get the first and last spiking neuron of the subgroup. Since our spikespace is not sorted by neuron indices, we can't do that.
spatialstateupdater.cu
templatecpp_standalone
versionThis should include an evaluation (profiling) for difference sorting options and should possibly be possible to set by the user in the brian2cuda preferences.
Currently the compiler flags specified for cpp compilers are passed to nvcc.
CudaStandaloneDevice::build()
gets compiler_flags
from brian2/codegen/cpp_prefs.py::get_compiler_and_args()
and passes those to CudaStandaloneDevice::generate_makefile()
, where they are used in the makefile template as arguments for the nvcc
call.
For now I added the following line to brian2cuda/cuda_cenerator.py
to get rid of the default gcc compiler flags that are not compatible with a nvcc call (namely -ffast-math
, -fno-finite-math-only
and -march=native
).
prefs.codegen.cpp.extra_compile_args_gcc = ['-w', '-O3']
Projects do compile with this options, but we should probably create a cuda_prefs.py
file analog to the cpp_prefs.py
file, which would then also give the user control over nvcc compiler flags. And then there should be a way to pass compiler flags for the cpp compiler, that is called by nvcc to compiles the host code. This is where the prefs.codegen.cpp
arguments should be passed to. And I also haven't looked much into nvcc compiler flags, which should be done to choose a proper default.
These tests are failing:
brian2.tests.test_spikegenerator.test_spikenerator_period
brian2.tests.test_spikegenerator.test_spikenerator_rounding_period
brian2.tests.test_spikegenerator.test_spikegenerator_change_spikes
It looks like the generated spikes are not repeated after period
time has passed.
Our implementation of the clip
function currently casts all arguments to double
. When implementing #37, we should also add at least an option for casting to float
. Maybe even overload the function for different types as done with the modulo function (while taking care of type safety when comparing different data types).
@mstimberg could you please make available the labels (to allow tagging of issues) from https://github.com/brian-team/brian2 for this repository as well? in particular i would like to have labels for: priority, cleanup, enhancement but optimally just copy all labels here (I think there are tools available for that purpose)
when using clean versions of moritzaugustin:brian2/frozen and brian-team/master the CUBA example fails in file cpp_standalone/device.py
in line 797 when not successfully reading results/last_run_info.txt
(no such file or directory
)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.