Comments (13)
Performance-wise, having a quick look at your code, you could try some of these:
- When the kernel is compiled, the buffer sizes (as well as the
decimation
parameter) are already known, so setting them via a macro could unroll the inner loop and optimize some arithmetic operations. This of course disallows caching the compiled kernel program to be run with different parameters... - If your data size is a multiple of 8/16 elements (or you can pad it to be), you may want to rewrite the kernel to use 8-/16-element vector types (i.e. load and store
float16
values). This could greatly improve performance. - Even if data cannot be processed completely as vectorized types, loading and storing vectors (and thus having less memory access instructions) alone can give a good performance boost, since usually memory-access is a bottleneck on the VC4.
E.g. this code has only half the memory accesses and takes (without scheduling overhead) about 25% less cycles to execute:
__kernel void fir_filter_process_memory(__global const float2 *restrict input, __global const float2 *restrict taps, const unsigned int taps_len, __global float2 *output, const unsigned int decimation, const unsigned int output_len) {
for (unsigned int i = 0; i < output_len; i++) {
int output_offset = get_global_id(0) * output_len + i;
int input_offset = output_offset * decimation;
float real0 = 0.0f;
float imag0 = 0.0f;
for (unsigned int j = 0; j < taps_len; j++) {
float2 in = input[input_offset + j];
float2 tap = taps[j];
real0 += (in.x * tap.x) - (in.y * tap.y);
imag0 += (in.x * tap.y) + (in.y * tap.x);
}
output[output_offset] = (float2)(real0, imag0);
}
}
This code could be twice as fast (again ignoring scheduling overhead), but requires the input and taps buffers to be padded to 8 floats and a constant tap-size of 3:
__kernel void fir_filter_process_vectorized(__global const float *restrict input, __global const float *restrict taps, __global float2 *output, const unsigned int decimation, const unsigned int output_len) {
for (unsigned int i = 0; i < output_len; i++) {
int output_offset = (get_global_id(0) * output_len + i);
int input_offset = output_offset * 2 * decimation;
float8 in = *((__global float8*)(input + input_offset));
float8 tap = *((__global float8*)taps);
float real0 = 0.0f;
float imag0 = 0.0f;
// 1 tap
real0 += (in.s0 * tap.s0) - (in.s1 * tap.s1);
imag0 += (in.s0 * tap.s1) + (in.s1 * tap.s0);
// 2 tap
real0 += (in.s2 * tap.s2) - (in.s3 * tap.s3);
imag0 += (in.s2 * tap.s3) + (in.s3 * tap.s2);
// 3 taps
real0 += (in.s4 * tap.s4) - (in.s5 * tap.s5);
imag0 += (in.s4 * tap.s5) + (in.s5 * tap.s4);
output[output_offset] = (float2)(real0, imag0);
}
}
from vc4cl.
The different interfaces to access the QPUs have some "quirks", e.g. the mailbox interface does not allow kernel executions taking longer then 1s and the VCHIQ GPU service does not wait for the kernel to actually finish.
Depending on whether you run the program as root or "normal" user, different of these interfaces are used.
Can you rerun your program (with and without "sudo") with the VC4CL_DEBUG=system
environment variable set and post the (VC4CL) log output?
from vc4cl.
Without sudo:
VC4CL_DEBUG=system ./perf_fir_filter
output length max: 199
output length max: 204
working_len_total: 10996
clGetDeviceIDs: 0
clCreateContext: 0
clCreateCommandQueue: 0
allocated working buf: 10996
gpuserv: vc_gpuserv_init: starting initialisation
[VC4CL](perf_fir_filter): [VC4CL] Using VCSM (CMA) for: memory allocation
[VC4CL](perf_fir_filter): [VC4CL] Using VCHI for: kernel execution
...
average time: 0.003389
With sudo:
sudo VC4CL_DEBUG=system ./perf_fir_filter
output length max: 199
output length max: 204
working_len_total: 10996
clGetDeviceIDs: 0
clCreateContext: 0
clCreateCommandQueue: 0
allocated working buf: 10996
[VC4CL](perf_fir_filter): [VC4CL] Using mailbox for: memory allocation, system queries
[VC4CL](perf_fir_filter): [VC4CL] Using V3D for: kernel execution, profiling, system queries
...
average time: 0.051971
Ok. So performance difference can be explained by different memory allocation types.
from vc4cl.
I turned on VCHI and now sudo is slow. So this is consistent with the "no sudo".
It looks like VCHI is causing this slowness. After some number of executions, it started calculating very slowly. But the output is still valid (so it is not a timeout):
done: 1 9 0.002897
output: 0.000051934, 0.421165079
done: 2 0 0.126826
output: 0.000051934, 0.421165079
from vc4cl.
Interesting results. I never got to the actual performance comparison of the different memory allocation methods, since non-root execution does not work reliably (see tickets I linked above).
IIRC, for VCSM, the kernel manages the memory and shares it by communicating asynchronously with the firmware. For Mailbox memory management, the kernel is completely unaware of what is going on. So maybe the involvement of the kernel causes the slow-down... Of course it could very well also be that I just mess up the caching flags...
If you are interested in playing around a bit more, you could rerun the program with the VC4CL_CACHE_FORCE
environment variable set to 0
(uncached), 1
(host-cached), 2
(GPU-cached) and 3
(both cached), see here.
Generally, I would recommend to run any VC4CL client as root (I know, not so great), allocate memory via the Mailbox and execute kernels via the V3D registers (the defaults for root), since this the most-tested combination by far.
from vc4cl.
It gets weirder with more tests.
- Execute the loop 2 times test
- without sudo: average time: 0.004408, output: 0.000000000, 0.000000048 0.000284149, 0.000077057
- with sudo: average time: 0.102403, output: 0.000051934, 0.421165079 0.305830151, 0.193653673
- VC4CL_CACHE_FORCE=0 + loop executed 2 times:
- without sudo: average time: 0.004297, output: -0.002558924, 0.426762044 0.305830151, 0.193653673
- with sudo (timeout): average time: 60.123802 0.000000000, 0.000000000 0.000000000, 0.000000000
- VC4CL_CACHE_FORCE=1 + loop executed 2 times:
- without sudo: average time: 0.004470, output: 0.000000000, 0.000000048 0.000284149, 0.000077057
- with sudo: average time: 0.102424, output: 0.000051934, 0.421165079 0.305830151, 0.193653673
- VC4CL_CACHE_FORCE=2 + loop executed 2 times:
- without sudo: average time: 0.004419, output: 0.000000000, 0.000000048 0.000284149, 0.000077057
- with sudo: average time: 30.062032 (!!!!!), output: 0.000000000, 0.000000000 0.000280094, 0.000072954
- VC4CL_CACHE_FORCE=3 + loop executed 2 times:
- without sudo: average time: 0.004467, output: 0.000000000, 0.000000048 0.000284149, 0.000077057
- with sudo: average time: 0.102449, output: 0.000051934, 0.421165079 0.305830151, 0.193653673
So it looks like:
- non-sudo execution (VCHI) doesn't wait for the result.
- That's why it is so fast (25x faster).
- That's why the output is corrupted. It doesn't wait for the second execution to complete, so the buffer with previous results is returned. Or some rubbish returned in case of VC4CL_CACHE_FORCE=0
- GPU-cached memory produce rubbish when using with sudo.
- Only with sudo and host-based cache produce valid results.
The performance gap between sudo/no-sudo is very important here, because executing very similar filter on CPU (+neon) gives "average time: 0.070378". So it lies somewhere between sudo and non-sudo.
There are number of things I can try to optimise performance, but first of all I would like to make the program stable :)
Some optimisations:
- Cache filter taps into the local memory
- Output results into local memory first and then copy them into the global
- Use map/unmap for external buffer.
from vc4cl.
Ok. I will go with sudo-enabled access and start optimising the code.
This is not related to this issue, but I switched to float8 and got 5x performance boost:
- original - average time: 0.503197
- float8 (https://github.com/dernasherbrezon/clDsp/blob/main/fir_filter_float8.cl) - average time: 0.176660
- CPU (neon via libvolk) - average time: 0.261358
So it looks like it is possible to beat CPU on this task.
from vc4cl.
I have been extensively testing the timeout issue for the last several days:
- It seems mailbox call 0x00030011 returns before the actual computation completes. Similar to the issue with VCHIQ. I've made a loop that runs the same kernel + sleep(1).
Normally computation takes 16k us and lots of execution cycles:
[VC4CL](VC4CL Queue Han): Elapsed time: 16264us
[VC4CL](VC4CL Queue Han): Clock speed: 0
[VC4CL](VC4CL Queue Han): Instruction count: 461
[VC4CL](VC4CL Queue Han): Explicit uniform count: 3
[VC4CL](VC4CL Queue Han): QPUs used: 12
[VC4CL](VC4CL Queue Han): Kernel repetition count: 17
[VC4CL](VC4CL Queue Han): Execution cycles: 38808180
But after some time it returns earlier:
[VC4CL](VC4CL Queue Han): Elapsed time: 548us
[VC4CL](VC4CL Queue Han): Clock speed: 0
[VC4CL](VC4CL Queue Han): Instruction count: 461
[VC4CL](VC4CL Queue Han): Explicit uniform count: 3
[VC4CL](VC4CL Queue Han): QPUs used: 12
[VC4CL](VC4CL Queue Han): Kernel repetition count: 17
[VC4CL](VC4CL Queue Han): Execution cycles: 801568
With much smaller execution cycles. However due to sleep(1) I can submit the same kernel again. It returns earlier, but the execution cycles (which are read from the GPU) stay relatively normal.
[VC4CL](VC4CL Queue Han): Elapsed time: 548us
[VC4CL](VC4CL Queue Han): Clock speed: 0
[VC4CL](VC4CL Queue Han): Instruction count: 461
[VC4CL](VC4CL Queue Han): Explicit uniform count: 3
[VC4CL](VC4CL Queue Han): QPUs used: 12
[VC4CL](VC4CL Queue Han): Kernel repetition count: 17
[VC4CL](VC4CL Queue Han): Execution cycles: 38803212
So it looks like GPU computes the kernel, but for some reason mailbox interface returns earlier.
- Buffer size has no effect. I can reduce it and still get these early returns.
- Simple kernel works well. I tried the following kernel with the same buffers, input parameters and never seen early returns:
__kernel void fir_filter_process(__global const float *restrict input, __global const float *restrict taps, __global float *output) {
int output_offset = get_global_id(0) * 2;
output[output_offset] = 1;
output[output_offset + 1] = 2;
}
from vc4cl.
Couple more observations:
- If I remove sleep(1), then after early return I cannot run application again. It will crash GPU (?) or saturate some internal buffer in ThreadX? Only power off/on helps.
- With sleep(1) I can re-run application and get the same 16k us execution times. So something resets/re-initialize between 2 subsequent executions.
from vc4cl.
Tried running code similar to add.py:
for x in range(100):
start = time.time()
drv.execute(
n_threads=n_threads,
program=code,
uniforms=uniforms
)
elapsed_gpu = time.time() - start
print('GPU: {:.4f} sec'.format(elapsed_gpu))
using py-videocore and got the same timeout:
GPU: 0.0044 sec
GPU: 0.0044 sec
Traceback (most recent call last):
File "add.py", line 118, in <module>
uniforms=uniforms
File "/usr/local/lib/python3.7/dist-packages/videocore/driver.py", line 238, in execute
r = self.mailbox.execute_qpu(n_threads, message.address, 0, timeout)
File "/usr/local/lib/python3.7/dist-packages/videocore/mailbox.py", line 110, in f
r = self._simple_call(name, tag, req_fmt, res_fmt, list(args))[5:]
File "/usr/local/lib/python3.7/dist-packages/videocore/mailbox.py", line 98, in _simple_call
ioctl(self.fd, IOCTL_MAILBOX, buf, True)
TimeoutError: [Errno 110] Connection timed out
from vc4cl.
Tried bullseye and got timeout after very first execution when executing via MAILBOX.
[VC4CL](VC4CL Queue Han): Mailbox buffer before: 00000020 00000000 00030002 00000008 00000004 00000005 00000000 00000000
[VC4CL](VC4CL Queue Han): Mailbox buffer after: 00000020 80000000 00030002 00000008 80000008 00000005 11e1a300 00000000
[VC4CL](VC4CL Queue Han): Mailbox request: succeeded
[VC4CL](VC4CL Queue Han): Mailbox buffer before: 00000028 00000000 00030011 00000010 00000010 0000000c bebdc290 00000000 00007530 00000000
[VC4CL](VC4CL Queue Han): ioctl_set_msg failed: -1
[VC4CL] Error in mbox_property: Connection timed out
Here is firmware version:
pi@raspberrypi:~ $ sudo vcgencmd version
Oct 29 2021 10:49:08
Copyright (c) 2012 Broadcom
version b8a114e5a9877e91ca8f26d1a5ce904b2ad3cf13 (clean) (release) (start)
from vc4cl.
Isn't this also related to raspberrypi/linux#4321?
Can you check how long it actually took to time out (from the start of that particular Mailbox call to the timeout error)?
from vc4cl.
Isn't this also related to raspberrypi/linux#4321?
Unlikely. I'm executing exactly the same code all the time and it takes ~16264us to execute. On "buster" I've got timeout after several executions. While on "bullseye" it instantly fails.
I think it relates to: raspberrypi/firmware#1582 or at least looks very similar.
Another observation (not sure if related), but GPU firmware becomes corrupted. For example, other modules responsible for changing frequency stop working.
from vc4cl.
Related Issues (20)
- mbox_property Connection timed out error HOT 2
- Can we have global work size a multiple of 16? HOT 2
- [Question] Is VC4CL compatible with BOINC GPU jobs ? HOT 5
- Error building VC4CL HOT 46
- Encountering problems when running darknet on RPI3B+ HOT 29
- Tests fail for Raspberry Pi Zero HOT 6
- Support Raspberry OS 64 bit Bullseye HOT 5
- clinfo can't find VideoCore on Raspi 3B+, and return `Number of platforms 0` HOT 5
- clinfo not recognizing platform
- No way to build with cpplog with Ninja, also it does not find installed cpplog HOT 3
- issues when building tools and tests HOT 5
- Failed to download dependencies when compiling VC4C HOT 3
- OpenCV DNN Module: OpenCL Target Precompilation Error HOT 3
- [Question] Object Detection running with UMat and/or OpenCL target noticeably slower HOT 1
- Running on Docker HOT 2
- Compiler threw exception: Normalizer: Invalid local type for memory area HOT 6
- support arm64 HOT 10
- [Test] How can I build test folder HOT 1
- [System-Error] application hang when launching a simple example (HelloWorld) HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vc4cl.