Comments (9)
You are using the GPU very inefficiently, so I assume the kernel execution overhead weighs in heavy in the execution time.
E.g. you should get a 10- to 16-times performance increase if you process 16 float-values at once by using a float16
vector.
And even more if you process more than a single vector per work-item.
from vc4cl.
Thanks,
I just saw the 2nd part of this issue where you very clearly explained the results. Sorry for the double issue.
For some reasons replacing by float16 in the kernel hangs the program when using VC4CL, not this pocl, but I'll check.
Thanks !
from vc4cl.
It works with half
:
#pragma OPENCL EXTENSION cl_khr_fp16: enable
__kernel void sum(__global half* a_g, __global const half* b_g, __global half* res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
But with float16
it crashes the rpi.
Exact same performance, 288 ms
from vc4cl.
What exactly do you mean by "crashes the rpi"?
half
does not provide better performance than float
(except maybe in memory access), since the calculations are all done in float
anyway.
from vc4cl.
The program hangs when using "float16" in the kernel (Ctrl-C doesn't stop), I need to reboot the Rpi as ssh doesn't answer anymore.
Code attached below.
Generate 1000000 random numbers of type <class 'numpy.float16'>
Buffer types: float16 float16 float16
Create OpenCL context and queue
Choose platform:
[0] <pyopencl.Platform 'OpenCL for the Raspberry Pi VideoCore IV GPU' at 0x1ebdb44>
[1] <pyopencl.Platform 'Portable Computing Language' at 0x73e556f8>
Choice [0]:0
Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.
Create buffers
Reading kernel file
Compiling kernel
Executing computation
Then nothing
When using pocl, it doesn't hang but crashes "normally" and result is not consistent:
Generate 1000000 random numbers of type <class 'numpy.float16'>
Buffer types: float16 float16 float16
Create OpenCL context and queue
Choose platform:
[0] <pyopencl.Platform 'OpenCL for the Raspberry Pi VideoCore IV GPU' at 0x24d5b 44>
[1] <pyopencl.Platform 'Portable Computing Language' at 0x73e8d6f8>
Choice [0]:1
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
Create buffers
Reading kernel file
Compiling kernel
Executing computation
Transferring result to host
Computing on the host using numpy
local type: float16
Comparing results
--------------------------------------------------------------------------------------------------
demo_float16.py:58: RuntimeWarning: invalid value encountered in subtract
print("Difference :{}".format(res_np - res_local))
--------------------------------------------------------------------------------------------------
Difference :[ 1.2704e+04 -2.4756e-01 -9.5654e-01 ... -4.3994e-01 -8.8965e-01
-1.1426e-01]
[0.4468 0.567 0.7197 0.6445 0.995 ]
[0.6294 0.2476 0.1945 0.6084 0.5654]
[ 1.270e+04 5.669e-01 -4.245e-02 6.929e-01 -3.712e+03]
[1.076 0.8145 0.914 1.253 1.561 ]
--------------------------------------------------------------------------------------------------
demo_float16.py:64: RuntimeWarning: invalid value encountered in subtract
print("Checking the norm between both: {}".format(np.linalg.norm(res_np - res_local)))
--------------------------------------------------------------------------------------------------
Checking the norm between both: nan
--------------------------------------------------------------------------------------------------
/usr/local/lib/python3.7/dist-packages/numpy/core/numeric.py:2297: RuntimeWarning: invalid value encountered in multiply
x = x * ones_like(cond)
--------------------------------------------------------------------------------------------------
Checking results are mostly the same: False
comparing execution times
openCL: 292.114323 ms
openCL copy from device to host: 2.665573 ms
numpy: 66.555573 ms
--------------------------------------------------------------------------------------------------
munmap_chunk(): invalid pointer
[E] Thu Jul 16 22:19:46 2020: Received signal: SIGABRT
[E] Thu Jul 16 22:19:46 2020: (1) /usr/local/lib/libVC4CC.so.1.2 : +0x7c5ab4 [0x6d627ab4]
[E] Thu Jul 16 22:19:46 2020: (2) /lib/arm-linux-gnueabihf/libc.so.6 : __default_rt_sa_restorer+0 [0x76d19130]
--------------------------------------------------------------------------------------------------
from vc4cl.
You have a buffer of 1000000 float numbers and access it with 1000000 work-items reading 16 float-numbers each. So you read far behind any allocated buffer.
The VideoCore IV (or at least the component reading the memory) stalls if some inaccessible is to be read, which in this case is kind of expected.
You will need to decrease the global_work_size passed to enqueue_nd_range_kernel
by dividing to by 16, then it should work consistently on VC4CL and pocl.
from vc4cl.
Just for my understanding, where in the code do you see that each work-item is reading 16 floats each?
When I look at the pyopencl "hello world" example, it doesn't divide the numer of elements by 16
https://documen.tician.de/pyopencl/index.html
So I modified to:
ev = cl.enqueue_nd_range_kernel(queue=queue, kernel=sum_knl, global_work_size=(vector_size//16,), local_work_size=local_work_size)
and this time it doesn't hang, it is REALLY faster (22ms) but the result in not as expected:
demo_float16.py:58: RuntimeWarning: invalid value encountered in subtract
print("Difference :{}".format(res_np - res_local))
Difference :[ 1.0144e+02 -6.5234e-01 6.8359e-03 ... -3.1787e-01 -1.1328e-01 -1.5137e-02]
A: [0.006306 0.7876 0.535 0.8457 0.5815 ]
B: [0.657 0.6787 0.00826 0.1714 0.3862 ]
C(openCL): [102.1 0.8145 0.5503 0.8457 0.437 ]
C(cpu) [0.6636 1.467 0.5435 1.018 0.968 ]
Do you think pyopencl doesn't decode well the float16 numbers?
from vc4cl.
Just for my understanding, where in the code do you see that each work-item is reading 16 floats each?
Your OpenCL kernel code:
#pragma OPENCL EXTENSION cl_khr_fp16: enable
__kernel void sum(__global float16* a_g, __global const float16* b_g, __global float16* res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid]; // <-- here
}
The code basically does this:
For the Nth work-item (where you run up to 1000000), load the Nth float16
vector (vector of 16 floats) from input buffers a_g
and b_g
and store at the Nth float16
vector of the output buffer res_g
.
The Nth float16
vector of a base address (e.g. a_g[gid]
) contains of the next 16 float
values from byte address base + N * sizeof(float16)
(this is the normal address calculation for the subscript operator []
).
Thus, if you have a buffer of X float
entries and run with more than floor(X/16)
work-items, you will read outside of the allocated buffer.
and this time it doesn't hang, it is REALLY faster (22ms) but the result in not as expected:
Is the result wrong for pocl, VC4CL or both?
Do you think pyopencl doesn't decode well the float16 numbers?
From how I read your code, pyopencl does not even know that the kernels calculate 16 float
values at once, so I doubt that is the problem here.
One problem that I see is the ambiguity of float16
.
In your Python files, you specify the numpy array as type float16
, which is a 16-bit floating point type (also otherwise known as half
). The OpenCL float16
type is a vector of 16 32-bit floating point values. So maybe the numpy conversion to float16
(half
) screws something up.
What you could try:
- On the OpenCL kernel side, calculate with
float16
(16-element vector offloat
s), like the kernel code at the top of this message - On the host side, don't do any conversions. The only difference between the scalar
float
version and thefloat16
version should beglobal_work_size=(vector_size,)
vs.global_work_size=(vector_size/16,)
.
from vc4cl.
Ohhhhhhhhhhhhhhhhhhhhhhhhhhh!
When I read "The OpenCL float16 type is a vector of 16 32-bit floating point values", all the little lights in my head flashed at the same time!!! (quite a nice moment I would say after days of being puzzled).
I don't know why, I understood the VideoCore was better as managing 16bits float than 32bits floats and as a type float16/half type exists, it was making sense.... but no, this is a parallelization issue not a size issue :D
So, I changed the pyopencl code back to float32, divided the global_work_size by 16 and now it works perfectly! And in 31ms!
Now this fully makes sense to me:
Preferred / native vector sizes
char 16 / 16
short 16 / 16
int 16 / 16
long 0 / 0
half 0 / 0 (n/a)
float 16 / 16
double 0 / 0 (n/a)
sudo python3 demo_float16.py
Generate 1000000 random numbers of type <class 'numpy.float32'>
Buffer types: float32 float32 float32
Create OpenCL context and queue
Choose platform:
[0] <pyopencl.Platform 'OpenCL for the Raspberry Pi VideoCore IV GPU' at 0x2693b44>
[1] <pyopencl.Platform 'Portable Computing Language' at 0x739db6f8>
Choice [0]:
Set the environment variable PYOPENCL_CTX='' to avoid being asked again.
Create buffers
Reading kernel file
Compiling kernel
Executing computation
Transferring result to host
Computing on the host using numpy
local type: float32
Comparing results
Difference :[ 0.0000000e+00 0.0000000e+00 0.0000000e+00 ... 0.0000000e+00
-1.1920929e-07 0.0000000e+00]
[0.07957439 0.38969344 0.61542416 0.23726429 0.80809623]
[0.21152946 0.32104772 0.77827895 0.54055446 0.9818086 ]
[0.29110384 0.71074116 1.3937031 0.77781874 1.7899048 ]
[0.29110384 0.71074116 1.3937031 0.77781874 1.7899048 ]
Checking the norm between both: 5.2845014579361305e-05
Checking results are mostly the same: True
comparing execution times
openCL: 31.172292 ms
openCL copy from device to host: 23.143335 ms
numpy: 15.477813 ms
Using pocl:
sudo python3 demo_float16.py
Generate 1000000 random numbers of type <class 'numpy.float32'>
Buffer types: float32 float32 float32
Create OpenCL context and queue
Choose platform:
[0] <pyopencl.Platform 'OpenCL for the Raspberry Pi VideoCore IV GPU' at 0x172fb3c>
[1] <pyopencl.Platform 'Portable Computing Language' at 0x73a036f8>
Choice [0]:1
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
Create buffers
Reading kernel file
Compiling kernel
Executing computation
Transferring result to host
Computing on the host using numpy
local type: float32
Comparing results
Difference :[0. 0. 0. ... 0. 0. 0.]
[0.11475492 0.3573052 0.63840175 0.5940225 0.46965545]
[0.40806714 0.43409592 0.953978 0.62630785 0.94360524]
[0.5228221 0.79140115 1.5923798 1.2203304 1.4132607 ]
[0.5228221 0.79140115 1.5923798 1.2203304 1.4132607 ]
Checking the norm between both: 0.0
Checking results are mostly the same: True
comparing execution times
openCL: 24.763333 ms
openCL copy from device to host: 4.438073 ms
numpy: 15.358959 ms
Thanks a lot.... and I'm really sorry to have bothered you so much. That's really great to be able to learn OpenCL on the Raspberry Pi! Fascinating!
from vc4cl.
Related Issues (20)
- mbox_property Connection timed out error HOT 2
- Can we have global work size a multiple of 16? HOT 2
- clEnqueueNDRangeKernel times out on large buffers HOT 13
- [Question] Is VC4CL compatible with BOINC GPU jobs ? HOT 5
- Error building VC4CL HOT 46
- Encountering problems when running darknet on RPI3B+ HOT 29
- Tests fail for Raspberry Pi Zero HOT 6
- Support Raspberry OS 64 bit Bullseye HOT 5
- clinfo can't find VideoCore on Raspi 3B+, and return `Number of platforms 0` HOT 5
- clinfo not recognizing platform
- No way to build with cpplog with Ninja, also it does not find installed cpplog HOT 3
- issues when building tools and tests HOT 5
- Failed to download dependencies when compiling VC4C HOT 3
- OpenCV DNN Module: OpenCL Target Precompilation Error HOT 3
- [Question] Object Detection running with UMat and/or OpenCL target noticeably slower HOT 1
- Running on Docker HOT 2
- Could not find supported libbcm_host.so HOT 4
- clinfo Number of plataforms1 (only pocl) does not list vc4cl HOT 1
- [Test] How can I build test folder HOT 1
- [System-Error] application hang when launching a simple example (HelloWorld) HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vc4cl.