Hi, That's not that important but I thought it was pretty weird (and not compatibl

Thanks, I just saw the 2nd part of this <a href="https://github.com/doe300/VC4CL/i

It works with half : <div class="snippet-

What exactly do you mean by "crashes the rpi"? <code class="notransl

[performance] VC4CL more than 10x slower than pocl? about vc4cl HOT 9 CLOSED

shazz commented on September 23, 2024

[performance] VC4CL more than 10x slower than pocl?

from vc4cl.

Comments (9)

doe300 commented on September 23, 2024

You are using the GPU very inefficiently, so I assume the kernel execution overhead weighs in heavy in the execution time.

E.g. you should get a 10- to 16-times performance increase if you process 16 float-values at once by using a float16 vector.
And even more if you process more than a single vector per work-item.

from vc4cl.

shazz commented on September 23, 2024

Thanks,
I just saw the 2nd part of this issue where you very clearly explained the results. Sorry for the double issue.

For some reasons replacing by float16 in the kernel hangs the program when using VC4CL, not this pocl, but I'll check.

Thanks !

from vc4cl.

shazz commented on September 23, 2024

It works with half:

#pragma OPENCL EXTENSION cl_khr_fp16: enable
__kernel void sum(__global half* a_g, __global const half* b_g, __global half* res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid] + b_g[gid];
}

But with float16 it crashes the rpi.
Exact same performance, 288 ms

from vc4cl.

doe300 commented on September 23, 2024

What exactly do you mean by "crashes the rpi"?

half does not provide better performance than float (except maybe in memory access), since the calculations are all done in float anyway.

from vc4cl.

shazz commented on September 23, 2024

The program hangs when using "float16" in the kernel (Ctrl-C doesn't stop), I need to reboot the Rpi as ssh doesn't answer anymore.
Code attached below.

Generate 1000000 random numbers of type <class 'numpy.float16'>
Buffer types: float16 float16 float16
Create OpenCL context and queue
Choose platform:
[0] <pyopencl.Platform 'OpenCL for the Raspberry Pi VideoCore IV GPU' at 0x1ebdb44>
[1] <pyopencl.Platform 'Portable Computing Language' at 0x73e556f8>
Choice [0]:0
Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.
Create buffers
Reading kernel file
Compiling kernel
Executing computation

Then nothing

When using pocl, it doesn't hang but crashes "normally" and result is not consistent:

Generate 1000000 random numbers of type <class 'numpy.float16'>
Buffer types: float16 float16 float16
Create OpenCL context and queue
Choose platform:
[0] <pyopencl.Platform 'OpenCL for the Raspberry Pi VideoCore IV GPU' at 0x24d5b                                                                                                                                                             44>
[1] <pyopencl.Platform 'Portable Computing Language' at 0x73e8d6f8>
Choice [0]:1
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
Create buffers
Reading kernel file
Compiling kernel
Executing computation
Transferring result to host
Computing on the host using numpy
local type: float16
Comparing results
--------------------------------------------------------------------------------------------------
demo_float16.py:58: RuntimeWarning: invalid value encountered in subtract
  print("Difference :{}".format(res_np - res_local))
--------------------------------------------------------------------------------------------------
Difference :[ 1.2704e+04 -2.4756e-01 -9.5654e-01 ... -4.3994e-01 -8.8965e-01
 -1.1426e-01]
[0.4468 0.567  0.7197 0.6445 0.995 ]
[0.6294 0.2476 0.1945 0.6084 0.5654]
[ 1.270e+04  5.669e-01 -4.245e-02  6.929e-01 -3.712e+03]
[1.076  0.8145 0.914  1.253  1.561 ]
--------------------------------------------------------------------------------------------------
demo_float16.py:64: RuntimeWarning: invalid value encountered in subtract
  print("Checking the norm between both: {}".format(np.linalg.norm(res_np - res_local)))
--------------------------------------------------------------------------------------------------
Checking the norm between both: nan
--------------------------------------------------------------------------------------------------
/usr/local/lib/python3.7/dist-packages/numpy/core/numeric.py:2297: RuntimeWarning: invalid value encountered in multiply
  x = x * ones_like(cond)
--------------------------------------------------------------------------------------------------
Checking results are mostly the same:  False
comparing execution times
openCL: 292.114323 ms
openCL copy from device to host: 2.665573 ms
numpy: 66.555573 ms
--------------------------------------------------------------------------------------------------
munmap_chunk(): invalid pointer
[E] Thu Jul 16 22:19:46 2020: Received signal: SIGABRT
[E] Thu Jul 16 22:19:46 2020:  (1) /usr/local/lib/libVC4CC.so.1.2 : +0x7c5ab4 [0x6d627ab4]
[E] Thu Jul 16 22:19:46 2020:  (2) /lib/arm-linux-gnueabihf/libc.so.6 : __default_rt_sa_restorer+0 [0x76d19130]
--------------------------------------------------------------------------------------------------

pyOpenCL.zip

from vc4cl.

doe300 commented on September 23, 2024

You have a buffer of 1000000 float numbers and access it with 1000000 work-items reading 16 float-numbers each. So you read far behind any allocated buffer.
The VideoCore IV (or at least the component reading the memory) stalls if some inaccessible is to be read, which in this case is kind of expected.

You will need to decrease the global_work_size passed to enqueue_nd_range_kernel by dividing to by 16, then it should work consistently on VC4CL and pocl.

from vc4cl.

shazz commented on September 23, 2024

Just for my understanding, where in the code do you see that each work-item is reading 16 floats each?

When I look at the pyopencl "hello world" example, it doesn't divide the numer of elements by 16
https://documen.tician.de/pyopencl/index.html

So I modified to:

ev = cl.enqueue_nd_range_kernel(queue=queue, kernel=sum_knl, global_work_size=(vector_size//16,), local_work_size=local_work_size)

and this time it doesn't hang, it is REALLY faster (22ms) but the result in not as expected:

demo_float16.py:58: RuntimeWarning: invalid value encountered in subtract
  print("Difference :{}".format(res_np - res_local))
Difference :[ 1.0144e+02 -6.5234e-01  6.8359e-03 ... -3.1787e-01 -1.1328e-01 -1.5137e-02]
A: [0.006306 0.7876   0.535    0.8457   0.5815  ]
B: [0.657   0.6787  0.00826 0.1714  0.3862 ]
C(openCL): [102.1      0.8145   0.5503   0.8457   0.437 ]
C(cpu) [0.6636 1.467  0.5435 1.018  0.968 ]

Do you think pyopencl doesn't decode well the float16 numbers?

from vc4cl.

doe300 commented on September 23, 2024

Just for my understanding, where in the code do you see that each work-item is reading 16 floats each?

Your OpenCL kernel code:

#pragma OPENCL EXTENSION cl_khr_fp16: enable
__kernel void sum(__global float16* a_g, __global const float16* b_g, __global float16* res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid] + b_g[gid]; // <-- here
}

The code basically does this:
For the Nth work-item (where you run up to 1000000), load the Nth float16 vector (vector of 16 floats) from input buffers a_g and b_g and store at the Nth float16 vector of the output buffer res_g.
The Nth float16 vector of a base address (e.g. a_g[gid]) contains of the next 16 float values from byte address base + N * sizeof(float16) (this is the normal address calculation for the subscript operator []).
Thus, if you have a buffer of X float entries and run with more than floor(X/16) work-items, you will read outside of the allocated buffer.

and this time it doesn't hang, it is REALLY faster (22ms) but the result in not as expected:

Is the result wrong for pocl, VC4CL or both?

Do you think pyopencl doesn't decode well the float16 numbers?

From how I read your code, pyopencl does not even know that the kernels calculate 16 float values at once, so I doubt that is the problem here.

One problem that I see is the ambiguity of float16.

In your Python files, you specify the numpy array as type float16, which is a 16-bit floating point type (also otherwise known as half). The OpenCL float16 type is a vector of 16 32-bit floating point values. So maybe the numpy conversion to float16 (half) screws something up.

What you could try:

On the OpenCL kernel side, calculate with float16 (16-element vector of floats), like the kernel code at the top of this message
On the host side, don't do any conversions. The only difference between the scalar float version and the float16 version should be global_work_size=(vector_size,) vs. global_work_size=(vector_size/16,).

from vc4cl.

shazz commented on September 23, 2024

Ohhhhhhhhhhhhhhhhhhhhhhhhhhh!

When I read "The OpenCL float16 type is a vector of 16 32-bit floating point values", all the little lights in my head flashed at the same time!!! (quite a nice moment I would say after days of being puzzled).

I don't know why, I understood the VideoCore was better as managing 16bits float than 32bits floats and as a type float16/half type exists, it was making sense.... but no, this is a parallelization issue not a size issue :D

So, I changed the pyopencl code back to float32, divided the global_work_size by 16 and now it works perfectly! And in 31ms!

Now this fully makes sense to me:

  Preferred / native vector sizes
    char                                                16 / 16
    short                                               16 / 16
    int                                                 16 / 16
    long                                                 0 / 0
    half                                                 0 / 0        (n/a)
    float                                               16 / 16
    double                                               0 / 0        (n/a)

sudo python3 demo_float16.py
Generate 1000000 random numbers of type <class 'numpy.float32'>
Buffer types: float32 float32 float32
Create OpenCL context and queue
Choose platform:
[0] <pyopencl.Platform 'OpenCL for the Raspberry Pi VideoCore IV GPU' at 0x2693b44>
[1] <pyopencl.Platform 'Portable Computing Language' at 0x739db6f8>
Choice [0]:
Set the environment variable PYOPENCL_CTX='' to avoid being asked again.
Create buffers
Reading kernel file
Compiling kernel
Executing computation
Transferring result to host
Computing on the host using numpy
local type: float32
Comparing results
Difference :[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ...  0.0000000e+00
 -1.1920929e-07  0.0000000e+00]
[0.07957439 0.38969344 0.61542416 0.23726429 0.80809623]
[0.21152946 0.32104772 0.77827895 0.54055446 0.9818086 ]
[0.29110384 0.71074116 1.3937031  0.77781874 1.7899048 ]
[0.29110384 0.71074116 1.3937031  0.77781874 1.7899048 ]
Checking the norm between both: 5.2845014579361305e-05
Checking results are mostly the same:  True
comparing execution times
openCL: 31.172292 ms
openCL copy from device to host: 23.143335 ms
numpy: 15.477813 ms

Using pocl:

sudo python3 demo_float16.py
Generate 1000000 random numbers of type <class 'numpy.float32'>
Buffer types: float32 float32 float32
Create OpenCL context and queue
Choose platform:
[0] <pyopencl.Platform 'OpenCL for the Raspberry Pi VideoCore IV GPU' at 0x172fb3c>
[1] <pyopencl.Platform 'Portable Computing Language' at 0x73a036f8>
Choice [0]:1
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
Create buffers
Reading kernel file
Compiling kernel
Executing computation
Transferring result to host
Computing on the host using numpy
local type: float32
Comparing results
Difference :[0. 0. 0. ... 0. 0. 0.]
[0.11475492 0.3573052  0.63840175 0.5940225  0.46965545]
[0.40806714 0.43409592 0.953978   0.62630785 0.94360524]
[0.5228221  0.79140115 1.5923798  1.2203304  1.4132607 ]
[0.5228221  0.79140115 1.5923798  1.2203304  1.4132607 ]
Checking the norm between both: 0.0
Checking results are mostly the same:  True
comparing execution times
openCL: 24.763333 ms
openCL copy from device to host: 4.438073 ms
numpy: 15.358959 ms

Thanks a lot.... and I'm really sorry to have bothered you so much. That's really great to be able to learn OpenCL on the Raspberry Pi! Fascinating!

from vc4cl.

[performance] VC4CL more than 10x slower than pocl? about vc4cl HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent