The cl kernel I am trying to compile is mcxcl - the code download URL and test command

Looks like I forgot to implement the OpenCL built-in finction <code class="notranslate

<a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://github

thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Optimizer error: There should be no more function calls - convert_float_rtef () about vc4cl HOT 8 OPEN

doe300 commented on May 25, 2024

Optimizer error: There should be no more function calls - convert_float_rtef ()

from vc4cl.

Comments (8)

doe300 commented on May 25, 2024

Looks like I forgot to implement the OpenCL built-in finction convert_float_rte in the standard library.

Im off-line for the next couple of days, so the fix might take some time.

from vc4cl.

fangq commented on May 25, 2024

no problem. I look forward to testing this new function once it is added. thank you again for developing this wonderful library!

from vc4cl.

doe300 commented on May 25, 2024

doe300/VC4CLStdLib@0cf109f adds these functions, can you re-check?

from vc4cl.

fangq commented on May 25, 2024

thanks @doe300. I recompiled vc4clstdlib and vc4c, reinstalled both packages, and reran my benchmark (./run_benchmark1.sh -n 1e5 -J "-DUSE_LL5_RAND"), the previously observed optimizer error no longer showed up.

Unfortunately, this is still not enough to get my kernel to run on the RPI. The program stalled after printing the line

Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL -DUSE_LL5_RAND

here is the source code related to this message:

https://github.com/fangq/mcxcl/blob/master/src/mcx_host.cpp#L457-L458

it looks to me that the clBuildProgram() call kept running and not returned, otherwise, it either prints an error message, or print the message "build program complete" on line#477.

From the output of the top command, I don't see clang anymore, the mcxcl thread is taking 100% of the CPU. It has been running for over 50 minutes, but still no new message printing. I am going to kill it.

can you suggest how to debug this and find out what is stalling the clBuildProgram from returning?

thanks again.

from vc4cl.

fangq commented on May 25, 2024

I left the code running overnight, this morning, I saw the below error message on the screen, it must have been terminated many hours into the compilation:

root@raspberrypi:/home/pi/temp/mcxcl/example/benchmark# ./run_benchmark1.sh -n 1e5 -J "-DUSE_LL5_RAND"
==============================================================================
=                       Monte Carlo eXtreme (MCX) -- OpenCL                  =
...
- code name: [Vanilla MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] Logistic-Lattice [Seed Length] 5
initializing streams ...	init complete : 1 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL  -DUSE_LL5_RAND
completed clBuildProgram
completed clGetProgramBuildInfo
Kernel build log:
[W] Fri Mar  2 00:30:27 2018: Warnings in precompilation:
[W] Fri Mar  2 00:30:27 2018: <stdin>:1039:1: warning: null character ignored
<U+0000>
^
1 warning generated.

[W] Fri Mar  2 01:22:50 2018: Register conflict resolver has exceeded its maximum rounds, there might still be errors!
[E] Fri Mar  2 01:22:50 2018: Error assigning local to register: %energylaunched.1606
[E] Fri Mar  2 01:22:50 2018:  (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xac [0xb66ca484]
[E] Fri Mar  2 01:22:50 2018:  (2) /usr/local/lib/libVC4CC.so.1.2 : vc4c::qpu_asm::GraphColoring::toRegisterMap() const+0x150 [0xb6778e18]
[E] Fri Mar  2 01:22:50 2018:  (3) /usr/local/lib/libVC4CC.so.1.2 : vc4c::qpu_asm::CodeGenerator::generateInstructions(vc4c::Method&)+0x464 [0xb6765f4c]
[E] Fri Mar  2 01:22:50 2018:  (4) /usr/local/lib/libVC4CC.so.1.2 : +0x2edee8 [0xb66cbee8]
[E] Fri Mar  2 01:22:50 2018:  (5) /usr/local/lib/libVC4CC.so.1.2 : +0x2ee0c8 [0xb66cc0c8]
[E] Fri Mar  2 01:22:50 2018:  (6) /usr/local/lib/libVC4CC.so.1.2 : +0x2ef7f8 [0xb66cd7f8]
[E] Fri Mar  2 01:22:50 2018:  (7) /usr/local/lib/libVC4CC.so.1.2 : std::function<void ()>::operator()() const+0x40 [0xb66d01b4]
[E] Fri Mar  2 01:22:50 2018:  (8) /usr/local/lib/libVC4CC.so.1.2 : threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const+0x58 [0xb66cf3d8]
[E] Fri Mar  2 01:22:50 2018:  (9) /usr/local/lib/libVC4CC.so.1.2 : void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)+0x38 [0xb66d717c]
[E] Fri Mar  2 01:22:50 2018:  (10) /usr/local/lib/libVC4CC.so.1.2 : std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()()+0x1c [0xb66d70e8]
[E] Fri Mar  2 01:22:50 2018:  (11) /usr/local/lib/libVC4CC.so.1.2 : std::thread::_State_impl<std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()> >::_M_run()+0x20 [0xb66d7054]
[E] Fri Mar  2 01:22:50 2018:  (12) /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 : +0x9c9dc [0xb6e9a9dc]
[E] Fri Mar  2 01:22:51 2018: Background worker threw error: Label/Register Mapping: There are erroneous register-associations!
[E] Fri Mar  2 01:22:53 2018: Compiler threw exception: Label/Register Mapping: There are erroneous register-associations!


MCX ERROR(2):Error: Failed to build program executable! in unit mcx_host.cpp:476
terminate called without an active exception
Aborted

from vc4cl.

doe300 commented on May 25, 2024

My assumption:
The input you are trying to compile is very complex (e.g. has long kernels). The register-allocation is the slowest part of VC4C, especially, if it cannot find a suitable register-mapping for all locals (as in your case). In this case, it will apply some fixes and try again a few times, which of course prolong the execution time even more.
Depending on the structure of the failing kernel, VC4C may not be able to find a valid register-mapping at all, since it doesn't support register-spilling yet.

from vc4cl.

fangq commented on May 25, 2024

@doe300, sorry for the delay in getting back to you.

MCXCL needs at least 53 registers based on our results on nvidia/amd/intel cpus/gpus. If vc4c does not support spilling, this can be an issue.

what is the maximum number of registers per kernel supported by a VideoCore IV GPU? if there is no such restriction, but has a total size limit for a block, I can perhaps reduce the block size.

also, is there a "wavefront/warp" size? in VideoCore IV? in nvidia hardware, I typically set my block size to 32 or 64; in AMD GPUs, I typically use 64 for block size.

from vc4cl.

doe300 commented on May 25, 2024

A single QPU has 64 + 3/4 registers available. Since kernels are not threaded, every kernel has the same number of available registers. But not all registers can be used in instructions together, also additional registers may be reserved by variables not specified in the kernel source-code.
The VideoCore IV has 12 processors (which can execute independently and limit the work-group size to 12 work-elements, one per QPU) with 16 SIMD-elements each (execute the same code, and can be leveraged by using vector-types, e.g. intto int16).

I think the best/only way to solve this is to implement register-spilling, since this problem will occur on any more complex kernel.

from vc4cl.

Optimizer error: There should be no more function calls - convert_float_rtef () about vc4cl HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent