Comments (15)
Hmm...I've been testing the waitIRQ version heavily the past week and I haven't seen this behavior yet. Do you just wait an extended period of time to make it happen? 100,000 interrupts is approximately 100 seconds? Or do you use a rate other than 1kHz? When you say stuck, do you just miss an interrupt or two or does mk lock up and need to be killed? The zynq version has been running for hours continuously without any symptoms of hard hanging except on shutdown. Occasionally, 1 in 10 or 15, on shutdown the rtapi will begin spamming timeout and no connection errors and require a manual intervention. That's the only error I've run into with wait irq.
What should I do to try and duplicate?
As for GPIO, I will try to post zturn today or tomorrow, and I can help get a custom config that will expose some timing gpio pins for you. I've been doing u-boot debugging to get NFS booting on the microzed, but I think I can find a little time to finish a basic zturn design.
from mksocfpga.
well with posix it happens within minutes, yes; I have it running now for 20hrs with an RT thread and it did not happen
stuck means: the waitirq function just sits in the read().
The shutdown hang is something else I still need to address - seems the device read is not interruptible.
'timeout' typically means either rtapi_app went away/crashed - the command thread is separate so should run whatever is happening in HAL threads
duplicate: adapt the config= line in above config fragment and run it, then observe the hm2.0.irq-count
pin. If it stops increasing, you see the same symptom. I saw no suspicious linuxcnc.log or dmesg entries.
from mksocfpga.
On 6/10/2016 7:24 AM, Michael Haberler wrote:
well with posix it happens within minutes, yes; I have it running now for 20hrs
with an RT thread and it did not happenstuck means: the waitirq function just sits in the read().
Where's the interrupt code? I'm either looking in the wrong file(s) or
the wrong branch(es).
Usually, a problem like this means an interrupt was incorrectly cleared
(without being properly seen by software), but I'm not sure if the issue
can happen that way since there's a single IRQ from the hm2 hardware.
The process I'm talking about is like so:
- Interrupt A occurs
- Software IRQ process is triggered and sees IRQ A happened
- Interrupt B occurs
- Software clears the IRQ A, which ALSO clears IRQ B!
- ...software waits forever for IRQ B
I'm not sure if something like this can happen in the IRQ handling for a
UIO device, but IIRC the ARM core only really has one actual hardware
interrupt (two if you count the FIRQ), so it might be possible even
though the hm2 only generates one interrupt.
Similar deadlocks are also possible with improper handling of interrupt
enable/mask bits.
Charles Steinkuehler
[email protected]
from mksocfpga.
IRQ code is here: https://github.com/machinekit/machinekit/blob/master/src/hal/drivers/mesa-hostmot2/hm2_soc_ol.c#L615-L643
that above scenario could well be it!
from mksocfpga.
I was thinking about this while driving and I guess an easy way out would be to have a timeout condition on the blocking read. If the timeout occurs, then we check to see if the flags need to be reset, then make the choice to block until the next interrupt by restarting the read, or we finish the function normally and let other hal calls get their share of the CPU.
From what I've read about kernel latency to interrupts, the double interrupt described above is a real possibility, and the only fix for that is a preemption kernel. If posix is a requirement though, the timeout on read will at least keep the bits flowing even if they're serviced irregularly.
The other option is simply slowing down the interrupt rate since the hardware cannot support the current rate with the latency included.
from mksocfpga.
this might touch on the same issue
is there any point in experimenting with level- vs edge-triggered IRQ's?
from mksocfpga.
semi-related: the shutdown hangs if employing a read() on a device file (or some other fd for that matter; I am experiencing the same with eventfd(2)) - the scenario is:
- halcmd shutdown causes the thread to be stopped through hal_exit_threads() ->...->rtapi_task_delete_hook() which does a pthread_join()
- the pthread_join() waits on read() to terminate
- all this is happening with rtapi_app holding the rtapi_mutex
- rtapi_app is deadlocked in the halcmd request handler
there are two ways out of this:
- send a signal to the RT thread if it does not exit right away, which should terminate the read()
- do not use read() but rather poll() and do the poll on 2 file descriptors - the one we are interested in, and an eventfd which is just used for shutdown signaling - the shutdown sequence would write() to the eventfd before doing the pthread_join, having the poll() return
I'll explore in turn and see how I fare - I think 1. being less intrusive on API use
I already verified that closing the device file does not cancel the read :-/
update: in fact a pthread_cancel() might do as read() is on the list of cancellation points
from mksocfpga.
well luckily it seems the pthread_cancel() does the job of terminating the read() (src/rtapi/rt-preempt.c):
@dkhughes - mind trying this patch and see if this gets your shutdown hang sorted?
@@ -197,14 +197,21 @@ void _rtapi_task_delete_hook(task_data *task, int task_id) {
void *returncode;
/* Signal thread termination and wait for the thread to exit. */
if (!extra_task_data[task_id].deleted) {
extra_task_data[task_id].deleted = 1;
+
+ err = pthread_cancel(extra_task_data[task_id].thread);
+ if (err)
+ rtapi_print_msg(RTAPI_MSG_ERR,
+ "pthread_cancel() on RT thread '%s': %d %s\n",
+ task->name, err, strerror(err));
err = pthread_join(extra_task_data[task_id].thread, &returncode);
if (err)
- rtapi_print_msg
- (RTAPI_MSG_ERR, "pthread_join() on realtime thread failed\n");
+ rtapi_print_msg(RTAPI_MSG_ERR,
+ "pthread_join() on RT thread '%s': %d %s\n",
+ task->name, err, strerror(err));
}
/* Free the thread stack. */
free(extra_task_data[task_id].stackaddr);
extra_task_data[task_id].stackaddr = NULL;
}
the more I think of it - this patch is seriously needed: ANY thread (rt or posix) doing not just HAL but any form of blocking system calls will be subjected to this shutdown hang otherwise
from mksocfpga.
on the socfpga, the above patch reliable removes the hang on exiting 'halrun -I irqtest.hal'
on an amd64, this triggers an obscure pthread_cancel() bug causing a segv in the terminating thread
google for '_Unwind_ForcedUnwind crash' or 'pthread_cancel segv' - tons of hits, no clear resolution
from mksocfpga.
I think I'm on the trail to this one - very subtle
on 'unload ', the thread functs and pins of this comp are unlinked from the threads, then the comp is unloaded - the theory being thereafter comp code and data cannot be referenced anymore
in the case of a comp doing a blocking call, e.g. read(), the thread is blocked within this read even after the functs and pins are unlinked and the comp unloaded
a later delthread (implicit in shutdown) cancels the system call originating in a - by now unloaded - comp (really a shared library), meaning the code and data segment of this comp are invalid, causing the crash on return from the system call
I guess the resolution is - extend the rather obscure 'halcmd unload all' to delete all threads before any comp is unloaded - in the legacy code, threads were exported by the motion and threads components only and an unload of those implicitly deleted the threads
to round out my monologue ;) yes, the above worked and a patch is coming which covers the 'unload all' and 'halcmd shutdown' cases
it does NOT cover the case where a comp using blocking system calls is loaded, a thread has been started calling this funct, and the user removes the comp with 'unload comp' - this still results in an rtapi crash
the HAL data structures just do not support expressing this kind of referential integrity relation easily
I do see an alternative, more proper fix through shared library reference counting: if a shared libary (component) already was loaded with dlopen(), another dlopen() just increases the reference count on the shared library handle; dlclose() decrements the refcount and unmaps the shared library when the refcount drops to zero. I wonder if this is worth the trouble for now
from mksocfpga.
@mhaberler Impressive detective work. So, to make sure I understand what is happening - the thread is stuck in the blocking function call, and the component is deleted out from underneath it?
from mksocfpga.
yes, exactly - a scenario which cannot happen with the current nonblocking thread functs
PR coming soonish
from mksocfpga.
@dkhughes - machinekit/machinekit#962 should make the exit hang go away
one can still crash rtapi by an explicit unloadrt hm2_soc_ol while running but maybe we'll find a fix for that downstream - for now 'a restriction' as it's easily fixed by preceding the unloadrt with 'delthread all' (that is pretty much what the patch does for 'unload all')
update: merged
from mksocfpga.
@mhaberler I've used the patches for a day or so now and the exit hangs have disappeared, great work! I have been looking for side effects to the change but my tests haven't shown anything yet.
from mksocfpga.
@dkhughes great to hear, thanks!
re the actual topic of this issue.. next stop is rebuilding uio_pdrv_genirq from source assuming the solution will be at that level
afterall Linus might have had a case ;)
from mksocfpga.
Related Issues (20)
- Expose IO DDR register from top level of hm2 HOT 4
- gpio timing pins for kernel and userland scoping - feasible? HOT 3
- Xilinx Zybo support? HOT 18
- hm2 read(), write() functions need signficant time HOT 22
- modifying zturn_ztio unclear HOT 9
- Simplify firmware protobuf ID generation HOT 1
- REquesting a 1 thread place for resolving adding new HM2 Mesa Soc FPGA functionality issues
- socfpga-rbf packaging fails with: FATAL: Failed to pull Docker image mhaberler/ubker HOT 27
- Inconsistent qsys address mapping of peripherals HOT 2
- License Free partial reconfiguration HOT 1
- Package builder reports success but fails to copy the new package into the repo HOT 7
- new DeXX image -> hm2_soc_ol.so: undefined symbol: hm2_unregister HOT 5
- SmartSerial broken HOT 33
- Add Ultra96 and upgrade Vivado HOT 43
- Adding a new Board HOT 1
- Stick to Vivado 19.1 or update to Vitis 20.x (including Vivado) HOT 13
- Mksocfpga Failing builds HOT 5
- Quartus Hostmot2 FPGA compiling error HOT 2
- Lychee Hex support ? HOT 5
- Kria KR260 wifi 22.02 Ubuntu
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mksocfpga.