A long running blk io operation (we saw it with a long slow TRIM operation, but expect anything longer than ~30s would do) can cause a Linux guest to attempt to stop and reset the AHCI controller.
The issue appears to be that the SIGCONT
which is sent by blockif_cancel
and expected to be delivered via the mevent thread calling back into blockif_sigcont_handler
never arrives and so blockif_cancel
blocks forever. The interesting backtrace is:
thread #13: tid = 0x2d000a5, 0x00007fff8c0aac8a libsystem_kernel.dylib`__psynch_cvwait + 10, name = 'vcpu:1'
frame #0: 0x00007fff8c0aac8a libsystem_kernel.dylib`__psynch_cvwait + 10
frame #1: 0x00007fff8c19496a libsystem_pthread.dylib`_pthread_cond_wait + 712
frame #2: 0x000000010ea97316 com.docker.hyperkit`pci_ahci_write + 128 at block_if.c:899 [opt]
frame #3: 0x000000010ea97296 com.docker.hyperkit`pci_ahci_write [inlined] ahci_port_stop(p=<unavailable>) + 180 at pci_ahci.c:445 [opt]
frame #4: 0x000000010ea971e2 com.docker.hyperkit`pci_ahci_write [inlined] pci_ahci_port_write(sc=<unavailable>, offset=<unavailable>, value=<unavailable>) + 492 at pci_ahci.c:2051 [opt]
frame #5: 0x000000010ea96ff6 com.docker.hyperkit`pci_ahci_write(vcpu=<unavailable>, pi=<unavailable>, baridx=<unavailable>, offset=<unavailable>, size=<unavailable>, value=<unavailable>) + 498 at pci_ahci.c:2156 [opt]
frame #6: 0x000000010ea9c974 com.docker.hyperkit`pci_emul_mem_handler(vcpu=1, dir=<unavailable>, addr=<unavailable>, size=<unavailable>, val=<unavailable>, arg1=0x00007fd1ca7000a0, arg2=<unavailable>) + 284 at pci_emul.c:394 [opt]
frame #7: 0x000000010ea9686d com.docker.hyperkit`mem_write(unused=<unavailable>, vcpu=<unavailable>, gpa=<unavailable>, wval=<unavailable>, size=<unavailable>, arg=<unavailable>) + 47 at mem.c:151 [opt]
frame #8: 0x000000010ea9292b com.docker.hyperkit`vmm_emulate_instruction [inlined] emulate_mov(vm=<unavailable>, vcpuid=<unavailable>, gpa=<unavailable>, vie=<unavailable>, memread=<unavailable>, memwrite=(com.docker.hyperkit`mem_write at mem.c:147), arg=<unavailable>) + 442 at vmm_instruction_emul.c:497 [opt]
frame #9: 0x000000010ea92771 com.docker.hyperkit`vmm_emulate_instruction(vm=<unavailable>, vcpuid=<unavailable>, gpa=<unavailable>, vie=<unavailable>, paging=<unavailable>, memread=<unavailable>, memwrite=<unavailable>, memarg=<unavailable>) + 2930 at vmm_instruction_emul.c:1421 [opt]
frame #10: 0x000000010ea91739 com.docker.hyperkit`xh_vm_emulate_instruction(vcpu=<unavailable>, gpa=<unavailable>, vie=<unavailable>, paging=<unavailable>, memread=<unavailable>, memwrite=<unavailable>, memarg=<unavailable>) + 87 at vmm_api.c:828 [opt]
frame #11: 0x000000010eaab6d0 com.docker.hyperkit`vmexit_inst_emul [inlined] emulate_mem(paddr=<unavailable>, vie=0x000000010edcd670, paging=0x000000010edcd658) + 31 at mem.c:202 [opt]
frame #12: 0x000000010eaab6b1 com.docker.hyperkit`vmexit_inst_emul(vme=<unavailable>, pvcpu=<unavailable>) + 180 at hyperkit.c:486 [opt]
frame #13: 0x000000010eaabf3b com.docker.hyperkit`vcpu_thread [inlined] vcpu_loop(vcpu=1) + 116 at hyperkit.c:630 [opt]
frame #14: 0x000000010eaabec7 com.docker.hyperkit`vcpu_thread(param=<unavailable>) + 1076 at hyperkit.c:276 [opt]
frame #15: 0x00007fff8c193aab libsystem_pthread.dylib`_pthread_body + 180
frame #16: 0x00007fff8c1939f7 libsystem_pthread.dylib`_pthread_start + 286
frame #17: 0x00007fff8c193221 libsystem_pthread.dylib`thread_start + 13
Can be reproduced by e.g. introducing a sleep(35) into blockif_proc
's BOP_DELETE
handler and then running fstrim
on a filesystem from inside the guest. ijc@d8af9d6 contains some code to do that along with some debugging around the blockif_cancel
code paths.
This seems likely to be down to a different in the kevent
/kqueue
semantics between FreeBSD (where this code originates via bhyve) and OSX. FreeBSD kqueue(2) and OSX kevent(2) differ in their descriptions of EVFILT_SIGNAL
in that the OSX version says explicitly:
Only signals sent to the process, not to a particular thread, will trigger the filter.
While the FreeBSD one does not. The signal is sent with pthread_kill(be->be_tid, SIGCONT);
so would be expected to be subject to this caveat.
However there is something I do not understand about the the original code on FreeBSD which makes me reluctant to just start coding a fix.
There are 3 threads involved:
- The VCPU I/O emulation thread (
ioemu
), runs the device model e.g. the pci_ahci.c code
- The block IO thread (
blkio
), performs actual IO onto the underlying backend devices
- The
mevent
thread, listens for various events using kqueue/kevent. This is actually the process' original main
thread which calls mevent_dispatch
after initialisation.
There are 2 sets queues involved:
- The three blockif request queues (
bc->bc_freeq
, bc->bc_pendq
and bc->bc_busyq
) which between them containing theBLOCKIF_MAXREQ
elements (type struct blockif_elem
) bc->bc_reqs. These are protected by
bc->bc_mtxand
bc->bc_cond`.
- The (global)
blockif_bse_head
which contains a chain of struct blockif_sig_elem *
, protected through the use of atomic_cmpset_ptr
. Each blockif_sig_elem
contains bse_mtx
and bse_cond
used for completion.
In normal processing the ioemu
thread will process an MMIO (e.g. for pci_ahci.c
, from ahci_handle_rw
, atapi_read
, ahci_handle_dsm_trim
and others) will call a helper function which passes a blockif_req
to blockif_request
which calls blockif_enqueue
enqueues a blockif_elem
then pokes bc->bc_cond
.
This will wake the blkio
thread (which is a single thread, but can be multiple in bhvye upstream) which was waiting on bc->bc_cond
which will then call blockif_dequeue
which takes the blockif_elem
and sets be->be_status
to BUSY and be->be_tid
to the thread which is claiming the work (that is, blkio
).
The blkio
thread will then process the IO via blockif_proc
which will issue various blocking system calls. When the I/O completes an ioemu provided callback is called (for pci_ahci.c
this would be ata_ioreq_cb
or atapi_ioreq_cb
, these update the emulation state etc, i.e. marking the command complete), this callback is passed err
which is either 0 (success) or an errno value (fail). Finally the blkio
thread calls blockif_complete
which frees the blockif_elem
.
This all seems reasonable enough.
However upon cancellation, which happens in blockif_cancel
(in the case of pci_ahci.c
this is called from ahci_port_stop
from the ioemu
thread), things are more complex.
If the blockif_request
is not active, that is, the corresponding blockif_elem
has not been claimed by the blkio
thread (it is on bc->bc_pendq
), then it is simply calling blockif_complete
. Also simple enough.
However if the blockif_request
is active, that is, the corresponding blockif_elem
is on bc->bc_busyq
and therefore has a non-zero be_tid
and a be_status
of BUSY then it will allocate a new struct blockif_sig_elem bse
on the stack and add it to the global blockif_bse_head
(using atomic_cmpset_ptr
). It will then send a SIGCONT
to the be_tid
with pthread_kill(be->be_tid, SIGCONT);
and then block waiting for the embedded bse_cond
to be signalled and the blockif_sig_elem
completed.
So far so good but at this point I lose track of what is going on because the SIGCONT
is delivered via kqueue
to the mevent
thread and the blockif_sigcont_handler
callback and not to the blkio
thread. The callback handler does nothing other than walk the global list marking each blockif_sig_elem
complete and kicking the corresponding bse_cond
(which wakes the ioemu
thread). It takes no action WRT the blkio
thread.
The only way I can see this working on FreeBSD is that receiving the SIGCONT
causes the system call which the blkio
thread is currently in to return with EINTR
(or similar) while delivering the actual signal to another thread via the kevent
. This has a subtle dependency on the ordering of the events (the system call must return before the signal handler callback is called) and is not something which is made clear in any of the documentation I've been able to find.
I'm also not sure what happens if the blkio
thread is merely on the way to calling the system call at the point where the cancellation signal arrives. Seems like it would block when it actually made the call? Might be harmless (since things rely on these things returning via their normal return path to signal the error) or might result in things not being cancelled as expected.
This needs more thought and investigation, I shall ask on the Bhyve list hence setting down my understanding here.