Git Product home page Git Product logo

libvfio-user's Introduction

libvfio-user

vfio-user is a framework that allows implementing PCI devices in userspace. Clients (such as qemu) talk the vfio-user protocol over a UNIX socket to a server. This library, libvfio-user, provides an API for implementing such servers.

vfio-user example block diagram

VFIO is a kernel facility for providing secure access to PCI devices in userspace (including pass-through to a VM). With vfio-user, instead of talking to the kernel, all interactions are done in userspace, without requiring any kernel component; the kernel VFIO implementation is not used at all for a vfio-user device.

Put another way, vfio-user is to VFIO as vhost-user is to vhost.

The vfio-user protocol is intentionally modelled after the VFIO ioctl() interface, and shares many of its definitions. However, there is not an exact equivalence: for example, IOMMU groups are not represented in vfio-user.

There many different purposes you might put this library to, such as prototyping novel devices, testing frameworks, implementing alternatives to qemu's device emulation, adapting a device class to work over a network, etc.

The library abstracts most of the complexity around representing the device. Applications using libvfio-user provide a description of the device (eg. region and IRQ information) and as set of callbacks which are invoked by libvfio-user when those regions are accessed.

Memory Mapping the Device

The device driver can allow parts of the virtual device to be memory mapped by the virtual machine (e.g. the PCI BARs). The business logic needs to implement the mmap callback and reply to the request passing the memory address whose backing pages are then used to satisfy the original mmap call; more details here.

Interrupts

Interrupts are implemented via eventfd's passed from the client and registered with the library. libvfio-user consumers can then trigger interrupts by writing to the eventfd.

Building libvfio-user

Build requirements:

  • meson (v0.53.0 or above)
  • apt install libjson-c-dev libcmocka-dev or
  • yum install json-c-devel libcmocka-devel

The kernel headers are necessary because VFIO structs and defines are reused.

To build:

meson build
ninja -C build

Finally build your program and link with libvfio-user.so.

Supported features

With the client support found in cloud-hypervisor or the in-development qemu support, most guest VM use cases will work. See below for some details on how to try this out.

However, guests with an IOMMU (vIOMMU) will not currently work: the number of DMA regions is strictly limited, and there are also issues with some server implementations such as SPDK's virtual NVMe controller.

Currently, libvfio-user has explicit support for PCI devices only. In addition, only PCI endpoints are supported (no bridges etc.).

API

The API is currently documented via the libvfio-user header file, along with some additional documentation.

The library (and the protocol) are actively under development, and should not yet be considered a stable API or interface.

The API is not thread safe, but individual vfu_ctx_t handles can be used separately by each thread: that is, there is no global library state.

Mailing List & Chat

libvfio-user development is discussed in [email protected]. Subscribe here: https://lists.gnu.org/mailman/listinfo/libvfio-user-devel.

We are on Slack at libvfio-user.slack.com (invite link); or IRC at #qemu on OFTC.

Contributing

Contributions are welcome; please file an issue or open a PR. Anything substantial is worth discussing with us first.

Please make sure to mark any commits with Signed-off-by (git commit -s), which signals agreement with the Developer Certificate of Origin v1.1.

Running make pre-push will do the same checks as done in github CI. After merging, a Coverity scan is also done.

See Testing for details on how the library is tested.

Examples

The samples directory contains various libvfio-user examples.

lspci

lspci implements an example of how to dump the PCI header of a libvfio-user device and examine it with lspci(8):

# lspci -vv -F <(build/samples/lspci)
00:00.0 Non-VGA unclassified device: Device 0000:0000
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Region 0: I/O ports at <unassigned> [disabled]
        Region 1: I/O ports at <unassigned> [disabled]
        Region 2: I/O ports at <unassigned> [disabled]
        Region 3: I/O ports at <unassigned> [disabled]
        Region 4: I/O ports at <unassigned> [disabled]
        Region 5: I/O ports at <unassigned> [disabled]
        Capabilities: [40] Power Management version 0
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

The above sample implements a very simple PCI device that supports the Power Management PCI capability. The sample can be trivially modified to change the PCI configuration space header and add more PCI capabilities.

Client/Server Implementation

Client/server implements a basic client/server model where basic tasks are performed.

The server implements a device that can be programmed to trigger interrupts (INTx) to the client. This is done by writing the desired time in seconds since Epoch to BAR0. The server then triggers an eventfd-based IRQ and then a message-based one (in order to demonstrate how it's done when passing of file descriptors isn't possible/desirable). The device also works as memory storage: BAR1 can be freely written to/read from by the host.

Since this is a completely made up device, there's no kernel driver (yet). Client implements a client that knows how to drive this particular device (that would normally be QEMU + guest VM + kernel driver).

The client exercises all commands in the vfio-user protocol, and then proceeds to perform live migration. The client spawns the destination server (this would be normally done by libvirt) and then migrates the device state, before switching entirely to the destination server. We re-use the source client instead of spawning a destination one as this is something libvirt/QEMU would normally do.

To spice things up, the client programs the source server to trigger an interrupt and then migrates to the destination server; the programmed interrupt is delivered by the destination server. Also, while the device is being live migrated, the client spawns a thread that constantly writes to BAR1 in a tight loop. This thread emulates the guest VM accessing the device while the main thread (what would normally be QEMU) is driving the migration.

Start the source server as follows (pick whatever you like for /tmp/vfio-user.sock):

rm -f /tmp/vfio-user.sock* ; build/samples/server -v /tmp/vfio-user.sock

And then the client:

build/samples/client /tmp/vfio-user.sock

After a couple of seconds the client will start live migration. The source server will exit and the destination server will start, watch the client terminal for destination server messages.

shadow_ioeventfd_server

shadow_ioeventfd_server.c and shadow_ioeventfd_speed_test.c are used to demonstrate the benefits of shadow ioeventfd, see ioregionfd for more information.

Other usage notes

qemu

Step-by-step instructions for using libvfio-user with qemu can be found here.

SPDK

SPDK uses libvfio-user to implement a virtual NVMe controller: see docs/spdk.md for more details.

libvirt

You can configure vfio-user devices in a libvirt domain configuration:

  1. Add xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0' to the domain element.

  2. Enable sharing of the guest's RAM:

<memoryBacking>
  <source type='file'/>
  <access mode='shared'/>
</memoryBacking>
  1. Pass the vfio-user device:
<qemu:commandline>
  <qemu:arg value='-device'/>
  <qemu:arg value='vfio-user-pci,socket=/var/run/vfio-user.sock,x-enable-migration=on'/>
</qemu:commandline>

Live migration

The master branch of libvfio-user implements live migration with a protocol based on vfio's v2 protocol. Currently, there is no support for this in any qemu client. For current use cases that support live migration, such as SPDK, you should refer to the migration-v1 branch.

History

This project was formerly known as "muser", short for "Mediated Userspace Device". It implemented a proof-of-concept VFIO mediated device in userspace. Normally, VFIO mdev devices require a kernel module; muser implemented a small kernel module that forwarded onto userspace. The old kernel-module-based implementation can be found in the kmod branch.

libvfio-user's People

Contributors

awarus avatar berrange avatar brvtalcake avatar changpe1 avatar dreiss avatar florianfreudiger avatar franciozzy avatar gierens avatar jakelly10 avatar jfgd avatar jimharris avatar jlevon avatar jraman567 avatar limiao-intel avatar mnissler-rivos avatar mpiszczek avatar scop avatar stefanharh avatar swapnili avatar tmakatos avatar w-henderson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libvfio-user's Issues

use kvmalloc or kvmalloc_array in get_dma_map

Quoting @swapnili :

kmalloc() is faster than vmalloc() and also the accesses to the kmalloc'ed memory is faster as additional translation is not required. So mostly in the non-huge page cases we can leverage the kmalloc'ed memory.

Related to #37.

bad page map kernel warning when QEMU guest uses hugepages for guest memory

When using hugepages for the guest memory:

-object memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages,share=on -mem-prealloc -numa node,memdev=mem

the kernel (5.4.0-rc8) emits the following warning multiple times:

[2065774.775141] BUG: Bad page map in process gpio-pci-idio-1  pte:80000001bcea0027 pmd:22f4c4067
[2065774.775980] page:ffffddd2c6f3a800 refcount:0 mapcount:-1 mapping:dead000000000400 index:0xa0 compound_mapcount: -1
[2065774.776641] hugetlbfs_aops name:"qemu_back_mem.mem.NGh5A4"
[2065774.777329] flags: 0x2ffff8000000000()
[2065774.778116] raw: 02ffff8000000000 ffffddd2c6f38001 ffffddd0c6f3a808 dead000000000400
[2065774.778733] raw: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[2065774.779350] page dumped because: bad pte
[2065774.779965] addr:00007fb15f4f4000 vm_flags:1c0000fb anon_vma:0000000000000000 mapping:ffff8c0329d59570 index:a0
[2065774.780608] file:00000000-0000-0000-0000-000000000000 fault:0x0 mmap:libmuser_mmap [muser] readpage:0x0
[2065774.781307] CPU: 4 PID: 96138 Comm: gpio-pci-idio-1 Tainted: G           O      5.4.0-rc8+ #11
[2065774.782361] Hardware name: Nutanix AHV, BIOS 1.9.1-5.el6 04/01/2014
[2065774.783352] Call Trace:
[2065774.784611]  dump_stack+0x66/0x8b
[2065774.785866]  print_bad_pte+0x1d1/0x2a0
[2065774.786652]  unmap_page_range+0x7b6/0xab0
[2065774.787410]  ? __switch_to_asm+0x40/0x70
[2065774.788114]  unmap_vmas+0x81/0xf0
[2065774.788811]  unmap_region+0xae/0x120
[2065774.789501]  __do_munmap+0x2aa/0x4d0
[2065774.790190]  __vm_munmap+0x6f/0xc0                                                                                                                                                                            [2065774.790868]  __x64_sys_munmap+0x27/0x30
[2065774.791531]  do_syscall_64+0x52/0x180
[2065774.792184]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[2065774.792828] RIP: 0033:0x7fb19f5386e7
[2065774.793476] Code: c7 c0 ff ff ff ff eb 8d 48 8b 15 ac 47 2b 00 f7 d8 64 89 02 e9 5b ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2b 00 f7 d8 64 89
01 48
[2065774.794832] RSP: 002b:00007ffdc6f671d8 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
[2065774.795524] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb19f5386e7
[2065774.796232] RDX: 0000000040000000 RSI: 0000000040000000 RDI: 00007fb15f454000
[2065774.796934] RBP: 00007ffdc6f67200 R08: 0000000000000000 R09: 0000000000000000
[2065774.797628] R10: 0000000000000001 R11: 0000000000000206 R12: 000055e3b81fb830
[2065774.798324] R13: 00007ffdc6f67a40 R14: 0000000000000000 R15: 0000000000000000
[2065774.799088] Disabling lock debugging due to kernel taint

reported by [email protected]

Surprisingly the GPIO sample seems to work, although attempting to unload muser.ko results in much more serious errors.

The problem might be that libmuser calls mmap without the MAP_HUGE flags in this case.

Compilation error on Fedora 30 OS

Error information:
[ 8%] Building C object lib/CMakeFiles/muser.dir/libmuser_pci.c.o
/home/changpe1/libmuser/lib/libmuser_pci.c: In function ‘muser_pci_hdr_write_bar’:
/home/changpe1/libmuser/lib/libmuser_pci.c:61:24: error: taking address of packed member of ‘union ’ may result in an unaligned pointer value [-Werror=address-of-packed-member]
61 | bar = (uint32_t *) & lm_get_pci_config_space(lm_ctx)->hdr.bars[bar_index];
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[3]: *** [lib/CMakeFiles/muser.dir/build.make:89: lib/CMakeFiles/muser.dir/libmuser_pci.c.o] Error 1
make[2]: *** [CMakeFiles/Makefile2:91: lib/CMakeFiles/muser.dir/all] Error 2

Version: commit 380f37d
OS: Fedora 30
Gcc:Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)

create mailing list on Savannah

We had previously tried to create a mailing list on Savannah but had problems having the project accepted because of its dual license. Now that we've dropped muser.ko and MUSER's license is purely 3-Clause BSD, it should be easier to re-apply.

don't install the same fd multiple times

During DMA map we grab the fd and blindly install it in libmuser, however we might have already installed an fd for that file. We should keep track of that and avoid installing a new fd for the same file multiple times. Also, the interface between muser.ko and the DMA library is not clear, as muser.ko opens the file and it's the DMA library's responsibily to close it. See the discussions in #49 for more information.

MUSER fails to build if 4.x kernel headers aren't installed

MUSER cannot build on 3.x kernels (I've only tried 3.10.0) as they lack sparse capabilities. While we're able to build the kernel module by setting the KDIR environment variable accordingly, however when I try to do the same for the library (by addiing include_directories(${KDIR}/include/uapi/linux) to lib/CMakeLists.txt), it fails to build because of missing definitions and/or redefinitions.

don't call attach callback in lm_ctx_create() for LM_TRANS_SOCK

      // Attach to the muser control device.
-    lm_ctx->conn_fd = transports_ops[dev_info->trans].attach(lm_ctx);
     if ((dev_info->flags & LM_FLAG_ATTACH_NB) == 0) {
+        lm_ctx->conn_fd = transports_ops[dev_info->trans].attach(lm_ctx);

The socket conn_fd can only be valid in lm_ctx_try_attach() function

@@ -2376,12 +2376,19 @@ int
 lm_ctx_try_attach(lm_ctx_t *lm_ctx)
 {
     assert(lm_ctx != NULL);
+    int ret;

     if ((lm_ctx->flags & LM_FLAG_ATTACH_NB) == 0) {
         errno = EINVAL;
         return -1;
     }
-    return transports_ops[lm_ctx->trans].attach(lm_ctx);
+
+    ret = transports_ops[lm_ctx->trans].attach(lm_ctx);
+    if (ret == -1) {
+        return -1;
+    }
+    lm_ctx->conn_fd = ret;
+    return 0;
 }

Look into not keeping DMA pages pinned

We currently pin pages when they are registered for DMA and keep them in mudev's "dma_list". That is done to support libmuser restarts, as muser.ko always have the pages to provide back. For VMs, it means their memory cannot be swapped (similar limitation to pass-through devices).

This issue is to track the work of looking into how we can get away with not keeping the pages pinned and yet support libmuser restarts.

make libmuser pollable

To make libmuser pollable we most likely have to implement the poll callback in libmuser_fops.

gpio-pci-idio-16 system crash

[ 1418.366630] muser muser: muser_iommu_dma_map: DMA map vaddr=0x7f7cf7e00000 iova=0x0-0xa0000
[ 1418.366631] muser muser: find_file_for_vaddr: no file for vaddr=0x7f7cf7e00000
[ 1418.366632] muser muser: muser_iommu_dma_map: DMA map vaddr=0x7f7f0ca20000 iova=0xe0000-0x100000
[ 1418.366633] muser muser: find_file_for_vaddr: no file for vaddr=0x7f7f0ca20000
[ 1418.366633] muser muser: muser_iommu_dma_map: DMA map vaddr=0x7f7db7e00000 iova=0x100000000-0x200000000
[ 1418.366634] muser muser: find_file_for_vaddr: no file for vaddr=0x7f7db7e00000
[ 1418.366634] muser muser: muser_iommu_dma_map: DMA map vaddr=0x7f7f0ca00000 iova=0xfffc0000-0x100000000
[ 1418.366635] muser muser: find_file_for_vaddr: no file for vaddr=0x7f7f0ca00000
[ 1418.366635] muser muser: muser_iommu_dma_map: DMA map vaddr=0x7f7cf7f00000 iova=0x100000-0xc0000000
[ 1418.366636] muser muser: find_file_for_vaddr: no file for vaddr=0x7f7cf7f00000
[ 1418.366636] muser muser: muser_iommu_dma_map: DMA map vaddr=0x7f7f0c800000 iova=0xc0000-0xe0000
[ 1418.366637] muser muser: find_file_for_vaddr: no file for vaddr=0x7f7f0c800000
[ 1418.366639] muser muser: muser_ioctl: mdev=000000000d244be4, cmd=15211, arg=0x7FFC89670DE0
[ 1418.366763] muser muser: muser_ioctl: mdev=000000000d244be4, cmd=15212, arg=0x5563DFD02F90
[ 1418.366806] muser muser: muser_ioctl: mdev=000000000d244be4, cmd=15212, arg=0x5563DFD02FC0
[ 1418.366811] muser muser: muser_ioctl: mdev=000000000d244be4, cmd=15212, arg=0x5563DFD02FF0
[ 1418.366820] muser muser: muser_ioctl: mdev=000000000d244be4, cmd=15212, arg=0x5563DFD034E0
[ 1418.366826] muser muser: muser_ioctl: mdev=000000000d244be4, cmd=15212, arg=0x5563DFD03510
[ 1418.366831] muser muser: muser_ioctl: mdev=000000000d244be4, cmd=15212, arg=0x5563DFD03540
[ 1418.366836] muser muser: muser_ioctl: mdev=000000000d244be4, cmd=15212, arg=0x5563DFD03570
[ 1418.366840] muser muser: muser_ioctl: mdev=000000000d244be4, cmd=15213, arg=0x7FFC89670E80
[ 1418.366855] muser muser: muser_read: R 100@70000000000
[ 1418.366863] muser muser: libmuser_write: received data from libmuser
[ 1418.366866] BUG: unable to handle kernel paging request at 0000000001eed890
[ 1418.366867] PGD fe3f1f067 P4D fe3f1f067 PUD ffc7bd067 PMD fe3c9c067 PTE 8000000fe7ec7867
[ 1418.366870] Oops: 0001 [#1] SMP NOPTI
[ 1418.366871] CPU: 6 PID: 3638 Comm: gpio-pci-idio-1 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-193.1.2.el8_2.x86_64 #1
[ 1418.366872] Hardware name: Dell Inc. OptiPlex 7070/0YNVJG, BIOS 1.0.1 04/19/2019
[ 1418.366875] RIP: 0010:hex_dump_to_buffer+0x99/0x490
[ 1418.366876] Code: ed 0f 84 d7 01 00 00 4d 85 e4 0f 84 05 01 00 00 c7 04 24 01 00 00 00 49 83 fd 01 0f 86 e0 02 00 00 48 8b 44 24 08 48 8d 55 01 <0f> b6 30 89 f0 c0 e8 04 83 e0 0f 0f b6 80 80 4e 68 a2 88 45 00 49
[ 1418.366876] RSP: 0018:ffffb856877a7d68 EFLAGS: 00010206
[ 1418.366877] RAX: 0000000001eed890 RBX: 0000000000000010 RCX: 0000000000000010
[ 1418.366878] RDX: ffffb856877a7de6 RSI: 0000000000000010 RDI: 0000000000000010
[ 1418.366878] RBP: ffffb856877a7de5 R08: 0000000000000021 R09: 0000000000000083
[ 1418.366879] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000010
[ 1418.366879] R13: 0000000000000083 R14: 0000000001eed890 R15: 00000000000000f0
[ 1418.366880] FS:  00007f0cc1f39740(0000) GS:ffff93207c380000(0000) knlGS:0000000000000000
[ 1418.366881] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1418.366881] CR2: 0000000001eed890 CR3: 0000000fdf6e6003 CR4: 00000000003626e0
[ 1418.366882] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1418.366882] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1418.366883] Call Trace:
[ 1418.366886]  print_hex_dump+0x9a/0x100
[ 1418.366888]  ? _copy_from_user+0x30/0x60
[ 1418.366890]  ? libmuser_unl_ioctl+0xeb/0x610 [muser]
[ 1418.366892]  ? _cond_resched+0x15/0x30
[ 1418.366894]  ? __inode_security_revalidate+0x4c/0x60
[ 1418.366895]  libmuser_write+0xe5/0x1c8 [muser]
[ 1418.366897]  vfs_write+0xa5/0x1a0
[ 1418.366898]  ksys_write+0x4f/0xb0
[ 1418.366900]  do_syscall_64+0x5b/0x1a0
[ 1418.366901]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 1418.366902] RIP: 0033:0x7f0cc184db28
[ 1418.366903] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 35 4b 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[ 1418.366904] RSP: 002b:00007fff73082e68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1418.366904] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0cc184db28
[ 1418.366905] RDX: 0000000000000100 RSI: 0000000001eed890 RDI: 0000000000000003
[ 1418.366905] RBP: 00007fff73082ea0 R08: 00000000000000f8 R09: 0000000000000000
[ 1418.366906] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000400770
[ 1418.366906] R13: 00007fff730836c0 R14: 0000000000000000 R15: 0000000000000000

when trying to start the vm with parameters as seen on readme.MD

kernel warning wihout nested virtualization

The following was observed when testing MUSER with kernel 5.4 (need to get exact version):

[ 3299.086419] muser muser: muser_create_dev: new device 00000000-0000-0000-0000-000000000000
[ 3299.086445] vfio_mdev 00000000-0000-0000-0000-000000000000: Adding to iommu group 61
[ 3299.086447] vfio_mdev 00000000-0000-0000-0000-000000000000: MDEV: group_id = 61
[ 3333.950405] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
[ 3333.977350] muser muser: libmuser_mmap_dma: mmap_dma: end 0x7F9A995FB000 - start 0x7F9A9955B000 (0xA0000), off = 0x0
[ 3333.977359] ------------[ cut here ]------------
[ 3333.977364] WARNING: CPU: 0 PID: 41540 at /home/changpe1/kernel/linux/mm/rmap.c:1199 page_add_file_rmap+0x1cd/0x210
[ 3333.977365] Modules linked in: muser(OE) vfio_mdev mdev vfio_pci vfio_virqfd vfio_iommu_type1 vfio xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iscsi_tcp libiscsi_tcp rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_uverbs ib_core intel_rapl_msr intel_rapl_common ppdev parport_pc parport sb_edac x86_pkg_temp_thermal intel_powerclamp fuse coretemp vmw_vsock_vmci_transport kvm_intel vsock kvm vmw_vmci irqbypass sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support intel_cstate intel_uncore intel_rapl_perf ipmi_si mei_me ipmi_devintf ipmi_msghandler pcspkr joydev mei mxm_wmi ioatdma lpc_ich i2c_i801 acpi_power_meter acpi_pad xfs libcrc32c mlx5_core mgag200 drm_kms_helper drm_vram_helper ttm drm
[ 3333.977401]  igb nvme crc32c_intel nvme_core mlxfw pci_hyperv_intf ptp dca pps_core i2c_algo_bit wmi
[ 3333.977406] CPU: 0 PID: 41540 Comm: reactor_0 Tainted: G           OE     5.4.0-rc7+ #1
[ 3333.977407] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016
[ 3333.977408] RIP: 0010:page_add_file_rmap+0x1cd/0x210
[ 3333.977410] Code: 49 fd ff 48 63 54 24 04 e9 fc fe ff ff 48 c7 c6 10 0f 13 8f 48 89 df e8 a1 54 fe ff 0f 0b 48 89 87 80 00 00 00 e9 1c ff ff ff <0f> 0b e9 80 fe ff ff be 0f 00 00 00 48 89 c7 e8 3f 30 fd ff e9 0d
[ 3333.977411] RSP: 0018:ffffa9ce06fa7ca8 EFLAGS: 00010246
[ 3333.977412] RAX: 0017ffffc001000e RBX: fffff74218918000 RCX: 0000000000000000
[ 3333.977412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffff74218918000
[ 3333.977413] RBP: 00007f9a9955b000 R08: 000fffffffe00000 R09: 00000000000a0000
[ 3333.977414] R10: ffff964547e73d08 R11: 0000000000000000 R12: ffff9645123e0ad8
[ 3333.977414] R13: ffff96451b7e0000 R14: 8000000000000027 R15: ffff9645468ab6a8
[ 3333.977415] FS:  00007f9a8ec63700(0000) GS:ffff96455f200000(0000) knlGS:0000000000000000
[ 3333.977416] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3333.977416] CR2: 00007f9a9814d738 CR3: 000000085a874002 CR4: 00000000001626f0
[ 3333.977417] Call Trace:
[ 3333.977423]  vm_insert_page+0x19a/0x2a0
[ 3333.977428]  vm_insert_pages+0x40/0x150 [muser]
[ 3333.977430]  libmuser_mmap+0xb1/0x380 [muser]
[ 3333.977431]  ? kmem_cache_alloc+0x166/0x220
[ 3333.977433]  mmap_region+0x3fd/0x600
[ 3333.977435]  do_mmap+0x479/0x5f0
[ 3333.977439]  ? security_mmap_file+0x5e/0xc0
[ 3333.977442]  vm_mmap_pgoff+0xd2/0x120
[ 3333.977444]  ksys_mmap_pgoff+0x199/0x230
[ 3333.977459]  do_syscall_64+0x55/0x180
[ 3333.977463]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This was not with nested virtualization. QEMU was run as follows (need to get exact QEMU version as well):

build/x86_64-softmmu/qemu-system-x86_64 --enable-kvm -cpu host -smp 4 -m 4G -object memory-backend-file,id=mem0,size=4G,mem-path=/dev/hugepages,share=on -numa node,memdev=mem0 -drive file=/root/fedora.img,if=none,id=disk -device ide-hd,drive=disk,bootindex=0 -net user,hostfwd=tcp::10022-:22 -net nic -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000000000000 -vnc 0.0.0.0:1

add option to automatically unmap device memory in lm_ctx_destroy

If device emulation has allocated device memory and does not unmap it after lm_ctx_destroy has been called, then the muser device cannot be removed until that memory is unmapped. This behavior is desirable in order to support restartable device emulation. However, there are cases where we'd like device emulation to go away and having lm_ctx_destroy automatically unmap such memory would make coding easier. My proposal is to add a flag in lm_ctx_destroy to unmap such memory (should be false by default).

make returning errors consistent

Some functions return -1 and set errno, others returns -errno. We should be consistent, e.g. make all functions return -errno.

libmuser unnecessarily requires IOMMU group as LM UUID

libmuser requires the LM context UUID to be a number so that it maps to an IOMMU group. This is a relic of trying to make libmuser work with the unmodified VFIO client in QEMU. Instead it should be simple a path to the UNIX domain socket.

DMA unmap notification miss after QEMU shutdowns/fails may occur unmap errors

QEMU does not send any dma unmap notification when it shutdowns/fails, it may occur errors when we quit QEMU first then quit the userspace program which uses libmuser.

Generally, QEMU will notify libmuser dma map/unmap events and libmuser will maintenance the mapping table. However, if the QEMU process was ended, QEMU does not send any dma unmap notification when it shutdowns, which makes the mapping table in the libmuser still exists.
In dma.c, libmuser will unmap all mapped regions by dma_controller_destroy if the userspace program quit. If the QEMU process does not exist, the unmap may fail since it unmap a region that does not exist. We can see tons of kernel bug in dmesg when this situation occurs.

I think we need to do some cleaning if QEMU fails. Please fix me if there is anything inappropriate.

optimize VirtIO doorbell kicks

VirtIO doorbell kicks don’t carry any information, it’s just a memory access. Therefore we don’t have to fully emulate the write (which means VM->KVM->QEMU->VFIO->muser.ko->libmuser and back). Instead, we can setup an ioeventfd notifier so the kick is VM->KVM and back. In KVM, all that happens is an eventfd write which libmuser can catch via epoll or similar (or even spin on it if it wants). But currently we have no mechanism to implement that notifier, so emulating a VirtIO device on MUSER would be much slower than QEMU or vhost.

Compilation error with Werror option

based on commit 4191486

00:00:27.938 [ 8%] Building C object samples/CMakeFiles/test_read.dir/test_read.c.o
00:00:27.938 [ 8%] Building C object samples/CMakeFiles/test_dma_map.dir/test_dma_map.c.o
00:00:27.938 [ 13%] Building C object lib/CMakeFiles/muser.dir/dma.c.o
00:00:27.938 [ 21%] Building C object lib/CMakeFiles/muser.dir/libmuser.c.o
00:00:27.938 [ 21%] Building C object lib/CMakeFiles/muser.dir/libmuser_pci.c.o
00:00:27.939 [ 26%] Building C object samples/CMakeFiles/test_mmap.dir/test_mmap.c.o
00:00:27.939 [ 30%] Building C object lib/CMakeFiles/muser.dir/cap.c.o
00:00:27.939 [ 34%] Building C object samples/CMakeFiles/client.dir/client.c.o
00:00:27.939 [ 39%] Building C object samples/CMakeFiles/client.dir//lib/libmuser.c.o
00:00:27.939 [ 47%] Building C object samples/CMakeFiles/client.dir/
/lib/libmuser_pci.c.o
00:00:27.939 [ 47%] Building C object samples/CMakeFiles/client.dir//lib/dma.c.o
00:00:27.939 [ 52%] Building C object samples/CMakeFiles/client.dir/
/lib/cap.c.o
00:00:27.939 [ 65%] Linking C executable test_read
00:00:27.939 [ 65%] Linking C executable test_dma_map
00:00:27.939 [ 65%] Linking C executable test_mmap
00:00:27.978 [ 65%] Built target test_dma_map
00:00:27.978 [ 65%] Built target test_mmap
00:00:27.978 [ 65%] Built target test_read
00:00:28.075 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/cap.c: In function ‘handle_pm_write’:
00:00:28.075 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/cap.c:208:6: error: this statement may fall through [-Werror=implicit-fallthrough=]
00:00:28.075 208 | if (count != sizeof(struct pc)) {
00:00:28.075 | ^
00:00:28.075 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/cap.c:212:2: note: here
00:00:28.075 212 | case offsetof(struct pmcap, pmcs):
00:00:28.075 | ^~~~
00:00:28.075 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/cap.c: In function ‘caps_create’:
00:00:28.075 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/cap.c:467:9: error: ‘caps’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
00:00:28.075 467 | free(caps);
00:00:28.075 | ^~~~~~~~~~
00:00:28.075 cc1: all warnings being treated as errors
00:00:28.075 make[5]: *** [lib/CMakeFiles/muser.dir/build.make:122: lib/CMakeFiles/muser.dir/cap.c.o] Error 1
00:00:28.075 make[5]: *** Waiting for unfinished jobs....
00:00:28.138 In file included from /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/dma.c:44:
00:00:28.138 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/dma.h: In function ‘dma_unmap_addr’:
00:00:28.138 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/dma.h:303:9: error: variable ‘r’ set but not used [-Werror=unused-but-set-variable]
00:00:28.138 303 | int r;
00:00:28.138 | ^
00:00:28.138 cc1: all warnings being treated as errors
00:00:28.139 make[5]: *** [lib/CMakeFiles/muser.dir/build.make:83: lib/CMakeFiles/muser.dir/dma.c.o] Error 1
00:00:28.539 In file included from /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c:60:
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/dma.h: In function ‘dma_unmap_addr’:
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/dma.h:303:9: error: variable ‘r’ set but not used [-Werror=unused-but-set-variable]
00:00:28.539 303 | int r;
00:00:28.539 | ^
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c: In function ‘recv_blocking’:
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c:134:14: error: variable ‘fret’ set but not used [-Werror=unused-but-set-variable]
00:00:28.539 134 | int ret, fret;
00:00:28.539 | ^~~~
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c: In function ‘dev_get_info’:
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c:1173:24: error: unused parameter ‘lm_ctx’ [-Werror=unused-parameter]
00:00:28.539 1173 | dev_get_info(lm_ctx_t *lm_ctx, struct vfio_device_info *dev_info)
00:00:28.539 | ~~~~~~~~~~^~~~~~
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c: In function ‘muser_mmap’:
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c:1463:22: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
00:00:28.539 1463 | ((int)&addr), err);
00:00:28.539 | ^~~~~~~~~~~~
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c: In function ‘handle_device_set_irqs’:
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c:808:18: error: ‘data’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
00:00:28.539 808 | i++, d32++) {
00:00:28.539 | ~~~^

00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c:1865:11: note: ‘data’ was declared here
00:00:28.539 1865 | void *data;
00:00:28.539 | ^~~~
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c: In function ‘process_request’:
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c:1720:12: error: ‘ret’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
00:00:28.539 1720 | return ret;
00:00:28.539 | ^~~
00:00:28.539 /var/jenkins/workspace/BlobFS-autotest/spdk/muser/lib/libmuser.c:1648:13: note: ‘ret’ was declared here
00:00:28.539 1648 | ssize_t ret;
00:00:28.539 | ^~~
00:00:28.539 cc1: all warnings being treated as errors
00:00:28.539 make[5]: *** [lib/CMakeFiles/muser.dir/build.make:96: lib/CMakeFiles/muser.dir/libmuser.c.o] Error 1
00:00:28.540 make[4]: *** [CMakeFiles/Makefile2:147: lib/CMakeFiles/muser.dir/all] Error 2
00:00:28.540 make[4]: *** Waiting for unfinished jobs....
00:00:28.574 [ 69%] Linking C executable client
00:00:28.633 [ 69%] Built target client
00:00:28.634 make[3]: *** [Makefile:150: all] Error 2
00:00:28.635 make[2]: *** [Makefile:56: install] Error 2
00:00:28.635 make[1]: *** [Makefile:48: all] Error 2
00:00:28.636 make: *** [/var/jenkins/workspace/BlobFS-autotest/spdk/mk/spdk.subdirs.mk:44: muserbuild] Error 2
00:00:28.636 make: *** Waiting for unfinished jobs....

support for live migration

Implementing live migration requires providing a migration region with a VFIO region info type capability. Currently we only support the sparse mmap capablility so we don't require the user to create a VFIO capability header and chain with the sparse mmap capability; we simply receive sparse areas via lm_reg_info_t.mmap_areas during context creation time. This made sense in the past as there was no other VFIO region capability that we had to support, however this now needs to change. Rather than adding a new member to lm_reg_info_t for passing the VFIO region info type capability (along with mmap_areas), we should refactor the code and take a generic VFIO region capability, just like we do for the PCI capabilities (lm_cap_t). This way we'll instantly support new VFIO region capability that might be added in the future.

So first, we need to make this change. Second, we need to extend the client/server samples to use this code. Third, we can continue working on the actual live migration.

Supported PCI specification version

Investigate if libmuser needs to advertise PCI specification version supported and if we want to support multiple versions. Maybe start with updating README with the current supported PCI specification version.

need an export mapfile

If we're going to provide a stable interface, we need to build with a mapfile:

  • to specify the exact versions a symbol is exported in the dynamic symbol table
  • to hide internal symbols that shouldn't be visible to consumers

VFIO does not notify MUSER for DMA unmap events

Becasue we no longer ping pages using vfio_pin_pages, VFIO doesn't sent MUSER DMA unmap events. This causes problems to libmuser because we get overlapping DMA regions we don't know how to properly handle, and also it's a resource leak. One way to deal with this is to hack VFIO blindly send the DMA unmap event (even if the driver hasn't pinned any page from that DMA region): https://www.redhat.com/archives/vfio-users/2020-February/msg00016.html.

Another way would be to not to use this hack and implicitly unmap the previous regions clobbered by the new region. The problem of this approach is that there can be leaks and handling of overlapping regions can be complicated. We need to check to see how VFIO handles overlapping regions.

failed to open /dev/vfio/0: No such file or directory

using libvirt on centos7
selinux disabled
tried to manually modify permissions on /dev/vfio/0
of course compiled and loaded all needed modules.
and running the gpio sample

all am getting is:
2020-06-18T10:35:37.783291Z qemu-kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000000000000: vfio 00000000-0000-0000-0000-000000000000: failed to open /dev/vfio/0: No such file or directory

DMA notifier fails if using hugepages and GB of guest memory

When using hugepages and 3 GB of guest memory, QEMU fails as follows:

qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000000000000: vfio 00000000-0000-0000-0000-000000000000: error getting device from group 0: Inappropriate ioctl for device
Verify all devices in group 0 are bound to vfio-<bus> or pci-stub and not already in use

The kernel emits the following warning:

[ 2484.634813] notifier failed for iova=100000 vaddr=7f50a3700000 size=bff00000

As [email protected] mentioned this might be because of the kmalloc in muser_iommu_dma_map. If this is the case then it should be trivial to fix e.g. by using vmalloc.

automatically remove DMA regions when QEMU exits

We don't receive DMA unmap operations for active DMA regions when QEMU exists (even with the implementation that uses vfio_pin_pages). This results in resources never freed in libmuser. We can't assume that libmuser will exit after QEMU exits since there can be a single libmuser process virtualizing multiple devices, therefore we need to fix this. In muser_close we free all DMA regions, we could tell libmuser to free all of them. Another approach would be to rely on the assumption that when QEMU exits, lm_ctx will soon be destroyed along with its DMA controller, so all the DMA regions will be freed.

@franciozzy @swapnili thoughts?

compile libmuser as a static library

Existing code will compile libmuser as a shared library default, we need to export LD_LIBRARY_PATH when link libmuser library, this isn't work for SPDK unit test binaries, so I would like to have a static library option, e.g: make static or make SHARED=no.

assorted refactoring suggestions

  • Each command that receives data additional to the vfio_user_header does a recv/read call, we should eliminate this and do a recv(..., hdr->msg_size - sizeof(*hdr), ...) instead in process_request.
  • Function dev_get_sparse_mmap_cap assumes that it might talk to the kernel (hence the realloc); we should instead alloc a buffer large enough to hold struct vfio_region_info plus the capabilities.
  • In function dev_get_sparse_mmap_cap, filling in the type and sparse mmap capabilities should go into separate functions.
  • Allocating the response is rather complicated intentionally for efficiency reasons. Depending on the type of the command we either allocate memory or we point to local function variable. We should make this simpler.

struct file behind DMA region is never released

When we receive a DMA map operation we find the struct file behind the VMA and install an fd to libmuser (and with PR #46 we attempt to eliminate duplicates), however we never release that file, resulting in a leak.

rework PCI capabilities

Currently handling of standard PCI capabilities is left to the device implementation, however that shouldn't be the case as it's the same for all kinds of devices.

replace travis

travis is essentially did: we probably want to replace it with github actions or something

kernel warning if muser app dies while guest is doing PCI read

Because of an unrelated bug, the process providing device emulation died (specifically it's muser SPDK, https://github.com/tmakatos/spdk):

[    5.793367] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10
qemu-system-x86_64: vfio_region_write(00000000-0000-0000-0000-000000000000:region0+0x14, 0x0,4) failed: Operation not permitted
qemu-system-x86_64: vfio_region_read(00000000-0000-0000-0000-000000000000:region0+0x1c, 4) failed: Timer expired
[   10.837434] nvme nvme0: Removing after probe failure status: -19
qemu-system-x86_64: vfio_region_read(00000000-0000-0000-0000-000000000000:region0+0x1c, 4) failed: Timer expired
qemu-system-x86_64: vfio_pci_write_config(00000000-0000-0000-0000-000000000000, 0x4a, 0x3, 0x2) failed: Operation not permitted
qemu-system-x86_64: vfio_pci_read_config(00000000-0000-0000-0000-000000000000, 0x3d, 0x1) failed: Timer expired
qemu-system-x86_64: vfio 00000000-0000-0000-0000-000000000000: Failed to set up TRIGGER eventfd signaling for interrupt INTX-0: VFIO_DEVICE_SET_IRQS failure: Operation not permitted
qemu-system-x86_64: vfio_pci_read_config(00000000-0000-0000-0000-000000000000, 0x4, 0x2) failed: Timer expired
qemu-system-x86_64: vfio_pci_write_config(00000000-0000-0000-0000-000000000000, 0x4, 0xfbc2, 0x2) failed: Operation not permitted
qemu-system-x86_64: vfio_pci_read_config(00000000-0000-0000-0000-000000000000, 0x5c, 0x2) failed: Timer expired
qemu-system-x86_64: vfio_pci_write_config(00000000-0000-0000-0000-000000000000, 0x5c, 0xffc0, 0x2) failed: Operation not permitted
qemu-system-x86_64: vfio_pci_read_config(00000000-0000-0000-0000-000000000000, 0x4, 0x2) failed: Timer expired
qemu-system-x86_64: vfio_pci_read_config(00000000-0000-0000-0000-000000000000, 0x0, 0x4) failed: Timer expired

The host kernel:

[1476060.673557] reactor_0[217443]: segfault at fc ip 0000558c28d78f64 sp 00007fcec1dfaa10 error 4 in nvmf_tgt[558c28cba000+1df000]
[1476060.675394] Code: 00 00 00 00 e8 58 8b 00 00 48 8b 45 f8 0f b6 00 83 e0 01 84 c0 0f 85 ae 01 00 00 48 8b 45 e8 48 8b 80 78 07 00 00 48 8b 40 20 <0f> b6 80 fc 00 00 00 83 e0 01 84 c0 0f 84 8d 01 00 00 48 8d
05 c3
[1476060.709588] muser muser: libmuser_release: moving command back in list
[1476065.701220] muser muser: muser_process_cmd: giving up, no response for cmd 3
[1476070.821272] muser muser: muser_process_cmd: giving up, no response for cmd 2
[1476075.941249] muser muser: muser_process_cmd: giving up, no response for cmd 2
[1476081.061339] muser muser: muser_process_cmd: giving up, no response for cmd 3
[1476081.063049] muser muser: muser_ioctl_setup_cmd: ignore DATA_NONE index=2 start=0 count=0
[1476086.181391] muser muser: muser_process_cmd: giving up, no response for cmd 1
[1476091.301391] muser muser: muser_process_cmd: giving up, no response for cmd 2
[1476096.421497] muser muser: muser_process_cmd: giving up, no response for cmd 1
[1476101.541559] muser muser: muser_process_cmd: giving up, no response for cmd 2
[1476106.661604] muser muser: muser_process_cmd: giving up, no response for cmd 3
[1476111.781594] muser muser: muser_process_cmd: giving up, no response for cmd 2
[1476116.901669] muser muser: muser_process_cmd: giving up, no response for cmd 3
[1476122.021798] muser muser: muser_process_cmd: giving up, no response for cmd 2
[1476127.653830] muser muser: muser_process_cmd: giving up, no response for cmd 2
[1476258.950949] muser muser: MDEV: Unregistering
[1476268.622393] BUG: unable to handle page fault for address: 0000000234ff6038
[1476268.623160] #PF: supervisor read access in kernel mode
[1476268.623692] #PF: error_code(0x0000) - not-present page
[1476268.624243] PGD 0 P4D 0
[1476268.624763] Oops: 0000 [#1] SMP PTI
[1476268.625273] CPU: 5 PID: 217452 Comm: qemu-system-x86 Tainted: G        W  O      5.4.0-rc8+ #11
[1476268.625787] Hardware name: Nutanix AHV, BIOS 1.9.1-5.el6 04/01/2014
[1476268.626319] RIP: 0010:__mutex_lock.isra.7+0xd0/0x4d0
[1476268.626895] Code: 39 c2 74 b7 48 89 c2 eb 9c 65 48 8b 04 25 c0 6b 01 00 48 8b 00 a8 08 0f 85 8f 00 00 00 49 8b 07 48 83 e0 f8 0f 84 77 03 00 00 <8b> 50 38 85 d2 0f 85 06 01 00 00 85 d2 74 73 4d 8d 67 0c 4c
89 e7
[1476268.628101] RSP: 0018:ffffb1fc00ab3b80 EFLAGS: 00010206
[1476268.628693] RAX: 0000000234ff6000 RBX: ffff902a2e5a6100 RCX: 0000000234ff6000
[1476268.629309] RDX: 0000000234ff6000 RSI: ffff902a2f3ff080 RDI: ffff902a2e5a6150
[1476268.629983] RBP: ffffb1fc00ab3c20 R08: 0000000000000000 R09: 0000000000000000
[1476268.630617] R10: ffffb1fc00ab3c40 R11: ffff902a2b2a0d10 R12: ffffb1fc00ab3c40
[1476268.631228] R13: ffff902a2cfc1900 R14: 0000000000000002 R15: ffff902a2e5a6150
[1476268.631883] FS:  0000000000000000(0000) GS:ffff902a34140000(0000) knlGS:0000000000000000
[1476268.632548] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1476268.633209] CR2: 0000000234ff6038 CR3: 00000000a3c0a004 CR4: 00000000001606e0
[1476268.633915] Call Trace:                                                                                                                                                                               [4/1804]
[1476268.634628]  ? locks_remove_posix+0xc8/0x160
[1476268.635327]  ? dma_unmap_all+0x45/0x150 [muser]
[1476268.636009]  dma_unmap_all+0x45/0x150 [muser]
[1476268.636734]  muser_close+0x1a/0x90 [muser]
[1476268.637413]  vfio_mdev_release+0x1e/0x30 [vfio_mdev]
[1476268.638123]  vfio_device_fops_release+0x1e/0x40 [vfio]
[1476268.638813]  __fput+0xbe/0x250
[1476268.639551]  task_work_run+0x8a/0xb0
[1476268.640219]  do_exit+0x2e0/0xbb0
[1476268.640846]  do_group_exit+0x3a/0xa0
[1476268.641444]  get_signal+0x16d/0x8e0
[1476268.642051]  ? __wake_up_common+0x96/0x180
[1476268.642607]  do_signal+0x30/0x690
[1476268.643191]  ? eventfd_write+0xbe/0x2a0
[1476268.643779]  ? __x64_sys_futex+0x87/0x170
[1476268.644334]  exit_to_usermode_loop+0x91/0xf0
[1476268.644933]  do_syscall_64+0x159/0x180
[1476268.645497]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1476268.646081] RIP: 0033:0x7fc799ec317f
[1476268.646633] Code: Bad RIP value.
[1476268.647251] RSP: 002b:00007fc73d3e3510 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[1476268.647872] RAX: fffffffffffffe00 RBX: 000055a5aa5ab1f0 RCX: 00007fc799ec317f
[1476268.648531] RDX: 00000000000014e7 RSI: 0000000000000080 RDI: 000055a5aa5ab1f4
[1476268.649146] RBP: 000055a5a90e97e0 R08: 000055a5a90e9700 R09: 0000000000000a73
[1476268.649789] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a5a885b774
[1476268.650422] R13: 00000000000004dd R14: 00007fc73cbe6000 R15: 0000000000000003
[1476268.651078] Modules linked in: muser(O-) tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag vfio_mdev mdev vfio_iommu_type1 vfio binfmt_misc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
bochs_drm drm_vram_helper snd_pcm ttm aesni_intel snd_timer crypto_simd drm_kms_helper snd cryptd glue_helper soundcore evdev joydev sg drm serio_raw pcspkr virtio_balloon button ip_tables x_tables autofs4 ext4
crc32c_generic crc16 mbcache jbd2 hid_generic usbhid hid sr_mod cdrom sd_mod ata_generic virtio_net net_failover virtio_scsi failover crc32c_intel ata_piix uhci_hcd ehci_pci ehci_hcd libata psmouse virtio_pci v$
rtio_ring i2c_piix4 usbcore virtio scsi_mod floppy [last unloaded: muser]
[1476268.655755] CR2: 0000000234ff6038
[1476268.656534] ---[ end trace 7cf53999121ec065 ]---
[1476268.657271] RIP: 0010:__mutex_lock.isra.7+0xd0/0x4d0
[1476268.658503] Code: 39 c2 74 b7 48 89 c2 eb 9c 65 48 8b 04 25 c0 6b 01 00 48 8b 00 a8 08 0f 85 8f 00 00 00 49 8b 07 48 83 e0 f8 0f 84 77 03 00 00 <8b> 50 38 85 d2 0f 85 06 01 00 00 85 d2 74 73 4d 8d 67 0c 4c
89 e7
[1476268.661380] RSP: 0018:ffffb1fc00ab3b80 EFLAGS: 00010206
[1476268.662773] RAX: 0000000234ff6000 RBX: ffff902a2e5a6100 RCX: 0000000234ff6000
[1476268.664248] RDX: 0000000234ff6000 RSI: ffff902a2f3ff080 RDI: ffff902a2e5a6150
[1476268.665585] RBP: ffffb1fc00ab3c20 R08: 0000000000000000 R09: 0000000000000000
[1476268.667002] R10: ffffb1fc00ab3c40 R11: ffff902a2b2a0d10 R12: ffffb1fc00ab3c40
[1476268.668475] R13: ffff902a2cfc1900 R14: 0000000000000002 R15: ffff902a2e5a6150
[1476268.669943] FS:  0000000000000000(0000) GS:ffff902a34140000(0000) knlGS:0000000000000000
[1476268.671391] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1476268.672700] CR2: 00007fc799ec3155 CR3: 00000000a3c0a004 CR4: 00000000001606e0
[1476268.674077] Fixing recursive fault but reboot is needed!
[1476284.630049] vfio_mdev 00000000-0000-0000-0000-000000000000: Device is currently in use, task "rmmod" (223678) blocked until device is released

I then ran rmmod trying to clean up, which got stuck (included in previous stack trace). The device emulation application cannot be killed. This requires a hard reset to rectify.

FYI @swapnili

Support other+extended PCI capabilities e.g Advanced Error Reporting(AER)

Today device can provide capabilities using following API,
vfu_pci_setup_caps(vfu_ctx_t *vfu_ctx, vfu_cap_t **caps, int nr_caps)
For device implementation convenience vfu_cap_t is defined in libvfio-user.h

We need to investigate unsupported capabilities and revisit the API to have a way to support them.

audit API surface for security

Later on, when the API is firmer, we need to self-audit the code for robustness in the face of a malicious client (for example - no unconstrained allocations based on user-supplied values).

Host kernel panic with "gpio-pci-idio-16" sample

Host kernel panics when "gpio-pci-idio-16" sample is used with QEMU.

Host
Kernel version (git describe): v5.4-rc8-2-gc74386d
QEMU version: v4.1.0-1750-g591b3bd
muser version: v0.1-5-g2e35483

Steps to reproduce the issue

  1. muser/patches/vfio.diff was applied to above linux kernel
  2. Kernel was built and installed
  3. Built muser by executing make; make install
  4. Rebooted Host
  5. Installed muser driver by using modprobe muser command
  6. Created Mdev using echo f2d37405-dfa3-4c9b-98e4-cb1a01800ad3 > /sys/class/muser/muser/mdev_supported_types/muser-1/create command
  7. Launched the sample app using ./build/dbg/samples/gpio-pci-idio-16 f2d37405-dfa3-4c9b-98e4-cb1a01800ad3 command
  8. Launched QEMU with the following options:
    -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/f2d37405-dfa3-4c9b-98e4-cb1a01800ad3
    -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=mem,share=yes,size=2G
    -numa node,nodeid=0,cpus=0,memdev=ram-node0 \

backwards binary compatibility of libmuser

In libmuser we export some critical structs, e.g. lm_dev_info, lm_cap_t. Doing so makes coding for the user simpler, however we risk not being able to upgrade libmuser while maintaining backward compatibility, since we won't be able to modify these structs without breaking it. Therefore it would be better to make these structs opaque and introduce public functions for e.g. initializing lm_dev_info. @swapnili @franciozzy thoughts?

If we still want to keep some structs public, we might have to do something similar to VFIO: introduce an argsz member at the beginning of the struct so that we know the amount of data passed by the user and only append new struct members whenever we have to change them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.