axboe / liburing Goto Github PK

Library providing helpers for the Linux kernel io_uring support

License: MIT License

Makefile 1.18% C 97.94% Shell 0.61% C++ 0.27%

liburing's Introduction

liburing
--------

This is the io_uring library, liburing. liburing provides helpers to setup and
teardown io_uring instances, and also a simplified interface for
applications that don't need (or want) to deal with the full kernel
side implementation.

For more info on io_uring, please see:

https://kernel.dk/io_uring.pdf

Subscribe to [email protected] for io_uring related discussions
and development for both kernel and userspace. The list is archived here:

https://lore.kernel.org/io-uring/


kernel version dependency
--------------------------

liburing itself is not tied to any specific kernel release, and hence it's
possible to use the newest liburing release even on older kernels (and vice
versa). Newer features may only be available on more recent kernels,
obviously.


ulimit settings
---------------

io_uring accounts memory it needs under the rlimit memlocked option, which
can be quite low on some setups (64K). The default is usually enough for
most use cases, but bigger rings or things like registered buffers deplete
it quickly. root isn't under this restriction, but regular users are. Going
into detail on how to bump the limit on various systems is beyond the scope
of this little blurb, but check /etc/security/limits.conf for user specific
settings, or /etc/systemd/user.conf and /etc/systemd/system.conf for systemd
setups. This affects 5.11 and earlier, new kernels are less dependent
on RLIMIT_MEMLOCK as it is only used for registering buffers.


Regressions tests
-----------------

The bulk of liburing is actually regression/unit tests for both liburing and
the kernel io_uring support. Please note that this suite isn't expected to
pass on older kernels, and may even crash or hang older kernels!


Building liburing
-----------------

    #
    # Prepare build config (optional).
    #
    #  --cc  specifies the C   compiler.
    #  --cxx specifies the C++ compiler.
    #
    ./configure --cc=gcc --cxx=g++;

    #
    # Build liburing.
    #
    make -j$(nproc);

    #
    # Install liburing (headers, shared/static libs, and manpage).
    #
    sudo make install;

See './configure --help' for more information about build config options.


FFI support
-----------

By default, the build results in 4 lib files:

    2 shared libs:

        liburing.so
        liburing-ffi.so

    2 static libs:

        liburing.a
        liburing-ffi.a

Languages and applications that can't use 'static inline' functions in
liburing.h should use the FFI variants.

liburing's main public interface lives in liburing.h as 'static inline'
functions. Users wishing to consume liburing purely as a binary dependency
should link against liburing-ffi. It contains definitions for every 'static
inline' function.


License
-------

All software contained within this repo is dual licensed LGPL and MIT, see
COPYING and LICENSE, except for a header coming from the kernel which is
dual licensed GPL with a Linux-syscall-note exception and MIT, see
COPYING.GPL and <https://spdx.org/licenses/Linux-syscall-note.html>.

Jens Axboe 2022-05-19

liburing's People

Contributors

Stargazers

Watchers

Forkers

jusual carterli rouzier anba8005 twissel hasegaw crixalis2013 glommer bgianfo cor3ntin yanghonggang wdauchy franz1981 quininer safl scalablecory anarazel metze-samba moneytech guillemj baosuning zzxji4 taylordotfish lzlrd ghadishayban stefano-garzarella bergwolf xuanyi-fu krircc markpapadakis wubo009 shuveb dcui lastweek mengyouyang maxkellermann 1jo1 bvanassche gerow kayess jobs-git nan1994 plsmaop 257 necipfazil danielealbano deepanshu1422 zhang137 lwllvyb dshulyak dimarusyy justanotherdot gfx cw123 xvhfeng mellanox-lab lczerner danielzhanghl normanmaurer pcewing icalciu feiyunwill alexsmind piotrrak kraj ammarfaizi2 sharpelletti qingshui sinhasantos 5gapp shuochengwang wsczq wapxmas goldsteinn avsm davidbien nunojsa ylowy jorangreef zhangzhiming865 damenly nobodyxu eszlari utumen umanshahzad fletcherj1 zhiqiangliu26 hassila dmonakhov kalimuthu-velappan heratap alphawalker00 followheart hsqstephenzhang stemha deltavoid yuan-luo ddiss le-migou isilence

liburing's Issues

Question about LIBURING_UDATA_TIMEOUT filtering

Currently io_uring_peek_cqe filters all cqes with user_data set to LIBURING_UDATA_TIMEOUT, while io_uring_peek_batch_cqe does not. This is desired behavior or not?

Sample code:

#include "liburing.h"
#include <stdio.h>
#include <errno.h>

int main(int argc, char const *argv[])
{
    int ret;
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;

    ret = io_uring_queue_init(32, &ring, 0);

    if (ret)
    {
        fprintf(stderr, "queue init failed: %d\n", ret);
        return ret;
    }

    sqe = io_uring_get_sqe(&ring);
    if (!sqe)
    {
        fprintf(stderr, "sqe get failed\n");
        return 1;
    }

    // this one gets filtered
    io_uring_prep_nop(sqe);
    io_uring_sqe_set_data(sqe, (void *)LIBURING_UDATA_TIMEOUT);

    ret = io_uring_submit_and_wait(&ring, 1);

    if (ret != 1)
    {
        fprintf(stderr, "submit failed: %d\n", ret);
        return 1;
    }

    ret = io_uring_peek_cqe(&ring, &cqe);

    if (ret != -EAGAIN)
    {
        fprintf(stderr, "peek failed: %d\n", ret);
        return ret;
    }

    sqe = io_uring_get_sqe(&ring);
    if (!sqe)
    {
        fprintf(stderr, "sqe get failed\n");
        return 1;
    }

    // this one is not filtered
    io_uring_prep_nop(sqe);
    io_uring_sqe_set_data(sqe, (void *)LIBURING_UDATA_TIMEOUT);

    ret = io_uring_submit_and_wait(&ring, 1);

    if (ret != 1)
    {
        fprintf(stderr, "submit failed: %d\n", ret);
        return ret;
    }

    ret = io_uring_peek_batch_cqe(&ring, &cqe, 1);
    if (ret != 1)
    {
        fprintf(stderr, "peek batch failed, expected 1, got: %d\n", ret);
        return ret;
    }

    if (cqe->user_data != LIBURING_UDATA_TIMEOUT)
    {
        fprintf(stderr, "LIBURING_UDATA_TIMEOUT expected");
        return 1;
    }
    return 0;
}

Feature request: threading support

Threading support can be very useful in async programming. For example thread joining and condvar waiting.

Futex is a good start IMO.

Calling `io_uring_setup` from multiple threads

Hi, I've probably found another problem triggered when my tests run in parallel.
Should it be ok to setup unique io_uring independently from multiple threads?

I've created simple test to reproduce it (at least on my Ryzen 7 3700X with Fedora 31 kernel 5.3.12).

Basically it fails during io_uring_setup call with ENOMEM.
I've tried to follow this guide to figure something out, but I'm not into kernel dev so this all seems too much woodoo for me ;-)

Anyway, here's the trace output if it helps somehow:

  1)               |  __x64_sys_io_uring_setup() {
  1)               |    io_uring_setup() {
  1)               |      capable() {
  1)               |        ns_capable_common() {
  1)  <...>-633600  =>  <...>-633604 
  1)               |          security_capable() {
  1)  <...>-633604  =>  <...>-633600 
  1)   0.230 us    |            cap_capable();
  1)   0.762 us    |          }
  1)   1.202 us    |        }
  1)   1.623 us    |      }
  1)   0.270 us    |      free_uid();
  1)   2.735 us    |    }
  1)   3.467 us    |  }

And here is a test to reproduce it.

#include <stdio.h>
#include <pthread.h>
#include "liburing.h"

struct thread_info_t {
	pthread_t tid;
	int num;
};

static void *doTest(void *arg) {
	struct io_uring ring;
	struct io_uring_cqe *cqe;
	struct io_uring_sqe *sqe;
	struct thread_info_t *ti;
	int ret;

	ti = (struct thread_info_t *)arg;
	printf("%d: start\n", ti->num);

	ret = io_uring_queue_init(128, &ring, 0);
	if (ret) {
		printf("%d: ring setup failed: %d\n", ti->num, ret);
		return arg;
	}

	sqe = io_uring_get_sqe(&ring);
	if (!sqe) {
		printf("%d: get sqe failed\n", ti->num);
		return arg;
	}

	io_uring_prep_nop(sqe);

	ret = io_uring_submit(&ring);
	if (ret <= 0) {
		printf("%d: sqe submit failed: %d\n", ti->num, ret);
		return arg;
	}

	ret = io_uring_wait_cqe(&ring, &cqe);
	if (ret < 0) {
		printf("%d: wait completion %d\n", ti->num, ret);
		return arg;
	}

	io_uring_cqe_seen(&ring, cqe);
	printf("%d: done\n", ti->num);
	return NULL;
}

int main(int argc, char *argv[])
{
	struct thread_info_t threads[10];
	int ret;
	void *res;

	for (int i=0; i<10; i++) {
		threads[i].num = i;
		ret = pthread_create(&threads[i].tid, NULL, doTest, &threads[i]);
		if (ret) {
			fprintf(stderr, "Thread create failed\n");
			return 1;
		}
	}

	for (int i=0; i<10; i++) {
		ret = pthread_join(threads[i].tid, &res);
		if (ret) {
			fprintf(stderr, "Thread join failed\n");
			return 1;
		}
		if (res) {
			fprintf(stderr, "Test failed\n");
			return 1;
		}
	}

	return 0;
}

One of my outputs is:

0: start
1: start
2: start
3: start
4: start
4: ring setup failed: -12
5: start
5: ring setup failed: -12
6: start
0: done
6: ring setup failed: -12
3: done
7: start
7: ring setup failed: -12
1: done
8: start
8: ring setup failed: -12
9: start
9: ring setup failed: -12
2: done
Test failed

Support asynchronous-but-blocking socket reads

I have an open socket from which I'm reading data, with the following behavior:

(1) If the socket is blocking, then preparing a read [io_uring_prep_readv] and submitting it [io_uring_submit] causes the submit to block until data can be read from the socket.

(2) If the socket is non-blocking, then doing (1) causes EAGAIN to be returned on the CQE, unless the socket has data available.

(3) If the socket is non-blocking, then polling the socket for input [io_uring_prep_poll_add + POLLIN], flagging it with IOSQE_IO_LINK, followed by a consecutive read SQE causes the error EINVAL to be returned on the poll CQE.

Ideally (3) would work and perform the read when the socket received input. In order to get it to work, I have to split up the poll and read and only submit the latter after I receive the former.

Likewise when sending data on a socket. In the rare occurrence where the output buffer is full, instead of registering a POLLOUT and retrying the write, it'd be nice to send the data and only have to worry about my total outstanding operations.

Perhaps I'm missing something, but can this be supported? Thanks!

Issue: IORING_OP_READV sometimes returns 0 or -EFAULT after dropping caches

This issue only happens when you submit multiple read requests.

#include <liburing.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>

char str1[32768];
char str2[32768];

struct io_uring ring;

void prep(int fd, char *str) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    struct iovec iov = {
        .iov_base = str,
        .iov_len = sizeof(str1),
    };
    io_uring_prep_readv(sqe, fd, &iov, 1, 0);
    sqe->user_data = fd;
    io_uring_submit(&ring);
    printf("SUBMIT: %d\n", fd);
}

void wait() {
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);
    io_uring_cqe_seen(&ring, cqe);
    printf("FINISH: %d with res %d\n", (int)cqe->user_data, cqe->res);
}

int main() {
    io_uring_queue_init(32, &ring, 0);
    int fd1 = open("/path/to/large/file1", O_RDONLY);
    int fd2 = open("/path/to/large/file2", O_RDONLY);

    prep(fd1, str1);
    prep(fd2, str2);
    wait();
    wait();

    close(fd2);
    close(fd1);
    io_uring_queue_exit(&ring);
}

Before executing the program, run sync && echo 3 > /proc/sys/vm/drop_caches && swapoff -a && swapon -a

You will get -EFAULT when debugging the program using GDB

(gdb) run
Starting program: /root/test/./test

SUBMIT: 8
SUBMIT: 9
FINISH: 9 with res -14
FINISH: 8 with res 32768
[Inferior 1 (process 16788) exited normally]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64

If you run it directly, the program will crash with segfault.

$ ./test
SUBMIT: 4
SUBMIT: 5
FINISH: 4 with res 32768
FINISH: 5 with res 3568
[1]    16893 segmentation fault (core dumped)  ./test

If you run the program again without dropping caches, it will work as expected

Linux localhost 5.4.0-1.el7.elrepo.x86_64 #1 SMP Mon Nov 25 09:18:09 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

Original post: hakasenyang/openssl-patch#22 (comment)

EDIT: verified on 5.5rc too

Could we add support to package the build result into Debian Package?

user access

Is there a way to manage user access/permission?

For example say i use liburing to be a web server and run it as a "root" user maybe i am using "IORING_SETUP_SQPOLL". I wouldn't want my web server to run everything in root privileges, maybe some of the functions(read/write, ...) needs to run as other users.

Maybe this is something we can set in sqe

sqe = io_uring_get_sqe(ring)
sqe.setuid = 123

on error it would raise permission denied error.

Discussion: write/send number of bytes based on previous read/recv result for IOSQE_IO_LINK

One use case for IOSQE_IO_LINK is that zero copy IO operation, but it's hard to determine how many bytes is correctly read.

For example echo server. It's just ACCEPT -> RECV -> SEND -> CLOSE, but it's hard / not possible to do in zero copy way. The problem is:

The fd used by RECV/SEND/CLOSE is generated by ACCEPT, there's no way to use it in an IOSQE_IO_LINK chain
The number of bytes received from client is known only after RECV completes. You have to wait for RECV's completion to know how many bytes need to be sent.

For 2, man 2 read says that

It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal.

So that even a simple READ -> WRITE link chain may not always be reliable.

I suggest that add a flag called IOSQE_IO_USE_PREV_RES ( the name is not decided ), which works only with IORING_OP_{WRITE,SEND} must be used together with IOSQE_IO_LINK, indicates that current operation's buffer size is set by previous ret code. If previous ret code <= 0 the operation should generate an error.

What do you think?

Question about passing off_t in prep helpers

Why does prep helpers have offt_t in their arguments, example:

liburing/src/include/liburing.h

Line 166 in 8f24d3c

off_t offset)

and io_uring_sqe have __u64 in definition

liburing/src/include/liburing/io_uring.h

Line 22 in 8f24d3c

union {

Unexpected CQE result -512 (recvmsg+cancel)

I have a socket open that has an asynchronous recvmsg (io_uring_prep_recvmsg + io_uring_sqe_set_data) outstanding. No data is being supplied by the other end. Subsequently, the recvmsg is being canceled (io_uring_prep_cancel). The CQE for the cancel is giving -114 (-EALREADY) which is expected, however the CQE for the recvmsg is receiving -512, which is not.

Not sure where the result is being generated. In the case it's a kernel issue, I'm testing with https://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2019-12-01/.

Feature request: Please rethink supporting seeking operations

It's a fact that users don't want to maintain an offset themselves. Seeking operations are still widely used but there's no way to do it using io_uring. (lseek+read+lseek is not atomic)

I think readv/writev without offsets are still reasonable for asio: operations that for different fds can still be run in parallel, as shown in ucontext-cp

libuv has to punt these operations to the threadpool currently. Let's support it natively.

Feature request: support cancellation for operations that haven't yet started

Cancellation is very important for socket programming. I know canceling an IO operation requires hardware/driver support, but if the operation haven't started yet ( ie. EAGAIN ), it should be cancelable.

Otherwise, we have to fall back to IORING_OP_POLL_ADD.

New operations lack man page entry.

The operations defined in https://github.com/axboe/liburing/blob/master/src/include/liburing/io_uring.h#L81-L88 are missing man page descriptions:


	IORING_OP_CONNECT,
	IORING_OP_FALLOCATE,
	IORING_OP_OPENAT,
	IORING_OP_CLOSE,
	IORING_OP_FILES_UPDATE,
	IORING_OP_STATX,
	IORING_OP_READ,
	IORING_OP_WRITE,

Discussion: performance about reading/writing a socket

When benchmarking an echo server written with io-uring, I found adding a poll_add sqe before readv/recvmsg could result in about 30% performance boost:

https://github.com/CarterLi/io_uring-echo-server/blob/switch/io_uring_echo_server.c#L14

131729 request/sec VS 98694 request/sec using rust_echo_bench.

That was unexpected. AFAIK readv/recvmsg is async operation itself, adding a poll_add sqe won't help but result in extra context switch ( because it will awake io_uring_enter ).

After some investigation, I found program without poll_add will create lots of kernel processes called io_wqe_worker. But program with poll_add won't.

Don't know how poll_add works, but it seems that poll_add has much lower cost then async read. Is it expected? And maybe a silly question, could we implement async read as poll-add and nonblocking read?

Issue: IORING_OP_RECVMSG sometimes returns -EFAULT ( regression? )

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>

#include <liburing.h>


int main() {
    char buffer[1024];
    struct io_uring ring;
    struct iovec iov;
    struct sockaddr_in saddr;
    struct msghdr msg;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    int sockfd = 0, clientfd = 0, ret;

    io_uring_queue_init(32, &ring, 0);

    sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if (sockfd < 0) {
        perror("socket");
        goto err;
    }

    saddr = (struct sockaddr_in) {
        .sin_family = AF_INET,
        .sin_addr = {
            .s_addr = htonl(INADDR_ANY),
        },
        .sin_port = htons(12345),
    };

    ret = bind(sockfd, (struct sockaddr *)&saddr, sizeof(saddr));
    if (ret < 0) {
        perror("bind");
        goto err;
    }

    ret = listen(sockfd, 32);
    if (ret < 0) {
        perror("listen");
        goto err;
    }

    clientfd = accept(sockfd, NULL, NULL);
    if (clientfd < 0) {
        perror("accept");
        goto err;
    }

    iov = (struct iovec) {
        .iov_base = buffer,
        .iov_len = sizeof(buffer),
    };

    msg = (struct msghdr) {
        .msg_namelen = sizeof(struct sockaddr_in),
        .msg_iov = &iov,
        .msg_iovlen = 1,
    };

    sqe = io_uring_get_sqe(&ring);
    io_uring_prep_recvmsg(sqe, clientfd, &msg, 0);

    ret = io_uring_submit_and_wait(&ring, 1);
    if (ret <= 0) {
        perror("io_uring_submit_and_wait");
        goto err;
    }

    io_uring_peek_cqe(&ring, &cqe);

    if (cqe->res < 0) {
        printf("recvmsg failed: %d\n", cqe->res);
        goto err;
    }

err:
    io_uring_queue_exit(&ring);
    close(clientfd);
    close(sockfd);
}

$ clang -g -luring -o test test.c
$ ./test # On another terminal: curl -v localhost:12345
recvmsg failed: -14 # Not constantly
$ uname -a # https://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2019-12-15/
Linux carter-virtual-machine 5.5.0-999-generic #201912142104 SMP Sun Dec 15 02:07:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I can't reproduce it on Linux 5.4, may relates to https://lore.kernel.org/io-uring/[email protected]/T/#m919b41ecbf5049c15df15e8cbf2ff982acc37cc9

Bug in io_uring_get_sqe?

io_uring_get_sqe sometimes fails to find vacant sqe when SQPOLL is enabled, but there is free space.Running following test case always produces io_uring_get_sqe failed, space left : 8 :

#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/poll.h>
#include "liburing.h"

#define NUM_ENTRIES 8
int setup_and_run();

int main(int argc, char *argv[])
{
    for (int j = 0; j < 100; j++)
    {
        int ret = setup_and_run();
        if (ret)
        {
            return ret;
        }
    }
    return 0;
}

int setup_and_run()
{
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    struct io_uring_params p;
    struct io_uring ring;
    int ret, data;

    memset(&p, 0, sizeof(p));
    p.flags = IORING_SETUP_SQPOLL;
    ret = io_uring_queue_init_params(NUM_ENTRIES, &ring, &p);
    if (ret)
    {
        fprintf(stderr, "ring create failed: %d\n", ret);
        return 1;
    }

    if (p.sq_entries != NUM_ENTRIES)
    {
        fprintf(stderr, "ring create failed, wanted %d sq entries, got: %d entries\n", NUM_ENTRIES, ret);
        return 1;
    }

    for (int i = 0; i < NUM_ENTRIES; i++)
    {
        sqe = io_uring_get_sqe(&ring);
        if (!sqe)
        {
            fprintf(stderr, "io_uring_get_sqe failed\n");
            return ret;
        }

        io_uring_prep_nop(sqe);
        io_uring_sqe_set_data(sqe, (void *)(unsigned long)42);
    }

    ret = io_uring_submit(&ring);

    if (!ret)
    {
        fprintf(stderr, "io_uring_submit railed");
        return -1;
    }

    ret = io_uring_wait_cqe(&ring, &cqe);
    if (ret == 0)
    {
        data = (unsigned long)io_uring_cqe_get_data(cqe);
        if (data != 42)
        {
            fprintf(stderr, "invalid data: %d\n", data);
            return data;
        }

        int space_left = io_uring_sq_space_left(&ring);
        sqe = io_uring_get_sqe(&ring);
        if (sqe == NULL)
        {
            fprintf(stderr, "io_uring_get_sqe failed, space left: %d\n", space_left);
            return 1;
        }
    }
    else
    {
        fprintf(stderr, "io_uring_wait_cqe failed : %d\n", ret);
        return ret;
    }

    io_uring_queue_exit(&ring);
    return 0;
}

Issue: io_uring_prep_timeout seems use previous submitted timespec as its timespec

// test.c
#include <liburing.h>
#include <stdio.h>
#include <time.h>

int main() {
    struct io_uring ring;
    io_uring_queue_init(32, &ring, 0);
    printf("0: %ld\n", time(NULL));

{
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_nop(sqe);
    io_uring_submit(&ring);

    struct __kernel_timespec ts = {
        .tv_sec = 10,
        .tv_nsec = 0,
    };
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe_timeout(&ring, &cqe, &ts);
    io_uring_cqe_seen(&ring, cqe);
    printf("1: %ld\n", time(NULL));
}
{
    struct __kernel_timespec ts = {
        .tv_sec = 1,
        .tv_nsec = 0,
    };
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_timeout(sqe, &ts, 0, 0);
    io_uring_submit(&ring);

    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);
    io_uring_cqe_seen(&ring, cqe);
    printf("2: %ld\n", time(NULL));
}

    io_uring_queue_exit(&ring);
    return 0;
}

Actual: The last io_uring_prep_timeout waits for 10s

$ clang test.c -luring -o test
$ ./test 
0: 1575128130
1: 1575128130
2: 1575128140

Expect: It should only wait for 1s

Linux carter-virtual-machine 5.4.0-999-generic #201911282213 SMP Fri Nov 29 03:17:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

https://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2019-11-29/

question: io_uring_enter EAGAIN return

hi, I have a question about io_uring_enter and EAGAIN.

When to_submit is zero, can io_uring_enter return EAGAIN?

When to_submit is not zero, can io_uring_enter return 0? And if so, when does it return 0, and when EAGAIN?

Suggestion: IORING_OP_TIMEOUT with sqe->off == 0

I noticed that for IORING_OP_TIMEOUT, if completion event count is not set, it defaults to 1. It's not very useful in my opinion. I suggest if sqe->off equals to 0, IORING_OP_TIMEOUT acts like a timer. That is to say, IORING_OP_TIMEOUT won't be completed through other requests' completion.

With this change, timerfd can be partially replaced. interval is not suit for io_uring though.

Yes it's a breaking change. sqe->off == -1 is also considerable.

Feature request: support fds that have no {read,write}_iter support

There is no way to know whether a file descriptor support {read,write}_iter, and sometimes we don't even know the type of fds, for example STDIN/OUT/ERR_FILENO. We have to try IORING_OP_PREADV first. If we get -EINVAL, we have to fall back to IORING_OP_POLL_ADD and plain read, which is inconvenient and slow.

Ref: libuv/libuv#2322 (comment)

why there isn't some documentation

why the man pages doesn't contain any documentation about this library
and is this faster than epoll and if yes then by any factor ?

Typo in io_uring_enter manpage for IORING_OP_RECVMSG

Need a space in

liburing/man/io_uring_enter.2

Line 236 in c0fcb7f

.BRrecvmsg(2)

IORING_OP_READ for eventfd

I've been trying and failing to read from a fd opened with eventfd through IORING_OP_READ

int fd = eventfd(0, 0);
io_uring_prep_read(sqe, fd, &event, sizeof(eventfd_t), 0);

This fails consistently with EINVAL
Polling with io_uring_prep_poll_add(sqe, fd, 0); then reading with read(fd, &event, sizeof(eventfd_t)) works.

am I missing something or is reading for eventfd directly through io_uring not supported ?

Thanks

Question: When does iov_base have to be posix_memalign'ed?

Hi,

I looked through the files under test and examples, and found some iov_base are allocated by posix_memalign and others by malloc or even char* literals.

I then did some experiments. It seemed to be that liburing does not care about memory alignment. Is this true? Thanks in advance.

liburing/test/500f9fbadef8-test.c

Line 27 in 1ed37c5

if (posix_memalign(&iov.iov_base, 4096, 4096)) {

liburing/test/stdout.c

Line 20 in 5569609

vecs.iov_base = "This is a pipe test\n";

Possible bug in __io_uring_submit

This program never terminates:

#include "liburing.h"
#include <stdio.h>

int main(int argc, char *argv[])
{
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    struct io_uring ring;
    int ring_flags, ret, data;

    ring_flags = IORING_SETUP_SQPOLL;
    ret = io_uring_queue_init(64, &ring, ring_flags);
    if (ret) {
        fprintf(stderr, "ring create failed: %d\n", ret);
        return 1;
    }

    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        fprintf(stderr, "sqe get failed\n");
        return 1;
    }

    io_uring_prep_nop(sqe);
    io_uring_sqe_set_data(sqe, 42);
    io_uring_submit_and_wait(&ring, 1);

    ret = io_uring_peek_cqe(&ring, &cqe);
    if (ret) {
        fprintf(stderr, "cqe get failed\n");
        return 1;
    }
    data = io_uring_cqe_get_data(cqe);
    if (data != 42) {
        fprintf(stderr, "invalid data: %d\n", data);
        return 1;
    }
    return 0;
}

changing this line

liburing/src/queue.c

Line 177 in 4cc37de

if (wait_nr || sq_ring_needs_enter(ring, &flags)) {

if (sq_ring_needs_enter(ring, &flags) || wait_nr) {

fixes it. If i'm wrong, I'm sorry.

Possible erroneous behavior when reusing SQEs

I'm issuing an accept() SQE subsequently followed by a connect() SQE. The connect result is success, however the accept() returns with CQE status ENOTCONN. Ignoring the error, the connected socket is fine and can issue I/O.

I presume this has something to do with the asynchronous connect case. Running with linux kernel at e31736d9fae841e8a1612f263136454af10f476a (12/14).

Feature request: support IORING_OP_{SEND,RECV}_FIXED for consistance

We support IORING_OP_{READ,WRITE}FIXED but doesn't support IORING_OP{SEND,RECV}_FIXED.

Question: preadv/pwritev returns ssize_t (8), which is greater then io_uring_cqe::res (__s32, 4)

Is it problematic? Will it result in type narrowing?

Why cancel* function requires flags?

Noticed this:

liburing/src/include/liburing.h

Line 267 in 2e7d744

__u64 user_data, unsigned flags)

Then I found this:

https://git.kernel.dk/cgit/linux-block/tree/fs/io_uring.c?h=for-5.5/io_uring-post#n2353

Is it reserved for future use? I think it's kind of strange and doesn't seem to be useful in my opinion.

Feature request: add IOSQE flag which indicates specified sqe won't wake up io_uring_enter

For IOSQE_IO_LINK link chain, people usually don't care operations before the whole link chain is completed.

Currently io_uring_wait_cqe is awaked for every operation's completion. We have io_uring_wait_cqes can partially resolve this issue, but io_uring_wait_cqes has its own limitations:

For program that uses event loops, it's not easy to pass the arguments in.
There are common situations that the number of operations to wait cannot be determined. For example echo server: we have an IORING_OP_ACCEPT operation pending for new connection ( which needs io_uring_wait_cqes(1) ), and multiple RECV-SEND chains solving existing connections ( which needs io_uring_wait_cqes(2) ). As a result we have to use io_uring_wait_cqe.

Suggestion: add a new flag named IOSQE_IO_NO_AWAKE which indicates an operation should not awake io_uring_enter. It can resolve those 2 problems.

IOSQE_IO_NO_AWAKE is set when preparing operations, we don't need to touch the global event loop
IOSQE_IO_NO_AWAKE can be set for sqes separately. For example we set RECV(IOSQE_IO_NO_AWAKE)-SEND, then io_uring_wait_cqe should work fine.

Sorry for my bad English if I can't explain myself clearly.

Issue: process using io_uring hangs forever for unknown reason

A program written by me became a zombie process for some reason. I didn't fork other process, nor did something special, just normal stuff.

I was testing IOSQE_IO_LINK, FIXED_FILES and FIXED_BUFFERS, if helps.

It can't be consistently reproduced, but happened several times. I couldn't kill it. When I was rebooting the system, I got:

$ uname -a                                                           23:46:16
Linux carter-virtual-machine 5.5.0-999-generic #202002082109 SMP Sun Feb 9 02:13:41 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

torvalds/linux@d4f309c

Issue: IORING_OP_RECV returns -EFAULT constantly

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>

#include <liburing.h>

int main() {
    char buffer[1024];
    struct io_uring ring;
    struct sockaddr_in saddr;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    int sockfd = 0, clientfd = 0, ret;

    io_uring_queue_init(32, &ring, 0);

    sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if (sockfd < 0) {
        perror("socket");
        goto err;
    }

    saddr = (struct sockaddr_in) {
        .sin_family = AF_INET,
        .sin_addr = {
            .s_addr = htonl(INADDR_ANY),
        },
        .sin_port = htons(12345),
    };

    ret = bind(sockfd, (struct sockaddr *)&saddr, sizeof(saddr));
    if (ret < 0) {
        perror("bind");
        goto err;
    }

    ret = listen(sockfd, 32);
    if (ret < 0) {
        perror("listen");
        goto err;
    }

    clientfd = accept(sockfd, NULL, NULL);
    if (clientfd < 0) {
        perror("accept");
        goto err;
    }

    sqe = io_uring_get_sqe(&ring);
    io_uring_prep_recv(sqe, clientfd, buffer, sizeof(buffer), 0);

    ret = io_uring_submit_and_wait(&ring, 1);
    if (ret <= 0) {
        perror("io_uring_submit_and_wait");
        goto err;
    }

    io_uring_peek_cqe(&ring, &cqe);

    if (cqe->res < 0) {
        printf("recv failed: %d\n", cqe->res);
        goto err;
    }

err:
    io_uring_queue_exit(&ring);
    close(clientfd);
    close(sockfd);
}

$ gcc recv.c -o recv -luring
$ ./recv # On another terminal: curl -v localhost:12345
recv failed: -14
$ uname -a # https://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2020-01-31/
Linux carter-virtual-machine 5.5.0-999-generic #202001302109 SMP Fri Jan 31 02:15:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

io_uring_cqe.res

Think there is wrong with "io_uring_cqe.res" When i call something like

read-1

cqe = io_uring_cqe()
... io_uring_prep_readv
cqe.res # will output say "5"

read-2

cqe = io_uring_cqe()
... io_uring_prep_readv
cqe.res # will output again the same value as read-1 "5"

while read-2 content length/buffer size is totally different!

not sure whats going on.

iouring randomly hangs forever under high IO load

I am testing following code which spawns 5 threads and drives IOs on each of them. It randomly hangs with and without IORING_SETUP_SQPOLL flag.

#include <errno.h>                                                                                                                                                                                 
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdbool.h>

#include "liburing.h"

#define BS 4096
#define QD 32
#define MAX_OBJECTS 5

static struct io_uring ring[MAX_OBJECTS];
static int dev_fd;
static int ios;
static bool sqpoll;
static struct iovec iov;

static void *setup_iov_base(size_t size)
{
        void *buf;
        int fd;

        if (posix_memalign(&buf, BS, size) != 0) {
                printf("mem aligned failed\n");
                return NULL;
        }

        fd = open("/dev/urandom", O_RDONLY);
        if (fd < 0) {
                printf("Failed to open urandom. rc=%d\n", fd);
                return NULL;
        }

        read(fd, buf, size);
        close(fd);

        return buf;
}

static int init(void)
{
        struct io_uring_params p = { 0 };
        int i, rc;

        if (sqpoll) {
                p.flags = IORING_SETUP_SQPOLL;
                printf("Initializing liburing with SQPOLL flag\n");
        }

        dev_fd = open("/dev/nvme1n1", O_RDWR | O_DIRECT);
        if (dev_fd < 0) {
                printf("Failed to open nvme device. rc=%d\n", dev_fd);
                return dev_fd;
        }

         for (i = 0; i < MAX_OBJECTS; i++) {
                rc = io_uring_queue_init_params(QD, &ring[i], &p);
                if (rc != 0) {
                        printf("queue_init failed. rc=%d\n", rc);
                        return rc;
                }

                if (sqpoll) {
                        rc = io_uring_register_files(&ring[i], &dev_fd, 1);
                        if (rc < 0) {
                                printf("Failed to register files. rc=%d\n", rc);
                                return rc;
                        }
                }
        }

        iov.iov_base = setup_iov_base(BS);
        iov.iov_len = BS;

        return 0;
}

static inline void submit_to_kernel(char *failure_message, int thread_id)
{
        int rc;

        rc = io_uring_submit(&ring[thread_id]);
        if (rc < 0) {
                printf("%s. rc=%d\n", failure_message, rc);
        }
}

static struct io_uring_sqe *get_sqe(int tid, int *yield)
{
        struct io_uring_sqe *sqe;

        while ((sqe = io_uring_get_sqe(&ring[tid])) == NULL) {
                submit_to_kernel("Failure to wake napping thread", tid);
                *yield = *yield + 1;
                pthread_yield();
        }

        return sqe;
}

static void *submit_io(void *input)
{
        off_t offset = 0;
        int thread_id = *((int *)input);
        int total_ios = ios;
        int yield = 0;
                                                                                                                                                                                                    
        while (total_ios != 0) {
                struct io_uring_sqe *sqe = get_sqe(thread_id, &yield);

                if (sqpoll) {
                        io_uring_prep_writev(sqe, 0, &iov, 1, offset);
                        sqe->flags |= IOSQE_FIXED_FILE;
                } else {
                        io_uring_prep_writev(sqe, dev_fd, &iov, 1, offset);
                }

                sqe->user_data = offset;

                total_ios--;
                if (total_ios % QD == 0) {
                        submit_to_kernel("Failed to submit new IO", thread_id);
                }

                offset += BS;
        }

        printf("[thread_id=%d] Submission complete. yield=%d\n", thread_id, yield);

        return NULL;
}

static void *reap_io_completions(void *input)
{
        int thread_id = *((int *)input);
        int failed_ios = 0, rc;
        int total_ios = ios;

        while (total_ios != 0) {
                struct io_uring_cqe *cqe = NULL;

                rc = io_uring_wait_cqe(&ring[thread_id], &cqe);
                if (rc < 0 || cqe->res != BS) {
                        printf("thread_id=%d rc=%d cqe->res=%d offset=%llu\n", thread_id, rc, cqe->res, cqe->user_data);
                        failed_ios++;
                }

                total_ios--;
                io_uring_cqe_seen(&ring[thread_id], cqe);
        }

        printf("[thread_id=%d] Failed IO count=%d\n", thread_id, failed_ios);

        return NULL;
}

int main(int argc, char *argv[])
{
        pthread_t submit[MAX_OBJECTS], complete[MAX_OBJECTS];
        int t_ids[MAX_OBJECTS];
        int i, rc;

        if (argc != 3) {
                printf("Expected three arguments\n");
                return -EINVAL;
        }
        ios = atoi(argv[1]);
        sqpoll = atoi(argv[2]) == 1;

        rc = init();
        if (rc != 0) {
                return rc;
        }

        for (i = 0; i < MAX_OBJECTS; i++) {
                t_ids[i] = i;

                rc = pthread_create(&submit[i], NULL, submit_io, &t_ids[i]);
                if (rc < 0) {
                        printf("Failed to create submit thread. rc=%d\n", rc);
                        return rc;
                }

                rc = pthread_create(&complete[i], NULL, reap_io_completions, &t_ids[i]);
                if (rc < 0) {
                        printf("Failed to create complete thread. rc=%d\n", rc);
                        return rc;
                }
        }

        for (i = 0; i < MAX_OBJECTS; i++) {
                pthread_join(submit[i], NULL);
                printf("Reaped submit thread_id=%d\n", i);

                pthread_join(complete[i], NULL);
                printf("Reaped complete thread_id=%d\n", i);

                io_uring_queue_exit(&ring[i]);
        }

        close(dev_fd);

        return 0;
}

Following is the output example

Success without SQPOLL
[root@ip-10-0-58-7 liburing]# ./examples/iouring-object 500 0                                            
[thread_id=0] Submission complete. yield=0
Reaped submit thread_id=0
[thread_id=1] Submission complete. yield=0
[thread_id=3] Submission complete. yield=0
[thread_id=2] Submission complete. yield=0
[thread_id=4] Submission complete. yield=0
[thread_id=0] Failed IO count=0
Reaped complete thread_id=0
[thread_id=1] Failed IO count=0
[thread_id=3] Failed IO count=0
[thread_id=2] Failed IO count=0
[thread_id=4] Failed IO count=0
Reaped submit thread_id=1
Reaped complete thread_id=1
Reaped submit thread_id=2
Reaped complete thread_id=2
Reaped submit thread_id=3
Reaped complete thread_id=3
Reaped submit thread_id=4
Reaped complete thread_id=4

Success with SQPOLL
[root@ip-10-0-58-7 liburing]# ./examples/iouring-object 500 1
Initializing liburing with SQPOLL flag
[thread_id=1] Submission complete. yield=883
[thread_id=0] Submission complete. yield=989
Reaped submit thread_id=0
[thread_id=2] Submission complete. yield=1070
[thread_id=3] Submission complete. yield=963
[thread_id=4] Submission complete. yield=966
[thread_id=1] Failed IO count=0
[thread_id=0] Failed IO count=0
Reaped complete thread_id=0
[thread_id=2] Failed IO count=0
[thread_id=3] Failed IO count=0
[thread_id=4] Failed IO count=0
Reaped submit thread_id=1
Reaped complete thread_id=1
Reaped submit thread_id=2
Reaped complete thread_id=2
Reaped submit thread_id=3
Reaped complete thread_id=3
Reaped submit thread_id=4
Reaped complete thread_id=4

Failure without SQPOLL
[root@ip-10-0-58-7 liburing]# ./examples/iouring-object 2000 0
[thread_id=0] Submission complete. yield=0
[thread_id=1] Submission complete. yield=0
Reaped submit thread_id=0
[thread_id=3] Submission complete. yield=0
[thread_id=2] Submission complete. yield=0
[thread_id=4] Submission complete. yield=0
[thread_id=1] Failed IO count=0
[thread_id=2] Failed IO count=0
[thread_id=4] Failed IO count=0
^C

Failure with SQPOLL
[root@ip-10-0-58-7 liburing]# ./examples/iouring-object 5000 1
Initializing liburing with SQPOLL flag
[thread_id=4] Submission complete. yield=32130
[thread_id=1] Submission complete. yield=65518
[thread_id=3] Submission complete. yield=67878
[thread_id=0] Submission complete. yield=72433
Reaped submit thread_id=0
[thread_id=2] Submission complete. yield=73237
[thread_id=1] Failed IO count=0
[thread_id=3] Failed IO count=0
[thread_id=0] Failed IO count=0
Reaped complete thread_id=0
[thread_id=2] Failed IO count=0
Reaped submit thread_id=1
Reaped complete thread_id=1
Reaped submit thread_id=2
Reaped complete thread_id=2
Reaped submit thread_id=3
Reaped complete thread_id=3
Reaped submit thread_id=4
^C

liburing commit - a68caac
Linux kernel has been built from https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.4.1.tar.xz

privilege requirement for using SQPOLL

Calling io_uring_setup with ...SQPOLL returns -1 with errno = 1 (EPERM)
After several failed search on documentations, I eventually found in the kernel code it requires CAP_SYS_ADMIN.

Could you add a few words in liburing's comments, or maybe in a newer version of "Efficient IO with io_uring" to warn new SQPOLL enthusiasts of this possible error?

BTW, this privilege check really makes it hard to use SQPOLL since the user process have to run with "escalated" privilege level, or no SQPOLL at all...

Issue: IORING_OP_TIMEOUT with IOSQE_IO_LINK always cancels the next operation, which makes it unusable

IORING_OP_TIMEOUT returns -ETIME when expires, which is considered an error and will breaks the entire link. As a result, operations after IORING_OP_TIMEOUT with IOSQE_IO_LINK will always be canceled.

#include <unistd.h>
#include <liburing.h>

int main() {
	struct io_uring ring;
	io_uring_queue_init(8, &ring, 0);

    struct io_uring_sqe *sqe1 = io_uring_get_sqe(&ring);
    struct __kernel_timespec ts = {
        .tv_sec = 1,
        .tv_nsec = 0,
    };
    io_uring_prep_timeout(sqe1, &ts, 0, 0);
    io_uring_sqe_set_flags(sqe1, IOSQE_IO_LINK);

    struct io_uring_sqe *sqe2 = io_uring_get_sqe(&ring);
    struct iovec iov = {
        .iov_base = "OK\n",
        .iov_len = sizeof("OK\n"),
    };
    io_uring_prep_writev(sqe2, STDERR_FILENO, &iov, 1, 0);
    io_uring_submit_and_wait(&ring, 2);

    io_uring_queue_exit(&ring);
}

Expected: waits 1s then print "OK"
Actual: nothing is printed

add support for connect(2)

Question: Proper usage of io_uring_register_buffers

The man pages seem to indicate that it is fine to register additional/different buffers during the lifetime of a ring. It would help to add to the documentation a statement about when a call to unregister is considered safe.
It comes down to the following question: "Is it safe to unregister and re-register buffers, while an operation like IORING_OP_READ_FIXED is submitted but not yet completed?" Or in other words: "Would one have to wait until all scheduled operations on registered buffers are completed before unregistering?" I assume the latter but it makes it non-trivial to register/unregister buffers at runtime based on the demand of the application. In this case, one would have to drain (IOSQE_IO_DRAIN) the ring before registering additional buffers which will likely cause a hiccup in throughput.

Question about versioning

Thanks for your work on io_uring, its a really stellar interface! I'm working on wrapping liburing in a Rust library to make it accessible from Rust (as well as higher level memory-safe integrations into our async/.await ecosystem, which haven't born fruit yet).

First, I just want to confirm this is the best place for you to receive questions & pull requests. Let me know if not.

My main question: what is the backwards compatibility story for liburing right now? In general, would you say you will not remove or break APIs exposed by liburing (excepting obviously __ functions for example)? I noticed that you recently removed the syscall helpers from liburing, but I think that was a special case because you expect them to be upstreamed to glibc.

I'm asking to determine my own versioning for my Rust wrappers. Most Rust users use cargo to perform version resolution for them, and cargo makes strong assumptions about backwards compatibility between "semver compatible" versions, so I just need to figure out if I should prepare for possible breaking changes between updates to liburing.

Feature request: return the maximum opcode supported by kernel for easier feature detection

Proposed change:

struct io_uring_params {
	__u32 sq_entries;
	__u32 cq_entries;
	__u32 flags;
	__u32 sq_thread_cpu;
	__u32 sq_thread_idle;
	__u32 features;
    __u32 op_last; /* IORING_OP_LAST set by kernel */
	__u32 resv[3];
	struct io_sqring_offsets sq_off;
	struct io_cqring_offsets cq_off;
};

Feature request: Support fallocate()

It's not clear if/when fallocate blocks (notable exception for supportive network file systems) but regardless it'd be nice to include in the iouring framework since we can link fsyncs to commit file metadata changes.

Possible bug with IOSQE_IO_DRAIN

Hi,
I'm trying to test various aspects of io_uring and came to fsync test in this lib.
I've noticed this line: https://github.com/axboe/liburing/blob/master/test/fsync.c#L117
and added just a printf to let me know if my kernel is ok with it or not.

And unfortunatelly it doesn't work.
I've tested it on kernels 5.2.x (where this flag was introduced) and kernel 5.3.x.
It just returns that error and I don't understand why..

When I've added IOSQE_IO_LINK to sumbission flags of flush operation, it started to run ok.

But it seems to be against what is documented and what is actually tested with the IOSQE_IO_DRAIN.

I've also tried to search the internet if someone already hasn't faced the same issue and found only this maybe relevant post: https://www.mail-archive.com/[email protected]/msg39033.html, but without any followup..

It this an issue or some misunderstanding? Thanks!

PS: This line https://github.com/axboe/liburing/blob/master/test/fsync.c#L101 should probably be if (ret == -EINVAL) but is unrelated to this problem.

Weird readv behavior on 5.5 kernel

On 5.5 kernel in case if file size is less than iovec size, cqe.res will be equal to 0, On 5.4 kernel cqe.res will contain correct number of bytes read

Sample code:

#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/poll.h>

#include "liburing.h"

#define BUF_SIZE 4096
#define FILE_SIZE 1024

static int create_file(const char *file)
{
	ssize_t ret;
	char *buf;
	int fd;

	buf = malloc(FILE_SIZE);
	memset(buf, 0xaa, FILE_SIZE);

	fd = open(file, O_WRONLY | O_CREAT, 0644);
	if (fd < 0) {
		perror("open file");
		return 1;
	}
	ret = write(fd, buf, FILE_SIZE);
	close(fd);
	return ret != FILE_SIZE;
}

int main(int argc, char* argv[]) {
    int ret, fd;
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    struct iovec vec;

    vec.iov_base = malloc(BUF_SIZE);
    vec.iov_len = BUF_SIZE;

    if (create_file(".basic-r")) {
		fprintf(stderr, "file creation failed\n");
		return 1;
	}

    fd = open(".basic-r", O_RDONLY);
	if (fd < 0) {
		perror("file open");
		return 1;
	}

    ret = io_uring_queue_init(32, &ring, 0);
	if (ret)
		return ret;


    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
			fprintf(stderr, "sqe get failed\n");
			return 1;
	}

    io_uring_prep_readv(sqe, fd, &vec, 1, 0);
    
    ret = io_uring_submit(&ring);
    if (ret != 1) {
        return 1;
    }

    ret = io_uring_wait_cqes(&ring, &cqe, 1, 0, 0);
    if (ret) {
        return 1;
    }

    fprintf(stderr, "cqe res %d", cqe->res);
    io_uring_cqe_seen(&ring, cqe);
    return 0;
}

Issue: io_uring_prep_connect always returns -EINPROGRESS

// connect.c
#include <liburing.h>
#include <stdio.h>
#include <netdb.h>
#include <sys/socket.h>
#include <unistd.h>

int main() {
    struct addrinfo hints = {
        .ai_family = AF_UNSPEC,
        .ai_socktype = SOCK_STREAM,
    }, *addr;

    if (getaddrinfo("github.com", "http", &hints, &addr) < 0) {
        return 1;
    }
    int clientfd = socket(addr->ai_family, addr->ai_socktype, addr->ai_protocol);
    if (clientfd < 0) return 2;

#ifndef USE_PLAIN_CONNECT
    struct io_uring ring;
    io_uring_queue_init(32, &ring, 0);

    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_connect(sqe, clientfd, addr->ai_addr, addr->ai_addrlen);
    io_uring_submit(&ring);

    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);
    io_uring_cqe_seen(&ring, cqe);
    int ret = cqe->res;

    io_uring_queue_exit(&ring);
#else
    int ret = connect(clientfd, addr->ai_addr, addr->ai_addrlen);
#endif

    printf("%d\n", ret);
    close(clientfd);
    return 0;
}

$ clang connect.c -luring -o connect && ./connect
-115
$ clang connect.c -luring -o connect -DUSE_PLAIN_CONNECT && ./connect
0
$ uname -a
Linux carter-virtual-machine 5.4.0-999-generic #201911282213 SMP Fri Nov 29 03:17:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Consider supporting unvectored read/write?

io_uring currently only supports vectored reads and writes (except for the _fixed operations). While vectored reads and writes are in theory a superset of single reads and writes, the required indirection of the array of iovecs presents some problems.

In particular, I'm interested in creating a memory safe abstraction of io_uring's completion-based API in Rust. Realistically, the best way to do this is for the abstraction to have logical ownership of the buffers until the IO is complete. The naive solution would be to just always allocate an intermediate buffer, which would mean an extra allocation for every read or write operation. There are better solutions which avoid the allocation, but they can be tricky to implement.

It would be easier to create a safe API for unvectored read/write (the common case) if it were supported directly by the io_uring interface. Then the abstraction would only need to manage the lifetime of the actually buffer and not the indirection array as well.

Issue: io_uring_wait_cqe with no pending requests should return error instead of dead locking

Calling io_uring_wait_cqe with nothing to wait is a bug, but currently there is no (easy?) way to get the number of pending requests. We should detect the bug and return -EINVAL or something else.

We could also add io_uring_pending_requests to get the number of pending requests ( with completed requests in cqe )

IORING_SETUP_SQPOLL randomly fails IOs

My toy program spins up two threads. Submits IOs from one thread and reaps IOs from the other. I have setup liburing using IORING_SETUP_SQPOLL flag.

Following is my code:

#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include "liburing.h"

#define DEVICE_SIZE (512ULL << 30)
#define BS 4096
#define QD 32

static struct io_uring ring;
static int dev_fd;

static void *setup_iov_base(size_t size)
{
        void *buf;
        int fd;

        if (posix_memalign(&buf, BS, size) != 0) {
                printf("mem aligned failed\n");
                return NULL;
        }

        fd = open("/dev/urandom", O_RDONLY);
        if (fd < 0) {
                printf("Failed to open urandom. rc=%d\n", fd);
                return NULL;
        }

        read(fd, buf, size);
        close(fd);
 
                                                                                                                                                                                         [76/282]
        return buf;
}

static int init(void)
{
        struct io_uring_params p = { 0 };
        time_t t;
        int rc;

        /* Implies no syscalls to submit IOs */
        p.flags = IORING_SETUP_SQPOLL;
        rc = io_uring_queue_init_params(QD, &ring, &p);
        if (rc != 0) {
                printf("queue_init failed. rc=%d\n", rc);
                return rc;
        }

        dev_fd = open("/dev/nvme1n1", O_RDWR | O_DIRECT);
        if (dev_fd < 0) {
                printf("Failed to open nvme device. rc=%d\n", dev_fd);
                return dev_fd;
        }

        /* SQPOLL only works with fixed files. */
        rc = io_uring_register_files(&ring, &dev_fd, 1);
        if (rc < 0) {
                printf("Failed to register files. rc=%d\n", rc);
                return rc;
        }

        srand((unsigned) time(&t));

        return 0;
}

static inline void submit_to_kernel(char *failure_message)
{
        int rc;

        rc = io_uring_submit(&ring);
        if (rc < 0) {
                printf("%s. rc=%d\n", failure_message, rc);
        }
}

static struct io_uring_sqe *get_sqe(int *yield)
{
        struct io_uring_sqe *sqe;

        while ((sqe = io_uring_get_sqe(&ring)) == NULL) {
                /* Kick kernel thread if it is taking a nap */
                submit_to_kernel("Failure to wake napping thread");

                *yield = *yield + 1;
                /* TODO: Use condition variables */
                pthread_yield();
        }

        return sqe;
}

static void *submit_io(void *input)
{
        char *buf = setup_iov_base(BS);
        off_t offset = 0;
        int total_ios = *((int *)input);
        int yield = 0;
       while (total_ios != 0) {
                struct io_uring_sqe *sqe = get_sqe(&yield);
                struct iovec iov = {
                        .iov_base = buf,
                        .iov_len = BS,
                };

                io_uring_prep_writev(sqe, 0, &iov, 1, offset);

                sqe->flags |= IOSQE_FIXED_FILE;
                sqe->user_data = offset;

                total_ios--;
                if (total_ios % QD == 0) {
                        submit_to_kernel("Failed to submit new IO");
                }

                offset += BS;
        }

        printf("submit_io yield %d times\n", yield);

        return NULL;
}

static void *reap_io_completions(void *input)
{
        int total_ios = *((int *)input);
        int failed_ios = 0;

        while (total_ios != 0) {
                struct io_uring_cqe *cqe = NULL;
                /* This call blocks if no CQE entries are available */
                int rc = io_uring_wait_cqe(&ring, &cqe);
                if (rc < 0 || cqe->res != BS) {
                        printf("rc=%d cqe->res=%d offset=%llu\n", rc, cqe->res, cqe->user_data);
                        failed_ios++;
                }

                total_ios--;
                io_uring_cqe_seen(&ring, cqe);
        }

        printf("Failed IO count=%d\n", failed_ios);

        return NULL;
}

int main(int argc, char *argv[])
{
        pthread_t submit, complete;
        int total_ios;
        int rc;

        if (argc != 2) {
                printf("Expected two arguments\n");
                return -EINVAL;
        }

        total_ios = atoi(argv[1]);

        rc = init();
        if (rc != 0) {
                return rc;
        }

        rc = pthread_create(&submit, NULL, submit_io, &total_ios);
        if (rc < 0) {
                printf("Failed to create submit thread. rc=%d\n", rc);
                return rc;
        }

        rc = pthread_create(&complete, NULL, reap_io_completions, &total_ios);

        if (rc < 0) {
                printf("Failed to create complete thread. rc=%d\n", rc);
                return rc;
        }

        pthread_join(submit, NULL);
        pthread_join(complete, NULL);

        io_uring_queue_exit(&ring);
        close(dev_fd);

        return 0;
}

Here is my output for multiple runs:

[root@ip-10-0-58-7 liburing]# ./examples/iouringthread 65
submit_io yield 74 times
rc=0 cqe->res=-14 offset=258048
rc=0 cqe->res=-14 offset=262144
Failed IO count=2

[root@ip-10-0-58-7 liburing]# ./examples/iouringthread 65
submit_io yield 194 times
rc=0 cqe->res=-14 offset=204800
rc=0 cqe->res=-14 offset=208896
rc=0 cqe->res=-14 offset=212992
rc=0 cqe->res=-14 offset=217088
rc=0 cqe->res=-14 offset=221184
rc=0 cqe->res=-14 offset=225280
rc=0 cqe->res=-14 offset=229376
rc=0 cqe->res=-14 offset=233472
rc=0 cqe->res=-14 offset=237568
rc=0 cqe->res=-14 offset=241664
rc=0 cqe->res=-14 offset=245760
rc=0 cqe->res=-14 offset=249856
rc=0 cqe->res=-14 offset=253952
rc=0 cqe->res=-14 offset=258048
rc=0 cqe->res=-14 offset=262144
Failed IO count=15

[root@ip-10-0-58-7 liburing]# ./examples/iouringthread 65
submit_io yield 69 times
rc=0 cqe->res=-14 offset=196608
rc=0 cqe->res=-14 offset=200704
rc=0 cqe->res=-14 offset=204800
rc=0 cqe->res=-14 offset=208896
rc=0 cqe->res=-14 offset=212992
rc=0 cqe->res=-14 offset=217088
rc=0 cqe->res=-14 offset=221184
rc=0 cqe->res=-14 offset=225280
rc=0 cqe->res=-14 offset=229376
rc=0 cqe->res=-14 offset=233472
rc=0 cqe->res=-14 offset=237568
rc=0 cqe->res=-14 offset=241664
rc=0 cqe->res=-14 offset=245760
rc=0 cqe->res=-14 offset=249856
rc=0 cqe->res=-14 offset=253952
rc=0 cqe->res=-14 offset=258048
rc=0 cqe->res=-14 offset=262144
Failed IO count=17

If I update the code to not use SQPOLL it works just fine.

liburing commit ID - a68caac

Feature request: timed waiting support

Timed waiting is commonly used and a necessary feature to replace epoll_wait (for fewer syscalls and better integration with io_uring file aio).

AFAIK the old Linux AIO does support timed-wait. Please consider adding timed-wait support to io_uring too.

Issue: IORING_OP_POLL_ADD with signalfd

#include <unistd.h>
#include <sys/signalfd.h>
#include <sys/poll.h>

#include <liburing.h>

int main() {
    sigset_t mask;
    sigemptyset(&mask);
    sigaddset(&mask, SIGINT);

    sigprocmask(SIG_BLOCK, &mask, NULL);
    int sfd = signalfd(-1, &mask, SFD_NONBLOCK);

    struct io_uring ring;
    io_uring_queue_init(32, &ring, 0);

    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_poll_add(sqe, sfd, POLLIN);
    io_uring_submit(&ring);

    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);
    io_uring_cqe_seen(&ring, cqe);
    io_uring_queue_exit(&ring);

    close(sfd);
    return 0;
}

Ctrl+C should terminate the program but it doesn't. Similar code works for epoll: https://gist.github.com/CarterLi/b8db2fcfea689b96eeae382c38130afb

Linux Ubuntu 5.3.0-10-generic #11-Ubuntu SMP Mon Sep 9 15:12:17 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Feature request: Allow sqe flags like IOSQE_IO_LINK for IORING_OP_TIMEOUT with off == 0

Currently we forbid flags for IORING_OP_TIMEOUT.

https://github.com/torvalds/linux/blob/63de37476ebd1e9bab6a9e17186dc5aa1da9ea99/fs/io_uring.c#L2456

I think it's reasonable for io_uring_wait_cqe_timeout. But for pure timeout ( ie REQ_F_TIMEOUT_NOSEQ ), this operation should behave like other operations and should allow for common sqe flags.

This is a valid usage:

readv
timeout(1s)
writev

And this should be valid too:

timeout(1s)
timeout(1s)
timeout(1s)