Git Product home page Git Product logo

ixy's Introduction

ixy - a userspace network driver in 1000 lines of code

ixy is a simple userspace packet processing framework. It takes exclusive control of a network adapter and implements the whole driver in userspace. Its architecture is similar to DPDK and Snabb and completely different from (seemingly similar) frameworks such as netmap, pfq, pf_ring, or XDP (all of which rely on kernel components). In fact, reading both DPDK and Snabb drivers was crucial to understand some parts of the Intel 82599 datasheet better.

Check out our research paper "User Space Network Drivers" [BibTeX] or watch the recording of our talk at 34C3 to learn more.

ixy is designed for educational purposes to learn how a network card works at the driver level. Lack of kernel code and external libraries allows you to look through the whole code from startup to the lowest level of the driver. Low-level functions like handling DMA descriptors are rarely more than a single function call away from your application logic.

A whole ixy app, including the whole driver, is only ~1000 lines of C code. Check out the ixy-fwd and ixy-pktgen example apps and look through the code. The code often references sections in the Intel 82599 datasheet or the VirtIO specification, so keep them open while reading the code. You will be surprised how simple a full driver for a network card can be.

Don't like C? We also have implementations in other languages (Rust, Go, C#, Java, OCaml, Haskell, Swift, JavaScript, and Python).

Features

  • Driver for Intel NICs in the ixgbe family, i.e., the 82599ES family (aka Intel X520)
  • Driver for paravirtualized virtio NICs
  • Less than 1000 lines of C code for a packet forwarder including the whole driver (w/o virtio and VFIO support, see minimal branch)
  • No kernel modules needed (except vfio-pci when using the IOMMU / VFIO)
  • Can run without root privileges (when using the IOMMU / VFIO)
  • IOMMU support (see Using the IOMMU / VFIO)
  • Interrupt support (when using VFIO)
  • Simple API with memory management, similar to DPDK, easier to use than APIs based on a ring interface (e.g., netmap)
  • Support for multiple device queues and multiple threads
  • Super fast, can forward > 25 million packets per second on a single 3.0 GHz CPU core
  • Super simple to use (when not using VFIO): no dependencies, no annoying drivers to load, bind, or manage - see step-by-step tutorial below
  • BSD license

Supported hardware

Tested on an Intel 82599ES (aka Intel X520), X540, and X550. Might not work on all variants of these NICs because our link setup code is a little bit dodgy.

How does it work?

Check out our research paper "User Space Network Drivers" [BibTeX] for a detailed evaluation.

If you prefer to dive into the code: Start by reading the apps in src/app then follow the function calls into the driver. The comments in the code refer to the Intel 82599 datasheet (Revision 3.3, March 2016).

Compiling ixy and running the examples

Caution

Your NIC has full DMA access to your memory. A misconfigured NIC will cause memory corruptions that might crash your server or even destroy your filesystem. Do not run this on any systems that have anything remotely important on them if you want to modify the driver. Our version is also not necessarily safe and might be buggy. You have been warned.

Running ixy will unbind the driver of the given PCIe device without checking if it is in use. This means the NIC will disappear from the system. Do not run this on NICs that you need. We currently have a simple check if the device is actually a NIC, but trying to use another device could crash your system.

  1. Install the following dependencies

    • gcc >= 4.8
    • make
    • cmake

    Run this on Debian/Ubuntu to install them:

    sudo apt-get install -y build-essential cmake
    
  2. Configure 2MB hugepages in /mnt/huge using our script:

    cd ixy
    sudo ./setup-hugetlbfs.sh
    
  3. Run cmake and make

    cmake .
    make
    
  4. That's it! You can now run the included examples, e.g.:

    sudo ./ixy-pktgen 0000:XX:YY.Z
    

    Replace the PCI address as needed. All examples expect fully qualified PCIe bus addresses, i.e., typically prefixed with 0000:, as arguments. You can use lspci from the pciutils (Debian/Ubuntu) package to find the bus address. For example, lspci shows my 82599ES NIC as

    03:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

    which means that I have to pass 0000:03:00.0 as parameter to use it.

Using the IOMMU / VFIO

The usage of the IOMMU via the vfio-pci driver is implemented for ixgbe devices (Intel X520, X540, and X550). Using VFIO will also enable interrupt support. To use it, you have to:

  1. Enable the IOMMU in the BIOS. On most Intel machines, the BIOS entry is called VT-d and has to be enabled in addition to any other virtualization technique.

  2. Enable the IOMMU in the linux kernel. Add intel_iommu=on to your cmdline (if you are running a grub, the file /etc/default/grub.cfg contains a GRUB_CMDLINE_LINUX where you can add it).

  3. Get the PCI address, vendor and device ID: lspci -nn | grep Ether returns something like 05:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 [8086:1528] (rev 01). In this case, 0000:05:00.0 is our PCI Address, and 8086 and 1528 are the vendor and device id, respectively.

  4. Unbind the device from the ixgbe driver. echo $PCI_ADDRESS > /sys/bus/pci/devices/$PCI_ADDRESS/driver/unbind

  5. Enable the vfio-pci driver. modprobe vfio-pci

  6. Bind the device to the vfio-pci driver. echo $VENDOR_ID $DEVICE_ID > /sys/bus/pci/drivers/vfio-pci/new_id

  7. Chown the device to the user. chown $USER:$GROUP /dev/vfio/*

  8. That's it! Now you can compile and run ixy as stated above!

Wish list

It's not the plan to implement every single feature, but a few more things would be nice to have. The list is in no particular order.

Implement at least one other driver beside ixgbe and VirtIO

NICs that rely too much on firmware (e.g., Intel XL710) are not fun, because you end up only talking to a firmware that does everything. The same is true for NICs like the ones by Mellanox that keep a lot of magic in kernel modules, even when being used by frameworks like DPDK.

Interesting candidates would be NICs from the Intel igb and e1000e families as they quite common and reasonably cheap.

Better NUMA support

PCIe devices are attached to a specific CPU in NUMA systems. DMA memory should be pinned to the correct NUMA node. Threads handling packet reception should also be pinned to the same NUMA node.

NUMA handling must currently be done via numactl outside of ixy. Implementing it within ixy is annoying without depending on libnuma, so it's not implemented here.

RSS support

What's the point of having multiple rx queues if there is no good way to distribute the traffic to them? Shouldn't be too hard. See Section 7.1.2.8 in the datasheet.

tcpdump-like example

A simple rx-only app that writes packets to a .pcap file based on mmap and fallocate. Most of the code can be re-used from libmoon's pcap.lua.

Multi-threaded mempools

A mempool can currently only be accessed by a single thread, i.e., a packet must be allocated and free'd by the same thread. This prevents apps that implement different processing steps on different cores, a common multi-threading model for complex chains.

The current limitation is the list of free buffers in the mempool, it's a simple stack at the moment. Replacing this with a lock-free stack or queue makes it safe for use with multiple threads.

The good news is that multi-threaded mempools are essentially the same problem as passing packets between threads: both can be implemented with the same queue/ring buffer data structure.

FAQ

Why C and not a more reasonable language?

It's the lowest common denominator that everyone should be able to understand -- this is for educational purposes only. I've taken care to keep the code simple and understandable.

Check out our implementations in other languages (Rust, Go, C#, Java, OCaml, Haskell, Swift, JavaScript, and Python).

I can't get line rate :(

There's a weird problem on some systems that causes it to slow down if the CPU is too fast. DPDK had the same problem in the past. Try applying bidirectional traffic to the forwarder and/or underclock your CPU to speed up ixy.

It's more than ~1000 lines! There is a huge ixgbe_type.h file.

ixgbe_type.h is copied from the Intel driver, it's only used as a machine-readable version of the datasheet. ixy only uses #define definitions for registers and the two relatively simple structs for the DMA descriptors. Overall, ixy uses less than 100 lines of the file and we could remove the remainder.

But it's nice to have all the struct definitions right there when implementing a new driver feature. Copy & pasting magic values from the datasheet is significantly less fun. Another interesting approach to making these values available is writing a parser for the tables in the datasheet. Snabb does this.

Should I use this for my production app?

No.

When should I use this?

To understand how a NIC works and to understand how a framework like DPDK or Snabb works.

ixy's People

Contributors

ackxolotl avatar bobo1239 avatar bonkf avatar emmericp avatar huberste avatar maybeanerd avatar mmisono avatar pudelkom avatar tzwickl avatar werekraken avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ixy's Issues

Broken loop

ixy/src/libixy-vfio.c

Lines 153 to 165 in e04d2d1

for (int i = VFIO_PCI_MSIX_IRQ_INDEX; i >= 0; i--) {
struct vfio_irq_info irq = {.argsz = sizeof(irq), .index = i};
check_err(ioctl(device_fd, VFIO_DEVICE_GET_IRQ_INFO, &irq), "get IRQ Info");
/* if this vector cannot be used with eventfd continue with next*/
if ((irq.flags & VFIO_IRQ_INFO_EVENTFD) == 0) {
error("IRQ doesn't support Event FD");
continue;
}
return i;
}

The loop is executed at most one time since error() calls abort().

Update README

After the recent changes and additions the readme should be updated.

run ixy-pktgen on virtualbox virtio err: Device does not support required features

i run virtualbox 6.0 on macos 10.14.6
guest os is ubuntu 16.04.

guest network device is set Paravirtualized network adapter (virtio-net), as documented in https://www.virtualbox.org/manual/ch06.html

when i run app:
/home/zzlu/Downloads/uio_test/ixy/cmake-build-debug-remote/ixy-pktgen 0000:00:03.0 [DEBUG] /home/zzlu/Downloads/uio_test/ixy/src/pci.c:58 pci_open_resource(): Opening PCI resource at /sys/bus/pci/devices/0000:00:03.0/config [DEBUG] /home/zzlu/Downloads/uio_test/ixy/src/pci.c:58 pci_open_resource(): Opening PCI resource at /sys/bus/pci/devices/0000:00:03.0/config [INFO ] /home/zzlu/Downloads/uio_test/ixy/src/driver/virtio.c:352 virtio_init(): Detected virtio legacy network card [DEBUG] /home/zzlu/Downloads/uio_test/ixy/src/pci.c:58 pci_open_resource(): Opening PCI resource at /sys/bus/pci/devices/0000:00:03.0/resource0 [DEBUG] /home/zzlu/Downloads/uio_test/ixy/src/driver/virtio.c:275 virtio_legacy_init(): Configuring bar0 [DEBUG] /home/zzlu/Downloads/uio_test/ixy/src/driver/virtio.c:284 virtio_legacy_init(): Host features: 410fdda3 [ERROR] /home/zzlu/Downloads/uio_test/ixy/src/driver/virtio.c:289 virtio_legacy_init(): Device does not support required features

i can figure out this error in code, my host feature without VIRTIO_F_ANY_LAYOUT
how can i fix this?

VIRTIO-PCI: failed to open /sys/bus/pci/devices/<BDF>/resource0

  1. There is no resource0 but resource in my Ubuntu 20.04 VM
  • So I changed resource0 in virtio.c to resource.
  1. /sys/bus/pci/devices/<BDF>/resource is read-only, so ixy-pktgen failed because permission denied
  • I added write permissions to /sys/bus/pci/devices/<BDF>/resource (I also tried 777)
  1. Then it can open the resource, but failed at device.h:154 write_io8(): pwrite io resource after virtio_legacy_init(): Configuring bar0

Is there something wrong with my usage? Can anyone help?

SEGFAULT when using more than 1 queue

We have issue when we are using than 1 queue on ixgbe

Stack trace:
1 pkt_buf_free
2 ixgbe_tx_batch

Both queues are processed in single thread, we tried to use single mempool as well as mempool-per-queue. The result is the same - any other queue except #0 caught.

Thoughts on driver wrapping

Universal structure contained in every driver and exposed to the user:

struct ixy_device {
    char* driver_name;
    uint16_t (*tx_batch)(struct ixy_device*, struct pkt_buf*, uint16_t);
    uint16_t (*rx_batch)(struct ixy_device*, struct pkt_buf*, uint16_t);
};

Specific driver implementations include this struct somewhere:

struct virtio_device {
    int fd;
    struct virt_queue* rx, *tx, *ctrl;

    struct ixy_device ixy;
};

In the public API we expose two functions, which just forward the calls to the fn pointers in the struct:

uint16_t ixy_rx_batch(struct ixy_device* dev, struct pkt_buf* bufs, uint16_t num_bufs);
uint16_t ixy_tx_batch(struct ixy_device* dev, struct pkt_buf* bufs, uint16_t num_bufs);

Which of course point to the appropriate driver implementations, so the following is always correct code:

#define IXY_TO_VIRT(dev) container_of(dev, struct virtio_device, ixy)

static uint16_t virtio_rx_batch(struct ixy_device* dev, ...) {
    struct virtio_device* virt = IXY_TO_VIRT(dev);
    ...
}

This unwrapping is only needed on the semi-public entry functions to a driver. Internally it can just pass its struct around.

Pros

  • Unified user-interface, an ixy-fwd app works with every driver
  • Very common practice

Cons

  • One level of pointer indirection; Performance impact?
  • Too much magic?
  • Even required for just two drivers?

Implement self-test

Tests are important! And also difficult when you have hardware dependencies :(

We can implement a simple full system test by adding another example application: ixy-dump which dumps packets to a pcap file.

We can then use all three example applications together for a system test

  • send packets with ixy-pktgen (and let's add a sequence number generater here)
  • forward them with ixy-fwd
  • dump them to a pcap with ixy-dump

Then we can just check if the generated pcap file contains packets with increasing sequence numbers. This can be run on a single server with four connected interfaces.

The important part here is that this tests the actual applications, because there is nothing worse than having examples that just don't work because no one ever tests them.

great 34C3 talk

thanks for your work and talk. i loved the fast speaking style - keeps me focused ๐Ÿ‘ (i hope you ll get more time on the next congress(es)!

Thoughts on NUMA

NUMA is really important for performance. There are two things to consider: thread-pinning and memory-pinning. Thread pinning is trivial and can be done with the usual affinity mask. The best way to pin memory is by linking against libnuma.
A dependency, eeww. But a simple dependency (just a wrapper for a few syscalls) that I'd see on a level with libpthread; a necessary evil.

Let's look at a forwarding application on a NUMA system with NICs connected to both CPUs.
It will typically have at least one thread per NIC that handles incoming packets and forwards them somewhere. It might need to cross a NUMA-boundary to do so.
In our experience, it's most efficient to pin both the thread and packet memory to the CPU node to which NIC receiving packets is connected. Sending from the wrong node is not as bad as receiving to the wrong node. Also, we (usually) can't know where to send the packets when receiving them, so we can't pin the memory correctly for that.

How to implement this?

  • read numa_node in NIC's sysfs directory to figure out where it's connected to
  • use libnuma to set a memory policy before allocating memory for it
  • pin the thread correctly

Sounds easy, right?
But is it worth implementing it? What do we gain beside added complexity?
Sure, this is obviously a must-have feature for a real-world high-performance driver.

But we've decided against implementing it for now.
Almost everyone will just look at the code and that NUMA stuff is not particularly interesting compared to the rest and it just adds noise.

That doesn't mean you can't use ixy on a NUMA system.
We obviously want to run some benchmarks and performance tests with different NUMA scenarios and we are just going to use the numactl command for that:

 numactl --strict --membind=0 --cpunodebind=0 ./ixy-pktgen <id> <id>

That works just fine with the current memory allocator and allows us to benchmark all relevant scenarios on a NUMA system with NICs attached to both nodes.

mmap can be used for allocating such memory regions for DMA

mmap can be used for allocating such memory regions for DMA.

map(NULL, length, PROT_READ |PROT_WRITE, MAP_ANONYMOUS | MAP_HUGETLB | MAP_LOCKED, 0, 0);

This kind of allocation is done in programs that need specific properties on allocated memory blocks as well (e.g. Wine).

Thoughts on memory pools

Memory pools should be

  • fast
  • simple
  • multi-threaded
  • allow bulk alloc/free

currently they are only fast and simple. Let's see if we can get the other two properties as well without losing speed and simplicity.

A memory pool is just a memory allocator for a fixed number of fixed-size buffers. So its core is some data structure that keeps track of a fixed number of pointers. Reasonable data structures for that are ring buffers and stacks. Currently it's a stack.

Implementing bulk allow/free for that stack is a trivial change. But multi-threaded fast stacks (especially with bulk operations) are extremely difficult (they are a standard example for lock-free data structures suffering from the ABA bug...). That would go against our design goal of keeping it simple...

So let's use a queue? Multi-threaded queues are relatively simple. However, queues suffer from a different problem: poor temporal data locality as they will cycle through all the references -- the stack will re-use buffers that you just recently used.
The fix for that are per-thread caches in the memory pool. There are two problems with that: it's probably no longer simple and it doesn't work with all use cases. One scenario where this fails is some pipeline application where packets aren't sent out on the same thread that they where received. And this is the primary motiviation why we wanted a multi-threaded memory pool in the first place.

An interesting reference is DPDK which defaults to a queue-based memory pool with thread-local caches. And it has the same problem. Their solution for pipeline-based applications is a stack with a spin lock: http://dpdk.org/ml/archives/dev/2016-July/043106.html

Why not use some library? We want to keep ixy free of external dependencies, you should be able to go through the whole code and understand all parts, including the core data structures.

What are we going to do? Probably a stack with a sping lock. And benchmarking! Probably queue vs stack. And single-threaded stack vs. multi-threaded stack in an uncontended scenario. And some realistic contention setup. NUMA?

ppms function doesn't calculate packets per microsecond

/**
 * Calculate packets per microsecond based on the received number of packets and the elapsed time in nanoseconds since the
 * last calculation.
 * @param received_pkts Number of received packets.
 * @param elapsed_time_nanos Time elapsed in nanoseconds since the last calculation.
 * @return Packets per microsecond.
 */
static uint64_t ppms(uint64_t received_pkts, uint64_t elapsed_time_nanos) {
	return received_pkts / (elapsed_time_nanos / 1000000);
}

micro = 10 ^ -6
nano = 10 ^ -9

By using (elapsed_time_nanos / 1000000), this function calculates the amount of received packets per millisecond, so I don't know if the comments are wrong micro -> milli or the divisor is wrong 1000000 -> 1000.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.