Git Product home page Git Product logo

sysbox's Introduction

sysbox

GitHub license build status

Introduction

Sysbox is an open-source and free container runtime (a specialized "runc"), originally developed by Nestybox (acquired by Docker on 05/2022), that enhances containers in two key ways:

  • Improves container isolation:

    • Linux user-namespace on all containers (i.e., root user in the container has zero privileges on the host).

    • Virtualizes portions of procfs & sysfs inside the container.

    • Hides host info inside the container.

    • Locks the container's initial mounts, and more.

  • Enables containers to run same workloads as VMs:

    • With Sysbox, containers can run system-level software such as systemd, Docker, Kubernetes, K3s, buildx, legacy apps, and more seamlessly & securely.

    • This software can run inside Sysbox containers without modification and without using special versions of the software (e.g., rootless variants).

    • No privileged containers, no complex images, no tricky entrypoints, no special volume mounts, etc.

Think of it as a "container supercharger": it enables your existing container managers / orchestrators (e.g., Docker, Kubernetes, etc.) to deploy containers that have hardened isolation and can run almost any workload that runs in VMs.

Sysbox does this by making the container resemble a VM-like environment as much as possible, using advanced OS virtualization techniques.

Unlike alternative runtimes such as Kata and KubeVirt, it does not use VMs. This makes it easier to use (particularly in cloud environments by avoiding nested virtualization), although it does not provide the level of isolation that VM-based runtimes do. See here for a comparison.

There is no need to learn new tools or modify your existing container images or workflows to take advantage of Sysbox. Simply install it and point your container manager / orchestrator to it to deploy enhanced containers.

Sysbox can live side-by-side with other container runtimes on the same host (e.g., the default OCI runc, Kata, etc.) You can easily choose which containers or pods to run with each, depending on your needs.

Demo Videos

Contents

License

Sysbox is free and open-source, licensed under the Apache License, Version 2.0. See the LICENSE file for details.

Relationship to Nestybox & Docker

Sysbox was originally developed by Nestybox. As Nestybox is now part of Docker, Docker is the main sponsor of the Sysbox project.

Having said this, Sysbox is a community open-source project and it's not officially supported by Docker (i.e., Docker subscriptions do not include Sysbox support). Support is provided on a best effort basis via this Github repo or via the Sysbox Slack Workspace.

We encourage participation from the community to help evolve and improve Sysbox, with the goal of increasing the use cases and benefits it enables. External maintainers and contributors are welcomed.

Motivation

Sysbox solves problems such as:

  • Enhancing the isolation of containerized microservices (root in the container maps to an unprivileged user on the host).

  • Enabling a highly capable root user inside the container without compromising host security.

  • Securing CI/CD pipelines by enabling Docker-in-Docker (DinD) or Kubernetes-in-Docker (KinD) without insecure privileged containers or host Docker socket mounts.

  • Enabling the use of containers as "VM-like" environments for development, local testing, learning, etc., with strong isolation and the ability to run systemd, Docker, IDEs, and more inside the container.

  • Running legacy apps inside containers (instead of less efficient VMs).

  • Replacing VMs with an easier, faster, more efficient, and more portable container-based alternative, one that can be deployed across cloud environments easily.

  • Partitioning bare-metal hosts into multiple isolated compute environments with 2X the density of VMs (i.e., deploy twice as many VM-like containers as VMs on the same hardware at the same performance).

  • Partitioning cloud instances (e.g., EC2, GCP, etc.) into multiple isolated compute environments without resorting to expensive nested virtualization.

How it Works

sysbox

Sysbox installs easily on Linux hosts (bare-metal, VM, on-prem, cloud, etc.). It works on all mayor cloud-based IaaS and Kubernetes services (e.g., EC2, GCP, GKE, EKS, AKS, Rancher, etc.)

Once installed, Sysbox works under the covers: you use Docker, Kubernetes, etc. to deploy containers with it.

For example, this simple Docker command creates a container with Sysbox:

$ docker run --runtime=sysbox-runc -it any_image

You get a well isolated container capable of seamlessly running microservices as well as system-level software that normally that runs on VMs (e.g., systemd, Docker, Kubernetes, etc).

More on how to use Sysbox here.

Comparison to Related Technologies

sysbox

As shown, Sysbox enables unprivileged containers to run system-level workloads such as systemd, Docker, Kubernetes, etc., seamlessly, while giving you a balanced approach between container isolation, performance, efficiency, and portability.

And it does this with minimal configuration changes to your existing infra: just install Sysbox and configure your container manager/orchestrator to launch containers with it, using the image of your choice.

Note that while Sysbox hardens the isolation of standard containers and voids the need for insecure privileged containers in many scenarios, it does not (yet) provide the same level of isolation as VM-based alternatives or user-space OSes like gVisor. Therefore, for scenarios where the highest level of isolation is required, alternatives such as KubeVirt may be preferable (at the expense of lower performance and efficiency and higher complexity and cost).

See this blog post for more.

Audience

The Sysbox project is intended for anyone looking to experiment, invent, learn, and build systems using system containers. It's cutting-edge OS virtualization, and contributions are welcomed.

Sysbox Enterprise Edition [DEPRECATED]

Prior to the acquisition by Docker on 05/2022, Nestybox offered Sysbox Enterprise as an enhanced version of Sysbox (e.g., more security, more workloads, and official support).

After the acquisition however, Sysbox Enterprise is no longer offered as a standalone product but has instead been incorporated into Docker Desktop (see Docker Hardened Desktop).

NOTE: As Sysbox Enterprise is no longer offered as a standalone product, Docker plans to make some Sysbox Enterprise features available in Sysbox Community Edition. The features are TBD and your feedback on this is welcome.

Sysbox Features

The table below summarizes the key features of the Sysbox container runtime.

It also provides a comparison between the Sysbox Community Edition (i.e., this repo) and the previously available Sysbox Enterprise Edition (now deprecated).

sysbox

More on the Sysbox features here.

If you have questions, you can reach us here.

System Containers

We call the containers deployed by Sysbox system containers, to highlight the fact that they can run not just micro-services (as regular containers do), but also system software such as Docker, Kubernetes, Systemd, inner containers, etc.

More on system containers here.

Installation

Host Requirements

The Sysbox host must meet the following requirements:

  • It must be running one of the supported Linux distros and be a machine with a supported architecture (e.g., amd64, arm64).

  • We recommend a minimum of 4 CPUs (e.g., 2 cores with 2 hyperthreads) and 4GB of RAM. Though this is not a hard requirement, smaller configurations may slow down Sysbox.

Installing Sysbox

The method of installation depends on the environment where Sysbox will be installed:

Using Sysbox

Once Sysbox is installed, you create a container using your container manager or orchestrator (e.g., Docker or Kubernetes) and an image of your choice.

Docker command example:

$ docker run --runtime=sysbox-runc --rm -it --hostname my_cont registry.nestybox.com/nestybox/ubuntu-bionic-systemd-docker
root@my_cont:/#

Kubernetes pod spec example:

apiVersion: v1
kind: Pod
metadata:
  name: ubu-bio-systemd-docker
  annotations:
    io.kubernetes.cri-o.userns-mode: "auto:size=65536"
spec:
  runtimeClassName: sysbox-runc
  containers:
  - name: ubu-bio-systemd-docker
    image: registry.nestybox.com/nestybox/ubuntu-bionic-systemd-docker
    command: ["/sbin/init"]
  restartPolicy: Never

You can choose whatever container image you want, Sysbox places no requirements on the image.

Nestybox makes several reference images available in its Dockerhub and GitHub Container Registry repos. These are images that typically include systemd, Docker, Kubernetes, and more inside the containers. The Dockerfiles are here. Feel free to use and modify per your needs.

Documentation

We strive to provide good documentation; it's a key component of the Sysbox project.

We have several documents to help you get started and get the best out of Sysbox.

Performance

Sysbox is fast and efficient, as described in this Nestybox blog post.

The containers created by Sysbox have similar performance to those created by the OCI runc (the default runtime for Docker and Kubernetes).

Even containers deployed inside the system containers have excellent performance, thought there is a slight overhead for network IO (as expected since packets emitted by inner containers go through an additional network interface / bridge inside the system container).

Now, if you use Sysbox to deploy system containers that replace VMs, then the performance and efficiency gains are significant: you can deploy 2X as many system containers as VMs on the same server and get the same performance, and do this with a fraction of the memory and storage consumption. The blog post referenced above has more on this.

Under the Covers

Sysbox was forked from the excellent OCI runc in early 2019 and it stands on the shoulders of the work done by the OCI runc developers.

Having said this, Sysbox adds significant functionality on top. It's written in Go, and it is currently composed of three components: sysbox-runc, sysbox-fs, and sysbox-mgr.

Sysbox uses many OS-virtualization features of the Linux kernel and complements these with OS-virtualization techniques implemented in user-space. These include using all Linux namespaces (in particular the user-namespace), partial virtualization of procfs and sysfs, selective syscall trapping, and more.

More on Sysbox's design can be found in the Sysbox user guide.

Sysbox does not use hardware virtualization

Though the containers generated by Sysbox resemble virtual machines in some ways (e.g., you can run as root, run multiple services, and deploy Docker and K8s inside), Sysbox does not use hardware virtualization.

Sysbox is a pure OS-virtualization technology meant to create containers that can run applications as well as system-level software, easily and securely.

This makes the containers created by Sysbox fast, efficient, and portable (i.e., they aren't tied to a hypervisor).

Isolation wise, it's fair to say that Sysbox containers provide stronger isolation than regular Docker containers (by virtue of using the Linux user-namespace and light-weight OS shim), but weaker isolation than VMs (by sharing the Linux kernel among containers).

Contributing

We welcome contributions to Sysbox, whether they are small documentation changes, bug fixes, or feature additions. Please see the contribution guidelines and developer's guide for more info.

Security

See the User Guide's Security Chapter for info on how Sysbox secures containers.

If you find bugs or issues that may expose a Sysbox vulnerability, please report these by sending an email to [email protected]. Please do not open security issues in this repo. Thanks!

In addition, a few vulnerabilities have recently been found in the Linux kernel that in some cases reduce or negate the enhanced isolation provided by Sysbox containers. Fortunately they are all fixed in recent Linux kernels. See the Sysbox User Guide's Vulnerabilities & CVEs chapter for more info, and reach out on the Sysbox Slack channel for further questions.

Troubleshooting & Support

Support is currently offered on a best-effort basis.

If you have a question or comment, we love to hear it. You can reach us at our slack channel or file an issue on this GitHub repo.

If you spot a problem with Sysbox, please search the existing issues as they may describe the problem and provide a work-around.

Check also the Troubleshooting document.

Uninstallation

Prior to uninstalling Sysbox, make sure all containers deployed with it are stopped and removed.

The method of uninstallation depends on the method used to install Sysbox:

Roadmap

The following is a list of features in the Sysbox roadmap.

We list these here so that our users can get a better idea of where we are going and can give us feedback on which of these they like best (or least).

Here is a short list; the Sysbox issue tracker has many more.

  • Support for more Linux distros.

  • More improvements to procfs and sysfs virtualization.

  • Continued improvements to container isolation.

  • Exposing host devices inside system containers with proper permissions.

Contact

Slack: Sysbox Slack Workspace

Email: [email protected]

We are available from Monday-Friday, 9am-5pm Pacific Time.

Thank You

We thank you very much for using and/or contributing to Sysbox. We hope you find it interesting and that it helps you use containers in new and more powerful ways.

sysbox's People

Contributors

abalmos avatar arukiidou avatar augustasv avatar ctalledo avatar dekusdenial avatar dmullero avatar elreydetoda avatar felipecrs avatar gadiener avatar ihid avatar jawnsy avatar joanbm avatar jonnymcc avatar lolli42 avatar manasugi avatar martinkunkel2 avatar optimalbrew avatar rayshard avatar rodnymolina avatar roussidis avatar scop avatar stoically avatar uagaro avatar zalsader avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sysbox's Issues

Publish docker images to GitHub Container Registry

Since Docker Hub has imposed pull rate limits, I'm not able to pull anything from there from my company anymore. So I have to manually mirror the image to another registry that I can access.

This is a suggestion for you to push the Docker images not only to Docker Hub, but also to the recently introduced GitHub Container Registry.

It would be awesome to be able to:

$ docker pull ghcr.io/nestybox/ubuntu-focal-systemd-docker

I strongly suggest using GitHub Actions to automate it :).

Add support for K3d + Sysbox

The goal here is to allow K3d tool to be able to create K8s clusters running over Sysbox containers. Sysbox would offer the following benefits over K3d's default runtime (runc):

  • Security: Sysbox would allow secure containers to be utilized as K8s-nodes -- currently K3d can only run over 'privileged' containers.
  • Simplicity: K3d's Dockerfile images could be drastically simplified as Sysbox absorbs most of the complexity required to allow K8s execution.
  • Flexibility: Sysbox imposes no restrictions on the docker images to utilize, so users would have more flexibility to define their own (customized) K8s-node images, including inner images corresponding to K8s components and/or applications.

Various test-cases failing due to SElinux's default activation (CentOS)

Problem already understood and fixed. Opening issue just for tracking purposes. (Ref #62)

In scenarios where kernel is operating with SElinux turned on (CentOS / Redhat), file properties are displayed with an extra character (".") to indicate the presence of a SElinux security-context:

$ ls -lrt /proc/uptime
-r--r--r--. 1 root root 0 Jul 24 02:00 /proc/uptime

<-- This caused multiple test failures ...

ok 14 pid-ns hierarchy
not ok 15 proc lookup
# (in test file tests/sysfs/proc.bats, line 45)
#   `[[ "$want_perm" == "$got_perm" ]]' failed
# sysbox-runc run -d --console-socket /work/console.sock syscont (status=0):
#
# sysbox-runc exec syscont sh -c ls -ld /proc/uptime (status=0):
# -r--r--r--    1 root     root             0 Jul 23 01:04 /proc/uptime
ok 16 proc read-only

<-- The fix is to adjust our test helper functions to avoid comparing this 11th character, and to stick to the first ten ones. More, potentially elegant, solutions were attempted by adding extra logic in helpers function to identify the presence/absence of SElinux, but that opened up another set of issues that broke other testcases, so decided to go for the simplest but consistent solution.

rpm installation fails inside a sysbox redhat inner container

  1. Start a sysbox container with dockerd.
docker run --rm --runtime=sysbox-runc -it nestybox/ubuntu-focal-systemd-docker
  1. Start a container inside and attempt to install the filesystem package.
docker run -it registry.access.redhat.com/ubi8/ubi:8.2-347 /bin/bash

# Inside container.
dnf install filesystem -y --downloadonly
rpm -Uvh /var/cache/dnf/ubi-8-baseos-53c30a88cff3796c/packages/filesystem-3.8-3.el8.x86_64.rpm --force

Output from the install:

Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:filesystem-3.8-3.el8             ################################# [ 50%]
error: unpacking of archive failed on file /proc: cpio: chown failed - No such file or directory
error: filesystem-3.8-3.el8.x86_64: install failed
error: filesystem-3.8-2.el8.x86_64: erase skipped

Related

Can sysbox with systemd as ENTRYPOINT execute the other command after container start?

I need to run the docker container with systemd as PID 1 it means as ENTRYPOINT so I follow nestybox Dockerfile from here https://github.com/nestybox/dockerfiles/blob/main/ubuntu-bionic-systemd/Dockerfile but in this case, I need to run the custom bash script after container start but it doesn't execute the command:

FROM nestybox/ubuntu-bionic-systemd:latest
...
...
CMD [";docker-entrypoint.sh" ]

If I check the process list inside the container, systemd successfully loaded but docker-entrypoint.sh not executed.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  77412  8352 ?        Ss   09:51   0:00 /sbin/init --log-level=err ;docker-entrypoint.sh
root       262  0.0  0.1  78476  9452 ?        Ss   09:51   0:00 /lib/systemd/systemd-journald
root       394  0.0  0.0  45012  3604 ?        Ss   09:51   0:00 /lib/systemd/systemd-udevd
systemd+   548  0.0  0.0  70656  5096 ?        Ss   09:51   0:00 /lib/systemd/systemd-resolved
root       759  0.0  0.0  28352  2712 ?        Ss   09:51   0:00 /usr/sbin/cron -f

NOTE: the script contains to initialize the data and other stuff.

So I need systemd to run in my gitlab-runner docker image and use docker inside it, any suggestion?

[Bug] Unable to run container with --runtime=sysbox-runc when Docker data root is on an LVM

After installing Sysbox on Ubuntu focal either from package release or from source code, I cannot run a container with sysbox-runc, I always have the same error:

$ docker run --runtime=sysbox-runc hello-world
docker: Error response from daemon: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:342: getting the final child's pid from pipe caused \"EOF\"": unknown.
ERRO[0000] error waiting for container: context canceled

whereas with runc it works.

I tried this without success.

$ uname -a
Linux charles 5.4.0-58-generic #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.4.2-docker)

Server:
 Containers: 2
  Running: 0
  Paused: 0
  Stopped: 2
 Images: 206
 Server Version: 20.10.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc sysbox-runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-58-generic
 Operating System: Ubuntu 20.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 31.01GiB
 Name: charles
 ID: LREV:LW7X:THPL:CJY4:W7V2:3PDY:OQ4R:EPPX:QA5F:3BTJ:JKTI:7C6W
 Docker Root Dir: /home/clement/encrypted/system/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Default Address Pools:
   Base: 172.25.0.0/16, Size: 24

WARNING: No swap limit support
WARNING: No blkio weight support
WARNING: No blkio weight_device support
$ sudo cat /etc/docker/daemon.json 
{
  "data-root": "/home/clement/encrypted/system/docker",
  "bip": "172.20.0.1/16",
  "runtimes": {
       "sysbox-runc": {
          "path": "/usr/local/sbin/sysbox-runc"
       }
   },
  "default-address-pools": [
    {
      "base": "172.25.0.0/16",
      "size": 24
    }
  ]
}

docker-py does not seem to have access to the `docker daemon` inside a docker.

Issue
Hello,

I am interesting on maybe using sysbox to have Docker-In-Docker capabilities.

I have installed the runtime as specified in the README and then when running an image with this runtime, I try to run the following python command (with docker-py SDK).

import docker
env = docker.from_env()

and I get the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/docker/client.py", line 101, in from_env
    **kwargs_from_env(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/client.py", line 45, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 197, in __init__
    self._version = self._retrieve_server_version()
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 222, in _retrieve_server_version
    'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', PermissionError(13, 'Permission denied'))

It seems that it is failing to access the docker daemon.

What is the proper way to have docker in docker using sysbox?

Thank you!

Overlayfs can't be mounted from within a sys container (except on Ubuntu)

In the mainline Linux kernel, it's not possible to mount overlayfs from within a container (or more accurately from outside the initial user-namespace). Doing so causes permission denied response.

To reproduce the issue, simply enter a user-namespace and the mount overlayfs:

cd /home/chino/sandbox/overlayfs
mkdir lower upper work merged
unshare -i -m -n -p -u -U -C -f --mount-proc -r /bin/bash
mount -t overlay overlay -o lowerdir=/home/chino/sandbox/overlayfs/lower,upperdir=/home/chino/sandbox/overlayfs/upper,workdir=/home/chino/sandbox/overlayfs/work /home/chino/sandbox/overlayfs/merged
mount: /home/chino/sandbox/overlayfs/merged: permission denied.

It's not clear to me why this restriction exists; it may be related to the security issue described in this lwn.net article.

This is a problem as it won't allow us to run system containers on ext4, because if an inner docker daemon is launched, the inner docker will try to mount overlayfs for the container images and this operation will fail.

Fortunately the problem does not occur on Ubuntu. There appears to be a patch from Ubuntu that allows this. As described in here:

"Ubuntu carries a patch that allows overlayfs mounting inside of an unprivileged user namespace, so we were carrying the fix mentioned above as a delta against the upstream Linux kernel since the issue didn't affect upstream overlayfs. "

Note that the problem does not affect system containers on btrfs, because in that case overlayfs is not used by the inner docker; it uses btrfs subvolumes.

(Ref #62)

Add support for K8s.io KinD + Sysbox

The goal here is to allow KinD tool to be able to create K8s clusters running over Sysbox containers. Sysbox would offer the following benefits over KinD's default runtime (runc):

  • Security: Sysbox would allow secure containers to be utilized as K8s-nodes -- currently KinD can only run over 'privileged' containers.
  • Simplicity: KinD's Dockerfile images could be drastically simplified as Sysbox absorbs most of the complexity required to allow K8s execution.
  • Flexibility: Sysbox imposes no restrictions on the docker images to utilize, so users would have more flexibility to define their own (customized) K8s-node images, including inner images corresponding to K8s components and/or applications.

The following issue has been filed in KinD's project to track this effort: kubernetes-sigs/kind#1772.

Sysbox unable to handle read-only rootfs requirement

Problem was initially observed in a Cloudron setup. For security purposes, Cloudron sets container's spec so that its rootfs is mounted as read-only.

There are two different issues being reported here:

  1. Sysbox expects 'read-write' rootfs in received OCI spec. We already had a fix for this one, but it hasn't been merged yet. This prevents read-only container from properly launching:
$ docker run --runtime=sysbox-runc -it --rm --read-only --name test-1 nestybox/ubuntu-focal-systemd-docker
docker: Error response from daemon: OCI runtime create failed: error in the container spec: invalid or unsupported container spec: root path must be read-write but it's set to read-only: unknown.
  1. In scenarios involving docker's custom-networks (which is the case in Cloudron), Sysbox adjusts the DNS resolution process to avoid forwarding issues in nested container setups. As part of this adjustment, sysbox-runc writes into the container's /etc/resolv.conf file. However, this is not allowed in read-only setups. This triggers the following failure:
$ docker run --runtime=sysbox-runc -it --rm --read-only --name test-2 --net=rodny-testing ubuntu
docker: Error response from daemon: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"switching Docker DNS: rootfs_linux.go:1247: writing /etc/resolv.conf caused \\\"open /etc/resolv.conf: read-only file system\\\"\"": unknown.

/cc @vRobM

Sysbox does not handle nested bind-mounts into the container correctly

When launching a container with Docker + Sysbox with nested host bind-mounts, the inner bind-mount shows up with nobody:nogroup permissions.

For example, below we launch a sysbox container with nested host bind-mounts of ~/tmp/hometest -> /home/admin and ~/tmp/hometest/docker -> /var/lib/docker:

$ docker run --runtime=sysbox-runc -it --rm --mount type=bind,source=$HOME/tmp/hometest,target=/home/admin --mount type=bind,source=$HOME/tmp/hometest/docker,target=/var/lib/docker nestybox/alpine-docker-dbg

/ # cd var/lib
/var/lib # ls -l
total 28
drwxr-xr-x    2 root     root          4096 May 29  2020 apk
drwxr-xr-x    3 root     root          4096 Dec  2 03:59 containerd
drwxr-xr-x    2 nobody   nobody        4096 Dec  2 03:45 docker
drwxr-xr-x    2 root     root          4096 Oct  5 16:01 iptables
drwxr-xr-x    2 root     root          4096 Dec  2 03:59 kubelet
drwxr-xr-x    2 root     root          4096 May 29  2020 misc
drwxr-xr-x    2 root     root          4096 May 29  2020 udhcpd

/var/lib # ls -l /home
total 4
drwxr-xr-x    3 root     root          4096 Dec  2 03:45 admin

As shown, inside the container, the inner bind-mount (~/tmp/hometest/docker -> /var/lib/docker) shows up as nobody:nogroup inside the container.

Sysbox build attributes missing from "--version" output

I following the guide from here https://github.com/nestybox/sysbox/blob/master/docs/developers-guide/build.md, download the source code from the release page but it's failed when try to run make sysbox like this:

OS version: Ubuntu 18.04.5 LTS
Kernel version: 5.4.0-48-generic #52~18.04.1-Ubuntu SMP

** Building sysbox **

docker run --privileged --rm --hostname sysbox-build --name sysbox-build -v /root/sysbox-0.2.1:/root/nestybox/sysbox -v /root/go/pkg/mod:/go/pkg/mod -v /lib/modules/5.4.0-48-generic:/lib/modules/5.4.0-48-generic:ro -v /usr/src/:/usr/src/:ro -v /usr/src/:/usr/src/:ro sysbox-test /bin/bash -c "buildContainerInit sysbox-local"
cd sysbox-runc && make clean
make[1]: Entering directory '/root/nestybox/sysbox/sysbox-runc'
make[1]: Leaving directory '/root/nestybox/sysbox/sysbox-runc'
make[1]: *** No rule to make target 'clean'.  Stop.
Makefile:325: recipe for target 'clean' failed
make: *** [clean] Error 2
Makefile:110: recipe for target 'sysbox' failed
make: *** [sysbox] Error 2

Add support for the systemd cgroups

Sysbox-runc does not yet support the systemd cgroups. This means that when creating a container, sysbox-runc writes the cgroup constraints for the container directly into the host's /sys/fs/cgroup directories, rather than asking systemd to allocate cgroup resources for the container.

As a result, if the higher level container manager (e.g., Docker/containerd, CRI-O, etc.) are configured to use systemd cgroups (e.g., in Docker this is done by starting the docker engine with dockerd --exec-opt native.cgroupdriver=systemd), then sysbox-runc will emit this error:

flag provided but not defined: -systemd-cgroup

Having systemd allocate cgroup resources for the container is preferable in scenarios where you want a single entity (systemd) managing all cgroup resource assignments in the system.

It's also helpful because some container managers (e.g., CRI-O) default to systemd cgroup management. For example, fixing this is a nice-to-have for issue #64.

Note that here we are talking about systemd cgroup V1 support. This issue does not cover cgroup V2 support (that's a separate and larger task).

Extend sysbox to allow podman's rootful containers to run within sys-containers

The goal here is to allow Sysbox to run podman inside a system container. Refer to this podman issue for details about the use-case.

After analyzing the issue and making a few adjustments to sysbox i'm now running into this one:

rmolina@dev-vm1:~$ docker run -it --rm --device=/dev/fuse --runtime=sysbox-runc quay.io/podman/stable bash
[root@c9f908a8ef7a /]#

[root@c9f908a8ef7a /]# podman run hello-world
Trying to pull registry.fedoraproject.org/hello-world...
  manifest unknown: manifest unknown
Trying to pull registry.access.redhat.com/hello-world...
  name unknown: Repo not found
Trying to pull registry.centos.org/hello-world...
  manifest unknown: manifest unknown
Trying to pull docker.io/library/hello-world...
Getting image source signatures
Copying blob 0e03bdcc26d7 [--------------------------------------] 0.0b / 0.0b
Copying config bf756fb1ae done
Writing manifest to image destination
Storing signatures
Error: openat2 `proc`: Operation not permitted: OCI runtime permission denied error
[root@c9f908a8ef7a /]#

<-- Strace output below -- note that syscall 0x1b5 == 437 == openat2() ...

[pid 2968594] 16:46:16 syscall_0x1b5(0x6, 0x55f30681c180, 0x7ffdc8e34730, 0x18, 0, 0x28000000000000) = -1 EPERM (Operation not permitted) <0.000006>
[pid 2968594] 16:46:16 close(6)         = 0 <0.000007>
[pid 2968594] 16:46:16 write(7, "\1\0\0\0\1\0\0\0openat2 `proc`\0", 23) = 23 <0.000018>
[pid 2968594] 16:46:16 exit_group(1 <unfinished ...>

Looks like a seccomp issue preventing openat2() execution. The fix may need to extend libseccomp as openat2() doesn't seem to be supported (at least not in our private version). If that's the case, and we see nothing else, i believe the fix for this one should be an easy one.

/cc @felipecrs @rhatdan @giuseppe

sysbox-runc cgroup test failure due to oom_killer

Problem is consistently reproduced in some Ubuntu Bionic boxes (i.e. GCP ones), but haven't been able to replicate it in equivalent local VMs yet. Also, haven't been able to reproduce in Ubuntu Eoan nor Focal, issue seems to be Bionic specific.

rodny@gcp-ubuntu-bionic:~$ uname -a
Linux gcp-ubuntu-bionic 5.3.0-1032-gcp #34~18.04.1-Ubuntu SMP Tue Jul 14 22:07:36 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

rodny@gcp-ubuntu-bionic:~$ docker -v
Docker version 19.03.12, build 48a66213fe

root@sysbox-test:~/nestybox/sysbox# docker -v
Docker version 19.03.12, build 48a66213fe
root@sysbox-test:~/nestybox/sysbox# bats -t sysbox-runc/tests/integration/update.bats
1..2
not ok 1 update
# (in test file sysbox-runc/tests/integration/update.bats, line 66)
#   `[ "$status" -eq 0 ]' failed
# runc list (status=0):
# ID          PID         STATUS      BUNDLE      CREATED     OWNER
# runc list (status=0):
# ID          PID         STATUS      BUNDLE      CREATED     OWNER
# runc list (status=0):
# ID          PID         STATUS      BUNDLE      CREATED     OWNER
# runc run -d --console-socket /root/console.sock test_update (status=1):
# time="2020-08-13T18:02:26Z" level=warning msg="signal: killed"
# time="2020-08-13T18:02:26Z" level=error msg="container_linux.go:364: starting container process caused \"process_linux.go:533: container init caused \\\"read init-p: connection reset by peer\\\"\""
# container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"read init-p: connection reset by peer\""
# runc list (status=0):
# ID          PID         STATUS      BUNDLE      CREATED     OWNER
# runc list (status=0):
# ID          PID         STATUS      BUNDLE      CREATED     OWNER
# runc list (status=0):
# ID          PID         STATUS      BUNDLE      CREATED     OWNER
ok 2 # skip (SYSBOX ISSUE #714) update rt period and runtime
root@sysbox-test:~/nestybox/sysbox#

As seen above, the sys container is unable to complete its initialization due to a received 'kill' signal triggered by the oom-killer (see below), due to container's memory footprint exceeding the preconfigured mem cgroup limits defined in the testcase:

Aug 13 07:00:11 gcp-ubuntu-bionic kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-999
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: CPU: 0 PID: 7148 Comm: runc:[1:CHILD] Not tainted 5.3.0-1032-gcp #34~18.04.1-Ubuntu
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: Call Trace:
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  dump_stack+0x6d/0x95
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  dump_header+0x4f/0x200
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  oom_kill_process+0xe6/0x120
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  out_of_memory+0x109/0x510
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  mem_cgroup_out_of_memory+0xbb/0xd0
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  try_charge+0x79a/0x7d0
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  mem_cgroup_try_charge+0x75/0x190
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  mem_cgroup_try_charge_delay+0x22/0x50
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  wp_page_copy+0x118/0x7d0
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  ? reuse_swap_page+0x10e/0x330
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  ? filemap_map_pages+0x18f/0x380
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  do_wp_page+0x91/0x530
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  __handle_mm_fault+0x9fe/0x1260
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  ? __switch_to_xtra+0x1ab/0x590
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  handle_mm_fault+0xcb/0x210
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  __do_page_fault+0x2a1/0x4d0
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  do_page_fault+0x2c/0xe0
Aug 13 07:00:11 gcp-ubuntu-bionic kernel:  page_fault+0x34/0x40
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: RIP: 0033:0x7fe42b299c38
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: Code: ce ea ff ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 be 01 00 00 00 41 55 41 54 31 c0 55 53 48 89 fb 48 83 ec 28 <f0> 0f b1 35 08 77 21 00 74 1a 48 8d 3d ff 76 21 00 48 81 ec 80 00
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: RSP: 002b:00007ffd305b9950 EFLAGS: 00010206
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: RAX: 0000000000000000 RBX: 00007ffd305b99b0 RCX: 0000000000000041
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ffd305b99b0
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000010
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
Aug 13 07:00:11 gcp-ubuntu-bionic **kernel: memory: usage 32768kB, limit 32768kB, failcnt 23**
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: kmem: usage 372kB, limit 16384kB, failcnt 0
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: Memory cgroup stats for /docker/4436bfefd7df2cc83b27a66c6e6419bc3939c21c4c8b4e5fa3d3b98f78a6ab9a/runc-cgroups-integration-test/test-cgroup:
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: anon 6782976
                                          file 26357760
                                          kernel_stack 0
                                          slab 0
                                          sock 0
                                          shmem 26357760
                                          file_mapped 4730880
                                          file_dirty 0
                                          file_writeback 0
                                          anon_thp 0
                                          inactive_anon 26357760
                                          active_anon 6758400
                                          inactive_file 0
                                          active_file 0
                                          unevictable 0
                                          slab_reclaimable 0
                                          slab_unreclaimable 0
                                          pgfault 3069
                                          pgmajfault 0
                                          workingset_refault 0
                                          workingset_activate 0
                                          workingset_nodereclaim 0
                                          pgrefill 0
                                          pgscan 0
                                          pgsteal 0
                                          pgactivate 0
                                          pgdeactivate 0
                                          pglazyfree 0
                                          pglazyfreed 0
                                          thp_fault_alloc 0
                                          thp_collapse_alloc 0
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: Tasks state (memory values in pages):
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: [   7133]     0  7133     7330     2942    90112        0          -999 runc:[0:PARENT]
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: [   7148]     0  7148     7330     1662    77824        0          -999 runc:[1:CHILD]
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test-cgroup,mems_allowed=0,oom_memcg=/docker/4436bfefd7df2cc83b27a66c6e6419bc3939c21c4c8b4e5fa3d3b98f78a6ab9a/runc-cgroups-integration-test/test-cgroup,task_memcg=/docker/4436bfefd7df2cc83b27a66c6e6419bc3939c21c4c8b4e5fa3d3b98f78a6ab9a/runc-cgroups-integration-test/test-cgroup,task=runc:[1:CHILD],pid=7148,uid=0
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: Memory cgroup out of memory: Killed process 7148 (runc:[1:CHILD]) total-vm:29320kB, anon-rss:6612kB, file-rss:0kB, shmem-rss:36kB
Aug 13 07:00:11 gcp-ubuntu-bionic kernel: oom_reaper: reaped process 7148 (runc:[1:CHILD]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

As per testcase (below), the hard limit should be set at "33554432" bytes, which fully matches the utilization numbers seen above ("37768KB"), so the oom-killer seems to be doing what it should ...

    # Set some initial known values
    DATA=$(cat <<EOF
    "memory": {
        **"limit": 33554432,**
        "reservation": 25165824,
        "kernel": 16777216,
        "kernelTCP": 11534336
    },

As seen below, container's spec is looking good too: received cgroup attributes match the configured ones ...

"resources": {"memory": { "limit": 33554432, "reservation": 25165824, "kernel": 16777216, "kernelTCP": 11534336 }, "cpu": { "shares": 100, "quota": 500000, "period": 1000000, "cpus": "0" }, "pids"\
: { "limit": 20 }},

So it seems that failure is just a consequence of the container simply utilizing more memory than what expected by the testcase.

Sysbox Nesting (future support?)

I understand that sysbox nesting is currently a limitation.

Is this a feature that is ever planned on being supported in the future?


I'm aware of some solutions like this Y Combinator comment. However, this comes with it's drawbacks as mentioned in the well known "Using Docker-in-Docker for your CI or testing environment? Think twice." blog post.

I can see this being a common use case such as for a CI server:

  1. You run your CI server in a docker container.
  2. Each build job runs inside a container
  3. The build jobs would like to do things like build, publish, etc.

For this case, 3 levels deep may be adequate, but I'm sure there are less common but valid use cases which would involve many more levels of nesting.

Unexpected error during NotifReceive() execution

Hi @ctalledo @rodnymolina

I have an error log like this, i'm not sure because of this our containers sometimes crash can't kill the running containers without restart the docker daemon.

WARN[2020-12-23 05:30:53] Unexpected error during NotifReceive() execution (inappropriate ioctl for device) on fd 26 pid 22364
WARN[2020-12-23 07:28:28] sysbox-fs caught signal: terminated
WARN[2020-12-23 12:55:42] sysbox-fs caught signal: terminated
WARN[2020-12-23 16:09:01] Error decoding received nsenterMsg response: read nsenterPipe-p: connection reset by peer
WARN[2020-12-23 16:18:02] Sysbox-fs first child process error status: pid = 6723
WARN[2020-12-23 16:18:02] Sysbox-fs first child process error status: pid = 6722
WARN[2020-12-23 21:50:24] Unexpected error during NotifReceive() execution (inappropriate ioctl for device) on fd 14 pid 29780
WARN[2020-12-24 10:35:02] Error decoding received nsenterMsg response: read nsenterPipe-p: connection reset by peer
WARN[2020-12-24 18:57:02] Error decoding received nsenterMsg response: read nsenterPipe-p: connection reset by peer
WARN[2020-12-25 14:14:44] Unexpected error during NotifReceive() execution (inappropriate ioctl for device) on fd 30 pid 3068
WARN[2020-12-25 15:58:03] Error decoding received nsenterMsg response: read nsenterPipe-p: connection reset by peer
WARN[2020-12-25 22:36:04] Unexpected error during NotifReceive() execution (inappropriate ioctl for device) on fd 29 pid 6920
WARN[2020-12-26 01:39:33] Unexpected error during NotifReceive() execution (inappropriate ioctl for device) on fd 190 pid 25454
WARN[2020-12-26 09:39:34] Unexpected error during NotifReceive() execution (inappropriate ioctl for device) on fd 156 pid 26159
WARN[2020-12-26 13:50:29] Unexpected error during NotifReceive() execution (inappropriate ioctl for device) on fd 77 pid 17239
WARN[2020-12-26 19:15:10] Unexpected error during NotifReceive() execution (inappropriate ioctl for device) on fd 78 pid 21232
WARN[2020-12-26 21:25:06] Unexpected error during NotifReceive() execution (inappropriate ioctl for device) on fd 87 pid 23710
WARN[2020-12-27 00:35:03] Unexpected error during NotifReceive() execution (bad file descriptor) on fd 53 pid 6342
WARN[2020-12-27 01:48:35] Unexpected error during NotifReceive() execution (inappropriate ioctl for device) on fd 168 pid 7878
WARN[2020-12-27 07:14:40] Unexpected error during NotifReceive() execution (bad file descriptor) on fd 43 pid 30142

Gitlab Runner + Sysbox fail to set volume permissions

Hi,
I've completed the steps listed in: https://blog.nestybox.com/2020/10/21/gitlab-dind.html#gitlab-runner-deploys-jobs-in-system-containers (It's really good reading!)
When I try to run a build, gitlab-runner failes with the following error:
WARNING: Preparation failed: adding cache volume: set volume permissions: running permission container "19d34278bd14d99363c7f03e49c5f178c38e508fbfa755143eed00fe543a23f7" for volume "runner-twkxhdzq-project-504-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": starting permission container: Error response from daemon: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"process_linux.go:504: handleReqOp caused \\\"rootfs_init_linux.go:249: bind mounting /var/lib/docker/containers/19d34278bd14d99363c7f03e49c5f178c38e508fbfa755143eed00fe543a23f7/mounts/shm to dev/shm caused \\\\\\\"lstat /var/lib/docker/containers/19d34278bd14d99363c7f03e49c5f178c38e508fbfa755143eed00fe543a23f7/mounts: permission denied\\\\\\\"\\\"\"": unknown (linux_set.go:100:1s) job=7871 project=504 runner=twkxHdzQ

If I use runc it works just fine and I can complete the build.

/etc/docker/daemon.json:

{
    "default-runtime": "sysbox-runc",
    "runtimes": {
        "sysbox-runc": {
            "path": "/usr/local/sbin/sysbox-runc"
        }
    },
    "bip": "172.20.0.1/16",
    "default-address-pools": [
        {
            "base": "172.25.0.0/16",
            "size": 24
        }
    ]
}

/etc/gitlab-runner/config.toml:

concurrent = 1
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "CloudRunnerVM"
  url = "https://gitlab.server"
  token = "token"
  executor = "docker"
  [runners.docker]
    tls_verify = true
    image = "docker:19.03.12"
    privileged = false
    disable_cache = false
    volumes = ["/cache", "/certs/client"]
    runtime = "sysbox-runc"

.gitlab-ci.yml:

image: docker:19.03.12
services:
  - docker:19.03.12-dind
build:
    stage: build
    script:
      - docker run -i -t ubuntu date

Gitlab-runner version: 13.5.0
Docker version: 19.03.13
sysbox version: 0.2.1-0

Sometimes sysbox stopped

I've installed sysbox from the source (compiled), but sometimes the runtime is stopped how can i managed those to automatically run it again if sysbox stopped? Then it seems scr/sysbox script doesn't do that, also i've suggestion to add suffix for log filename like sysbox-mgr-14-11-2020.log

Thanks

Unable to mount sysfs (EPERM) during sys-container initialization

Ran into this issue while trying to launch K8s PODs as system containers.

Within one of the privileged containers acting as K8s nodes, we attempt to launch a POD making use of sysbox as the runtime.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx-sysbox-1
spec:
  runtimeClassName: sysbox-runc
  containers:
  - name: nginx
    image: nginx
EOF
  • Sysbox-runc returns an EPERM error while trying to build the sandbox (pause) container:
rmolina@heavy-vm-bionic:~/wsp/04-26-2020/kind$ kubectl describe pod nginx-sysbox-1
...
Events:
  Type     Reason                  Age   From                  Message
  ----     ------                  ----  ----                  -------
  Normal   Scheduled               13s   default-scheduler     Successfully assigned default/nginx-sysbox-1 to kind-worker
  Warning  FailedCreatePodSandBox  10s   kubelet, kind-worker  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:532: container init caused \"rootfs_linux.go:58: setting up rootfs mounts caused \\\"rootfs_linux.go:884: mounting \\\\\\\"sysfs\\\\\\\" to rootfs \\\\\\\"/run/containerd/io.containerd.runtime.v1.linux/k8s.io/834a03c15468dedaf708cb525f0fe3016d61aabf7e40e29dd2cf57475ad18ee0/rootfs\\\\\\\" at \\\\\\\"sys\\\\\\\" caused \\\\\\\"operation not permitted\\\\\\\"\\\"\"": unknown
  • We reproduce the same exact issue when replacing kubectl cli interface with crictl tool for POD creation:
root@kind-worker:~/kindestnode-rodny# crictl run --runtime=sysbox-runc container-config.json pod-config.json
FATA[3001] Running container failed: run pod sandbox failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:532: container init caused \"rootfs_linux.go:58: setting up rootfs mounts caused \\\"rootfs_linux.go:884: mounting \\\\\\\"sysfs\\\\\\\" to rootfs \\\\\\\"/run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs\\\\\\\" at \\\\\\\"sys\\\\\\\" caused \\\\\\\"operation not permitted\\\\\\\"\\\"\"": unknown
  • Let's place a breakpoint right before the sysfs mount instruction to have a chance at peaking the state of the incipient container. [ all the instructions below are executed within the 'sysbox-runc init' context ]
rmolina@heavy-vm-bionic:~/wsp/04-26-2020/sysbox/kindestnode-rodny$ ps -ef | egrep "sysbox-runc init"
231072    6149  6066  0 23:41 ?        00:00:00 /usr/local/sbin/sysbox-runc init

rmolina@heavy-vm-bionic:~/wsp/04-26-2020/sysbox/kindestnode-rodny$ sudo nsenter -t 6149 -a bash
root@kind-worker:/# findmnt
TARGET                                                                                 SOURCE                                             FSTYPE    OPTIONS
/                                                                                      overlay                                            overlay   rw,relatime,lowerdir=/var/lib/docker/overlay2/l/KMVXHZYLEJZPKOU2TTPZUFSHBF:/var/lib/docker/ove
|-/proc                                                                                proc                                               proc      rw,nosuid,nodev,noexec,relatime
| `-/proc/sys/fs/binfmt_misc                                                           systemd-1                                          autofs    rw,relatime,fd=39,pgrp=0,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=94183586
|-/dev                                                                                 tmpfs                                              tmpfs     rw,nosuid,size=65536k,mode=755
| |-/dev/pts                                                                           devpts                                             devpts    rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666
| |-/dev/mqueue                                                                        mqueue                                             mqueue    rw,nosuid,nodev,noexec,relatime
| |-/dev/shm                                                                           shm                                                tmpfs     rw,nosuid,nodev,noexec,relatime,size=65536k
| |-/dev/console                                                                       devpts[/0]                                         devpts    rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666
| `-/dev/hugepages                                                                     hugetlbfs                                          hugetlbfs rw,relatime,pagesize=2M
|-/sys                                                                                 sysfs                                              sysfs     rw,nosuid,nodev,noexec,relatime
| |-/sys/fs/cgroup                                                                     tmpfs                                              tmpfs     ro,nosuid,nodev,noexec,mode=755
| | |-/sys/fs/cgroup/systemd                                                           cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,xattr,name=systemd
| | |-/sys/fs/cgroup/devices                                                           cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,devices
| | |-/sys/fs/cgroup/cpuset                                                            cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,cpuset
| | |-/sys/fs/cgroup/cpu,cpuacct                                                       cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,cpu,cpuacct
| | |-/sys/fs/cgroup/blkio                                                             cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,blkio
| | |-/sys/fs/cgroup/memory                                                            cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,memory
| | |-/sys/fs/cgroup/hugetlb                                                           cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,hugetlb
| | |-/sys/fs/cgroup/rdma                                                              cgroup                                             cgroup    rw,nosuid,nodev,noexec,relatime,rdma
| | |-/sys/fs/cgroup/net_cls,net_prio                                                  cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,net_cls,net_prio
| | |-/sys/fs/cgroup/freezer                                                           cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,freezer
| | |-/sys/fs/cgroup/pids                                                              cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,pids
| | |-/sys/fs/cgroup/perf_event                                                        cgroup[/../../..]                                  cgroup    rw,nosuid,nodev,noexec,relatime,perf_event
| | `-/sys/fs/cgroup/unified                                                           cgroup2[/../../../..]                              cgroup2   rw,nosuid,nodev,noexec,relatime,nsdelegate
| |-/sys/kernel/debug                                                                  debugfs                                            debugfs   rw,nosuid,nodev,noexec,relatime
| `-/sys/fs/fuse/connections                                                           fusectl                                            fusectl   rw,nosuid,nodev,noexec,relatime
|-/run                                                                                 tmpfs                                              tmpfs     rw,nosuid,nodev,noexec,relatime
| |-/run/lock                                                                          tmpfs                                              tmpfs     rw,nosuid,nodev,noexec,relatime,size=5120k
| |-/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a5ba02ce9661b191df25560ca67694f2b00601b887db936c3d38d882082e837a/shm
| |                                                                                    shm                                                tmpfs     rw,nosuid,nodev,noexec,relatime,size=65536k
| |-/run/containerd/io.containerd.runtime.v2.task/k8s.io/a5ba02ce9661b191df25560ca67694f2b00601b887db936c3d38d882082e837a/rootfs
| |                                                                                    overlay                                            overlay   rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs
| |-/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a9294afb871b35503d3111c60346f3c69d67757d52032ef6018f3fa782a2f02b/shm
| |                                                                                    shm                                                tmpfs     rw,nosuid,nodev,noexec,relatime,size=65536k
| |-/run/containerd/io.containerd.runtime.v2.task/k8s.io/a9294afb871b35503d3111c60346f3c69d67757d52032ef6018f3fa782a2f02b/rootfs
| |                                                                                    overlay                                            overlay   rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs
| |-/run/containerd/io.containerd.runtime.v2.task/k8s.io/6f2abce13e25839fca0689db233038506514c1c342586c1a33505665769211c5/rootfs
| |                                                                                    overlay                                            overlay   rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/16/f
| |-/run/containerd/io.containerd.runtime.v2.task/k8s.io/225162f17f1794063d62865b698bddf1cdac1d5e0569c51e033b9f1ae6e100a1/rootfs
| |                                                                                    overlay                                            overlay   rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/f
| |-/run/containerd/io.containerd.runtime.v1.linux/k8s.io/.a0127560524578936a435db0554660ed362eb19ac1e7dfff1ceb9306bbd5d137/rootfs
| |                                                                                    overlay                                            overlay   rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs
| | `-/run/containerd/io.containerd.runtime.v1.linux/k8s.io/.a0127560524578936a435db0554660ed362eb19ac1e7dfff1ceb9306bbd5d137/rootfs
| |                                                                                    /run/containerd/io.containerd.runtime.v1.linux/k8s.io/a0127560524578936a435db0554660ed362eb19ac1e7dfff1ceb9306bbd5d137/rootfs
| |                                                                                                                                       shiftfs   rw,relatime,mark
| |-/run/netns/cni-2b9b363a-c622-4b7f-017c-08688285fd7c                                nsfs[net:[4026532972]]                             nsfs      rw
| |-/run/containerd/io.containerd.grpc.v1.cri/sandboxes/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/shm
| |                                                                                    shm                                                tmpfs     rw,nosuid,nodev,noexec,relatime,size=65536k
| | `-/run/containerd/io.containerd.grpc.v1.cri/sandboxes/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/shm
| |                                                                                    /run/containerd/io.containerd.grpc.v1.cri/sandboxes/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/shm
| |                                                                                                                                       shiftfs   rw,relatime,mark
| |   `-/run/containerd/io.containerd.grpc.v1.cri/sandboxes/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/shm
| |                                                                                    /run/containerd/io.containerd.grpc.v1.cri/sandboxes/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/shm
| |                                                                                                                                       shiftfs   rw,relatime
| `-/run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs
|                                                                                      overlay                                            overlay   rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs
|   `-/run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs
|                                                                                      /run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs
|                                                                                                                                         shiftfs   rw,relatime,mark
|     `-/run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs
|                                                                                      /run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs
|                                                                                                                                         shiftfs   rw,relatime,mark
|       `-/run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs
|                                                                                      .                                                  shiftfs   rw,relatime
|-/tmp                                                                                 tmpfs                                              tmpfs     rw,nosuid,nodev,noexec,relatime
|-/root/nestybox/sysbox                                                                /dev/sda3[/home/rmolina/wsp/04-26-2020/sysbox]     ext4      rw,relatime,errors=remount-ro
|-/var                                                                                 /dev/sda3[/var/lib/docker/volumes/8e4ddb7f4ee97e215df9d125de2bb10dd2e6cd9f3303e109e3ba1b552db24f1f/_data]
|                                                                                                                                         ext4      rw,relatime,errors=remount-ro
| |-/var/lib/kubelet                                                                   /dev/sda3[/var/lib/docker/volumes/8e4ddb7f4ee97e215df9d125de2bb10dd2e6cd9f3303e109e3ba1b552db24f1f/_data/lib/kubelet]
| |                                                                                                                                       ext4      rw,relatime,errors=remount-ro
| | |-/var/lib/kubelet/pods/4a12d0af-20af-48bc-8297-3f63a2be05d9/volumes/kubernetes.io~secret/kube-proxy-token-ll9tc
| | |                                                                                  tmpfs                                              tmpfs     rw,relatime
| | `-/var/lib/kubelet/pods/ae9626cb-2388-44fb-8bd6-0aaca66dc88d/volumes/kubernetes.io~secret/kindnet-token-hcnt8
| |                                                                                    tmpfs                                              tmpfs     rw,relatime
| `-/var/lib/sysboxfs/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508 sysboxfs                                           fuse      rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
|-/usr/lib/modules                                                                     /dev/sda3[/lib/modules]                            ext4      ro,relatime,errors=remount-ro
| `-/usr/lib/modules/5.3.0-46-generic                                                  /lib/modules/5.3.0-46-generic                      shiftfs   rw,relatime,mark
|   `-/usr/lib/modules/5.3.0-46-generic                                                /lib/modules/5.3.0-46-generic                      shiftfs   rw,relatime
|-/usr/src/linux-headers-5.3.0-46                                                      /dev/sda3[/usr/src/linux-headers-5.3.0-46]         ext4      ro,relatime,errors=remount-ro
| `-/usr/src/linux-headers-5.3.0-46                                                    /usr/src/linux-headers-5.3.0-46                    shiftfs   rw,relatime,mark
|   `-/usr/src/linux-headers-5.3.0-46                                                  /usr/src/linux-headers-5.3.0-46                    shiftfs   rw,relatime
|-/usr/src/linux-headers-5.3.0-46-generic                                              /dev/sda3[/usr/src/linux-headers-5.3.0-46-generic] ext4      ro,relatime,errors=remount-ro
| `-/usr/src/linux-headers-5.3.0-46-generic                                            /usr/src/linux-headers-5.3.0-46-generic            shiftfs   rw,relatime,mark
|   `-/usr/src/linux-headers-5.3.0-46-generic                                          /usr/src/linux-headers-5.3.0-46-generic            shiftfs   rw,relatime
|-/etc/resolv.conf                                                                     /dev/sda3[/var/lib/docker/containers/693268e234832ba5b97d02763bd8ab37a236610024382f9c9c4cec561ffb9e29/resolv.conf]
|                                                                                                                                         ext4      rw,relatime,errors=remount-ro
|-/etc/hostname                                                                        /dev/sda3[/var/lib/docker/containers/693268e234832ba5b97d02763bd8ab37a236610024382f9c9c4cec561ffb9e29/hostname]
|                                                                                                                                         ext4      rw,relatime,errors=remount-ro
`-/etc/hosts                                                                           /dev/sda3[/var/lib/docker/containers/693268e234832ba5b97d02763bd8ab37a236610024382f9c9c4cec561ffb9e29/hosts]
                                                                                                                                          ext4      rw,relatime,errors=remount-ro
root@kind-worker:/#
  • Once again, an EPERM is returned if we manually attempt to mount sysfs in the container's rootfs:
root@kind-worker:/# mount -t sysfs sysfs /run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs/sys
mount: /run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs/sys: permission denied.
root@kind-worker:/#
  • However, no issue is observed when trying to mount procfs:
root@kind-worker:/# mkdir /run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs/proc
root@kind-worker:/#
root@kind-worker:/# ls -lrt /run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs
total 740
-rwxr-xr-x 1 root root 742472 Dec 20  2017 pause
drwxr-xr-x 2 root root   4096 May  8 23:41 dev
drwxr-xr-x 2 root root   4096 May  8 23:41 sys
drwxr-xr-x 2 root root   4096 May  8 23:44 proc
root@kind-worker:/#
root@kind-worker:/# mount -t proc proc /run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs/proc
root@kind-worker:/# findmnt
TARGET                                                                                 SOURCE                                             FSTYPE    OPTIONS
/                                                                                      overlay                                            overlay   rw,relatime,lowerdir=/var/lib/docker/overlay2/l/KMVXHZYLEJZPKOU2TTPZUFSHBF:/var/lib/docker/ove
|-/proc                                                                                proc                                               proc      rw,nosuid,nodev,noexec,relatime
...
|       `-/run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs
|                                                                                      .                                                  shiftfs   rw,relatime
|         `-/run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs/proc
|                                                                                      proc                                               proc      rw,relatime
...
  • Notice that there's nothing wrong with the 'rootfs/sys' folder created by sysbox-runc; the same issues is observed with a manually created 'sys' folder:
root@kind-worker:/# mkdir /run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs/sysnew
root@kind-worker:/# mount -t sysfs sysfs /run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs/sysnew
mount: /run/containerd/io.containerd.runtime.v1.linux/k8s.io/617d677a65ffb36b8872629d3894b23ec5f47501dc4e54fb0bb8337006d5d508/rootfs/sysnew: permission denied.
root@kind-worker:/#
  • Obviously, the problem is not with the rootfs path but with sys file-system itself. For some reason kernel is preventing sysfs from being mounted.

Sysbox-fs crash while mounting /proc in sys-container's chrooted environment

Problem can be easily reproduced by creating a chroot context within a sys container and mounting any sysbox-managed resource.

rmolina@heavy-vm-bionic:~$ docker run --runtime=sysbox-runc -it --rm --name test-1 --hostname test-1 nestybox/ubuntu-bionic-systemd-docker
admin@test-1:~$

admin@test-1:~$ mkdir rootfs
admin@test-1:~$ sudo cp -a /usr rootfs/
admin@test-1:~$ sudo cp -a /lib rootfs/
admin@test-1:~$ sudo cp -a /lib64 rootfs/
admin@test-1:~$ sudo cp -a /bin rootfs/

admin@test-1:~$ sudo chroot /home/admin/rootfs/
bash-4.4#

bash-4.4# ls -lrt /
total 16
drwxr-xr-x  2 0 0 4096 Apr  3 17:13 lib64
drwxr-xr-x  2 0 0 4096 Jun 10 23:57 bin
drwxr-xr-x 11 0 0 4096 Jun 11 00:01 usr
drwxr-xr-x 10 0 0 4096 Jul 17 19:21 lib

bash-4.4# mkdir proc

bash-4.4# mount -t proc proc /proc
mount: /proc: mount(2) system call failed: Function not implemented.
bash-4.4#

<-- Sysbox-fs crashes right away while processing the mount-syscall ...

DEBU[2020-07-17 19:24:26] Received mount syscall from pid 9851
DEBU[2020-07-17 19:24:26] source: proc, target = /proc, fstype = proc, flags = 0xc0ed0000, data =
DEBU[2020-07-17 19:24:26] Processing new procfs mount: source: proc, target = /proc, fstype = proc, flags = 0xc0ed0000, data =
INFO[2020-07-17 19:24:27] Initiating sysbox-fs engine ...

Add some improvements for logging

Basically i have 2 recommendation for the next update:

  • Disable colors output, as you see on below image the colors output make unused characters for logging.
  • Reduce the number for container id like f4e18b30b8df6ed2100117e7595822db8d50a07925129a1341ee50e4da7857a2 into 8 chars similar to container id does support this things.

image

All of it will become a good improvement for logging in sysbox besides of JSON format and timestamp.

Thanks.

Is there a way to move /var/lib/sysbox to other places?

Hello, we are running sysbox with ubuntu focal. Due to limited / size, we opt to use /workspace/docker(which is a raid-0 ssd array) for docker storage. However, when running docker pull in system container, we found / is full. After further investigation, the image is saved in /var/lib/sysbox. So is there a way to move /var/lib/sysbox also to /workspace/sysbox? If not, will manually mount /var/lib/docker for system container work? Thanks

Sysbox-mgr fails leaks file-descriptor during container creation

The sysbox-mgr has a file descriptor leak during container creation and/or removal.

This leak causes sysbox to fail to create containers after more than ~512 containers are created and removed, because this causes the default Linux limit for max opened file descriptors (1024) to be reached.

When the error occurs, we see the following in the sysbox-mgr log (/var/log/sysbox-mgr.log):

WARN[2020-11-09 19:59:25] sync-out for container 8b1f5eced84ca3ec1ded7b5bebfdaf32a0a490a57a7e84c2887936498994b6ea failed: sync-out for volume backing [var-lib-docker var-lib-kubelet var-lib-containerd-ovfs] failed: volume sync-out failed: failed to sync /var/lib/sysbox/containerd/8b1f5eced84ca3ec1ded7b5bebfdaf32a0a490a57a7e84c2887936498994b6ea/ to /var/lib/docker/165536.165536/overlay2/214855bdc8d28deed38e653fe9a780e7c1a519274e4b34c04613c829ba013170/merged/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs:  fork/exec /usr/bin/rsync: too many open files

Ubuntu Focal displays 'setrlimit' error during 'sudo' execution

The following error is dumped to console whenever a command is executed through 'sudo' interface:

admin@test-3:~$ touch test

admin@test-3:~$ ls
test

admin@test-3:~$ sudo ls
sudo: setrlimit(RLIMIT_CORE): Operation not permitted
test

admin@test-3:~$

Notice that the instruction is properly executed so it looks like a cosmetic issue. In either case, problem has nothing to do with Sysbox, and in fact it's being actively worked on by Canonical folks:

https://bugs.launchpad.net/juju/+bug/1867799
https://discuss.linuxcontainers.org/t/sudo-on-ubuntu-20-04-focal/7508/4

I'm creating this issue to make sure we update our external Docker images (and associated Dockerfiles) once that problem is fixed, which may require updating the 'sudo' package.

Docker-compose does not work with "sysbox" as a default runtime

With docker run command everything works great but nothing works with docker-compose.

Ducker run works:

I've configured sysbox-runc as a default runtime.

docker run -ti  ubuntu:latest

docker inspect shows that everything find

docker inspect 4921e6bd074a|grep Runtime
            "Runtime": "sysbox-runc",
            "CpuRealtimeRuntime": 0, 

Docker-compose does not work:

docker-compose.test.yml

version: "3.1"

services:
  ubuntu:
    image: ubuntu:latest
    restart: always
docker git:(master) ✗ docker-compose -f docker-compose.test.yml up                                                                
Starting docker_ubuntu_1 ... error

ERROR: for docker_ubuntu_1  Cannot start service ubuntu: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"process_linux.go:504: handleReqOp caused \\\"rootfs_init_linux.go:249: bind mounting /var/lib/docker/containers/744072f486fdffb8abf303ef4467267de1f53defaa98e391a507930a3336b06c/mounts/shm to dev/shm caused \\\\\\\"lstat /var/lib/docker/containers/744072f486fdffb8abf303ef4467267de1f53defaa98e391a507930a3336b06c/mounts: permission denied\\\\\\\"\\\"\"": unknown

ERROR: for ubuntu  Cannot start service ubuntu: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"process_linux.go:504: handleReqOp caused \\\"rootfs_init_linux.go:249: bind mounting /var/lib/docker/containers/744072f486fdffb8abf303ef4467267de1f53defaa98e391a507930a3336b06c/mounts/shm to dev/shm caused \\\\\\\"lstat /var/lib/docker/containers/744072f486fdffb8abf303ef4467267de1f53defaa98e391a507930a3336b06c/mounts: permission denied\\\\\\\"\\\"\"": unknown
ERROR: Encountered errors while bringing up the project.

I have tried to run the command from the root but it changes nothing.

Bug: Sysbox v0.2.1 Error response from daemon on Ubuntu 18.04

after building from source and running the given example

docker run --runtime=sysbox-runc -it debian

i get the following error:

docker: Error response from daemon: OCI runtime create failed: failed to register with sysbox-mgr: failed to invoke Register via grpc: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/sysbox/sysmgr.sock: connect: no such file or directory": unknown.
ERRO[0000] error waiting for container: context canceled 

Docker swarm service fails with published ports in sysbox container

Hi

Within a sysbox container, docker swarm services fail to launch if told to publish ports.

I'm setting up my environment, on Ubuntu 19.10 (GNU/Linux 5.3.0-64-generic x86_64), running sysbox 0.2.0-0.ubuntu-eoan (installed from your deb package), like this:

$ docker run --rm -it --runtime sysbox-runc nestybox/alpine-docker:latest sh
# dockerd >/tmp/docker.log 2>&1 &
# docker swarm init
Swarm initialized: current node (extmf6chuqym48w5wia3uzvd0) is now a manager.

Now this service create works:

/ # docker service create --name test --replicas 1 amd64/debian:buster bash -c 'sleep 86400'
tou0md7naand0y1m4jsku57dw
overall progress: 1 out of 1 tasks 
1/1: running   [==================================================>] 
verify: Service converged 

But this does not:

# docker service create --name test --replicas 2 -p 10000 amd64/debian:buster bash -c 'sleep 86400'
l3395q4e70n2hwukcjxgipcts
overall progress: 0 out of 2 tasks 
1/2: failed to create new osl sandbox: namespace creation reexec command failed… 
2/2: failed to create new osl sandbox: namespace creation reexec command failed… 

Here are the contents of docker.log:

time="2020-09-26T15:25:33.819736125Z" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=172.17.0.2 Adv-addr=172.17.0.2 Data-addr= Remote-addr-list=[] MTU=1500"
time="2020-09-26T15:25:33.820028978Z" level=info msg="New memberlist node - Node:c513d71cd7a2 will use memberlist nodeID:58131b79aaa6 with config:&{NodeID:58131b79aaa6 Hostname:c513d71cd7a2 BindAddr:0.0.0.0 AdvertiseAddr:172.17.0.2 BindPort:0 Keys:[[159 142 20 25 168 232 5 122 170 0 227 50 29 31 220 13] [39 213 80 118 185 7 165 134 169 20 47 229 222 174 154 239] [179 18 46 196 81 153 61 145 17 154 194 156 228 209 56 144]] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2020-09-26T15:25:33.820845460Z" level=info msg="Node 58131b79aaa6/172.17.0.2, joined gossip cluster"
time="2020-09-26T15:25:33.821006266Z" level=info msg="Node 58131b79aaa6/172.17.0.2, added to nodes list"
time="2020-09-26T15:25:33.871868881Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh1" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh1: no such file or directory"
time="2020-09-26T15:25:33.891259538Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh2" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh2: no such file or directory"
time="2020-09-26T15:25:33.911785143Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh3" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh3: no such file or directory"
time="2020-09-26T15:25:33.917030579Z" level=error msg="Failed creating ingress network: failed to create new osl sandbox: namespace creation reexec command failed: fork/exec /proc/self/exe: operation not permitted"
time="2020-09-26T15:25:33.917114942Z" level=warning msg="Peer operation failed:Unable to find the peerDB for nid:aq30ph66ai8zarjasj1csf3u8 op:&{3 aq30ph66ai8zarjasj1csf3u8  [] [] [] [] false false false func1}"
time="2020-09-26T15:25:51.334093694Z" level=error msg="fatal task error" error="failed to create new osl sandbox: namespace creation reexec command failed: fork/exec /proc/self/exe: operation not permitted" module=node/agent/taskmanager node.id=ewae02c7c9x8nptme5fsaay9o service.id=ozpi6gug8gk7npevwf0c95ypb task.id=qe11zd763pb35jzh9bvjqp6j1
time="2020-09-26T15:25:51.334179564Z" level=warning msg="Peer operation failed:Unable to find the peerDB for nid:aq30ph66ai8zarjasj1csf3u8 op:&{3 aq30ph66ai8zarjasj1csf3u8  [] [] [] [] false false false func1}"
time="2020-09-26T15:25:51.513798213Z" level=error msg="fatal task error" error="failed to create new osl sandbox: namespace creation reexec command failed: fork/exec /proc/self/exe: operation not permitted" module=node/agent/taskmanager node.id=ewae02c7c9x8nptme5fsaay9o service.id=ozpi6gug8gk7npevwf0c95ypb task.id=kbhxn5zturzg38xy9mv7dh49m
time="2020-09-26T15:25:51.513879294Z" level=warning msg="Peer operation failed:Unable to find the peerDB for nid:aq30ph66ai8zarjasj1csf3u8 op:&{3 aq30ph66ai8zarjasj1csf3u8  [] [] [] [] false false false func1}"
time="2020-09-26T15:25:56.515898859Z" level=error msg="fatal task error" error="failed to create new osl sandbox: namespace creation reexec command failed: fork/exec /proc/self/exe: operation not permitted" module=node/agent/taskmanager node.id=ewae02c7c9x8nptme5fsaay9o service.id=ozpi6gug8gk7npevwf0c95ypb task.id=jwe3o66ru0pomh7nrc2gmyluh
time="2020-09-26T15:25:56.515920086Z" level=warning msg="Peer operation failed:Unable to find the peerDB for nid:aq30ph66ai8zarjasj1csf3u8 op:&{3 aq30ph66ai8zarjasj1csf3u8  [] [] [] [] false false false func1}"
time="2020-09-26T15:26:01.517817277Z" level=error msg="fatal task error" error="failed to create new osl sandbox: namespace creation reexec command failed: fork/exec /proc/self/exe: operation not permitted" module=node/agent/taskmanager node.id=ewae02c7c9x8nptme5fsaay9o service.id=ozpi6gug8gk7npevwf0c95ypb task.id=qkidiwmj6v3g4yvmanbdo4kiq
time="2020-09-26T15:26:01.518010223Z" level=warning msg="Peer operation failed:Unable to find the peerDB for nid:aq30ph66ai8zarjasj1csf3u8 op:&{3 aq30ph66ai8zarjasj1csf3u8  [] [] [] [] false false false func1}"
time="2020-09-26T15:26:06.521589054Z" level=error msg="fatal task error" error="failed to create new osl sandbox: namespace creation reexec command failed: fork/exec /proc/self/exe: operation not permitted" module=node/agent/taskmanager node.id=ewae02c7c9x8nptme5fsaay9o service.id=ozpi6gug8gk7npevwf0c95ypb task.id=e2gpsmc8j8bfah5e37zwhbfox
time="2020-09-26T15:26:06.521648916Z" level=warning msg="Peer operation failed:Unable to find the peerDB for nid:aq30ph66ai8zarjasj1csf3u8 op:&{3 aq30ph66ai8zarjasj1csf3u8  [] [] [] [] false false false func1}"
time="2020-09-26T15:26:06.584220810Z" level=warning msg="underweighting node ewae02c7c9x8nptme5fsaay9o for service ozpi6gug8gk7npevwf0c95ypb because it experienced 5 failures or rejections within 5m0s" module=node node.id=ewae02c7c9x8nptme5fsaay9o

It does seem that /proc/sys/net/ipv4/neigh/default not available in the sysbox container, while it is available on the host.

Sysbox fails to honor tmpfs mounts over the container's /tmp on systemd-based containers

For container images that have systemd inside, sysbox automatically mounts tmpfs on the container's /tmp (to satisfy a systemd requirement). The tmpfs mount is hardcoded with an upper limit of 64mb.

In some cases, users may want that tmpfs mount to be larger. The way to do this is to have users explicitly mount tmpfs over the container's /tmp. For example:

docker run --runtime=sysbox-runc -it --rm --tmpfs /tmp:rw,noexec,nosuid,size=131072k nestybox/ubuntu-bionic-systemd-docker

This however does not currently work, as sysbox has a bug in which it's ignoring the tmpfs mount over /tmp.

Let's fix this.

Add support k3OS + Sysbox

It would be very useful if there were a way to install sysbox on k3OS. k3OS uses the Ubuntu's LTS kernel, which is 5.4 currently.

However, it does not have apt or dpkg for installing packages.

It seems that k3OS uses runc instead of docker due to k3s.

Extend Sysbox support to RPM camp (Redhat, Centos, Fedora)

Sysbox is currently only supported in Ubuntu distribution. Our goal is to make Sysbox a distro-agnostic runtime, but for that to happen these distributions must be capable of running relatively-recent kernels, so that Sysbox's kernel dependencies can be met.

In Fedora's case (> 31) these dependencies are already satisfied (>= kernel 5.5); however, in CentOS and Redhat we may need to wait a bit longer (unless a user is willing to upgrade kernel w/o vendor support).

Leaving kernel compatibility aside, our goal with this EPIC is to keep track of the overall implementation effort that needs to be carried out for Sysbox to support these distributions.

Sysbox integration-tests failures due to interface's MTU mismatch

Problem affects many of Sysbox's integration-testcases where large packet sizes need to be exchanged (e.g. docker pull). At this point is not clear if problem can be reproduced in regular system-containers (L1) and/or their child app containers.

At high-level problem's symptom is very obvious: network transactions initiated by an L2 sys-container (within sysbox's privileged test-cntr), that require exchange of large-size packets, can stall during "docker pull" execution.

The following elements are required for problem to reproduce:

  • host' egress-facing network iface (A) is configured with lower than expected MTU (1460 bytes) -- typical scenario in server farms.

  • hosts' docker0 bridge iface (B) is configured with MTU of 1500 bytes.

  • L1's egress-iface (C) is configured with MTU of 1500 bytes.

  • L1's docker0 iface (D) is configured with MTU of 1500 bytes.

  • IP packets generated from app-containers have "don't-fragment" (DF) bit set.

From a topological perspective, this is the path to be traversed by every packet generated within the L2 container:

data-center-fabric <--> host's egress-iface (A) <--> host's docker0 iface (B) <--> L1's egress-iface (C) <--> L1's docker0 iface (C) 

Notice that the MTU along this path is always 1500 bytes, except on the last network element (from L2's container perspective), where the MTU for A is 1460 bytes.

As per PMTU's ietf specs, upon arrival of a datagram with a size larger than the egress iface's MTU, if "DF" bit is set in its L3 header, this package must be dropped and an ICMP "Fragmentation Needed" message must be sent back to hint the origin about the need to reduce the packet size in subsequent attempts. The network-stack in origin, should then generate an entry to keep track of the discovered MTU value associated to the remote IP peer. At the same time, the network stack must notify the application of the need to adjust its PDU size. At this point the application typically opts by reducing the size of the subsequent packets so that these ones can now reach the remote end.

In our scenario, we expect interface B (host's docker0) to be the spot in which ICMP's fragmentation messages are generated, as this is the location where there's an MTU discrepancy. That's precisely what we observe during problem reproduction:

16:24:11.481123 IP 172.17.0.1 > 172.18.0.6: ICMP ec2-52-5-11-128.compute-1.amazonaws.com unreachable - need to frag (mtu 1460), length 556
16:24:11.551969 IP ec2-52-5-11-128.compute-1.amazonaws.com.https > 172.18.0.6.55310: Flags [.], ack 1587, win 119, options [nop,nop,TS val 4082793925 ecr 1493915319], length 0
16:24:11.632656 IP 172.18.0.6.55310 > ec2-52-5-11-128.compute-1.amazonaws.com.https: Flags [P.], seq 3035:3104, ack 5326, win 501, options [nop,nop,TS val 1493915471 ecr 4082793925], length 69
16:24:11.662150 IP ec2-52-5-11-128.compute-1.amazonaws.com.https > 172.18.0.6.55310: Flags [.], ack 1587, win 128, options [nop,nop,TS val 4082794035 ecr 1493915319,nop,nop,sack 1 {3035:3104}], length 0
16:24:11.662203 IP 172.18.0.6.55310 > ec2-52-5-11-128.compute-1.amazonaws.com.https: Flags [.], seq 1587:3035, ack 5326, win 501, options [nop,nop,TS val 1493915500 ecr 4082794035], length 1448
16:24:11.662249 IP 172.17.0.1 > 172.18.0.6: ICMP ec2-52-5-11-128.compute-1.amazonaws.com unreachable - need to frag (mtu 1460), length 556
16:24:11.900648 IP 172.18.0.6.55310 > ec2-52-5-11-128.compute-1.amazonaws.com.https: Flags [.], seq 1587:3035, ack 5326, win 501, options [nop,nop,TS val 1493915739 ecr 4082794035], length 1448
16:24:11.900764 IP 172.17.0.1 > 172.18.0.6: ICMP ec2-52-5-11-128.compute-1.amazonaws.com unreachable - need to frag (mtu 1460), length 556
16:24:12.392772 IP 172.18.0.6.55310 > ec2-52-5-11-128.compute-1.amazonaws.com.https: Flags [.], seq 1587:3035, ack 5326, win 501, options [nop,nop,TS val 1493916231 ecr 4082794035], length 144

However, we don't see the source application ("docker pull" in this case) reacting to this msg, so it continues generating packets of the same size, and communication ultimately stalls.

Podman + Sysbox Integration

Sysbox should support Podman.

Podman is a RedHat's default container-management tool. Podman is specially useful in development phases as it can simply launch containers (through a docker-like cli interface), and it can easily convert container-definition files into K8 yaml recipes, which is quite handy to quickly test new PODs within a K8 cluster. In addition, Podman supports per-container user-namespace UID mappings, which gels wells with Sysbox.

Scott McCarthy and Giuseppe Scrivano from RedHat suggested that we consider:

  • Using Podman to deploy system containers with Sysbox.

  • Enabling system containers to run Podman inside.

K8s + Sysbox: mount sysfs fails (EPERM) during pod creation

I ran into this one while trying to scope the level of effort required to launch K8s PODs through Sysbox runtime.

I initially stumbled into issue #66, which hasn't been properly fixed yet, and then reproduced the problem described herein. Notice that even though the symptoms are identical (i.e, unable to mount sysfs), the cause seems to be different in this case, and that's why we are tracking this issue separately.

After multiple attempts at bysecting the container's OCI spec, i was able to identify the spec instruction causing this problem; however, the low-level root-cause has not been found yet.

Problem is reproduced whenever a sandbox container (e.g. "pause") is instantiated by K8s master. There's nothing specially relevant in the spec of this container, except for the fact that a "path" element is passed as part of the network-namespace element:

        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            },
            {
                "path": "/var/run/netns/cni-ca69f110-38f9-4be8-dca4-10cbb16f8695",
                "type": "network"
            }
        ],

As per OCI's specification, a compliant runtime is expected to place the to-be-created container in the network namespace indicated by this file (which in turn, represents a bind-mount of a "/proc/pid/ns/net").

path (string, OPTIONAL) - namespace file. This value MUST be an absolute path in the runtime mount namespace. The runtime MUST place the container process in the namespace associated with that path. The runtime MUST generate an error if path is not associated with a namespace of type type. If path is not specified, the runtime MUST create a new container namespace of type type.

We can re-create the observed behavior by following the steps indicated below ...

Let's start by creating the shared network namespace that our POD will be part of:

rmolina@heavy-vm-bionic:~/wsp$ sudo ip netns add test-ns-1

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ ls -li /run/netns/
total 0
4026532321 -r--r--r-- 1 root root 0 May 13 02:31 test-ns-1
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ findmnt
...
├─/run                                tmpfs                  tmpfs       rw,nosuid,noexec,relatime,size=815200k,mode=755
│ ├─/run/lock                         tmpfs                  tmpfs       rw,nosuid,nodev,noexec,relatime,size=5120k
│ ├─/run/user/1000                    tmpfs                  tmpfs       rw,nosuid,nodev,relatime,size=815196k,mode=700,uid=1000,gid=1000
│ ├─/run/user/1001                    tmpfs                  tmpfs       rw,nosuid,nodev,relatime,size=815196k,mode=700,uid=1001,gid=1001
│ ├─/run/netns/test-ns-1              nsfs[net:[4026532321]] nsfs        rw
│ └─/run/netns                        tmpfs[/netns]          tmpfs       rw,nosuid,noexec,relatime,size=815200k,mode=755
│   └─/run/netns/test-ns-1            nsfs[net:[4026532321]] nsfs        rw
├─/boot                               /dev/sda1              ext4        rw,relatime
...

Let's now add this network-ns file to our own baked spec:

		"namespaces": [
			{
				"type": "pid"
			},
		        {
			        "path": "/var/run/netns/test-ns-1",
				"type": "network"
			},
			{
				"type": "ipc"
			},
			{
				"type": "uts"
			},
			{
				"type": "mount"
			},
			{
				"type": "cgroup"
			}
		],

Problem is right away reproduced:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo sysbox-runc run ubuntu-1
container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"rootfs_linux.go:58: setting up rootfs mounts caused \\\"rootfs_linux.go:928: mounting \\\\\\\"sysfs\\\\\\\" to rootfs \\\\\\\"/home/rmolina/wsp/05-12-2020/sysbox/ubuntu/rootfs\\\\\\\" at \\\\\\\"sys\\\\\\\" caused \\\\\\\"operation not permitted\\\\\\\"\\\"\""
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

As expected, problem is not reproduced with upstream runc in the default configuration (no user-ns), as this would also fail in all K8s deployments. However, the same exact issue is reproduced the moment that we request user-ns creation.

See no issue with runc when relying on the above spec:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo runc run ubuntu-1
#

Let's modify the spec to explicitly activate user-ns creation:

root@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu# cat /etc/subuid
lxd:100000:65536
root:100000:65536
vagrant:165536:65536
rmolina:231072:65536
sysbox:296608:268435456

root@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu# cat config.json
...
        "linux": {
        "uidMappings": [
            {
                "hostID": 296608,
                "containerID": 0,
                "size": 268435456
            }
        ],
        "gidMappings": [
            {
                "hostID": 296608,
                "containerID": 0,
                "size": 268435456
            }
        ],
            "namespaces": [
                        {
                                "type": "pid"
                        },
                        {
                                "path": "/var/run/netns/test-ns-1",
                                "type": "network"
                        },
                        {
                                "type": "ipc"
                        },
                        {
                                "type": "uts"
                        },
                        {
                                "type": "mount"
                        },
                        {
                                "type": "user"
                        },
                        {
                                "type": "cgroup"
                        }
                ],
...

Trying runc once again shows the same problem reported by sysbox-runc:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo runc run ubuntu-1
WARN[0000] exit status 1
ERRO[0000] container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"sysfs\\\" to rootfs \\\"/home/rmolina/wsp/05-12-2020/sysbox/ubuntu/rootfs\\\" at \\\"/sys\\\" caused \\\"operation not permitted\\\"\""
container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"sysfs\\\" to rootfs \\\"/home/rmolina/wsp/05-12-2020/sysbox/ubuntu/rootfs\\\" at \\\"/sys\\\" caused \\\"operation not permitted\\\"\""
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

Problem seems to be caused by some sort of kernel limitation or requirement imposed on user-namespaces and their relationship with network-namespaces. See that issue is also reproduced when leaving runtimes out of the equation:


<-- With network-ns:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo unshare -m -u -i -n -p -U -f -r bash -c "mkdir /root/sys && mount -t sysfs sysfs /root/sys"
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ echo $?
0

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo rm -rf /root/sys

<-- No network-ns:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo unshare -m -u -i -p -U -f -r bash -c "mkdir /root/sys && mount -t sysfs sysfs /root/sys"
mount: /root/sys: permission denied.
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

More details to come ...

The Vagrant Docker provider fails to stop Docker containers when using Sysbox

Playing around with Vagrant, I was easily able to create a container using the Docker provider + Sysbox. Here is the Vagrant file:

   1 │ # encoding: utf-8                                                                                                                                                                                                                                                                                                      
   2 │ # -*- mode: ruby -*-                                                                                                                                                                                                                                                                                                   
   3 │ # vi: set ft=ruby :                                                                                                                                                                                                                                                                                                    
   4 │                                                                                                                                                                                                                                                                                                                        
   5 │ Vagrant.configure("2") do |config|                                                                                                                                                                                                                                                                                     
   6 │   config.vm.provider "docker" do |d|                                                                                                                                                                                                                                                                                   
   7 │     d.image = "alpine"                                                                                                                                                                                                                                                                                                 
   8 │     d.cmd = ["tail", "-f", "/dev/null"]                                                                                                                                                                                                                                                                                
   9 │     d.create_args = ["--runtime=sysbox-runc"]                                                                                                                                                                                                                                                                         
  10 │   end                                                                                                                                                                                                                                                                                                                  
  11 │ end    

Running vagrant up works without problem.

However, vagrant halt or vagrant destroy -f fail with a cryptic error:

ctalledo@nestybox-srv-01:~/vagrant/sysbox$ vagrant destroy -f                                                                                                                                                                                                                                                       [228/9101]
==> default: Stopping container...                                                                                                                             
==> default: Deleting the container...                                                                                                                         
Traceback (most recent call last):                                                                                                                             
        84: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/batch_action.rb:86:in `block (2 levels) in run'
        83: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/machine.rb:198:in `action'         
        82: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/machine.rb:198:in `call'                          
        81: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/environment.rb:613:in `lock'         
        80: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/machine.rb:212:in `block in action'  
        79: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/machine.rb:240:in `action_raw'                    
        78: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `run'       
        77: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/busy.rb:19:in `busy'                         
        76: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `block in run'
        75: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builder.rb:116:in `call'                   
        74: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'  
        73: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builtin/call.rb:53:in `call'                                  
        72: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `run'                      
        71: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/busy.rb:19:in `busy'                  
        70: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `block in run'
        69: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builder.rb:116:in `call'         
        68: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'          
        67: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:107:in `block in finalize_action'
        66: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        65: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builtin/call.rb:53:in `call'
        64: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `run'
        63: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/busy.rb:19:in `busy'
        62: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `block in run'
        61: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builder.rb:116:in `call'
        60: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        59: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:107:in `block in finalize_action'
        58: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        57: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:107:in `block in finalize_action'
        56: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        55: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builtin/call.rb:53:in `call'
        54: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `run'
        53: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/busy.rb:19:in `busy'
        52: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `block in run'
        51: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builder.rb:116:in `call'
        50: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        49: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:107:in `block in finalize_action'
        48: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        47: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builtin/call.rb:53:in `call'
        46: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `run'
        45: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/busy.rb:19:in `busy'
        44: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `block in run'
        43: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builder.rb:116:in `call'
        42: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        41: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:107:in `block in finalize_action'
        40: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        39: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builtin/config_validate.rb:25:in `call'
        38: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        37: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builtin/provisioner_cleanup.rb:25:in `call'
        36: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        35: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builtin/env_set.rb:19:in `call'
        34: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        33: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builtin/call.rb:53:in `call'
        32: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `run'
        31: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/busy.rb:19:in `busy'
        30: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `block in run'
        29: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builder.rb:116:in `call'
        28: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        27: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:107:in `block in finalize_action'
        26: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        25: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:107:in `block in finalize_action'
         24: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        23: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builtin/call.rb:53:in `call'
        22: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `run'
        21: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/busy.rb:19:in `busy'
        20: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/runner.rb:89:in `block in run'
        19: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/builder.rb:116:in `call'
        18: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        17: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:107:in `block in finalize_action'
        16: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        15: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/plugins/providers/docker/action/stop.rb:16:in `call'
        14: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        13: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:107:in `block in finalize_action'
        12: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
        11: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/plugins/providers/docker/action/host_machine_sync_folders_disable.rb:18:in `call'
        10: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/action/warden.rb:48:in `call'
         9: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/plugins/providers/docker/action/destroy.rb:23:in `call'
         8: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/plugins/providers/docker/driver.rb:174:in `rm'
         7: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/plugins/providers/docker/driver.rb:120:in `created?'
         6: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/plugins/providers/docker/driver.rb:280:in `execute'
         5: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/plugins/providers/docker/executor/local.rb:16:in `execute'
         4: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/busy.rb:19:in `busy'
         3: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/plugins/providers/docker/executor/local.rb:17:in `block in execute'
         2: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/subprocess.rb:22:in `execute'
         1: from /opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/subprocess.rb:62:in `execute'
/opt/vagrant/embedded/gems/2.2.9/gems/vagrant-2.2.9/lib/vagrant/util/subprocess.rb:62:in `pwd': No such file or directory - getcwd (Errno::ENOENT)

I am not quite sure what's going on, but in Linux this type of error generally occurs when a command is executed from within a directory that no longer exists. That does not appear to be the case here, at least on the surface.

Speculating here: I noticed that when Vagrant runs the container, it mounts the Vagrant file in the host into the container's /vagrant directory. This causes Sysbox to mount shiftfs on the host directory where the Vagrant file is located. This in turn makes that directory non-exec. I wonder if this is playing a role somehow.

Also: after the error occurs, the container is stopped but not removed. It must be removed explicitly with "docker rm".

Feature request: release binaries on GitHub Releases

It would be very cool if we could have the binaries (.deb?) uploaded in the GitHub Releases, so newcomers could easily install on their machine without having to build themselves, despite the build process is so simple.

I suggest the entire process to happen automatically under CI, such as GitHub Actions.

In case you're interested in enforcing Semantic Versioning, I really suggest you take a look at semantic-release.

sysbox-mgr testcase failure due to unexpected linux-headers (CentOS)

This was found as part of adding support for CentOS in Sysbox. (Refs #62)

A couple of sysbox-mgr testcase fail on CentOS. Sysbox is just complaining of the fact that the expected 'modules' folder mounted from the host fs has not been found. Probably just a consequence of the different naming convention utilized by CentOS vs the Ubuntu one we are used to.

ok 20 kubeletVolMgr non-persistence
ok 21 kubeletVolMgr consecutive restart
ok 22 kubeletVolMgr sync-out
not ok 23 kernel lib-module mount
# (in test file tests/sysmgr/mount.bats, line 22)
#   `[[ "$output" =~ "on /lib/modules/${kernel_rel}".+"ro,relatime" ]]' failed
# docker run --runtime=sysbox-runc -d --rm nestybox/alpine-docker-dbg:latest tail -f /dev/null (status=0):
# a779d41adcfe924f829c97807df7138939ca28872ffe84e6b53d227b4a2e45e4
# docker ps --format {{.ID}} (status=0):
# a779d41adcfe
# docker exec a779d41adcfe sh -c mount | grep "/lib/modules/5.7.9-1.el8.elrepo.x86_64" (status=0):
# /dev/vda1 on /lib/modules/5.7.9-1.el8.elrepo.x86_64 type xfs (ro,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
not ok 24 kernel headers mounts
# (in test file tests/sysmgr/mount.bats, line 40)
#   `[[ "${lines[0]}" =~ "on /usr/src/linux-headers-${kernel_rel}".+"ro,relatime" ]]' failed
# docker run --runtime=sysbox-runc -d --rm nestybox/alpine-docker-dbg:latest tail -f /dev/null (status=0):
# 0fccabae938c797b948dfd734d05296a4c1ddc2feb57f7047d21862e69ea4804
# docker ps --format {{.ID}} (status=0):
# 0fccabae938c
# a779d41adcfe
# docker exec 0fccabae938c sh -c mount | grep "/usr/src/linux-headers-5.7.9-1.el8.elrepo.x86_64" (status=0):
# /dev/vda1 on /usr/src/linux-headers-5.7.9-1.el8.elrepo.x86_64 type xfs (ro,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
# /dev/vda1 on /usr/src/linux-headers-5.7.9-1.el8.elrepo.x86_64 type xfs (ro,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
ok 25 # skip (needs UID shifting) shiftfsMgr basic
ok 26 # skip (needs UID shifting) shiftfsMgr multiple syscont

<-- For the time being have workarounded issue with a soft-link:

[vagrant@centos-8-vm sysbox]$ sudo ln -sf /usr/src/kernels/5.7.9-1.el8.elrepo.x86_64 /usr/src/linux-headers-5.7.9-1.el8.elrepo.x86_64
[vagrant@centos-8-vm sysbox]$ sudo ls -lrt /usr/src/
total 0
drwxr-xr-x. 2 root root  6 May 11  2019 debug
drwxr-xr-x. 3 root root 39 Jul 19 06:34 kernels
lrwxrwxrwx. 1 root root 42 Jul 19 06:52 linux-headers-5.7.9-1.el8.elrepo.x86_64 -> /usr/src/kernels/5.7.9-1.el8.elrepo.x86_64
[vagrant@centos-8-vm sysbox]$

Sysbox unable to build in GCP's ubuntu VMs due to missing libseccomp.h

I'm able to consistently reproduce this one in GCP's Ubuntu-Bionic and Ubuntu-Focal VMs:

$ make sysbox
...
** Building sysbox **

docker run --privileged --rm --hostname sysbox-build --name sysbox-build -v /home/rodny/wsp/11-17-2020/sysbox:/root/nestybox/sysbox -v /pkg/mod:/go/pkg/mod -v /lib/modules/5.4.0-1021-gcp:/lib/modules/5.4.0-1021-gcp:ro -v /usr/include/linux/seccomp.h:/usr/include/linux/seccomp.h:ro -v /usr/src/linux-headers-5.4.0-1021-gcp:/usr/src/linux-headers-5.4.0-1021-gcp:ro -v /usr/src/linux-gcp-5.4-headers-5.4.0-1021:/usr/src/linux-gcp-5.4-headers-5.4.0-1021:ro sysbox-test /bin/bash -c "buildContainerInit sysbox-local"
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/usr/include/linux/seccomp.h\\\" to rootfs \\\"/var/lib/docker/overlay2/2f75d4b5e4395f0b4d3c4ad7b48867464892568de36c30112abf4287976ae8d6/merged\\\" at \\\"/var/lib/docker/overlay2/2f75d4b5e4395f0b4d3c4ad7b48867464892568de36c30112abf4287976ae8d6/merged/usr/include/linux/seccomp.h\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type.
Makefile:127: recipe for target 'sysbox' failed
make: *** [sysbox] Error 125

There seems to be two separated issues to fix here:

  1. Installer should add libseccomp-dev as a package dependency as only then we can guarantee that /usr/include/libseccom.h will be present. This may not be the proper header to utilize as we may need the seccomp.h that comes with the current kernel-headers.

  2. There seems to be a misalignment in regards to the Sysbox's expected libseccomp.h path (i.e. /usr/include/linux/libseccomp.h) and the one where Ubuntu install this header (i.e. /usr/include/libseccomp.h).

Add support for Sysbox PODs

This enables orchestration of system containers by allowing K8s to create PODs powered by Sysbox runtime.

As a start, we want to support:

  • K8s + Docker
  • K8s + containerd

Implement GPU passthrough functionality to provide hw-acceleration to inner-containers

Our goal here is to allow Sysbox container hierarchies to make use of hardware-acceleration capabilities offered by system devices. Device 'passthrough' is a concept that applies naturally to system-containers, and as such, it has been on our mind since Sysbox's early days, but it recently came up as part of a conversation with @jamierajewski (thanks for that).

A couple of scenarios where this would be useful:

  • Development environments: A sysbox container would wrap inner containers with GUI requirements.
  • K8s environments: A sysbox container could act as a K8s node and have PODs running apps with GUI demands.

Even though most of the concepts described here are applicable to any GPU, we will limit the scope of this issue to Nvidia GPU's; let's create separate issues for other GPUs.

At high-level, these are some of the requirements that Sysbox would need to meet:

  • Sysbox should identify the GPU devices in the host and expose them automatically to the sysbox containers (through 'devices' oci-spec attribute).

  • Sysbox should provide a mechanism that allows cuda-toolkit and related nvidia tools, which are required at host-level, to be shared (bind-mounted?) with sysbox containers. This would address two problems:

    1. Cuda-toolkit and drivers installed within sysbox containers must fully match those installed in the host (end-user wouldn't know which version to fetch at image build-time).
    2. Cuda packages are quite large, which would bloat sysbox images.
  • Sysbox should allow proper execution of the nvidia-container-runtime within the system-containers, which should expose all the required abstractions for nvidia-runtime to operate as if it were running in the host.

This list of requirements will obviously change as we further understand the problem.

Istio-proxy forwarding issue in K8s' nested containers setup

[ originally reported by @kylecarbs ]

In a K8s setup with a POD running a docker-in-docker (DinD) image, traffic generated within inner containers is blackholed in host's network namespace. No forwarding issue is observed when traffic is generated within the (privileged) POD itself.

Environment:

rodny@gke-cluster-1-default-pool-e78f2962-q6x8:~/bin$ uname -r
5.4.0-1024-gcp

rodny@gke-cluster-1-default-pool-e78f2962-q6x8:~/bin$ lsb_release -d
Description:	Ubuntu 18.04.5 LTS

rodny@gke-cluster-1-default-pool-e78f2962-q6x8:~/bin$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-gke.1504", GitCommit:"17061f5bd4ee34f72c9281d49f94b4f3ac31ac25", GitTreeState:"clean", BuildDate:"2020-10-19T17:02:11Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-gke.1504", GitCommit:"17061f5bd4ee34f72c9281d49f94b4f3ac31ac25", GitTreeState:"clean", BuildDate:"2020-10-19T17:00:22Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}

In this setup, Istio's sidecar-injection is automatically configured for the 'default' namespace, so istio-init is properly setting the iptables as expected:

root@sysbox-in-docker-6d8fc47bb6-xgz6v:/# iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
LOG        all  --  anywhere             anywhere             LOG level debug prefix "rodny-nat-prerouting "
ISTIO_INBOUND  tcp  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
LOG        all  --  anywhere             anywhere             LOG level debug prefix "rodny-nat-input "

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
LOG        all  --  anywhere             anywhere             LOG level debug prefix "rodny-nat-output "
ISTIO_OUTPUT  tcp  --  anywhere             anywhere
DOCKER     all  --  anywhere            !127.0.0.0/8          ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
LOG        all  --  anywhere             anywhere             LOG level debug prefix "rodny-nat-postrouting "
MASQUERADE  all  --  172.24.0.0/16        anywhere

Chain DOCKER (2 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

Chain ISTIO_INBOUND (1 references)
target     prot opt source               destination
RETURN     tcp  --  anywhere             anywhere             tcp dpt:ssh
RETURN     tcp  --  anywhere             anywhere             tcp dpt:15020
ISTIO_IN_REDIRECT  tcp  --  anywhere             anywhere

Chain ISTIO_IN_REDIRECT (2 references)
target     prot opt source               destination
REDIRECT   tcp  --  anywhere             anywhere             redir ports 15006

Chain ISTIO_OUTPUT (1 references)
target     prot opt source               destination
RETURN     all  --  127.0.0.6            anywhere
ISTIO_IN_REDIRECT  all  --  anywhere            !localhost
RETURN     all  --  anywhere             anywhere             owner UID match 1337
RETURN     all  --  anywhere             anywhere             owner GID match 1337
RETURN     all  --  anywhere             localhost
ISTIO_REDIRECT  all  --  anywhere             anywhere

Chain ISTIO_REDIRECT (1 references)
target     prot opt source               destination
REDIRECT   tcp  --  anywhere             anywhere             redir ports 15001
root@sysbox-in-docker-6d8fc47bb6-xgz6v:/#

Problem is specifically with TCP traffic initiated from inner containers, which is blackholed at host level. Problem seems to be a direct consequence of this forwarding sequence:

  • TCP Sync packet generated from inner container is processed by PREROUTING chain and re-directed to Istio's inbound-handler (port 15006). Note that the source address here (172.24.0.2) corresponds to the egress iface of the inner container, and 172.24.0.1 is docker0's interface within this POD.
 	 ... nat-prerouting IN=docker0 OUT= PHYSIN=vethfd9bdf3 MAC=02:42:c7:d0:5d:05:02:42:ac:18:00:02:08:00 SRC=172.24.0.2 DST=74.125.20.101
  • Packet exits Istio logic and is now processed by OUTPUT chain; however, this one is now sourced from 127.0.0.6:
	 ... nat-output IN= OUT=eth0 SRC=127.0.0.6 DST=74.125.20.101 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52031 DF PROTO=TCP SPT=51503 DPT=80 WINDOW=42600 RES=0x00 SYN URGP=0
  • Packet hits POSTROUTING chain:
	 ... nat-postrouting IN= OUT=eth0 SRC=127.0.0.6 DST=74.125.20.101 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52031 DF PROTO=TCP SPT=51503 DPT=80 WINDOW=42600 RES=0x00 SYN URGP=0
  • Packet ends up being discarded in host network namespace as '127.0.0.6' is not routable.
	 ... kernel: IPv4: martian source 74.125.20.101 from 127.0.0.6, on dev cbr0

I have found a workaround that basically masquerades all traffic that hits POSTROUTING chain with source-address == '127.0.0.6', but i feel that this may not be a proper/generic-enough solution for this problem.

$ iptables -t nat -A POSTROUTING -s 127.0.0.6 ! -o docker0 -j MASQUERADE

$ iptables -L -t nat
...
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 MASQUERADE  all  --  *      !docker0  172.24.0.0/16        0.0.0.0/0
    0     0 MASQUERADE  all  --  *      !docker0  127.0.0.6            0.0.0.0/0
...

Problem is clearly not a Sysbox issue as traffic is dropped regardless of the inner container being launched with Sysbox or the regular runc. However, we are tracking this one here as this is a common Sysbox deployment setup in K8s scenarios.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.