tailhook / lithos Goto Github PK

View Code? Open in Web Editor NEW

110.0 6.0 6.0 783 KB

Process supervisor that supports linux containers

Home Page: http://lithos.readthedocs.org

License: MIT License

Makefile 0.62% Shell 0.60% Rust 98.78%

linux process supervisor rust containers

lithos's Issues

Fix error message on cgroup creation error

Currently it's just can't create cgroup. We need to know which cgroup can't be created

The purpose of sandbox vs. process vs. container config separation?

I still don’t understand high-level concept of Lithos’ configuration. Why is configuration of a container separated into three configs (sandbox, process and container), specifically why is sandbox and process not a single config? What’s the idea behind this and how it’s supposed to be used? Could you please provide example of some non-trivial setup that demonstrates this?

In #8 (comment) you wrote:

Well, technically we have a concept of sandboxes just for that. I.e. you allow users to upload images to certain directory configured in a sandbox. And arrange some script to update their /etc/lithos/processes/<sandbox-name>.yaml. This allows them to freely update their images, add and remove services, but they can't escape a sandbox.

This quite makes sense, except one thing – cgroup limits are configured only in the container config (inside image). Cgroup limits are exactly the thing I’d like to enforce for users and not let them change them.

Lithos crashes when `reuse-port` is enabled

Traceback, on release version is useless, but:

thread '<main>' panicked at 'called `Option::unwrap()` on a `None` value', ../src/libcore/option.rs:330
stack backtrace:
   1:     0x560e5ea4c3d0 - sys::backtrace::tracing::imp::write::h3675b4f0ca767761Xcv
   2:     0x560e5ea4f8fb - panicking::default_handler::_$u7b$$u7b$closure$u7d$$u7d$::closure.44519
   3:     0x560e5ea4f568 - panicking::default_handler::h18faf4fbd296d909lSz
   4:     0x560e5ea3c94c - sys_common::unwind::begin_unwind_inner::hfb5d07d6e405c6bbg1t
   5:     0x560e5ea3cdd8 - sys_common::unwind::begin_unwind_fmt::h8b491a76ae84af35m0t
   6:     0x560e5ea4b981 - rust_begin_unwind
   7:     0x560e5ea7f61f - panicking::panic_fmt::h98b8cbb286f5298alcM
   8:     0x560e5ea7f8f8 - panicking::panic::h4265c0105caa1121SaM
   9:     0x560e5e910006 - run::hb0fe367667823cd6nab
  10:     0x560e5e96ca05 - main::h163a2cd1736181e3R3b
  11:     0x560e5ea4f1c4 - sys_common::unwind::try::try_fn::h14622312129452522850
  12:     0x560e5ea4b90b - __rust_try
  13:     0x560e5ea4ec5b - rt::lang_start::h0ba42f7a8c46a626rKz
  14:     0x7f6749f94f44 - __libc_start_main
  15:     0x560e5e8fab78 - <unknown>

Lithos vs Vagga

Good day, @tailhook !

Tell me, please: how could Lithos be compared with Vagga?

Thanks!

Add support for memfd

Motivation

Sometimes we want to pass some values to the running process in a very lightweight way at runtime. Use cases include:

Enabling debugging logging on seemingly unaccessible process
Enable runtime profiling
Throttling of heavily loaded process

Background

memfd is a linux mechanism to communicate via shared memory in a safe way. Memfd are shared between processes as file descriptors. It's basically an anonymous temporary file.
We already have support for passing file descriptors for TCP sockets. While it's a bit hacky, it's tremendously useful for some applications

Solution

Lithos side:

Pass memfd to the process that contains some fixed-size-and-position equivalent of:

log-level: warn

Update log-level directly in the target memory

Application's task:

Map the respective memory position to in-code variable
Check it in appropriate places

The expected cost of reading a variable is a single atomic load instruction. So you can check the variable in every logging instruction. Even rust can afford it.

Applicability

Surely, this is not for full blown configs. We think that this feature needed for things that are:

Hard to reproduce on test cluster
Needs some feature enabled (like debug logging) that can't be enabled at all times
Has a setting that is easy to read to/from shared memrory
Or alternatively, requires immediate action which can't afford full service (or even config) reload: throttling, stop accepting connections, etc.

On the other hand, most things that traditionally where implemented via signals might be use this mechanism, because in most cases signal handlers do exactly the same thing: set some global flag and continue work until normal code flow checks the flag

Alternatives

Fetch the value of the variable by web API. The downsides are:
- Cost of the request
- Cost of the request library in the process that 99.9% of the time is useless
- Can't enable fast enough: in synchronous process it will probably be too late when enabled, but in async process it may also be a problem if the process is heavily loaded. If you need to debug ENFILE error
Using signals such as SIGUSR1. The downsides are:
- There are only few signals (there are real-time ones, but it's hard to find which are unused), so it may work as enable/disable, but not set the actual log level
- It's hard to make sure signal is set up and will enable feature at right point in time (not after request or something)
- EINTR handling still has issues in many languages and environments
Using just shared memory file:
- Can't be sealed, so can potentially crash application
- Not much simpler than memfd

/cc @anti-social, @popravich, @vharitonsky

Examples don't work

% sudo ./example_configs.sh 
[sudo] password for a: 
Copying examples/py into the system
WARNING: This Command will remove /etc/lithos from the system
... hopefully you run this in a virtual machine
... but let you think for 10 seconds
10 \r9 \r8 \r7 \r6 \r5 \r4 \r3 \r2 \r1 \r0 \rOkay proceeding...
building file list ... done

sent 188 bytes  received 11 bytes  398.00 bytes/sec
total size is 309  speedup is 1.55
Config /home/a/proj/lithos/vagga.yaml cannot be read: Error parsing config /home/a/proj/lithos/vagga.yaml: /home/a/proj/lithos/vagga.yaml:89:7: Parse Error: Expected scalar, sequence or mapping, got Anchor

Tested with vagga versions both master and 0.6.2

Bridged network setup fails on arping check step on Alpine Linux

While trying to setup bridged network on Alpine Linux v3.7, the initialization procedure setup_network fails while checking interface avalibility with arping on the following error:

Fatal error running "container/container.0": arping failed: exited with code 1

Full log of the container attache below:

[2018-06-19T11:46:53Z][DEBUG]src/mount.rs:148: Making private "/"
[2018-06-19T11:46:53Z][DEBUG]src/mount.rs:132: Remount readonly: "/run/lithos/mnt"
[2018-06-19T11:46:53Z][INFO] [container/container.0] Starting container
[2018-06-19T11:46:53Z][DEBUG]src/bin/lithos_knot/setup_network.rs:106: Running "ip" "link" "add" "li_ca02f9_0001" "type" "veth" "peer" "name" "li-ca02f9-0001"
[2018-06-19T11:46:53Z][DEBUG]src/bin/lithos_knot/setup_network.rs:119: Running "ip" "link" "set" "dev" "li_ca02f9_0001" "netns" "/proc/28231/fd/6"
[2018-06-19T11:46:53Z][DEBUG]src/bin/lithos_knot/setup_network.rs:131: Running "brctl" "addif" "br0" "li_ca02f9_0001"
[2018-06-19T11:46:53Z][DEBUG]src/bin/lithos_knot/setup_network.rs:142: Running "ip" "link" "set" "li_ca02f9_0001" "up"
[2018-06-19T11:46:53Z][DEBUG]src/bin/lithos_knot/setup_network.rs:155: Running "ip" "link" "set" "lo" "up"
[2018-06-19T11:46:53Z][DEBUG]src/bin/lithos_knot/setup_network.rs:168: Running "ip" "addr" "add" "10.0.0.1/24" "dev" "li-ca02f9-0001"
[2018-06-19T11:46:53Z][DEBUG]src/bin/lithos_knot/setup_network.rs:179: Running "ip" "link" "set" "li-ca02f9-0001" "up"
[2018-06-19T11:46:53Z][DEBUG]src/bin/lithos_knot/setup_network.rs:191: Running "ip" "route" "add" "default" "via" "10.0.0.1"
[2018-06-19T11:46:53Z][DEBUG]src/bin/lithos_knot/setup_network.rs:203: Running "arping" "-U" "10.0.0.1" "-c1"
[2018-06-19T11:46:53Z][ERROR] Fatal error running "container/container.0": arping failed: exited with code 1

If lithos_tree process is run with strace as follows, it gives us an indication why is this happening:

$ strace -f /usr/bin/lithos_tree
... unrelated output ommited ...

[pid 32223] execve("/usr/bin/arping", ["/usr/bin/arping", "-U", "10.0.0.1", "-c1"], 0x55d76cb2d0a0 /* 1 var */) = 0

... unrelated output ommited ...

[pid 32223] writev(2, [{iov_base="", iov_len=0}, {iov_base="arping: Too many args on command"..., iov_len=61}], 2arping: Too many args on command line. Expected at most one.
) = 61
[pid 32223] exit_group(1)               = ?
[pid 32223] +++ exited with 1 +++

The lithos configurations used are:

master config:

log_level: debug
devfs-dir: /dev

sandbox config:

allow-users: [0, 1, 100]
allow-groups: [0, 101]
image-dir: /var/lib/lithos/images
writable-paths:
  /var/log/container: /data
  /var/lib/container: /log
bridged-network:
  bridge: br0
  network: 10.0.0.0/24
  default_gateway: 10.0.0.1

process config:

container:
  image: container
  config: /config/container.yaml
  ip-addresses: [10.0.0.1]

It turns out that Alpine uses distribution of arping (specifically this one by Thomas Habets) which is more strict with regards to argument order validation than arping distributions that are available in other (more mainstream) Linux distros (there it is usually provided by iputils package).

Simple reordering of arguments is sufficient to get it to work on both implementations of arping (I have tested only on Fedora and Alpine).

Allow to pass binded socket via fd: 0

Some programs (e.g. java) support starting on demand from xinetd daemon that binds socket on behalf of the program and passes it to the program using file descriptor 0, i.e. STDIN.

Unfortunately fd: 0 cannot be used in Lithos:

tcp-ports:
  8080:
    fd: 0
    host: 0.0.0.0
    reuse-addr: true

$ lithos_tree
thread 'main' panicked at 'Stdio file descriptors must be configured with respective methods instead of passing fd 0 to `file_descritor()`', .../github.com-1ecc6299db9ec823/unshare-0.2.0/src/fds.rs:33:13

unshare crate explicitly checks if the given fd is greater than 2 and panics when it’s not (see fds.rs:32).

Could you please make it work even for fd 0?

ARPing timeouts in bridged network setup

With bridged_network setup, while checking for an veth interface availability on setup_network the arping call fails with a timeout as shown by following output of lithos_tree process:

--- 10.0.0.1 statistics ---
1 packets transmitted, 0 packets received, 100% unanswered (0 extra)

Fatal error: arping failed: exited with code 1
ARPING 10.0.0.1
Timeout

--- 10.0.0.1 statistics ---
1 packets transmitted, 0 packets received, 100% unanswered (0 extra)

Fatal error: arping failed: exited with code 1
ARPING 10.0.0.1
Timeout

TCP dump from host (bridge interface) shows no ARP response being sent:

tcpdump -i br0 -en "icmp or arp"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:15:05.564340 4a:41:e6:41:9f:a9 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.0.0.1 (ff:ff:ff:ff:ff:ff) tell 10.0.0.1, length 44
15:15:06.900662 82:84:a5:0c:48:b1 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.0.0.1 (ff:ff:ff:ff:ff:ff) tell 10.0.0.1, length 44
15:15:08.253697 ce:91:ed:2d:a7:19 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.0.0.1 (ff:ff:ff:ff:ff:ff) tell 10.0.0.1, length 44
15:15:09.583953 ba:3f:bc:a6:ac:e8 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.0.0.1 (ff:ff:ff:ff:ff:ff) tell 10.0.0.1, length 44
15:15:10.930642 3a:f1:e9:d9:3f:23 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.0.0.1 (ff:ff:ff:ff:ff:ff) tell 10.0.0.1, length 44
^C
5 packets captured
5 packets received by filter
0 packets dropped by kernel

bridge info on host:

br0       Link encap:Ethernet  HWaddr 00:00:00:00:00:00
          inet addr:10.0.0.200  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::1440:b0ff:fec7:5926/64 Scope:Link
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:1047 errors:0 dropped:0 overruns:0 frame:0
          TX packets:727 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:62207 (60.7 KiB)  TX bytes:59939 (58.5 KiB)

veth info in the lithos container:

nsenter -n -p --target 1647 /bin/sh
ifconfig -a
li-ca02f9-0001 Link encap:Ethernet  HWaddr 5A:23:45:F5:F9:B4
          inet addr:10.0.0.1  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::5823:45ff:fef5:f9b4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:12 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:956 (956.0 B)  TX bytes:760 (760.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:16 errors:0 dropped:0 overruns:0 frame:0
          TX packets:16 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:1376 (1.3 KiB)  TX bytes:1376 (1.3 KiB)

I have a trouble understanding what is this check trying to accomplish because:

the address (here 10.0.0.1) used in the arping call is that of the veth interface located inside of the container (in child namespace)
arping is invoked from the same namespace as the interface is located (ie. child namespace)

That means that veth device with 10.0.0.1 address assigned (in the container) effectively sends ARP broadcast asking "Tell me who has 10.0.0.1" to the bridge (located on host) that the other part of veth is attached and (as expected) receives no answer (the only one that can answer this, is the local interface itself that sends the link broadcast, please note that in this case arping beahaves differently than eg. ping which allows pinging local interface).

I suspect that the arping call should be performed on the host (parent) namespace (possibly with specific bridge interface selected, what if different hosts with the 10.0.0.1 address are reachable from multiple host's NICs?) instead of child namespace, so that it will send ARP from bridge to veth and not the other way around it happens now.

Please see attached patch for better understanding of my proposal:

diff --git a/src/bin/lithos_knot/setup_network.rs b/src/bin/lithos_knot/setup_network.rs
index f8df4a6..db9227d 100644
--- a/src/bin/lithos_knot/setup_network.rs
+++ b/src/bin/lithos_knot/setup_network.rs
@@ -193,23 +193,27 @@ fn _setup_bridged(sandbox: &SandboxConfig, _child: &ChildInstance, ip: IpAddr)
             Ok(s) if s.success() => {}
             Ok(s) => bail!("ip route failed: {}", s),
             Err(e) => bail!("ip route failed: {}", e),
         }
     }

+    setns(parent_ns.as_raw_fd(), CloneFlags::CLONE_NEWNET)?;
+
     let mut cmd = unshare::Command::new("/usr/bin/arping");
     cmd.arg("-U");
     cmd.arg("-c1");
     cmd.arg(&format!("{}", ip));
     debug!("Running {}", cmd.display(&Style::short()));
     match cmd.status() {
         Ok(s) if s.success() => {}
         Ok(s) => bail!("arping failed: {}", s),
         Err(e) => bail!("arping failed: {}", e),
     }

+    setns(my_ns.as_raw_fd(), CloneFlags::CLONE_NEWNET)?;
+
     Ok(())
 }

 fn _setup_isolated(_sandbox: &SandboxConfig, _child: &ChildInstance)
     -> Result<(), Error>
 {

I have observed this error on Alpine v3.7 after PR #15 applied.

Thank you for any ideas or opinions on this.

BTW many thanks to you and other contributors for an awesome containerization stuff (not just lithos). It saves us a lot of time and sanity by not having to deal with Docker :).

Variable of generic type?

According to the documentation of Container Configuration there are only three types of variables – TcpPort, Choice and Name – and all used variables must be declared. I’ve checked also sources and it seems that really only these three types are accepted.

That’s very restricting. I’d like to declare e.g. variable for base URI of the application and use it in environ to pass it into the container as environment variable or via arguments. Is this wrong usage of the variables?

Running “full” OS with Lithos

Hi,

it seems that Lithos is designed for running services that may be isolated using namespaces, cgroups, capabilities(?), i.e. running them inside “containers.” I wonder, can I use it even to run ”full OS” (Alpine Linux to be specific) with traditional init system (OpenRC) and multiple services, including OpenSSH server for users to connect into? Like with LXC that I currently use. Theoretically it should be possible, but it seems that it’s really not designed for such use case (?), so are there any limitations or design decisions that makes it really bad idea?

Storing secrets

Motivation

In the past we have relied on keeping secrets in the filesystem and mounting them as:

volumes:
  /secrets: !Readonly /secrets

This is fine for smaller installations but has the following downsides:

Application must be able to read secrets from there
Users deploying applications are unaware of what is in the /secrets dir (i.e. is the config there is up to date)
They are plain-text for the host system
We need to sync them to machines somehow

Solution

Add the following to sandbox config:

secrets-private-key: /etc/some/file

Add the following to container config:

secret-environ:
  SOME_SECRET: I7OO1RBTRYk+oAZ6n/dhRMCDXwgW

The value of SOME_SECRET is the actual value encoded by a public key.

When process is started, lithos will decode those values and pass them as environment variables to the process.

Upsides:

Passing secrets as env vars is known industry practice
It requires minimal support from applications (and most already support that)
Secrets are versioned even if the contents is unknown
Anybody with commit privileges can add a secret (no need to access server by SSH or other means)

Downsides:

You have to publish public key somehow
If private key is compromised you need to redeploy every project (however, it's safe to assume that if private key is compromised, all the secrets are compromised, so you should change all the secrets anyway, presumably it's much more work than changing environment itself)
Many keys will be duplicated if multiple lithos configs need same secrets, this is convenience vs explicitness trade-off

Alternatives / Extra Features

We may also add the following to sandbox:

environ-secrets-file: /encrypted/secret.file # file is encrypted

And the following to container config:

external-secret-environ: [VAR1, VAR2]

To allow changing environ more dynamically. But this means /encrypted/secret.file has to be deployed separately

Thoughts?

/cc @popravich, @anti-social, @jirutka, #13

Add cantal metrics

Metrics which we can track:

Number of restarts for each process
Number of containers in each sandbox

tailhook / lithos Goto Github PK

lithos's Issues

Motivation

Background

Solution

Applicability

Alternatives

Motivation

Solution

Alternatives / Extra Features

Recommend Projects

Recommend Topics

Recommend Org