threefoldtech / zos Goto Github PK

View Code? Open in Web Editor NEW

81.0 27.0 12.0 57.03 MB

Autonomous operating system

Home Page: https://threefold.io/host/

License: Apache License 2.0

Makefile 0.38% Shell 3.28% Go 94.21% Rust 1.88% HTML 0.26%

os autonomous decentralized automation

zos's Introduction

0-OS

0-OS is an autonomous operating system design to expose raw compute, storage and network capacity.

This repository host the V2 of 0-OS which is a complete rewrite from scratch. If you want to know about the history and decision that motivated the creation of the V2, you can read this article

0-OS is mainly used to run node on the Threefold Grid. Head to https://threefold.io and https://wiki.threefold.io to learn more about Threefold and the grid.

Documentation

Start exploring the code base by first checking the documentation and specification documents.

An FAQ is also available for all the common questions.

Setting up your development environment

If you want to contribute read the contribution guideline and the documentation to setup your development environment

Grid Networks

0-OS is deployed on 3 different "flavor" of network:

production network: Released of stable version. Used to run the real grid with real money. Cannot be reset ever. Only stable and battle tested feature reach this level. You can find the dashboard here
test network: Mostly stable features that need to be tested at scale, allow preview and test of new features. Always the latest and greatest. This network can be reset sometimes, but should be relatively stable. You can find the dashboard here
QA network: Mostly unstable features that need to be tested internally, allow preview and test of new features. Can be behind development. This network can be reset sometimes, but should be relatively stable. You can find the dashboard here
dev network: ephemeral network only setup to develop and test new features. Can be created and reset at anytime. You can find the dashboard here

Learn more about the different network by reading the upgrade documentation

Provisioning of workloads

ZOS does not expose an interface, instead of wait for reservation to happen on a trusted source, and once this reservation is available, the node will actually apply it to reality. You can start reading about provisioning in this document.

Owners

@maxux @muhamadazmy @delandtj @leesmet

Community

If you have some questions or just want to hang out, you can find us on:

telegram: https://t.me/zero_os_tech
Matrix: #zero-os:matrix.org

zos's People

Contributors

Stargazers

Watchers

Forkers

zaibon providenetwork bonedaddy majjihari moonlightengent whdev711 haddockdao mariobassem pinkdiamond1 openssl-sg-insights duzceselin dorucioclea

zos's Issues

Avoid generating the wireguard key on the node

As a requests from @despiegk, we need to remove the generation of the wireguard key from the node and move that to the user side.

So the user will have to generate the key pair of each member of its network and then publish these key pair in the bcdb.
This mean we will need to encrypt part of the network object with the public key of the node, so only the node will be able to read its private key.

The current layout is not made for something like this, it currently only container the public key of the wireguard peers of a network.

Define container profiles

After the discussion from #5 (comment)
We agreed we will provide different container "profiles".
This issue is about choosing what kind of profile we want and write the config.json file for them.

Implement storage module

A basic implementation for storage, which shows how the high lvl architecture will look like
@muhamadazmy

Btrfs abstraction
Btrfs unit tests
Device abstraction

@LeeSmet

Module interface
Module interface implementation
Cache preparation ?

Network implementation

Exit node:

Define how the node gets its public IP configured

Network module:

FPA (farmer prefix allocator):

Define network object of farmer prefix allocator

Design/Implement Virtual machine module [fcvm]

Technology:

Firecracker has been chosen to run the VM.

To investigate:

check if k3os runs fine on firecracker: https://github.com/rancher/k3os/releases/tag/v0.8.0
test if firecracker VM can use a tap interface connected to a network resource bridge
test that raw file works properly as a way to give access to disks from the VM

Module Design.

This module will be quite similar to the container module. Both needs to expose method to start/stop/inspect a VM/Container

There is an SDK for the firecracker API: https://github.com/firecracker-microvm/firecracker-go-sdk

Todo

~~create udhcpd config -> in a separate NS is easier for networkd~~ we found a way to assign an IP statically to the VM
create subvolume in a disk (where) to host images
- VMs have their caveats: RAW image files are large, but at the same time we would need to have them fast. Storaged allow to create filesystem on top of multiple disk, so if big disks are required we could use thisf feature.
  After some performance testing and if the speed is not good enough maybe we could investigate BCACHE.
create flist for firecracker/K3OS images (or integrate firecracker in zos (0-initramfs)

ls -lh build/cargo_target/x86_64-unknown-linux-musl/release/{firecracker,jailer}
-rwxr-xr-x 2 delandtj delandtj 2.7M Jan 14 10:05 build/cargo_target/x86_64-unknown-linux-musl/release/firecracker
-rwxr-xr-x 2 delandtj delandtj 2.4M Jan 14 09:58 build/cargo_target/x86_64-unknown-linux-musl/release/jailer

create tap device and attach to NR -> needs schema definition ?
There is no schema definition for VMs, I think that will be necessary
manage reservation size for volumes is not really specified, we wouldn't want quota break things
create an automated install procedure for k3os :
- prepare tap, attach to NR, bring up, disable_ipv6 (networkd)
- prepare volume fallocate, truncate (storaged)
- use api to configure vm fc instance (vmd)
  - mac addr
  - volume name / place
  - k3os.mode=install (and others)
- boot autoinstall , wait ? kill ? (k3os automated install by default halts, instead of reboot :-/ ) (provisiond)
- start it (provisiond)

network: Logic to request a new network resource for a node joining a network

When a node needs to join a network for the first time, a new network resource (NR) needs to be added to the tenant network object (TNo).

To create a new NR, the TNoDB requires 2 information from the node:

a free port
the public wireguard key of the node for this TNo

We need to implement this communication between the node and the TNoDB. Currently the networker interface only expose a method to publish the public wireguard key https://github.com/threefoldtech/zosv2/blob/82028f15dd91604c6fd443dda2d7332b632e99f4/modules/network.go#L12-L18

This interface needs to be rethink to change PublishWGPubKey to a method that ask the TNoDB to add a new NR to the TNo.

Design container module

The container module is going to be responsible for exposing an interface on top of the chosen OCI runtime.

Ability to choose the type of storagepool when requesting a filesystem

Storage module should take into account what type of device a filessytem needs to be run on, not just take anything

Document file location of each module

I would like to have clear view on all the file that each module need to be able to write/create.

Then from there we can have a reflection if this is the best layout or if we can improve things a bit.

The main idea is I want to avoid that any module can write anywhere on the filesystem. This leads to less security and make things harder do debug.

Research and design for versioning, update and upgrades

One of the main feature we want to provide with this version of 0-OS is the "auto-update". The idea is that the system needs to be able to update itself and each of its component with the minimum downtime possible for the workloads. Minimum in most case would be no downtime at all.

Keeping this in mind actually drives quite a bit the design decision to make for all module of 0-OS.
One of them I would like to discuss here is how are we going to handle versioning of the system.

Since every component is by definition modular and choose be able to be change in place with another one or another version, a single version for the OS is not going to be something meaningful. So instead of sticking with plain version I had another idea.
Having 3 "flavor" of 0-OS, main, dev and test like tfchain does.

Having these 3 "flavor" will allow to actually have different network into the grid:

Dev: ephemeral network only setup to develop and test new features. Can be created and reset at anytime
Test: Mostly stable feature that needs to be tested at scale, allow preview and test of new features. Always the latest greatest. This network can be rest sometimes, by should be relatively stable.
Main: Released of stable version. Used to run the real grid with real money. Cannot be reset ever. Only stable and battle tested feature reach this level.

This allow each component to progress at its own pace, have a separate semantic version. A flavor will be composed of different version of each modules. Once a new version of a module is ready, it will just make its way trough the 3 flavor from dev to main.

Each "flavor" will actually create a separate grid, the node will always ever connect to other node with the same "version". This simplify the code cause you know you don't have to deal with different version and can be sure of the feature of the node you're talking too.

The flow of a new feature will bubble up from dev to main. Every time a new feature is upgraded to the next level, all the node in this network will receive the update automatically.
This makes upgrading a network trivial and also ensure that upgrade procedure are at least tested at scale 2 time before reaching main net.

Define how an exit node gets its public IP configured

storaged: /var/cache is mounted twice is already mounted

When storaged is restarted and /var/cache is already mounted, it will mount it again giving a double mountpoint like

/dev/sda on /var/cache type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/zos-cache)
/dev/sda on /var/cache type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/zos-cache)

Storaged should not mount it a second time

Volume provisioning give btrfs subvolume of size 0

I did a provisioning of volume of 5GiB, then when I wanted to write some file into my volume I immediately got an quota exceeded error

Design metric/monitoring module

Implement prefix allocator

Storage: disk caching

Right now the storage module does a scan of the disks every time something changes the disk layout (filessytem creation, in the future maybe partitioning, ...). This prevents us from bringing down the disks completely.

It should be possible to manually modify the in memory device when these actions are done, allowing us to shut down the disks and not rescan

Extract network manipulation logic from tnodb

Currently most of the logic regarding network object manipulation lives inside the tnodb_mock

While this makes things super easy to use, it doesn't really fit with the concept where the node only provision what they get from the bcdb. Instead the node is now dynamically talking with the tnodb by itself and the network object is modified without the owning user knowing it.

To solve this and move the full control to the user we need to let the user create the network object by himself (using the lib, manually this is way too complex)
Then the user can send the network object as a provisioning request to the bcdb.
The bcdb will only validate the content of the network object. If the network object is not correct the provisioning will be refused by the bcdb.

Tasks:

extract all the network object manipulation logic into a library (fb3d753)
remove some endpoint from the tnodb_mock (fb3d753)
- create network
- add member
- add user
implement a full network object validation that will be used by the bcdb when provisining network: will be done in #132
update provisiond to support the a network object as provisioning request (aed0c23)
networkd should not watch the network object anymore, but just react upon provisiond request (watcher could be move to provisiond) (aed0c23)

containerd zero-fs integration

Check if it's possible to implement a new image type to support zero-fs directly in containerd
Plan B, the rootfs of the container is mounted first then sent to containerd in the runc specs

Design provisioning schemas

To do:

: go over current tnodb_mock data structure and clean it up: #139
: convert all the object to JSX schemas: https://gist.github.com/zaibon/8f8124c094073225a3475c4736eeb531
: validating that all the schema are actually valid and working in JSX

Research OCI runtime

Implement flist module

First implementation of the flist module based on the spec: https://github.com/threefoldtech/zosv2/tree/master/specs/flist

Provisioning module has no notion of reservation expiration

Currently in the provisioning module, the reservation have no notion of time. We lack the information about for how long does the reservation needs to be live.

We need to add a duration to the Reservation struct, so we can know if a reservation is still valid or if it should be deleted from the system

POC: containerd snapshotter plugin for 0fs

Using containerd plugin support, implement a plugin to use 0fs as a container rootfs layer.

network: logic to apply network resources

We need to implement the logic that applies a network resource.

Prepare hardware to setup first testnet

Implement gateway network namespace

The flist for exit node container need to have:

dnsmasq
ntfables
UTS namespace

Choose a license

Design pid 1

Inter node communication

I would like to discuss here how we are going to create a network of nodes and how these nodes are going to communicate.

The idea behind the TFgrid is that each 0-OS node is a stateless capacity provider and ThreeBot are the directors.
Since the ThreeBot needs to be able to reach every single node in the network, and the ThreeBot are running on top of the 0-OS node, it means the nodes needs to be able to connect to each other.
This idea is simple enough, but practically this raises some questions.

How does the nodes knows about each other?
What do a node to join a network ?
How does the network handle a node leaving the network ?

Zerotier was the way to go for these things so far. But it has proven to not be scalable, extremely hard to manage and not usable at all in an environment where a lot of nodes are in the same LAN.

I think some kind of inter node communication protocol has to be designed to allow to create a fully distributed network where node organize themself and can route information though the network. Trying to create direct p2p connection between node when possible, and finding other routed through publicly reachable nodes when not possible.

Formalize the concept of DSL next to the public API

Enable missing unit tests in CI

Now that most of the WIP modules have been merged to master I would like we start having a proper build pipeline, CI to run as much test as possible and that we look at how we need to use go mod to version all the code.

I'm still pretty new to go mod so I don't know if what we have today is valid or not.

define clear boot flow

We can organize the services boot in stages by creating pseudo stages. For example we can create a service called init that exec 'true' as one shot and depends on all boot services (udev, settle, network, etc) then all other second stage services can depend only on init

zinit: PID 1 for ZOS v2

Project

https://github.com/threefoldtech/zinit

Tasks

process manager in rust/toki POC
resolve dependencies of the services
load services config from a directory
simple unix socket api to management
command line tool to manage the process manager (status, stop, start, restart, and reload)
zombie reaping 😨

Design flist module

The flist module will be responsible for managing everything related to 0-fs

networkd: both ApplyNetworkResource and GenerateWGKeys need to be idempotent

They method need to have no effect, if called multiple time with the same config. The ApplyNetworkResource however need to take proper action to apply configuration to reality.

Make sure ZOSv2 is running 4.19 with user fuse option enabled

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.18-FUSE

allow to stream container logs to remote endpoints

Since there will be no way for user to connect directly to a 0-OS node, we want to add the possibility to a node to stream the logs of a container to remote endpoint.

During the creation of the container, the user will specify the endpoint location and type. Then during the lifetime of the container, the 0-OS node will stream the logs to the endpoint.

As a first supported endpoint type:

redis
0-db

Implement provision module

Design storage module

The storage module will be responsible for everything that is related to storing information on a long term medium, usually a disk.

Sub components:

0-db management
- capacity planning, reservation of 0-db and 0-db namespaces
disks management
- formatting of the disk
- health monitoring
- management of volume used by containers

Design node identity

We need to decide how we are going to identify a node.

Since most of the tfgrid identity system will be based on a key pair, we could for each node, generate a key pair. Use the base64 encoded version of the public key as identity of a node.

Example: https://play.golang.org/p/IycmGa1USUi

Now this brings one problem, this key pair will have to be stored on the node disk itself. Which means we bring some state in 0-OS. That is something we want to avoid.
Once possible solution for that would be to use a deterministic seed to generate the key pair that is unique for a node (hardware serial number,...)
This would allow to always be able to generate the key pair. But I'm not sure this is doable without compromising the full security. Since if someone can deduct the seed, he can then make himself looks like a certain node.

IPC infrastructure

We need to decide how API layer will discover and talk to all the low level modules.
Requirements:

module needs to be discoverable
module needs to be able to be upgraded without impacting the higher layer
Capable of propagating events up the stack

schema: Parse and generate types from jsx schemas

The types then can then be used to call gedis actors.

extract create-network command from tffarmer into a user CLI tool

The tffarmer CLI needs to only have command for a farmer.
While extracting the create-network command from tffarmer I'll move it into a new CLI tfuser. tfuser will contains a command to create each possible workload:

network
volume
container

Research/choose RPC framework

we kina agreed to use RESP as protocol, but we still need to agree on the RPC framework to use on top of that.

Design network module

Since we are going to use CNI as much as possible for network configuration, the network module is not going to be responsible to actually do network configuration. But instead it will expose an API that let a user create some CNI compatible configuration file.

A 0-OS being by nature a multi user platform. We need to be able to provide private network for users so the containers from user A running on a node wont have access to container from user B.
So the network module will be responsible for the management of these private network configuration on a node.

Only create storagepools when needed

Right now the storage module greedily takes all available devices and creates storagepools on them when it starts, this should be changed to only create a storagepool when a filesystem is requested and there is no more space in the existing pools

Resource IDs and ownership of resources

Reservation can (and will) happen on multiple steps. For example to start a container you probably have to do the following:

Create a storage volume
Allocate network resource
Create container, and assign both the volume, and the network resource to that container.

The problem is, on creating a container you need to pass the the information of the volume (may be the volume full path) and the network resource identity as well. But there is no way to grantee that the caller of the container creation is the rightful owner of the associated resources (volume and/or network)

We think that the IDs of the allocated resources should carry some information about the owner of the resource. so in that case, if the id of the volume and the network matches the owner of the container, the container creation should pass.

Proposal

an id of a resource can be a jwt token with payload that contains both the resource information needed, plus the tenant id. The jwt token is signed with the node private key so the jwt is only valid on the node where the resource is allocated.

Example, you first allocate a volume, the volume id can hold the

{
  "path": "/pool/volume",
  "tenant": "id of the owner"
}

then on container creation, the jwt token is passed as the volume id. the node can then verify that the jwt has valid signature, then matches the tenant id to the tenant id of the container.create caller. Once verified, the volume path can be mounted inside the container.

The only drawback for using this technique is the size of the jwt token.

Design security structure

0-OS is by nature a shared system. It's main goal is to allow different user to use capacity provided by the hardware.
This of course raises security concerns:

How do I ensure that the data I write on a disk is not going to be accessible by someone else ?
How can we ensure that networks from one user is not reachable by another ?
How do I authenticate user when they are talking to the public API of the OS ?
How are we going to allocate resource to a certain user (we are a reservation based system)
...

All these question needs to be resolve and be a first though when developing all the module that compose 0-OS.

Convert specs document into documentation

Now that we have most MVP we should start to reformulate specification document into actual documentation.

Modules to do: