rook / rook Goto Github PK

View Code? Open in Web Editor NEW

11.9K 11.9K 2.6K 59.87 MB

Storage Orchestration for Kubernetes

Home Page: https://rook.io

License: Apache License 2.0

Makefile 0.85% Go 93.83% Shell 2.90% Ruby 0.01% Dockerfile 0.03% Python 2.09% Smarty 0.30% HTML 0.01%

ceph cloud-native cncf docker etcd kubernetes storage storage-cluster

rook's Introduction

What is Rook?

Rook is an open source cloud-native storage orchestrator for Kubernetes, providing the platform, framework, and support for Ceph storage to natively integrate with Kubernetes.

Ceph is a distributed storage system that provides file, block and object storage and is deployed in large scale production clusters.

Rook automates deployment and management of Ceph to provide self-managing, self-scaling, and self-healing storage services. The Rook operator does this by building on Kubernetes resources to deploy, configure, provision, scale, upgrade, and monitor Ceph.

The status of the Ceph storage provider is Stable. Features and improvements will be planned for many future versions. Upgrades between versions are provided to ensure backward compatibility between releases.

Rook is hosted by the Cloud Native Computing Foundation (CNCF) as a graduated level project. If you are a company that wants to help shape the evolution of technologies that are container-packaged, dynamically-scheduled and microservices-oriented, consider joining the CNCF. For details about who's involved and how Rook plays a role, read the CNCF announcement.

Getting Started and Documentation

For installation, deployment, and administration, see our Documentation and QuickStart Guide.

Contributing

We welcome contributions. See Contributing to get started.

Report a Bug

For filing bugs, suggesting improvements, or requesting new features, please open an issue.

Reporting Security Vulnerabilities

If you find a vulnerability or a potential vulnerability in Rook please let us know immediately at [email protected]. We'll send a confirmation email to acknowledge your report, and we'll send an additional email when we've identified the issues positively or negatively.

For further details, please see the complete security release process.

Contact

Please use the following to reach members of the community:

Slack: Join our slack channel
GitHub: Start a discussion or open an issue
Twitter: @rook_io
Security topics: [email protected]

Community Meeting

A regular community meeting takes place every other Tuesday at 9:00 AM PT (Pacific Time). Convert to your local timezone.

Any changes to the meeting schedule will be added to the agenda doc and posted to Slack #announcements.

Anyone who wants to discuss the direction of the project, design and implementation reviews, or general questions with the broader community is welcome and encouraged to join.

Official Releases

Official releases of Rook can be found on the releases page. Please note that it is strongly recommended that you use official releases of Rook, as unreleased versions from the master branch are subject to changes and incompatibilities that will not be supported in the official releases. Builds from the master branch can have functionality changed and even removed at any time without compatibility support and without prior notice.

Licensing

Rook is under the Apache 2.0 license.

rook's People

Contributors

Stargazers

Watchers

Forkers

bassam travisn maniacs-ops himanshpal jbw976 lfany hodgesds javiergarmon kamalmarhubi devopsbox leg100 charudut alenzhao dclausen cybernetics hhy5277 cganey rlugojr doubledensity thephred dx9 larsholowko-zz bergwolf dab-q cpg1111 akinswin kokhang duanshuaimin alex1528 poppap boreys weekface duzhanyuan junneyang empia tomzhang nanne007 number0 gourao zoutaiqi ikor chenqiangzhishen monicaedavidson fakod yongtin koalacxr maliciousgenius luxas mcluseau jorise7 ksingh7 maximilianmeister davidstack qinzhao168 cristian-radu sajal dewet22 bobhenkel lorenz dangula freesky-edward aramisjohnson githubch foodotbar luomeiqin gcmalloc debianmaster lapterchow kopiczko ipaste acalephstorage zhangxiaoyu-zidif mcenirm-forks georgekuruvillak jefflaplante pgdops eltmon researchiteng sjanulonoks longquanzheng bzub wilhelmguo sun363587351 huangyanhong abdelsalam-abbas grovecai liuqian1990 496080199 galexrt prashantsunkari paha tantinevincent pokoli like-inspur etsangsplk venezia minshenglin mrahbar mapbased hansbogert

rook's Issues

Add support for etcd failover scenario

When a castled server which is already a member of etcd cluster fails, the etcdmgr should create an instance of embedded etcd on an existing healthy node to replace the failed one.

Enable bluestore option for osd configuration

We need to enable bluestore for high-performance testing of the castle clusters. Filestore should still be the default option.

castled --store=[filestore | bluestore]

OSD orchestration should return success to the leader immediately after it starts config

Device initialization for OSDs takes on the order of minutes. For the agent to configure all devices before returning to the orchestration leader causes several problems:

Orchestrations are blocked for a long period of time
New machines coming online may have to wait a long time to be configured
More critical services such as etcd and ceph mons will be blocked from failover orchestration for long periods of time.

Completion of OSD configuration is a process that is quite independent from core orchestration. The only thing the orchestrator really cares about is whether the agent is actively configuring the OSDs. There are no orchestration failover scenarios that OSDs need to worry about, unlike the more critical etcd and mon services.

Upon a request to configure OSDs, the agent can immediately return success to the orchestrator that it is working on the configuration. There is no need to wait for all the OSDs to complete.

For applications that need to be guaranteed there is storage available (ie. OSD configuration is complete), the orchestrator can provide a helper to signal progress of available OSDs.

I'd like a way to generate ceph.conf and keyring data to access my cluster with RADOS

You need a ceph.conf file to start a rados client. With the tools as they are it is hard to create one.

I'd like to see an option to the rook tool that emits a usable ceph.conf file either to a file or on stdout.

Likewise I'd like to see a similar command that facilitates generating a keyring file for the cephx keys.

consider add a storage node name

use a machine id is not as friendly as using a node name. we could possibly default to the host name or something like that.

cephmgr/client package cannot use cephmgr/client/test package due to cyclical dependency

We can not use the mocked types from cephmgr/client/test to test the cephmgr/client package itself because of a cyclical dependency. This indicates a layering issue.

We could move some of the helper methods (e.g. pool.go, auth.go) to a package that is a peer of cephmgr/client. Essentially, usage of the cephmgr/client interfaces should not be within the same package, or else they cannot be tested using the mocked implementations of those interfaces.

/castle/pkg/cephmgr/client
> go test
# github.com/quantum/castle/pkg/cephmgr/client
import cycle not allowed in test
package github.com/quantum/castle/pkg/cephmgr/client (test)
    imports github.com/quantum/castle/pkg/cephmgr/client/test
    imports github.com/quantum/castle/pkg/cephmgr/client

FAIL    github.com/quantum/castle/pkg/cephmgr/client [setup failed]

I'd like a command-line way to learn about monitors

I'd like a command-line way to learn about the monitors in a cluster (where [ip, hostname], how many, health, other stats). I'm thinking about something that works like the "node ls" command.

Perhaps "rook monitor ls" or just "rook monitors".

Windows build fails due to linux only functionality in syscall.Kill

When building on Windows with

build/run make -j4 cross

A build error occurs:

# github.com/quantum/castle/pkg/util/proc
pkg/util/proc/procmanager.go:141: undefined: syscall.Kill

That functionality is only available on linux. This dependency is occuring due to castlectl pulling in the proc package for Executor functionality. It doesn't need procmanager that comes along with it (that contains the linux only syscall.Kill).

Consider splitting into new packages or using build tags for linux only.

add support for vagrant

We need a simple vagrant file to bring up castle for dev/test scenarios

etcdmgr: can add nodes from cluster via API from clusterd

etcdmgr should be able to add new etcd members to the etcd cluster quorum dynamically as the castle cluster grows.

rookd should use a more full-featured logging package

github.com/coreos/pkg/capnslog or something that support log levels, etc. Also would be good to tie that logging to ceph logging levels.

embedded ceph should not shell out to use crush tool

When castle running on Ubuntu, creating storage pools fails because it appears the mons have a dependency on crushtool existing in $PATH.
the castled API handler log shows:

failed to create new pool '{Name:jaredPool1 Number:0}': mon_command osd pool create failed, buf: , info: crushtool check failed with -22: crushtool: exec failed: (2) No such file or directory: cephd: Invalid argument

This ceph mailing list explains that mons needs the crushtool:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003392.html

Here's the codepath where it's used:
https://github.com/quantum/ceph/blob/47b199109bbff1db37ddff9461652e30d79df330/src/mon/OSDMonitor.cc#L4849

This works on CoreOS because we have some ceph tools embedded in the image: https://github.com/quantum/coreos-overlay/blob/master/sys-cluster/ceph/ceph-9999.ebuild#L76

There may be more unexpected dependencies in the ceph code to the ceph tools still out there.
repro steps on Ubuntu:

./bin/linux_amd64/castled
./bin/linux_amd64/castlectl pool create --pool-name="mypool1"

Trim the tool dependencies for castled

Oct 13 22:07:02 castle00 rkt[2122]: 2016-10-13 22:07:02.895783 I | Running command: lsblk --all -n -l --output KNAME
Oct 13 22:07:02 castle00 rkt[2122]: 2016-10-13 22:07:02.895985 I | error while discovering hardware: failed to list all devices: Failed to complete lsblk all: exec: "lsblk": executable file not found in $PATH
Oct 13 22:07:02 castle00 rkt[2122]: waiting for ctrl-c interrupt...

build: CTRL+C hangs sometimes when running in build container

When running a build inside a container using build/run make -j4 hitting CTRL+C sometimes hangs . An explicit docker stop cross-build will stop it. This is likely due to the make program not handing signals correctly when its started inside a container as pid 1.

release builds should publish rkt aci images to S3

the release build currently builds rkt ACIs but does not publish them. We need a home for storing them that works with rkt trust. S3 seems like an obvious choice but there are some issues with using HTTP/S and rkt trust, see appc/spec#319

Also https://coreos.com/rkt/docs/latest/signing-and-verification-guide.html

Usage information should not be displayed when there is an error

Usage information should be displayed when there is an error:

> bin/castled 
2016-10-12 13:00:22.717331 I | cluster max size is:  1
2016-10-12 13:00:22.744193 I | currentNodes:  []
2016-10-12 13:00:22.744218 I | current localURL:  http://127.0.0.1:2379
2016-10-12 13:00:22.744258 I | creating a new embedded etcd...
2016-10-12 13:00:22.744338 I | conf: {e16d06178d2c471eb4b77c9489f00be7 [{http  <nil> 127.0.0.1:2380   false  }] [{http  <nil> 127.0.0.1:2379   false  }] [{http  <nil> 127.0.0.1:2380   false  }] [{http  <nil> 127.0.0.1:2379   false  }] /tmp/etcd-data}
2016-10-12 13:00:22.744356 I | client urls to set listeners for:  [{http  <nil> 127.0.0.1:2379   false  }]
Error: listen tcp 127.0.0.1:2379: bind: address already in use
Usage:
  castled [flags]
  castled [command]

Available Commands:
  version     Print the version number of castled

Flags:
      --devices string         comma separated list of devices to use
      --discovery-url string   etcd discovery URL. Example: http://discovery.castle.com/26bd83c92e7145e6b103f623263f61df
      --etcd-members string    etcd members to connect to. Overrides the discovery URL. Example: http://10.23.45.56:2379
      --force-format           true to force the format of any specified devices, even if they already have a filesystem.  BE CAREFUL!
  -h, --help                   help for castled
      --location string        location of this node for CRUSH placement
      --private-ipv4 string    private IPv4 address for this machine (default "127.0.0.1")

Use "castled [command] --help" for more information about a command.

castled error: listen tcp 127.0.0.1:2379: bind: address already in use

Refreshing etcd clients when etcd membership changes

Each Castled server creates an instance of en etcd client and passes it to other parts of the code as a part of clusterd context. Since clusterd supports dynamic resizing of etcd cluster, the original etcd client could get outdated. This problem would lead to timeout and failure of castled.
The fix should update the etcd client in the context when a change occurs in the etcd membership.

rook tool should not log to /var/log/rook

We can't log to /var/log/castle since we don't run privileged.

castled expects /etc/machine-id to be available, should really have options

castled currently requires /etc/machine-id to be available. When running in a scratch container this is not he case. we should support getting a node identity on the command line and env.

Health of cluster should be exposed in the command line interface

We are hurting without a health command in the castle tool. There's no insight into the health of a cluster.

debug logging should be easier

when castled starts up, it'll overwrite any existing ceph config files with it's hard coded values for logging levels. This makes it impossible to enable DEBUG logging without rebuilding the binary.

We should make it easier to enable DEBUG logging, especially for bootstrapping scenarios. Perhaps a --debug switch to castled?

Avoid using lsb_release in ceph

This is benign but should be removed nonetheless

Oct 24 16:03:26 castle00 rkt[1528]: sh: lsb_release: not found

Allow vendoring to be optional Makefile

To support quick/dirty dev scenarios where you may have manually copied some files into the vendor directory, it would be nice if the Makefile could be given an explicit option to skip vendoring (glide install) so the temporary dirty dev changes are not overwritten.

By default, configure OSDs on all devices except the system disk

Specifying devices by name is not a reliable choice mechanism. The most common scenario is to bring up OSDs on all devices except the system disk so we should make that the default.

reconsider the use of GODEBUG=netgo in etcdmgr

its not clear why we need to use that. maybe it should be removed completely.

embedded etcd server sometimes returns client endpoints before fully initialized

Embedded etcd cluster sometimes returns a list of current client endpoints before they get fully initialized. It leads to failure of castled.

Changing etcd membership should update the discovery service

When etcd membership changes in the castle, the leader should update the discovery service, so it reflects the latest etcd membership.

Ceph config for public/private networks

In a production environment we will expect two independent networks to be configured:

Public: Communication between clients and mons/osds
Private: Replication, data maintenance, and other internal traffic between mons and osds.

For this ceph configuration, see http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/.

orchestration timeout for osds is too aggressive

The timeout for the orchestration leader is two minutes when waiting for a node to respond. If a node has many disks on which to configure osds, it will take longer than this timeout and fail the orchestration even though the osds will still succeed.

A simple way to extend the timeout is a type of heartbeat after each osd is configured. This will signal to the leader that the node is still working and the timeout can be reset.

rookd did not stop correctly when running under systemd

systemd stop castled.service

Oct 13 22:29:03 castle00 rkt[2122]: 2016-10-13 22:29:03.708259 I | Node 172.20.20.10 has age of 0s
Oct 13 22:29:08 castle00 rkt[2122]: 2016-10-13 22:29:08.709403 I | Discovered 1 nodes
Oct 13 22:29:08 castle00 rkt[2122]: 2016-10-13 22:29:08.710259 I | Node 172.20.20.10 has age of 0s
Oct 13 22:29:12 castle00 systemd[1]: Stopping Castle Daemon - software defined storage...
Oct 13 22:29:12 castle00 systemd[1]: castled.service: Killing process 2213 (castled) with signal SIGKILL.
Oct 13 22:29:12 castle00 systemd[1]: Stopped Castle Daemon - software defined storage.
core@castle00 ~ $

track disks with uuid rather than serial number

hardware discovery and orchestration currently rely on the disk serial number for a constant identity. However, there is not always a serial number available as seen in a container. We need to rely solely on uuids for a stable disk identity.

Default port number on the --api-server-endpoint option

If no port number is specified on rook's --api-server-endpoint option value, we should default to trying the default port.

The default port number is not easy to remember and this would make the tool easier to use.

osds should be configured on a new node as long as there is mon quorum

When a node comes up, its devices should be configured with osds even if the orchestration fails the monitor configuration. The osd config does at least require mon quorum.

For example, say we already have one node in the cluster with a single monitor running. Now two new machines come online and the orchestration chooses to increase from 1 to 3 monitors. If the new monitors fail to start, there is no reason to skip configuration of osds on nodes 2 and 3, assuming the first mon is still healthy.

build: manage vendor directories when building in a container

on macOS we rsync the source tree from the host to the build container. this is a one way sync. if the vendor directory is not populated before calling build/run it will be populated inside the container but never copied back to the host. as a result each time build/run runs it will have to fetch the vendor directory again.

An easy workaround is to call make vendor on the host before calling build/run. We need a better solution. We could bind mount just the vendor directory but this issue gets in the way Masterminds/glide#642.

Also related is that when we install glide during make it goes in the tools dir. Same issue as above. also glide is install for the host arch, so if we bind mount it it will be the wrong arch for the host on the mac.

Use the same discovery token for each machine brought up in vagrant

The demo vagrantfile generates a new etcd discovery token every time a machine comes up. We need to use the same discovery token to get the cluster going.

incremental builds on macOS not effective due to wrong timezone

Looks like we need to set the timezone in the container to match the host.

bassam@bassamQ [AWS production] ~/Projects/src/github.com/quantum/castle (master)
> stat pkg/clusterd/inventory/hardware.go 
16777220 119333050 -rw-r--r-- 1 bassam staff 0 1988 "Oct 18 22:28:40 2016" "Oct  9 17:05:44 2016" "Oct  9 17:05:44 2016" "Oct  9 17:05:44 2016" 4096 8 0 pkg/clusterd/inventory/hardware.go
bassam@bassamQ [AWS production] ~/Projects/src/github.com/quantum/castle (master)
> build/run stat pkg/clusterd/inventory/disk.go                                             
  File: 'pkg/clusterd/inventory/disk.go'
  Size: 9098        Blocks: 24         IO Block: 4096   regular file
Device: fe02h/65026d    Inode: 2501685     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  501/ UNKNOWN)   Gid: (   20/ dialout)
Access: 2016-10-19 20:24:29.319294497 +0000
Modify: 2016-10-19 21:14:35.000000000 +0000
Change: 2016-10-19 20:21:08.969798242 +0000
 Birth: -

Need a way to specify which disks to use and not use

Currently CASTLED_DATA_DEVICES supports devices name such as sdb which are not stable and could vary on every boot. We need to support different options for specifying which disks to use and not use for storage.

One option would be to let the user use ALL disks for storage except the system disk. It would be easy to find the system disk and exclude it. For example,

CASTLED_DATA_DEVICES=all

For this case, the system disk should be excluded and become an INFO message, instead of:

Oct 24 16:03:26 castle00 rkt[1528]: 2016-10-24 16:03:26.685353 I | ERROR: failed to config osd on device sdd. failed device sdd. device sdd already formatted with ext4

Another option would be to support an arbitrary filtering criteria for disks based on information obtainable by libblkid or lsblk, for example:

CASTLED_DATA_DEVICES=“SUSBSYSTEM=block:scsi:pci,SIZE>=5TB"

Speed up the build

Look into using ccache to speed up clean build. Also passing -j parameter to make and to nested projects like rocksdb. Finally for circleci, cache the docker image.

Launch processes through one code path instead of two

The process management needs to be cleaned up. The ProcManager needs to use the Executor to launch processes instead of launching processes itself. This will also ensure the logging from child processes is captured consistently.

Disk support, tools needed and implication on security model

We would like castled to automatically manage disk devices, including partitioning them, formatting them etc. For example

castled --data-devices=/dev/sdb,/dev/sdc

castled should prepare these disks anyway it wants including formatting them for bluestore etc.

There are two considerations that worth highlighting for this approach:

Disk operations almost always require root privs
Disk tools like sgdisk are required for this to work

Its our goal to minimize dependencies on the host distro to support running in minimal containers or hosts like CoreOS.

One possible approach to balance the two issues is to add a new verb to castled:

castled prepare --data-devices=/dev/sdb,/dev/sdc

this would require root privs. and also requires some linux tools like sgdisk to be available. castled can also support a flag that automatically prepares disks if no prepared

castled --data-devices=/dev/sdb,/dev/sdc --data-devices-prepare=auto

--data-device-prepare should default to "auto" but do nothing (and not require root) if the disks are already prepared. It can also be set to "disabled" or "false"

This would enable the caller to decide whether to prepare the disks ahead of time, and run castled in a non-root account, or give castled root privs and let it auto-prepare.

WRT to tools needed by castled to partition devices, it seems wise to use alpine linux in our containers for tools like sgdisk, lsblk. While its possible to write these in go/cgo and remove the dependency (see bassam/rook-old@b07e06e for an example of a cgo lsblk), we should hold off on doing that now until we understand all the dependencies we need. Also #73 is related to this.

environment variable prefix should CASTLED_

instead of CASTLE_ to match the binary name.

support for erasure coded storage pools

Storage pools are currently created with the default settings for replication. castlectl should also support operations on erasure code profiles, so that erasure coded storage pools can be created.

The listing of storage pool details should also include information about replication/erasure code profile.

Enable jemalloc and tcmalloc support in rookd

Now that jemalloc/jemalloc#442 (comment) is fixed we need to reenable jemalloc

Bluestore osd configuration intermittently fails with (22) Invalid argument

Starting the bluestore osds in the demo vagrant environment, we intermittently see a the osd fail with the following error.

Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.147976 c937080 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.177217 c937080 -1 bluestore(/var/lib/castled/osd0) _read_fsid unparsable uuid
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.189627 c937080 -1 bdev(/var/lib/castled/osd0/block) open open got: (22) Invalid argument
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.189686 c937080 -1 OSD::mkfs: ObjectStore::mkfs failed with error -22
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.189758 c937080 -1  ** ERROR: error creating empty object store in /var/lib/castled/osd0: (22) Invalid argument
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.191071 I | ERROR: failed to config osd on device sdd. failed to initialize OSD at /var/lib/castled/osd0: failed osd mkfs for OSD ID 0, UUID 27eaf968-5e1c-4ddd-a967-6e02291c3c4e, dataDir /var/lib/castled/osd0: failed to run osd: exit status 1

etcd quorum is broken on multi-node cluster

When an etcd cluster size of greater than one is specified, the first machine will wait for more machines. However, later machines will come up and attempt to continue even though quorum is not formed.

etcdmgr: can remove nodes from cluster via API from clusterd

etcdmgr should be able to remove existing etcd members from the etcd cluster quorum dynamically as the castle cluster shrinks.

rookd daemons such as MON do not have the right parents process

When running in rkt on CoreOS this is the output of systemctl status castled

 ├─machine.slice
           │ └─castled.service
           │   ├─2065 /usr/bin/castled daemon --type=mon -- --foreground --cluster=castlecluster --name=mon.mon0 --mon-data=/tmp/mon0/mon.mon0 --conf=/tmp/mon0/castlecluster.config --public-addr=172.20.20.10:6790
           │   └─2233 /usr/bin/castled

Note that process 2065 is not a child of 2233

Also when process 2233 is killed, 2065 remains. This tells me something is wrong with how we start child processes

If etcd state is lost, cluster may be unrecoverable

If the state stored in etcd is lost it may not be possible to recover the cluster.

The etcd state should be backed up somewhere.
In the case of an etcd state loss the cluster should be able to come up with empty etcd.

Number 2 is a problem because the discovery.etcd.io token names etcd members which are nolonger valid. This could be fixed by an enhanced discovery service.

Steps to reproduce:

Issue a new discovery token.
Boot a multi-node cluster onto ramfs and allow the cluster to come up.
Reboot the cluster with the same discovery token from step 1.
Nodes will hang because of the confused state of etcd.

Expected - either:

Etcd state would be restored from backup OR
Discovery information would be reset and the cluster would be brought up "cold".

Dan

public ipv4 flag is missing from castled

I only see private_ipv4. we need to add public.

Proper use of public/private ip addresses

The public/private ip addresses are passed to the command line parameters, but only the private ip is currently used everywhere. We need to utilize the public ip wherever it is public networking, such as for mons and osds.