Git Product home page Git Product logo

rook's Introduction

Rook

CNCF Status GitHub release Docker Pulls Go Report Card CII Best Practices Security scanning Slack Twitter Follow

What is Rook?

Rook is an open source cloud-native storage orchestrator for Kubernetes, providing the platform, framework, and support for Ceph storage to natively integrate with Kubernetes.

Ceph is a distributed storage system that provides file, block and object storage and is deployed in large scale production clusters.

Rook automates deployment and management of Ceph to provide self-managing, self-scaling, and self-healing storage services. The Rook operator does this by building on Kubernetes resources to deploy, configure, provision, scale, upgrade, and monitor Ceph.

The status of the Ceph storage provider is Stable. Features and improvements will be planned for many future versions. Upgrades between versions are provided to ensure backward compatibility between releases.

Rook is hosted by the Cloud Native Computing Foundation (CNCF) as a graduated level project. If you are a company that wants to help shape the evolution of technologies that are container-packaged, dynamically-scheduled and microservices-oriented, consider joining the CNCF. For details about who's involved and how Rook plays a role, read the CNCF announcement.

Getting Started and Documentation

For installation, deployment, and administration, see our Documentation and QuickStart Guide.

Contributing

We welcome contributions. See Contributing to get started.

Report a Bug

For filing bugs, suggesting improvements, or requesting new features, please open an issue.

Reporting Security Vulnerabilities

If you find a vulnerability or a potential vulnerability in Rook please let us know immediately at [email protected]. We'll send a confirmation email to acknowledge your report, and we'll send an additional email when we've identified the issues positively or negatively.

For further details, please see the complete security release process.

Contact

Please use the following to reach members of the community:

Community Meeting

A regular community meeting takes place every other Tuesday at 9:00 AM PT (Pacific Time). Convert to your local timezone.

Any changes to the meeting schedule will be added to the agenda doc and posted to Slack #announcements.

Anyone who wants to discuss the direction of the project, design and implementation reviews, or general questions with the broader community is welcome and encouraged to join.

Official Releases

Official releases of Rook can be found on the releases page. Please note that it is strongly recommended that you use official releases of Rook, as unreleased versions from the master branch are subject to changes and incompatibilities that will not be supported in the official releases. Builds from the master branch can have functionality changed and even removed at any time without compatibility support and without prior notice.

Licensing

Rook is under the Apache 2.0 license.

FOSSA Status

rook's People

Contributors

alimaredia avatar anmolsachan avatar aruniiird avatar autosupport avatar bassam avatar blaineexe avatar dangula avatar dependabot[bot] avatar dotnwat avatar galexrt avatar humblec avatar jbw976 avatar jmolmo avatar kokhang avatar leseb avatar madhu-1 avatar parth-gr avatar phlogistonjohn avatar rakshith-r avatar rkachach avatar rohan47 avatar rohantmp avatar sabbot avatar satoru-takeuchi avatar sebastianriese avatar sp98 avatar subhamkrai avatar thotz avatar travisn avatar umangachapagain avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rook's Issues

Add support for etcd failover scenario

When a castled server which is already a member of etcd cluster fails, the etcdmgr should create an instance of embedded etcd on an existing healthy node to replace the failed one.

OSD orchestration should return success to the leader immediately after it starts config

Device initialization for OSDs takes on the order of minutes. For the agent to configure all devices before returning to the orchestration leader causes several problems:

  • Orchestrations are blocked for a long period of time
  • New machines coming online may have to wait a long time to be configured
  • More critical services such as etcd and ceph mons will be blocked from failover orchestration for long periods of time.

Completion of OSD configuration is a process that is quite independent from core orchestration. The only thing the orchestrator really cares about is whether the agent is actively configuring the OSDs. There are no orchestration failover scenarios that OSDs need to worry about, unlike the more critical etcd and mon services.

Upon a request to configure OSDs, the agent can immediately return success to the orchestrator that it is working on the configuration. There is no need to wait for all the OSDs to complete.

For applications that need to be guaranteed there is storage available (ie. OSD configuration is complete), the orchestrator can provide a helper to signal progress of available OSDs.

cephmgr/client package cannot use cephmgr/client/test package due to cyclical dependency

We can not use the mocked types from cephmgr/client/test to test the cephmgr/client package itself because of a cyclical dependency. This indicates a layering issue.

We could move some of the helper methods (e.g. pool.go, auth.go) to a package that is a peer of cephmgr/client. Essentially, usage of the cephmgr/client interfaces should not be within the same package, or else they cannot be tested using the mocked implementations of those interfaces.

/castle/pkg/cephmgr/client
> go test
# github.com/quantum/castle/pkg/cephmgr/client
import cycle not allowed in test
package github.com/quantum/castle/pkg/cephmgr/client (test)
    imports github.com/quantum/castle/pkg/cephmgr/client/test
    imports github.com/quantum/castle/pkg/cephmgr/client

FAIL    github.com/quantum/castle/pkg/cephmgr/client [setup failed]

I'd like a command-line way to learn about monitors

I'd like a command-line way to learn about the monitors in a cluster (where [ip, hostname], how many, health, other stats). I'm thinking about something that works like the "node ls" command.

Perhaps "rook monitor ls" or just "rook monitors".

Windows build fails due to linux only functionality in syscall.Kill

When building on Windows with

build/run make -j4 cross

A build error occurs:

# github.com/quantum/castle/pkg/util/proc
pkg/util/proc/procmanager.go:141: undefined: syscall.Kill

That functionality is only available on linux. This dependency is occuring due to castlectl pulling in the proc package for Executor functionality. It doesn't need procmanager that comes along with it (that contains the linux only syscall.Kill).

Consider splitting into new packages or using build tags for linux only.

embedded ceph should not shell out to use crush tool

When castle running on Ubuntu, creating storage pools fails because it appears the mons have a dependency on crushtool existing in $PATH.
the castled API handler log shows:

failed to create new pool '{Name:jaredPool1 Number:0}': mon_command osd pool create failed, buf: , info: crushtool check failed with -22: crushtool: exec failed: (2) No such file or directory: cephd: Invalid argument

This ceph mailing list explains that mons needs the crushtool:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003392.html

Here's the codepath where it's used:
https://github.com/quantum/ceph/blob/47b199109bbff1db37ddff9461652e30d79df330/src/mon/OSDMonitor.cc#L4849

This works on CoreOS because we have some ceph tools embedded in the image: https://github.com/quantum/coreos-overlay/blob/master/sys-cluster/ceph/ceph-9999.ebuild#L76

There may be more unexpected dependencies in the ceph code to the ceph tools still out there.
repro steps on Ubuntu:

./bin/linux_amd64/castled
./bin/linux_amd64/castlectl pool create --pool-name="mypool1"

Trim the tool dependencies for castled

Oct 13 22:07:02 castle00 rkt[2122]: 2016-10-13 22:07:02.895783 I | Running command: lsblk --all -n -l --output KNAME
Oct 13 22:07:02 castle00 rkt[2122]: 2016-10-13 22:07:02.895985 I | error while discovering hardware: failed to list all devices: Failed to complete lsblk all: exec: "lsblk": executable file not found in $PATH
Oct 13 22:07:02 castle00 rkt[2122]: waiting for ctrl-c interrupt...

build: CTRL+C hangs sometimes when running in build container

When running a build inside a container using build/run make -j4 hitting CTRL+C sometimes hangs . An explicit docker stop cross-build will stop it. This is likely due to the make program not handing signals correctly when its started inside a container as pid 1.

Usage information should not be displayed when there is an error

Usage information should be displayed when there is an error:

> bin/castled 
2016-10-12 13:00:22.717331 I | cluster max size is:  1
2016-10-12 13:00:22.744193 I | currentNodes:  []
2016-10-12 13:00:22.744218 I | current localURL:  http://127.0.0.1:2379
2016-10-12 13:00:22.744258 I | creating a new embedded etcd...
2016-10-12 13:00:22.744338 I | conf: {e16d06178d2c471eb4b77c9489f00be7 [{http  <nil> 127.0.0.1:2380   false  }] [{http  <nil> 127.0.0.1:2379   false  }] [{http  <nil> 127.0.0.1:2380   false  }] [{http  <nil> 127.0.0.1:2379   false  }] /tmp/etcd-data}
2016-10-12 13:00:22.744356 I | client urls to set listeners for:  [{http  <nil> 127.0.0.1:2379   false  }]
Error: listen tcp 127.0.0.1:2379: bind: address already in use
Usage:
  castled [flags]
  castled [command]

Available Commands:
  version     Print the version number of castled

Flags:
      --devices string         comma separated list of devices to use
      --discovery-url string   etcd discovery URL. Example: http://discovery.castle.com/26bd83c92e7145e6b103f623263f61df
      --etcd-members string    etcd members to connect to. Overrides the discovery URL. Example: http://10.23.45.56:2379
      --force-format           true to force the format of any specified devices, even if they already have a filesystem.  BE CAREFUL!
  -h, --help                   help for castled
      --location string        location of this node for CRUSH placement
      --private-ipv4 string    private IPv4 address for this machine (default "127.0.0.1")

Use "castled [command] --help" for more information about a command.

castled error: listen tcp 127.0.0.1:2379: bind: address already in use

Refreshing etcd clients when etcd membership changes

Each Castled server creates an instance of en etcd client and passes it to other parts of the code as a part of clusterd context. Since clusterd supports dynamic resizing of etcd cluster, the original etcd client could get outdated. This problem would lead to timeout and failure of castled.
The fix should update the etcd client in the context when a change occurs in the etcd membership.

debug logging should be easier

when castled starts up, it'll overwrite any existing ceph config files with it's hard coded values for logging levels. This makes it impossible to enable DEBUG logging without rebuilding the binary.

We should make it easier to enable DEBUG logging, especially for bootstrapping scenarios. Perhaps a --debug switch to castled?

Allow vendoring to be optional Makefile

To support quick/dirty dev scenarios where you may have manually copied some files into the vendor directory, it would be nice if the Makefile could be given an explicit option to skip vendoring (glide install) so the temporary dirty dev changes are not overwritten.

orchestration timeout for osds is too aggressive

The timeout for the orchestration leader is two minutes when waiting for a node to respond. If a node has many disks on which to configure osds, it will take longer than this timeout and fail the orchestration even though the osds will still succeed.

A simple way to extend the timeout is a type of heartbeat after each osd is configured. This will signal to the leader that the node is still working and the timeout can be reset.

rookd did not stop correctly when running under systemd

systemd stop castled.service
Oct 13 22:29:03 castle00 rkt[2122]: 2016-10-13 22:29:03.708259 I | Node 172.20.20.10 has age of 0s
Oct 13 22:29:08 castle00 rkt[2122]: 2016-10-13 22:29:08.709403 I | Discovered 1 nodes
Oct 13 22:29:08 castle00 rkt[2122]: 2016-10-13 22:29:08.710259 I | Node 172.20.20.10 has age of 0s
Oct 13 22:29:12 castle00 systemd[1]: Stopping Castle Daemon - software defined storage...
Oct 13 22:29:12 castle00 systemd[1]: castled.service: Killing process 2213 (castled) with signal SIGKILL.
Oct 13 22:29:12 castle00 systemd[1]: Stopped Castle Daemon - software defined storage.
core@castle00 ~ $ 

track disks with uuid rather than serial number

hardware discovery and orchestration currently rely on the disk serial number for a constant identity. However, there is not always a serial number available as seen in a container. We need to rely solely on uuids for a stable disk identity.

osds should be configured on a new node as long as there is mon quorum

When a node comes up, its devices should be configured with osds even if the orchestration fails the monitor configuration. The osd config does at least require mon quorum.

For example, say we already have one node in the cluster with a single monitor running. Now two new machines come online and the orchestration chooses to increase from 1 to 3 monitors. If the new monitors fail to start, there is no reason to skip configuration of osds on nodes 2 and 3, assuming the first mon is still healthy.

build: manage vendor directories when building in a container

on macOS we rsync the source tree from the host to the build container. this is a one way sync. if the vendor directory is not populated before calling build/run it will be populated inside the container but never copied back to the host. as a result each time build/run runs it will have to fetch the vendor directory again.

An easy workaround is to call make vendor on the host before calling build/run. We need a better solution. We could bind mount just the vendor directory but this issue gets in the way Masterminds/glide#642.

Also related is that when we install glide during make it goes in the tools dir. Same issue as above. also glide is install for the host arch, so if we bind mount it it will be the wrong arch for the host on the mac.

incremental builds on macOS not effective due to wrong timezone

Looks like we need to set the timezone in the container to match the host.

bassam@bassamQ [AWS production] ~/Projects/src/github.com/quantum/castle (master)
> stat pkg/clusterd/inventory/hardware.go 
16777220 119333050 -rw-r--r-- 1 bassam staff 0 1988 "Oct 18 22:28:40 2016" "Oct  9 17:05:44 2016" "Oct  9 17:05:44 2016" "Oct  9 17:05:44 2016" 4096 8 0 pkg/clusterd/inventory/hardware.go
bassam@bassamQ [AWS production] ~/Projects/src/github.com/quantum/castle (master)
> build/run stat pkg/clusterd/inventory/disk.go                                             
  File: 'pkg/clusterd/inventory/disk.go'
  Size: 9098        Blocks: 24         IO Block: 4096   regular file
Device: fe02h/65026d    Inode: 2501685     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  501/ UNKNOWN)   Gid: (   20/ dialout)
Access: 2016-10-19 20:24:29.319294497 +0000
Modify: 2016-10-19 21:14:35.000000000 +0000
Change: 2016-10-19 20:21:08.969798242 +0000
 Birth: -

Need a way to specify which disks to use and not use

Currently CASTLED_DATA_DEVICES supports devices name such as sdb which are not stable and could vary on every boot. We need to support different options for specifying which disks to use and not use for storage.

One option would be to let the user use ALL disks for storage except the system disk. It would be easy to find the system disk and exclude it. For example,

CASTLED_DATA_DEVICES=all

For this case, the system disk should be excluded and become an INFO message, instead of:

Oct 24 16:03:26 castle00 rkt[1528]: 2016-10-24 16:03:26.685353 I | ERROR: failed to config osd on device sdd. failed device sdd. device sdd already formatted with ext4

Another option would be to support an arbitrary filtering criteria for disks based on information obtainable by libblkid or lsblk, for example:

CASTLED_DATA_DEVICES=“SUSBSYSTEM=block:scsi:pci,SIZE>=5TB"

Speed up the build

Look into using ccache to speed up clean build. Also passing -j parameter to make and to nested projects like rocksdb. Finally for circleci, cache the docker image.

Launch processes through one code path instead of two

The process management needs to be cleaned up. The ProcManager needs to use the Executor to launch processes instead of launching processes itself. This will also ensure the logging from child processes is captured consistently.

Disk support, tools needed and implication on security model

We would like castled to automatically manage disk devices, including partitioning them, formatting them etc. For example

castled --data-devices=/dev/sdb,/dev/sdc

castled should prepare these disks anyway it wants including formatting them for bluestore etc.

There are two considerations that worth highlighting for this approach:

  1. Disk operations almost always require root privs
  2. Disk tools like sgdisk are required for this to work

Its our goal to minimize dependencies on the host distro to support running in minimal containers or hosts like CoreOS.

One possible approach to balance the two issues is to add a new verb to castled:

castled prepare --data-devices=/dev/sdb,/dev/sdc

this would require root privs. and also requires some linux tools like sgdisk to be available. castled can also support a flag that automatically prepares disks if no prepared

castled --data-devices=/dev/sdb,/dev/sdc --data-devices-prepare=auto

--data-device-prepare should default to "auto" but do nothing (and not require root) if the disks are already prepared. It can also be set to "disabled" or "false"

This would enable the caller to decide whether to prepare the disks ahead of time, and run castled in a non-root account, or give castled root privs and let it auto-prepare.

WRT to tools needed by castled to partition devices, it seems wise to use alpine linux in our containers for tools like sgdisk, lsblk. While its possible to write these in go/cgo and remove the dependency (see bassam/rook-old@b07e06e for an example of a cgo lsblk), we should hold off on doing that now until we understand all the dependencies we need. Also #73 is related to this.

support for erasure coded storage pools

Storage pools are currently created with the default settings for replication. castlectl should also support operations on erasure code profiles, so that erasure coded storage pools can be created.

The listing of storage pool details should also include information about replication/erasure code profile.

Bluestore osd configuration intermittently fails with (22) Invalid argument

Starting the bluestore osds in the demo vagrant environment, we intermittently see a the osd fail with the following error.

Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.147976 c937080 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.177217 c937080 -1 bluestore(/var/lib/castled/osd0) _read_fsid unparsable uuid
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.189627 c937080 -1 bdev(/var/lib/castled/osd0/block) open open got: (22) Invalid argument
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.189686 c937080 -1 OSD::mkfs: ObjectStore::mkfs failed with error -22
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.189758 c937080 -1  ** ERROR: error creating empty object store in /var/lib/castled/osd0: (22) Invalid argument
Oct 24 16:13:31 castle01 rkt[1568]: 2016-10-24 16:13:31.191071 I | ERROR: failed to config osd on device sdd. failed to initialize OSD at /var/lib/castled/osd0: failed osd mkfs for OSD ID 0, UUID 27eaf968-5e1c-4ddd-a967-6e02291c3c4e, dataDir /var/lib/castled/osd0: failed to run osd: exit status 1

etcd quorum is broken on multi-node cluster

When an etcd cluster size of greater than one is specified, the first machine will wait for more machines. However, later machines will come up and attempt to continue even though quorum is not formed.

rookd daemons such as MON do not have the right parents process

When running in rkt on CoreOS this is the output of systemctl status castled

 ├─machine.slice
           │ └─castled.service
           │   ├─2065 /usr/bin/castled daemon --type=mon -- --foreground --cluster=castlecluster --name=mon.mon0 --mon-data=/tmp/mon0/mon.mon0 --conf=/tmp/mon0/castlecluster.config --public-addr=172.20.20.10:6790
           │   └─2233 /usr/bin/castled

Note that process 2065 is not a child of 2233

Also when process 2233 is killed, 2065 remains. This tells me something is wrong with how we start child processes

If etcd state is lost, cluster may be unrecoverable

If the state stored in etcd is lost it may not be possible to recover the cluster.

  1. The etcd state should be backed up somewhere.
  2. In the case of an etcd state loss the cluster should be able to come up with empty etcd.

Number 2 is a problem because the discovery.etcd.io token names etcd members which are nolonger valid. This could be fixed by an enhanced discovery service.

Steps to reproduce:

  1. Issue a new discovery token.
  2. Boot a multi-node cluster onto ramfs and allow the cluster to come up.
  3. Reboot the cluster with the same discovery token from step 1.
  4. Nodes will hang because of the confused state of etcd.

Expected - either:

  • Etcd state would be restored from backup OR
  • Discovery information would be reset and the cluster would be brought up "cold".

Dan

Proper use of public/private ip addresses

The public/private ip addresses are passed to the command line parameters, but only the private ip is currently used everywhere. We need to utilize the public ip wherever it is public networking, such as for mons and osds.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.