coreos / torus Goto Github PK

View Code? Open in Web Editor NEW

1.8K 91.0 173.0 1.19 MB

Torus Distributed Storage

Home Page: https://coreos.com/blog/torus-distributed-storage-by-coreos.html

License: Apache License 2.0

Go 98.17% Protocol Buffer 0.85% Shell 0.33% Makefile 0.65%

torus's Introduction

Torus

Torus Overview

Torus is an open source project for distributed storage coordinated through etcd.

Torus provides a resource pool and basic file primitives from a set of daemons running atop multiple nodes. These primitives are made consistent by being append-only and coordinated by etcd. From these primitives, a Torus server can support multiple types of volumes, the semantics of which can be broken into subprojects. It ships with a simple block-device volume plugin, but is extensible to more.

Sharding is done via a consistent hash function, controlled in the simple case by a hash ring algorithm, but fully extensible to arbitrary maps, rack-awareness, and other nice features. The project name comes from this: a hash 'ring' plus a 'volume' is a torus.

Project Status

Development on Torus at CoreOS stopped as of Feb 2017. We started Torus as a prototype in June 2016 to build a storage system that could be easily operated on top of Kubernetes. We have proven out that model with this project. But, we didn't achieve the development velocity over the 8 months that we had hoped for when we started out, and as such we didn't achieve the depth of community engagement we had hoped for either.

If you have immediate storage needs Kubernetes can plugin to dozens of other storage options including AWS/Azure/Google/OpenStack/etc block storage, Ceph, Gluster, NFS, etc that are external to Kubernetes.

We are also seeing the emergence of projects, like rook, which creates a storage system that is ran on top of Kubernetes, as an Operator. We expect to see more systems like this in the future, because Kubernetes is a perfect platform for running distributed storage systems.

If you are interested in continuing the project feel free to fork and continue; we can update this README if a particular fork gets solid traction.

For further questions email [email protected].

Trying out Torus

To get started quicky using Torus for the first time, start with the guide to running your first Torus cluster, learn more about setting up Torus on Kubernetes using FlexVolumes in contrib, or create a Torus cluster on bare metal.

Contributing to Torus

Torus is an open source project and contributors are welcome! Join us on IRC at #coreos on freenode.net, file an issue here on Github, check out bigger plans on the kind/design tag, contribute on bugs that are low hanging fruit for issue ideas and check the project layout for a guide to the sections that might interest you.

Licensing

Unless otherwise noted, all code in the Torus repository is licensed under the Apache 2.0 license. Some portions of the codebase are derived from other projects under different licenses; the appropriate information can be found in the header of those source files, as applicable.

torus's People

Contributors

Stargazers

Watchers

Forkers

pmoust mawalu fcantournet barakmich fnordahl digideskio mischief kz8s ericchiang kunalkushwaha kindlyops foxmailed djwackey ircody maniacs-ops tomzhang wxdublin ebostijancic abioy cpecc rbramwell satanson mantyr thrawn01 alpe rothgar facre s-urbaniak chenzhongtao wjybluse sgotti jango2015 ringtail mandarn sqleejan jacobtrvl davidmr001 davidnasar charsyam khanchan dongsupark rstrlcpy beh9540 dazer-chen fengfeili heyitsanthony is00hcw duzhanyuan zhuchance mtanlee fnet123 hestendelin zoutaiqi nevercgoodbye pedrosland leochencipher e-llp qinglongyanyue eugenrommel discordianfish xiaofan-china zhanglianx111 sloblee universal-it-systems sriv1211 liu4480 zeysh liusongpeng2019 cloudxtreme julor materone kopiczko xingjianwei cgonyeo vchander weetui yarntime simonschuang lpabon cooljiansir cganey grapebaba bravechou2009 dan-and kwokhunglee ship-os coerick childsb rongfengliang keontang barnettzqg githubwithme jiaxuanzhou alex1528 jorise7 philips stormgo avalanche-io yut148 mangeshvc

torus's Issues

Make Paths use VolumeIDs instead of names

It's a name-collision-cleanliness thing.

implement stat for agro.Server interface

Setup a CI environment

I don't care which. Travis?

Recreate etcd lease

Suppose the laptop turns off. We come back and can't write keys because our lease has expired in etcd. We need to safely recover.

agromount: wrap in a rkt fly container

As I couldn't get the kubernetes integration to work I tried to run latest from the container. It didn't work either but it should be a start.

#!/bin/bash
# Wrapper for launching kubelet via rkt-fly stage1.

set -e

AGRO_ACI="${AGRO_ACI:-quay.io/coreos/agro}"

exec /usr/bin/rkt run \
  --volume run,kind=host,source=/run \
  --mount volume=run,target=/run \
  --volume volume-data,kind=host,source=/tmp/foo \
  --volume volume-plugin,kind=host,source=/tmp/foo  \
  --insecure-options=image \
  $RKT_OPTS \
  --stage1-path=/usr/share/rkt/stage1-fly.aci \
  docker://${AGRO_ACI}:latest --exec=/go/bin/agromount -- $@ 2> /dev/null

core@localhost ~ $ cat /etc/rkt/auth.d/docker.json
{
    "rktKind": "dockerAuth",
    "rktVersion": "v1",
    "registries": ["quay.io"],
    "credentials": {
        "user": "coreos+agro",
        "password": "stuff"
    }
}

RFC: Remove INode Store Abstraction

Let's simplify:

Right now, there's a separate INode store. I did that early on so that I could think about other abstractions, but it's time to centralize the storage.

My proposal is this: INodes consist of blocks, with special BlockRefs.

What's special about them? Well, BlockRefs would have a type now. We can steal a few bits (I'm thinking 16 from volume, meaning 64k potential blocktypes and 2^48 or 2.8e14 different volumes -- which is still plenty). The INode abstraction translates *model.INodes into blockns and back again.

This simplifies all the steps greatly (no second store) and has another advantage: The ring becomes even more the source of truth. If you want to modify how INodes are replicated, that's a change to the ring.

This has an added advantage: new and different blocktypes can now be known to the ring. For instance, it's probably unwise to keep Reed-Solomon blocks on all the same hosts as their original blocks (or probabalistically so). Or maybe you want to keep them on the same rack or whatever.

So too with alternate replication strategies.

And this simplifies RPCs; it's just blocks.

This also obviates #63.

cmux for now, proper https1.1/http2 later

Looks like this discussion is still happening: https://github.com/grpc/grpc-common/issues/284

(Tick marks to avoid linking the issue)

https://github.com/soheilhy/cmux is a reasonable workaround that exists today. Let's do it -- one port for HTTP API and internal gRPC. And when proper single-port support comes through grpc, then everyone wins.

documentation: automate and document the release process

We need to get ready to do the first release of agro. So, we need to document and automate the release process. rkt has a pretty good example of the things that should be covered: https://github.com/coreos/rkt/blob/master/Documentation/devel/release.md

NBD interface

NBD is the 'network block device': https://github.com/yoe/nbd/blob/master/doc/proto.md

nbd can run over tcp/ip and is supposed by the linux kernel.

i believe @barakmich is working on this.

Pass errors in models.RebalanceStatus

If the connection dies, if a follower errors out and we need to cancel, having a semantic reason is important.

Client Library & CLI

Start /client and an associated /cmd/agroctl -- the design of which is a little bit clever:

Start /client as another concrete implementation of the agro.Server interface. The difference being that the inode storage, the file abstraction, and so on, are fairly simple stubs into a new set of things: RemoteINodeStorage, RemoteBlockStorage, and RemoteMetadata. These basically make HTTP calls for all their implementations, calling the internal/http endpoint as a proxy and off we go.

The cool part is then agroctl (and the meta-client calls; I would guess things like ls) can self-host. The long-future goal being that it could talk directly to etcd for metadata if it needed, or could do local caching, or can do a lot of things -- which will make the FUSE client quite easy to implement well, without needing to proxy everything. In effect, a long-running FUSE daemon could itself be another node that doesn't register as storage, but speaks the native node-to-node interface.

Meanwhile, for other scenarios, the proxy works great.

build failure

$ go version
go version go1.5.1 darwin/amd64
$ make
go build ./cmd/agro
# github.com/barakmich/agro/models
models/rpc.pb.go:287: cannot use _AgroStorage_Block_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
models/rpc.pb.go:291: cannot use _AgroStorage_INode_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
models/rpc.pb.go:295: cannot use _AgroStorage_PutBlock_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
models/rpc.pb.go:299: cannot use _AgroStorage_PutINode_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
# github.com/barakmich/agro/internal/etcdproto/etcdserverpb
internal/etcdproto/etcdserverpb/rpc.pb.go:914: cannot use _KV_Range_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
internal/etcdproto/etcdserverpb/rpc.pb.go:918: cannot use _KV_Put_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
internal/etcdproto/etcdserverpb/rpc.pb.go:922: cannot use _KV_DeleteRange_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
internal/etcdproto/etcdserverpb/rpc.pb.go:926: cannot use _KV_Txn_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
internal/etcdproto/etcdserverpb/rpc.pb.go:930: cannot use _KV_Compact_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
make: *** [build] Error 2
$

Rebalance Optimization

As of PR #65, rebalancing is done by iterating over the store and copying over desired blocks/inodes. This is optimized for the case where you're deleting more than you're keeping. I would venture that most rebalances are only going to be deleting a small amount of values and would thus benefit from a strategy where you iterate over the blocks/inodes and mark what you're going to delete rather than copying what you'll keep.

Peer Registration

Peers need to register with the MDS (extend the MDS interface to do this), but that's all. It's up to the client library to create (and commit) a (future) ring from a list of potential peers.

Registration should include a bunch of metadata about the host. How many blocks can it hold? What IP/port is it listening on? Does it have an optional name? What's its permanent UUID?

Along with this is a heartbeat. Heartbeat(UUID, status, timeout) gets saved to the MDS every so often.

(This works with #24 in that a client won't register or heartbeat!)

package: `mv types proto`

I really hate the name "types" for packages. It pretty much implies absolutely nothing and it isn't clear unless you look at the package source that it's for protos. We should call this package proto so that it's clear when you're using the types that they're protobufs.

POSIX Interface (via FUSE)

We should provide a POSIX-compliant interface. The standard way to do that is with FUSE, and we can use Bazil's Go FUSE library to build it.

We may or may not want to have it as a separate command-line tool.

Drain/Fail

Add specific removal types to agroctl -- things which will block until the node is removed (safe or hard, in either case)

Adding an mode for really ramping up the rebalance traffic if you're draining also falls in this bug. Right now it's rate limited, but I think the better option will be a fast-divestment-of-data.

containerization and mounts

You cannot currently share host mounts via docker (until moby/moby#17034 lands). The author of this PR has a blogpost expanding on this PR here. rhdan has some food for thought on the subject here as well.

Persistence

We have MFiles, let's use them to store blocks. This requires keeping a block map (easy enough).

We also have need of an INode store. Longer-term, this could be the same block store too -- if we split INodes (serialized protos) into blocks when written. But that's a deliberate abstraction. For now, a simple, separate boltdb will do the trick. It does mean we'll be using more space than we advertise, but running conservatively (ie, --size=something-less-than-half-a-disk) will be fine through a few phases.

To persist the temporary metadata? A flat file for that would be fine.

document full kubernetes story

Have instructions for doing the k8s pull secret
Deploy daemon set
Talk to etcd and agro machine to mkfs
Use a flexvolume: https://github.com/kubernetes/kubernetes/tree/master/examples/flexvolume
- Use a solution like CoreOS Kubernetes does for CNI plugins, using a mount flag
- https://github.com/coreos/agro/blob/d3872a14f62579ae74b50a40ce64b6b671965e71/cmd/agrom

feature request: mount(8) helper

ideally agro would have a mount(8) helper, that can be invoked by e.g. mount -t agro -o etcd:32378,volume=blah ....

the use case here is systemd.mount units which operate solely through execution of mount(8).

Proposal: Lose the VolumeID in the INodeRef

As currently implemented, there's no need for the VolumeID in the INodeRef; it simply takes up space because right now there's only a single source of INodes, with no duplicates.

Now, there's two options:

Lose the VolumeID, have a single source of INodes:

Pros:
- Smaller Refs
- Smaller MDS storage (no need for an inode printer for every volume)
Cons:
- Smaller representation
  - With 8k blocks, we're still representing up to 10^18 Yottabytes.
- Potential etcd contention with a lot of write traffic.
  - Could be almost nullified by adding an "INC" operation in etcd, which atomically adds one to the value and returns that count

Keep the VolumeID, have an inode chain per volume

Pros:
- etcd contention only per-volume -- if we're mounting a lot of these, with an average ratio of almost 1:1 per machine that has it mounted, that means very little contention.
- Friggin huge representation
  - 10^18 Yottabytes -- PER VOLUME. For an insane 10^37 Yottabytes of possible representation space. That's 10 tera-yottabytes.
Cons:
- More MDS storage (O(volumes * 4 bytes), so tiny)
- Bigger refs (could be more smartly encoded)

etcdv3 MDS

We could waste some time building a scaffolding one (either via HTTP API or something) but why not go direct to the real plan.

Write an "etcd-mds" bash script that fetches the latest etcd binary from github releases (if not exist) to a local gitignored dir and runs it in single-node v3 mode on a known port for ease of testing.
Follow eg https://github.com/coreos/etcd/blob/master/etcdctlv3/command/put_command.go to connect the MDS interface to etcd. Single node concerns only, for the moment.

The good news is the v3API is hella cleaner and more powerful than it used to be. Build proto. Send proto on grpc. Done.

Erasure Coding

We should use Reed-Solomon or something similar to reduce replication storage overhead and increase resiliency to data loss.

Also, this may be useful.

Prometeus Metrics

We should expose Prometheus metrics for monitoring.

Logging and Sane Errors

Go through and vet the logging/errors. Add more logs using a favorite logging package (it need not be capnslog, but I'm game/ Just needs to be BSD/Apache2). Add errors.go files on appropriate layers. Use them.

entrypoint.sh can only use a single etcd host

With the current structure of entrypoint.sh only a single etcd host can be added. Porentially this could be overridden by supplying a string of hosts as ETCD_HOST but in this case it will also require setting ETCD_PORT="" to avoid the automatic assignment of that variable.

documentation: document the prometheus metrics API

We have a prometheus metrics API but it isn't documented. Do that including:

The metrics endpoint
An example query or two for prometheus
Explanations of the various metrics exported

labels: use standard labels

https://github.com/coreos-inc/company/tree/master/eng/github#using-labels

Command-line Tool

We should have a command-line tool (agro), which has subcommands allowing a user to add a new file and get an existing file.

Streaming Writes

I should be able to stream content to a file in multiple chunks, without each chunk needing to be flushed to disk. The chunks should not be visible to other readers until I choose to explicitly commit.

tl;dr: Writes should be independent from commits in the low-level API.

documentation: how to use torus as a library

Related to #147 I think we need to document how people can build applications on top of torus easily. I know that @barakmich has been working on the libraries and factoring things away so that building the next app on top of torus is easy to do but I think we should do this pre-launch. At least paint the rough picture.

can't go get a private repo

Some of us don't have our go get enabled to use ssh. Instead of Go get maybe:

git clone [email protected]:coreos/agro.git $GOPATH/src/github.com/coreos/agro
cd $GOPATH/src/github.com/coreos/agro
go get ./...

Garbage Collection

Data which has been replaced should be periodically removed to free up space.

make fails

$ make
go build ./cmd/agro
go build ./cmd/agroctl
# github.com/coreos/agro/cmd/agroctl
cmd/agroctl/agroctl.go:29: undefined: volumeCommand
Makefile:2: recipe for target 'build' failed
make: *** [build] Error 2

agromount: error while using flexvolume

May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: panic: runtime error: slice bounds out of range
May 13 13:39:01 localhost agro[20293]: goroutine 70 [running]:
May 13 13:39:01 localhost agro[20293]: panic(0xc04cc0, 0xc82000c030)
May 13 13:39:01 localhost agro[20293]:         /usr/lib/go/src/runtime/panic.go:481 +0x3e6
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro/distributor.(*Distributor).GetBlock(0xc82006bc20, 0x7f0c4fd90510, 0xc820280b10, 0x3, 0x4, 0x1, 0x0, 0x0, 0x0, 0x0, .
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/distributor/storage.go:37 +0xec3
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro/blockset.(*baseBlockset).GetBlock(0xc820283a40, 0x7f0c4fd90510, 0xc820280b10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/blockset/base.go:55 +0x41a
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro/blockset.(*crcBlockset).GetBlock(0xc82027b1d0, 0x7f0c4fd90510, 0xc820280b10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/blockset/crc.go:58 +0x25a
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro.(*File).openBlock(0xc82022c960, 0x0, 0x0, 0x0)
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/file.go:151 +0x725
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro.(*File).WriteAt(0xc82022c960, 0xc8203ca01c, 0x1000, 0xfffe4, 0x0, 0x0, 0x0, 0x0)
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/file.go:296 +0x1e38
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro/internal/nbd.(*NBD).handle(0xc820283a80, 0xc8202962a0)
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/internal/nbd/nbd.go:240 +0x95e
May 13 13:39:01 localhost agro[20293]: created by github.com/coreos/agro/internal/nbd.(*NBD).Serve
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/internal/nbd/nbd.go:176 +0x2bd

Count All The Things! (Prometheus)

Add prometheus in a lot of places. Count block reads, count files opened, count files that are write-dirty, count metadata interactions, etc, etc....

Integration/Benchmark Test

a la https://github.com/google/cayley/blob/master/integration/integration_test.go

Sets up a single node (or connects as a client a la #24) and generates a couple files of different sizes to tmp. Writes them with fun rs/rep/crc/etc options. Measure how long it takes to write (randomly generated) 1K/100k/1M/100M files. Then measure how long it takes to read them back again.

Rolling Upgrades

We should support rolling upgrades.

metadata/etcd: Drop and Reconnect semantics

We're happily passing contexts now, and timeouts can happen. In the case of timeout, a drop-and-reconnect is warranted. However, we currently have just one connection. We need to fix this.

Spitballing names

Agro is the current name which is the name of the dog from Shadow of the Colossus.

Following the Colossus idea, but staying relevant to computer science, Bletchley Park was the where Colossus the computer was created by codebreakers (including Alan Turing) after the war. My proposal is to call the project "bletchley".

INode replication vs Block replication.

There is a parameter called INodeReplication, which defines how many replica each INode is expected to have.

What is the plan for handling Block replication? For N copy, we can simply reuse the INodeReplication parameter. But when doing RS encoding for blocks, the replication factor does not make a lot of sense.

Another question is that, are we trying to replicate the block on the same node as the INode?

hitting etcd request size limit

error creating volume agroblock2: rpc error: code = 3 desc = "etcdserver: request is too large"

when trying to create a 512M block volume.

the marshalled inode size in createBlockVol ends up being ~1.75M, and etcd recently set a limit of 1.5M etcd-io/etcd@3556722

Lingo

Hash out the lingo on the wiki

documentation: document partition and other "recoverable" failures

The very first question we are going to get about is how this thing fails in various scenarios:

What happens in single machine failure?
What happens during network partition between peers?
What happens during network partition of etcd?

And in each of these scenarios we should provide an information about:

Increased latency
Read/write availability
Estimated time to recovery

additionally if this is implemented it might be interesting to use agro w/ diod - https://github.com/chaos/diod

if i have the time i'll look at adding a 9p server to agro.

AoE interface

AoE is ATA over Ethernet: http://brantleycoilecompany.com/AoEr11.pdf

it runs only over ethernet, but this is not a problem as long an an agro node is on the local network segment.

i'm slowly poking at this.

Add a mutex to temp stores

We're probably going to need this pretty soon.