Git Product home page Git Product logo

torus's Introduction

Torus

Build Status Go Report Card GoDoc

Torus Overview

Torus is an open source project for distributed storage coordinated through etcd.

Torus provides a resource pool and basic file primitives from a set of daemons running atop multiple nodes. These primitives are made consistent by being append-only and coordinated by etcd. From these primitives, a Torus server can support multiple types of volumes, the semantics of which can be broken into subprojects. It ships with a simple block-device volume plugin, but is extensible to more.

Quick-glance overview

Sharding is done via a consistent hash function, controlled in the simple case by a hash ring algorithm, but fully extensible to arbitrary maps, rack-awareness, and other nice features. The project name comes from this: a hash 'ring' plus a 'volume' is a torus.

Project Status

Development on Torus at CoreOS stopped as of Feb 2017. We started Torus as a prototype in June 2016 to build a storage system that could be easily operated on top of Kubernetes. We have proven out that model with this project. But, we didn't achieve the development velocity over the 8 months that we had hoped for when we started out, and as such we didn't achieve the depth of community engagement we had hoped for either.

If you have immediate storage needs Kubernetes can plugin to dozens of other storage options including AWS/Azure/Google/OpenStack/etc block storage, Ceph, Gluster, NFS, etc that are external to Kubernetes.

We are also seeing the emergence of projects, like rook, which creates a storage system that is ran on top of Kubernetes, as an Operator. We expect to see more systems like this in the future, because Kubernetes is a perfect platform for running distributed storage systems.

If you are interested in continuing the project feel free to fork and continue; we can update this README if a particular fork gets solid traction.

For further questions email [email protected].

Trying out Torus

To get started quicky using Torus for the first time, start with the guide to running your first Torus cluster, learn more about setting up Torus on Kubernetes using FlexVolumes in contrib, or create a Torus cluster on bare metal.

Contributing to Torus

Torus is an open source project and contributors are welcome! Join us on IRC at #coreos on freenode.net, file an issue here on Github, check out bigger plans on the kind/design tag, contribute on bugs that are low hanging fruit for issue ideas and check the project layout for a guide to the sections that might interest you.

Licensing

Unless otherwise noted, all code in the Torus repository is licensed under the Apache 2.0 license. Some portions of the codebase are derived from other projects under different licenses; the appropriate information can be found in the header of those source files, as applicable.

torus's People

Contributors

barakmich avatar betawaffle avatar dghubble avatar discordianfish avatar egustafson avatar ericchiang avatar glevand avatar gyuho avatar heyitsanthony avatar ircody avatar jonboulle avatar jzelinskie avatar kopiczko avatar mawalu avatar mdlayher avatar mischief avatar muxator avatar nak3 avatar philips avatar pmoust avatar rothgar avatar s-urbaniak avatar sgotti avatar shawnps avatar xiang90 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

torus's Issues

Recreate etcd lease

Suppose the laptop turns off. We come back and can't write keys because our lease has expired in etcd. We need to safely recover.

agromount: wrap in a rkt fly container

As I couldn't get the kubernetes integration to work I tried to run latest from the container. It didn't work either but it should be a start.

#!/bin/bash
# Wrapper for launching kubelet via rkt-fly stage1.

set -e

AGRO_ACI="${AGRO_ACI:-quay.io/coreos/agro}"

exec /usr/bin/rkt run \
  --volume run,kind=host,source=/run \
  --mount volume=run,target=/run \
  --volume volume-data,kind=host,source=/tmp/foo \
  --volume volume-plugin,kind=host,source=/tmp/foo  \
  --insecure-options=image \
  $RKT_OPTS \
  --stage1-path=/usr/share/rkt/stage1-fly.aci \
  docker://${AGRO_ACI}:latest --exec=/go/bin/agromount -- $@ 2> /dev/null
core@localhost ~ $ cat /etc/rkt/auth.d/docker.json
{
    "rktKind": "dockerAuth",
    "rktVersion": "v1",
    "registries": ["quay.io"],
    "credentials": {
        "user": "coreos+agro",
        "password": "stuff"
    }
}

RFC: Remove INode Store Abstraction

Let's simplify:

Right now, there's a separate INode store. I did that early on so that I could think about other abstractions, but it's time to centralize the storage.

My proposal is this: INodes consist of blocks, with special BlockRefs.

What's special about them? Well, BlockRefs would have a type now. We can steal a few bits (I'm thinking 16 from volume, meaning 64k potential blocktypes and 2^48 or 2.8e14 different volumes -- which is still plenty). The INode abstraction translates *model.INodes into blockns and back again.

This simplifies all the steps greatly (no second store) and has another advantage: The ring becomes even more the source of truth. If you want to modify how INodes are replicated, that's a change to the ring.

This has an added advantage: new and different blocktypes can now be known to the ring. For instance, it's probably unwise to keep Reed-Solomon blocks on all the same hosts as their original blocks (or probabalistically so). Or maybe you want to keep them on the same rack or whatever.

So too with alternate replication strategies.

And this simplifies RPCs; it's just blocks.

This also obviates #63.

cmux for now, proper https1.1/http2 later

Looks like this discussion is still happening: https://github.com/grpc/grpc-common/issues/284

(Tick marks to avoid linking the issue)

https://github.com/soheilhy/cmux is a reasonable workaround that exists today. Let's do it -- one port for HTTP API and internal gRPC. And when proper single-port support comes through grpc, then everyone wins.

Client Library & CLI

Start /client and an associated /cmd/agroctl -- the design of which is a little bit clever:

Start /client as another concrete implementation of the agro.Server interface. The difference being that the inode storage, the file abstraction, and so on, are fairly simple stubs into a new set of things: RemoteINodeStorage, RemoteBlockStorage, and RemoteMetadata. These basically make HTTP calls for all their implementations, calling the internal/http endpoint as a proxy and off we go.

The cool part is then agroctl (and the meta-client calls; I would guess things like ls) can self-host. The long-future goal being that it could talk directly to etcd for metadata if it needed, or could do local caching, or can do a lot of things -- which will make the FUSE client quite easy to implement well, without needing to proxy everything. In effect, a long-running FUSE daemon could itself be another node that doesn't register as storage, but speaks the native node-to-node interface.

Meanwhile, for other scenarios, the proxy works great.

build failure

$ go version
go version go1.5.1 darwin/amd64
$ make
go build ./cmd/agro
# github.com/barakmich/agro/models
models/rpc.pb.go:287: cannot use _AgroStorage_Block_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
models/rpc.pb.go:291: cannot use _AgroStorage_INode_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
models/rpc.pb.go:295: cannot use _AgroStorage_PutBlock_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
models/rpc.pb.go:299: cannot use _AgroStorage_PutINode_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
# github.com/barakmich/agro/internal/etcdproto/etcdserverpb
internal/etcdproto/etcdserverpb/rpc.pb.go:914: cannot use _KV_Range_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
internal/etcdproto/etcdserverpb/rpc.pb.go:918: cannot use _KV_Put_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
internal/etcdproto/etcdserverpb/rpc.pb.go:922: cannot use _KV_DeleteRange_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
internal/etcdproto/etcdserverpb/rpc.pb.go:926: cannot use _KV_Txn_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
internal/etcdproto/etcdserverpb/rpc.pb.go:930: cannot use _KV_Compact_Handler (type func(interface {}, context.Context, func(interface {}) error) (interface {}, error)) as type grpc.methodHandler in field value
make: *** [build] Error 2
$

Rebalance Optimization

As of PR #65, rebalancing is done by iterating over the store and copying over desired blocks/inodes. This is optimized for the case where you're deleting more than you're keeping. I would venture that most rebalances are only going to be deleting a small amount of values and would thus benefit from a strategy where you iterate over the blocks/inodes and mark what you're going to delete rather than copying what you'll keep.

Peer Registration

Peers need to register with the MDS (extend the MDS interface to do this), but that's all. It's up to the client library to create (and commit) a (future) ring from a list of potential peers.

Registration should include a bunch of metadata about the host. How many blocks can it hold? What IP/port is it listening on? Does it have an optional name? What's its permanent UUID?

Along with this is a heartbeat. Heartbeat(UUID, status, timeout) gets saved to the MDS every so often.

(This works with #24 in that a client won't register or heartbeat!)

package: `mv types proto`

I really hate the name "types" for packages. It pretty much implies absolutely nothing and it isn't clear unless you look at the package source that it's for protos. We should call this package proto so that it's clear when you're using the types that they're protobufs.

POSIX Interface (via FUSE)

We should provide a POSIX-compliant interface. The standard way to do that is with FUSE, and we can use Bazil's Go FUSE library to build it.

We may or may not want to have it as a separate command-line tool.

Drain/Fail

Add specific removal types to agroctl -- things which will block until the node is removed (safe or hard, in either case)

Adding an mode for really ramping up the rebalance traffic if you're draining also falls in this bug. Right now it's rate limited, but I think the better option will be a fast-divestment-of-data.

Persistence

We have MFiles, let's use them to store blocks. This requires keeping a block map (easy enough).

We also have need of an INode store. Longer-term, this could be the same block store too -- if we split INodes (serialized protos) into blocks when written. But that's a deliberate abstraction. For now, a simple, separate boltdb will do the trick. It does mean we'll be using more space than we advertise, but running conservatively (ie, --size=something-less-than-half-a-disk) will be fine through a few phases.

To persist the temporary metadata? A flat file for that would be fine.

Proposal: Lose the VolumeID in the INodeRef

As currently implemented, there's no need for the VolumeID in the INodeRef; it simply takes up space because right now there's only a single source of INodes, with no duplicates.

Now, there's two options:

  1. Lose the VolumeID, have a single source of INodes:
  • Pros:
    • Smaller Refs
    • Smaller MDS storage (no need for an inode printer for every volume)
  • Cons:
    • Smaller representation
      • With 8k blocks, we're still representing up to 10^18 Yottabytes.
    • Potential etcd contention with a lot of write traffic.
      • Could be almost nullified by adding an "INC" operation in etcd, which atomically adds one to the value and returns that count
  1. Keep the VolumeID, have an inode chain per volume
  • Pros:
    • etcd contention only per-volume -- if we're mounting a lot of these, with an average ratio of almost 1:1 per machine that has it mounted, that means very little contention.
    • Friggin huge representation
      • 10^18 Yottabytes -- PER VOLUME. For an insane 10^37 Yottabytes of possible representation space. That's 10 tera-yottabytes.
  • Cons:
    • More MDS storage (O(volumes * 4 bytes), so tiny)
    • Bigger refs (could be more smartly encoded)

etcdv3 MDS

We could waste some time building a scaffolding one (either via HTTP API or something) but why not go direct to the real plan.

The good news is the v3API is hella cleaner and more powerful than it used to be. Build proto. Send proto on grpc. Done.

Logging and Sane Errors

Go through and vet the logging/errors. Add more logs using a favorite logging package (it need not be capnslog, but I'm game/ Just needs to be BSD/Apache2). Add errors.go files on appropriate layers. Use them.

entrypoint.sh can only use a single etcd host

With the current structure of entrypoint.sh only a single etcd host can be added. Porentially this could be overridden by supplying a string of hosts as ETCD_HOST but in this case it will also require setting ETCD_PORT="" to avoid the automatic assignment of that variable.

Command-line Tool

We should have a command-line tool (agro), which has subcommands allowing a user to add a new file and get an existing file.

Streaming Writes

I should be able to stream content to a file in multiple chunks, without each chunk needing to be flushed to disk. The chunks should not be visible to other readers until I choose to explicitly commit.

tl;dr: Writes should be independent from commits in the low-level API.

documentation: how to use torus as a library

Related to #147 I think we need to document how people can build applications on top of torus easily. I know that @barakmich has been working on the libraries and factoring things away so that building the next app on top of torus is easy to do but I think we should do this pre-launch. At least paint the rough picture.

can't go get a private repo

Some of us don't have our go get enabled to use ssh. Instead of Go get maybe:

git clone [email protected]:coreos/agro.git $GOPATH/src/github.com/coreos/agro
cd $GOPATH/src/github.com/coreos/agro
go get ./...

Garbage Collection

Data which has been replaced should be periodically removed to free up space.

make fails

$ make
go build ./cmd/agro
go build ./cmd/agroctl
# github.com/coreos/agro/cmd/agroctl
cmd/agroctl/agroctl.go:29: undefined: volumeCommand
Makefile:2: recipe for target 'build' failed
make: *** [build] Error 2

agromount: error while using flexvolume

May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: couldn't dial: dial tcp 10.2.57.5:40000: getsockopt: connection refused
May 13 13:39:01 localhost agro[20293]: error WriteAll to peer c88654ce-190f-11e6-a8bf-faef23e236fc: agro: block cannot be retrieved
May 13 13:39:01 localhost agro[20293]: panic: runtime error: slice bounds out of range
May 13 13:39:01 localhost agro[20293]: goroutine 70 [running]:
May 13 13:39:01 localhost agro[20293]: panic(0xc04cc0, 0xc82000c030)
May 13 13:39:01 localhost agro[20293]:         /usr/lib/go/src/runtime/panic.go:481 +0x3e6
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro/distributor.(*Distributor).GetBlock(0xc82006bc20, 0x7f0c4fd90510, 0xc820280b10, 0x3, 0x4, 0x1, 0x0, 0x0, 0x0, 0x0, .
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/distributor/storage.go:37 +0xec3
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro/blockset.(*baseBlockset).GetBlock(0xc820283a40, 0x7f0c4fd90510, 0xc820280b10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/blockset/base.go:55 +0x41a
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro/blockset.(*crcBlockset).GetBlock(0xc82027b1d0, 0x7f0c4fd90510, 0xc820280b10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/blockset/crc.go:58 +0x25a
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro.(*File).openBlock(0xc82022c960, 0x0, 0x0, 0x0)
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/file.go:151 +0x725
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro.(*File).WriteAt(0xc82022c960, 0xc8203ca01c, 0x1000, 0xfffe4, 0x0, 0x0, 0x0, 0x0)
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/file.go:296 +0x1e38
May 13 13:39:01 localhost agro[20293]: github.com/coreos/agro/internal/nbd.(*NBD).handle(0xc820283a80, 0xc8202962a0)
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/internal/nbd/nbd.go:240 +0x95e
May 13 13:39:01 localhost agro[20293]: created by github.com/coreos/agro/internal/nbd.(*NBD).Serve
May 13 13:39:01 localhost agro[20293]:         /home/barak/src/agro/src/github.com/coreos/agro/internal/nbd/nbd.go:176 +0x2bd

Count All The Things! (Prometheus)

Add prometheus in a lot of places. Count block reads, count files opened, count files that are write-dirty, count metadata interactions, etc, etc....

metadata/etcd: Drop and Reconnect semantics

We're happily passing contexts now, and timeouts can happen. In the case of timeout, a drop-and-reconnect is warranted. However, we currently have just one connection. We need to fix this.

Spitballing names

Agro is the current name which is the name of the dog from Shadow of the Colossus.

Following the Colossus idea, but staying relevant to computer science, Bletchley Park was the where Colossus the computer was created by codebreakers (including Alan Turing) after the war. My proposal is to call the project "bletchley".

INode replication vs Block replication.

There is a parameter called INodeReplication, which defines how many replica each INode is expected to have.

What is the plan for handling Block replication? For N copy, we can simply reuse the INodeReplication parameter. But when doing RS encoding for blocks, the replication factor does not make a lot of sense.

Another question is that, are we trying to replicate the block on the same node as the INode?

hitting etcd request size limit

error creating volume agroblock2: rpc error: code = 3 desc = "etcdserver: request is too large"

when trying to create a 512M block volume.

the marshalled inode size in createBlockVol ends up being ~1.75M, and etcd recently set a limit of 1.5M etcd-io/etcd@3556722

Lingo

Hash out the lingo on the wiki

documentation: document partition and other "recoverable" failures

The very first question we are going to get about is how this thing fails in various scenarios:

  • What happens in single machine failure?
  • What happens during network partition between peers?
  • What happens during network partition of etcd?

And in each of these scenarios we should provide an information about:

  • Increased latency
  • Read/write availability
  • Estimated time to recovery

documentation: simple agrocat command line tool

I think part of the experiment with this project is to see what people build outside of block devices. We should create a simple tool that people can cat in a stream and cat out a stream from an agro volume.

Structured Logging

We should have good structured logging from the start. Looking for suggestions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.