deislabs / bindle Goto Github PK

Bindle: Object Storage for Collections

License: Apache License 2.0

Rust 99.45% Makefile 0.42% Dockerfile 0.13%

bindle's Introduction

Bindle: Aggregate Object Storage

Please note that this is pre-1.0 software, meaning that breaking changes are possible (and even likely) before it hits 1.0. However, these changes are documented and the project is safe to use for real use cases

Aggregate Object Storage Means Keeping Related Things Together

A photo album. A sock drawer. A bookshelf. We like storage solutions that let us keep related things in a single location.

Consider the humble silverware drawer. When we set the table for dinner, it's convenient to open one drawer and get the forks, spoons, and knives. Yes, a fork is a different thing than a knife. Yes, there are multiple different kinds of spoons. And, yes, silverware is not even uniform in size or shape brand-to-brand, model-to-model. Some people keep chopsticks with the silverware. Others toss in those tiny spreader things you use to slather on the cream cheese at a fancy party. In my house we keep the straws in the silverware drawer. Drawers are flexible. They can accommodate these variances.

Bindle is the digital silverware drawer.

More specifically, Bindle is an aggregate object storage system. Merriam-Webster defines an aggregate as "a mass or body of units or parts somewhat loosely associated with one another." The fundamental feature of Bindle is that it provides you with a way to group your associated objects into an organized and named unit. It thinks in terms of aggregates.

The Usefulness of Aggregate Object Storage

Again with the silverware, the attraction of the metaphor is how relatable it is. We understand why we need silverware drawers. But why do we want aggregate object storage?

Let's look at a classic C/C++ program developed in a UNIX-like environment. Many times, the modern C/C++ application makes use of shared objects (SOs). These are binary files that contain the compiled version of a shared library. Say we have a top-level program called maple that uses several shared object files. In order to execute our maple program, we need to make sure we have all of the required SO files present on the same filesystem as maple. And when moving maple to another system, we need to also move all of those SOs. So maple is really an aggregate application: To run it, we need to keep track several different pieces.

We can jump to a different part of our tech stack for another example. A web application may have components in HTML, in JavaScript, CSS, and even images or other media. No one thing on that list is "the application." The application is an aggregate of all of those resources.

In the last few years, we've even seen the emergence of large-scale distributed computing. In this world, aggregates of microservices together make up a single application. Indeed, as the industry progresses, it is essential that we learn how to capture the definition of an application as an aggregate of related programs.

Bindle is a tool for treating the group of related parts as a thing itself. In geology an "aggregate" is a single chunk of rock that is composed of an assortment of individual minerals. Modern applications are like geological aggregates. We want to be able to talk about the individual parts, but within the context of the whole that they represent.

What Does an Aggregate Application Look Like?

To take our web application example above, Bindle would see that application as a unit that looked something like this:

my-web-app 1.2.3
  |- index.html
  |- style.css
  |- library.js
  |- pretty-picture.jpg

Our top-level application, my-web-app 1.2.3, is composed of four individual parts that all need to be present. It's simultaneously important for us to talk about the aggregate as a whole while still appreciating the individuality of each of its parts.

Bindle goes one step further, though: It allows you to express relationships between these parts. Keeping with the silverware drawer example, it lets you say "these are both spoons, but this spoon is only used when we are having soup, while that one is used for tea."

To that end, Bindle supports a more complex notion of composition. We might have a case where one part of the application has requirements that can be satisfied by multiple different parts. Here's an example: Say we have an application that reads through a pool of sports data and makes projections about who will win this weekend's SportsBall game.

A frontend is the user interface, and it connects to some prediction engine. The prediction engine might use a set of simple statistical prediction rules, or it might use a sophisticated machine learning algorithm. Which engine we use may be determined by a range of factors, including the capabilities of the system on which it is run or the accuracy demanded of the output.

Bindle can model this situation by keeping all of the objects stored together, and letting the client figure out which combination it needs. So the Bindle looks like this:

sports-predictionator 2.0.0
  |
  |- preditionator-frontend
  |- One of
        |- lib-machine-learning
        |- lib-statistical-prediction

A client with plenty of time and resources might select the lib-machine-learning one, while a constrained client might pick the simple statistical formulas. But the bindle describes both possibilities.

While it is not apparent from these simple examples, Bindle provides information that helps runtimes make these decisions. Take a look at the specifications to get into the details.

Still, there's a little more to the Bindle story.

Don't Store the Same Thing Twice

Rewinding to our example of the C program that used shared objects, one important word in this design is shared. Modern applications, be they web applications or system tools, benefit from sharing. Bindle, too, cares about this. Sharing is also good when it comes to cost. "Storage is cheap." No, it's really not. And bandwidth charges are certainly far from cheap. Bindle is structured so that:

An object is only stored once (where "object" here means "unique stream of bytes")
Clients only have to pull objects they don't already have -- and it's easy for them to figure this out.
Servers inform clients about when they need to send data. There's no reason for a client to be compelled to upload data that the server already has.
All of this is done with content addressable storage and cryptographically secure hashing and signing.

Because of this, Bindle can keep download times fast, bandwidth costs low, and storage space minimized--all without sacrificing data integrity.

Enough with talk, let's get down to the business of using Bindle.

Using Bindle

To build Bindle, you can use make or cargo:

$ # Recommended
$ make build
$ # The above is approximately equivalent to:
$ cargo build --features=cli --bin bindle
$ cargo build --all-features --bin bindle-server

The binaries will be built in target/debug/bindle and target/debug/bindle-server. For both client and server, the --help flag will print out documentation.

Starting the Server

To start the compiled server, simply run target/debug/bindle-server. If you would like to see the available options, use the --help command.

If you would like to run the server with cargo run (useful when debugging), use make serve or make serve-tls. (The first time you run make serve-tls, it will prompt you to create a testing TLS cert.)

Supplying a Configuration File

The Bindle server looks for a configuration file in $XDG_DATA/bindle/server.toml. If it finds one, it loads configuration from there. You can override this location with the --config-path path/to/some.toml flag.

address = "127.0.0.1:8080"
bindle-directory = "/var/run/bindle"
cert-path = "/etc/ssl/bindle/certificate.pem"
key-path = "/etc/ssl/bindle/key.pem"

Running the Client

If you compiled, the client is in target/debug/bindle. You can also run from source with cargo run --features=cli --bin=bindle or $(make client) (e.g. $(make client) --help).

You will either need to supply the --server parameter on the command line or set the BINDLE_URL.

$ export BINDLE_URL="http://localhost:8080/v1"
$ # Running from build
$ target/debug/bindle --help
$ # Running from Cargo
$ cargo run --bin bindle --features=cli -- --help
$ # Running from make
$ $(make client) --help

For more, see the docs.

Concepts

In the Bindle system, the term bindle refers to a bundle of related data called parcels. A bindle might be simple, containing only a single binary file. Or it may be complex, containing hundreds of discrete data objects (files, libraries, or whatnot). It can represent a layer diagram, like Docker, or just a regular file download. With experimental conditions, it can even represent packages containing mandatory, optional, and conditional components.

A bindle is composed of several parts:

The invoice (invoice.toml) contains information about the bindle (name, description...) as well as a manifest of parcels (individual data items).
A parcel has a (parcel.dat) that contains the opaque and arbitrary data

A bindle hub is a service that manages storage and retrieval of bindles. It is available via an HTTP/2 connection (almost always over TLS). A hub supports the following actions:

GET: Get a bindle and any of its parcels that you don't currently have
POST: Push a bindle and any of its parcels that the hub currently doesn't have
DELETE: Remove a bindle

Note that you cannot modify any part of a bindle. Not the payload. Not the name. Not even the description. Bindles are truly immutable. It's like the post office: Once you ship a package, you can't go back and change it. This greatly increases the security of the entire system.

Bindle Names

There are many fancy naming conventions in the world. But Bindle eschews the fancy in favor of the easy. Bindle names are paths. The following are all valid bindle names:

mybindle
mybindle.txt
example.com/stuff/mybindle
mybindle/v1.2.3

While all of the above are valid bindles, those that end with a version string (a SemVer) have some special features. Thus, we recommend using versioned bindle names:

mybindle/v1.0.0
mybindle/v1.0.1-beta.1+ab21321
example.com/stuff/mybindle/v1.2.3

First-class Semver

One frequently used convention in the software world is versioning. And one standard for version numbering is called SemVer. Bindle has strong support for SemVer.

Each Bindle invoice MUST have a semantic version. There is no head or latest in Bindle. Every release is named with a specific version number. With this strong notion of versioning, we can track exact objects with no ambiguity. (Remember, Bindles are immutable. A version number is always attached to exactly one release.)

And SemVer queries are a way of locating "near relatives" of bindles.

For example, searching for 1.2.3 of a bindle will return an exact version. Searching for 1.2 will return the latest patch release of the 1.2 version of the bindle (which might be 1.2.3 or perhaps 1.2.4, etc).

Version ranges must be explicitly requested as queries. A direct fetch against v1.2 will only return a bindle whose version string is an exact match to v1.2. But a version query for v1.2 will return what Bindle thinks is the most appropriate matching version.

The Bindle Specification

The docs folder on this site contains the beginnings of a formal specification for Bindle. The best place to start is with the Bindle Specification.

Okay, IRL what's a "bindle"

The word "bindle" means a cloth-wrapped parcel, typically of clothing. In popular U.S. culture, hobos were portrayed as carrying bindles represented as a stick with a handkerchief-wrapped bundle at the end. Also, it sounds cute.

bindle's People

Stargazers

Watchers

bindle's Issues

There is almost no documentation on building/running Bindle

And what is present in the README is actually wrong and outdated.

As an operator, I can verify the signatures on bindles (Bindle)

To complete this, I am working on the general key management in Bindle, with emphasis on creator and host key signing/verifying.

Handling Authn/z for bindle

How do we want to handle authorization and authentication for Bindle? Our current idea for authentication is to either suggest a reverse proxy that handles the authentication part or some sort of connection to a third party service.

For authorization, we have a working idea of creating a new "authorization" crate that has a base Authorizable trait anything can implement. We could then connect this into the server side of things to ensure access. We will also need a way to give a bindle server access to a database of groups/users assigned to each bindle. A mock up of what the code might look like can be found here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3e9c2ea72dfb9442310c8b629e075775

Client needs to generate signing key

The client needs to generate a signing key for the creator role exactly once. We should probably add a command for this.

$ bindle create-key 'My Name <my@email>'

The key should be stored in the Bindle client's config directory with a name like client-key/creator-key or something to distinguish it from a server key. The mode of this file MUST be set to something like 600 or 700. Alternately, we could store it in the system keyring if that is broadly supported.

I think that the file could be as simple as a base64-encoded (or PEM-encoded if you prefer for some reason) bytes of the keypair. Possibly a better design would be to store it in TOML in a format similar to the keyring.toml described in the spec, but with a private key. This might allow a single client to have multiple keys (e.g. one for creator, one for approver, etc. or one for current, multiple for rotated).

Key generation code is all in signature.rs.

While we should support key rotation in the future, we probably don't need to do so now.

Standalone bindle tarball support

Currently the standalone module does not support the optional tarball mechanism. We should implement both pushing and writing to tarballs for easier portability

Change BINDLE_SERVER_URL to BINDLE_URL

See deislabs/hippo-cli#25

Server needs to validate `creator` signature on invoice receipt

When the server receives a new invoice, it MUST verify that the invoice is signed, that the signature is valid, and that the signing key used is already in the server's keyring. Any failure on this MUST lead to rejecting the bindle.

Set the correct permissions for create-key on Windows

With #175 and #115, we have unix support for setting read/write/execute permissions only for the owner of the file. However, it is a little unclear what should be done in Windows. I think the best option is to set the creator/owner to have read/write access and everything else should be denied, but there could be a better option for Windows. Also, as far as I can tell, there is no way to manually pass security options down to the CreateFileA method in the windows API, so this will require importing the windows crate and calling that API manually

Bindle search `-q` requires a value

According to the output of bindle search -h:

-q, --query <query>               name of the bindle to search for, an empty query means all
                                      bindles

But if you do bindle search -q '' (or any similar variation), you get the following error:

error: The argument '--query <query>' requires a value but none was supplied

USAGE:
    bindle --server <server-url> search [OPTIONS]

For more information try --help

I am not sure which is right. I suspect Clap does not like empty values. We could use something like * if we wanted

`bindle push` needs to sign using the `creator` role for the current user's key

Some flags might be necessary to override defaults, but here is what I am thinking as the default behavior:

User runs bindle push
Client loads user's signing secret key, failing if the key is not found
Client loads the invoice.toml
Client signs the invoice object (support is in signature.rs)
Client appends signature as a [[signature]] block on the invoice.toml
- Option A: The invoice with the signature is stored on disk
- Option B: This version of the invoice is in memory only, since there is little value in keeping the signature on the artifact on disk
Client pushes the invoice to Server

Plumb the signing logic into the server

Now that we have the signing code in place, we need to plumb it into the server (and provider if necessary). The server should be able to validate and sign the bindle it receives

Configuration Files and Config Dir

Bindle should have a configuration directory with the following files:

server.toml: Basic configuration, like port, hostname, etc.
keyring.toml: Public keyrings (possibly temporary)
signing.key: mode 600 file with just the secret key used for signing packages

The configuration directory should default to a well-known location (either $XDG_CONFIG_DIR or /etc). IT should be automatically created if it does not already exist. And this directory name should be overrideable from CLI flags.

For starters, we don't need the keyring or key file -- just server.toml.

We may also want/need a client.toml for configuring a Bindle client.

Add logging

We need to add logging for the server.

Invoice/Parcel signing

The signing needs to be implemented on bindles and parcels.

Proxy: Sign bindles in get_yanked_invoice

Right now, the proxy signs invoices on creation. I think the proxy probably should also sign (with a proxy record) any request that it gets from an upstream.

See #127

Server needs to sign invoices on receipt

When the server receives an invoice, it needs to sign the invoice upon reception, using its host key as described in the signing-spec.md

Signing: API server needs to load strategy

At some point, the API server should be able to load the signature verification strategy. For now, it defaults to GreedyVerification, which is the strategy that the specification recommends.

Should ensure that `config_dir() + "bindle"` exists

Right now, Bindle doesn't check to see if the dir exists before attempting to operate on it. While this is fine for reads, for writes it is problematic. See create-key for an example.

File locking doesn't work on windows

As evidenced by the flakes we have in our windows tests, it looks like the file locking in the file provider doesn't work on windows. I think this could be solved by using the windows specific OpenOptionsExt and setting the share mode to 0

Support JSON format for requests and responses

Although TOML is great for hand-coding invoices, it's not as widely or as well supported as JSON for parsing or programmatic composition.

It would be convenient if the server supported Accept: application/json on invoice GETs and Content-Type: application/json on invoice POSTs. This would allow JSON parsers and serialisers to be used on platforms where these are better maintained than TOML parsers (such as .NET).

Key management endpoint

Do we want to provide an endpoint for fetching the public keys that Bindle knows about? /cc @radu-matei

Bug: create-key should make sure the secret key file mode is 0700

Right now, it doesn't check

Accept charset in Content-Type header

When creating an invoice, if you send a Content-Type of application/toml; charset=..., the server rejects it with 400 Bad Request, because it tries to exactly match the string application/toml and the additional clause confuses it. However this is allowed by the HTTP spec and is sent by default by at least the .NET HttpClient/StringContent class. The server should accept the charset declaration.

Add client tracing

#128 adds support for tracing on all the server parts. We should also add tracing to the client parts

Inconsistent client/server flag and environment variable names

For bindle directory:
client: --bindle-dir and BINDLE_DIR
server: --directory and BINDLE_DIRECTORY

For server url:
client: --server, -s and BINDLE_URL
server: --address, -i and BINDLE_IP_ADDRESS_PORT

Exposing signing helpers in the bindle CLI

Do we want to add a new subcommand for signing bindles to the CLI?

Hashing parcels: Should we bump this up a layer?

Right now, when Bindle loads a parcel off of the file system it checks the hash. But this is done in the storage driver. Really, this should be done a layer higher so that storage drivers merely load the bindle.

Update e2e tests to use creator keys

Right now we don't pass a keychain to the server inside of TestController for our e2e tests. This means that we can't sign with creator keys yet in tests (and we don't even test that yet). TestController::new should generate a private/public key for the creator and use that public key in its keyring. The private key should be added as a field on the controller for use in signing test bindles

Server needs to generate keys

The server should generate a host key (alternately, the bindle create-key command could just create host keys for you). This should be stored in the bindle config dir.

See #103

Hard delete a bindle

One of the key principles of the Bindle architecture is immutability - no bindle can ever be modified or deleted.

This is great until:

Somebody creates eight dozen apps with variants of "hello world" as their names, and you run out of demo namespace
Somebody stores illegal stuff on your Bindle server

The first is merely vexing, and probably not a major issue on production Bindle servers. The second, however, results in Mickey Mouse and Goofy kicking down the server operator's door at 4am to hand them a writ for possession of 400 slightly different encodings of Frozen II.

It would be good if the server operator, at least, could do something about this. It need not be exposed to ordinary users and would be fine strictly as an admin operation.

We would need to consider whether hard deletion would continue to reserve a spot in the namespace. If it did not, it would create an avenue for someone to spoof a bindle, though this would require admin collusion, and is unlikely to happen to bindles that are actually in legitimate use. If it did, then the "polluted test namespace" would continue to be an issue, albeit a minor annoyance rather than anything serious.

Add Change notification to bindle server

Bindle server should support event notification for changes, clients should be able to subscribe to an event stream, messages, webhook to get notifications about new bindles, new versions etc.

This will enable many scenarios including application updates, slack notifications etc.

Use timeouts on download operations in the client

Right now we don't have a timeout value for the client library. This means that if anything happens to the connection (or it is malformed), it can just hang (as happened in #148). We should set a default timeout and allow the user to specify a value when constructing the client. We also will need logic to handle a failed download. Basically, if timeout of X is reached, emit an error and remove the partially downloaded parcel.

Is there a listing?

I don't see any way to list bindles in the protocol spec. Was curious.

New endpoint for uploading JUST a signature for a parcel

I imagine we need something like

POST _i/${INVOICE}/sign

Which would take JUST a Signature object, and would attack it to the invoice.

The endpoint MUST:

verify that the signature is valid
verify that the key is known to the server
verify that the same key was not already used to sign the invoice in question

[Discussion] Do not return Accepted for invoice creation

When you POST an invoice, the invoice resource is created immediately and is available immediately. This maps to the HTTP 201 Created status code.

However, if there are any parcels not currently uploaded, we currently return HTTP 202 Accepted. The intended usage of this is for resources that are created asynchronously and may not be available immediately. This is misleading as the invoice has definitely been created, but not all of its parcels are available.

Should we return Created for this case too? We would probably need to overload the response body, though, to say if any parcels are unavailable, which is a bit iffy. Or are we okay with the use of the async creation status?

`InvalidId(InvalidId)`

When the client reports an invalid ID error, it should say what the ID is.

Use the `mime` crate for parsing the `Accept` header

Currently our code does some manual parsing, now that #153 is using the mime library for proper parsing, we should replace some of our own code with it in the SerializedData type

Client must apply validation strategy on signatures

The signing-spec defines several possible strategies for validating signatures. The Bindle client needs to follow at least one of these. I would suggest that at first we go with:

On get or get-invoice, the client verifies that the host is known and valid, and that either the creator or one of the approver keys is known and valid.

Discussion: `push` command names

We currently have the following commands:

push              push a bindle and all its parcels to the server
push-file         push an arbitrary file as a parcel to the server
push-invoice      push an invoice file to the server

The push name is so general it makes it easy to mistake for the other commands. I wonder if it is worth renaming it, e.g. to push-id or push-bindle. Perhaps push-file might be clearer as push-parcel too?

Alternatively we could keep push but make it cleverer:

If the argument looks like a bindle id, and that bindle id exists locally, push the bindle
If the argument is named invoice.toml, push it as an invoice
Otherwise, push it as a parcel

This adds complexity and possible unpredictability though!

We could also restructure the push commands as subcommands a la bindle push invoice -f invoice.toml. This would allow a user to type bindle push and get help text specifically on the various push options.

What do folks think? Would any of these be improvements or is what we have fine?

Add standard labels or groups for `README` and `LICENSE` files

We know that tooling will need to be able to find licenses, and we suspect that tooling will benefit from having a standard place for a top-level document. So it has been proposed that we add a standard way of labeling README and LICENSE files.

See deislabs/hippo-cli#21

Not all `bindle` commands require globally required Bindle URL

Not all of the bindle commands require a Bindle server URL. However, it appears that Clap requires that either the env var is set or the flag is explicitly passed.

ACTION REQUIRED: Microsoft needs this repository to complete compliance info

There are open compliance tasks that need to be reviewed for your bindle repo.

Action required: 3 compliance tasks

To bring this repository to the standard required for 2021, we require Microsoft administrators of this GitHub repository to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your GitHub repo.

Microsoft repo admins: Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/deislabs/repos/bindle/compliance

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

Better concurrent streaming

What are ways that we can improve the concurrency by which Bindle clients can fetch parcels?

In the original proposal we had an HTTP/3 method mapped out where a single session could be used to fetch the invoice, then fetch all of the necessary parcels in parallel. I suspect this could be at least partially done in HTTP/2.

Improve server errors

For example, while testing Create Invoice:

I screwed up serialisation and had it sending [parcels] instead of [parcel]. Got a 400 and no hint as to why.
I forgot that a successful test would create the bindle, and so was at a loss when running the test again gave me a 400.

In both cases, a clear error message would have helped.

Also, in the second case, consider 409 Conflict instead of 400 Bad Request. It was a perfectly good request, just the resource already existed and couldn't be overwritten.

(If Bindle is sending a clear error message and Axios is throwing it away on the client, then feel free to close this issue.)

Default Bindle server directory should be config_data()

The current default directory is /tmp. It should use XDG data, which is dir::config_data()

Do not use home_dir() for cache

We should be using the XDG cache directory instead of $HOME/.bindle/bindle for cache.

I think dirs::cache_dir() provides this.

In bin/cli/main.rs:

 let bindle_dir = opts
        .bindle_dir
        .unwrap_or_else(|| dirs::home_dir().unwrap().join(".bindle/bindles"));

Implement filters in the CLI

The Bindle client should have a command that gives an invoice and some filters (groups and features toggles) and gives back a list of matching parcels from that invoice.

Missing parcels schema not clear from protocol spec

The protocol spec for the missing parcels query says the result is "a list of label objects for missing parcels." The result is actually a TOML map with a single key missing whose value is an array of labels.

(I will send a PR for this, just don't have a working copy in a suitable state right now.)

Need to check parcel length on the server (Client hangs when parcel size is wrong)

I recently created a parcel like this:

[[parcel]]
[parcel.label]
name = "examples/mkbindle.rs"
mediaType = "text/plain"
size = 1483
sha256 = "9f12ae3891d31fa9ec13089d6c9b8f46d1900c88a51a64f582f669fd4ec8a5ff"
[parcel.label.feature.wagi]
file = "true"
[parcel.conditions]
memberOf = ["files"]

But I had transposed the last two digits of size. It should have been 1438.

When the Bindle server constructed the HTTP response, it set Content-Length: 1483, and then sent 1438 bytes.

That little typo broke the Bindle client, which listens until it has reached exactly the byte count from Content-Length. So the client will just hang with no error.

So probably we should at least fix this in the Bindle client. On the server, we should (a) check the content length of the parcel before sending or (b) verify the content length of the parcel on reception. These are not mutually exclusive; we could do both.

Server query doesn't seem to follow spec

The demo server has and bindle called enterprise.com/warpcore but if you query the demo server for warpcore it returns 0 results.

I thought this might be an old build but I cloned latest and seeded it with several bindles including my/fancy/bindle and your/fancy/bindle but querying for fancy still gave 0 results.

My reading of https://github.com/deislabs/bindle/blob/master/docs/protocol-spec.md#strict-mode is that substrings should match even in strict mode. I don't know whether the issue is with the spec or the implementation though!