facebook / akd Goto Github PK

View Code? Open in Web Editor NEW

219.0 219.0 31.0 3.33 MB

An implementation of an auditable key directory

License: Apache License 2.0

Rust 100.00%

akd's People

Contributors

Stargazers

Watchers

akd's Issues

Documentation

Adding documentation for all API calls.

Need error handling for inserting duplicate element into azks

Right now, if you insert an element that is already a member of azks, we get a HistoryTreeNodeErr(DirectionIsNone) error. Perhaps it would be best to clarify that it is a duplicate insertion error?

We can also add a test for this error.

Improve debugging by adding error contexts

We have errors that only include the error string in the error. When these errors surface up, they don't have any context on where and why they occurred (e.g., a MySQL error). We should improve these by adding

More context to the errors, e.g., GetData error in batch_get should mention that the operation was batch get.
Stack trace info. --> Since the stack trace leading to the error is cleared up when the error is caught, we lose this info.
Line numbers and file names when the error occurred.

Move UserData to storage layer

Currently the SeemlessDirectory struct stores a hashmap for UserData. This needs to be moved to the storage layer.

Problem with running more than one azks batch updates

At the end of two runs of azks.batch_insert_leaves(insertion_set.clone()); on different insertion sets, the root node does not seem to update properly. This results in the append-only proofs failing.

Upgrade winter_* to 0.2.0

winter_crypto and associated libs (_math, _utils) seem to have released a new version, with better handling of bytes and big namespace cleanups. It'd be worthwhile to update to the latest version for these improvements

Call multiple lookup operations async

To improve the efficiency of bulk/multi lookups, we could call lookups async and await on the results of all lookups -- rather than awaiting on them individually.

Simplifying code

The code in history_tree_node.rs and append_only_zks.rs can be further simplified. A few suggestions:

Remove any remaining instances of unwrap().
Look for repeated code among conditionals and abstract it out into a separate function.

Storage contract for values between epoch range

With adding the limited_key_history call in directory.rs we now need the ability to select a series of user values given a minimum epoch (start_epoch). This should be a new call on the storage contract which doesn't exist today.

For usage, see directory.rs:L366-368.

AKD client verification crate with limited dependencies

Due to the high likelihood that client's will verify the proofs with limited resources, it would be helpful to have a separate crate with as few dependencies as possible for verification of the proofs provided from the AKD. This would then simply be a dependency of the main AKD crate or could be maintained independently.

Adding VRF-based computation of labels

We want to use a VRF to compute labels in seemless_directory.rs and then verify these VRFs in lookup_verify, key_history_verify etc.

See: https://crates.io/crates/vrf for one option. Below is a starting checklist of things to consider:

We will need to convert the output of the VRF into NodeLabel type. We may want to change the NodeLabel type as a result, or implement a trait instead.
VRF keys need to be stored in some sort of config file, so that the server can generate VRF values.
In the Java implementation of SEEMless, we used BouncyCastle, and at the time, we had to hash to the curve from strings/bytearrays. We need to hash to the correct space in this case as well.
If we hash to a curve and later hash the output value, we should be careful about the security guarantees offered by the various parameter sizes. Would be helpful to add documentation for this.

MySQL error `HistoryTreeNode`s `batch_get`

For big-sized trees, mysql_tests are failing in batch_get due to errors similar to

ERROR HY000 (1366): Incorrect integer value: '??%????\u{3}?\\??N??\u{4}a??H??~~S??!DU9D' for column 'label_val' at row 1

Simplifying the functions in `history_tree_node.rs` and using mutation.

Many of the functions in history_tree_node.rs as overly complicated to avoid passing mutable references. We need to restructure and simplify code to avoid this and for it to be easier to build on.

Proof structs should have some properties merged into tuples

In MembershipProof, the sibling hashes and sibling labels are a 1-1 mapping and therefore should be a Vec<(NodeLabel, Digest)> rather than 2 separate properties.

Same thing in a NonMembershipProof with longest_prefix_children*, they can be merged into a single property of a tuple.

This syntax is warranted as it's already used in AppendOnlyProof with the inserted and unchanged nodes.

[long-running] Changing the storage contract requires 2x publishes to crates.io

The problem

Due to the change in the storage contract, there are delays in the time between publishing akd to crates.io and the new version being available to cargo to build akd_mysql.

This happens because when breaking the contract between akd and akd_mysql, akd_mysql will rely on the latest version of akd which doesn't exist in crates.io until it's published. For example, if both crates are at v1.0.0, and we change the contract, and publish v2.0.0, akd_mysql depends on akdv2.0.0 which was _just_published. There appears to be a race in crates.io that the package isn't instantly available after publish, so the cargo publish --dry-run balks saying akdv2.0.0 can't be found but since we don't have a multi-publish, akdv2.0.0 is published and cannot be reverted. Therefore we're in a "broken" state where half of the crates are published.

Current workaround

The current workaround is we need to publish akd with the update, akd_mysql will fail, and we version bump both and publish again. The 2nd time crates.io has updated and the index is updated, akd_mysql can build and publish.

See PRs tagged with publish_race to identify instances where this is needed.

Reason to fix

This is ugly and just clutters crates.io with partial versions of akd with sometimes matching akd_mysql crates.

Batching reads and writes

We'd like to batch reads and writes so that there is a balance between the amount of required serialization and the amount of cache used to store deserialized structs in memory.

Fix the #[cfg(test)] hackiness with storage.rs

I ran into an issue with running tests in parallel over shared state. This needs to be fixed and the state_map field of HistoryTreeNode removed.

Integrating the implemented AIR

Reduce the number of things that are "pub"

Right now, there are way too many functions / structs that are pub, which I think do not need to be. We should reduce this as much as possible, preferring pub(crate).

Creating commits for values

At the moment the values being committed in the tree for prototyping are all dummies. We would like to replace the value_to_bytes function in directory.rs to return actual commitments instead. This involves 3 main steps:

Changing the function value_to_bytes to return a byte array derived from a digest instead.
To make the commitment hiding, generating and storing a nonce for each committed value.
Including the nonce in the proofs sent to the client, and having the client verify these nonces.

Optimize for size-bounded cache

Currently we have a timed cache with unbounded space. Lifetime of the items in the cache is 30 seconds.

With the growing tree size and increasing number of requested operations (e.g., lookups), the size of the cache may become a limiting factor for how many operations we can handle. In that case, we will need to manage the cache more effectively.

To this end, some initial ideas are:

Varying expiration times for nodes depending on their use frequency.
Prioritize some (e.g., most used?) nodes during batch_get operations.

cc @afterdusk

Option to initialize AKD without write ability

Currently, when AKD is initialized, it will fetch the Azks record in storage. If it is unable to find the record, it will attempt to write the root node (and its epoch 0 state) along with the Azks record.

In a distributed deployment with dedicated readers and writers, we do not want the readers to perform this write on startup. Instead, they should fail on being unable to retrieve the Azks record.

Creating an AIR for the append-only proof verification

We need to write an AIR that verifies an Append-only proof, so that the client doesn't actually need to receive nodes or rely only an auditor.

Creating and storing persistent VRF public and secret keys.

Now that we have a PR implementing the VRFs, we should find a way to securely generate, store and access the VRF secret key and to publish a VRF public key.

The vrf crate could probably support key generation but still it needs to be stored somewhere.

Cleanup errors in the library

There's a lot of error codes in the errors.rs library which are incorrectly utilized or not utilized at all. These codes should be cleaned up to only what's necessary and not duplicated in multiple structures.

Differentiate between private and non-private mode

Some applications may be amenable to a lower privacy level, such as certificate transparency. For these examples, we do not need a VRF and can thus be more efficient. We should allow both private and non-private modes for this crate. Perhaps using features.

Remove calls to `unwrap()`

We need to strip the repo of any panicking calls, such as unwrap. This may also include changing return types for various messages.

Profiling for optimizations

Profiling using a Rust profiler, such as those documented: https://nnethercote.github.io/perf-book/profiling.html

Add examples for using `seemless_directory` APIs

Add examples for how to use SEEMlessDirectory's various operations in the examples directory.

Make caching behaviour for `set_state_map` and `get_state_map`

Move the tree_repr_set and tree_repr_get to Storable trait to cache all Storables at any time.

Make set_node_child_without_hash return nothing

Right now, when trying to migrate over state maps to use Storage, we are running into an issue that I think might be caused by this line:

https://github.com/novifinancial/SEEMless/blob/main/src/history_tree_node.rs#L173-L175

Here, we are potentially copying a child, implicitly cloning its state map. This gets problematic when we move to global storage. This could be solved by ensuring that set_node_child_without_hash doesn't return anything, and we find a better way to structure the code to achieve the same thing (without having to clone a child).

Serialized storage

We would like for the Storage to always get string values and get string values. The specific implementation of SEEMless could then serialize and deserialize accordingly.

Create an enum that encompasses all the types of data a SEEMless directory may ever need to store.
Implement serialization and deserialization for said enum. This would also include writing a wrapper around the function which gets keys, so that the location in the statemap is deterministic based on the type of object being written.

Fixing membership and non-membership verifications

Currently, the membership and non-membership proof generation and verification both include membership proof operations as sub-routines. However, this is never actually used in the non-membership verification (see: https://github.com/novifinancial/akd/blob/b0a10e8e9cd17199a890686b36696f5da8f9d820/akd/src/client.rs#L82). This needs to be fixed and will likely require fixing get_membership_proof_and_node in append_only_zks.rs.

Bulk lookup proofs

In production, under high-load we might need the ability to perform "bulk" lookup proofs. This means that we should do the same breadth-first-search for a collection of users as is done under the Directory::publish operation (as is done in append_only_zks.rs:191).

This means, when doing a lookup we should calculate all the prefixes and pre-load (via BFS walk of batch-retrievals) all the nodes along the path into the cache. Then we can generate the proofs without doing many individual retrievals.

Eliminate dummy children nodes

Writes can be made more efficient if we don't write dummy nodes -- seemingly this won't affect functionality.

Verification failure with small trees

Lookup and key history verifications encounter failures with small trees. This manifests with the Blake3 Hasher as well as with Sha3_256.

Repros:

Integrate implementation for demo purposes

Remove cargo bench from CI

Cargo bench was added in 51b756c, however it is no longer in use. We should remove it from CI workflows.

Adding benchmarks

Once #17 is closed, we should write benchmarks for its functions and various component parts so we have some sort of profile on what, if anything, needs to be optimised further and how this compares to the Java implementation in the paper.

AkdKey and Values should be standardized

The types AkdKey and Values should be standardized in their naming, definition and usage. They're a little hazard at the moment. (one is Akd... the other is just ..., etc).

Additionally we should migrate to binary (&[u8]) instead of strings to save on storage sizing for at least the value, if not the key also. All our operations are done on the binary values, so it's just for readability that it's strings. This will induce data-layer changes in the akd_mysql crate as well, as we'll be changing data-layer types.

Put state_map into storage and only read needed parts

Putting state_map into storage so that large streams of data do not need to be read when reading a node.

Add serialization/deserialization for all public-facing structs

Every pub struct should implement Serialize and Deserialize from serde.

At the very least, in https://github.com/novifinancial/akd/blob/main/akd/src/proof_structs.rs, all of these structs should also #[derive(Serialize, Deserialize)].

Add History Proof validation to the "lean" client

The akd_client crate should also support the other validation functions that are in the base akd client.

Initially we should start with the history proof, and eventually we'll add audit too.

Batch load `get_user_state`

In batch_lookup we can batch get_user_state as well.

Append-only proof failing when root has single or no child.

The append only proof causes stack overflow in case we start from a point where the root had no children, for example.

Handle absence of ValueState

In a production system, data in storage is pruned to manage storage size and adhere to privacy commitments. For AKD, the only prunable record is ValueState, which contains the data payload (key material for E2EE messaging).

Today, operations fail when ValueState cannot be retrieved. We need to modify the library to handle the scenario gracefully and propagate the information to the caller.

Invalidate cache on new Azks epoch

In a distributed deployment, the cache may contain stale records. We want the cache to be invalidated when a new Azks record with a higher epoch number is detected, so that stale HistoryTreeNodes are not used to generate proofs. This involves remembering the epoch of the Azks record in the cache, and at TTL expiration checking it against the newly fetched Azks record to see if the epoch number changed. If it has, the entire cache should be dumped.

Unbounded ```HistoryTreeNode``` field

HistoryTreeNode has an epochs field that is used to identify the epochs which an update took place.
As it stands now, the field can grow to an unbounded size. This will pose an issue when the corresponding DB record hits either soft or hard limits imposed by the DB schema or architecture.

ESTIMATED LIMITS

Epochs are u64 = 8 bytes.

For mysql, with the current VARBINARY(2000) definition, 2KB / 8B = ~ 250 values (we could always raise the max value in the definition, this is just for illustration)
For some alternate key-value db (KVDB) with say 100KB recommended soft limit for record size, ignoring other overhead, 100KB / 8B = ~ 12500 values

SOLUTION?

The associated epochs for a HistoryTreeNode can be obtained directly from the HistoryNodeState table given the node label, since each update corresponds to a HistoryNodeState.

For mysql, this can be done through one call with a join, or a separate, concurrent call to the DB.
For the alternate KVDB, one call is also sufficient. Instead of a "get" operation to fetch a singular HistoryTreeNode record, we can use a "scan" operation to fetch both the HistoryTreeNode record AND associated HistoryNodeState records at one go, provided we have a good sharding setup.

Accumulating the data from separate tables will incur some computation time on the database server and we will need to profile and assess the impact. However, I think this is necessary to avoid the unbounded record size issue. Are there any other potential solutions I've missed?

Completing the functions in `seemless_directory.rs`

Functions to be implemented and tested:

AKD Quorum distributed change auditor

Overview

To prevent a split-view attack the root hash of every epoch needs to have a signature signed by a private key which cannot be leaked from the quorum (via shared-secret methodology). This quorum participates to independently validate the append-only proof of the key directory and each participant provides their partial shard of the quorum signing key and when enough participants agree, the changes are signed off on and stored in persistent storage.

That way a proof only needs to give the root hash, and the signature on it to ascertain the quorum has agreed on the changes, and the AKD (or any other 3rd party) cannot generate its own signatures.

Eventually these auditors can be participants from external entities who can participate in the quorum vote.

Requirements

The Quorum Key (QK) is long-lived and securely managed such that individual (or a few) participants cannot generate their own signatures and invalidate the trust of the system
The quorum participants can evolve over time (i.e. removal in the face of failures, and additions for future collaborative growth or upgrades).
Communication channels are secure, such that 3rd parties cannot sniff the key shares over the wire and compromise the QK.
The public key of the QK is publicly distributed such that it's easy to client-side validate the validity of the epochs without additional I/O operations to a potentially malicious acting key directory.

Initial design overview

A Quorum Key is generated at the beginning of the life of the quorum, and the private key is broken into "shards" via Shamir secret sharing.
These shards are transmitted to the quorum participants who hold them in secure storage. The shards are generated with the following properties

There are 3 f + 1 nodes in the quorum
2 f + 1 nodes need to agree to reconstruct the signing key (quorum key)
The public key is given publicly to every client which will need to verify the signature

Communcation with the quorum

The collection of nodes in this quorum receive commands from an external 3rd party (key directory epoch notification or admin interface for example).
The messages they can receive are the following

New epoch publication - The AKD has published a new epoch and the quorum should validate the changes and generate a signature
Quorum member enrollment - The quorum should attempt to add the new member to the set. A series of independent tests will be
performed against the new member and if they pass, a new shard-set will be generated and the node enrolled
Quorum member removal - The quorum has a member which should be removed. If 2 f + 1 nodes agree, the Quorum Key is reconstructed and new shards are generated and
distributed to the remaining membership.

For any of these messages, whichever node receives the request is denoted as the leader. We do not need the full RAFT
protocol, since we have no need for a persistent leader and the nodes are effectively stateless in between operations.
The temporary request leader is responsible for communicating with the other quorum members and gathering
their votes, reconstructing the shards, and either signing the commitment or enrolling the new member (re-generating shards and transmitting them).

Inter-node messages:

Vote - args: (proof and hash information)
a. Reply: Success(shard) or Failure (errors)
Generate enrollment test - args: (member information, public key, etc)
a. Reply: A "test" of enrollment for a new quorum member, with the result temporarily stored in the local machine
Process enrollment test result - args: (member reply to specific proof)
a. Reply: If test successful, return the shard to the "leader". If not, return failure and don't disclose the shard
Remove member - args: (node_id)
a. Reply: Agree(shard) or Disagree (error)
Update stored shard - args: (new_shard, commitment_over_shard_signed_by_quorum_key)
a. We need to provide a commitment, so the membership doesn't replace shards with invalid information which is potentially a different quorum key and
corrupting the quorum

Retrieval of the latest commitment

Since the public key is available to anyone who cares, external parties can read directly from the storage layer to minimize I/O to the quorum, as they
will likely be resource constrained with validating the commitments between epochs and processing membership changes.

NOTE: If the storage layer is mutated directly, then the quorum will start failing commitments and signatures will not match so the system fault will be detectable.