bbva / qed Goto Github PK

The scalable, auditable and high-performance tamper-evident log project

License: Apache License 2.0

Go 92.75% Shell 2.32% HCL 2.28% Python 0.13% C++ 0.68% Dockerfile 0.20% C 1.65%

merkle-tree sparse-merkle-tree cryptography forensics latin byzantine-failures tamper-evident verifiable-data-structures verify lsm-tree

qed's Introduction

QED - Scalable, auditable and high-performance tamper-evident log

QED is an open-source software that allows you to establish trust relationships by leveraging verifiable cryptographic proofs.

It can be used in multiple scenarios:

Data transfers.
System (or application or business) logging.
Distributed business transactions.
Etc.

QED guarantees that the system itself, even when deployed into a non-trusted server, cannot be modified without being detected. It also provides verifiable cryptographic proofs in logarithmic relation (time and size) to the number of entries.

QED is scalable, resilient and ops friendly:

Designed to manage billions of events per shard
Over 2000 operations per second per shard under sustained load
Consistent replication through RAFT
Operable and instrumented with dozens of metrics
Zero config files, fully documented single binary

Documentation

You can find the complete documentation at: Documentation

Project code

You can find the project code at Github

Authors

QED was made by Hyperscale BBVA-Labs Team.

License

QED is Open Source and available under the Apache 2 license.

Contributions

Contributions are very welcome. See docs/source/contributing/contributing.rst or skim existing tickets to see where you could help out.

qed's People

Contributors

Stargazers

Watchers

qed's Issues

docs/usage.md is stalled

It needs to update the documentation according to the actual state of commands

Implement internal statistics using expvar package

The current approach to save and analyze statistics about the trees is naive and badly integrated into the application. A better approach seems to be the use of the expvar package which is already in the runtime and its already used by Badger. See https://golang.org/pkg/expvar/ for the current documentation.

Feature: bulk inserts

Tasks:

Improve godoc contents

Unify cli commands

We should use a single base command for every interaction with the server, Therefore server command should be integrated into qed.

The sub-command hierarchy would be:

qed
  |-- start
  |-- stop
  |-- client
            |-- add
            |-- membership
  |-- auditor

start sub-command will include flags to enable profiling and tampering endpoints, and also for launching the server in background (-d):

qed start -d --profiling --tampering

Deduplicate HTTP sanitization in API handlers

Reuse https://github.com/BBVA/qed/blob/master/tests/riot.go#L300 in api package

Documentation contributions

Propose a way to contribute to documentation out of the build cycle.

QED client library topology discovery timeout

The configuration parameter DiscoveryTimeout is not used inside client.discover(). When creating a client with the default options, changing that timeout and setting an incorrect endpoint, the client creation hangs on the discover operation instead of timing out.

Investigate performance drop

Seems the latest Go version give us at least a 10% performance impact in unit benchmarks, which is amplified in e2e benchmarks.

We need to read the changes from one verstion to another. We also need to identify which parts of our code are causing this and analyze if we can change anything to mitigate the effect.

gRPC support protocol

We should allow gRPC. and as seamless as possible.

Simplify Position interface

We should reduce the number of exposed methods included in Position abstraction because it should be designed to be used outside the trees. For instance, methods like ShouldBeCached, Key or Height could be internal to each tree implementation.

Pruners should return errors

We are swallowing some errors or failing with panics when building the Prune() methods are executed to build the pruned trees.

Prune() common.Visitable

We should change the method signature to return some errors that should be handled at upper layers.

Prune() (common.Visitable, err error)

Add Riot support for list of qed cluster IP addr

We need to improve riot adding support of multiple IP addr.

Implement a server component

We should bind the start and stop operations of all HTTP endpoints together under a single struct. Also, this component should be able to react to SIGTERM signals and do a graceful shutdown.

Remove Expvars

if possible

Implement tests for FIFOReadThroughCache

Design: Start() and Stop()/Shutdown()

I think any component with Start() and a Stop() methods should make those methods non-blocking. We would need to block only those without the Stop() method.

The start should create a goroutine and the stop should end it. Without any leaking, and without the user of the API knowing the internals used to do that.

Thoughts?

Remove index table

We are using the index table to map from event hashes to history tree versions, but that responsibility should be exclusive of the hyper tree, given that now, it stores the raw version in the shortcut leaves.

In this manner, we could eliminate the need for using another table to support fast mappings. With this change, every membership operation must query first the hyper tree before generating the audit path from the history tree, and thus, incurs in a latency penalty. However, given that the hyper tree is the only one that holds a lock for queries, in theory, it shouldn't reduce balloon's throughput.

This change helps to reduce space and write amplification in storage.

Implement incremental proofs

Now that we have completed a draft implementation of the membership query and its verification process, we are ready to undertake the generation of incremental proofs in order to verify the temporal consistency of a sequential flow of events.

The history tree is in charge of generating incremental proofs P between commitments Ci and Cj, where i <= j:

P <- H.IncGen(Ci, Cj).

Once the client has received the proof, he should be able to verify that the proof proves that Cj fixes every event fixed by the recomputed C'i (where i <= j):

{accept, reject} <- P.IncVerify(C'i, Cj)

Tasks:

E2E tests
Extend HTTP API
Implement query functionality in Balloon
Implement verify functionality in Balloon
Extend client to include incremental query
Extend client to include verification

Unable to Query Membership concurrently

When we try to concurrently Query Membership we receive the fowling error:

2018/10/03 17:42:50 http: panic serving [::1]:59546: d.nx != 0
goroutine 400165 [running]:
net/http.(*conn).serve.func1(0xc022b49720)
/usr/local/go/src/net/http/server.go:1746 +0xd0
panic(0x8edd00, 0xa4e670)
/usr/local/go/src/runtime/panic.go:513 +0x1b9
crypto/sha256.(*digest).checkSum(0xc00fcc37c8, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/crypto/sha256/sha256.go:253 +0x1db
crypto/sha256.(*digest).Sum(0xc0000a6200, 0x0, 0x0, 0x0, 0xb, 0x0, 0x0)
/usr/local/go/src/crypto/sha256/sha256.go:229 +0x69
github.com/bbva/qed/hashing.(*Sha256Hasher).Do(0xc000066470, 0xc011b5e6a0, 0x1, 0x1, 0xc00fc52908, 0x0, 0x0)
/home/spark/go/src/github.com/bbva/qed/hashing/hash.go:74 +0xb5
github.com/bbva/qed/balloon.Balloon.QueryMembership(0x186a0, 0x9e4ba0, 0x7f807c8f7360, 0xc005510048, 0xc0000af500, 0xc00551c1c0, 0xa55e20, 0xc000066470, 0xc01158ec50, 0xb, ...)
/home/spark/go/src/github.com/bbva/qed/balloon/balloon.go:222 +0xe1
github.com/bbva/qed/raftwal.BalloonFSM.QueryMembership(0x9e4ba0, 0xa59cc0, 0xc005510048, 0xc0000af740, 0xc01c1a6300, 0x0, 0x0, 0x0, 0xc01158ec50, 0xb, ...)
/home/spark/go/src/github.com/bbva/qed/raftwal/fsm.go:89 +0x87
github.com/bbva/qed/raftwal.RaftBalloon.QueryMembership(0x7ffca83d569c, 0x11, 0x9ba7b3, 0x5, 0xc0000240e2, 0x9, 0xc000116000, 0xc000158400, 0xc00553a1b0, 0xc00000c940, ...)
/home/spark/go/src/github.com/bbva/qed/raftwal/raft.go:408 +0x87
github.com/bbva/qed/api/apihttp.Membership.func1(0xa55de0, 0xc011b5e660, 0xc0170c6500)
/home/spark/go/src/github.com/bbva/qed/api/apihttp/apihttp.go:167 +0x1e1
net/http.HandlerFunc.ServeHTTP(0xc00000c680, 0xa55de0, 0xc011b5e660, 0xc0170c6500)
/usr/local/go/src/net/http/server.go:1964 +0x44
github.com/bbva/qed/api/apihttp.AuthHandlerMiddleware.func1(0xa55de0, 0xc011b5e660, 0xc0170c6500)
/home/spark/go/src/github.com/bbva/qed/api/apihttp/apihttp.go:246 +0xc5
net/http.HandlerFunc.ServeHTTP(0xc0000664d0, 0xa55de0, 0xc011b5e660, 0xc0170c6500)
/usr/local/go/src/net/http/server.go:1964 +0x44
net/http.(*ServeMux).ServeHTTP(0xc005531740, 0xa55de0, 0xc011b5e660, 0xc0170c6500)
/usr/local/go/src/net/http/server.go:2361 +0x127
github.com/bbva/qed/api/apihttp.LogHandler.func1(0xa56b20, 0xc01c6b7960, 0xc0170c6500)
/home/spark/go/src/github.com/bbva/qed/api/apihttp/apihttp.go:290 +0xda
net/http.HandlerFunc.ServeHTTP(0xc00000c6c0, 0xa56b20, 0xc01c6b7960, 0xc0170c6500)
/usr/local/go/src/net/http/server.go:1964 +0x44
net/http.serverHandler.ServeHTTP(0xc000132a90, 0xa56b20, 0xc01c6b7960, 0xc0170c6500)
/usr/local/go/src/net/http/server.go:2741 +0xab
net/http.(*conn).serve(0xc022b49720, 0xa571a0, 0xc025696f00)
/usr/local/go/src/net/http/server.go:1847 +0x646
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2851 +0x2f5

It seams to be related we are sharing the same hasher for each client.

Refactor public API ingestion should return 204 empty response

Instead of waiting for the response we should return 204.
It will hinders our tests/e2e

Pointer receivers clean up

We should use pointers receivers with an * when the method modifies something of the data structure, otherwise, do not use *.

type A struct {
    value int
}
func (a *A) Add(x int) { ... }
func (a A) Show() {...}

This way we make sure we do not modify our structures when we don't want.

Implement tests for CollectMutationsVisitor

Centralize bash scripts in `/scripts` directory

All the scripts now live in /tests directory and since we are using it as a canonical way to launch environments sometimes outside tests scope (QA, Performance...) I believe it would be worthy to give them a proper directory.

HyperTree: bug adding nodes.

Given the following test in "hyper>tree_test.go":

func TestAdd(t *testing.T) {

	testCases := []struct {
		eventDigest      []byte
		expectedRootHash []byte
	}{
		{[]byte{0x0}, []byte{0x0}},
		{[]byte{0x1}, []byte{0x1}},
		{[]byte{0x2}, []byte{0x3}},
		{[]byte{0x3}, []byte{0x0}},
		{[]byte{0x4}, []byte{0x4}},
		{[]byte{0x5}, []byte{0x1}},
		{[]byte{0x6}, []byte{0x7}},
		{[]byte{0x7}, []byte{0x0}},
		{[]byte{0x8}, []byte{0x8}},
		{[]byte{0x9}, []byte{0x1}},
	}

	hasher := new(hashing.XorHasher)

	leaves, close := openBPlusStorage()
	defer close()
	cache := cache.NewSimpleCache(2)
	tree := NewFakeTree(string(0x0), cache, leaves, hasher)

	for i, c := range testCases {
		index := make([]byte, 8)
		binary.LittleEndian.PutUint64(index, uint64(i))

		rh, err := tree.Add(c.eventDigest, index)
		assert.Nil(t, err, "Error adding to the tree: %v", err)
		assert.Equal(t, c.expectedRootHash, rh, "Incorrect root hash for index %d", i)
	}

}

And printing insertions until test 2:

;;;;
: [0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0]
;;;;
: [0 0 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0]
: [1 0 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0]
;;;;
: [0 0 0 0 0 0 0 0 0] [2 0 0 0 0 0 0 0]
: [1 0 0 0 0 0 0 0 0] [2 0 0 0 0 0 0 0]
: [2 0 0 0 0 0 0 0 0] [2 0 0 0 0 0 0 0]

It happens that when inserting a new index in hyper-tree, the function "tree.fromStorage" returns the same value to all existing leafs.

riot client is not resilient on Qed leader restarts

riot Client library must try understand topology changes on dead of qed-leaders.

Unify randomBytes function in testutils

Currently, we have randomBytes function in every test file where it's necessary. We should unify them in a single function placed under testutils/rand package.

Implement snapshot sign process

We need to sign the snapshots which are being published using pki mechanisms.

Benchmark: Raft behavior while testing Membership() throughput

We need to measure the system write/query performance when using raft replication, and querying only 1 follower (instead of both).

We should consider the following scenarios:
Multi node:

One Leader & two followers{1..2}
Start the cluster
Preload N events to the leader
Query Membership of the N events to follower1
Perform continuous write load on the leader
At the same time, perform Query Membership of the previous N events ONLY to follower 1
(notice that follower 2 is query-load free)

QED input parameter check

When setting up URLS from the command line we need to make sure those are well formed URLs, instead of passing random strings up to the libraries API.

This is quite annoying when we require to use http:// in and endpoint definition instead of hostname:port, and the program does not fail whatever you put in.

We must check all parameters properly.

Separate integration tests from unit tests

Identify and segregate tests by their purpose: unit, integration, etc. For instance, balloon_test.go file currently contains both unit and integration tests. The later should be moved to another package under tests folder.

Improve cli error handling

I've detected whenever we use the cli and it returns an error, Cobra always returns the usage message. Even when we pass the right parameters.

How to reproduce:

$ go run ../main.go -k pepe client membership --key 0 --version 1 -l info     
                                       
QedClient: 2018/10/11 11:00:05.575712 /home/spark/go/src/github.com/bbva/qed/cmd/client_membership.go:53: Querying key [ 0 ] with version [ 1 ]
Error: Unexpected server error
Usage:
  qed client membership [flags]

Flags:
  -h, --help                   help for membership
      --historyDigest string   Digest of the history tree
      --hyperDigest string     Digest of the hyper tree
      --key string             Key to query
      --verify                 Do verify received proof
      --version uint           Version to query

Global Flags:
  -k, --apikey string     Server api key
  -e, --endpoint string   Endpoint for REST requests on (host:port) (default "http://localhost:8080")
  -l, --log string        Choose between log levels: silent, error, info and debug (default "error")

exit status 255

We've seen an possible solution in this Cobra issue

Move event hashing out of balloon internals

Currently, inserted events get hashed in the Add method before inserting into both trees. This means that the raw event, whose size could be quite large compared with a 32B hash, is first stored in the WAL and then replicated to other nodes of the Raft cluster. We can avoid these unnecessary storage space and network traffic penalties by allowing Balloon to also accept event hashes instead of raw events in the Add method and hashing the events at the HTTP layer, before applying them to the WAL.

Change QueryMembership signature

Currently, QueryMembership method in the hyper tree looks like this:

func (t *HyperTree) QueryMembership(eventDigest hashing.Digest) (proof *QueryProof, err error)

When the event doesn't exist, it returns an ErrKeyNotFound. But this type of error shouldn't rise to the upper Balloon's layer. It will be more convenient to change the hyper method's signature to hide this kind of errors and return a bool in case of non-existence:

func (t *HyperTree) QueryMembership(eventDigest hashing.Digest) (proof *QueryProof, answer bool, err error)

With this change, we wouldn't need to check the length of the proof in Balloon in order to set the Exists flag:

if len(hyperProof.Value) > 0 {
		proof.Exists = true
		proof.ActualVersion = util.BytesAsUint64(hyperProof.Value)
	}

Remove balloon/storage package

Remove balloon/storage package and move its functionality to a balloon/storage.go inside ballon package.

We don't need a package there anymore, as the interface implementations don't need to be aware of it.

Improve component API design

We need to state clear boundaries between components and their relations. For example, in server/server.go we need to include almost all components separately, injecting dependencies instead of configuration.

We did this to ease the testing, enable the tampering, etc. But we can design this with the same functionality without exposing the balloon components if we introduce new constructors for specific needs, instead of exposing everything.

Also I think this will lead to improve our APIs and interface{} designs, adding only what's really needed, and using composition to build complexity.

Please add places where we can fix this situation:

server/server.go --> clean up by simplifying balloon constructor and API

Create benchmark suite to test Membership() throughput

We need to measure the system Membership throughput because it's a critical parameter for the system operation(the Membership performance is expected to be greater than the write performance, as well as not losing write throughput when both operations are performed simultaneously.)

We should consider the following scenarios:

Single node:
Start a single node
Preload N events
Query Membership of the N events
Perform continuous write load on the leader
At the same time, perform Query Membership of the previous N events

Multi node:
One Leader & 2 followers{1..2}
Start the cluster
Preload N events to the leader
Query Membership of the N events to follower 1
Perform continuous write load on the leader
At the same time, perform Query Membership of the previous N events to all the followers
One Leader & 4 followers{1..4}
(Same as before)

Cloud benchmarking

Prepare and execute a benchmark plan on different clouds to test performance in different providers and VM flavors.

Create benchmark suite to test Incremental() throughput

Remove snapshot channel from the FSM

The process of sending newer snapshots to the snapshot channel (now named agentsQueue) after inserting the event into the balloon must get removed out the critical path of the insertion operation.

Given that the process of applying changes to the FSM is executed in a serialized way with one single thread, the queuing could lead to a potential stalling situation if the channel gets full. Snapshots should be sent to the channel after committing the command into the WAL, just after resolving the Apply future in the Raft node. This way, the goroutine that handles the HTTP request is responsible for the sending the snapshots freeing up the Raft applying thread.

Improve re-join cluster after restart

Spinning QED in cluster mode when all the nodes join via raft and the leader gets elected, if the actual leader goes down and if we bring it up again it's unable to join the cluster again.

We should improve our join process for new and clusters with existing configuration.

Add persistence to hyper tree's cache

In order to improve durability and keep the hyper tree consistent under shutdown or failure scenarios, we need to implement a persistent storage of the cache on disk, so the next time that the server starts up, all of the previously cached data is still available.

Improve our tests coverage by increasing the number of scenarios tested in each part of the QED.
Document each test objective and remove duplicate tests
Increase the quality of our fakes and its documentation
Increase the number and quality of acceptance tests
Test corner cases

adding gorelease in azure

After creating a GITHUB_TOKEN for the manual release we need to improve this workflow:

discover how to store secrets in azure-pipelines
generate a GITHUB_TOKEN from a bot, or our ORG to be independent of the users
create the task in the pipeline and only run when a tag is uploaded (and ensure commit and tag are uploaded synchronously)