Git Product home page Git Product logo

flow-dps's Introduction

Flow Data Provisioning Service

CI Status License Documentation Internal Documentation

The Flow Data Provisioning Service (DPS) aims at providing a scalable and efficient way to access the history of the Flow execution state, both for current live sporks and for past sporks.

The state of past sporks is indexed by reading an execution node's protocol state and state trie write-ahead log. Optionally, a root checkpoint is required to bootstrap state before a spork's start. In more specific terms, indexing of past sporks requires a Badger key-value database containing the Flow protocol state of the spork and a LedgerWAL with all the trie updates that happened on the spork.

Indexing the live spork works similarly, but it reads the protocol state by acting as a consensus follower, and it reads the execution-related data from records written to a Google Cloud Storage bucket by an execution node.

The Flow DPS maintains multiple specialized indexes for different purposes. Contrary to the execution node's state trie, the index for ledger registers allows random access to the execution state at any block height which enables state retrieval at any point in history, overcoming the pruning limit seen on the execution node.

Documentation

Binaries

Below are links to the individual documentation for the binaries within this repository.

APIs

The DPS API gives access to historical data at any given height.

There are also additional API layers that can be run on top of the DPS API:

Developer Documentation

Dependencies

Go v1.16 or higher is required to compile flow-dps. Only Linux amd64 builds are supported, because of the dependency to the flow-go/crypto package. Please note that it is also required to make sure that your GOPATH is exported in your environment in order to generate the DPS API.

If you want to make changes to the GRPC API, the following dependencies are required as well.

Once they are installed, you can run go generate ./... from the root of this repository to update the generated protobuf files.

In order to build the live binary, the following extra steps and dependencies are required:

Please note that the flow-go repository should be cloned in the same folder as the DPS with its default name, so that the Go module replace statement works as intended: replace github.com/onflow/flow-go/crypto => ./flow-go/crypto.

  • git clone [email protected]:onflow/flow-go.git
  • cd flow-go/crypto
  • git checkout c0afa789365eb7a22713ed76b8de1e3efaf3a70a
  • go generate

You can then verify that the installation of the flow-go crypto package has been successful by running the tests of the project.

Build

You can build every binary by running go build -tags=relic -o . ./... from the root of the repository.

flow-dps's People

Contributors

awfm9 avatar m4ksio avatar maelkum avatar ullaakut avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

flow-dps's Issues

Rosetta: initialize FVM on top of DPS state

As the basis for the Rosetta Data API, we need to make the Flow Virtual Machine (FVM) available on top of our indexing data store. This should be rather straight forward, by injecting our ledger interface as the FVM environment. This will allow us to retrieve resources from Flow accounts to look up balances.

Storage: replace dictionary with main network version

The Zstandard dictionary currently in use was built using ~1k trie updates from the local network, which represents only account creations and token movements. On the main network, register values are a lot more varied.

In general, you should use 100x as much data for the dictionary as the size of the dictionary. We can build a script that uses the default Zstandard dictionary size, and then keeps piping trie update values from the middle of the LedgerWAL into files, until we reach 100x the default dictionary size. This should allow us to build an optimal dictionary for compression on the main network.

Add license

We need to add a license for pkg.go.dev to be allowed to index our packages, which is required for #4 and currently blocking #14 as a draft.

Index: make bootstrapping optional

We should make sure that we don't re-index on every start, but only if a protocol state and execution state are actually provided as parameters. Otherwise, we should just open the database and do a sanity check to see if there is anything in it.

Implement Rosetta Data API

Once we have a working FVM instantiation on top of our own indexed storage, we can implement the first parts of the Rosetta Data API for account balances. This will enable look up of any liquid balance at any block height.

State: add caching layer

The Store currently only retrieves payloads directly from the index. This is not viable as a strategy once the Badger database grows beyond a certain point. We also need to keep track of a certain number of parameters, like the last indexed height and state commitment. Every auxiliary data that is read often should also be cached.

The following tasks are needed to create a good index:

  • add support for marking registers as dirty when they are written, and otherwise assuming the register value has not changed and thus simply extending the last range in cache when present;
  • find a way to eject only the least recently used ranges from the cache, which should provide a sufficiently effective strategy to limit cache size; and
  • build a wrapper around the core storage that implements the same API as the core storage, but adds the caching layer around it (allowing us to turn the cache off if desired).

Index: investigate more compact header type for storage

We currently encode the full Flow header using cbor. There might be a number of fields, including big fields like signatures, that we don't really need for the DPS. We could convert the Flow header to a DPS header and encode that instead.

Mapper: handle execution forks

From what I can see, on mainnet-8, we sealed state commitment 033ed4a966cf8ed2c09912a67b45574ff66b845baca2b31036390543e39aacc7 at height 13413534. There are two updates in the WAL for that state commitment. One of them updates the trie to root hash ae26fde7f2f69668e16dc5042e28e809485d842b11d42f89ff048ad11a773d97, the other one updates the trie to root hash 68bbc76d7882f7e6211c727a04725c0ee8024d189e712d281fb16fb749f395bb. There is no trie.Update for either one of these in the LedgerWAL, so we don't know how to resolve the fork.

We need to investigate to identify the problem. Here are some possibilities:

  • some trie updates are not recorded in the WAL, in which case we should be able to find one of the subsequent blocks' state commitment in the WAL again;
  • the trie doesn't update properly, potentially only when the second path is the valid one, i.e. the rewind causes the trie to update to an incorrect root hash which is not found in the WAL.

Live Mapping

Description

The DPS should be able to connect to an execution node at startup when given its address, and get trie updates, blocks and events from it. The mapper should use a live version of the Feeder and Chain components, and be able to handle live mapping of a Flow network.

Indexer: optimize Badger transaction performance

Check on trade-off of using one big transaction versus many smaller transactions in the context of conflicts between transactions on live spork, and performance. Also check on various levels (5 different calls, versus within calls i.e. every delta).

Mapper: compact the Badger value log

Currently, every time we run the DPS with the same input trie/WAL, we end up increasing the Badger DB value log by the same size. We are now running a compaction explicitly at the end of the bootstrapping loop once. However, for ongoing sporks, it would be good to run this every X number of indexed blocks, just to be sure to keep the database in a good state for ongoing operations.

Unit test coverage

Description

This epic is about adding unit tests to the project as well as the infrastructure that goes along with it to guarantee that they are maintained and remain useful.

Rosetta: hard-code all chain & token-dependent stuff

There is currently no need to support any but the standard three Flow chains, with the hard-coded contract addresses, without supporting any kind of third party tokens. We can simplify the current Rosetta components to progress faster.

State: support storage transaction wrappers

We might want to introduce a number of wrappers that allows us to handle typical edge cases, like updating a value when it was already in the DB, or inserting a value when nothing was found, or retrying when there is a transaction conflict.

// signature
RetryOnConflict(func (*badger.Txn) error) func (*badger.Txn) error
// usage
_ = db.Update(RetryOnConflict(RetrieveHeaderByHeight(height, &header)))
// signature
InsertIfMissing(func (*badger.Txn) error, func (*badger.Txn) error) func (*badger.Txn) error
// usage
_ = db.Update(InsertIfMissing(RetrieveLastCommit(&commit), SaveLastCommit(&init)))

Add compression to KV store

The limiting factor in terms of instance resources will probably be hard-disk space, so it makes sense to add compression after encoding our payloads. Zstandard seems like the best choice, and we might even be able to use a relatively high compression level, as the overhead compared to disk I/O should be minimal. Since we will decompress most of the time, and algorithms like Zstandard are always the same speed on decompression that seems reasonable. We can even train a dictionary to improve all parameters further.

Generate minimalist WAL and `root.checkpoint` file

We need to write a small utility tool that does the following:

  • Takes as input the state of a consensus node, a valid checkpoint (which should have contained enough transactions to contain at least one state commitment for a block being finalized) and the tries from an execution node.
  • Goes through the tries of the checkpoint until it finds the latest one.
  • Goes through the deltas of that checkpoint until it finds a state commitment for a block being finalized
  • Keeps going through deltas while writing a LedgerWAL with all changes after the aforementioned state commitment.

The tool then outputs:

  • The valid checkpoint, renamed to root.checkpoint
  • A stripped version of the LedgerWAL with only the relevant state changes
  • A modification in the protocol state database to set the correct block height as the root block height

Implement GRPC API

We can implement a GRPC API (and/or JSON RPC API) wrapper around the Ledger API, which should be useful to support simple remote executions of Cadence script.

  • Implement GRPC API.
  • Add file to allow running go generate to rebuild protobuf files.
  • Add GH Action script to automatically reject PRs if running go generate ./... results in a diff.
  • Document required packages to build protobuf files in README.md (and introduction.md if merged by then)

Fix last height and commit at startup

Currently, the application will crash when indexing for the first time, because it can't retrieve a value for last height and last commit. If not found, we should set them to zero and the hash of an empty trie respectively. This should correctly work (fail) for all possible scenarios.

State: move storage API into separate package

The current way of manually creating the key byte slices is not sustainable and should be factored into another package to hide the low level details and make it less prone to bugs / human error. The way it is done in the Flow Go repository is good inspiration, but we probably can keep it a bit simpler.

Add architecture document

Create a document in docs/architecture.md which details the architecture of the DPS and how the different components interact together as well as with external entities.

This document will have to be kept up to date as new PRs come in, so it might make sense to also add a PR template which reminds PR creators to always make sure they updated the document if their PR introduces changes that should impact it.

Documentation Overview

Describe the project, its roadmap and its architecture in the README.md file at the root of the repository.

Add introduction document

Create a docs/introduction.md file which describes:

  • An overview of the Flow technology or at least relevant links to Flow documentation
  • A glossary to define terms frequently used in other parts of the documentation (tree/trie/forest, WAL, checkpoints, registers, sporks, etc.)
  • Links to the different relevant parts of flow-go (Ledger itnerface, Write Ahead Log, ReplayOnForest, etc.)
  • How to setup a test environment for Flow DPS
    • By building flow-go and running the localnet stack to generate state
    • By then cherry-picking from this state and loading it into DPS

One of the important points to keep in mind while writing this document is that it should remain relevant with as little need for maintenance as possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.