onflow / flow-archive Goto Github PK
View Code? Open in Web Editor NEWThis project forked from optakt/flow-dps
Flow Archive
License: Apache License 2.0
This project forked from optakt/flow-dps
Flow Archive
License: Apache License 2.0
We currently have an issue where the Archive node on staging is not indexing new blocks since the last deploy.
This should be fixed so we can continue using this environment for testing and validating ideas.
Problem Definition
This Epic is concerned with the Streaming Client Library & DPS, a part of the Flow Access Node Refactoring. It is related to issue #2466.
The Flow Data Provisioning Service has a Live Binary for live sporks and Indexer for past sporks. It stores indexed data in a Badger database, which can be accessed by block height. It does so by acting as an unstaked consensus follower and downloading block execution records from a Google Cloud Storage bucket – which is constantly updated by an Execution Node.
With the development of the Execution Data Sync engine on the Access Node/Observer Service and streaming functionality of the Execution Data API, this implementation is outdated.
Proposed Solution
Since the Observer Service, Access Node, and Consensus Follower Library are all abstracted; DPS must be updated to reflect these changes.
DPS should use the Streaming Client Library to register itself as an execution data consumer/subscriber with the Execution Data API on an Access Node or an Observer. It should no longer use GCP Streamer to draw from a Google Cloud Bucket.
Step Definitions
• Update Flow DPS Live Binary to pull block execution records from the Streaming Client Library instead of a Google Cloud Bucket through GCP Streamer.
• Update Flow DPS Indexer to pull block execution records from the Streaming Client Library instead of a Google Cloud Bucket through GCP Streamer.
• Update DPS Execution Tracker to receive block execution records from the Streaming Client Library instead of a Google Cloud Bucket through GCP Streamer.
Definition of Done
• DPS no longer uses GCP Streamer or pulls from a Google Cloud Bucket
• DPS instead initializes a Streaming Client Library and requests a network key for a host AN or Observer
• DPS registers itself as an execution data consumer/subscriber to an Execution Data API
• DPS receives catch-up execution data from the Execution Data API through the Streaming Client Library
• DPS Live Binary pulls current spork data from Execution Data API
• DPS Indexer pulls past spork data from Execution Data API
• Streaming Client Library sends execution data in order of seal
• Streaming Client Library calls callback for all registered finalization listeners
DPS currently uses the V5 format of the checkpoint file. where it is all stored in a single file and not partitioned to the V6 format, which is spread out among several files.
Update DPS Code to:
directory
in gcp as a flagtrieUpdates
are applied correctly in the mapper
Upgrade badgerdb to the latest version (and apply compatibility changes in flow-go) in order to possibly improve GetRegisters latency.
For localnet testing, we don't want to have to upload/download execution data from GCP. Instead, add a feature to ENs to write the data to disk, and a feature to DSP to consume from disk.
The end result should be that when this feature is enabled, ENs "upload" ComputationResult
data by writing them to a file on the local filesystem. DPS should "download" the data by reading the file from disk.
This new directory can then be mounted into both localnet ENs and DPS using a shared volume.
Setup an integration to run a DPS container in a localnet network. This should be done in a way that allows for quick iterative development on both DPS and the flow-go nodes.
We've done a similar integration for the access nodes rate limiter here:
https://github.com/onflow/access-api-rate-limit/tree/main/test
That may or may not make sense for the DPS usecase.
We need a convenient way to run DPS with a localnet cluster to do manual testing locally with smaller datasets. This is an Epic to track the work required to get that working.
The goal is to enable fast and efficient iterative development on DPS and flow nodes. The end result should enable a developer to make changes to local DPS and flow nodes, quickly refresh the localnet environment and run manual tests
DPS currently cannot handle forks. The way we store/retrieve indexes may need to be changed which may have other compatibility implications
Create a client that stream exec and block data from AN/Observers to the DPS storage
Problem Definition
The legacy Access API contained methods to both read and write (transaction) the blockchain. On Flow, reading is accomplished by read calls or scripts querying the blockchain state. Transactions are write calls which mutate blockchain state.
Without DPS, Access Nodes: forward transactions to Collection Nodes, answer client read calls locally, and delegate script execution to Execution Nodes. This is very read-heavy and poses numerous scalability concerns. When deployed, DPS provides access to the execution state at any block height and does not increase network load.
The Flow API Service, modularized from the legacy Access API, will serve as an entry-point API for transactions, queries, and Cadence scripts. While transactions will still make their way to the network through an Access Node, all other calls should be handled by BDS or DPS if deployed.
Proposed Solution
• Interface on Flow API translates Protocol API and Execution Data API calls into DPS API calls
• Flow API handles routing of scripts (read-only) and transactions (write) to the corresponding target
• Flow API communicates with DPS API for read calls, running against indexed state data
• Leverage DPS for state indexing, reducing load on Access Nodes/Observer Services
Step Definitions
• Flow API can bind to upstream DPS API
• Implement DPS API-Access API translation interface on the Flow API
• If Flow API binds to DPS API, it sends all calls to DPS API except unimplemented methods: GetExecutionResultForBlockID
, GetLatestProtocolStateSnapshot
, and SendTransaction
.
• Endpoints that query protocol state should be translated at the Flow API level and then sent to DPS API for results
• Endpoints that query execution state should be translated at the Flow API level and then sent to DPS API for results
Definition of Done
• Protocol API endpoints which query protocol state are translated by Flow API-DPS API interface and sent to DPS
• Execution Data API endpoints which query execution state are translated by Flow API-DPS API interface and sent to DPS
• Results returned from DPS API are served to clients
• If there is no downstream DPS, an error is returned
• DPS runs state queries from its local IndexDB and returns results through the DPS API.
• DPS indexed blockchain state is used to answer execution state requests from the DPS API.
• Those answers and synced state data are cached for timely future access.
When the archive node has imported the checkpoint, it's better to validate that the the state commitment matches with the root hash of the checkpoint.
No response
No response
Previous task: #66
Running the following Cadence script panics the FVM
getCurrentBlock().height
Problem
DPS indexing requires all block data cbors to be available on the designated GCP bucket. If a file is missing, DPS waits for the file to be available before syncing further blocks, this means indexing will not continue till that file is in the GCP bucket. This will persist until DPS switches over to streaming the state from Observers
Solutions
503
from GCPThe current documentation needs to be updated and the examples for how to run Observer need to be really solid and work out of the gate, considering support for docker and other runtime environment/OS/arch concerns.
Currently the Archive node spawns a new FVM instance that communicates with the live server via internal HTTP calls to get the register values it needs to execute the scripts.
We want to come up with a proof of concept where the script is executed on the same machine as the node itself, that way pulling the data it needs from disk rather than making a network request (which has proven to have unreliable latency as the node gets busier handling multiple requests).
Make sure DPS runs properly on testnet35 / mainnet 18
DPS's monitoring is based on log queries and there are some operational/performace issues that can crop up and not be detected unless someone proactively looks at the logs
During investigation of another unrelated problem on mainnet's EN1, the team noticed that the Archive Node seemed to be indexed a few blocks behind:
go run . -d -a dps-001.mainnet20.nodes.onflow.org:5005
Index first: 40171634
Index last: 40664221
Original reports:
{"level":"error","component":"gcp_streamer","error":"could not pull execution record (name: 353f39a7586d4b761fe0e0f763b67e47c73bd31fbb178ee3aba54be6c72de9b0.cbor): could not decode execution record: cbor: exceeded max number of elements 131072 for CBOR array","time":"2022-11-17T03:29:08.667080836Z","message":"could not download execution records"}
Originally reported by @m4ksio
Currently, the archive node throws a computation limit exceeded error similar to the Execution node for scripts or transactions that require a lot of computation.
While it makes sense to have an execution node throw this error, it doesn't make sense for the archive node to throw it. An operator running an archive node should not be subjected to any computation limits on their own node.
This issue is to remove the computation limit from the archive node.
The Fraud Prevention Team at DL used a script to query ALL account balances at a particular block height.
The script has now stopped working given the restricted restrictions on past block height introduced by the last mainnet spork (mainnet-19).
Janez ran the script against DPS as the alternative to the Execution node and ran into issues.
The DPS takes an incrementally longer and longer time to return register values when queried several times for different registers. The script thus cannot finish execution in a reasonable amount of time. It may take several hours to complete.
His message on Slack:
My main problem is that the Archive <-> client communication is to slow, to run scripts locally at a reasonable speed.
So trying to run like 10k scripts takes so long I gave up before 10%
Just adding multiple connections/clients, and running X scripts concurrently helps, but not much, and it seems to have a big impact on the archive node.
The main issue is that when a script runs it requests 1 register at a time.
So my current idea is to run many scripts at a time but only have 1 client requesting registers from the Archive node.
While the client is requesting all the scripts that need a register are aggregated into a pending GetRegisterValues request. When a GetRegisterValues returns, it dispatches results to all the pending scripts and sends the next request to the Archive node with the aggregated list of registers needed for the next batch of scripts.
This seems to be working, but I'll report back if its feasible to run ~10k scripts in a reasonable amount of time , and the code that does this.
I have noticed that the memory used by the Archive node goes up quite a bit, while I'm doing this. Is this ok?
We are in the process to compare get register return times between the 2 implementations (using badger v2 vs badger v3) to test if we should consider moving on with the upgrade now or later.
Deployment steps for dps-live
are still manual, and since the command and system setup is all the same and parameters readily available for each spork, this can be automated
Use the manual deploy guide as a reference and automate:
/var/flow/bootstrap
directory@franklywatson commented on Tue Feb 15 2022
To date making and testing changes on the DPS has been a slow and cumbersome process requiring the use of large data sets and with limited ability to debug and troubleshoot locally and with no way to avoid the loading of multi-GB files to verify it's functioning.
This task is to document the tasks required to make developing with DPS very much more straight forward, with a faster cycle time and quick bootstrap for developers.
Details incomplete
When flow-go
updates to a new release tag for each spork, we manually have to update DPS's dependencies. This means that we also have to manually generate new binaries, create a release tag and upload the created images to docker. The images when created on an m1 mac are incomaptible with the chromeOS operating system in GCP.
Create a process similar to flow-go
, where dps-live
and dps-server
Docker images are automatically built on gcp machines and uploaded to the Container registry
This way, we can rely on the @latest
tag for the image
I run the docker image, but on startup:
2022-10-19T15:19:24.155899406+02:00 stderr F {"level":"error","error":"could not decode protocol snapshot: json: cannot unmarshal string into Go struct field Seal.LatestSeal.FinalState of type flow.StateCommitment","time":"2022-10-19T13:19:24Z","message":"could not initialize protocol state"}
We currently have RN deployed on staging with badgerdb v3 and it would be helpful to measure the actual difference in read speeds against badgerdb v2 that we have running on mainnet right now.
Document the process of setting up DPS to run with localnet as well as how to refresh the envrionment while making changes. This should also include examples for test queries (e.g. send a transaction and query the updated state data in dps)
We're currently in the process to upgrade the RN key-value store from badgerdb/v2
to badgerdb/v3
. However, it became apparent that the read latency of this library may be too high for the amount of read operations we need to perform on this node when it's under high usage.
Part of this work is to study key-value store alternatives that are more efficient than badger when performing many read operations, with pebble being the current top contender as it has been tested previously in our new event-indexer.
Replace badgerdb by other key-value storage engine and test results in a production-like environment (maybe staging RN would be ideal for this) with use of heavy scripts (like fraud worker).
The bootstrap step will import payloads from a checkpoint file, it takes a long time.
It's better to use a big instance to do the importing job, and after the payloads has been imported from checkpoint file to storage, we can use a smaller instance to continue catch up and index each blocks.
With a cmd line util, the node can dedicated the resource of the big instance for the bootstrapping. We could also consider simplify the existing FSM state to replace the bootstrap step, as it's done by the bootstrap util, with a check that the bootstrap has been done.
No response
No response
Create a new epic and figure out steps required and a transition plan to offload script execution from ENs to RNs.
The archive node can not replace the historical EN, because the bootstrap phase assumes the checkpoint file is for the root block, but actually for historical EN, the checkpoint is for the last sealed block.
RN currently doesn't have the ability to bootstrap from a certain block with the checkpoint at that block.
we need to be able to specify the start-index-height and checkpoint-file flag, and allow the RN to bootstrap by importing the checkpoint and index the payloads with the specified height.
If they are not specified, then the height is chain.Root, and checkpoint file is just root.checkpoint file.
If they are specified, then it will be ignored if the database has already been bootstrapped.
If they are specified, the checkpoint file should only contain a single trie.
No response
No response
Problem Definition
This Epic is concerned with the Streaming Client Library, a part of the Flow Access Node Refactoring. It is related to issue #2465.
The Flow Data Provisioning Service (DPS) has a Live Binary for live sporks and Indexer for past sporks. It stores indexed data in a Badger database, which can be accessed by block height. Currently, it does so by bootstrapping a Consensus Follower. This implementation is outdated.
Proposed Solution
Since the Observer Service, Access Node, and Consensus Follower Library are all abstracted; DPS must be updated to reflect these changes. DPS should use the Streaming Client Library to register itself as a block consumer/subscriber with the Protocol API on an Access Node or an Observer.
Step Definitions
• Update Flow DPS Live Binary to pull chain data from the Streaming Client Library instead of Consensus Follower
• Update Flow DPS Indexer to pull chain data from the Streaming Client Library instead of Consensus Follower
• Update DPS Consensus Tracker to receive chain data from the Streaming Client Library instead of Consensus Follower
Definition of Done
• DPS no longer initializes the Consensus Follower Library
• DPS instead initializes a Streaming Client Library and requests a network key for a host AN or Observer
• DPS registers itself as a block subscriber/finalization listener to a Protocol API
• Protocol API calls callback for all registered finalization listeners
• DPS receives catch-up blocks from the Protocol API through the Streaming Client Library
• DPS Live Binary pulls current spork data from Protocol API
• DPS Indexer pulls past spork data from Protocol API
This is the intermediate step for streaming the execution data
The polling implementation has some extra requirements, it should:
Let's say latest block is 1000, register A is updated at height 500 and 700:with the following value:
height 500: register A is 3
height 700: register A is 4
When archive node is catching up, and indexed up to height 600, and we run a state query against it to get state for block 800, then it would thought it has indexed all (which is wrong, because it only indexed up to block 600), and would return register A's value as 3 for block 800, which is wrong
If archive node has indexed up height 600, it should return not found for state query against any block above 600, such as 800
use mainnet staging and flow-archive-access validator
`v0.29-concurrent-importing-simplify-logs` on flow-archive
No response
The Archive node is missing a few blocks since the CBOR files were not uploaded to the GCP buckets by the Execution node.
The script to extract past blocks from the Execution node is failing for block 41040071.
As part of the work to connect the RN with the new AN state streaming API, the data verification during block indexing has become a blocker and can be implemented in some other way in the future as the root hash is included with on each block streamed to the Archive node via this new API.
This will allow us to have a initial version of a trieless Archive node live that does not require so much memory, specially after booting up.
Problem: Script execution is increasing load on ENs and there is no reason ENs need to perform script execution - any node that has access to execution state can execute scripts, as the scripts don't modify execution state.
Proposed solution:
Support script execution on RNs and migrate the script execution from ENs to RNs.
Prevent script execution request routing to Ens.
No response
Problem
DPS metrics do not extend beyond golang and Badger related metrics. Since we need to be more alert to performance bottlenecks, we'll require more metrics for each sub-operation (including indexing and finalized block sync)
Solutions
On this new proposed version of the flow fraud worker, we have to pull balance data only for accounts that changed within a specific block height (since the initial data snapshot used as reference).
In order to implement that, a client example of how to pull this data from an Access Node would be benefitial.
DPS nodes are large and expensive machines to operate. While dps-live
does require large memory and cpu resources, dps-server
does not, as it serves its data from disk. this means that once a spork is complete, we can swap out the DPS machine for a lighter and cheaper one, until inter-spork compatibility is implemented.
Create an automated GCP way to:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.