xrplf / clio Goto Github PK

View Code? Open in Web Editor NEW

61.0 15.0 51.0 6.25 MB

An XRP Ledger API Server

Home Page: https://xrpl.org

License: ISC License

CMake 1.11% C++ 97.84% Python 0.09% Shell 0.27% Go 0.67% M4 0.01%

c-plus-plus cplusplus xrp xrp-ledger xrpl

clio's Introduction

Clio

Clio is an XRP Ledger API server optimized for RPC calls over WebSocket or JSON-RPC. It stores validated historical ledger and transaction data in a more space efficient format, and uses up to 4 times less space than rippled.

Clio can be configured to store data in Apache Cassandra or ScyllaDB, enabling scalable read throughput. Multiple Clio nodes can share access to the same dataset, which allows for a highly available cluster of Clio nodes without the need for redundant data storage or computation.

📡 Clio and `rippled`

Clio offers the full rippled API, with the caveat that Clio by default only returns validated data. This means that ledger_index defaults to validated instead of current for all requests. Other non-validated data, such as information about queued transactions, is also not returned.

Clio retrieves data from a designated group of rippled nodes instead of connecting to the peer-to-peer network. For requests that require access to the peer-to-peer network, such as fee or submit, Clio automatically forwards the request to a rippled node and propagates the response back to the client. To access non-validated data for any request, simply add ledger_index: "current" to the request, and Clio will forward the request to rippled.

Note

Clio requires access to at least one rippled node, which can run on the same machine as Clio or separately.

📚 Learn more about Clio

Below are some useful docs to learn more about Clio.

For Developers:

For Operators:

General reference material:

🆘 Help

Feel free to open an issue if you have a feature request or something doesn't work as expected. If you have any questions about building, running, contributing, using Clio or any other, you could always start a new discussion.

clio's People

Contributors

Stargazers

Watchers

clio's Issues

Use CMake FetchContent instead of git submodule for rippled

Git submodules are a little complicated. We include rippled as a git submodule, but we could really just use CMake's FetchContent functionality to build and include rippled.

New API

Design a new and more consistent API

support for secure_gateway

I'm fairly sure we have something similar w/ whitelist, but we should explicitly support this.

Support data deletion

Use TTL

Return error when data is stale

If the database is not caught up with the network, we should return an error or at least a warning. Reasons for not being caught up are historical data is being loaded, there are no writer nodes deployed, or there are no in sync rippled nodes to talk to.

We can detect this by the recorded close time of the most recent ledger in the db.

Improve documentation

We need a descriptive README with step by step instructions to build and run.

Parallel Load

Reporting mode has a tool for loading SHAMap into cassandra, but clio has to manually load all data into its native data format.

We should add functionality that allows for multiple write nodes to load data into the same database. I.E.

node 1 loads ledgers 32,000 to 20,000,000
node 2 loads ledgers 20,000,001 to 40,000,000
node 3 loads ledgers 40,000,001 to 60,000,000
node 4 loads ledgers 60,000,000 to present

Revert the account_tx change for large rows

There was a change to the account_tx data model in Cassandra that tried to prevent large rows. This change significantly increased the complexity of the code and data model, and also degraded performance in some cases. We should revert this change, and if we actually see an issue with large rows in production, circle back.

Lite Mode

Consider adding a lite mode that doesn't store the full ledger state, but rather just stores transactions and ledger headers. Research should be done to determine if this is an actual use case.

Sort the offers returned by book_offers by quality

Right now, they are sorted by the index of the offer object.

Forward validations and manifests

We need to forward validations and manifests received over websocket from rippled. Reporting mode does this, so we can probably just copy the code.

ASIO: Consider using strands instead of multiple io contexts

There are multiples io contexts used in the application. This is generally done to prevent different types of jobs from competing for the same resources. For instance, CassandraBackend has it's own io context for async requests, ETLLoadBalancer has it's own io context for connecting to rippled, and the RPC handlers post to their own io context. Each of these io contexts is handled by it's own, single thread, except for the io context used by the RPC handlers, which is handled by a configurable number of threads.

Using strands would force sequential execution of work posted to the io context, which should achieve virtually the same thing, where at any given time only one thread can be executing work posted to that strand. Using strands would be easier to maintain than having multiple separate io contexts. However, we want to make sure the RPC handlers can't starve out the other work posted to the same io context, as the other work is essential for healthy functioning of the server. A compromise might be to have one io context for RPC handlers, and one io context for everything else. More research is needed to determine the best approach here.

Enhance logging

We should audit the logs and make sure everything that's being logged makes sense, and everything that makes sense is being logged, at the appropriate levels.
We should make some progress on this before the beta, mainly so we have good logs to debug things.

Use SSL connections in ETLSource

Rename this repository

This repository needs to be renamed. Ideally the name would be unique, creative, descriptive, easy to say/remember, and not associated with Ripple.

Querying the open ledger

We should look into storing the open ledger in Redis, or something similar.

Or maybe each clio server can keep it's own cache.

Add HTTP support

Right now we only support websockets

investigate large partitions in Cassandra

Some of the partitions in our data model are very large.

The worst offender is the keys table, where there are millions of records in a single partition. For keys, we could do something like have the partition key be the ledger sequence AND the first byte of the key.

The account_tx table also has very large partitions, which correspond to accounts with a lot of transactions. One solution to this would be to have the partition key be the account AND the first x bytes of the ledger sequence. In this manner we can split up the records for a given account, so they don't all correspond to the same partition.

This problem could also exist in the objects table, if a ledger object is modified very often over a long period of time. For this, we could do something similar to the solution described above for account_tx.

The other tables don't appear to have this problem.

error: badSyntax

When issuing requests to clio through RPC, it always returns with error badSyntax. No further details provided.
Reproducible via:

curl -X POST \
  -H 'Content-Type: application/json' \
  -d '{"command": "server_info"}' \
$ADDRESS

Don't publish stale data

We should consider not publishing ledgers during a historical data load. We could refrain from publishing any ledger that's more than a certain amount of time old. This could be controlled by a config parameter.

Proxy client source IP addresses

For proxy, allow rippled to do rate limiting based on source IP of client, not that of clio.

Cache calls to fetchLedgerRange()

fetchLedgerRange() does not actually need to hit the database. Instead, the range can be cached. We can query the range on startup for the initial population. For writing nodes, we update the cached range when we write to the database. For read-only nodes, we update the cached range when we publish.

Support fetching ledger diffs

Ledger diffs (all the added, modified and deleted objects that resulted from applying the ledger's transactions) are not supported by the current data model. In both Postgres and Cassandra, this would involve a full table scan. However, ledger diffs could be very useful for a variety of use cases. The ETL mechanism of clio itself uses ledger diffs.

There are two ways to go about this:

Add an index to the database. We could add a hash index to Postgres and a secondary index to Cassandra.
Add a second table. We could have a table that maps ledger sequence to a list of hashes. The hashes are the objects in the diff. Lookups are then two reads.
Process the transaction metadata. The transaction metadata describes every object that was created, modified or deleted as part of applying that transaction. We also need to include the skiplist.

1 is the simplest. 1 and 2 both use significant amounts of space on the db side. 3 though doesn't use any extra space, and only results in a little extra processing on the clio side.

Use `can_delete` to tell rippled when it's ok to delete ledgers

Every time clio finishes processing a ledger, clio should send a can_delete request to rippled, with can_delete set to the ledger which clio just processed. This tells rippled that it's safe to delete that ledger from it's nodestore.

rippled may respond with a notEnabled error, if advisory deletion is not enabled. However, I think clio should just keep sending these messages, because the rippled node could be restarted with an updated config, and the messages are tiny.

This will require a new method in ETLLoadBalancer that sends a websocket request to ALL of the connected rippled nodes, instead of one at random.

Timeline

Create a timeline of tasks to get clio to version 1.0, and define what version 1.0 is.

Add counter support to server_info

We need to add support for the different counters that rippled includes in server_info counters. Not all of the counters will be applicable to clio, but we need to at least support the counters that are used to determine throughput, latency and age. So that way we can monitor the performance of clio.

Determine DOS Guard parameters

The DOS Guard reads config parameters to determine the max bytes per second to serve to clients. We need to determine the correct default parameters here.

Consider adding a special table for current state of order book

book_offers is a little slow. Fetching 200 offers from a book can take 2 seconds. I'm not sure how fast this needs to be, but we should consider adding an extra table in the database for the current state of each order book. We could delete from this table as part of ETL.

Proxy RPCs to a rippled node

RPCs like fee and submit need to be proxied to a rippled node

Don't use rippled JSON

We use rippled JSON to serialize ledger objects and transactions to JSON. It's not used when users specify the binary flag. We should rewrite the rippled serialization to JSON code in clio and use boost JSON.

DDOS prevention

We need some sort of basic DDOS prevention built in. This DDOS prevention needs to exist without altering the rippled API (such as requiring some sort of API token). We can just use something similar to rippled's rate limiting feature, where it tracks the IP. The mechanism could exist locally, or live in the database.

Set up `clang-format`

rippled uses clang-format, we should just port that over to clio

Add SSL support

Unwrap result.result when forwarding

We need to do something like this in clio:XRPLF/rippled#3804

When we forward an RPC, we forward over websocket. But if the original client requested via JSON-RPC, the resulting response is incorrect, and contains a double nested result field, like so:

{
   "result" : {
      "api_version" : 1,
      "forwarded" : true,
      "result" : {
         "current_ledger_size" : "41",
         "current_queue_size" : "0",
         "drops" : {
            "base_fee" : "10",
            "median_fee" : "5000",
            "minimum_fee" : "10",
            "open_ledger_fee" : "10"
         },
         "expected_ledger_size" : "81",
         "ledger_current_index" : 62544876,
         "levels" : {
            "median_level" : "128000",
            "minimum_level" : "256",
            "open_ledger_level" : "256",
            "reference_level" : "256"
         },
         "max_queue_size" : "2000"
      },
      "status" : "success",
      "type" : "response",
      "warnings" : [
         {
            "id" : 1004,
            "message" : "This is a reporting server.  The default behavior of a reporting server is to only return validated data. If you are looking for not yet validated data, include \"ledger_index : current\" in your request, which will cause this server to forward the request to a p2p node. If the forward is successful the response will include \"forwarded\" : \"true\""
         }
      ]
   }
}

Achieve parity with rippled API

We should make sure all of our handlers match the rippled API exactly, with maybe some exceptions.

Make database fetches truly async

Currently, we use cassandra's async API to fetch records, but we really just submit a batch of async requests and then wait on a condition variable until they all complete. It would be better to actually free up the thread that makes the async requests, so that way the thread can handle additional RPC calls while waiting for the async requests to complete. This will give us better throughput overall for reads.

We need to figure out how to make async postgres queries.

It would be best to change all database reads to be asynchronous, even the ones that currently use the synchronous cassandra API.

Support API versioning

rippled actually has a version 2 of the API that is not yet enabled. We need to implement those.

Prevent user from accidentally changing key shift

The indexer_key_shift config value can't change once the database is populated. Otherwise, the application is going to be looking for the keys at different flag ledgers than were written previously. We need to either detect this change on startup, or just store the shift in the database.

Improve unit testing

Backfill

We should look into adding functionality for clio to backfill data. Right now, Clio only moves forward in time, so if you have an existing database, there's no way to get older data without just wiping the database and starting over. There still should be no history gaps. I think it should be straightforward to apply diffs in reverse, though things might get complicated with the successor table being developed as part of caching.

Transaction Caching

Implement subscriptions

Rippled has a subscribe RPC, where clients can subscribe to various events. We need to implement the infrastructure, and support "transactions", "transactions_proposed","accounts", "accounts_proposed" and "ledgers" at a minimum

doBooksRepair should iterate over the books table

right now, doBooksRepair() iterates over the entire ledger, and checks if each object is an offer object. Instead we should iterate through the books table. In Cassandra, we'll need to use TOKEN

Unify JSON serialization format of transactions

There are many API methods that can return transactions. Currently, the return formats are inconsistent. The following table shows where fields are located as of rippled v1.12.0 (unchanged since v1.0.1).

Note: In the following table, the "top" level is with respect to the individual object representing the transaction; for methods that return arrays of transactions, the "top" level is per element of the array. "Transaction Instructions" refer to the canonical fields of the transaction as defined in the transaction format, such as the Account and Flags fields.

Method	Transaction Instructions	Transaction Metadata	`hash`	`ledger_index`	`inLedger`	`validated`	`ledger_hash`	Binary blob	`date`
`ledger` (with transactions expanded)	Top level	`metaData` field (JSON) or in `meta` field¹ (binary)	Top level	Above transaction level	None	Above transaction level	None	`tx_blob` field¹	None
`tx`	Top level	`meta` field	Top level	Top level	Top level	Top level	None	`tx` field¹	Top level
`tx_history` (deprecated)	Top level	None	Top level	Top level	Top level	None	None	Not supported	None
`account_tx`	`tx` field	`meta` field	`tx` field	`tx` field	`tx` field	Top level	None	`tx_blob` field¹	None
`transaction_entry` (would like to deprecate)	`tx_json` field	`metadata` field	`tx_json` field	Top level	None	Top level	Top level	Not supported	None
`sign`, `sign_for`, `submit`, and `submit_multisigned`	`tx_json` field	(N/A)	`tx_json` field	(N/A)	(N/A)	(N/A)	(N/A)	`tx_blob` field	None
Streams from `subscribe`	`transaction` field	`meta` field²	Top level	Top level²	None	Top level²	Top level²	Not supported	`transaction` field
Data API v2 methods	`tx` field	`meta` field	Top level	Top level	None	None (Data API only reports validated transaction data)	None	`tx`¹	Top level (as ISO8601 timestamp, not Ripple time)

¹ Only if the request asked for binary data
² The transactions_proposed stream omits these fields because the transactions' outcomes are not yet final.

Recommendation

Change all of rippled's methods to serialize transactions to JSON in a single consistent format. The format I suggest is essentially the same format the Data API uses, with modifications to accommodate for some formats only the rippled APIs handle. Specifically:

Always use tx_blob for binary format. Always use tx_json for JSON format transaction instructions.
Only include the canonical transaction instructions ("uppercase fields") and signing fields in the tx_json field with transaction instructions. Move all "lowercase" fields like hash and ledger_index out to the top level of the transaction. (Note: the issuer/currency/value sub-fields of amounts are canonical fields even though they are lowercase.)
Always use meta for JSON metadata. Use meta_blob for binary metadata.
Remove the inLedger field entirely. (It was a deprecated alias for ledger_index, which is the ledger index of the ledger that includes this transaction.)
Add the date field to all transactions that are from a closed ledger. Most of the time this would be the close time of the parent ledger. Make this a UTC ISO8601 timestamp with whole-seconds resolution, for example 2018-07-22T16:37:55Z. Omit this field from transactions that are not in closed ledgers.
Add the ledger_hash field to any transactions when pulling them from a closed ledger. Omit this field from transactions that are not in closed ledgers.
Add the validated boolean field to all transactions, even when this information is redundant because of context (for example, listing transactions in a validated ledger).

Refactor successor logic and keys table

Currently, the logic to get the next object in a ledger, given a certain object in the ledger, is rather complicated. This is due to the data model: we flattened the tree, and so there's no way to iterate through a single ledger. Instead, we end up iterating over all of the data, and filtering out data we don't want. This gets very slow as the total dataset grows over time, to the point where it's unusable after a certain point. To alleviate this, we ended up creating a table called keys, which has one row for every roughly 1 million ledgers, consisting of all of the object IDs in that ledger and every ledger in the prior 1 million. This works, as the dataset size is somewhat bounded. We still end up filtering out records we don't need, but it's less than if the dataset was not bounded in this way.

There are a couple of issues with this though. First and foremost, the code that makes this happen is very complicated. The code kicks off async jobs every flag ledger, has logic to determine if a flag ledger is missing and thus kick off some writes, has recursion logic to drop down to the previous flag if a flag ledger is incomplete, etc. Furthermore, while the performance is fine for ledger_data, it's terrible for book_offers. It can take several hundred milliseconds to get just one successor object.

We should refactor and get rid of the keys table, and everything related. I think we can instead keep a successor table, that looks something like this:

CREATE TABLE successor (
key blob,
seq int,
next blob,
PRIMARY KEY(key, seq)
)

Then, given an object ID and a ledger sequence, one can read the table to get the next object. The hard part is for clio to populate this table, but clio can actually leverage rippled, since rippled uses a SHAMap and thus successor is easy. clio can just ask rippled for successor information as part of ETL.

Handle all account_tx flags

There are some additional flags for account_tx that we don't yet handle. For instance, we don't support forward iteration (old to new).

Remove the big caches in BackendIndexer

The really large sets and maps in BackendIndexer are slowing down ETL by at least a few hundred transactions per second. The big caches are only used during each flag ledger, to write everything in the current ledger to the next flag row. However, we don't need to keep the entire ledger in memory. We can just read the ledger from the database via the ledger_data algorithm. So if we just wrote ledger x, and ledger x is a flag ledger, we then iterate through ledger x and write everything to x + (1 << shift). This will make writing the flag ledgers slower, but that's ok since it should happen very infrequently.

For books, the situation is not as simple, since we don't currently have a way to iterate through all of the books in a ledger. So we could start with just removing the keys cache. Or we could just store the books, but not the contents of the book, and look up the contents at each flag ledger.

I think it would be worth removing the keys cache and see if that gives us a speedup.

Create a tool to cryptographically verify clio's dataset

We should create a tool that allows one to verify that their clio dataset is correct and complete. This tool should be able to verify state data and transaction data for a single ledger or for a range of ledgers.

One way to do this would be to recreate the SHAMap and then check that the hashes match. Reporting mode does something similar to this when it builds a new ledger. It is unclear how to do this outside of rippled, since we would need the SHAMap and the SHAMap is not available in xrpl_core.

Redesign proxy architecture

The proxy architecture is very simply, but may not be the best design. The way it works now is if a request needs to be proxied (either is one of the requests that is always proxied, like submit, or specifies ledger_index:current), clio picks an etl source at random and forwards the request there. This works, but there are a number of improvements that could be made.

Consider disabling proxy for certain requests. For instance, ledger_data probably should never be proxied. The call is iterative, and iterating the whole ledger takes minutes, during which time the open ledger changes many times over. The list of RPCs that actually need access to the open ledger is quite small, so we could probably disable proxying on quite a bit.
Think about how to provide a consistent view to the open ledger. clio uses a collection of etl sources, and each of those has a different view of the open ledger. Transaction propagation results in a sort of eventual consistency, though there is no guarantee on consistency prior to consensus, at which point the data is validated and clio writes it to the database. When we pick a random p2p node to proxy a request to, subsequent requests are likely to go to a different p2p node. So if a user calls server_info and uses the load_factor to calculate fees, and then calls submit to submit the transaction, submit will probably go to a different p2p node than server_info did, which could cause incorrect fee calculation. A similar situation can occur with account_info.
Have clio be smarter about which node to proxy to. clio could track the load factor of each p2p node, and forward requests to the node with the lowest load factor, or some other measure of idleness. At a certain point, clio should reject proxy requests if the p2p nodes are in danger of going out of sync.
Consider caching data in clio. We proxy server_info to get the load factor, but we could probably just query each p2p node once per second to get their load factor, and then cache that in clio. We might be able to do something like this for account_info as well.
Enable separating ETL sources and proxy endpoints. Proxy endpoints could get bombarded with traffic and go out of sync, and then can no longer be ETL sources, and the database will stop getting updates.
Make the proxy asynchronous, so we are not sitting on a thread until the response comes back.

Lastly, what we really need is some way to handle a lot of requests that require a p2p node without causing the nodes to desync. Reporting nodes simply slow down when there are too many requests, forcing requests to wait in a queue and eventually returning an error rpcTOO_BUSY. p2p nodes on the other hand lose sync, at which point they can't handle any requests until they resync, which can take several seconds or longer.

Cache the ledger

We should cache the most recently validated ledger (state data) in memory, and use this for any RPCs that request state data from the most recently validated ledger. This will speed up a lot of queries.

We need to be careful not to delete state data that an RPC is currently using. For example, if a ledger_data RPC is being handled, and then a new ledger is validated, we need to make sure we don't delete data and mess up the handling of the currently running ledger_data RPC. We can probably do this through some sort of locking and lazy deletion. RPCs lock the ledger they want, and then when they finish, delete the data if no one else has locked it and new data is available.