This issue is for discussion of <a href="https://github.com/joyent/rfd/tree/master/rfd

Kicking it off the discussion with a few candidates: RocksDB -

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RFD 116: Discussion about rfd HOT 15 OPEN

tritondatacenter commented on August 10, 2024

RFD 116: Discussion

from rfd.

Comments (15)

kellymclaughlin commented on August 10, 2024

Kicking it off the discussion with a few candidates:

RocksDB - This is Facebook's fork of Google's leveldb modified to be suited for high end servers rather than cell phones. This would be an option 3 candidate; it is just a non-replicated KV store. Also worth looking into are some write throttling changes made by the ArangoDB folks to smooth out performance on not-so-high-end servers.
WiredTiger - This is another option 3 candidate. WiredTiger was acquired by mongo and is now the default storage backend for that product. Like RocksDB it's also a non-replicated KV store.
Riak - This would be an option 2 candidate. Perhaps a darkhorse candidate given the demise of Basho, but there are some reasons I think it's worth considering. I had initially discounted it because on top of everything else I feel like introducing an eventually consistent data store could prove problematic, but then I remembered that Riak has a feature to create buckets that are strongly consistent. This convinced me that it's at least worth a bit of further investigation.

from rfd.

bradjonesca commented on August 10, 2024

+1 for Riak ... it incorporates CRDT's

from rfd.

jclulow commented on August 10, 2024

I'd really like us to target a native SmartOS zone rather than an LX zone. There are a number of things that are currently better (for us) in a native zone; e.g., DTrace and MDB support is at its best, everything will be built with correct use of the frame pointer and including symbols, etc. There's also the fact that an LX zone is not, by itself, enough to run a Linux program -- we'd also need to figure out what distribution of Linux bits to ship in an appropriate image and commit to supporting that.

I think we should change the item about zones in Qualitative Requirements to target just native SmartOS zones for now. If that unexpectedly becomes a serious impediment, we could revisit at that time.

from rfd.

jaredmorrow commented on August 10, 2024

After spending a day on CockroachDB, I'm going to table that for now. I could spend a couple of days getting it to build on SmartOS just for us to not consider it.

@kellymclaughlin talking with Russell at NHS (UK), this is where they are going with Riak, towards a 2.2.5 on this branch. I'd suggest if we look at Riak we go that route.

from rfd.

pfmooney commented on August 10, 2024

I'd really like us to target a native SmartOS zone rather than an LX zone

I second this sentiment. Dealing with the lack of frame pointers and symbols is a significant hassle when trying to debug something on Linux or LX. As far as I know, there is no mainstream distribution which makes preservation of those items a priority.

from rfd.

sean- commented on August 10, 2024

@jaredmorrow I spent a decent amount of time on portability for CRDB but didn't get it fully working before shifting gears to the prefaulter last summer. Most of the prerequisite work to get CRDB to work native on illumos has been done and could probably be finished with a week or so's worth of time.

from rfd.

kellymclaughlin commented on August 10, 2024

I'd really like us to target a native SmartOS zone rather than an LX zone.

I concur. We might as well start with that as a constraint and only relax it if we really must. I'll update the RFD to reflect that. Thanks!

from rfd.

jaredmorrow commented on August 10, 2024

@sean- good to know. I'd already fixed one go library yesterday, but it seemed like jemalloc was pretty tied into the product. @kellymclaughlin and I discussed being more sure of what DB's we'll actively chase before spending too much more time on getting the ones which won't build to build.

from rfd.

sean- commented on August 10, 2024

@jaredmorrow iirc, jemalloc is used by RocksDB, but not in the Go-written portions of the database. jemalloc(3) is exceptional, but if it doesn't work, then we can run without it at a performance loss, which probably isn't critical for the application at hand. Anyway, I can dig in more. I wouldn't gate choice of database on portability concerns unless the amount of work is vast or risky (i.e. a dependency on a recent jvm). In the case of crdb, the amount of work to be done is pretty modest (and mostly complete with the exception of some OS-specific portability bits required by sigar).

gmake TAGS=stdmalloc ...

from rfd.

jaredmorrow commented on August 10, 2024

Some updates from me, and some other databases I looked at...

@sean- I got cockroach building and spoke with someone from cockroach labs. We certainly would want gosigar ported to solaris past what I did to get it building (that is, I stubbed out empty functions). CRDB is still on the table, just needs some work before we could reasonably test a cluster.

Other DB's I looked at for candidates:

Aerospike - Closed source, nothing else matters after that based on our RFD
VoltDB - Looks promising and has solid engineering, but I think in-memory-only would be too much of a change on how our operators would have to act around it.
Hazelcast - In-memory-only and got absolutely creamed on the jepsen tests
Cassandra - Without vector clocks and a default of LWW conflict resolution, I don't see how it would fit our model, though it does have rack awareness.
RethinkDB - Zero solaris support, and the company is now defunct. This would be a pure OSS play and we'd have to handle our own solaris port and support.

from rfd.

cburroughs commented on August 10, 2024

Cassandra - Without vector clocks and a default of LWW conflict resolution, I don't see how it would fit our model, though it does have rack awareness.

The custom terminology is messy, but Cassandra does have a way to do something more than just last write wins: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlLtwtTransactions.html (With a corresponding different set of tradeoffs.)

from rfd.

misterbisson commented on August 10, 2024

There are some clear performance benefits to buckets over the Manta directory model, but on its own (without broad S3 compatibility, substantially reducing the cost of storage, and other benefits), it's unlikely that existing customers with substantial usage footprints can justify a migration from one API to another.

This RFD assumes that new APIs are a given, but we need to investigate how we might achieve some of the performance and scaling benefits of buckets without changing the APIs.

from rfd.

kellymclaughlin commented on August 10, 2024

| This RFD assumes that new APIs are a given, but we need to investigate how we might achieve some of the performance and scaling benefits of buckets without changing the APIs.

Great point. There are definitely two distinct things going on here. There's the feature request for the buckets API and then there's the investigation into the most appropriate data storage solution for the bucket metadata. Ideally, in the end, we end up with only a single metadata storage solution. The directory-based API definitely imposes some different requirements that could make this difficult (e.g. needing efficient directory listings), but it's definitely something I have been keeping in mind while doing research.

from rfd.

timkordas commented on August 10, 2024

I have a concrete question about structure of the storage. I'm going to phrase it with CockroachDB in mind; but it probably applies to any buckets solution.

When using a distributed database, especially for the workload Manta has been used for recently, many small writes are being made by many client instances.

Is the proposal here to remove sharding entirely? That is, the commit-clique for any DC would be all DBs ? That is, webapi/muskie would hit a new metadata-storage tier without any electric-moray-style routing to shard based on consistent hash ? Or is the proposal here to replace some combination of Moray/Postgres with a similarly sharded collection of CockroachDB "clusters."

I don't have data at hand to back me up; but my intuition is that CockroachDB (or similar) may have difficulties handling the commit burden of thousands of transactions per second when the required quorum size gets big (where big may be in the dozens to hundreds) and I think that the number of nodes required just for I/O reasons will remain large, even with a relaxed metadata contract given the buckets API.

I think a key test (which may be easy to run ?) is to evaluate the I/O requirement get a number: how many nodes N would be required to write the metadata for M buckets per second); and then see how many transactions per second a system can do with N nodes in the CockroachDB cluster.

I really believe that the fault-tolerance in your proposal will be a vast improvement over the replicated Postgres setup; I just worry about operating very large clusters of Cockroaches.

I think you lay out plans for some of this in your evaluation; but wanted a little more.

from rfd.

kellymclaughlin commented on August 10, 2024

Is the proposal here to remove sharding entirely? That is, the commit-clique for any DC would be all DBs ? That is, webapi/muskie would hit a new metadata-storage tier without any electric-moray-style routing to shard based on consistent hash ? Or is the proposal here to replace some combination of Moray/Postgres with a similarly sharded collection of CockroachDB "clusters."

We've considered both ideas actually though most recently we've focused on the former as a target for our initial testing. Instead of using electric-moray for consistent hashing to distribute the data we would instead lean on the data distribution provided by the distributed database. The main reason we chose to focus on this approach for the initial testing is that it is much more operationally appealing to manage a single database cluster rather than numerous clusters. Obviously if it falls apart at scale that may not be true, but as a starting point I find it to be the better option.

I don't have data at hand to back me up; but my intuition is that CockroachDB (or similar) may have difficulties handling the commit burden of thousands of transactions per second when the required quorum size gets big (where big may be in the dozens to hundreds) and I think that the number of nodes required just for I/O reasons will remain large, even with a relaxed metadata contract given the buckets API.

I share this concern and the problem of testing at a sufficient scale before deciding to proceed with the option of a new distributed database is something I want to be very careful to explore.

I think you lay out plans for some of this in your evaluation; but wanted a little more.

Our plan is to start modestly and try to compare numbers between a small cockroachdb cluster (5 nodes), a small rethinkdb cluster (also 5 nodes), and a single shard of replicated postgres. We modified muskie to talk directly to each of these options to try and ensure a fair comparison and also avoid the overhead of changing moray to work with our schema. The goal of the initial testing is to compare the throughput and latency we see for each of these options and let that guide the next steps. If either of the distributed databases performs well enough in this small scale test to give reason to believe it could be a viable option then we'll move onto larger scale tests and compare it to postgres in a multi-shard environment. Then there would be the necessity for operational testing to see it like to add and remove cluster nodes under load and get a feel for what it would be like to manage it under duress.

Thanks for questions! If I didn't answer anything fully let me know and I'll try again.

from rfd.

RFD 116: Discussion about rfd HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent