quickwit-oss / chitchat Goto Github PK
View Code? Open in Web Editor NEWCluster membership protocol with failure detection inspired by Cassandra and DynamoDB
License: MIT License
Cluster membership protocol with failure detection inspired by Cassandra and DynamoDB
License: MIT License
This crate looks super-cool; have you given any thought to how to allow nodes to have multiple addresses? For example, in a mixed IPv4 and IPv6 stack world, some nodes may have both an IPv4 address and an IPv6 address, and some may have only one or the other. This is a big problem deploying Cassandra in mixed-stack environments (since its gossip address can only be a single socket address).
I'm running the tests from "chitchat-test" locally on my windows machine with the following script (trying to mimic the bash version as best I can):
Get-Process "chitchat-test" | Stop-Process
cargo build --release
for ($i = 10000; $i -lt 10100; $i++)
{
$listen_addr = "127.0.0.1:$i";
Write-Host $listen_addr;
Start-Process -NoNewWindow "cargo" -ArgumentList "run --release -- --listen_addr $listen_addr --seed 127.0.0.1:10002 --node_id node_$i"
}
Read-Host
Are the following results expected behavior, or am I maybe running into some windows bugs?
{
"cluster_id": "testing",
"cluster_state": {
"node_state_snapshots": [
{
"chitchat_id": {
"node_id": "node_10000",
"generation_id": 1711736008,
"gossip_advertise_addr": "127.0.0.1:10000"
},
"node_state": {
"chitchat_id": {
"node_id": "node_10000",
"generation_id": 1711736008,
"gossip_advertise_addr": "127.0.0.1:10000"
},
"heartbeat": 2,
"key_values": {},
"max_version": 0,
"last_gc_version": 0
}
}
],
"seed_addrs": [
"127.0.0.1:10002"
]
},
"live_nodes": [
{
"node_id": "node_10000",
"generation_id": 1711736008,
"gossip_advertise_addr": "127.0.0.1:10000"
}
],
"dead_nodes": []
}
The other services on ports greater than the seed start up fine.
Is any of this expected behavior, or do we think maybe its windows related?
Right now, the delta mtu is expressed in number of records.
We want it to be bytes (and not exceed 1KB).
We can probably do that with a writer wrapping a buffer,
/// Atomically adds a record to the buffer,
///
/// Returns true if the record was successfully added
/// Returns false if adding the record would exceed the mtu.
fn add_record(peer: &str, key: &str, value: &str) -> bool;
This is motivated by this PR: quickwit-oss/quickwit#1838
In Quickwit, we need to build members from pairs of (key, value) stored in NodeState
. Being able to define a build_from_node_state
method on Member
struct would be nice.
This is also consistent with the fact that the ClusterStateSnapshot
struct is public.
Spotted on a deployment.
Resetted nodes are briefly identified as alive.
The reason is that we call report_heartbeat
in the reset procedure.
The purpose of this call is to make sure that we have a record in the failure detector for the node.
Currently, this crate does not do much logging making it hard to see what's going on under the hood.
hey there , was wondering how many nodes folks have tested chitchat up to? Not a theoretical but a known maximum?
Cheers
Sean
Hey! ๐
Awesome library, thanks for all of the work on it.
Small issue I'm running into when trying to make a custom transport.
chitchat::serialize
module is marked as pub(crate)
, which means you can't serialize a ChitchatMessage
outside of the crate.ChitchatMessage
isn't marked as Serde's Serialize/Deserialize
, there's actually no way to serialize messages for a custom transport.This seems like an easy fix unless there's a specific reason not to: just remove the pub(crate)
and leave pub
on chitchat::serialize
.
Following the original paper, chitchat currently first shares nodes with the highest number of stale values.
As a side effect, nodes that are not emitting many KVs are gossiped last.
In quickwit, under a little bit of load (1000 indexes on 10 indexer), it has a very dire effect.
Indexer that reconnect have to gossip the entire cluster state (~10MB) before being able to get any information about the metastore.
Knowing at least one node with the metastore service is required for nodes to declare themselves as live.
Right now scuttlebutt relies on UDP with a hardcoded mtu of 1KB.
Right now, GC of deleted keys happens on all nodes independently.
This works fine, but after a reset, a node may need many gossip round trips to be up to date again.
It might get a last_gc_version of V first and then talk with another peer, be unlucky and get reset to a last_gc_version V',
and have this bad luck keep going on over and over again.
If instead, writer nodes were emitting GC version as a KV key, we would probably have one or two last_gc_version over an entire cluster.
If we updated that key only every 3 minutes or so, we would have the guarantee that a node playing catch up would be able to in at most 2 resets. (and usually a single one would suffice)
In addition, all nodes would not have to keep track of delete timestamps. Instead, only the writing node would keep a queue of GC versions.
RIght now one needs to specify its public address.
It can however, in some environment be a non-trivial thing to know.
(k8s, using the :0 trick, etc.)
Only the seed really need to have a known address.
Other nodes are exhibiting their public gossip address as they contact a peer.
To avoid dead nodes resurrecting for a few moments on newly joined nodes (because of the initial interval on the failure detector).
It's best to exclude dead node from the gossip messages (digest, delete).
This can even pave the way for easier garbage collection because when a node is known to all cluster members as dead, its state will no longer be shared even when we delete it on the cluster state.
Right now we might be updating the heartbeat too often...
Simulation and visualization should help making that decision.
Due to the nature UDP, the existence of resets and the fact that we are
gossipping to several nodes at the same time, it is possible for our
obsolete deltas to arrive.
We probably need to implement a leave/join method for a node.
Currently, we have shutdown
that stops the UpdServer and consumes the server.
@fulmicoton WDYT? https://github.com/quickwit-oss/quickwit/pull/1164/files#diff-5f02eaf8b40edf6dfb71eab3ea3c59888c42f987d19b1934b9c03bc176f4ec50R191
Right now we pick peers at random amongst all peers... We can probably do a bit better.
Also avoid the network partition caused by number of seeds https://issues.apache.org/jira/browse/CASSANDRA-150
Hello, @fulmicoton !
Could you please explain, why are you keeping NodeState
structure private while ClusterState
's methods
pub fn node_state(&self, node_id: &NodeId) -> Option<&NodeState> {..}
pub fn node_state_mut(&mut self, node_id: &NodeId) -> &mut NodeState {..}
return NodeState
as a result?
Thank you!
Currently, we use marked_for_deletion_grace_period
to set a deletion grace period expressed as a number of version.
This means two things:
max_version
very low compared to others. If node_1.max_version + marked_for_deletion_grace_period < node_state.max_version
, we will reset the node state of node_1
.versioned_value.version + marked_for_deletion_grace_period < max_version
.But this logic is flawed. If we set n
keys in a given node... the max_version will in at least n
. And this is a big issue as now the garbage collection can happen way faster than a user expects.
We can do two things:
if a node A get reset after a partition, the node B which sent the reset should have sent some key(s), or a sentinel key if no key is supposed to be alive. If the mtu prevent from sending all these keys, it's possible another node C then send to A a key with a version high enough to be acceptable (but smaller than the max version B would have sent), which have already been deleted from B's point of view
I'm thinking more of pointers to papers or already existing implementations from which we get inspiration.
I have in mind this paper:
Unit tests are using manually cherry-picked port that do not overlap... That is a deep paper cut, but also it might break unit tests if the system has this port busy by a different application.
We can probably find a way to rely on port 0 in unit tests and have the OS pick an available port for us.
Currently the scuttlebutt messages are not versioned.
Even if these are transient messages, we might want to make changes that do not require to stop-restart all nodes at the same time.
For the moment adding a simple version number should be sufficient to be "ready for the future"
There is already an unrelated crate published at https://crates.io/crates/scuttlebutt.
master
since 2017, its maintainer might agree to retract their releases and transfer the name to you. There are no identified dependents, but the crate has been downloaded 22 times in the last 90 days (probably scrapers, but there is no way to tell)Currently, it's hard to test the different stages a node in a cluster can be in.
It would be nice to have a kind of simulator built around tokio::task
that can start a cluster and simulate packet delay or node failure for testing.
look into https://github.com/jepsen-io/maelstrom
We need a proper single document explaining what a node id identity is.
We have
broadcast_rpc_address
: (using cassandra terminology) the host:port at which other nodes can reach us.node_id
: defined in the quickwit.yaml file. Today if it is missing we autogenerate it on startupIf specified the node_id
is something that never changes. We can add a generation number to disambiguate two different runs.
A couple of problem to consider
Currently, we can only test this with unit tests and manual tests. This does not allow us to test with many nodes.
This should:
scuttlebutt-test
Currently, we can only request the list of live nodes from the ScuttleButt instance. It would be nice to provide a way for clients to listen to membership changes in order to avoid clients from continuously pulling the list of cluster live nodes.
To reproduce.
Create two nodes node_{1,2}
.
Restart node two.
You will end up with a two nodes 2 in the state.Both staying in the READY mode, despite the heartbeat never being updated.
{
"cluster_id": "quickwit-default-cluster",
"self_node_id": "node-1",
"ready_nodes": [
{
"node_id": "node-1",
"generation_id": 1699243351019449000,
"gossip_advertise_addr": "127.0.0.1:7280"
},
{
"node_id": "node-2",
"generation_id": 1699243416590203000,
"gossip_advertise_addr": "127.0.0.1:8280"
}
],
"live_nodes": [],
"dead_nodes": [
{
"node_id": "node-2",
"generation_id": 1699242914104763000,
"gossip_advertise_addr": "127.0.0.1:8280"
}
],
"chitchat_state_snapshot": {
"node_state_snapshots": [
{
"chitchat_id": {
"node_id": "node-1",
"generation_id": 1699243351019449000,
"gossip_advertise_addr": "127.0.0.1:7280"
},
"node_state": {
"heartbeat": 487,
"key_values": {
"enabled_services": {
"value": "janitor,control_plane,indexer,searcher,metastore",
"version": 1,
"tombstone": null
},
"grpc_advertise_addr": {
"value": "127.0.0.1:7281",
"version": 2,
"tombstone": null
},
"indexer_capacity": {
"value": "2000",
"version": 4,
"tombstone": null
},
"indexing_task:cflogs:01HE82PCCWZNHSZKVZSJ2ARB73:_ingest-api-source": {
"value": "1",
"version": 117,
"tombstone": null
},
"indexing_task:otel-logs-v0_6:01HE82P019G3THGA3JK9Q833WA:_ingest-api-source": {
"value": "1",
"version": 118,
"tombstone": null
},
"indexing_task:otel-traces-v0_6:01HE82P02NKPZX17J3304XY4T2:_ingest-api-source": {
"value": "1",
"version": 119,
"tombstone": null
},
"readiness": {
"value": "READY",
"version": 116,
"tombstone": null
}
},
"max_version": 119
}
},
{
"chitchat_id": {
"node_id": "node-2",
"generation_id": 1699242903139100000,
"gossip_advertise_addr": "127.0.0.1:8280"
},
"node_state": {
"heartbeat": 3,
"key_values": {
"enabled_services": {
"value": "indexer,searcher",
"version": 1,
"tombstone": null
},
"grpc_advertise_addr": {
"value": "127.0.0.1:8281",
"version": 2,
"tombstone": null
},
"indexer_capacity": {
"value": "4000",
"version": 4,
"tombstone": null
},
"readiness": {
"value": "NOT_READY",
"version": 3,
"tombstone": null
}
},
"max_version": 4
}
},
{
"chitchat_id": {
"node_id": "node-2",
"generation_id": 1699242914104763000,
"gossip_advertise_addr": "127.0.0.1:8280"
},
"node_state": {
"heartbeat": 492,
"key_values": {
"enabled_services": {
"value": "indexer,searcher,control_plane",
"version": 1,
"tombstone": null
},
"grpc_advertise_addr": {
"value": "127.0.0.1:8281",
"version": 2,
"tombstone": null
},
"indexer_capacity": {
"value": "4000",
"version": 4,
"tombstone": null
},
"indexing_task:cflogs:01HE82PCCWZNHSZKVZSJ2ARB73:_ingest-api-source": {
"value": "1",
"version": 114,
"tombstone": null
},
"indexing_task:otel-logs-v0_6:01HE82P019G3THGA3JK9Q833WA:_ingest-api-source": {
"value": "1",
"version": 115,
"tombstone": null
},
"indexing_task:otel-traces-v0_6:01HE82P02NKPZX17J3304XY4T2:_ingest-api-source": {
"value": "1",
"version": 116,
"tombstone": null
},
"readiness": {
"value": "READY",
"version": 113,
"tombstone": null
}
},
"max_version": 116
}
},
{
"chitchat_id": {
"node_id": "node-2",
"generation_id": 1699243416590203000,
"gossip_advertise_addr": "127.0.0.1:8280"
},
"node_state": {
"heartbeat": 421,
"key_values": {
"enabled_services": {
"value": "control_plane,searcher",
"version": 1,
"tombstone": null
},
"grpc_advertise_addr": {
"value": "127.0.0.1:8281",
"version": 2,
"tombstone": null
},
"readiness": {
"value": "READY",
"version": 45,
"tombstone": null
}
},
"max_version": 45
}
}
],
"seed_addrs": []
}
}
Hi - great library thank you for sharing. I just had a question that maybe indicates a gap in the documentation.
Is it safe to assume that the live nodes (returned by live_nodes_watcher) will never contain more than 1 ChitchatId with the same node_id?
Based on this I think that is the case but wanted to confirm.
Hi, this project looks awesome, thank you ๐ ! The license is ambiguous, though. The LICENSE file is MIT but the source code headers generated by .license_header.txt are commercial AGPL v3.0. Folks probably can't use this project in open source until that ambiguity is clarified.
Opening an issue to track this TODO.
I'm not sure how you can optimize iter_stale_key_values
with the max version...
If floor_version
is greater or equal to max_version
, you don't have to iterate over all values.
Currently, if you have a cluster of nodes in Chitchat, and a node dies, each time chitchat attempts to gossip with the node you'll have a fairly vague error logged each time, which leads to a lot of unhelpful logs being repeated over and over until the node re-joins the cluster.
A possible solution would be to make the error returned by the Transport
trait, more verbose so we can distinguish between a disconnect or not and not log the error if it's a node we already know is dead.
We have two MTU related bug in chitchat.
1- The UDP transport has its own definition of MTU and bounces all message exceeding 1400 bytes regardless of what the chitchat mtu was set to.
2- The syn message do not have any sampling logic. With a low mtu (1400 bytes), we cannot send Syn message in clusters with a lot of members.
As a quickfix, we remove the configurable MTU and fix it to 60KB.
With that setting, chitchat works ok with hundreds of nodes.
Currently, when a cluster node dies, It is kept in the cluster node even if the node never comes back.
This can make the cluster node list grow very large for a long-running node.
We want to remove a dead node from the cluster node list at some point.
Currently during node is setup, there is no way to set the initial key-value pairs for that node. key-value can only be set when the node is already instantiated. In some situations we want known values to be available on the node instance as soon the node is instantiated.
Let's have an argument to accept initial key-value pairs when creating a scuttlebutt instance.
For security reasons, it would be great if chitchat supported encryption.
Consul is using AES-256+GCM for the gossip protocol, and it seems to be implemented in memberlist. See also, Secure Gossip Communication with Encryption.
The current key GC logic is broken. It relies on heartbeat, assuming that all keys that were updated before the last heartbeat have been received. This is however incorrect.
The heartbeat is sent with every single delta.
If a delta of a node is partial (meaning it does not contain all of the stale KV of a node), the delta receiver might update its heartbeat without receiving some anterior KV updates.
For instance, the following scenario leads into KV reappearing after deletion.
- node1 inserts key K
- key K is replicated to node 2
- node1 deletes key K at version V
- some time passes
- node2 receives partial update not contain key K but updating Node 1 heartbeat.
- grace period is elapsed
- node1 GC key K
More than just being a bug, relying on heartbeat like this prevents us from using heartbeat sent in digest to discover and
detect which nodes is alive.
Hey there, I am running through some tests with your test application. Four nodes in the cluster. I terminate a single node and bring it back online and seems like the time it takes for the terminated node to rejoin the cluster and node be on the dead list is highly variable. Sometimes it will take 30 seconds, sometimes 5 minutes, sometimes never?
Maybe I am making an improper configuration? If I query the recently terminated node it shows all nodes live, but the other 3 nodes have trouble knowing the previously terminated node is not live.
I have tried this on the main branch. Any advice would be great. Very cool system.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.