quickwit-oss / chitchat Goto Github PK

View Code? Open in Web Editor NEW

236.0 17.0 41.0 410 KB

Cluster membership protocol with failure detection inspired by Cassandra and DynamoDB

License: MIT License

Rust 99.70% Makefile 0.21% Shell 0.09%

anti-entropy gossip-protocol membership

chitchat's People

Contributors

Stargazers

Watchers

chitchat's Issues

Use Phi-Accrual Detection for failure detection

https://www.researchgate.net/publication/29682135_The_ph_accrual_failure_detector

multiple addresses?

This crate looks super-cool; have you given any thought to how to allow nodes to have multiple addresses? For example, in a mixed IPv4 and IPv6 stack world, some nodes may have both an IPv4 address and an IPv6 address, and some may have only one or the other. This is a big problem deploying Cassandra in mixed-stack environments (since its gossip address can only be a single socket address).

Weird behavior when running tests

I'm running the tests from "chitchat-test" locally on my windows machine with the following script (trying to mimic the bash version as best I can):

Get-Process "chitchat-test" | Stop-Process

cargo build --release

for ($i = 10000; $i -lt 10100; $i++)
{
    $listen_addr = "127.0.0.1:$i";
    Write-Host $listen_addr;

    Start-Process -NoNewWindow "cargo" -ArgumentList "run --release -- --listen_addr $listen_addr --seed 127.0.0.1:10002 --node_id node_$i"
}

Read-Host

Are the following results expected behavior, or am I maybe running into some windows bugs?

The services started at 127.0.0.1:10000 and 127.0.0.1:10001 (before the seed address port) never receive any peers, and are stuck with a rather bare "state", regardless of how long I leave it running:

{
  "cluster_id": "testing",
  "cluster_state": {
    "node_state_snapshots": [
      {
        "chitchat_id": {
          "node_id": "node_10000",
          "generation_id": 1711736008,
          "gossip_advertise_addr": "127.0.0.1:10000"
        },
        "node_state": {
          "chitchat_id": {
            "node_id": "node_10000",
            "generation_id": 1711736008,
            "gossip_advertise_addr": "127.0.0.1:10000"
          },
          "heartbeat": 2,
          "key_values": {},
          "max_version": 0,
          "last_gc_version": 0
        }
      }
    ],
    "seed_addrs": [
      "127.0.0.1:10002"
    ]
  },
  "live_nodes": [
    {
      "node_id": "node_10000",
      "generation_id": 1711736008,
      "gossip_advertise_addr": "127.0.0.1:10000"
    }
  ],
  "dead_nodes": []
}

The other services on ports greater than the seed start up fine.

If I select a random process and kill it, all heartbeats stop (even for the services which received no peers). The state is still readable as JSON with a GET request to 127.0.0.1:10XXX, but the heartbeat does not increment, and no new peers are registered if I try adding them after the stoppage.

Is any of this expected behavior, or do we think maybe its windows related?

Handle long key-value pairs

Write delta updates with an mtu expressed on bytes

Right now, the delta mtu is expressed in number of records.
We want it to be bytes (and not exceed 1KB).

We can probably do that with a writer wrapping a buffer,

/// Atomically adds a record to the buffer,
///
/// Returns true if the record was successfully added
/// Returns false if adding the record would exceed the mtu.
fn add_record(peer: &str, key: &str, value: &str) -> bool;

Make public `NodeState` struct

This is motivated by this PR: quickwit-oss/quickwit#1838

In Quickwit, we need to build members from pairs of (key, value) stored in NodeState. Being able to define a build_from_node_state method on Member struct would be nice.

This is also consistent with the fact that the ClusterStateSnapshot struct is public.

Support namespaces

Avoid identifying resetted dead nodes as alive

Spotted on a deployment.

Resetted nodes are briefly identified as alive.
The reason is that we call report_heartbeat in the reset procedure.

The purpose of this call is to make sure that we have a record in the failure detector for the node.

Add more logging

Currently, this crate does not do much logging making it hard to see what's going on under the hood.

Max nodes

hey there , was wondering how many nodes folks have tested chitchat up to? Not a theoretical but a known maximum?

Cheers

Sean

Introduce a concept of heartbeat.

Custom transports don't have access to serialize

Hey! 👋

Awesome library, thanks for all of the work on it.

Small issue I'm running into when trying to make a custom transport.

The chitchat::serialize module is marked as pub(crate), which means you can't serialize a ChitchatMessage outside of the crate.
Additionally, since ChitchatMessage isn't marked as Serde's Serialize/Deserialize, there's actually no way to serialize messages for a custom transport.

This seems like an easy fix unless there's a specific reason not to: just remove the pub(crate) and leave pub on chitchat::serialize.

Change in chitchat gossiping priority.

Following the original paper, chitchat currently first shares nodes with the highest number of stale values.

As a side effect, nodes that are not emitting many KVs are gossiped last.
In quickwit, under a little bit of load (1000 indexes on 10 indexer), it has a very dire effect.

Indexer that reconnect have to gossip the entire cluster state (~10MB) before being able to get any information about the metastore.

Knowing at least one node with the metastore service is required for nodes to declare themselves as live.

Prevent KV with a value longer than 500B.

Right now scuttlebutt relies on UDP with a hardcoded mtu of 1KB.

Have the write state emit its own stable version (aka last_gc_version today)

Right now, GC of deleted keys happens on all nodes independently.

This works fine, but after a reset, a node may need many gossip round trips to be up to date again.
It might get a last_gc_version of V first and then talk with another peer, be unlucky and get reset to a last_gc_version V',
and have this bad luck keep going on over and over again.

If instead, writer nodes were emitting GC version as a KV key, we would probably have one or two last_gc_version over an entire cluster.

If we updated that key only every 3 minutes or so, we would have the guarantee that a node playing catch up would be able to in at most 2 resets. (and usually a single one would suffice)

In addition, all nodes would not have to keep track of delete timestamps. Instead, only the writing node would keep a queue of GC versions.

Autodetect gossip public address

RIght now one needs to specify its public address.
It can however, in some environment be a non-trivial thing to know.

(k8s, using the :0 trick, etc.)

Only the seed really need to have a known address.
Other nodes are exhibiting their public gossip address as they contact a peer.

Avoid gossiping about dead node

To avoid dead nodes resurrecting for a few moments on newly joined nodes (because of the initial interval on the failure detector).
It's best to exclude dead node from the gossip messages (digest, delete).
This can even pave the way for easier garbage collection because when a node is known to all cluster members as dead, its state will no longer be shared even when we delete it on the cluster state.

Consider updating heartbeat once per ack round

Right now we might be updating the heartbeat too often...

Simulation and visualization should help making that decision.

Delta validation

Due to the nature UDP, the existence of resets and the fact that we are
gossipping to several nodes at the same time, it is possible for our
obsolete deltas to arrive.

Implement Leave/Join

We probably need to implement a leave/join method for a node.

Leave: stops the node from gossiping to others, stops processing gossip messages
Join: puts back the node in normal mode
These should be useful for exposing on the gRPC service.

Currently, we have shutdown that stops the UpdServer and consumes the server.

@fulmicoton WDYT? https://github.com/quickwit-oss/quickwit/pull/1164/files#diff-5f02eaf8b40edf6dfb71eab3ea3c59888c42f987d19b1934b9c03bc176f4ec50R191

Better peer selection and lifecycle

Right now we pick peers at random amongst all peers... We can probably do a bit better.
Also avoid the network partition caused by number of seeds https://issues.apache.org/jira/browse/CASSANDRA-150

NodeState struct visibility

Hello, @fulmicoton !
Could you please explain, why are you keeping NodeState structure private while ClusterState's methods

pub fn node_state(&self, node_id: &NodeId) -> Option<&NodeState> {..}
pub fn node_state_mut(&mut self, node_id: &NodeId) -> &mut NodeState  {..}

return NodeState as a result?

Thank you!

Parameter `marked_for_deletion_grace_period` needs to be reviewed

Currently, we use marked_for_deletion_grace_period to set a deletion grace period expressed as a number of version.

This means two things:

an old node that did not communicate with other nodes will have its max_version very low compared to others. If node_1.max_version + marked_for_deletion_grace_period < node_state.max_version, we will reset the node state of node_1.
a key marked for deletion will be deleted if versioned_value.version + marked_for_deletion_grace_period < max_version.

But this logic is flawed. If we set n keys in a given node... the max_version will in at least n. And this is a big issue as now the garbage collection can happen way faster than a user expects.

We can do two things:

separate the heartbeat version from other keys. So this parameter can be used safely for case 1)
and now we still need to handle the GC of keys marked for deletion. I'm not sure what we should do. Setting a high value could be good enough.

a node can think a key is still alive despite it having deleted after a partition

if a node A get reset after a partition, the node B which sent the reset should have sent some key(s), or a sentinel key if no key is supposed to be alive. If the mtu prevent from sending all these keys, it's possible another node C then send to A a key with a version high enough to be acceptable (but smaller than the max version B would have sent), which have already been deleted from B's point of view

Documentation foundations

I'm thinking more of pointers to papers or already existing implementations from which we get inspiration.

I have in mind this paper:

https://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf
maybe some Cassandra code pieces?

ports in unit tests.

Unit tests are using manually cherry-picked port that do not overlap... That is a deep paper cut, but also it might break unit tests if the system has this port busy by a different application.

We can probably find a way to rely on port 0 in unit tests and have the OS pick an available port for us.

Add versionning to message.

Currently the scuttlebutt messages are not versioned.

Even if these are transient messages, we might want to make changes that do not require to stop-restart all nodes at the same time.
For the moment adding a simple version number should be sufficient to be "ready for the future"

Scuttlebutt name is already taken on crates.io

There is already an unrelated crate published at https://crates.io/crates/scuttlebutt.

As this crate has had are no release or commit on master since 2017, its maintainer might agree to retract their releases and transfer the name to you. There are no identified dependents, but the crate has been downloaded 22 times in the last 90 days (probably scrapers, but there is no way to tell)
Otherwise, you will need to rename the crate before publishing it. The sooner is probably the better, to reduce the number of dead links this will create.

Create a simulator for tests

Currently, it's hard to test the different stages a node in a cluster can be in.
It would be nice to have a kind of simulator built around tokio::task that can start a cluster and simulate packet delay or node failure for testing.
look into https://github.com/jepsen-io/maelstrom

Implement a UDP server to server scuttlebutt

Compress messages?

Sort out and document our philosophy on what the identify of a node is

We need a proper single document explaining what a node id identity is.

We have

the broadcast_rpc_address: (using cassandra terminology) the host:port at which other nodes can reach us.
the node_id: defined in the quickwit.yaml file. Today if it is missing we autogenerate it on startup

If specified the node_id is something that never changes. We can add a generation number to disambiguate two different runs.

A couple of problem to consider

between two runs we probably want to start fresh on the local scuttlebutt state. We don't want other nodes to gossip to us stuff from the past.
between two runs we want nodes to know that our new stuff is new... Even if they have a version number that is higher.
node do not know their broadcast_rpc_address .
we probably want to use the node_id for the cache placement. In other word, there are some benefit to keeping it "sticky". If a node leaves and rejoin, it is a good thing if it takes back its original place in the cache placement logic.

Add integration

Currently, we can only test this with unit tests and manual tests. This does not allow us to test with many nodes.
This should:

Refactor layout
Setup CI for checking: license, lint, test
Add integration tests for the executable scuttlebutt-test

Broadcast membership changes

Currently, we can only request the list of live nodes from the ScuttleButt instance. It would be nice to provide a way for clients to listen to membership changes in order to avoid clients from continuously pulling the list of cluster live nodes.

If a node to reset state exceeds the MTU, reset might not bring us to a version that exceeds other nodes last_gc

Generation increase in chitchat does not really clean up the state

To reproduce.

Create two nodes node_{1,2}.
Restart node two.

You will end up with a two nodes 2 in the state.Both staying in the READY mode, despite the heartbeat never being updated.

{
  "cluster_id": "quickwit-default-cluster",
  "self_node_id": "node-1",
  "ready_nodes": [
    {
      "node_id": "node-1",
      "generation_id": 1699243351019449000,
      "gossip_advertise_addr": "127.0.0.1:7280"
    },
    {
      "node_id": "node-2",
      "generation_id": 1699243416590203000,
      "gossip_advertise_addr": "127.0.0.1:8280"
    }
  ],
  "live_nodes": [],
  "dead_nodes": [
    {
      "node_id": "node-2",
      "generation_id": 1699242914104763000,
      "gossip_advertise_addr": "127.0.0.1:8280"
    }
  ],
  "chitchat_state_snapshot": {
    "node_state_snapshots": [
      {
        "chitchat_id": {
          "node_id": "node-1",
          "generation_id": 1699243351019449000,
          "gossip_advertise_addr": "127.0.0.1:7280"
        },
        "node_state": {
          "heartbeat": 487,
          "key_values": {
            "enabled_services": {
              "value": "janitor,control_plane,indexer,searcher,metastore",
              "version": 1,
              "tombstone": null
            },
            "grpc_advertise_addr": {
              "value": "127.0.0.1:7281",
              "version": 2,
              "tombstone": null
            },
            "indexer_capacity": {
              "value": "2000",
              "version": 4,
              "tombstone": null
            },
            "indexing_task:cflogs:01HE82PCCWZNHSZKVZSJ2ARB73:_ingest-api-source": {
              "value": "1",
              "version": 117,
              "tombstone": null
            },
            "indexing_task:otel-logs-v0_6:01HE82P019G3THGA3JK9Q833WA:_ingest-api-source": {
              "value": "1",
              "version": 118,
              "tombstone": null
            },
            "indexing_task:otel-traces-v0_6:01HE82P02NKPZX17J3304XY4T2:_ingest-api-source": {
              "value": "1",
              "version": 119,
              "tombstone": null
            },
            "readiness": {
              "value": "READY",
              "version": 116,
              "tombstone": null
            }
          },
          "max_version": 119
        }
      },
      {
        "chitchat_id": {
          "node_id": "node-2",
          "generation_id": 1699242903139100000,
          "gossip_advertise_addr": "127.0.0.1:8280"
        },
        "node_state": {
          "heartbeat": 3,
          "key_values": {
            "enabled_services": {
              "value": "indexer,searcher",
              "version": 1,
              "tombstone": null
            },
            "grpc_advertise_addr": {
              "value": "127.0.0.1:8281",
              "version": 2,
              "tombstone": null
            },
            "indexer_capacity": {
              "value": "4000",
              "version": 4,
              "tombstone": null
            },
            "readiness": {
              "value": "NOT_READY",
              "version": 3,
              "tombstone": null
            }
          },
          "max_version": 4
        }
      },
      {
        "chitchat_id": {
          "node_id": "node-2",
          "generation_id": 1699242914104763000,
          "gossip_advertise_addr": "127.0.0.1:8280"
        },
        "node_state": {
          "heartbeat": 492,
          "key_values": {
            "enabled_services": {
              "value": "indexer,searcher,control_plane",
              "version": 1,
              "tombstone": null
            },
            "grpc_advertise_addr": {
              "value": "127.0.0.1:8281",
              "version": 2,
              "tombstone": null
            },
            "indexer_capacity": {
              "value": "4000",
              "version": 4,
              "tombstone": null
            },
            "indexing_task:cflogs:01HE82PCCWZNHSZKVZSJ2ARB73:_ingest-api-source": {
              "value": "1",
              "version": 114,
              "tombstone": null
            },
            "indexing_task:otel-logs-v0_6:01HE82P019G3THGA3JK9Q833WA:_ingest-api-source": {
              "value": "1",
              "version": 115,
              "tombstone": null
            },
            "indexing_task:otel-traces-v0_6:01HE82P02NKPZX17J3304XY4T2:_ingest-api-source": {
              "value": "1",
              "version": 116,
              "tombstone": null
            },
            "readiness": {
              "value": "READY",
              "version": 113,
              "tombstone": null
            }
          },
          "max_version": 116
        }
      },
      {
        "chitchat_id": {
          "node_id": "node-2",
          "generation_id": 1699243416590203000,
          "gossip_advertise_addr": "127.0.0.1:8280"
        },
        "node_state": {
          "heartbeat": 421,
          "key_values": {
            "enabled_services": {
              "value": "control_plane,searcher",
              "version": 1,
              "tombstone": null
            },
            "grpc_advertise_addr": {
              "value": "127.0.0.1:8281",
              "version": 2,
              "tombstone": null
            },
            "readiness": {
              "value": "READY",
              "version": 45,
              "tombstone": null
            }
          },
          "max_version": 45
        }
      }
    ],
    "seed_addrs": []
  }
}

Docs: Invariants on BTreeMap<ChitchatId, NodeState> (live_nodes)

Hi - great library thank you for sharing. I just had a question that maybe indicates a gap in the documentation.

Is it safe to assume that the live nodes (returned by live_nodes_watcher) will never contain more than 1 ChitchatId with the same node_id?

Based on this I think that is the case but wanted to confirm.

Add a delayed transport

Conflicting MIT and AGPL licenses

Hi, this project looks awesome, thank you 🎉 ! The license is ambiguous, though. The LICENSE file is MIT but the source code headers generated by .license_header.txt are commercial AGPL v3.0. Folks probably can't use this project in open source until that ambiguity is clarified.

Optimize `iter_stale_key_values` by checking the max version

Opening an issue to track this TODO.

I'm not sure how you can optimize iter_stale_key_values with the max version...

If floor_version is greater or equal to max_version, you don't have to iterate over all values.

Adjust logging of gossip errors to make it less annoying.

Currently, if you have a cluster of nodes in Chitchat, and a node dies, each time chitchat attempts to gossip with the node you'll have a fairly vague error logged each time, which leads to a lot of unhelpful logs being repeated over and over until the node re-joins the cluster.

A possible solution would be to make the error returned by the Transport trait, more verbose so we can distinguish between a disconnect or not and not log the error if it's a node we already know is dead.

Add a dropping transport.

Avoid set when set does not modify the existing value.

MTU asymmetry / Low MTU not supported

We have two MTU related bug in chitchat.

1- The UDP transport has its own definition of MTU and bounces all message exceeding 1400 bytes regardless of what the chitchat mtu was set to.
2- The syn message do not have any sampling logic. With a low mtu (1400 bytes), we cannot send Syn message in clusters with a lot of members.

As a quickfix, we remove the configurable MTU and fix it to 60KB.
With that setting, chitchat works ok with hundreds of nodes.

Remove dead node from the cluster node list eventually

Currently, when a cluster node dies, It is kept in the cluster node even if the node never comes back.
This can make the cluster node list grow very large for a long-running node.
We want to remove a dead node from the cluster node list at some point.

Add initial key values

Currently during node is setup, there is no way to set the initial key-value pairs for that node. key-value can only be set when the node is already instantiated. In some situations we want known values to be available on the node instance as soon the node is instantiated.
Let's have an argument to accept initial key-value pairs when creating a scuttlebutt instance.

Support encryption

For security reasons, it would be great if chitchat supported encryption.
Consul is using AES-256+GCM for the gossip protocol, and it seems to be implemented in memberlist. See also, Secure Gossip Communication with Encryption.

Key GC logic is broken

The current key GC logic is broken. It relies on heartbeat, assuming that all keys that were updated before the last heartbeat have been received. This is however incorrect.

The heartbeat is sent with every single delta.
If a delta of a node is partial (meaning it does not contain all of the stale KV of a node), the delta receiver might update its heartbeat without receiving some anterior KV updates.

For instance, the following scenario leads into KV reappearing after deletion.

- node1 inserts key K
- key K is replicated to node 2
- node1 deletes key K at version V
- some time passes
- node2 receives partial update not contain key K but updating Node 1 heartbeat.
- grace period is elapsed
- node1 GC key K

More than just being a bug, relying on heartbeat like this prevents us from using heartbeat sent in digest to discover and
detect which nodes is alive.

Add a possible TTL to keys.

Time is takes for a node to rejoin the cluster

Hey there, I am running through some tests with your test application. Four nodes in the cluster. I terminate a single node and bring it back online and seems like the time it takes for the terminated node to rejoin the cluster and node be on the dead list is highly variable. Sometimes it will take 30 seconds, sometimes 5 minutes, sometimes never?

Maybe I am making an improper configuration? If I query the recently terminated node it shows all nodes live, but the other 3 nodes have trouble knowing the previously terminated node is not live.

I have tried this on the main branch. Any advice would be great. Very cool system.

quickwit-oss / chitchat Goto Github PK

chitchat's People

Contributors

Stargazers

Watchers

Forkers

chitchat's Issues

Recommend Projects

Recommend Topics

Recommend Org