Git Product home page Git Product logo

chitchat's People

Contributors

canardleteer avatar chrisduerr avatar densone avatar evanxg852000 avatar fmassot avatar fulmicoton avatar getong avatar guilload avatar ianterrell avatar irevoire avatar raphaelmarinier avatar trinity-1686a avatar xvello avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chitchat's Issues

multiple addresses?

This crate looks super-cool; have you given any thought to how to allow nodes to have multiple addresses? For example, in a mixed IPv4 and IPv6 stack world, some nodes may have both an IPv4 address and an IPv6 address, and some may have only one or the other. This is a big problem deploying Cassandra in mixed-stack environments (since its gossip address can only be a single socket address).

Weird behavior when running tests

I'm running the tests from "chitchat-test" locally on my windows machine with the following script (trying to mimic the bash version as best I can):

Get-Process "chitchat-test" | Stop-Process

cargo build --release

for ($i = 10000; $i -lt 10100; $i++)
{
    $listen_addr = "127.0.0.1:$i";
    Write-Host $listen_addr;

    Start-Process -NoNewWindow "cargo" -ArgumentList "run --release -- --listen_addr $listen_addr --seed 127.0.0.1:10002 --node_id node_$i"
}

Read-Host

Are the following results expected behavior, or am I maybe running into some windows bugs?

  1. The services started at 127.0.0.1:10000 and 127.0.0.1:10001 (before the seed address port) never receive any peers, and are stuck with a rather bare "state", regardless of how long I leave it running:
{
  "cluster_id": "testing",
  "cluster_state": {
    "node_state_snapshots": [
      {
        "chitchat_id": {
          "node_id": "node_10000",
          "generation_id": 1711736008,
          "gossip_advertise_addr": "127.0.0.1:10000"
        },
        "node_state": {
          "chitchat_id": {
            "node_id": "node_10000",
            "generation_id": 1711736008,
            "gossip_advertise_addr": "127.0.0.1:10000"
          },
          "heartbeat": 2,
          "key_values": {},
          "max_version": 0,
          "last_gc_version": 0
        }
      }
    ],
    "seed_addrs": [
      "127.0.0.1:10002"
    ]
  },
  "live_nodes": [
    {
      "node_id": "node_10000",
      "generation_id": 1711736008,
      "gossip_advertise_addr": "127.0.0.1:10000"
    }
  ],
  "dead_nodes": []
}

The other services on ports greater than the seed start up fine.

  1. If I select a random process and kill it, all heartbeats stop (even for the services which received no peers). The state is still readable as JSON with a GET request to 127.0.0.1:10XXX, but the heartbeat does not increment, and no new peers are registered if I try adding them after the stoppage.

Is any of this expected behavior, or do we think maybe its windows related?

Write delta updates with an mtu expressed on bytes

Right now, the delta mtu is expressed in number of records.
We want it to be bytes (and not exceed 1KB).

We can probably do that with a writer wrapping a buffer,

/// Atomically adds a record to the buffer,
///
/// Returns true if the record was successfully added
/// Returns false if adding the record would exceed the mtu.
fn add_record(peer: &str, key: &str, value: &str) -> bool;

Make public `NodeState` struct

This is motivated by this PR: quickwit-oss/quickwit#1838

In Quickwit, we need to build members from pairs of (key, value) stored in NodeState. Being able to define a build_from_node_state method on Member struct would be nice.

This is also consistent with the fact that the ClusterStateSnapshot struct is public.

Avoid identifying resetted dead nodes as alive

Spotted on a deployment.

Resetted nodes are briefly identified as alive.
The reason is that we call report_heartbeat in the reset procedure.

The purpose of this call is to make sure that we have a record in the failure detector for the node.

Add more logging

Currently, this crate does not do much logging making it hard to see what's going on under the hood.

Max nodes

hey there , was wondering how many nodes folks have tested chitchat up to? Not a theoretical but a known maximum?

Cheers

Sean

Custom transports don't have access to serialize

Hey! ๐Ÿ‘‹

Awesome library, thanks for all of the work on it.

Small issue I'm running into when trying to make a custom transport.

  • The chitchat::serialize module is marked as pub(crate), which means you can't serialize a ChitchatMessage outside of the crate.
  • Additionally, since ChitchatMessage isn't marked as Serde's Serialize/Deserialize, there's actually no way to serialize messages for a custom transport.

This seems like an easy fix unless there's a specific reason not to: just remove the pub(crate) and leave pub on chitchat::serialize.

Change in chitchat gossiping priority.

Following the original paper, chitchat currently first shares nodes with the highest number of stale values.

As a side effect, nodes that are not emitting many KVs are gossiped last.
In quickwit, under a little bit of load (1000 indexes on 10 indexer), it has a very dire effect.

Indexer that reconnect have to gossip the entire cluster state (~10MB) before being able to get any information about the metastore.

Knowing at least one node with the metastore service is required for nodes to declare themselves as live.

Have the write state emit its own stable version (aka last_gc_version today)

Right now, GC of deleted keys happens on all nodes independently.

This works fine, but after a reset, a node may need many gossip round trips to be up to date again.
It might get a last_gc_version of V first and then talk with another peer, be unlucky and get reset to a last_gc_version V',
and have this bad luck keep going on over and over again.

If instead, writer nodes were emitting GC version as a KV key, we would probably have one or two last_gc_version over an entire cluster.

If we updated that key only every 3 minutes or so, we would have the guarantee that a node playing catch up would be able to in at most 2 resets. (and usually a single one would suffice)

In addition, all nodes would not have to keep track of delete timestamps. Instead, only the writing node would keep a queue of GC versions.

Autodetect gossip public address

RIght now one needs to specify its public address.
It can however, in some environment be a non-trivial thing to know.

(k8s, using the :0 trick, etc.)

Only the seed really need to have a known address.
Other nodes are exhibiting their public gossip address as they contact a peer.

Avoid gossiping about dead node

To avoid dead nodes resurrecting for a few moments on newly joined nodes (because of the initial interval on the failure detector).
It's best to exclude dead node from the gossip messages (digest, delete).
This can even pave the way for easier garbage collection because when a node is known to all cluster members as dead, its state will no longer be shared even when we delete it on the cluster state.

Delta validation

Due to the nature UDP, the existence of resets and the fact that we are
gossipping to several nodes at the same time, it is possible for our
obsolete deltas to arrive.

NodeState struct visibility

Hello, @fulmicoton !
Could you please explain, why are you keeping NodeState structure private while ClusterState's methods

pub fn node_state(&self, node_id: &NodeId) -> Option<&NodeState> {..}
pub fn node_state_mut(&mut self, node_id: &NodeId) -> &mut NodeState  {..}

return NodeState as a result?

Thank you!

Parameter `marked_for_deletion_grace_period` needs to be reviewed

Currently, we use marked_for_deletion_grace_period to set a deletion grace period expressed as a number of version.

This means two things:

  1. an old node that did not communicate with other nodes will have its max_version very low compared to others. If node_1.max_version + marked_for_deletion_grace_period < node_state.max_version, we will reset the node state of node_1.
  2. a key marked for deletion will be deleted if versioned_value.version + marked_for_deletion_grace_period < max_version.

But this logic is flawed. If we set n keys in a given node... the max_version will in at least n. And this is a big issue as now the garbage collection can happen way faster than a user expects.

We can do two things:

  • separate the heartbeat version from other keys. So this parameter can be used safely for case 1)
  • and now we still need to handle the GC of keys marked for deletion. I'm not sure what we should do. Setting a high value could be good enough.

a node can think a key is still alive despite it having deleted after a partition

if a node A get reset after a partition, the node B which sent the reset should have sent some key(s), or a sentinel key if no key is supposed to be alive. If the mtu prevent from sending all these keys, it's possible another node C then send to A a key with a version high enough to be acceptable (but smaller than the max version B would have sent), which have already been deleted from B's point of view

ports in unit tests.

Unit tests are using manually cherry-picked port that do not overlap... That is a deep paper cut, but also it might break unit tests if the system has this port busy by a different application.

We can probably find a way to rely on port 0 in unit tests and have the OS pick an available port for us.

Add versionning to message.

Currently the scuttlebutt messages are not versioned.

Even if these are transient messages, we might want to make changes that do not require to stop-restart all nodes at the same time.
For the moment adding a simple version number should be sufficient to be "ready for the future"

Scuttlebutt name is already taken on crates.io

There is already an unrelated crate published at https://crates.io/crates/scuttlebutt.

  • As this crate has had are no release or commit on master since 2017, its maintainer might agree to retract their releases and transfer the name to you. There are no identified dependents, but the crate has been downloaded 22 times in the last 90 days (probably scrapers, but there is no way to tell)
  • Otherwise, you will need to rename the crate before publishing it. The sooner is probably the better, to reduce the number of dead links this will create.

Sort out and document our philosophy on what the identify of a node is

We need a proper single document explaining what a node id identity is.

We have

  • the broadcast_rpc_address: (using cassandra terminology) the host:port at which other nodes can reach us.
  • the node_id: defined in the quickwit.yaml file. Today if it is missing we autogenerate it on startup

If specified the node_id is something that never changes. We can add a generation number to disambiguate two different runs.

A couple of problem to consider

  • between two runs we probably want to start fresh on the local scuttlebutt state. We don't want other nodes to gossip to us stuff from the past.
  • between two runs we want nodes to know that our new stuff is new... Even if they have a version number that is higher.
  • node do not know their broadcast_rpc_address .
  • we probably want to use the node_id for the cache placement. In other word, there are some benefit to keeping it "sticky". If a node leaves and rejoin, it is a good thing if it takes back its original place in the cache placement logic.

Add integration

Currently, we can only test this with unit tests and manual tests. This does not allow us to test with many nodes.
This should:

  • Refactor layout
  • Setup CI for checking: license, lint, test
  • Add integration tests for the executable scuttlebutt-test

Broadcast membership changes

Currently, we can only request the list of live nodes from the ScuttleButt instance. It would be nice to provide a way for clients to listen to membership changes in order to avoid clients from continuously pulling the list of cluster live nodes.

Generation increase in chitchat does not really clean up the state

To reproduce.

Create two nodes node_{1,2}.
Restart node two.

You will end up with a two nodes 2 in the state.Both staying in the READY mode, despite the heartbeat never being updated.

{
  "cluster_id": "quickwit-default-cluster",
  "self_node_id": "node-1",
  "ready_nodes": [
    {
      "node_id": "node-1",
      "generation_id": 1699243351019449000,
      "gossip_advertise_addr": "127.0.0.1:7280"
    },
    {
      "node_id": "node-2",
      "generation_id": 1699243416590203000,
      "gossip_advertise_addr": "127.0.0.1:8280"
    }
  ],
  "live_nodes": [],
  "dead_nodes": [
    {
      "node_id": "node-2",
      "generation_id": 1699242914104763000,
      "gossip_advertise_addr": "127.0.0.1:8280"
    }
  ],
  "chitchat_state_snapshot": {
    "node_state_snapshots": [
      {
        "chitchat_id": {
          "node_id": "node-1",
          "generation_id": 1699243351019449000,
          "gossip_advertise_addr": "127.0.0.1:7280"
        },
        "node_state": {
          "heartbeat": 487,
          "key_values": {
            "enabled_services": {
              "value": "janitor,control_plane,indexer,searcher,metastore",
              "version": 1,
              "tombstone": null
            },
            "grpc_advertise_addr": {
              "value": "127.0.0.1:7281",
              "version": 2,
              "tombstone": null
            },
            "indexer_capacity": {
              "value": "2000",
              "version": 4,
              "tombstone": null
            },
            "indexing_task:cflogs:01HE82PCCWZNHSZKVZSJ2ARB73:_ingest-api-source": {
              "value": "1",
              "version": 117,
              "tombstone": null
            },
            "indexing_task:otel-logs-v0_6:01HE82P019G3THGA3JK9Q833WA:_ingest-api-source": {
              "value": "1",
              "version": 118,
              "tombstone": null
            },
            "indexing_task:otel-traces-v0_6:01HE82P02NKPZX17J3304XY4T2:_ingest-api-source": {
              "value": "1",
              "version": 119,
              "tombstone": null
            },
            "readiness": {
              "value": "READY",
              "version": 116,
              "tombstone": null
            }
          },
          "max_version": 119
        }
      },
      {
        "chitchat_id": {
          "node_id": "node-2",
          "generation_id": 1699242903139100000,
          "gossip_advertise_addr": "127.0.0.1:8280"
        },
        "node_state": {
          "heartbeat": 3,
          "key_values": {
            "enabled_services": {
              "value": "indexer,searcher",
              "version": 1,
              "tombstone": null
            },
            "grpc_advertise_addr": {
              "value": "127.0.0.1:8281",
              "version": 2,
              "tombstone": null
            },
            "indexer_capacity": {
              "value": "4000",
              "version": 4,
              "tombstone": null
            },
            "readiness": {
              "value": "NOT_READY",
              "version": 3,
              "tombstone": null
            }
          },
          "max_version": 4
        }
      },
      {
        "chitchat_id": {
          "node_id": "node-2",
          "generation_id": 1699242914104763000,
          "gossip_advertise_addr": "127.0.0.1:8280"
        },
        "node_state": {
          "heartbeat": 492,
          "key_values": {
            "enabled_services": {
              "value": "indexer,searcher,control_plane",
              "version": 1,
              "tombstone": null
            },
            "grpc_advertise_addr": {
              "value": "127.0.0.1:8281",
              "version": 2,
              "tombstone": null
            },
            "indexer_capacity": {
              "value": "4000",
              "version": 4,
              "tombstone": null
            },
            "indexing_task:cflogs:01HE82PCCWZNHSZKVZSJ2ARB73:_ingest-api-source": {
              "value": "1",
              "version": 114,
              "tombstone": null
            },
            "indexing_task:otel-logs-v0_6:01HE82P019G3THGA3JK9Q833WA:_ingest-api-source": {
              "value": "1",
              "version": 115,
              "tombstone": null
            },
            "indexing_task:otel-traces-v0_6:01HE82P02NKPZX17J3304XY4T2:_ingest-api-source": {
              "value": "1",
              "version": 116,
              "tombstone": null
            },
            "readiness": {
              "value": "READY",
              "version": 113,
              "tombstone": null
            }
          },
          "max_version": 116
        }
      },
      {
        "chitchat_id": {
          "node_id": "node-2",
          "generation_id": 1699243416590203000,
          "gossip_advertise_addr": "127.0.0.1:8280"
        },
        "node_state": {
          "heartbeat": 421,
          "key_values": {
            "enabled_services": {
              "value": "control_plane,searcher",
              "version": 1,
              "tombstone": null
            },
            "grpc_advertise_addr": {
              "value": "127.0.0.1:8281",
              "version": 2,
              "tombstone": null
            },
            "readiness": {
              "value": "READY",
              "version": 45,
              "tombstone": null
            }
          },
          "max_version": 45
        }
      }
    ],
    "seed_addrs": []
  }
}

Docs: Invariants on BTreeMap<ChitchatId, NodeState> (live_nodes)

Hi - great library thank you for sharing. I just had a question that maybe indicates a gap in the documentation.

Is it safe to assume that the live nodes (returned by live_nodes_watcher) will never contain more than 1 ChitchatId with the same node_id?

Based on this I think that is the case but wanted to confirm.

Conflicting MIT and AGPL licenses

Hi, this project looks awesome, thank you ๐ŸŽ‰ ! The license is ambiguous, though. The LICENSE file is MIT but the source code headers generated by .license_header.txt are commercial AGPL v3.0. Folks probably can't use this project in open source until that ambiguity is clarified.

Adjust logging of gossip errors to make it less annoying.

Currently, if you have a cluster of nodes in Chitchat, and a node dies, each time chitchat attempts to gossip with the node you'll have a fairly vague error logged each time, which leads to a lot of unhelpful logs being repeated over and over until the node re-joins the cluster.

A possible solution would be to make the error returned by the Transport trait, more verbose so we can distinguish between a disconnect or not and not log the error if it's a node we already know is dead.

MTU asymmetry / Low MTU not supported

We have two MTU related bug in chitchat.

1- The UDP transport has its own definition of MTU and bounces all message exceeding 1400 bytes regardless of what the chitchat mtu was set to.
2- The syn message do not have any sampling logic. With a low mtu (1400 bytes), we cannot send Syn message in clusters with a lot of members.

As a quickfix, we remove the configurable MTU and fix it to 60KB.
With that setting, chitchat works ok with hundreds of nodes.

Remove dead node from the cluster node list eventually

Currently, when a cluster node dies, It is kept in the cluster node even if the node never comes back.
This can make the cluster node list grow very large for a long-running node.
We want to remove a dead node from the cluster node list at some point.

Add initial key values

Currently during node is setup, there is no way to set the initial key-value pairs for that node. key-value can only be set when the node is already instantiated. In some situations we want known values to be available on the node instance as soon the node is instantiated.
Let's have an argument to accept initial key-value pairs when creating a scuttlebutt instance.

Key GC logic is broken

The current key GC logic is broken. It relies on heartbeat, assuming that all keys that were updated before the last heartbeat have been received. This is however incorrect.

The heartbeat is sent with every single delta.
If a delta of a node is partial (meaning it does not contain all of the stale KV of a node), the delta receiver might update its heartbeat without receiving some anterior KV updates.

For instance, the following scenario leads into KV reappearing after deletion.

- node1 inserts key K
- key K is replicated to node 2
- node1 deletes key K at version V
- some time passes
- node2 receives partial update not contain key K but updating Node 1 heartbeat.
- grace period is elapsed
- node1 GC key K

More than just being a bug, relying on heartbeat like this prevents us from using heartbeat sent in digest to discover and
detect which nodes is alive.

Time is takes for a node to rejoin the cluster

Hey there, I am running through some tests with your test application. Four nodes in the cluster. I terminate a single node and bring it back online and seems like the time it takes for the terminated node to rejoin the cluster and node be on the dead list is highly variable. Sometimes it will take 30 seconds, sometimes 5 minutes, sometimes never?

Maybe I am making an improper configuration? If I query the recently terminated node it shows all nodes live, but the other 3 nodes have trouble knowing the previously terminated node is not live.

I have tried this on the main branch. Any advice would be great. Very cool system.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.