Git Product home page Git Product logo

Comments (11)

madolson avatar madolson commented on July 21, 2024 2

I like this idea! I've seen issues in the past, especially with cluster mode, where a node ends up being marked as a primary with no data. It seems like it should likely be part of the handshake that the replica has data and expects this primary to still have data unless it sees the same replication ID (indicating the primary did an intentional flushall of its own data).

We will need to document the behavior in such a way that an admin knows that the cluster might not self-recover and may require operator intervention.

from valkey.

madolson avatar madolson commented on July 21, 2024 2

I'd think when replication is reconfigured to change which nodes are primaries and which are replicas that would flush out the old primary state data that could wedge up the cluster, but perhaps there's detail and nuance here you know of that I don't.

It's really just that an operator could have caused the flush. I think we might be able to detect this case as well. In either case, I'm going to through this in the roadmap.

from valkey.

PingXie avatar PingXie commented on July 21, 2024 1

This is an unsolvable problem since we prioritize availability over consistency,

If we apply the argument of "prioritizing availability over consistency" to this situation, we shouldn't take this proposal but instead let the data flush happen. On the contrary, what is proposed here is muddying that stance even further. Users have some availability but not at the same level as they would expect (HA is lost). They also have no easy way to recover the data in the replica since it runs on a different replid. This proposal if implemented will always require admin intervention to fix the now stranded replicas.

I understand that the project was built with a bias towards availability but I think it has gone both too far and is also done inconsistently. A bit of generalization (since we are discussing replication here) but this practice in general is the reason why the cluster is really hard to manage and why we are still fixing fundamental bugs in it.

from valkey.

ag-TJNII avatar ag-TJNII commented on July 21, 2024

We will need to document the behavior in such a way that an admin knows that the cluster might not self-recover and may require operator intervention.

For the real-world cases I've been testing where the nodes come up empty I'd expect the new node to be promoted to the primary (either via Sentinel or via clustering), and the old node that came up empty should be reconfigured to be a replica and begin restoring the dataset from another node. In my testing this happens as long as I prevent the original replica(s) from clearing their data out via bad replication before the failover happens. I'd think when replication is reconfigured to change which nodes are primaries and which are replicas that would flush out the old primary state data that could wedge up the cluster, but perhaps there's detail and nuance here you know of that I don't.

from valkey.

PingXie avatar PingXie commented on July 21, 2024

This option will always break the replication and require admin intervention, and until then the setup will be running in non-HA mode.

The core problem is that the primary didn't stay "down" long enough to trigger the election reliably. I think a correct solution should explore the below instead

  1. A primary node needs to keep track of the number of replicas it has across restarts
  2. A primary node needs to be able to get to business immediately after (re)start if it has no replicas but needs to wait long enough to ensure election happens.

In the cluster case, 1) can be achieved via nodes.conf and cluster_node_timeout can be used in 2). My knowledge about sentinel is limited but I guess sentinel would need to be responsible for both 1) and 2)

from valkey.

madolson avatar madolson commented on July 21, 2024

The core problem is that the primary didn't stay "down" long enough to trigger the election reliably.

This is an unsolvable problem since we prioritize availability over consistency, and your proposal is only a partial solution over a subset of the problems, and introduces new race conditions and more failure modes for operators to worry about. It's always possible the node with data becomes apparently unavailability, so we prioritize electing a node without data. This is somewhat shifting away from availability towards consistency from CAP.

from valkey.

madolson avatar madolson commented on July 21, 2024

[Deleted]

I just realized the case I was mentioning only applies to how AWS does it, so that's just a problem that we have. I think Ping is right that you just need to better protect nodes are coming back empty.

from valkey.

zuiderkwast avatar zuiderkwast commented on July 21, 2024

We prioritize availability over consistency to some extent, and I think that's fine as long as the impact is limited. It's fine that we use asynchronous replication and that we can lose a few writes when a primary crashes. WAIT-ing for all replicas can prevent the write from being lost in that case.

It's not OK though to lose the whole data set. I want the cluster to prevent this from happening without manual intervention. Ping's ideas make sense to me.

If I understand this correctly, the primary needs to lose its data while keeping its nodes.conf so it still has its place in the cluster. If so, I can't see why we can store some more info in nodes.conf to prevent this from happening.

  • Using the existing information in nodes.conf, the primary can see how many replicas it had before restart, so it could wait for a failover (Ping's suggestion). Should it do this only if it starts with no keys?
  • We could store the replication offset in nodes.conf to let the primary detect that it lost its data and trigger (or let it wait for) a failover.

Another possibility: If the replicas can detect that the primary came back empty-handed, they could escalate/gossip this to the cluster in some way to mark the node as FAIL and trigger a failover.

from valkey.

hwware avatar hwware commented on July 21, 2024

I think in the standlone mode, we already solve this problem through the parameter master-reboot-down-after-period mymaster in sentinel.conf

The steps:

  1. Node 1 is the primary
  2. Node 2 is replicating node 1
  3. Node 1 drops
  4. Node 1 comes back without it's DB, in an empty state
  5. Node 2 reconnects and performs a full resync
  6. Node 2 is now empty, having propagated the data loss

In real production, the gap between step 3 and step 4 can not be too long, otherwise user can not accept,
In fact, the gap between step 3 and 4 is reboot time, so master-reboot-down-after-period timing should solve it because replica could not be failover.

from valkey.

zuiderkwast avatar zuiderkwast commented on July 21, 2024

@hwware I don't understand how master-reboot-down-after-period can solve this in a Sentinel setup. If the primary has lost its data, it comes up fast. If it still has its data, it takes more time to loads its data and be available. It's the first case we want to prevent here.

from valkey.

hwware avatar hwware commented on July 21, 2024

@hwware I don't understand how master-reboot-down-after-period can solve this in a Sentinel setup. If the primary has lost its data, it comes up fast. If it still has its data, it takes more time to loads its data and be available. It's the first case we want to prevent here.

in the step 4, when node 1 come back without its db, master node has one signal "reboot". If master node has no "reboot" signal, it means 2 cases:

  1. master node just lost connection or can not response to other nodes due to busy status, but master node should hold data
  2. master node shutdown long enough time, i am not sure if client could tolerate master node long time not working.

Maybe master-reboot-down-after-period parameter can not solve all problems as in top comment, but at least it could solve subset of it.

from valkey.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.