Comments (11)
I like this idea! I've seen issues in the past, especially with cluster mode, where a node ends up being marked as a primary with no data. It seems like it should likely be part of the handshake that the replica has data and expects this primary to still have data unless it sees the same replication ID (indicating the primary did an intentional flushall of its own data).
We will need to document the behavior in such a way that an admin knows that the cluster might not self-recover and may require operator intervention.
from valkey.
I'd think when replication is reconfigured to change which nodes are primaries and which are replicas that would flush out the old primary state data that could wedge up the cluster, but perhaps there's detail and nuance here you know of that I don't.
It's really just that an operator could have caused the flush. I think we might be able to detect this case as well. In either case, I'm going to through this in the roadmap.
from valkey.
This is an unsolvable problem since we prioritize availability over consistency,
If we apply the argument of "prioritizing availability over consistency" to this situation, we shouldn't take this proposal but instead let the data flush happen. On the contrary, what is proposed here is muddying that stance even further. Users have some availability but not at the same level as they would expect (HA is lost). They also have no easy way to recover the data in the replica since it runs on a different replid
. This proposal if implemented will always require admin intervention to fix the now stranded replicas.
I understand that the project was built with a bias towards availability but I think it has gone both too far and is also done inconsistently. A bit of generalization (since we are discussing replication here) but this practice in general is the reason why the cluster is really hard to manage and why we are still fixing fundamental bugs in it.
from valkey.
We will need to document the behavior in such a way that an admin knows that the cluster might not self-recover and may require operator intervention.
For the real-world cases I've been testing where the nodes come up empty I'd expect the new node to be promoted to the primary (either via Sentinel or via clustering), and the old node that came up empty should be reconfigured to be a replica and begin restoring the dataset from another node. In my testing this happens as long as I prevent the original replica(s) from clearing their data out via bad replication before the failover happens. I'd think when replication is reconfigured to change which nodes are primaries and which are replicas that would flush out the old primary state data that could wedge up the cluster, but perhaps there's detail and nuance here you know of that I don't.
from valkey.
This option will always break the replication and require admin intervention, and until then the setup will be running in non-HA mode.
The core problem is that the primary didn't stay "down" long enough to trigger the election reliably. I think a correct solution should explore the below instead
- A primary node needs to keep track of the number of replicas it has across restarts
- A primary node needs to be able to get to business immediately after (re)start if it has no replicas but needs to wait long enough to ensure election happens.
In the cluster case, 1) can be achieved via nodes.conf and cluster_node_timeout can be used in 2). My knowledge about sentinel is limited but I guess sentinel would need to be responsible for both 1) and 2)
from valkey.
The core problem is that the primary didn't stay "down" long enough to trigger the election reliably.
This is an unsolvable problem since we prioritize availability over consistency, and your proposal is only a partial solution over a subset of the problems, and introduces new race conditions and more failure modes for operators to worry about. It's always possible the node with data becomes apparently unavailability, so we prioritize electing a node without data. This is somewhat shifting away from availability towards consistency from CAP.
from valkey.
[Deleted]
I just realized the case I was mentioning only applies to how AWS does it, so that's just a problem that we have. I think Ping is right that you just need to better protect nodes are coming back empty.
from valkey.
We prioritize availability over consistency to some extent, and I think that's fine as long as the impact is limited. It's fine that we use asynchronous replication and that we can lose a few writes when a primary crashes. WAIT-ing for all replicas can prevent the write from being lost in that case.
It's not OK though to lose the whole data set. I want the cluster to prevent this from happening without manual intervention. Ping's ideas make sense to me.
If I understand this correctly, the primary needs to lose its data while keeping its nodes.conf so it still has its place in the cluster. If so, I can't see why we can store some more info in nodes.conf to prevent this from happening.
- Using the existing information in nodes.conf, the primary can see how many replicas it had before restart, so it could wait for a failover (Ping's suggestion). Should it do this only if it starts with no keys?
- We could store the replication offset in nodes.conf to let the primary detect that it lost its data and trigger (or let it wait for) a failover.
Another possibility: If the replicas can detect that the primary came back empty-handed, they could escalate/gossip this to the cluster in some way to mark the node as FAIL and trigger a failover.
from valkey.
I think in the standlone mode, we already solve this problem through the parameter master-reboot-down-after-period mymaster in sentinel.conf
The steps:
- Node 1 is the primary
- Node 2 is replicating node 1
- Node 1 drops
- Node 1 comes back without it's DB, in an empty state
- Node 2 reconnects and performs a full resync
- Node 2 is now empty, having propagated the data loss
In real production, the gap between step 3 and step 4 can not be too long, otherwise user can not accept,
In fact, the gap between step 3 and 4 is reboot time, so master-reboot-down-after-period timing should solve it because replica could not be failover.
from valkey.
@hwware I don't understand how master-reboot-down-after-period can solve this in a Sentinel setup. If the primary has lost its data, it comes up fast. If it still has its data, it takes more time to loads its data and be available. It's the first case we want to prevent here.
from valkey.
@hwware I don't understand how master-reboot-down-after-period can solve this in a Sentinel setup. If the primary has lost its data, it comes up fast. If it still has its data, it takes more time to loads its data and be available. It's the first case we want to prevent here.
in the step 4, when node 1 come back without its db, master node has one signal "reboot". If master node has no "reboot" signal, it means 2 cases:
- master node just lost connection or can not response to other nodes due to busy status, but master node should hold data
- master node shutdown long enough time, i am not sure if client could tolerate master node long time not working.
Maybe master-reboot-down-after-period parameter can not solve all problems as in top comment, but at least it could solve subset of it.
from valkey.
Related Issues (20)
- [Daily Test Failure] 12-master-reboot.tcl in test-sanitizer-address (clang)
- [Daily Test Failure] tests/unit/cluster/slot-migration.tcl in test-macos-latest
- Improve type safety of key embedding
- code usege of sds (and maybe other data types) HOT 1
- Add maxmemory-reserved-scale parameter to evict keys earlier HOT 2
- [NEW] WAIT ALL HOT 4
- [BUG] Example ACL for sentinel does not work HOT 1
- []Defrag level is calculated correctly HOT 1
- [NEW] Cluster support without special client bindings HOT 2
- [NEW] Opt-in for inclusive language (primary/replia in ROLE reply, CLUSTER SHARDS, etc.) HOT 4
- Regression from PR #445 Incorrectly Allows Slot Ownership Updates via Replica HOT 1
- Followup items from https://github.com/valkey-io/valkey/pull/758
- [NEW] Better branching strategy for Valkey HOT 10
- [Test Case Fail]External Server Tests HOT 3
- [BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) HOT 5
- [BUG] CLUSTER SHARDS command returns "empty array" in slots section HOT 3
- [NEW] Can we add 'count' option to the 'RANDOMKEY' command? HOT 5
- [NEW] Nightly builds for Docker
- Update rioconnwrite size from https://github.com/valkey-io/valkey/pull/60
- Missing Check for Sender's Config Epoch Before Accepting Primary Claim
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from valkey.