This is actually questions, not issues. I can't seem to find the info around this.

Thanks so much <a class="user-mention notranslate" data-hovercard-type="user" data-hov

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Questions: Read selection strategy and failing replica node handling about fred.rs HOT 5 CLOSED

aembke commented on September 13, 2024

Questions: Read selection strategy and failing replica node handling

from fred.rs.

Comments (5)

aembke commented on September 13, 2024

So currently Fred doesn't use replica nodes at all. I have a big TODO in the code to handle this in the future, but for now the library ignores them.

There are some big gotchas with Redis and replicas due to the fact that replication is asynchronous, and for my purposes at least at work this is a problem. Redis recently added the WAIT command to deal with some of this, but frankly I'm not a fan of that strategy. There are a number of distributed systems folks that have written better blog posts, etc, on this, so I won't repeat it here.

I do plan on adding replica node support for reads in the future, but it's not likely something I'm going to get to for a bit here. As you correctly point out this significantly complicates the failure mode scenarios. However, if you do need that you can always point a centralized client at a replica and ensure that you use read-only commands, however that's not a great solution.

However, if you use the sentinel interface you can automatically fail over all commands/connections to a replica. However that's mostly just dodging your question.

My initial thoughts on this from a prio standpoint is that pointing reads at replicas is only really useful when you use a cluster, and you're using cross-AZ replication, such that the cost of adding cluster nodes is quite high due to the multiple on your costs from the added replicas in different AZs. In that case it makes sense to try to direct read load to replicas assuming you're ok with occasional consistency issues. However, if you're using a centralized deployment and trying to use replication for load balancing purposes then you're almost always going to be better off by moving to a cluster. That's just my opinion though based on my experience, so take it with a grain of salt. That use case I outlined is very real for me at work, but seems a bit uncommon for most people, so I kicked out read replica support a ways in my plans for this library.

From an implementation standpoint here's the open questions I had noted for this, and why it's maybe more complicated than it looks.

Identifying commands as read vs write is pretty easy.
The connection management is complicated. You pointed out the failure mode scenario, but the happy path is also complicated.
There are potential consistency issues that come with this.
Cluster rebalancing becomes more complicated.

Consider the following scenario:

You're using Elasticache or something equivalent in one of the big cloud providers.
You're using cross-AZ replication for redundancy purposes. Therefore your costs are non trivial per added replica. Your bandwidth costs are also not negligible.
You have >=2 replicas per primary node where the 2 replicas are in different AZs.
At least one of the replicas is in an AZ that is closer to your primary node on the network, so you have a preferred failover order, and a preference on which replica should receive commands.

To handle all these use cases the client would need new interfaces for callers to not only enable read-only commands to go to replicas, but also some way to specify the order that replicas should receive commands. Or said another way, you may be using replication for failover purposes, or for load balancing. But it makes a big difference, and it's difficult to declare that information, especially in the face of changing cluster topologies when nodes fail or slots are rebalanced. It can also massively inflate your network usage per application node if you have multiple replicas per primary and/or a lot of primary nodes.

I should be clear though - if you use replicas behind a sentinel layer, or behind a well-managed cluster deployment layer (such as Elasticache, redis labs, or k8s), then everything will work. If a primary nodes goes down your infra should promote a replica to a primary node, and then this information will appear in the CLUSTER NODES response, and fred will handle it properly. The complexity I'm speaking of comes from trying to use a replica for more than failover purposes.

from fred.rs.

tsukit commented on September 13, 2024

Thanks so much @aembke for the in-depth info! This is pretty help. Our internal clusters are currently sentinel-managed (soon to be running in cluster mode) and we use replicas to distribute reads as the most of the stacks we have are read heavy. When do you think will the ability to read from replicas (the the ability to handle its failure) be available? All ballpark estimate is highly appreciated.

from fred.rs.

aembke commented on September 13, 2024

So if you're using the sentinel interface the failover scenarios are handled today. The same for clusters assuming you're using some sort of management layer that can run the CLUSTER commands to promote a replica to a primary node such that it'll show up as a primary in subsequent CLUSTER NODES responses.

As far as sending reads to replicas without any failure taking place - that's probably a couple months out. My biggest focus after the upcoming 5.0.0 release will be on deep refactoring to make transactions easier to reason about, and then I'll likely start performance tuning work including this kind of automatically-send-reads-to-replicas work.

from fred.rs.

tsukit commented on September 13, 2024

Thanks @aembke. Look forward to having that support.

from fred.rs.

aembke commented on September 13, 2024

Sounds good, when that's ready or nearing completion I'll tag you so you have a heads up. In the meantime I'm going to close this out and track this in a set of wiki pages that I'm working on for this repo.

from fred.rs.

Questions: Read selection strategy and failing replica node handling about fred.rs HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent