Comments (20)
Yes, looks suspicious. There's no activity in the logs of replica I have until my code runs replicaof no one
after sensing that master is having connectivity issues:
Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.620872 2006 protocol_client.cc:180] Resolved endpoint dragonfly-24c3a611-6.testproject-dw2f.aiven.local:28849 to fda7:a938:5bfe:5fa6:0:4a9:374:f236:28849
Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.628690 2006 replica.cc:536] Started full sync with dragonfly-24c3a611-6.testproject-dw2f.aiven.local:28849
Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.629586 2006 replica.cc:556] full sync finished in 3 ms
Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.629644 2006 replica.cc:627] Transitioned into stable sync
Apr 16 13:16:27 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:16:27.574677 2006 server_family.cc:2378] Replicating NO ONE
from dragonfly.
I am pretty sure there is a logical explanation about this. Let me get back to you :)
from dragonfly.
After some digging with iptables
I managed to reproduce :) I will debug and patch this.
from dragonfly.
Hi @kostasrim
Yes there's something different in my scenario compared to what you explain above. I run these iptables on the master node only so replica is not affected at all. Therefore below statements are not true when requests made against the replica but true "if" my code was making requests to master node -- which is not the case:
For (1) redis-cli will stop working, so issuing replicaof no one will never reach replica because the packets are dropped when received
For(2) redis-cli will stop working, because the packets are droped when they are sent
Once master gets these iptables configuration then my failover code gets executed and then replica gets promoted to be master. There are no issues with connection to the replica at this stage.
from dragonfly.
Hi @safa-topal, thank you for reporting this. I will take a look and come back :)
from dragonfly.
Hi @safa-topal, it doesn't seem to reproduce on my side so maybe I am doing something wrong?
So:
- I started DF master and replica
- I debug populated master via redis-cli
- I killed the master process which made replica try to reconnect
- I run
replicaof no one
and it returned OK immediately
from dragonfly.
Hi @kostasrim,
I can reproduce it with roughly the same flow (except on the 3rd step I cut the connectivity with iptables config instead of killing the master process).
Curious, are there any logs indicating that replica is trying to reconnect to master? I don't see any such logs during my tests.
from dragonfly.
Hi @safa-topal I can retry with iptables config instead but I am quite packed for the rest of the day.
I used --alsologtostderr
and it should print that the replica is trying to restablish the connection with master.
p.s. I doubt that iptables config is the problem but I should verify
from dragonfly.
I also don't think iptables will make a difference on this case.
I have the -alsologtostderr
enabled and didn't make a difference with the logs after network failure, it doesn't log anything related to retrying connection with master.
from dragonfly.
Your logs should be filled:
E20240416 16:10:54.929634 28736 replica.cc:196] Error connecting to localhost:6379 system:111
I20240416 16:10:55.429829 28736 protocol_client.cc:180] Resolved endpoint localhost:6379 to 127.0.0.1:6379
E20240416 16:10:55.429966 28736 protocol_client.cc:221] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
E20240416 16:10:55.430018 28736 replica.cc:196] Error connecting to localhost:6379 system:111
I20240416 16:10:55.930435 28736 protocol_client.cc:180] Resolved endpoint localhost:6379 to 127.0.0.1:6379
Which is suspicious that you don't get them. I will try with iptables and ping back
from dragonfly.
@safa-topal what happens if you kill/stop the master without the iptables?
from dragonfly.
@romange on that case, what happens is identical to what @kostasrim experiences:
Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: I20240416 14:29:09.430425 70 replica.cc:657] Exit stable sync
Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: W20240416 14:29:09.430497 70 replica.cc:244] Error stable sync with dragonfly.local:28849 system:103 Software caused connection abort
Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: I20240416 14:29:09.930770 70 protocol_client.cc:180] Resolved endpoint dragonfly.local:28849 to fda7:a938:5bfe:5fa6:0:4a9:afc:ba6e:28849
Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: E20240416 14:29:09.931649 70 replica.cc:196] Error connecting to dragonfly.local:28849 system:111
from dragonfly.
when you use iptables, a replica does not know that the master is dead and it relies on tcp keep alive settings to recognize a closed socket. And, I just checked - we do not actually configure TCP keep alive on our replication connections on replica, instead we have the following comment https://github.com/dragonflydb/dragonfly/blob/main/src/server/protocol_client.cc#L225
from dragonfly.
@romange I remember Roy experimented with those... I can take a look tomorrow
from dragonfly.
@kostasrim I do not think they will help with the original issue of "replica no one" being stuck. I would actually check at what step it is stuck when performing this command. My guess is socket->Shutdown
call but I am not sure.
@safa-topal could you please tell us how exactly you block traffic with iptables?
from dragonfly.
when you use iptables, a replica does not know that the master is dead and it relies on tcp keep alive settings to recognize a closed socket. And, I just checked - we do not actually configure TCP keep alive on our replication connections on replica, instead we have the following comment https://github.com/dragonflydb/dragonfly/blob/main/src/server/protocol_client.cc#L225
This comment explained why we do not see the "reconnect" logs, but it does not explain why "replica no one" hangs
from dragonfly.
Hi @safa-topal , just to be 100% sure, you dropped the connection via:
sudo iptables -A INPUT -s 127.0.0.1 -p tcp --dport 6379 -j DROP
or something similar ?
I found the bug and patched it but I wanted to double check.
from dragonfly.
Hİ @kostasrim, yes, looks similar. This is the iptable directives I've used:
-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A OUTPUT -p tcp --sport 22 -m state --state ESTABLISHED -j ACCEPT
-A INPUT -j DROP
-A OUTPUT -j DROP
from dragonfly.
@safa-topal hmm I wonder if we see the same issue now because:
-A INPUT -j DROP
drops all packets destined for the host computer.-A OUTPUT -j DROP
drops all packets originating from the host computer
Both of them have side effects in this context:
For (1) redis-cli will stop working, so issuing replicaof no one
will never reach replica because the packets are dropped when received
For(2) redis-cli will stop working, because the packets are droped when they are sent
Also both of them will make the system unstable (on my ubuntu it crashes a few things) as it drops all kinds of packets. So in this context I think these two are improper (meaning that my patch won't fix them since the issue is with how iptables is used).
Also both of them do not work with version 1.14.5
Now for the:
-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A OUTPUT -p tcp --sport 22 -m state --state ESTABLISHED -j ACCEPT
Doesn't actually have an effect because it modifies the ACCEPT
which is later DROPED
. If taken standalone it has no impact. Whereas if you use this:
sudo iptables -A INPUT -s 127.0.0.1 -p tcp --dport 6379 -j DROP
it will fail (now it's patched) because it will block all incoming traffic to the master.
Let me know if you see something different. I just want to be 100% sure we are on the same page.
from dragonfly.
Oh I see, if it's specific to the master instance then it has similar semantics with my config as well. I expect my patch to work :)
from dragonfly.
Related Issues (20)
- ACL permits set command when write category is allowed but set command is not HOT 2
- redis-py doesn't throw NoPermissionError on ACL mismatch HOT 1
- Grafana dashboard using deprecated componentes
- Got "script tried accessing undeclared key" when key actually exists HOT 3
- S3 snapshots does not work on non-AWS S3? HOT 8
- Enable ASAN when using MIMalloc HOT 2
- Connecting to Dragonfly via memcache client doesn't work HOT 7
- Uncaught exception on rename dis HOT 12
- tiering: Mix async flush and async delete
- tiering tests: wait for cleanup before dropping
- Please add SLSA provenance to your releases HOT 1
- regression replication python tests run ARM machine
- incorrect state of cluster migration
- R2 does not work with S3 endpoint | aws: s3 write file: failed to create multipart upload: NotImplemented
- dragonfly crashes if threads are more then expected HOT 2
- HTTP API does not return a valid JSON object
- Crash when rename set HOT 2
- stable sync replication dcheck fails over long latency connections
- random crash on mimalloc when df was shut down via signal (ctrl+c) HOT 1
- Horizontal Scaling HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dragonfly.