Git Product home page Git Product logo

Comments (20)

safa-topal avatar safa-topal commented on June 3, 2024 1

Yes, looks suspicious. There's no activity in the logs of replica I have until my code runs replicaof no one after sensing that master is having connectivity issues:

Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.620872  2006 protocol_client.cc:180] Resolved endpoint dragonfly-24c3a611-6.testproject-dw2f.aiven.local:28849 to fda7:a938:5bfe:5fa6:0:4a9:374:f236:28849
Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.628690  2006 replica.cc:536] Started full sync with dragonfly-24c3a611-6.testproject-dw2f.aiven.local:28849
Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.629586  2006 replica.cc:556] full sync finished in 3 ms
Apr 16 13:14:48 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:14:48.629644  2006 replica.cc:627] Transitioned into stable sync
Apr 16 13:16:27 dragonfly-24c3a611-7 taskset[2005]: I20240416 13:16:27.574677  2006 server_family.cc:2378] Replicating NO ONE

from dragonfly.

kostasrim avatar kostasrim commented on June 3, 2024 1

I am pretty sure there is a logical explanation about this. Let me get back to you :)

from dragonfly.

kostasrim avatar kostasrim commented on June 3, 2024 1

After some digging with iptables I managed to reproduce :) I will debug and patch this.

from dragonfly.

safa-topal avatar safa-topal commented on June 3, 2024 1

Hi @kostasrim

Yes there's something different in my scenario compared to what you explain above. I run these iptables on the master node only so replica is not affected at all. Therefore below statements are not true when requests made against the replica but true "if" my code was making requests to master node -- which is not the case:

For (1) redis-cli will stop working, so issuing replicaof no one will never reach replica because the packets are dropped when received
For(2) redis-cli will stop working, because the packets are droped when they are sent

Once master gets these iptables configuration then my failover code gets executed and then replica gets promoted to be master. There are no issues with connection to the replica at this stage.

from dragonfly.

kostasrim avatar kostasrim commented on June 3, 2024

Hi @safa-topal, thank you for reporting this. I will take a look and come back :)

from dragonfly.

kostasrim avatar kostasrim commented on June 3, 2024

Hi @safa-topal, it doesn't seem to reproduce on my side so maybe I am doing something wrong?

So:

  1. I started DF master and replica
  2. I debug populated master via redis-cli
  3. I killed the master process which made replica try to reconnect
  4. I run replicaof no one and it returned OK immediately

from dragonfly.

safa-topal avatar safa-topal commented on June 3, 2024

Hi @kostasrim,

I can reproduce it with roughly the same flow (except on the 3rd step I cut the connectivity with iptables config instead of killing the master process).

Curious, are there any logs indicating that replica is trying to reconnect to master? I don't see any such logs during my tests.

from dragonfly.

kostasrim avatar kostasrim commented on June 3, 2024

Hi @safa-topal I can retry with iptables config instead but I am quite packed for the rest of the day.

I used --alsologtostderr and it should print that the replica is trying to restablish the connection with master.

p.s. I doubt that iptables config is the problem but I should verify

from dragonfly.

safa-topal avatar safa-topal commented on June 3, 2024

I also don't think iptables will make a difference on this case.

I have the -alsologtostderr enabled and didn't make a difference with the logs after network failure, it doesn't log anything related to retrying connection with master.

from dragonfly.

kostasrim avatar kostasrim commented on June 3, 2024

Your logs should be filled:

E20240416 16:10:54.929634 28736 replica.cc:196] Error connecting to localhost:6379 system:111
I20240416 16:10:55.429829 28736 protocol_client.cc:180] Resolved endpoint localhost:6379 to 127.0.0.1:6379
E20240416 16:10:55.429966 28736 protocol_client.cc:221] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
E20240416 16:10:55.430018 28736 replica.cc:196] Error connecting to localhost:6379 system:111
I20240416 16:10:55.930435 28736 protocol_client.cc:180] Resolved endpoint localhost:6379 to 127.0.0.1:6379

Which is suspicious that you don't get them. I will try with iptables and ping back

from dragonfly.

romange avatar romange commented on June 3, 2024

@safa-topal what happens if you kill/stop the master without the iptables?

from dragonfly.

safa-topal avatar safa-topal commented on June 3, 2024

@romange on that case, what happens is identical to what @kostasrim experiences:

Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: I20240416 14:29:09.430425    70 replica.cc:657] Exit stable sync
Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: W20240416 14:29:09.430497    70 replica.cc:244] Error stable sync with dragonfly.local:28849 system:103 Software caused connection abort
Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: I20240416 14:29:09.930770    70 protocol_client.cc:180] Resolved endpoint dragonfly.local:28849 to fda7:a938:5bfe:5fa6:0:4a9:afc:ba6e:28849
Apr 16 14:29:09 dragonfly-24c3a611-8 taskset[68]: E20240416 14:29:09.931649    70 replica.cc:196] Error connecting to dragonfly.local:28849 system:111

from dragonfly.

romange avatar romange commented on June 3, 2024

when you use iptables, a replica does not know that the master is dead and it relies on tcp keep alive settings to recognize a closed socket. And, I just checked - we do not actually configure TCP keep alive on our replication connections on replica, instead we have the following comment https://github.com/dragonflydb/dragonfly/blob/main/src/server/protocol_client.cc#L225

from dragonfly.

kostasrim avatar kostasrim commented on June 3, 2024

@romange I remember Roy experimented with those... I can take a look tomorrow

from dragonfly.

romange avatar romange commented on June 3, 2024

@kostasrim I do not think they will help with the original issue of "replica no one" being stuck. I would actually check at what step it is stuck when performing this command. My guess is socket->Shutdown call but I am not sure.
@safa-topal could you please tell us how exactly you block traffic with iptables?

from dragonfly.

romange avatar romange commented on June 3, 2024

when you use iptables, a replica does not know that the master is dead and it relies on tcp keep alive settings to recognize a closed socket. And, I just checked - we do not actually configure TCP keep alive on our replication connections on replica, instead we have the following comment https://github.com/dragonflydb/dragonfly/blob/main/src/server/protocol_client.cc#L225

This comment explained why we do not see the "reconnect" logs, but it does not explain why "replica no one" hangs

from dragonfly.

kostasrim avatar kostasrim commented on June 3, 2024

Hi @safa-topal , just to be 100% sure, you dropped the connection via:

sudo iptables -A INPUT -s 127.0.0.1 -p tcp --dport 6379 -j DROP or something similar ?

I found the bug and patched it but I wanted to double check.

from dragonfly.

safa-topal avatar safa-topal commented on June 3, 2024

@kostasrim, yes, looks similar. This is the iptable directives I've used:

-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A OUTPUT -p tcp --sport 22 -m state --state ESTABLISHED -j ACCEPT
-A INPUT -j DROP
-A OUTPUT -j DROP

from dragonfly.

kostasrim avatar kostasrim commented on June 3, 2024

@safa-topal hmm I wonder if we see the same issue now because:

  1. -A INPUT -j DROP drops all packets destined for the host computer.
  2. -A OUTPUT -j DROP drops all packets originating from the host computer

Both of them have side effects in this context:

For (1) redis-cli will stop working, so issuing replicaof no one will never reach replica because the packets are dropped when received
For(2) redis-cli will stop working, because the packets are droped when they are sent

Also both of them will make the system unstable (on my ubuntu it crashes a few things) as it drops all kinds of packets. So in this context I think these two are improper (meaning that my patch won't fix them since the issue is with how iptables is used).

Also both of them do not work with version 1.14.5

Now for the:

-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A OUTPUT -p tcp --sport 22 -m state --state ESTABLISHED -j ACCEPT

Doesn't actually have an effect because it modifies the ACCEPT which is later DROPED. If taken standalone it has no impact. Whereas if you use this:

sudo iptables -A INPUT -s 127.0.0.1 -p tcp --dport 6379 -j DROP it will fail (now it's patched) because it will block all incoming traffic to the master.

Let me know if you see something different. I just want to be 100% sure we are on the same page.

from dragonfly.

kostasrim avatar kostasrim commented on June 3, 2024

Oh I see, if it's specific to the master instance then it has similar semantics with my config as well. I expect my patch to work :)

from dragonfly.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.