Git Product home page Git Product logo

Comments (9)

markmandel avatar markmandel commented on June 17, 2024 1

We were chatting about this on Discord (I think we settled on 30s and 25 seconds respectively), but to make sure we implement this right (and because my head hurts a little - so making sure I've got the right order), the solution should probably be:

  1. On the relay make the default --idle-request-interval 30s
  2. On the agent make the default --idle-request-interval 35s so it has a nicer buffer to wait.

And we should write some docs to ensure that if you want to tweak this value, make sure that you don't make the relay equal to or less than the agent idle.

Did I make sure to capture that correctly?

from quilkin.

markmandel avatar markmandel commented on June 17, 2024

Got some logs? Share where you see the failures happening?

I am noting that both PRs touch files in /src/cli/ - wondering if it's something in there?

from quilkin.

XAMPPRocky avatar XAMPPRocky commented on June 17, 2024

@markmandel We tried replicating it again, but couldn't. It's odd I could consistently replicate it over the weekend on multiple machines, monday comes around, and I couldn't replicate it. I'll post more if I see it again, but what happens is essentially the agent warns that the connection has failed and reconnects.

from quilkin.

markmandel avatar markmandel commented on June 17, 2024

Yurk. Flaky issues in production are the worst to nail down.

from quilkin.

markmandel avatar markmandel commented on June 17, 2024

Wondering if what I'm seeing in #877 is similar to what you are seeing here? In the process of debugging, but figured I would check.

from quilkin.

MarkSeward avatar MarkSeward commented on June 17, 2024

Hi folks, we appear to have a similar issue in our dev environment. Every 30 seconds the agent claims to lose its connection to the relay:

{"timestamp":"2024-05-01T08:30:30.061180Z","level":"WARN","fields":{"message":"lost connection to relay server, retrying"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:30:30.061207Z","level":"INFO","fields":{"message":"attempting to connect to `http://quilkin-relay-agones:7900/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:30:30.062011Z","level":"INFO","fields":{"message":"Connected to management server"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}

...

{"timestamp":"2024-05-01T08:31:00.064008Z","level":"WARN","fields":{"message":"lost connection to relay server, retrying"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:31:00.064038Z","level":"INFO","fields":{"message":"attempting to connect to `http://quilkin-relay-agones:7900/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:31:00.064979Z","level":"INFO","fields":{"message":"Connected to management server"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}

This sequence repeats every 30 seconds.

So far we looked at:

  • 'debug' and 'trace' level logs for both agent and relay
  • tcpdump of traffic between agent and relay
  • some netstat/lsof/strace inspection in the agent container

Observations:

  • agent logs claim lost connection to relay every 30 seconds, immediately followed by a successful retry and reconnect
  • tcpdump shows a repeating pattern: a new TCP connection is established from agent to relay, some data is exchanged, 30 seconds later the agent initiates teardown by sending a FIN
  • debug logs also show the agent is sending (and relay is receiving) HTTP/2 GOAWAY every 30 seconds
  • The surrounding logs differ between each 30 second interval, there is no clear pattern leading up to or following the lost connection message

We tried looking for timers and found IDLE_REQUEST_INTERVAL, which defaults to 30 seconds. We changed this using --idle-request-interval-secs "60" to see if this influences the behaviour we see, and the "lost connection" issue vanished! The connection between agent and relay remained intact for as long as I watched it.

If we restore IDLE_REQUEST_INTERVAL to 30 seconds, the "lost connection" behaviour returns, so it seems at least right now we are able to reproduce this in our environment.

Is there any more information we could provide to help isolate the cause?

from quilkin.

swermin avatar swermin commented on June 17, 2024

We were chatting about this on Discord (I think we settled on 30s and 25 seconds respectively), but to make sure we implement this right (and because my head hurts a little - so making sure I've got the right order), the solution should probably be:

  1. On the relay make the default --idle-request-interval 30s
  2. On the agent make the default --idle-request-interval 35s so it has a nicer buffer to wait.

And we should write some docs to ensure that if you want to tweak this value, make sure that you don't make the relay equal to or less than the agent idle.

Did I make sure to capture that correctly?

Yes, you did capture that correctly. The agent needs a small buffer so that it does not prematurely close the connection.
From our investigations 5 seconds is more than enough, but your milage may vary.
I also strongly agree with you there that having some docs would be great!

from quilkin.

MarkSeward avatar MarkSeward commented on June 17, 2024

Thanks, after a bunch of troubleshooting last week we found the same thing: relay idle interval needs to be shorter than agent idle interval. We went with 30s on the agent and 25s on the relay in the end. This seems to be stable so far. +1 on the docs.

from quilkin.

XAMPPRocky avatar XAMPPRocky commented on June 17, 2024

I think the simplest solution is to remove the idle interval on the agent, I don't think it's turned out to be intuitive, and the connection stability has been more improved since it was added.

from quilkin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.