something in <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="h

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Wondering if what I'm seeing in <a class="issue-link js-issue-link" data-error-text="F

Agent <-> Relay communication is failing about quilkin HOT 9 CLOSED

XAMPPRocky commented on June 17, 2024

Agent <-> Relay communication is failing

from quilkin.

Comments (9)

markmandel commented on June 17, 2024 1

We were chatting about this on Discord (I think we settled on 30s and 25 seconds respectively), but to make sure we implement this right (and because my head hurts a little - so making sure I've got the right order), the solution should probably be:

On the relay make the default --idle-request-interval 30s
On the agent make the default --idle-request-interval 35s so it has a nicer buffer to wait.

And we should write some docs to ensure that if you want to tweak this value, make sure that you don't make the relay equal to or less than the agent idle.

Did I make sure to capture that correctly?

from quilkin.

markmandel commented on June 17, 2024

Got some logs? Share where you see the failures happening?

I am noting that both PRs touch files in /src/cli/ - wondering if it's something in there?

from quilkin.

XAMPPRocky commented on June 17, 2024

@markmandel We tried replicating it again, but couldn't. It's odd I could consistently replicate it over the weekend on multiple machines, monday comes around, and I couldn't replicate it. I'll post more if I see it again, but what happens is essentially the agent warns that the connection has failed and reconnects.

from quilkin.

markmandel commented on June 17, 2024

Yurk. Flaky issues in production are the worst to nail down.

from quilkin.

markmandel commented on June 17, 2024

Wondering if what I'm seeing in #877 is similar to what you are seeing here? In the process of debugging, but figured I would check.

from quilkin.

MarkSeward commented on June 17, 2024

Hi folks, we appear to have a similar issue in our dev environment. Every 30 seconds the agent claims to lose its connection to the relay:

{"timestamp":"2024-05-01T08:30:30.061180Z","level":"WARN","fields":{"message":"lost connection to relay server, retrying"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:30:30.061207Z","level":"INFO","fields":{"message":"attempting to connect to `http://quilkin-relay-agones:7900/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:30:30.062011Z","level":"INFO","fields":{"message":"Connected to management server"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}

...

{"timestamp":"2024-05-01T08:31:00.064008Z","level":"WARN","fields":{"message":"lost connection to relay server, retrying"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:31:00.064038Z","level":"INFO","fields":{"message":"attempting to connect to `http://quilkin-relay-agones:7900/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:31:00.064979Z","level":"INFO","fields":{"message":"Connected to management server"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}

This sequence repeats every 30 seconds.

So far we looked at:

'debug' and 'trace' level logs for both agent and relay
tcpdump of traffic between agent and relay
some netstat/lsof/strace inspection in the agent container

Observations:

agent logs claim lost connection to relay every 30 seconds, immediately followed by a successful retry and reconnect
tcpdump shows a repeating pattern: a new TCP connection is established from agent to relay, some data is exchanged, 30 seconds later the agent initiates teardown by sending a FIN
debug logs also show the agent is sending (and relay is receiving) HTTP/2 GOAWAY every 30 seconds
The surrounding logs differ between each 30 second interval, there is no clear pattern leading up to or following the lost connection message

We tried looking for timers and found IDLE_REQUEST_INTERVAL, which defaults to 30 seconds. We changed this using --idle-request-interval-secs "60" to see if this influences the behaviour we see, and the "lost connection" issue vanished! The connection between agent and relay remained intact for as long as I watched it.

If we restore IDLE_REQUEST_INTERVAL to 30 seconds, the "lost connection" behaviour returns, so it seems at least right now we are able to reproduce this in our environment.

Is there any more information we could provide to help isolate the cause?

from quilkin.

swermin commented on June 17, 2024

We were chatting about this on Discord (I think we settled on 30s and 25 seconds respectively), but to make sure we implement this right (and because my head hurts a little - so making sure I've got the right order), the solution should probably be:

On the relay make the default --idle-request-interval 30s

On the agent make the default --idle-request-interval 35s so it has a nicer buffer to wait.

And we should write some docs to ensure that if you want to tweak this value, make sure that you don't make the relay equal to or less than the agent idle.

Did I make sure to capture that correctly?

Yes, you did capture that correctly. The agent needs a small buffer so that it does not prematurely close the connection.
From our investigations 5 seconds is more than enough, but your milage may vary.
I also strongly agree with you there that having some docs would be great!

from quilkin.

MarkSeward commented on June 17, 2024

Thanks, after a bunch of troubleshooting last week we found the same thing: relay idle interval needs to be shorter than agent idle interval. We went with 30s on the agent and 25s on the relay in the end. This seems to be stable so far. +1 on the docs.

from quilkin.

XAMPPRocky commented on June 17, 2024

I think the simplest solution is to remove the idle interval on the agent, I don't think it's turned out to be intuitive, and the connection stability has been more improved since it was added.

from quilkin.

Agent <-> Relay communication is failing about quilkin HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent