Comments (9)
We were chatting about this on Discord (I think we settled on 30s and 25 seconds respectively), but to make sure we implement this right (and because my head hurts a little - so making sure I've got the right order), the solution should probably be:
- On the
relay
make the default--idle-request-interval
30s - On the
agent
make the default--idle-request-interval
35s so it has a nicer buffer to wait.
And we should write some docs to ensure that if you want to tweak this value, make sure that you don't make the relay
equal to or less than the agent
idle.
Did I make sure to capture that correctly?
from quilkin.
Got some logs? Share where you see the failures happening?
I am noting that both PRs touch files in /src/cli/
- wondering if it's something in there?
from quilkin.
@markmandel We tried replicating it again, but couldn't. It's odd I could consistently replicate it over the weekend on multiple machines, monday comes around, and I couldn't replicate it. I'll post more if I see it again, but what happens is essentially the agent warns that the connection has failed and reconnects.
from quilkin.
Yurk. Flaky issues in production are the worst to nail down.
from quilkin.
Wondering if what I'm seeing in #877 is similar to what you are seeing here? In the process of debugging, but figured I would check.
from quilkin.
Hi folks, we appear to have a similar issue in our dev environment. Every 30 seconds the agent claims to lose its connection to the relay:
{"timestamp":"2024-05-01T08:30:30.061180Z","level":"WARN","fields":{"message":"lost connection to relay server, retrying"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:30:30.061207Z","level":"INFO","fields":{"message":"attempting to connect to `http://quilkin-relay-agones:7900/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:30:30.062011Z","level":"INFO","fields":{"message":"Connected to management server"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
...
{"timestamp":"2024-05-01T08:31:00.064008Z","level":"WARN","fields":{"message":"lost connection to relay server, retrying"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:31:00.064038Z","level":"INFO","fields":{"message":"attempting to connect to `http://quilkin-relay-agones:7900/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2024-05-01T08:31:00.064979Z","level":"INFO","fields":{"message":"Connected to management server"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
This sequence repeats every 30 seconds.
So far we looked at:
- 'debug' and 'trace' level logs for both agent and relay
- tcpdump of traffic between agent and relay
- some netstat/lsof/strace inspection in the agent container
Observations:
- agent logs claim lost connection to relay every 30 seconds, immediately followed by a successful retry and reconnect
- tcpdump shows a repeating pattern: a new TCP connection is established from agent to relay, some data is exchanged, 30 seconds later the agent initiates teardown by sending a FIN
- debug logs also show the agent is sending (and relay is receiving) HTTP/2 GOAWAY every 30 seconds
- The surrounding logs differ between each 30 second interval, there is no clear pattern leading up to or following the lost connection message
We tried looking for timers and found IDLE_REQUEST_INTERVAL
, which defaults to 30 seconds. We changed this using --idle-request-interval-secs "60"
to see if this influences the behaviour we see, and the "lost connection" issue vanished! The connection between agent and relay remained intact for as long as I watched it.
If we restore IDLE_REQUEST_INTERVAL
to 30 seconds, the "lost connection" behaviour returns, so it seems at least right now we are able to reproduce this in our environment.
Is there any more information we could provide to help isolate the cause?
from quilkin.
We were chatting about this on Discord (I think we settled on 30s and 25 seconds respectively), but to make sure we implement this right (and because my head hurts a little - so making sure I've got the right order), the solution should probably be:
- On the
relay
make the default--idle-request-interval
30s- On the
agent
make the default--idle-request-interval
35s so it has a nicer buffer to wait.And we should write some docs to ensure that if you want to tweak this value, make sure that you don't make the
relay
equal to or less than theagent
idle.Did I make sure to capture that correctly?
Yes, you did capture that correctly. The agent
needs a small buffer so that it does not prematurely close the connection.
From our investigations 5 seconds is more than enough, but your milage may vary.
I also strongly agree with you there that having some docs would be great!
from quilkin.
Thanks, after a bunch of troubleshooting last week we found the same thing: relay
idle interval needs to be shorter than agent
idle interval. We went with 30s on the agent
and 25s on the relay
in the end. This seems to be stable so far. +1 on the docs.
from quilkin.
I think the simplest solution is to remove the idle interval on the agent, I don't think it's turned out to be intuitive, and the connection stability has been more improved since it was added.
from quilkin.
Related Issues (20)
- QCMP pings seem to be returning the wrong information.
- Stream metrics don't seem always reliable
- Datacenter map doesn't seem to be kept up to date in proxies
- Improve deserialisation errors in the Kubernetes provider
- Add support for `--address-type` option to Agones agent HOT 4
- Investigate eBPF Support
- Proxy doesn't remember which IP address it received on when sending packets back to the client HOT 9
- Gameserver events are failing due parsing errors
- Unit tests intermittently fails HOT 1
- `EndpointAddress` should be resolved once HOT 2
- Make filters static dispatch? HOT 3
- io_uring threads can get into high CPU condition HOT 3
- Agent's Kubernetes provider should not depend on ConfigMap
- Proxy Static Configuration defaults to an empty HashMap when using HashedTokenRouter filter HOT 1
- First connection to newly allocated gameserver using Proxy Dynamic Configuration with HashedTokenRouter filter is flaky
- Changes To The ConfigMap in the Relay aren't being picked up by proxies HOT 3
- 0.9.0 Release HOT 4
- Replace `serde_yaml` ?
- Use token -> endpoint mapping in token router
- Generate documentation from proto files
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quilkin.