Comments (14)
I should note we also upgraded the drives in our ctl nodes to SSDs which seemed like it did the trick for a bit, but etcd still fell over. I would expect if this was a hardware or network issue we'd have seen it in the previous version too. I've tried restarting all the nodes multiple times and keep seeing the same behavior: it's fine for a while, and then eventually etcd gets overwhelmed and dies
from okd.
Oh and oddly around the time of the issue we get loads of 'peer netns reference is invalid' messages from the garbage-collector
namespace
extract-2024-01-04T21_29_11.613Z.csv
from okd.
{"msg":"apply request took too long","took":"2.00022516s","request":"key:"/kubernetes.io/health" ","caller":"etcdserver/util.go:170","level":"warn","prefix":"read-only range ","response":"","error":"context canceled","expected-duration":"200ms","ts":"2024-01-04T20:52:04.900678Z"}
It takes etcd 2 seconds to do a read operation, disk is too slow for it. SSD or NVME disks are a requirement for OKD
from okd.
{"msg":"apply request took too long","took":"2.00022516s","request":"key:"/kubernetes.io/health" ","caller":"etcdserver/util.go:170","level":"warn","prefix":"read-only range ","response":"","error":"context canceled","expected-duration":"200ms","ts":"2024-01-04T20:52:04.900678Z"}
It takes etcd 2 seconds to do a read operation, disk is too slow for it. SSD or NVME disks are a requirement for OKD
We're using SSDs, and before etcd falls over, the request latency is totally normal, check my "before" logs
from okd.
Also prior to this version we had totally acceptable etcd performance
from okd.
I mean, here's our etcd metrics:
it's possible I'm running this query wrong, but the latency looks pretty normal?
from okd.
I have no idea what the anomalies at 2 PM and just now are
from okd.
And that's the thing that's so weird about this: if I reboot the control plane nodes, OKD seems to behave for a reasonable amount of time with no signs of high latency:
from okd.
but at some point it just gets a storm of reads or something from the apiserver which topples it
from okd.
{"level":"info","ts":"2024-01-05T01:04:41.203378Z","caller":"traceutil/trace.go:171","msg":"trace[1574369963] range","detail":"{range_begin:/kubernetes.io/events/; range_end:/kubernetes.io/events0; response_count:250; response_revision:1138218247; }","duration":"105.575164ms","start":"2024-01-05T01:04:41.097779Z","end":"2024-01-05T01:04:41.203354Z","steps":["trace[1574369963] 'range keys from in-memory index tree' (duration: 103.958977ms)"],"step_count":1}
{"level":"info","ts":"2024-01-05T01:05:40.560993Z","caller":"traceutil/trace.go:171","msg":"trace[1657201551] range","detail":"{range_begin:/kubernetes.io/events/hoby-feedback/exposer-ongsovhjcos7k8j79ni71h29962l557g8a6q8due9n0p3sr5fj5g.17a6f94efeedff69\u0000; range_end:/kubernetes.io/events0; response_count:10000; response_revision:1138219153; }","duration":"113.96833ms","start":"2024-01-05T01:05:40.447007Z","end":"2024-01-05T01:05:40.560975Z","steps":["trace[1657201551] 'range keys from in-memory index tree' (duration: 86.5626ms)","trace[1657201551] 'range keys from bolt db' (duration: 26.781581ms)"],"step_count":2}
{"level":"info","ts":"2024-01-05T01:07:19.046712Z","caller":"etcdserver/server.go:1404","msg":"triggering snapshot","local-member-id":"26c01424b8ce265f","local-member-applied-index":1302462815,"local-member-snapshot-index":1302452814,"local-member-snapshot-count":10000}
{"level":"info","ts":"2024-01-05T01:07:19.051514Z","caller":"etcdserver/server.go:2422","msg":"saved snapshot","snapshot-index":1302462815}
{"level":"info","ts":"2024-01-05T01:07:19.051698Z","caller":"etcdserver/server.go:2452","msg":"compacted Raft logs","compact-index":1302457815}
Like, our latencies are fairly normal almost all the time, but suddenly we get a bunch of requests or something which causes etcd to get overwhelmed.
from okd.
We're using these SSDs on our control plane, they back etcd:
[core@okd4-ctl02-nrh ~]$ cat /sys/block/sdb/device/model
VK000960GWJPF
from okd.
So, sometimes the disks are okay and you can open a console - and sometimes the disk is slow and so etcd is timing out? This is clearly a hardware / IO usage problem, not sure what OKD can do about this.
from okd.
@vrutkovs turns out the new version seems to have changed network handling: our control plane nodes have statically configured addresses and the generated default networkmanager config wasn't limited to any particular interface, so we were having intermittent networking issues caused by having two default routes and two interfaces with the same IP. I think this is a regression or something, because we didn't have this problem on older versions.
from okd.
Manually editing the networkmanager configs fixed it
from okd.
Related Issues (20)
- Stuck upgrade from 4.13. to 4.14 HOT 1
- Upgrade from 4.13 to 4.14 stuck HOT 2
- CSV packageserver in openshift-operator-lifecycle-manager "found the serving cert not active"
- Customize Prometheus Alerts in OKD 4.11 HOT 1
- This cluster should not be updated to the next minor version.
- How to set label "custom-kubelet=small-pods" to worker node during the cluster installation
- Only Container in POD restarted with OOM, but POD stays (not restartet)
- Instakk fail 4.14 rpm-ostree
- [Question] Specify static IPs on AWS IPI
- bootstrap host fails to boot on VM HOT 2
- Stuck upgrading from 4.13.0-0.okd-2023-05-22-052007 to 4.14.0-0.okd-2023-12-01-225814
- 4.12 cluster with no way to update due to OCPBUGS-13188 HOT 2
- Authentication service issue
- Bare metal (UPI) not a platform option when creating manifests HOT 1
- OKD 4.11 After MCP automatic update, apiserver updates pod and cannot start
- Very high load and all CPUs ~ 100% wait on worker nodes (kswapd0 100%) HOT 2
- HTTPS Redirect Using Non-Standard HTTP Code HOT 3
- update okd from 4.11 to 4.12 got "open /etc/containers/registries.conf: permission denied" HOT 1
- 4.14.0-0.okd-2024-01-06-084517 monitor-plugin failed HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from okd.