Comments (12)
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from kubernetes.
/sig scalability
from kubernetes.
[measurement call TestMetrics - TestMetrics error: [action gather failed for SystemPodMetrics measurement: restart counts violation: RestartCount(node-problem-detector-gsg92, node-problem-detector)=1, want <= 0]]
I might be related to the npd.
from kubernetes.
The same test failures happening again on 25.02:
ClusterLoaderV2: load overall (testing/load/config.yaml) expand_less | 1h46m27s
-- | --
{ Failure :0 [measurement call TestMetrics - TestMetrics error: [action gather failed for SystemPodMetrics measurement: restart counts violation: RestartCount(node-problem-detector-bpt42, node-problem-detector)=1, want <= 0; RestartCount(node-problem-detector-gxnpn, node-problem-detector)=1, want <= 0]] :0}
ClusterLoaderV2: load: [step: 29] gathering measurements [09] - TestMetrics expand_less | 8s
-- | --
{ Failure :0 [measurement call TestMetrics - TestMetrics error: [action gather failed for SystemPodMetrics measurement: restart counts violation: RestartCount(node-problem-detector-bpt42, node-problem-detector)=1, want <= 0; RestartCount(node-problem-detector-gxnpn, node-problem-detector)=1, want <= 0]] :0}
e2e.go: ClusterLoaderV2
error during /home/prow/go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --experimental-gcp-snapshot-prometheus-disk=true --experimental-prometheus-disk-snapshot-name=ci-kubernetes-e2e-gce-scale-performance-1761798551869853696 --experimental-prometheus-snapshot-to-report-dir=true --nodes=5000 --prometheus-scrape-node-exporter --provider=gce --report-dir=/logs/artifacts --testconfig=testing/load/config.yaml --testconfig=testing/huge-service/config.yaml --testconfig=testing/access-tokens/config.yaml --testoverrides=./testing/experiments/enable_restart_count_check.yaml --testoverrides=./testing/experiments/ignore_known_gce_container_restarts.yaml --testoverrides=./testing/overrides/5000_nodes.yaml: exit status 1
from kubernetes.
/sig node
from kubernetes.
/sig node
/cc @SergeyKanzhelev @kubernetes/sig-node-leads
The restart of node-problem-detector
caused the flake. I also opened kubernetes/node-problem-detector#866 to track it in NPD repo.
from kubernetes.
I0225 17:35:11.161300 20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a4a62c4a7d8 default d60b06ff-11d1-4cb8-be1f-7264e084769b 85768 0 2024-02-25 17:11:22 +0000 UTC map[] map[] [] [] [{node-problem-detector Update v1 2024-02-25 17:11:22 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 4550 (metrics-server) total-vm:834776kB, anon-rss:100100kB, file-rss:33900kB, shmem-rss:0kB, UID:65534 pgtables:400kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:22 +0000 UTC,LastTimestamp:2024-02-25 17:11:22 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I missed this log. It is caused by oom. Adding some memory may help. I submitted #123500 to fix it.
{
"name": "node-problem-detector-qrt5w",
"containers": [
{
"name": "node-problem-detector",
"restartCount": 1,
"lastRestartReason": "\u0026ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Unknown,Message:,StartedAt:2024-02-25 17:24:13 +0000 UTC,FinishedAt:2024-02-25 18:19:59 +0000 UTC,ContainerID:containerd://685c8897aa5538550976c05ee7e23fc8a6d8325d9d6e0fcf183556d621cac0df,}"
}
]
},
Does OOMKilling,Message:Memory cgroup out of memory: Killed process 4550 (metrics-server) total-vm:834776kB, anon-rss:100100kB, file-rss:33900kB, shmem-rss:0kB, UID:65534 pgtables:400kB oom_score_adj:999
mean killing metrics-server?
from kubernetes.
/remove-sig node
Sorry, I misread the log.
The NPD found the metric server is OOM.
It is about the metrics server, not the NPD.
Sorry for disturb.
I may add some memory for metric-server and check why metric server memory changed. I heard that metrics server recently released a new version. I will check if it is related.
from kubernetes.
/sig node
from kubernetes.
{
"failures": [],
"ignored": [],
"past": [
{
"node": "gce-scale-cluster-minion-heapster",
"process": "metrics-server",
"memory": "834776kB",
"pid": "4550",
"time": "2024-02-25T17:11:22Z"
},
{
"node": "gce-scale-cluster-minion-heapster",
"process": "metrics-server",
"memory": "834776kB",
"pid": "4550",
"time": "2024-02-25T17:11:22Z"
},
{
"node": "gce-scale-cluster-minion-heapster",
"process": "metrics-server",
"memory": "829304kB",
"pid": "5059",
"time": "2024-02-25T17:11:47Z"
},
{
"node": "gce-scale-cluster-minion-heapster",
"process": "metrics-server",
"memory": "829304kB",
"pid": "5072",
"time": "2024-02-25T17:11:47Z"
}
]
}
OOM metrics shows the metric server is OOMed.
This is odd, that https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/metrics-server/metrics-server-deployment.yaml#L83-L91 only metrics-server-nanny is with a memory limit.
It uses [registry.k8s.io/autoscaling/addon-resizer:1.8.14](http://registry.k8s.io/autoscaling/addon-resizer:1.8.14)
whose command is /pod_nanny
.
metrics-server container has no memory limit, but it got killed by OOM. Does this indicate that the node is OOM? (edited)
from kubernetes.
The last run failed in https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1762308682881699840
{
"name": "node-problem-detector-tgl8h",
"containers": [
{
"name": "node-problem-detector",
"restartCount": 0,
"lastRestartReason": "\u0026ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Unknown,Message:,StartedAt:2024-02-27 03:02:53 +0000 UTC,FinishedAt:2024-02-27 04:25:09 +0000 UTC,ContainerID:containerd://d02da804c9ed0bbe19592a39bfe063b8bc0354d676a0fbd72c04771950881fae,}"
}
]
},
node-problem-detector ContainerStateTerminated.
I0227 05:03:31.555871 20953 phase_latency.go:146] PodStartupLatency: perc50: 14m29.018886827s, perc90: 28m44.625518807s, perc99: 32m2.19303496s; threshold 1h0m0s
I0227 05:03:32.023466 20953 system_pod_metrics.go:229] Loaded restart count threshold overrides: map[coredns:2 fluentd-gcp:999 konnectivity-agent:2 kube-controller-manager:1 kube-proxy:2 kube-scheduler:1 l7-lb-controller:1 metadata-proxy:2 prometheus-to-sd-exporter:2]
I0227 05:03:32.090252 20953 ooms_tracker.go:230] ClusterOOMsTracker: gathering cluster OOMs tracking measurement
E0227 05:03:32.090376 20953 test_metrics.go:250] TestMetrics: [action gather failed for SystemPodMetrics measurement: restart counts violation: RestartCount(node-problem-detector-tgl8h, node-problem-detector)=1, want <= 0; RestartCount(node-problem-detector-vpsmz, node-problem-detector)=1, want <= 0]
I0227 05:03:32.090413 20953 simple_test_executor.go:171] Step "[step: 29] gathering measurements" ended
W0227 05:03:32.090434 20953 simple_test_executor.go:174] Got errors during step execution: [measurement call TestMetrics - TestMetrics error: [action gather failed for SystemPodMetrics measurement: restart counts violation: RestartCount(node-problem-detector-tgl8h, node-problem-detector)=1, want <= 0; RestartCount(node-problem-detector-vpsmz, node-problem-detector)=1, want <= 0]]
I0227 05:03:32.090470 20953 simple_test_executor.go:80] Waiting for the chaos monkey subroutine to end...
I0227 05:03:32.090480 20953 simple_test_executor.go:82] Chaos monkey ended.
I0227 05:03:32.200575 20953 simple_test_executor.go:102]
I0227 05:03:32.212110 20953 probes.go:153] Stopping DnsLookupLatency probe...
I0227 05:03:47.374772 20953 probes.go:153] Stopping InClusterNetworkLatency probe...
I0227 05:04:22.740309 20953 simple_test_executor.go:400] Resources cleanup time: 50.539700215s
E0227 05:04:22.740344 20953 clusterloader.go:248] --------------------------------------------------------------------------------
E0227 05:04:22.740349 20953 clusterloader.go:249] Test Finished
E0227 05:04:22.740352 20953 clusterloader.go:250] Test: testing/load/config.yaml
E0227 05:04:22.740356 20953 clusterloader.go:251] Status: Fail
E0227 05:04:22.740359 20953 clusterloader.go:253] Errors: [measurement call TestMetrics - TestMetrics error: [action gather failed for SystemPodMetrics measurement: restart counts violation: RestartCount(node-problem-detector-tgl8h, node-problem-detector)=1, want <= 0; RestartCount(node-problem-detector-vpsmz, node-problem-detector)=1, want <= 0]]
E0227 05:04:22.740364 20953 clusterloader.go:255] --------------------------------------------------------------------------------
I0227 05:04:22.740423 20953 clusterloader.go:238] --------------------------------------------------------------------------------
I0227 05:04:22.740429 20953 clusterloader.go:239] Running testing/huge-service/config.yaml
I0227 05:04:22.740433 20953 clusterloader.go:240] --------------------------------------------------------------------------------
from kubernetes.
{ "name": "konnectivity-agent-szcls", "containers": [ { "name": "konnectivity-agent", "restartCount": 1, "lastRestartReason": "\u0026ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Unknown,Message:,StartedAt:2024-02-27 17:16:07 +0000 UTC,FinishedAt:2024-02-27 18:25:42 +0000 UTC,ContainerID:containerd://49271453bd375c93608b8c63fe5d7c6f28feb05692b139cee46231d6c8fa9be1,}" } ] }, { "name": "node-problem-detector-2b6fb", "containers": [ { "name": "node-problem-detector", "restartCount": 1, "lastRestartReason": "\u0026ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Unknown,Message:,StartedAt:2024-02-27 17:16:06 +0000 UTC,FinishedAt:2024-02-27 18:25:42 +0000 UTC,ContainerID:containerd://fb253b230595790fec72900367c2b6a48fec8f06781c034acd90d7c213c8fb2a,}" } ] },
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1762523335750586368/artifacts/SystemPodMetrics_load_2024-02-27T19:18:48Z.json
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1762523335750586368konnectivity-agent status is the same as npd, and oom catched 2 oom node, which may be the reason of pod restart on that node.
{ "failures": [], "ignored": [], "past": [ { "node": "gce-scale-cluster-minion-heapster", "process": "metrics-server", "memory": "1358608kB", "pid": "4676", "time": "2024-02-27T17:10:04Z" }, { "node": "gce-scale-cluster-minion-heapster", "process": "metrics-server", "memory": "1358608kB", "pid": "4690", "time": "2024-02-27T17:10:04Z" } ] }
Add some investigation details from kubernetes/perf-tests#2579
from kubernetes.
Related Issues (20)
- Kubernets service not distributing traffic in equally , seeing imbalance in traffic . HOT 14
- Comments that end up in OpenAPI descriptions should be CommonMark HOT 18
- Enhancement: allow to filter what fields to return from the API HOT 9
- [Failing Test] ci-crio-cgroupv1-node-e2e-conformance (Swap Tests) HOT 3
- [Flaking Test] integration-master (goroutine leak detection) HOT 6
- [Flaking Test] ci-node-e2e (Container Lifecycle) HOT 11
- Migrate existing features to versioned feature gate HOT 4
- verification machinery for compatibility version HOT 3
- [Flaking Test] TestLog/stateful_set_logs_with_all_pods HOT 4
- Pod deleted during image pull still starts HOT 10
- ValidatingAdmissionPolicy objects have different runtime type compared to CRDValidationRules HOT 8
- `kube-proxy`'s `--healthz-bind-address` should support IPv4 and IPv6 simultaneously (dual stack) HOT 24
- Bug: securityContext appArmorProfile unconfined not working with containerd HOT 5
- The old pod log file is not deleted from the /var/log/pods/ directory HOT 14
- Job controller reports the count of terminating pods with unnecessary delay HOT 4
- tracking issue; bump pause to 3.10 HOT 4
- kubernetes-sigs / scheduler-plugins go.mod Error HOT 3
- post-kubernetes-push-image-pause failed to publish version 3.10 HOT 15
- Failure cluster [6bc9e9c5...] HOT 8
- Apiserver log `Forcing xxx watcher close due to unresponsiveness` meaning consultation HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kubernetes.