Comments (6)
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
from kubernetes.
/sig node
from kubernetes.
hi, thanks for the report. It's possible we have some issues in handling device plugins reconnections. Do we have a simple reproducer by any chance?
from kubernetes.
Actually, It is hard to reproduce a same condition, I'm not sure the root reason is about reconnections, but I think the mechanism of reconnection between kubelet and plugin could be improved(I haven't found the reconnection action on kubelet. ), isn't it ?
from kubernetes.
Actually, It is hard to reproduce a same condition, I'm not sure the root reason is about reconnections, but I think the mechanism of reconnection between kubelet and plugin could be improved(I haven't found the reconnection action on kubelet. ), isn't it ?
It is a responsability of the device plugin to handle reconnections, both on crashes/restarts of the device plugin itself and tolerating kubelet restarts: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#handling-kubelet-restarts
I'd be more than supportive to make the kubelet code more robust in presence of such events, but we need to narrow down the perimeter a bit, hence my previous question.
from kubernetes.
Thanks for your support, I'm agree with your opinion. Here is some additional notes:
- https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#handling-kubelet-restarts Kubelet re-registers the restore connection function by monitoring the sock file in the plugin directory. I have confirmed that there is no problem with the function. But when this issue occurred, I found that the sock file in the directory was normal and had not been removed. However, I restarted kubelet and found that the plugin was re-registered and the kubelet value was correctly updated.
- In the kubelet logs, I did not clearly observe logs of disconnection from the plugin, I just found the log like this:
However, the time of these two logs is after the failure occurs. :( - For details about this failure, I also carefully reviewed the plugin code and raised relevant issues to them. For details, please refer to :https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/issues/109
from kubernetes.
Related Issues (20)
- Failure cluster [92dcc8a6...] DRA E2eNode Suite failures HOT 8
- Kube 1.30.2 wouldn't not start on CentOS9 - OCI runtime create failed: runc create failed: unable to create new parent process: namespace HOT 3
- kindnetd + coredns issues in ec2 test harness HOT 10
- chore: use WaitForCacheSync method after sharedInformerFactory Start in integration or unit test follow-ups HOT 10
- Release a go mod version (with no replace) HOT 15
- See extra disks in container other than the one mounted inside the pod using lsblk command HOT 2
- how to config nvidia driverName in dynamicResourceAllocation HOT 9
- DRA kubelet: enhance PodResources testing HOT 1
- Topology aware endpoint filtering logs too much. HOT 4
- Adding additional guards on CRD schema HOT 1
- Remove version requirement for recognizeKeywordAsFieldName after update default compatible version HOT 4
- whether volume.kubernetes.io/selected-node problem is resolved? HOT 2
- Feature requests: Support adding label based indexes to apiserver cache HOT 3
- dra nvidia k8s-dra-driver helm chart install with error HOT 5
- [Flaky Test] [sig-storage] CSI Mock volume storage capacity storage capacity exhausted, late binding, no topology HOT 1
- [Flaky Test] [sig-windows] [Feature:Windows] Memory Limits [Serial] [Slow] attempt to deploy past allocatable memory limits should fail deployments of pods once there isn't enough memory HOT 2
- [Flaky Test] [sig-windows] Eviction should evict a pod when a node experiences memory pressure HOT 2
- Promote Go version to either 1.23.0 or revert to 1.22.x before K8s 1.31 HOT 2
- Prevent `celReservedSymbols` from accidentally adding value in escaping format HOT 1
- E2eNode Suite.[It] [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats apiChanges HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kubernetes.