Comments (8)
/sig scheduling
from kubernetes.
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
from kubernetes.
/assign
from kubernetes.
I don't consider this a bug, as PreEnqueue was not initially designed for Pods that went through the scheduling cycle already.
/remove-kind bug
/kind feature
In any case, I'm happy to review a reasonable proposal.
from kubernetes.
Current design is reasonable to me, so again ping @Huang-Wei
from kubernetes.
@kerthcet moving the discussion here:
To reply to your last comment
I don't know if we have such case, but if a pod is gated again, it means it's not ready, maybe it should restart the scheduling lifecycle.. my two cents. Let's track this under your issue then.
#125527 (comment)
At least, we have a usecsae here -
https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang
(wow hello again 😅 I've brought it into several discussion)
What it does with PreEnqueue is-
We also use PreEnqueue in https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang (gang scheduling); PreEnqueue allows Pods to enter scheduling cycles iff all Pods in the group are now ready to retry scheduling.
e.g., When gangA-pod1 is rejected by plugin-1 and when gangA-pod2 is rejected by plugin-2, we want to requeue all gangA Pods iff the scheduling queue observes some cluster events that could change the scheduling results of gangA-pod1 and gangA-pod2. (i.e., when QHints of plugin-1 and plugin-2 returns Queue for those Pods)
#125223 (comment)
So, Pods once got un-gated could get gated again.
To clarify, from less-subjective view, we have two options here:
- Add
once PreEnqueue allows a Pod to enter a scheduling cycle, PreEnqueue shouldn't gate the Pod again
to PreEnqueue specification. - Allow
even after PreEnqueue allows a Pod to enter a scheduling cycle once, PreEnqueue could gate the Pod again
. Fix a backoff behaviour like the issue description at the top argues inWhat did you expect to happen?
.
As @alculquicondor said, I admit the flow gated
-> un-gated
-> (experience scheduling cycles) -> gated
again isn't considered officially until now since the scheduling gate, which doesn't accept the flow, is the first motivation to establish PreEnqueue.
So, the current situation is we don't say anything for that case, and people like us technically can implement that flow with almost no issue, but only has this minor issue.
Current design is reasonable to me
#125538 (comment)
(as you know) I'm tend to the second option since I don't see any big reason/concern that we have to limit the usage of PreEnqueue, like the first one said.
I'm curious what reason the current behaviour looks reasonable to you for.
from kubernetes.
I can break this into two parts:
- A single Pod
- A group of Pods
I mostly pointed to the single Pod, I feel a re-gated Pod should restart the whole lifecycle of scheduling, backoffQ is designed for not blocking the queue however gated Pods will not enter into the queue.
For a group of Pods, it's much more complex, like if a Pod failed in scheduling, it should be marked as unschedulable or gated again, maybe unschedulable, then it's already under backoff. And they should be re-scheduled first next time because they're the main reason blocking the whole group. When scheduling timeout, how to handle these Pods, assumedPods should be gated again maybe, unscheduledPod should still be unscheduledPods, backoffing in the background?
So the total workflow is still unclear to me, but that's what I'm working for right now. Always open to new ideas :)
from kubernetes.
Okay, so the first thing not a direct answer to your last comment (sorry), but let me confirm one thing as I feel like we're not on the same base page.
So, this issue does not depend on which kind of Pods they are. The point that I cannot justify the current behaviour is more general.
Reviewing the concept of backoff (just in case), backoff time is a penalty that we impose on Pods when they consume a scheduling cycle, but they didn't get scheduled and came back to the queue.
So, regardless of whether a Pod is gated or not, they are supposed to get a penalty if they wasted some scheduling cycles before. It's the law in the scheduler; it's their obligation that they must meet before retrying a schedule again.
However, the current situation this issue highlights is, if they get gated again, they somehow get immune from this penalty.
Let's say Pods have failed at scheduling 1k times before, but if they were being gated in the queue even during a moment, they bypass backoff, which is illegal. That's my point that I want to clarify with this PR.
Then, answering your questions,
I feel a re-gated Pod should restart the whole lifecycle of scheduling, backoffQ is designed for not blocking the queue however gated Pods will not enter into the queue.
Why is it better to restart a whole lifecycle of scheduling?
My point is, again -
Let's say Pods have failed at scheduling 1k times before, but if they were being gated in the queue even during a moment, they bypass backoff, which is illegal.
So, I believe Pod should not restart a whole lifecycle, but the scheduling queue should remember this guy has failed at scheduling 1k times, and then impose a backoff penalty on them.
Usecase-wise, DRA's PreEnqueue, which is not a plugin schedules a group of Pods, could gate a Pod once it allows. (e.g., when a claim for Pod-A is deleted somehow after Pod-A's scheduling failure, it gates Pod-A until the claim is recreated.)
For a group of Pods
Sorry, I don't quite understand what you have in your mind.
But, let me elaborate our gang plugin's behaviour around retrying:
- Let's say Pod1, Pod2 and Pod3 are in the same group and they're just created, going to scheduling cycles.
- Pod1 and Pod2 goes thru Filter phase, and waiting for Pod3 (waitOnPermit).
- Pod3, unfortunately, is rejected by noderesourcefit.
- Pod3 goes back to unschedQ.
- Also, we bring back Pod1 and Pod2 to unschedQ too, without waiting for timeout they have. The reason is for "waiting Pods reserve too large space in a cluster kubernetes/enhancements#4671 (comment)"
- Now, Pod1 and Pod2 have
unschedPlugin: gang
and Pod3 hasunschedPlugin: noderesourcefit
. - This group won't be schedulable until Pod3's failure is resolved. So, Pod1 and Pod2 will be gated until Pod3's failure is solved.
- And, when the scheduler receives an event that is registered in noderesourcefit ([if QHint enabled] + when QHint returns
Queue
to the event ), Pod3 is moved from unschedQ to activeQ or backoffQ because Pod3 may be schedulable now. - Then, move Pod1 and Pod2 to activeQ/backoffQ too so that all Pods in this group are again proceeding to scheduling cycles.
- Then, if you're lucky, all Pods are scheduled successfully. If you're not, some of them might be unschedulable, and we're doing the same process of requeueing again.
from kubernetes.
Related Issues (20)
- Tracking issue: evaluating dependencies with non-CNCF CLAs HOT 9
- Support HTTP2 probes over cleartext (h2c) HOT 11
- The startup time of the init container is later than that of the application container. HOT 3
- Can't get secrets when adding imagePullSecrets HOT 3
- [Flaking test] [sig-node] Containers should use the image defaults if command and args are blank HOT 1
- kubectl --server-side --dry-run=server - wrong output for converting client side applied manifest HOT 3
- Node Labeling node.kubernetes.io/out-of-service Taint Label Delay HOT 2
- [FG:InPlacePodVerticalScaling] e2e test does not verify resource update in pod status HOT 3
- cronjob schedule with multiple conditions not working - conflict between day (week) and day (month) HOT 5
- NetPol block self pod trafic using an svc and not direct call HOT 12
- kube-apiserver logs watch requests before they end in 1.30 HOT 9
- Node Lifecycle Controller does not mark pods not ready when node becomes Ready=False HOT 8
- endpoints cannot be changed from notReadyAddresses to addresses HOT 8
- Enhancement: Add vTPM Configuration Fields for Enhanced Container Security HOT 3
- 'kubectl delete istag/$ISTAG --dry-run=server' is unexpectedly deleting the object from the server HOT 5
- [FG:InPlacePodVerticalScaling] resources in pod status are never updated if EventedPLEG is enabled HOT 2
- [Flaking test] ci-kubernetes-e2e-gci-gce.Overall HOT 4
- `kubernetes.io/legacy-token-last-used` label being added to long lived service token secrets HOT 2
- The endpoint status does not update when the pod state changes rapidly. HOT 8
- Pod with exitCode 137, The reason has nothing to do with resources。 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kubernetes.