What happened? With the current implementation, all gated Pods are

Current design is reasonable to me, so again ping <a class="user-mention notranslate"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I can break this into two parts: A single Pod A group

Pods gated by custom PreEnqueue plugins don't go through backoffQ even in case they ought to about kubernetes HOT 8 OPEN

sanposhiho commented on July 3, 2024

Pods gated by custom PreEnqueue plugins don't go through backoffQ even in case they ought to

from kubernetes.

Comments (8)

sanposhiho commented on July 3, 2024

/sig scheduling

from kubernetes.

k8s-ci-robot commented on July 3, 2024

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

from kubernetes.

sanposhiho commented on July 3, 2024

/assign

from kubernetes.

alculquicondor commented on July 3, 2024

I don't consider this a bug, as PreEnqueue was not initially designed for Pods that went through the scheduling cycle already.

/remove-kind bug
/kind feature

In any case, I'm happy to review a reasonable proposal.

from kubernetes.

kerthcet commented on July 3, 2024

Current design is reasonable to me, so again ping @Huang-Wei

from kubernetes.

sanposhiho commented on July 3, 2024

@kerthcet moving the discussion here:

To reply to your last comment

I don't know if we have such case, but if a pod is gated again, it means it's not ready, maybe it should restart the scheduling lifecycle.. my two cents. Let's track this under your issue then.
#125527 (comment)

At least, we have a usecsae here -
https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang
(wow hello again 😅 I've brought it into several discussion)

What it does with PreEnqueue is-

We also use PreEnqueue in https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang (gang scheduling); PreEnqueue allows Pods to enter scheduling cycles iff all Pods in the group are now ready to retry scheduling.
e.g., When gangA-pod1 is rejected by plugin-1 and when gangA-pod2 is rejected by plugin-2, we want to requeue all gangA Pods iff the scheduling queue observes some cluster events that could change the scheduling results of gangA-pod1 and gangA-pod2. (i.e., when QHints of plugin-1 and plugin-2 returns Queue for those Pods)
#125223 (comment)

So, Pods once got un-gated could get gated again.

To clarify, from less-subjective view, we have two options here:

Add once PreEnqueue allows a Pod to enter a scheduling cycle, PreEnqueue shouldn't gate the Pod again to PreEnqueue specification.
Allow even after PreEnqueue allows a Pod to enter a scheduling cycle once, PreEnqueue could gate the Pod again. Fix a backoff behaviour like the issue description at the top argues in What did you expect to happen?.

As @alculquicondor said, I admit the flow gated -> un-gated -> (experience scheduling cycles) -> gated again isn't considered officially until now since the scheduling gate, which doesn't accept the flow, is the first motivation to establish PreEnqueue.

So, the current situation is we don't say anything for that case, and people like us technically can implement that flow with almost no issue, but only has this minor issue.

Current design is reasonable to me
#125538 (comment)

(as you know) I'm tend to the second option since I don't see any big reason/concern that we have to limit the usage of PreEnqueue, like the first one said.

I'm curious what reason the current behaviour looks reasonable to you for.

from kubernetes.

kerthcet commented on July 3, 2024

I can break this into two parts:

A single Pod
A group of Pods

I mostly pointed to the single Pod, I feel a re-gated Pod should restart the whole lifecycle of scheduling, backoffQ is designed for not blocking the queue however gated Pods will not enter into the queue.

For a group of Pods, it's much more complex, like if a Pod failed in scheduling, it should be marked as unschedulable or gated again, maybe unschedulable, then it's already under backoff. And they should be re-scheduled first next time because they're the main reason blocking the whole group. When scheduling timeout, how to handle these Pods, assumedPods should be gated again maybe, unscheduledPod should still be unscheduledPods, backoffing in the background?

So the total workflow is still unclear to me, but that's what I'm working for right now. Always open to new ideas :)

from kubernetes.

sanposhiho commented on July 3, 2024

Okay, so the first thing not a direct answer to your last comment (sorry), but let me confirm one thing as I feel like we're not on the same base page.

So, this issue does not depend on which kind of Pods they are. The point that I cannot justify the current behaviour is more general.

Reviewing the concept of backoff (just in case), backoff time is a penalty that we impose on Pods when they consume a scheduling cycle, but they didn't get scheduled and came back to the queue.
So, regardless of whether a Pod is gated or not, they are supposed to get a penalty if they wasted some scheduling cycles before. It's the law in the scheduler; it's their obligation that they must meet before retrying a schedule again.

However, the current situation this issue highlights is, if they get gated again, they somehow get immune from this penalty.
Let's say Pods have failed at scheduling 1k times before, but if they were being gated in the queue even during a moment, they bypass backoff, which is illegal. That's my point that I want to clarify with this PR.

Then, answering your questions,

I feel a re-gated Pod should restart the whole lifecycle of scheduling, backoffQ is designed for not blocking the queue however gated Pods will not enter into the queue.

Why is it better to restart a whole lifecycle of scheduling?

My point is, again -

Let's say Pods have failed at scheduling 1k times before, but if they were being gated in the queue even during a moment, they bypass backoff, which is illegal.

So, I believe Pod should not restart a whole lifecycle, but the scheduling queue should remember this guy has failed at scheduling 1k times, and then impose a backoff penalty on them.

Usecase-wise, DRA's PreEnqueue, which is not a plugin schedules a group of Pods, could gate a Pod once it allows. (e.g., when a claim for Pod-A is deleted somehow after Pod-A's scheduling failure, it gates Pod-A until the claim is recreated.)

For a group of Pods

Sorry, I don't quite understand what you have in your mind.

But, let me elaborate our gang plugin's behaviour around retrying:

Let's say Pod1, Pod2 and Pod3 are in the same group and they're just created, going to scheduling cycles.
Pod1 and Pod2 goes thru Filter phase, and waiting for Pod3 (waitOnPermit).
Pod3, unfortunately, is rejected by noderesourcefit.
Pod3 goes back to unschedQ.
Also, we bring back Pod1 and Pod2 to unschedQ too, without waiting for timeout they have. The reason is for "waiting Pods reserve too large space in a cluster kubernetes/enhancements#4671 (comment)"
Now, Pod1 and Pod2 have unschedPlugin: gang and Pod3 has unschedPlugin: noderesourcefit.
This group won't be schedulable until Pod3's failure is resolved. So, Pod1 and Pod2 will be gated until Pod3's failure is solved.
And, when the scheduler receives an event that is registered in noderesourcefit ([if QHint enabled] + when QHint returns Queue to the event ), Pod3 is moved from unschedQ to activeQ or backoffQ because Pod3 may be schedulable now.
Then, move Pod1 and Pod2 to activeQ/backoffQ too so that all Pods in this group are again proceeding to scheduling cycles.
Then, if you're lucky, all Pods are scheduled successfully. If you're not, some of them might be unschedulable, and we're doing the same process of requeueing again.

from kubernetes.

Pods gated by custom PreEnqueue plugins don't go through backoffQ even in case they ought to about kubernetes HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent