Report At our organization (Flipkart), we manage large k8s cluster

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Colocating metrics provider along with the operator causes HPA delays if not configured properly about keda HOT 9 OPEN

bharathguvvala commented on July 17, 2024

Colocating metrics provider along with the operator causes HPA delays if not configured properly

from keda.

Comments (9)

JorTurFer commented on July 17, 2024 1

Hello @bharathguvvala
Thanks for reporting the gap in docs and sorry the problem :(

I think that the problem should be addressed by improving our docs about this and how to solve it. I think to address this from docs because updating the CR is part of the KEDA operation (not depending on the fallback) and it's part of the CRD, so users can build checking systems on top of it.

Just giving some context about that change, the underlying controller added the limit rate support with a quite restrictive values (5 and 10 IIRC), we increased those values for the most common user type (< 100 ScaledObject) to prevent bursting API server by mistake.

If you are willing to help with the documentation, it'd be awesome! Real user experiences are the best teachers for other folks

from keda.

bharathguvvala commented on July 17, 2024

Willing to participate in the discussion and to contribute to code/documentation.

from keda.

bharathguvvala commented on July 17, 2024

@JorTurFer Whichever scaledobjects have not been configured with the fallback option , isn't it better to skip updating the fallback health status for every metrics read call, which would avoid redundant API calls to the K8s API Server? Instead this condition can be updated from the controller during scaledobject reconciliations? Also means that the fallback health stats are only updated for scaledobjects where the fallback is enabled and for the rest there are defaulted during the time of the scaledobject reconciliation.

from keda.

bharathguvvala commented on July 17, 2024

Regarding the documentation, I'll raise a PR in the next the couple of days which provides guidance to setup and configure KEDA on large clusters -- with high number of deployments.

from keda.

JorTurFer commented on July 17, 2024

@JorTurFer Whichever scaledobjects have not been configured with the fallback option , isn't it better to skip updating the fallback health status for every metrics read call, which would avoid redundant API calls to the K8s API Server? Instead this condition can be updated from the controller during scaledobject reconciliations? Also means that the fallback health stats are only updated for scaledobjects where the fallback is enabled and for the rest there are defaulted during the time of the scaledobject reconciliation.

Your point it's interesting, it's true that updating the fallback all the time if the feature is disabled doesn't make sense. Checking the code, I have noticed that we can be updating the value although there isn't any change.

Checking the fallback logic, we are callign to updateStatus, and there, we don't check if the status has really changed before patching the resource:

keda/pkg/fallback/fallback.go

Lines 115 to 129 in f2d86a8

 func updateStatus(ctx context.Context, client runtimeclient.Client, scaledObject *kedav1alpha1.ScaledObject, status *kedav1alpha1.ScaledObjectStatus, metricSpec v2.MetricSpec) { 

 patch := runtimeclient.MergeFrom(scaledObject.DeepCopy()) 

 if fallbackExistsInScaledObject(scaledObject, metricSpec) { 

 status.Conditions.SetFallbackCondition(metav1.ConditionTrue, "FallbackExists", "At least one trigger is falling back on this scaled object") 

 } else { 

 status.Conditions.SetFallbackCondition(metav1.ConditionFalse, "NoFallbackFound", "No fallbacks are active on this scaled object") 

 } 

 scaledObject.Status = *status 

 err := client.Status().Patch(ctx, scaledObject, patch) 

 if err != nil { 

 log.Error(err, "failed to patch ScaledObjects Status", "scaledObject.Namespace", scaledObject.Namespace, "scaledObject.Name", scaledObject.Name) 

 } 

 }

I think that we can improve that logic to reduce the calls to the API server. @zroubalik @dttung2905 @wozniakjan WDYT?

from keda.

wozniakjan commented on July 17, 2024

Checking the fallback logic, we are callign to updateStatus, and there, we don't check if the status has really changed before patching

That is a pretty good optimization, especially given there is already DeepCopy available

keda/pkg/fallback/fallback.go

Line 116 in f2d86a8

patch := runtimeclient.MergeFrom(scaledObject.DeepCopy())

and the check could be as simple as !reflect.DeepEqual()

Regarding the documentation, I'll raise a PR in the next the couple of days which provides guidance to setup and configure KEDA on large clusters -- with high number of deployments.

Thank you @bharathguvvala, that would be terrific. Also, feel free to introduce code improvements, generally it's easier to get merged smaller PRs.

from keda.

bharathguvvala commented on July 17, 2024

@wozniakjan @JorTurFer I have made a change to disable health status updates if the fallback for the scaledobject is not configured. I will go ahead with adding the tests if this change is okayed in terms of the intent. We could do additional logic to avoid redundant updates where a fallback is configured in a scaledobject , on top of this.

from keda.

bharathguvvala commented on July 17, 2024

@JorTurFer @wozniakjan Taking a step back, I was thinking if this information around the error count needs to be updated back to the scaledobject status. Since this is transient information used to implement some sort of circuit breaking isn't it appropriate to keep this information inside the operator (in memory) and only update the condition whenever it flips?

I presume this information is updated in the status only to make it persistent and survive across multiple operator restarts but it's also expensive considering that an update is performed in the read path of the GetMetrics and can potentially affect the GetMetrics latencies which in turn can affect the autoscaler SLOs. If the same error count information can be reconstructed from scratch based on the new set of errors then why not avoid persistenting it?

from keda.

Colocating metrics provider along with the operator causes HPA delays if not configured properly about keda HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	func updateStatus(ctx context.Context, client runtimeclient.Client, scaledObject kedav1alpha1.ScaledObject, status kedav1alpha1.ScaledObjectStatus, metricSpec v2.MetricSpec) {
	patch := runtimeclient.MergeFrom(scaledObject.DeepCopy())

	if fallbackExistsInScaledObject(scaledObject, metricSpec) {
	status.Conditions.SetFallbackCondition(metav1.ConditionTrue, "FallbackExists", "At least one trigger is falling back on this scaled object")
	} else {
	status.Conditions.SetFallbackCondition(metav1.ConditionFalse, "NoFallbackFound", "No fallbacks are active on this scaled object")
	}

	scaledObject.Status = *status
	err := client.Status().Patch(ctx, scaledObject, patch)
	if err != nil {
	log.Error(err, "failed to patch ScaledObjects Status", "scaledObject.Namespace", scaledObject.Namespace, "scaledObject.Name", scaledObject.Name)
	}
	}