Application status about application HOT 12 CLOSED

konryd commented on July 29, 2024

Application status

from application.

Comments (12)

erictune commented on July 29, 2024

regarding option 3: Aggregating events that apply to members of an App is a good feature for a UI.
But, events are not a good way to determine current state. They can be lost, and they can conflict, and they are not required to be delivered or emitted.

regarding option 4: What if the App controller were responsible for listing all the "top-level objects" that match the App in the App's .status. @kow3ns WDYT?. Then the UI only needs to check the status of each of those objects.

from application.

konryd commented on July 29, 2024

events are not a good way to determine current state. They can be lost, and they can conflict, and they are not required to be delivered or emitted

Aren't these worries theoretical? The events are used to generate pod's status eg. in Kubernetes Dashboard. I am not not aware of users complaining.

What if the App controller were responsible for listing all the "top-level objects" that match the App in the App's .status

That would be convenient for listing a single app's components, and their statuses, but for computing the aggregate status of the application it will be comparable to the current solution, especially when computing statuses for all applications (we'll end up with API calls covering all individual resources, which might end up spanning most of the cluster).

from application.

prydonius commented on July 29, 2024

I'm curious about how people use healthchecks to determine healthiness in the context of an application today. I would imagine a lot of simpler applications (e.g. ones we have Helm charts for today) would only need a single health check for the main services. For example, in WordPress this would be the WordPress deployment and the MySQL deployment. It may even be enough in this case to just check the health of the WordPress deployment as it should fail if it cannot connect to MySQL. I think it's good practice for a service to have a healthcheck endpoint that checks that it can successfully connect to dependent services.

It'd be helpful for me to see examples of the more complex applications/architectures that use an extra service to provide an aggregate healthcheck for the application, or would require many healthchecks for each individual component.

from application.

konryd commented on July 29, 2024

@prydonius I like the idea of using health checks. Do you have thoughts on the exact implementation? What do you think about making it a service reference in the style of #12 plus some fields for homing down no the exact port/path?

from application.

prydonius commented on July 29, 2024

@konryd would it not be easier to use workload references (I believe just these five: Deployment, StatefulSet, ReplicaSet, DaemonSet, Pod) as we can leverage the readiness/liveness checks that these already do. Now that I think about it, this also gives app devs quite a lot of flexibility to decide whether they want to aggregate healthchecks across different services they've deployed, or deploy a single healthcheck service to compute the aggregated health.

For example, in WordPress I would just point to the WordPress Deployment as this covers the database connection too. For a more complex app, I could opt to bundle a separate Pod/Deployment to reduce the overhead of querying every workload in the app bundle.

This all assumes that we only need to check the health of an app's workloads. Are there other resource types that would make sense to check? PVCs/PVs, ConfigMaps, Secrets would all be part of workload readiness (the pods won't start if these aren't bound/don't exist). A LoadBalancer-type Service might not be accessible whilst the cloud provider provisions an IP/hostname, so there may be a need to check Services as you suggested (but for this case we may just be able to check for type: LoadBalancer and if an IP/hostname has been assigned). Anything else I could be missing?

from application.

lwander commented on July 29, 2024

The Application resource is intended to be used by UIs. One of the most useful features of those is reporting the health of the displayed resources.

Is reporting the health meant for human consumers in a UI, programmatic consumers via the API, or some combination of both?

If you're leaning more toward the former, I don't know if a single "OK" vs. "Error" is necessary, or even enough. Like @prydonius points out there's a large set of resources (not even counting CRDs) that might be worth rolling up into application health. If we want to provide users with a quick look into the status of their application, maybe the UI should surface each resource's health individually rather than try and guess at the overall status. This can include aggregating pod states, health checks, events, etc... all into your application dashboard.

On the programmatic side, coming up with a top-level heuristic can be very tricky. What if your pods fail their readiness probes a few times before coming online? Any application status that depends on this will flap between "OK" and "Error" each time you do a deployment. Furthermore, you might have more sophisticated notions of health built into your metrics/monitoring system that aren't natively exposed to the Kubernetes API, making it tricky for the application CRD to infer anything about your apps status. For example, you might be monitoring some internal queue depth & when a certain threshold is exceeded you don't want to mark you pods as unhealthy (now no requests are processed), you want to page an SRE or scale up your deployments.

I'm pointing this out because we'd like to take advantage of this CRD in Spinnaker, which has its own notion of applications. The view surfaced to an operator gives quite a bit of detail about what's going on in your cluster, without trying roll status up into a top-level object. Take for example this one pod that won't be scheduled in this application:

from application.

konryd commented on July 29, 2024

@prydonius I started with just services to avoid the need to re-implement their round-robin semantics, but I'm not against referencing workloads directly - nice thing about this is it wouldn't require adding a service just for the health check.

As for the unprovisioned LB case - this can't be detected with a blackbox approach (==healthcheck) from inside the cluster, which gets us back to testing components (or only services) one-by-one, which - as @lwander says, is often more appropriate to do on the client side / ui.

from application.

konryd commented on July 29, 2024

@lwander I think the status should be helpful in both cases: lightweight, glanceable status for humans and some reasonable check for automated tools (eg. testers). It doesn't need to provide all of this on the API level, though.

I'd like to focus on items that require some input from user and thus benefits from standardization in a CRD: Querying each component one-by-one is something that clients can do (and proceed as they wish: either roll up or display in full), selecting an individual healthcheck to represent the health of a whole app requires human input.

from application.

fejta-bot commented on July 29, 2024

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

from application.

fejta-bot commented on July 29, 2024

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

from application.

fejta-bot commented on July 29, 2024

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

from application.

k8s-ci-robot commented on July 29, 2024

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from application.

Application status about application HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent