Comments (7)
@MadhavJivrajani Great! I will let you know when I have a doc.
from kuberay.
I have already worked on a document. I will let you know when it is ready for review.
from kuberay.
Honestly, I don't think KubeRay should handle and expose K8s Pod errors. You can think of RayCluster as equivalent to multiple ReplicaSets. ReplicaSetStatus doesn't include "Pod failure" in its status. Maybe we can introduce a new conditions
field to handle Pod-level observability. Currently, the RayCluster state includes Failed
, which is quite undefined and makes the state machine rather messy. I am planning to refactor the RayCluster status soon. If you are interested, we can work on it together, or you can provide feedback on my design document.
from kuberay.
I'd be happy to help out here in case the help is needed @han-steve!
from kuberay.
Thanks for the response. I agree that the status state machine can get messy with pod failure statuses. An alternative would be to use the Conditions field to reflect the errors in the underlying cluster. For example, ReplicaSet and Deployment use a Condition to inform a user that the pods fail to scale up due to a resource quota error. They also produce events that can be easily seen with a kubectl describe
.
Our goal is to surface the underlying error to the user so they know if a job is pending or stuck due to resource quota errors. If there's no plan to surface these conditions, we'll query the associated ray cluster for this info to show to the user. Thanks again for taking a look!
from kuberay.
from kuberay.
Hi @han-steve @MadhavJivrajani,
I have scheduled a meeting for the RayCluster status improvement work stream on July 10 8:30 - 8:55 AM PT. You can add the following Google calendar to subscribe the events for Ray / KubeRay open-source community.
from kuberay.
Related Issues (20)
- [Bug] no feedback about failure to create submitter pod due to invalid spec HOT 2
- Good repository HOT 1
- [Bug] What's minimum permission set for kuberay-operator? HOT 3
- [Feature] Jupyter ecosystem support on kuberay HOT 4
- [Bug] Can't scaler up when using autoscaler v2 HOT 6
- [Bug] ray cluster getHeadServiceIp failed if "app.kubernetes.io/name" set HOT 2
- [Feature] keep the RayCluster's prefix in the worker pods name HOT 3
- [Feature] REP 54: Implement the ReplicaFailures condition HOT 1
- [CI] Check the consistency between role.yaml and multiple_namespaces_role.yaml HOT 1
- [Feature] REP 54: Donβt assign the rayv1.Failed to the State field HOT 2
- [Refactor] REP 54: Refactor usages of the inconsistentRayClusterStatus function HOT 2
- [Feature] REP 54: Add PodName to the HeadInfo HOT 1
- [Feature] REP 54: Implement the HeadReady condition HOT 2
- [Feature] REP 54: Re-define rayv1.Ready HOT 2
- [Bug] spec.workerGroupSpecs[0].template.spec.containers[0].env: Invalid value: "null" HOT 3
- [Feature] Add API reference documentation for KubeRay custom resources HOT 5
- [Feature] Event record for failed Pod creation HOT 4
- [Bug] Leader Election Lost: Kuberay pod restarts every 5mins! HOT 25
- [Feature] Add a strict parser for YAMLs in CI
- Adding --proxy=off for wget health check
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kuberay.