Comments (7)
Which Ray images are you using? You should use images that include aarch64
in the image tag.
from kuberay.
@kevin85421 yes, I'm using aarch64
images, 2.22.0-py310-aarch64
for Ray to be exact
from kuberay.
@kevin85421 do you have any idea what may be happening? This blocks me.
from kuberay.
I tried the following on my Mac M1, and my RayCluster is healthy; no pods have been killed.
kind create cluster
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
helm install raycluster kuberay/ray-cluster --version 1.1.1 --set image.tag=2.22.0-py310-aarch64
- We may have some differences: (1) kind vs minikube (2) M1 vs M2 (3) different instructions.
- You can try
kind
to determine whether the question is minikube-only or not. - Use the exact the same instructions above in your environment.
- You can try
Btw, are you in the Ray Slack channel? It will be helpful to join the Slack workspace. Other KubeRay users can also share their experiences. You can join #kuberay-questions
channel.
from kuberay.
@kevin85421 what container runtime do you use? Colima or Docker Desktop?
from kuberay.
I use Docker.
from kuberay.
Ok @kevin85421, I think I found the culprit, some weird behaviour with worker.minReplicas
parameters with enabled autoscaling head.enableInTreeAutoscaling: true
Example cases:
-
worker: replicas: 4 minReplicas: 0 maxReplicas: 1000
I get 4 pods launched, then (about 60s) all 4 failing readiness probe and getting killed
-
worker: replicas: 4 minReplicas: 2 maxReplicas: 1000
I get 4 pods launched, then (about 60s) 2 fail readiness probe and die, 2 stay healthy and work
-
If I set no min
worker: replicas: 4 maxReplicas: 1000
I get 4 pods launched, then (about 60s) 3 failing readiness probe and getting killed, 1 stays healthy and works
-
If I set
worker.replicas = worker.minReplicas = 4
, I get all 4 working properly.
Also noticed not setting worker.maxReplicas
leads to a weird behaviour as well (number of pods does not match the request) and head node throws error with autoscaler not working properly
So I see two possible things here (which may be interconnected):
- KubeRay uses
worker.minReplicas
as default when autoscaler is on after recovering from readiness probe fail (which is unexpected as it should useworker.replicas
value)? - readiness probes fail only on pods not tracked by autoscaler (not sure why)?
Disabling enableInTreeAutoscaling
makes everything work as expected.
What do you think?
from kuberay.
Related Issues (20)
- [Bug] RayJob falsely marked as "Running" when driver fails HOT 3
- [Feature] Checkpoint API to recover from checkpoint from previous runs HOT 2
- [Feature] Should we also set PublishNotReadyAddresses if the service is not headless? HOT 3
- [Bug] Fail the job, if the head node crashes HOT 2
- [Bug] Image vulnerabilities found with Aquasec HOT 4
- [Bug] [API Server] JobSubmission service does not work for cluster names >41 characters HOT 1
- [Bug] Update Readme to point to 1.1.1 instead of 1.1.0 for the operator HOT 1
- [Umbrella] Ray Autoscaling tests HOT 4
- [Bug] RayCluster Helm Chart: containerEnv set to null when not values are specified HOT 1
- [Umbrella] RayService HA tests
- [Bug] RayCluster Sporadic RPC Error HOT 1
- [Bug] RayJob should surface errors with underlying RayCluster HOT 6
- [Bug] "enable-batch-scheduler" bool flag is not working for schedulers other than Volcano
- [Bug] Minimum CPU and Memory requirements for KubeRay Head and worker pods
- [Feature] Allow setting `ttlSecondsAfterFinished` for RayJob submitter Job HOT 2
- [Bug] KubeRay cluster resource status is reporting Ready when there are pods still pending HOT 4
- [Feature] Display reconcile failures as events on ray clusters
- [Feature] Create an example doc for Modin HOT 1
- [Doc] Release schedule
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kuberay.