Comments (3)
Regarding the source of the delay when PublishNotReadyAddresses=false
, I have recorded a demo of creating a RayCluster with one worker in a fresh Kind environment:
Full Demo Recording:
output.mp4
Timeline Breakdown:
- The Kind took about 36 seconds to download the Ray image.
- The worker took about 53 seconds to pass the
ray health-check
. - The worker took another 10 seconds to pass the k8s readiness check.
Result Screenshot:
The result is the RayCluster took about 100 seconds to be fully ready.
Root Cause Analysis
From the above timeline breakdown, we can see that most of the time was spent on multiple retires of ray health-check
.
To inspect what is happening underneath, we can add the GRPC_VERBOSITY=DEBUG
before doing the ray health-check
and preserve its output:
diff --git a/ray-operator/controllers/ray/common/pod.go b/ray-operator/controllers/ray/common/pod.go
index 9939e63..1d747b7 100644
--- a/ray-operator/controllers/ray/common/pod.go
+++ b/ray-operator/controllers/ray/common/pod.go
@@ -172,16 +172,19 @@ func DefaultWorkerPodTemplate(ctx context.Context, instance rayv1.RayCluster, wo
Args: []string{
fmt.Sprintf(`
SECONDS=0
while true; do
if (( SECONDS <= 120 )); then
+ export GRPC_VERBOSITY=DEBUG
- if ray health-check --address %s:%s > /dev/null 2>&1; then
+ if ray health-check --address %s:%s; then
break
fi
echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."
Then we apply the RayCluster again to a fresh Kind environment:
Full Debug Recording:
output2.mp4
Debug Screenshot:
We now clearly confirm that the ray health-check
can fail multiple times with the Domain name not found
error which is indeed caused by PublishNotReadyAddresses=false
.
from kuberay.
Regarding the second question: Should we also set PublishNotReadyAddresses=true for other service types?
My previous thought was that other service types would not raise the Domain name not found error
because they had a virtual IP, so there was no need to PublishNotReadyAddresses=true.
However, given that
- KubeRay guarantees a Head service will apply to only one Pod at any given time. No worry about accessing a wrong Head.
- Allowing users to access the Ray dashboard for troubleshooting through the service when readiness fails is a big use case.
I think we can safely set PublishNotReadyAddresses=true for all Head service types. WDYT @kevin85421? I'd value your feedback on this.
from kuberay.
The worker took about 53 seconds to pass the ray health-check.
Interesting. I am very surprised that DNS registration takes much longer than my expectation, although the head Pod is already ready. Thank you for the investigation!
I think we can safely set PublishNotReadyAddresses=true for all Head service types.
Make sense to me.
from kuberay.
Related Issues (20)
- [Bug] no feedback about failure to create submitter pod due to invalid spec HOT 2
- Good repository HOT 1
- [Bug] What's minimum permission set for kuberay-operator? HOT 3
- [Feature] Jupyter ecosystem support on kuberay HOT 4
- [Bug] Can't scaler up when using autoscaler v2 HOT 6
- [Bug] ray cluster getHeadServiceIp failed if "app.kubernetes.io/name" set HOT 2
- [Feature] keep the RayCluster's prefix in the worker pods name HOT 3
- [Feature] REP 54: Implement the ReplicaFailures condition HOT 1
- [CI] Check the consistency between role.yaml and multiple_namespaces_role.yaml HOT 1
- [Feature] REP 54: Donβt assign the rayv1.Failed to the State field HOT 2
- [Refactor] REP 54: Refactor usages of the inconsistentRayClusterStatus function HOT 2
- [Feature] REP 54: Add PodName to the HeadInfo HOT 1
- [Feature] REP 54: Implement the HeadReady condition HOT 2
- [Feature] REP 54: Re-define rayv1.Ready HOT 2
- [Bug] spec.workerGroupSpecs[0].template.spec.containers[0].env: Invalid value: "null" HOT 3
- [Feature] Add API reference documentation for KubeRay custom resources HOT 5
- [Feature] Event record for failed Pod creation HOT 4
- [Bug] Leader Election Lost: Kuberay pod restarts every 5mins! HOT 25
- [Feature] Add a strict parser for YAMLs in CI
- Adding --proxy=off for wget health check
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kuberay.