ray-project / kuberay Goto Github PK

View Code? Open in Web Editor NEW

846.0 846.0 309.0 148.97 MB

A toolkit to run Ray applications on Kubernetes

License: Apache License 2.0

Dockerfile 0.25% Makefile 1.62% Go 78.46% Shell 0.71% Mustache 0.23% Python 18.60% Smarty 0.12%

apache deep-learning kubernetes machine-learning ray

kuberay's People

Contributors

Stargazers

Watchers

Forkers

alipay akanso qstar gaocegege caitengwei harryge00 jeffwan tgaddair chenk008 chaomengyuan anencore94 zhuangzhuang131419 denkensk franklinharry ryantd wilsonwang371 nostalgicimp haoxins wolfsniper2388 bharatjuber asm582 proteinqure jeffreyftang dmitrigekhtman nakamasato novahe treeti-official sriram-anyscale ddelange waynegates daikeshi mbhavya mitangelo simon-mo tomcli davidxia pingsutw tempcat07 armandpicard goswamig yabuchan scarlet25151 askulkarni2 basasuya jadcham ulfox bonsaiai bplotnick-humane mbaijal purvak-l unixcrh sdk21 jeethridge kevin85421 huadubadaniao orbitalwitness sigmundv rafvasq xajxiang jasoonn byronhsu dhaval0108 elvishsu66 icekhan13 ttvuo alessandropomponio sihanwang41 ronaldosaheki shrekris-anyscale orcahmlee mafs12 loleek treebeardtech michaelclifford suryatmodulus syedmohdqasim lizzzcai venkat-reddy-aera brycehuang30 architkulkarni wseaton brentsouza cskornel-doordash manuelding devsusu gvspraveen apalermo01 actable-ai yicheng-lu-llll jiwq younday davidtgautier enori psschwei averyngo34 samarara pulkit97 pranavkamlaskar-katonic chuanli11 huagang669

kuberay's Issues

[Feature][backend] Implement gRPC Backend service

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

part of #54

This issue is used to track the implementation of gRPC backend service for Cluster ComputeTemplate, Image etc.

Use case

No response

Related issues

#54

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Create a docs folder to track design docs

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

We need to create a docs folder at the root repository level to track large feature or new sub-project design etc.

Use case

To give developer rich information to participate in the community and understand more design principles.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Use squash merge for pull requests or introduce bot to manage merge

Seem this repo is growing and repo doesn't have a common way to manage pull requests.

I would suggest to use "squash merge" which makes commit history clean and easy to mange.

In addition, it would be better to introduce bot to manage PR or issues.

/cc @akanso @Qstar

[Feature] Enable code formatting, go imports, linters in CI

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

We may not always remember to do this work which results in inconsistent formatting issue.
Let's create the standard to help us maintain the code here.

gofmt
goimports
golangci-lint

curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(go env GOPATH)/bin v1.23.7

golangci-lint run  ./... # introduce linter config.yaml if necessary

I think the overall idea is to improve project quality. See suggestions here https://goreportcard.com/report/github.com/ray-project/kuberay

Use case

To avoid inconsistency issue from different contributors.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Restart worker pod when raylet exited

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

For now, ray-operator generate worker pod command

ulimit -n 65536; ray start  --node-ip-address=$MY_POD_IP  --redis-password=LetMeInRay  --address=raycluster-heterogeneous-head-svc:6379  && sleep infinity

If the raylet in worker pod exited, the raylet don't restart, so the pod will be useless without raylet.

Use case

If the raylet in worker pod exited, the raylet should restart.

One option is starting ray with --block and delete the sleep infinity.
Another option is restarting the pod, so the raylet will start. Maybe we can add some default liveness probe.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

upgrade CRD to `apiextensions.k8s.io/v1`

upgrade CRD from apiextensions.k8s.io/v1beta1 to apiextensions.k8s.io/v1

to avoid:

Warning: apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition

Chore: Preprare for v0.1.0 release

After #17, we should prepare a v0.1.0 branch. Before the official release, a few things need to be done. I create an milestone to track the progress.

Find a place to host container image so user can have one-command installation experiences #51
Clean up go mod links. #39
Clean up README.md documentation.
Prepare community profiles - Code of conduct, Contributing, License etc. #49
[Feature] Provide hosted operator container image #51
Fix Flaky tests #35

/cc @akanso @chenk008 @chaomengyuan

[Feature] [coreapi] Scripts to generate go clients and swagger files

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Sub story of #53

We need to generate go_client and swagger files for #82 #83. A guidance needs to be provided as well to onboard developers.

Use case

Need an easier way to generate clients in the development

Related issues

#53

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Create a root level .gitignore file

Since we may create more folder in this repo, we need a root level .gitignore file. This can be from ray-operator/.gitignore . We can still keep project/folder .gitignore for specific files.

Support in tree Autoscaler in ray operator

Controller support to scale in arbitrary pods by following api. This is extremely helpful for user who use out of tree autoscaler.

https://github.com/ray-project/ray-contrib/blob/f4076b4ec5bfae4cea6d9b66a1ec4e63680ca366/ray-operator/api/v1alpha1/raycluster_types.go#L56-L60

In our case, we still like to use in-tree autoscaler. the major differences is

we want to enable in-tree autoscaler in head pod via --autoscaling-config and head pod will start monitor process
config actually has to come from operator. So operator needs to convert RayCluster custom resource to a config file which can be used by in-tree autoscaler. example here
Since head and operator are hosted in different pods, operator needs to create the ConfigMap and mount to head pod transparently.

A new field has been reserved in API support this change.

https://github.com/ray-project/ray-contrib/pull/22/files#diff-edc3be4feb67012c143a57fcaefafb4c95e4cd6e661a67bb2ad1da340255bc00R21-R22

[Feature] Enabled e2e test using examples

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

This is a follow up request of #42

We should enable e2e testing using some examples from the operator. That helps to track example changes and make sure they are always working.

Use case

User will use example directly in their cluster.

Related issues

#42

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Create community profile

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

We plan to attract more contributors and news for this repo. Compare to recommend community standards. We at least needs to add some basic profiles like

Code of conduct
Contributing
License

Use case

Help user understand how they can interact with this repo.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Support nightly container image

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Build nightly container images for every commit. This should be part of GitHub workflow

Use case

User can use version with fixes from master without waiting for stable release.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Support clientset for RayCluster objects

We have some backend services written in Golang wants to interact with RayCluster Kubernetes objects and it needs to use typed clients.

Kubebuilder itself doesn't generate clients for objects. It would be great to leverage code-gen to generate clientset for external system to interact with

[Bug] Deleting and recreating RayCluster can lead to error to reconcile.

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Because there is not yet rolling update support, it's often necessary to stop and restart the RayCluster in quick succession to push updates. It seems in some cases this can lead to an error where there are temporarily two head node pods for the cluster at once:

2021-10-15T03:39:42.599Z	INFO	raycluster-controller	Reconciling RayCluster	{"cluster name": "cluster-3"}
2021-10-15T03:39:42.599Z	INFO	RayCluster-Controller	checkSvcName 	{"svc name amended": "cluster-3-service"}
2021-10-15T03:39:42.606Z	INFO	raycluster-controller	Pod service already exist,no need to create
2021-10-15T03:39:42.606Z	INFO	raycluster-controller	checkPods 	{"more than 1 head pod found for cluster": "cluster-3"}
E1015 03:39:42.606835       1 runtime.go:73] Observed a panic: runtime.boundsError{x:1, y:1, signed:true, code:0x0} (runtime error: index out of range [1] with length 1)
goroutine 291 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1344420, 0xc00099e840)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:69 +0x7b
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:51 +0x89
panic(0x1344420, 0xc00099e840)
	/usr/local/go/src/runtime/panic.go:969 +0x1b9
ray-operator/controllers.(*RayClusterReconciler).checkPods(0xc00073e600, 0xc0000d5c00, 0xc0005409f0, 0x22, 0x0, 0x1a)
	/workspace/controllers/raycluster_controller.go:128 +0x2b15
ray-operator/controllers.(*RayClusterReconciler).Reconcile(0xc00073e600, 0xc000513630, 0x9, 0xc00050d220, 0x1a, 0x5f3c0599, 0xc0000b33b0, 0xc00052e048, 0xc00052e000)
	/workspace/controllers/raycluster_controller.go:94 +0x475
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00029e000, 0x12d3ce0, 0xc00000c080, 0xc00070a500)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216 +0x166
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00029e000, 0xc0004b6f00)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0xb0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc00029e000)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00062b020)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00062b020, 0x3b9aca00, 0x0, 0x1488001, 0xc0000b8180)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0x105
k8s.io/apimachinery/pkg/util/wait.Until(0xc00062b020, 0x3b9aca00, 0xc0000b8180)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:157 +0x331
panic: runtime error: index out of range [1] with length 1 [recovered]
	panic: runtime error: index out of range [1] with length 1

goroutine 291 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:58 +0x10c
panic(0x1344420, 0xc00099e840)
	/usr/local/go/src/runtime/panic.go:969 +0x1b9
ray-operator/controllers.(*RayClusterReconciler).checkPods(0xc00073e600, 0xc0000d5c00, 0xc0005409f0, 0x22, 0x0, 0x1a)
	/workspace/controllers/raycluster_controller.go:128 +0x2b15
ray-operator/controllers.(*RayClusterReconciler).Reconcile(0xc00073e600, 0xc000513630, 0x9, 0xc00050d220, 0x1a, 0x5f3c0599, 0xc0000b33b0, 0xc00052e048, 0xc00052e000)
	/workspace/controllers/raycluster_controller.go:94 +0x475
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00029e000, 0x12d3ce0, 0xc00000c080, 0xc00070a500)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216 +0x166
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00029e000, 0xc0004b6f00)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0xb0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc00029e000)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00062b020)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00062b020, 0x3b9aca00, 0x0, 0x1488001, 0xc0000b8180)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0x105
k8s.io/apimachinery/pkg/util/wait.Until(0xc00062b020, 0x3b9aca00, 0xc0000b8180)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:157 +0x331

Reproduction script

Seems to be a race condition, as it doesn't always happen. But the general approach was to delete on cluster then recreate it with the same name.

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Rename file links and go modules to reflect new repo name

Community help rename repo to kuberay #33, we need some change on existing ray-contrib like beflow. Even Github provides redirection for renaming repo, we should clean them up to avoid any confusion.

kuberay/ray-operator/go.mod

Line 1 in 6743385

module github.com/ray-project/ray-contrib/ray-operator

[Feature] Easier way to build the cluster image

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

prepare a way to install the dependencies and build a custom image for users.

Use case

For a dedicated cluster to run a specific workload, user normally won't use job level runtime. Instead, they expect the cluster have dependencies installed directly and they are able to kick off the job immediately.

In that case, we need to create the ray cluster with preinstall dependencies.
User can definitely run Dockerfile by their own but it's not easy for platform owner to share the images and index the dependencies.

Let's implement this in a managed way

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Verify podName before attempting to create the pod

currently:podName := strings.ToLower(instance.Name + common.DashSymbol + string(rayiov1alpha1.HeadNode) + common.DashSymbol)

can yield in an invalid podName > 63 characters

We need to check that the names we are building are valid

Flaky tests: should update a raycluster object deleting a random pod

https://github.com/ray-project/ray-contrib/runs/3628922513

post submit job failed. I think this is a flaky tests. I will put some time to figure the problem out.

Summarizing 2 Failures:

[Fail] Inside the default namespace When creating a raycluster [It] should update a raycluster object deleting a random pod 
/home/runner/work/ray-contrib/ray-contrib/ray-operator/controllers/raycluster_controller_test.go:196

[Fail] Inside the default namespace When creating a raycluster [It] should have only 2 running worker 
/home/runner/work/ray-contrib/ray-contrib/ray-operator/controllers/raycluster_controller_test.go:203

Ran 10 of 10 Specs in 23.194 seconds
FAIL! -- 8 Passed | 2 Failed | 0 Pending | 0 Skipped
--- FAIL: TestAPIs (23.19s)
FAIL

Q) Is it possible to use python client outside of the k8s cluster now?

Hi forks, I've been tried to use this new ray-operator to control raycluster in k8s cluster.

I've deployed raycluster sample to my minikube for test with samples/ray-cluster.mini.yaml.

And now there was ray-cluster-head-svc, but this svc doesn't expose 10001 port for using python client(ray.init(address=....)).

Even though I could exec to Ray dashboard with kubectl port-forward, I'm in stuck of how to ray.init for my new raycluster in k8s cluster.

Does this feature is not implemented yet? or just specify at spec.headGroupSpec.template.spec.containers[0].ports could enable ray.init ?

[discussion] folder structure to support further sub-projects

Due to previous project naming, we put ray-operator in a separate folder. Currently, I notice some limitation to extend this project with more sub-projects and want to discuss with you.

Option 1. Each sub-project is an individual go module

pros: project is organized by modules and it's easy to all codes belong to a module
cons: several individual go modules and can not share common codes.

kuberay
├── proto
├── backend
│   ├── ...
│   └── go.mod
├── cli
│   ├── ....
│   └── go.mod
└── ray-operator
    ├── api
    ├── bin
    ├── config
    ├── controllers
    ├── go.mod
    ├── go.sum
    └── main.go

module github.com/ray-project/kuberay/ray-operator
module github.com/ray-project/kuberay/backend
module github.com/ray-project/kuberay/plugin

Option 2. Use a single module for kuberay project

Let's assume we have different components, like plugins, ray operator, backends. Even we put all into same project, they are build separately and packaged in separate container images.

Pros: good for testing, sharing utils. Easily to reference apis (no need to replace or go get)
Cons: not friendly for module users?

kuberay
    ├── api
       └── raycluster
             └── v1alpha1
                     ├── groupversion_info.go
                        ├── raycluster_types.go
                        ├── raycluster_types_test.go
                        └── zz_generated.deepcopy.go
       └── notebook
             └── v1alpha1
                     ├── groupversion_info.go
                        └──notebook_types.go
       └── image
             └── v1alpha1
                     ├── groupversion_info.go
                        └──image_types.go
    ├── cmd
    │   ├── ray
    │           ├── main.go
    │   ├── kubectl-plugin
    │           ├── main.go
    │   ├── notebook
    │           ├── main.go
    ├── pkg
    │   ├── service
    │           ├── backend.go
    ├── controllers
    │   ├── common
    │   ├── raycluster_controller.go
    │   ├── notebook_controller.go
    │   ├── notebook_controller.go
    │   └── utils
    ├── go.mod
    ├── go.sum
    ├── hack
    │   └── boilerplate.go.txt
    ├── main.go

/cc @akanso @chenk008 @chaomengyuan

[Feature] [coreapi] Code generation for protobuf message

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

part of #53

In order to make the PR small and maintainable, let's have the generated code in this separate PR.

Use case

No response

Related issues

#53

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Autodetect GPU resources to advertise to Ray.

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

For Ray pods configured to use GPU using one of the standard device drivers, read the number of GPUs from resource limits and automatically add that number to the --num-gpus argument of ray start, when num-gpus is not already specified.

The implementation should be straightforward and similar to existing CPU detection logic.
See the discussion in ray-project/ray#20265

Use case

This will simplify configuration for GPU workloads.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Remove BAZEL build in ray-operator project

Bazel is being removed as the build system for Kubernetes itself, in favor of the existing make build tooling that releases have always used.

We should remove BAZEL from the project as well. Existing tool is good enough to use.

references:

Consolidate different operators into an unified golang ray operator

After talking with @chenk008 @caitengwei @akanso , I think we already have some consensus on this plan. Before we move forward, Let's discuss some detail items that still not clear. Please have a review and leave your comments

/cc @chenk008 @caitengwei @akanso @chaomengyuan @wilsonwang371
/cc @DmitriGekhtman @yiranwang52

API

It might be easy to follow one of the three styles and make the required changes to meet everyones's need. Ant Group use some low level objects to gain granular control and reusability. API. MSFT and Bytedance both uses PodTemplate for strong consistency with Pod specs. Distinctions between two some some details.

Let's use MSFT API as the template here and see if there's anything else need to be added.

Scale in arbitrary pods
Ant and MSFT uses different way to support this feature in the API. I personally think there're tradeoffs. IdList can exactly reflect accurate state of the cluster but the IdList could be long for large clusters. workersToDelete overcomes large cluster paint point but information on RayClusterSpec could be outdated but still acceptable. I think at this moment, we both agree to have workersTodelete supported firstly. If we have further requirement, we can extend ScaleStrategy for richer configurations.

Ant Group
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/antgroup/deploy/ray-operator/api/v1alpha1/raycluster_types.go#L89-L93

MSFT
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L56-L60

Remove HeadService
https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L17-L19

I feel the ideal way to make it is to add one more field into HeadGroupSpec and ask controller to generate it. The only thing user wants to control is the what kind of service (including ingress) user want to create. It could be ClusterIP, NodePort, LoadBalancer or Ingress based on different network environment. Ports should be exposed based on head pod setting.

Add more status in the RayClusterStatus.

https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/msft-operator/ray-operator/api/v1alpha1/raycluster_types.go#L63-L71

This is not enough to reflect cluster status. I feel it's better to extend the status to concentrate more on the cluster level status like Bytedance did here. https://github.com/ray-project/ray-contrib/blob/881b59aa44d184a7bdf177b8782d4e1be358fcdf/bytedance/pkg/api/v1alpha1/raycluster_types.go#L87-L102

Support in-tree and out-of-tree autoscaler

Seems both Ant and MSFT use out-of-tree autoscaler, Bytedance still uses in-tree. The changes on the operator side are

in-tree: automatically append --autoscaling-config=$PATH to head commands. Operator also needs to convert custom resource to an autoscaling config, store it in configmap and mount to head pod.
out-of-tree: explicitly append --no-monitor to head start commands.

it would be good to add a switch to cluster.spec to control the logic. Do you have different opnions?

AutoscalingEnabled *bool `json:"autoscalingEnabled,omitempty"`

Don't use webhook

Kubebuilder uses webhook for validation and default setting. I feel it's a little bit heavy and not a performance efficient way.
I suggest to add basic CRD validation via and use defaulters.go instead.

Management

Do we want to manage ray-operator code at root level or have a ray-operator umbrella folder? Looks like ray-contrib aims to hold different community codes but currently there're only a few ray-operator implementations.

├── ray-contrib
│   ├── BUILD.bazel
│   ├── Dockerfile
│   ├── LICENSE
│   ├── Makefile
│   ├── README.md
│   ├── api
│   ├── config
│   ├── controllers
│   ├── go.mod
│   ├── go.sum
│   ├── hack
│   ├── main
│   └── main.go

└── ray-contrib
    ├── README.md
    └── ray-operator

Code review
What's the review styles we want to follow? Every PR should be accepted by reviewers from two different companies?
Release
Even every company may have downstream version, we'd like to effectively track and control changes. I suggest to follow semantic versioning.

For each minor version, let's cut a release branch to track bug fixes and patches.
Cut tag for every release

# Branches
release-0.1
release-0.2

# Tags
v0.1.0
v0.1.1
v0.2.0
v0.2.1
v0.2.2

I will prepare some release tools to help easily manage these stuff.

Testing and Tools

Presubmit Testing.
MSFT already enables Github workflows and trigger basic testing against the submitted PRs. I think we should keep doing this and add e2e testing to guarantee the code quality.
Host container image

This would be super helpful for community members to achieve one-click deployment.
I am not sure if Anyscale can create a robot token to get access to dockerHub. Github support encrypted secrets and we can use Github action to retrieve it and push to docker registry.

Bazel build

I notice both Ant Group and MSFT add BAZEL here and here. I feel this is complicated and not the native way to manage golang dependencies. Kubernetes community already remove BAZEL and I would suggest not to use it in the new project.

Kubebuilder 2.0 vs 3.0 and Kubernetes version.

Kubebuilder already release 3.0, I think it would be good to upgrade to the latest version.
Let's also ping major stack version for the first release. Kubernetes 1.19.x and golang 1.15.

Please check any concerns one above plans and any features are missing.

[Feature] [backend] Build the Ray and Kubernetes client manager

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

backend service needs to talk with kubernetes to CURD objects. it needs both kubernetes clients and CRD clusters (#29 )

We will need to build a client manager abstraction for upper to interact with.

Use case

No response

Related issues

#29 #54

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Support ingress for Ray head service

In the latest API change, a new field has been added to indicate whether user wants an Ingress object with head service.
This is helpful for some environment using ingress-controller to expose the external traffic.

When operator reconcileServices, it should check field's value. By default this should be false, controller should create an ingress object if the value is true.

type HeadGroupSpec struct {
        // EnableIngress indicates whether operator should create ingress object for head service or not.
        EnableIngress *bool `json:"enableIngress,omitempty"`
       .....
}

Proposal: KubeRay a toolkit to run Ray applications on Kubernetes

Background

Infrastructure management and compute orchestration is critical to production Ray users and users likes to scale their applications in an infinite compute environment with zero code changes. Since Kubernetes becomes de-facto container orchestrator for enterprise, users leverage Kubernetes as a substrate for execution of distributed Ray programs.

Community provides a Python ray operator implementation. However, due to some special needs, Ant Group, Microsoft and Bytedance put some efforts to build a Golang based operator and decouple autoscaler from operator itself (see design for details). All of us are using this solution in our production environments.

Due to historical reason, we have three folders named with company name in this project. After our collaboration in #17, there's only one ray-operator under ray-contrib which is big step for further evolution.

Proposal

In order to keep reducing maintenance efforts and simplifying user experience, more tools around Kubernetes and Ray operator become better developed. That means we plan to contribute more tools to this repo. However, we feel current repo name is not properly used. Technically speaking, any ray project could be added here. We think it might be better to reorganize Kubernetes related work in a separate repo which concentrate on Ray user's experiences on Kubernetes.

Besides ray-operator, some tools we plan to work on or already developed in downstream are

Kubectl plugin/CLI to operate CRD objects
Kubernetes event dumper for ray clusters/pod/services
Operator Integration with Kubernetes node problem detector
Kubernetes based workspace to easily submit ray jobs.
Prometheus stack integration for monitoring
...
(credits @chenk008 @caitengwei from AntGroup, @Jeffwan from Bytedance and @akanso from Microsoft)

Maybe we can call it KubeRay, a toolkit consist of different Kubernetes components and user can choose combination based on their Kubernetes environments. I think create a new repo like ray-project/kuberay is better and ray-contrib can be used for some incubated ideas. I think KubeRay will help attract more people participate in the community and It also help grows ray’s influence in CNCF/Kubernetes community. Lots of users are moving ML/DL workloads to Kubernetes and they should try Ray using this solution.

WDYT? Any feedbacks are welcomed!

/cc @chenk008 @caitengwei @akanso @chaomengyuan
/cc @zhe-thoughts @ericl @richardliaw @DmitriGekhtman @yiranwang52

Add helm chart

A helm chart is a collection of files that describe a related set of Kubernetes resources. It can help user to deploy ray-operator and ray cluster conveniently.

[Feature] Add build status, coverage status and go report card into the project

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Better to add some badges to the project

build status comes from Github actions
coverage status comes from https://coveralls.io (I am not sure if permission is a problem)
go report card comes from https://goreportcard.com

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] [coreapi] Support Core protobuf message for backend services

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Sub story of #53

We need to define core protocol buffer message to represents Cluster, HeadGroup, ComputeTemplate template, etc

A few stories will have dependency on this one.

Use case

No response

Related issues

#53

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Generic data abstraction on top of CRD

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

In our system, not everyone is using kubectl to operate clusters directly. There're few major reasons.

Current ray operator is very friendly to users who is familiar with Kubernetes operator pattern. For most data scientists, this way actually increase their learning curve.
Using kubectl requires sophisticated permission system. I think some kubernetes cluster doesn't enable user level authentication. In my company, we use loose RBAC management and corp SSO system is not integrated with Kubernetes OIDC at all.

Due to above reason, I think it's worth to build some generic abstraction on top of RayCluster CRD. With the core api support, we can easily build backend services, CLI, etc to bridge users. Underneath, it still use Kubernetes to manage real data.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

/cc @chenk008 @akanso @chaomengyuan

Clean up unused config manifests

https://github.com/ray-project/ray-contrib/tree/master/ray-operator/config

Some manifest are generated from kubebuilder boilerplate. However, we don't have plan to use it. It would be better to clean them up. I think certmanager, webhook, quota and some proxy roles in rbac can be deleted.

[Feature][backend] Prepare Docker image and kubernetes manifest for service

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Backend service should be able to be deployed in the same cluster of ray-operator. We will need to prepare the manifest and dockerfile for the backend service.

Use case

No response

Related issues

#54

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] [coreapi] Support grpc service and gprc gateway

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Sub story of #53

with support of #82, we can define grpc services and reserve proxy (gateway) for http services.

Use case

backend services can be generate automatically

Related issues

#53

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[planning] 0.2 features collection

Let's have some discussion on the features we want to deliver in next release.

What I like to include in this project is.

~~core proto API definition which can be used to build backend services and CLI tools~~
~~A gRPC and proxy service to accept cluster CRUD requests~~
~~A kubectl plugin to easily help operate cluster and debug cluster status~~
~~#62~~
#28
#27
Easier kuberay deployment steps. ~~#41~~ ~~#114~~
E2E workshop

Failed to install current CRD

There're two minor problems of current manifests.

metadata.annotations: Too long: must have at most 262144 bytes
CRD violates openAPIV3Schema.

These can be addressed by updating generation rule and upgrade the controller tools. I create this issue just in case someone need some clues.

➜  ray-operator git:(api_changes) ✗ make install

/Users/jiaxin/go/bin/controller-gen "crd:trivialVersions=true" rbac:roleName=manager-role webhook paths="./..." output:crd:artifacts:config=config/crd/bases
kustomize build config/crd | kubectl apply -f -
The CustomResourceDefinition "rayclusters.ray.io" is invalid:
* metadata.annotations: Too long: must have at most 262144 bytes
* spec.validation.openAPIV3Schema.properties[spec].properties[headGroupSpec].properties[template].properties[spec].properties[containers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property
* spec.validation.openAPIV3Schema.properties[spec].properties[headGroupSpec].properties[template].properties[spec].properties[initContainers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property
* spec.validation.openAPIV3Schema.properties[spec].properties[workerGroupSpecs].items.properties[template].properties[spec].properties[initContainers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property
* spec.validation.openAPIV3Schema.properties[spec].properties[workerGroupSpecs].items.properties[template].properties[spec].properties[containers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property
make: *** [install] Error 1

[Feature] Add/remove instances from an active Ray Cluster

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

With a ray cluster up and running, it would be nice that we could add more instances to the cluster (or remove inactive instances).

An enhancement to this feature would be to provide a min/max in the spec, and the ray cluster automatically allocate/deallocates based on active work-load.

@Jeffwan

Use case

From time to time, we may run into the situation where we didn't allocate enough instances for a ray cluster, therefore would like more instances without start over everything.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Use kubebuilder way to build binary and run test in Github CI

Currently, we are using following way to build and test code.

https://github.com/ray-project/ray-contrib/blob/f4076b4ec5bfae4cea6d9b66a1ec4e63680ca366/.github/workflows/test-job.yaml#L47-L53

In ray operator project, we can change to use make manager and make test instead. The major reason is it can find more problems for us. One example is it may fail to generate manifest which indicates our code is not compatible with controller-tools

Add support for rolling updates when cluster config changes are applied

For example, the user may wish to patch the cluster spec with a new container image. When this happens, ideally the operator would reconcile the new spec with the runnings pods by spinning up new pods, draining traffic from the old ones, then removing the old pods. This would primarily be useful for long-running clusters (e.g., Ray Serve) that ideally should be upgraded with zero downtime.

cc @Jeffwan

[Feature] Design a restful style backend services to handle cluster operations

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

A follow up issue of #53.

web service is one of the convenient way to interact with user. this doesn't require user to have any domain knowledge. At the same time, RESTful API doesn't need any dependency. Curl + json payload is every thing user need.

This backend service underneath should leverage generate clients to interact with RayCluster custom resources(depend on #29). So the chain of component call would be

client -> backend service (using client to create CR) -> apiserver

Use case

User who don't have kubectl knowledge or permission can use REST API to create/delete Ray Cluster on Kubernetes.

Related issues

#53
#29

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Create issue and pull request templates

I notice we have more community members come to the repo and submit issues now.

Issue templates and pull request templates help us provide more guidance for opening issues/PR while allowing contributors to provide specific, structured information their issue/PR. This help ensure maintainers receive desired information and reduce back and force communication.

https://www.google.com/search?client=safari&rls=en&q=issue+and+pull+request+templates&ie=UTF-8&oe=UTF-8

/cc @akanso @chenk008 @chaomengyuan

[Feature] Provide hosted operator container image

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Currently, user has to build their own image and push to registry in order to use ray-operator which is supper inconvenient. We should provide a hosted image for user to use.

Currently, most upstream ray images are hosted here https://hub.docker.com/u/rayproject. However, I don't think we can use this because it requires community to share credentials or build common infrastructure. We can probably register a dockerhub account or Github registry to use.

Note: dockerhub free tier has well-known rate limit issue. We can use it for short term and look for sponsor if the quota can not meet need some day.

Use case

User will have perfect experience to install ray-operator. No need to build images and mange it anymore.

Related issues

#47

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Bug] Worker group pods stuck at initialization

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

There're two issues

Kubernetes actually allows to create resources with names starting from numeric names. Our check name logic rename first char with "r" for naming conventions.

kuberay/ray-operator/controllers/utils/util.go

Lines 33 to 44 in efbbbe7

 // cannot start with a numeric value 

 if unicode.IsDigit(rune(s[0])) { 

 s = "r" + s[1:] 

 } 

 // cannot start with a punctuation 

 if unicode.IsPunct(rune(s[0])) { 

 fmt.Println(s) 

 s = "r" + s[1:] 

 } 

 return s

For the worker names, the pods name is construct with cluster name + roles + worker group name which is over 63 chars.
The scripts truncate the name which is expected. But the $RAY_IP injected into pod to connect to head svc is incorrect. We have not reuse the same logic which leads to stucking at the initialization phase.

kuberay/ray-operator/controllers/utils/util.go

Lines 26 to 31 in efbbbe7

 if len(s) > maxLenght { 

 //shorten the name 

 offset := int(math.Abs(float64(maxLenght) - float64(len(s)))) 

 fmt.Printf("pod name is too long: len = %v, we will shorten it by offset = %v\n", len(s), offset) 

 s = s[offset:] 

 }

The major problem is we use uuid as the cluster name which is too long. but the validation part we probably need better projection.

Reproduction script

Create a ray cluster with long name + at least one worker node group. with name small-group

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] container image auto-release scripts

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Create a script to automate image release process.

for master image: nightly
for testing image: ${commit_hash}
for release branch image: v0.1.0 etc.

Use case

Currently, we build image manually and retag the image and then publish to docker hub. this is too much manual work and error-prone.

Related issues

#68
#69

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Ray cluster monitor dashboard

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

A user friendly dashboard for people not familiar with Ray or K8S to monitor the resource usage (CPU/memory/IO, etc ), the status of the current job (one dynamic Ray cluster per job);

More interactive UI that users can operate on the UI such as start/pause/stop the jobs, filter the jobs per submitter, etc.

Use case

Ray Search project: the HQ team members are requesting for interactive UIs.

Related issues

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Upgrade kubebuilder to 3.x

Here's a comparison between v2 and v3.
https://book.kubebuilder.io/migration/v2vsv3.html

Migration guidance https://book.kubebuilder.io/migration/migration_guide_v2tov3.html

[Feature] Verify operator image building in CI

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Currently we only run make build in CI to verify binary build works or not. This won't cover any changes on Dockerfile etc.

Use case

Catch changes to Dockerfile and container related commands, etc

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Support to generate PersistentVolumeClaim for Pod

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

We need mount PersistentVolume into Pod in some cases. For now, we can add volumes which refer to PersistentVolumeClaim in PodSpec, but we need to create PersistentVolumeClaim manually.

I am looking up some way to create PersistentVolumeClaim automaticly.

Use case

Some ray job need to save file on persistent storage, we need to mount some disk (e.g. aws EBS, aliyun EBS) into Pod.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

CI failure: CustomResourceDefinition apiextensions.k8s.io/v1 can not be found

post-submit failed https://github.com/ray-project/ray-contrib/runs/3594181472

It complains CustomResourceDefinition apiextensions.k8s.io/v1 can not be found.

Failure [6.262 seconds]
[BeforeSuite] BeforeSuite 
/home/runner/work/ray-contrib/ray-contrib/ray-operator/controllers/suite_test.go:53

  Unexpected error:
      <*meta.NoKindMatchError | 0xc0003951c0>: {
          GroupKind: {
              Group: "apiextensions.k8s.io",
              Kind: "CustomResourceDefinition",
          },
          SearchedVersions: ["v1"],
      }
      no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1"
  occurred

  /home/runner/work/ray-contrib/ray-contrib/ray-operator/controllers/suite_test.go:64
------------------------------

The problem is related to kubernetes version. I think we still use lower version which doesn't support apiextensions.k8s.io/v1. The kubernetes version comes from Envtest binaries in suite_test.go

https://github.com/ray-project/ray-contrib/blob/7965aab3c31e47cde96b2745489f7d01b8544e5c/.github/workflows/test-job.yaml#L35-L37

But interesting part is presubmit job succeed without any issues. I will leave an eye on presubmit job logs and check if that worked as expected

	// cannot start with a numeric value
	if unicode.IsDigit(rune(s[0])) {
	s = "r" + s[1:]
	}

	// cannot start with a punctuation
	if unicode.IsPunct(rune(s[0])) {
	fmt.Println(s)
	s = "r" + s[1:]
	}

	return s

	if len(s) > maxLenght {
	//shorten the name
	offset := int(math.Abs(float64(maxLenght) - float64(len(s))))
	fmt.Printf("pod name is too long: len = %v, we will shorten it by offset = %v\n", len(s), offset)
	s = s[offset:]
	}