ytsaurus / ytsaurus-k8s-operator Goto Github PK

View Code? Open in Web Editor NEW

32.0 24.0 22.0 3.2 MB

Kubernetes operator for YTsaurus.

Home Page: https://ytsaurus.tech

License: Other

Dockerfile 0.21% Makefile 2.92% Go 96.27% Smarty 0.29% Shell 0.31%

big-data distributed-systems kubernetes

ytsaurus-k8s-operator's Introduction

ytsaurus-k8s-operator

YTsaurus is a distributed storage and processing platform for big data with support for MapReduce model, a distributed file system and a NoSQL key-value database.

This operator helps you to deploy YTsaurus using Kubernetes.

Description

Currently available in alpha-version and is capable to deploy a new YTsaurus cluster from scratch, primarily for testing purposes. Also can perform automated cluster upgrades with downtime.

Getting Started

You’ll need a Kubernetes cluster to run against. You can use KIND to get a local cluster for testing, or run against a remote cluster. Note: Your controller will automatically use the current context in your kubeconfig file (i.e. whatever cluster kubectl cluster-info shows).

You can install pre-built versions of operator via helm chart.

Next you need to prepare the Ytsaurus specification, see provided samples and API Reference.

Running on the cluster

Install Instances of Custom Resources:

kubectl apply -f config/samples/cluster_v1_demo.yaml

Build and push your image to the location specified by IMG:

make docker-build docker-push IMG=<some-registry>/ytsaurus-k8s-operator:tag

Deploy the controller to the cluster with the image specified by IMG:

make deploy IMG=<some-registry>/ytsaurus-k8s-operator:tag

Uninstall CRDs

To delete the CRDs from the cluster:

make uninstall

Undeploy controller

UnDeploy the controller to the cluster:

make undeploy

Contributing

We are glad to welcome new contributors!

Please read the contributor's guide.
We can accept your work to YTsaurus after you have signed contributor's license agreement (aka CLA).
Please don't forget to add a note to your pull request, that you agree to the terms of the CLA.

How it works

This project aims to follow the Kubernetes Operator pattern

It uses Controllers which provides a reconcile function responsible for synchronizing resources until the desired state is reached on the cluster

Test It Out

Install the CRDs into the cluster:

make install

Run your controller (this will run in the foreground, so switch to a new terminal if you want to leave it running):

make run

NOTE: You can also run this in one step by running: make install run

Modifying the API definitions

If you are editing the API definitions, generate the manifests such as CRs or CRDs using:

make manifests

NOTE: Run make --help for more information on all potential make targets

More information can be found via the Kubebuilder Documentation

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

ytsaurus-k8s-operator's People

Contributors

Stargazers

Watchers

ytsaurus-k8s-operator's Issues

QT ACO namespace configuration

We need to set good default settings for ACO namespace queries that will be created during init job

Suggested format:

$ yt get //sys/access_control_object_namespaces/queries/@acl
[
    {
        "action" = "allow";
        "subjects" = [
            "owner";
        ];
        "permissions" = [
            "read";
            "write";
            "administer";
            "remove";
        ];
        "inheritance_mode" = "immediate_descendants_only";
    };
    {
        "action" = "allow";
        "subjects" = [
            "users";
        ];
        "permissions" = [
            "modify_children";
        ];
        "inheritance_mode" = "object_only";
    };
]

$ yt get //sys/access_control_object_namespaces/queries/nobody/@principal_acl
[]

$ yt get //sys/access_control_object_namespaces/queries/everyone/@principal_acl
[
    {
        "action" = "allow";
        "subjects" = [
            "everyone";
        ];
        "permissions" = [
            "read";
            "use";
        ];
        "inheritance_mode" = "object_and_descendants";
    };
]

$ yt get //sys/access_control_object_namespaces/queries/everyone-use/@principal_acl
[
    {
        "action" = "allow";
        "subjects" = [
            "everyone";
        ];
        "permissions" = [
            "use";
        ];
        "inheritance_mode" = "object_and_descendants";
    };
]

Add a spec option to set layer locations for exec nodes

Support master_exit_read_only in Kubernetes operator

Since version 23.2 read_only flag in master's Hydra is persistent, so after update master_exit_read_only is required.

Research memory consumption in operator

As shown in #90 operator memory consumption could be more than 256mb even for cluster with hundreds of nodes.
We need to investigate if there are memory leaks or ineffective memory usage.

One of the options could be to add pprof to operator and some regular process to dump samples to configured pvc.

Test issue

text

code

123

Fixed host addresses for masters

Let us recall that each pod has two addresses: pod address (which looks like ms-2.masters.<namespace>.svc.cluster.local) and host address, which is an FQDN of underlying k8s node.

Problem: in hostNetwork = true setting we are exposing master in a very hacky way.

We specify hostNetwork = true for pods, making server processes bind to host network interfaces (making them available from outside k8s cluster). Actually, all our server processes bind to wildcard network interfaces, so masters even bind to CNI network interfaces too, allowing accessing them from their pod domain names.
We still use pod names in cluster_connection section of other components, as well as in master configs describing master cells; this work because of the remark from p.1
In order for master to identify himself in the list of peers int he cell, it searches for his fqdn in the list of peers. This does not work well because fqdn resolves to a host address, and the list of peers contains pod addresses, so we use an address resolver hack: we set /address_resolver/localhost_name_override to master's pod address with env variable substitution. This makes master being able to start and work.

This approach has a lot of downsides.

We are mixing pod addresses and host addresses for different components; in case when k8s overlay network uses ipv4 and host network uses ipv6, this requires setting up dualstack address resoltuion in all components, which is in general a bad practice.
We are relying on KubeDNS for pod addresses, which has a number of disadvantages (most notorious of which is c-ares DNS spoofing issue, which sometimes breaks DNS resolution at all).
Masters register under pod addresses in //sys/primary_masters, which makes it hard to find their host addresses from outside the cluster (still possible via annotations, but not convenient at all). This complicates integration with outside tools like pull-based monitoring.
Finally, the incorrect addresses of masters get into cluster connection, which makes it impossible to connect components outside of the cluster using the config //sys/@cluster_connection.

The proposal is to switch to fixing k8s nodes for masters a priori by their fqdns. If we know host fqdns of these nodes, a user must specify it as follows:

masterHostAddresses:
- 4932:
  - host1.external.address
  - host2.external.address
  - host3.external.address

Here 4932 is the cell tag of a master cell (for potential extension to the multicell case).

This option should be specified if and only if hostNetwork: true holds. In such case three changes in logic must apply.

Specified host addresses must be put in master configs.
Specified host addresses must be put in cluster connection configs of all components.
Master statefulset must get an additional affinity constraint of form externalHostnameLabel IN (host1.external.address, host2.external.address, host3.external.address), where externalHostnameLabel is kubernetes.io/hostname by default, but may be overridden in case when the value of kubernetes.io/hostname does not resolve outside, but some other label contains actual external addresses of nodes.

Also, the address_resolver hack must be dropped unconditionally. It is ugly.

[Feature] Add ability to update Ytsaurus spec

I would like to propose an enhancement to the Ytsaurus CRD by adding advanced configuration options. These options will enable more granular control over Ytsaurus specifications and cater to varying use cases. The features I am proposing to add are:

Update Core Image: Introduce an ability to update the coreImage field under the spec to change the core image of Ytsaurus.
Add Instance specs: Allow users to add new instance specifications dynamically.
Edit Instance specs: Allow users to edit existing instance specifications. This includes:
- Instance Count
- Resources
- Affinity
- Volumes, VolumeMounts, Locations
- Custom labels?
Remove Instance specs: Implement an option to remove instance specifications. Special care should be taken while removing data node and masters specifications to prevent data loss or inconsistency

Discovery servers should be listed in master static config to avoid manual orchid creation

The //sys/discovery_servers map node is currently created in the world_initializer and relies on a discovery_servers section in the master's static config.
The operator does not fill this section, so one has to create this map and its inner orchids manually.

Let's fix that by filling the section in the master's config from discovery server specs.

Operation archive init job should wait for sys bundle to be healthy

When looking at cluster initialization during any of the e2e tests, one can see the following errors in the init-job-op-archive pod. They get retried eventually, but the backoff duration increases the length of the already very slow tests.

achulkov2@nebius-yt-dev:~$ kubectl logs yt-scheduler-init-job-op-archive-btqb6  -nquerytrackeraco
++ export YT_DRIVER_CONFIG_PATH=/config/client.yson
++ YT_DRIVER_CONFIG_PATH=/config/client.yson
+++ /usr/bin/ytserver-all --version
+++ head -c4
++ export YTSAURUS_VERSION=23.1
++ YTSAURUS_VERSION=23.1
++ /usr/bin/init_operation_archive --force --latest --proxy http-proxies.querytrackeraco.svc.cluster.local
2024-01-10 19:37:20,124 - INFO - Transforming archive from 48 to 48 version
2024-01-10 19:37:20,134 - INFO - Mounting table //sys/operations_archive/jobs
Traceback (most recent call last):
  File "/usr/bin/init_operation_archive", line 749, in <module>
    main()
  File "/usr/bin/init_operation_archive", line 744, in main
    force=args.force,
  File "/usr/bin/init_operation_archive", line 731, in run
    transform_archive(client, next_version, target_version, force, archive_path, shard_count=shard_count)
  File "/usr/bin/init_operation_archive", line 639, in transform_archive
    mount_table(client, path)
  File "/usr/bin/init_operation_archive", line 55, in mount_table
    client.mount_table(path, sync=True)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/client_impl_yandex.py", line 1394, in mount_table
    freeze=freeze, sync=sync, target_cell_ids=target_cell_ids)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/dynamic_table_commands.py", line 524, in mount_table
    response = make_request("mount_table", params, client=client)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/driver.py", line 126, in make_request
    client=client)
  File "<decorator-gen-3>", line 2, in make_request
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/common.py", line 422, in forbidden_inside_job
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_driver.py", line 301, in make_request
    client=client)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 455, in make_request_with_retries
    return RequestRetrier(method=method, url=url, **kwargs).run()
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/retries.py", line 79, in run
    return self.action()
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 410, in action
    _raise_for_status(response, request_info)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 290, in _raise_for_status
    raise error_exc
yt.common.YtResponseError: Error committing transaction 1-44d-10001-b753
    Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361
        No healthy tablet cells in bundle "sys"

***** Details:
Received HTTP response with error    
    origin          yt-scheduler-init-job-op-archive-btqb6 on 2024-01-10T19:37:20.203965Z    
    url             http://http-proxies.querytrackeraco.svc.cluster.local/api/v4/mount_table    
    request_headers {
                      "User-Agent": "Python wrapper 0.13-dev-5f8638fc66f6e59c7a06708ed508804986a6579f",
                      "Accept-Encoding": "gzip, identity",
                      "X-Started-By": "{\"pid\"=17;\"user\"=\"root\";}",
                      "X-YT-Header-Format": "<format=text>yson",
                      "Content-Type": "application/x-yt-yson-text",
                      "X-YT-Correlation-Id": "d71f4e98-4f2880b3-9213c0d0-9a5a9336"
                    }    
    response_headers {
                      "Content-Length": "1242",
                      "X-YT-Response-Message": "Error committing transaction 1-44d-10001-b753",
                      "X-YT-Response-Code": "1",
                      "X-YT-Response-Parameters": {},
                      "X-YT-Trace-Id": "c0235705-98e9c7a-369cf397-97d28dd7",
                      "X-YT-Error": "{\"code\":1,\"message\":\"Error committing transaction 1-44d-10001-b753\",\"attributes\":{\"host\":\"hp-0.http-proxies.querytrackeraco.svc.cluster.local\",\"pid\":1,\"tid\":12837479201307132255,\"fid\":18446447647636925386,\"datetime\":\"2024-01-10T19:37:20.202367Z\",\"trace_id\":\"c0235705-98e9c7a-369cf397-97d28dd7\",\"span_id\":1636727892750608515,\"cluster_id\":\"Native(Name=test-ytsaurus)\",\"path\":\"//sys/operations_archive/jobs\"},\"inner_errors\":[{\"code\":1,\"message\":\"Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361\",\"attributes\":{\"host\":\"hp-0.http-proxies.querytrackeraco.svc.cluster.local\",\"pid\":1,\"tid\":12837479201307132255,\"fid\":18446447647636925386,\"datetime\":\"2024-01-10T19:37:20.202206Z\",\"trace_id\":\"c0235705-98e9c7a-369cf397-97d28dd7\",\"span_id\":1636727892750608515},\"inner_errors\":[{\"code\":1,\"message\":\"No healthy tablet cells in bundle \\\"sys\\\"\",\"attributes\":{\"request_id\":\"dc5643d9-124e57a5-cf4b0583-8753d056\",\"connection_id\":\"6b2e13-a3e8b3e0-314a5f40-69069dfd\",\"verification_mode\":\"none\",\"realm_id\":\"65726e65-ad6b7562-10259-79747361\",\"timeout\":30000,\"method\":\"CommitTransaction\",\"address\":\"ms-0.masters.querytrackeraco.svc.cluster.local:9010\",\"encryption_mode\":\"optional\",\"service\":\"TransactionSupervisorService\"}}]}]}",
                      "X-YT-Request-Id": "93a09617-71caa1ec-cbfe7e46-922f5a1f",
                      "Content-Type": "application/json",
                      "Cache-Control": "no-store",
                      "X-YT-Proxy": "hp-0.http-proxies.querytrackeraco.svc.cluster.local",
                      "Authorization": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
                    }    
    params          {
                      "suppress_transaction_coordinator_sync": false,
                      "path": "//sys/operations_archive/jobs",
                      "freeze": false,
                      "mutation_id": "124ef88f-86123fd-62afd823-512f2084",
                      "retry": false
                    }    
    transparent     True
Error committing transaction 1-44d-10001-b753    
    origin          hp-0.http-proxies.querytrackeraco.svc.cluster.local on 2024-01-10T19:37:20.202367Z (pid 1, tid b227e3515560815f, fid fffef266ed3d2bca)    
    trace_id        c0235705-98e9c7a-369cf397-97d28dd7    
    span_id         1636727892750608515    
    cluster_id      Native(Name=test-ytsaurus)    
    path            //sys/operations_archive/jobs
Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361    
    origin          hp-0.http-proxies.querytrackeraco.svc.cluster.local on 2024-01-10T19:37:20.202206Z (pid 1, tid b227e3515560815f, fid fffef266ed3d2bca)    
    trace_id        c0235705-98e9c7a-369cf397-97d28dd7    
    span_id         1636727892750608515
No healthy tablet cells in bundle "sys"    
    origin          yt-scheduler-init-job-op-archive-btqb6 on 2024-01-10T19:37:20.204007Z    
    request_id      dc5643d9-124e57a5-cf4b0583-8753d056    
    connection_id   6b2e13-a3e8b3e0-314a5f40-69069dfd    
    verification_mode none    
    realm_id        65726e65-ad6b7562-10259-79747361    
    timeout         30000    
    method          CommitTransaction    
    address         ms-0.masters.querytrackeraco.svc.cluster.local:9010    
    encryption_mode optional    
    service         TransactionSupervisorService

We should wait for the tablet cells to be healthy before running the init job.

Strawberry controller fails to start with an error of missing `jupyt` path

An error in strawberry controller logs:

panic: call a4d6e19-3ee832b4-a54d258b-718c84e7 failed: error resolving path //sys/strawberry/jupyt: node //sys/strawberry has no child with key "jupyt"

goroutine 1 [running]:
go.ytsaurus.tech/yt/chyt/controller/internal/agent.(*Agent).Start(0xc0003b6600)
	/actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/internal/agent/agent.go:355 +0x3b0
go.ytsaurus.tech/yt/chyt/controller/internal/app.(*App).Run(0xc00058fc78, 0xc0004b80c0)
	/actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/internal/app/app.go:187 +0x555
main.doRun()
	/actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/run.go:59 +0x1d5
main.wrapRun.func1(0x10f6ce0?, {0xb60355?, 0x2?, 0x2?})
	/actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/main.go:30 +0x1d
github.com/spf13/cobra.(*Command).execute(0x10f6ce0, {0xc000356660, 0x2, 0x2})
	/actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:944 +0x847
github.com/spf13/cobra.(*Command).ExecuteC(0x11004c0)
	/actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:1068 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
	/actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:992
main.main()
	/actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/main.go:38 +0x25

Seems like we just need to add "jupyt" here: https://github.com/ytsaurus/yt-k8s-operator/blob/main/pkg/ytconfig/chyt.go#L48

Deploy UI via a separate spec

Implement some signal from the operator that will mean "all good" in terms of operator

Currently we have Running cluster status, but it doesn't mean that operator finished all reconciliations.
For example cluster stays Running if full update couldn't start (if EnableFullUpdate not set for example) and it should be a way to detect such situations by human and/or monitoring without checking the logs.

Maybe we just should use other status then Running if operator is blocked and that would be enough.

There is a separate related issue to have some signal that operator is fully satisfied applied spec, currently status could be running if minimal amount of pods are created, but I consider that as an independent issue. Though it also may be resolved with some extra status, which will mean "cluster is usable, but not everything is running".

Automatically allow everyone to use ch_public CHYT clique

Currently, ch_public clique is created with default access control (//sys/access_control_object_namespaces/chyt/ch_public)

"principal_acl": [
        {
            "action": "allow",
            "inheritance_mode": "object_and_descendants",
            "permissions": [
                "use"
            ],
            "subjects": [
                "chyt_releaser"
            ]
        },
        {
            "action": "allow",
            "inheritance_mode": "object_and_descendants",
            "permissions": [
                "read",
                "remove",
                "manage"
            ],
            "subjects": [
                "chyt_releaser"
            ]
        }
    ],

And it requires manual actions (and superpowers) to allow everyone to use it
Can we add "allow everyone use" automatically during init ch_public job?

Test ticket integration

hello

Init job script is not recreated with operator update

I've updated the operator to the version, containing commit 7ff25c0
and was expecting that update op archive job will have new updated script, but turns out script same as was before the update.

I've localized that init script is not set directly in the job, but is set in the config map, which is not being recreated it seems.

kubectl get cm op-archive-yt-scheduler-init-job-config -oyaml | grep -A10 init-cluster.sh
  init-cluster.sh: |2-

    set -e
    set -x

    export YT_DRIVER_CONFIG_PATH=/config/client.yson
    /usr/bin/init_operation_archive --force --latest --proxy http-proxies.xxx.svc.cluster.local
    /usr/bin/yt set //sys/cluster_nodes/@config '{"%true" = {job_agent={enable_job_reporter=%true}}}'
kind: ConfigMap
metadata:
  creationTimestamp: "2023-11-09T18:44:06Z"```

UI: Replace environment variable YT_AUTH_CLUSTER_ID with ALLOW_PASSWORD_AUTH=1

YT_AUTH_CLUSTER_ID environment variable is removed since v 17.0.0 version of UI, it should be replaced with ALLOW_PASSWORD_AUTH=1:

I've found the only place:
https://github.com/ytsaurus/yt-k8s-operator/blob/012d4ef31d5f42c33b8c51daf32323a988e177ef/pkg/components/ui.go#L147-L150

Remote node CRD

We'd like to have an opportunity to describe a group of nodes (dat/exe/tab) as a separate CRD, providing the endpoints of a remote cluster. Such group of nodes (called remote nodes) will work the same as if they were described within the ytsaurus CRD, but they can belong to a different k8s cluster, which may be useful.

I'd suggest having a CRD RemoteNodeGroup, which looks exactly the same as dataNodes/execNodes/tabletNodes (and, ideally, reuses as much of existing code as possible).

The main difference is that there should be a field "remoteClusterSpec" which must be a reference to a resource (CRD) of type RemoteYtsaurusSpec.

RemoteYtsaurusSpec must describe necessary information for connecting a node to a remote cluster. remoteClusterSpec must be a structure, currently containing one field:

MasterAddresses: []string

But in the future it will probably be extended by various other fields.
When ytsaurus/ytsaurus#248 completes:

RemoteClusterUrl: string // allows taking cluster connection by HTTP cal to RemoteClusterUrl + "/cluster_connection".

When authentication in native protocol is employed:

Credentials: SomeKindOfCredentials

Add option to set namespace for operator to watch

So we can run several operators in one cluster for develoment/experimenting purposes.

We already have env var WATCH_NAMESPACE in code, it needs to be passed through helm chart's values.
@koct9i suggests it may be done here

Test issue

One more test issue

some problem

code

Foldable history/tutorial sidebar

It should be possible to hide the history sidebar and leave only the query/results areas somehow.

QT ACOs is not created during operator update

After updating the operator 0.5.0 -> 0.6.0, the cluster to 23.2, and qt to 0.0.5. This error appears when running QT queries

Access control object "nobody" does not exist

This is due to the fact that necessary ACOs are created only when qt is initializing, not when it is updating
https://github.com/ytsaurus/yt-k8s-operator/blob/main/pkg/components/query_tracker.go#L193
We need to start doing this job every update

As a temporary solution, they can be created manually

yt create access_control_object_namespace --attr '{name=queries}'
yt create access_control_object --attr '{namespace=queries;name=nobody}'

Support rolling update for masters

We want masters to update with leaders switch with minimal downtime.

Implementation can look like that:

we adding some flag/enum (at ytsaurus level or master leve) to support new rolling strategy for master instead of full update
we call yt_execute with switch_leader to ms-0 and wait for all things to become good.
we setting masters' sts updateStrategy =RollingUpdate with .spec.updateStrategy.rollingUpdate.partition=len(masters)-1, call sts update and wait until ms-2 is updated and catched up
(if possible) building snapshots and compare md5 for all of the snapshots (research convergence_check)
proceed with .spec.updateStrategy.rollingUpdate.partition=len(masters)-2 and so on until only ms-0 left unupdated
call switch_leader to len(masters)-1
remove updateStrategy.rollingUpdate/sync > wait ms-0 updated

Strawberry not starting

Operator version: af42b5642038e9d0b2ee6746a96403a16c78e845

Strawberry version: 0.0.5 e6ba5a781e10717a4ce2611c4c623ed1087137e1e359f2b2af1391dd7bfc04aa

Error:

panic: call 718cac8d-86e46ba7-354f650c-9db060c9 failed: error resolving path //sys/strawberry/chyt: node //sys/strawberry has no child with key "chyt"

goroutine 1 [running]:
go.ytsaurus.tech/yt/chyt/controller/internal/agent.(*Agent).Start(0xc000000600)
        /actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/internal/agent/agent.go:355 +0x390
go.ytsaurus.tech/yt/chyt/controller/internal/app.(*App).Run(0xc0002d5c78)
        /actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/internal/app/app.go:186 +0x466
main.doRun()
        /actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/run.go:58 +0x1ab
main.wrapRun.func1(0x10bd260?, {0xb3de4a?, 0x2?, 0x2?})
        /actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/main.go:30 +0x1d
github.com/spf13/cobra.(*Command).execute(0x10bd260, {0xc0001a0640, 0x2, 0x2})
        /actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:920 +0x847
github.com/spf13/cobra.(*Command).ExecuteC(0x10c6780)
        /actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:1044 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
        /actions-runner/_work/ytsaurus/go/pkg/mod/github.com/spf13/[email protected]/command.go:968
main.main()
        /actions-runner/_work/ytsaurus/ytsaurus/ytsaurus/yt/chyt/controller/cmd/chyt-controller/main.go:38 +0x25

Ytsaurus spec:

 strawberry:
    image: ytsaurus/strawberry:0.0.5
    resources:
      requests:
        cpu: 4
        memory: 8Gi
      limits:
        cpu: 8
        memory: 12Gi

There is //sys/strawberry directory in cypress, but there is nothing in it.

Test issue

text

code

Operator should restart only affected components on changing config overrides

Subj. Looks line now it does full-cluster restart.

feat: add ability to enable jaeger trace export

Same as logs, for every component.

setting nodePort

setting node port for all components via configuration file
For example:

httpProxies:
  - serviceType: NodePort
    instanceCount: 3
    nodePort: 31860 - setting

when I apply config I got error:
strict decoding error: unknown field "spec.httpProxies[0].nodePort", unknown field "spec.rpcProxies[0].nodePort"

feat: add json format for logs

docs: ytsaurus/jupyter-tutorial tag 404 in jupyter-demo.yaml

Probably should replace with ytsaurus/jupyter-tutorial:0.0.26-metrika

Support master caches in operator

Master cache is a stateless component that caches responses of some of the master read queries.

At first, we want to store just their addresses in cluster connection in a way like this: https://github.com/ytsaurus/ytsaurus/blob/1ccbfab18affabdb5512a8315a753d9b1bd1cfc4/yt/python/yt/environment/configs_provider.py#L1395-L1410

After that, we can improve master caches to support discovery via discovery servers and remove dependency on KubeDNS here.

Replace EnableFullUpdate field with something better.

Currently we have EnableFullUpdate field in the ytsaurus main CRD and if it set to true operator consider it can recreate all the pods and fully update yt cluster.
The idea here is full update would be controllable by human, but the problem is on deploy we often forget to change EnableFullUpdate=false back.

It would be better replaced with something that can be changed back by operator itself after full update is triggered and approved by human,

Ideas are appreciated.

Run full update only for master rebuild

Currently it also happens for data nodes and tablet nodes

Incorrect jupyter FQDN breaks SPYT examples

Details:

Helm operator: 0.4.1
Cluster config: https://github.com/ytsaurus/yt-k8s-operator/blob/main/config/samples/0.4.0/cluster_v1_demo.yaml
Jupyter config: https://github.com/ytsaurus/yt-k8s-operator/tree/main/config/samples/jupyter

Steps to reproduce:

(additional step) Change CHYT to strawberry
Setup YTsaurus cluster
Update SPYT library to match cluster: pip install -U ytsaurus-spyt==1.72.0 --user and restart kernel
Open SPYT examples.ipynb in jupyter
Change table path to any existing table in YTsaurus
Run cells down to reading table

Expected behavior:

Table successfully readed by SPYT and displayed.

Current behavir:

SPYT execution nodes can't access driver. SPYT returns WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.

Reason:

Spark executor gets incorrect FQDN of jupyter: --driver-url spark://CoarseGrainedScheduler@jupyterlab-0.jupyter-headless.default.svc.cluster.local:27001. This FQDN is the output of hostname, but k8s cluster doesn't have such FQDN. See stderror log from executor. Same behavior on each executor node.

I'm also made a clean setup of ubuntu with ytsaurus-spyt. In such case spark executor also gets hostname output, i.e. plain ubuntu for default image.

Proposal:

Provide a way to pass custom driver url to spark executors. This will allow to bypass reading of hostname at all.
Fix the jupyter configuration so the SPYT works out of the box.
Add information about such potential confusion with hostname to docs.

Thank you!

Operator should react on all out of sync k8s components

Currently operator only detects diff only with yson configs and images, but if we deploy new version of operator, which will want to add something in pod specs — this wouldn't be applied.

It is expected that operator should compare all the desired state with state of the world and try to lead components to the desired state. Changes that operator makes should lead to minimum downtime possible.

test issue

123

Add affinity to instance spec

Ytop-chart-controller crashes

The pod ytsaurus-ytop-chart-controller-manager is in the infinite CrashLoopBackOff state.

The used configuration:

ytop-chart v 0.4.0
coreImage: ytsaurus/ytsaurus:stable-23.1.0-relwithdebinfo

The log is attached.
chart_fail.log

Link to the Helm chart is broken

Hi,
Just found that the link to the operator Helm chart:
https://github.com/ytsaurus/yt-k8s-operator/blob/ed7a11096472dc7c92c5430bad9114e5175fd394/README.md?plain=1#L15
forwards to the Dockerhub and require authorization, so it is probably not a helm chart nor something which is available at all.

Refactor operator reconcilliation flow

Currently we have an implementation where ytsaurus cluster update flow is hard to understand because it is rather complex and scattered across codebase. As a consequence it is error-prone, hard to maintain and operate on real clusters.

We want to simplify things by implementing an approach where components will be brought to the desired state one after another in strict linear sequence. I.e each reconciliation loop will go through that (well known and explicitly described in a single place of the codebase) sequence, find first component which is NOT in desired state and execute one action which should bring that component closer to the desired state.

We expect that such flow would be easier to understand and also it would be easy to troubleshoot in what stage of update operator and cluster currently are.

As a downside of the approach we expect that linear update of components would be slower, which is might be thing to improve in the future.

fix k8s service and readinessProbe specs in case of monitoring port is overridden via configmap

Test ticket integration

Issue about something
Code example:

int main() {
    return 0;
}

Implement Functionality for Updating Cluster Specifications

Currently, Ytsaurus clusters lack the option to update specifications such as volumes, locations, resources, etc. Implementing this feature would provide much-needed flexibility.

Test issue about k8s operator

Problem description

example of code

ytsaurus / ytsaurus-k8s-operator Goto Github PK

ytsaurus-k8s-operator's Introduction

ytsaurus-k8s-operator

Description

Getting Started

Running on the cluster

Uninstall CRDs

Undeploy controller

Contributing

How it works

Test It Out

Modifying the API definitions

License

ytsaurus-k8s-operator's People

Contributors

Stargazers

Watchers

Forkers

ytsaurus-k8s-operator's Issues

Recommend Projects

Recommend Topics

Recommend Org