When looking at cluster initialization during any of the e2e tests, one can see the following errors in the init-job-op-archive
pod. They get retried eventually, but the backoff duration increases the length of the already very slow tests.
achulkov2@nebius-yt-dev:~$ kubectl logs yt-scheduler-init-job-op-archive-btqb6 -nquerytrackeraco
++ export YT_DRIVER_CONFIG_PATH=/config/client.yson
++ YT_DRIVER_CONFIG_PATH=/config/client.yson
+++ /usr/bin/ytserver-all --version
+++ head -c4
++ export YTSAURUS_VERSION=23.1
++ YTSAURUS_VERSION=23.1
++ /usr/bin/init_operation_archive --force --latest --proxy http-proxies.querytrackeraco.svc.cluster.local
2024-01-10 19:37:20,124 - INFO - Transforming archive from 48 to 48 version
2024-01-10 19:37:20,134 - INFO - Mounting table //sys/operations_archive/jobs
Traceback (most recent call last):
File "/usr/bin/init_operation_archive", line 749, in <module>
main()
File "/usr/bin/init_operation_archive", line 744, in main
force=args.force,
File "/usr/bin/init_operation_archive", line 731, in run
transform_archive(client, next_version, target_version, force, archive_path, shard_count=shard_count)
File "/usr/bin/init_operation_archive", line 639, in transform_archive
mount_table(client, path)
File "/usr/bin/init_operation_archive", line 55, in mount_table
client.mount_table(path, sync=True)
File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/client_impl_yandex.py", line 1394, in mount_table
freeze=freeze, sync=sync, target_cell_ids=target_cell_ids)
File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/dynamic_table_commands.py", line 524, in mount_table
response = make_request("mount_table", params, client=client)
File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/driver.py", line 126, in make_request
client=client)
File "<decorator-gen-3>", line 2, in make_request
File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/common.py", line 422, in forbidden_inside_job
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_driver.py", line 301, in make_request
client=client)
File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 455, in make_request_with_retries
return RequestRetrier(method=method, url=url, **kwargs).run()
File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/retries.py", line 79, in run
return self.action()
File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 410, in action
_raise_for_status(response, request_info)
File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 290, in _raise_for_status
raise error_exc
yt.common.YtResponseError: Error committing transaction 1-44d-10001-b753
Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361
No healthy tablet cells in bundle "sys"
***** Details:
Received HTTP response with error
origin yt-scheduler-init-job-op-archive-btqb6 on 2024-01-10T19:37:20.203965Z
url http://http-proxies.querytrackeraco.svc.cluster.local/api/v4/mount_table
request_headers {
"User-Agent": "Python wrapper 0.13-dev-5f8638fc66f6e59c7a06708ed508804986a6579f",
"Accept-Encoding": "gzip, identity",
"X-Started-By": "{\"pid\"=17;\"user\"=\"root\";}",
"X-YT-Header-Format": "<format=text>yson",
"Content-Type": "application/x-yt-yson-text",
"X-YT-Correlation-Id": "d71f4e98-4f2880b3-9213c0d0-9a5a9336"
}
response_headers {
"Content-Length": "1242",
"X-YT-Response-Message": "Error committing transaction 1-44d-10001-b753",
"X-YT-Response-Code": "1",
"X-YT-Response-Parameters": {},
"X-YT-Trace-Id": "c0235705-98e9c7a-369cf397-97d28dd7",
"X-YT-Error": "{\"code\":1,\"message\":\"Error committing transaction 1-44d-10001-b753\",\"attributes\":{\"host\":\"hp-0.http-proxies.querytrackeraco.svc.cluster.local\",\"pid\":1,\"tid\":12837479201307132255,\"fid\":18446447647636925386,\"datetime\":\"2024-01-10T19:37:20.202367Z\",\"trace_id\":\"c0235705-98e9c7a-369cf397-97d28dd7\",\"span_id\":1636727892750608515,\"cluster_id\":\"Native(Name=test-ytsaurus)\",\"path\":\"//sys/operations_archive/jobs\"},\"inner_errors\":[{\"code\":1,\"message\":\"Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361\",\"attributes\":{\"host\":\"hp-0.http-proxies.querytrackeraco.svc.cluster.local\",\"pid\":1,\"tid\":12837479201307132255,\"fid\":18446447647636925386,\"datetime\":\"2024-01-10T19:37:20.202206Z\",\"trace_id\":\"c0235705-98e9c7a-369cf397-97d28dd7\",\"span_id\":1636727892750608515},\"inner_errors\":[{\"code\":1,\"message\":\"No healthy tablet cells in bundle \\\"sys\\\"\",\"attributes\":{\"request_id\":\"dc5643d9-124e57a5-cf4b0583-8753d056\",\"connection_id\":\"6b2e13-a3e8b3e0-314a5f40-69069dfd\",\"verification_mode\":\"none\",\"realm_id\":\"65726e65-ad6b7562-10259-79747361\",\"timeout\":30000,\"method\":\"CommitTransaction\",\"address\":\"ms-0.masters.querytrackeraco.svc.cluster.local:9010\",\"encryption_mode\":\"optional\",\"service\":\"TransactionSupervisorService\"}}]}]}",
"X-YT-Request-Id": "93a09617-71caa1ec-cbfe7e46-922f5a1f",
"Content-Type": "application/json",
"Cache-Control": "no-store",
"X-YT-Proxy": "hp-0.http-proxies.querytrackeraco.svc.cluster.local",
"Authorization": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
params {
"suppress_transaction_coordinator_sync": false,
"path": "//sys/operations_archive/jobs",
"freeze": false,
"mutation_id": "124ef88f-86123fd-62afd823-512f2084",
"retry": false
}
transparent True
Error committing transaction 1-44d-10001-b753
origin hp-0.http-proxies.querytrackeraco.svc.cluster.local on 2024-01-10T19:37:20.202367Z (pid 1, tid b227e3515560815f, fid fffef266ed3d2bca)
trace_id c0235705-98e9c7a-369cf397-97d28dd7
span_id 1636727892750608515
cluster_id Native(Name=test-ytsaurus)
path //sys/operations_archive/jobs
Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361
origin hp-0.http-proxies.querytrackeraco.svc.cluster.local on 2024-01-10T19:37:20.202206Z (pid 1, tid b227e3515560815f, fid fffef266ed3d2bca)
trace_id c0235705-98e9c7a-369cf397-97d28dd7
span_id 1636727892750608515
No healthy tablet cells in bundle "sys"
origin yt-scheduler-init-job-op-archive-btqb6 on 2024-01-10T19:37:20.204007Z
request_id dc5643d9-124e57a5-cf4b0583-8753d056
connection_id 6b2e13-a3e8b3e0-314a5f40-69069dfd
verification_mode none
realm_id 65726e65-ad6b7562-10259-79747361
timeout 30000
method CommitTransaction
address ms-0.masters.querytrackeraco.svc.cluster.local:9010
encryption_mode optional
service TransactionSupervisorService
We should wait for the tablet cells to be healthy before running the init job.