mozilla-releng / k8s-autoscale Goto Github PK
View Code? Open in Web Editor NEWAutoscale scriptworkers in GCP
License: Mozilla Public License 2.0
Autoscale scriptworkers in GCP
License: Mozilla Public License 2.0
From time to time we hit an issue where a scriptworker task gets killed: when it periodically scans each pool, k8s-autoscale counts pending tasks and running workers, and if it thinks there are too many running workers it tells k8s to stop them. scriptworker gets SIGUSR1, which tells it to stop after the current task. However if it's not done after terminationGracePeriodSeconds (currently 20 minutes, except for treescript where it's 1 hour), it gets SIGTERM and terminates the running task, which then has to be rerun for no good reason.
Right now we submit PATCH
requests even when the desired amount of replicas is the same with amount of running replicas. It does nothing, but better to skip the entire API call instead.
This Mozilla repository has been identified as lacking a LICENSE.md file. This repository does have licensing information in the README.md file. To make it easier for users (and scanning tools) to find licensing information please add a LICENSE.md file with that information to the root directory of the project.
Mozilla staff can access more information in our Software Licensing Runbook – search for “Licensing Runbook” in Confluence to find it.
If you have any questions you can contact Daniel Nazer who can be reached at dnazer on Mozilla email or Slack.
READMELIC-2023-01
from #4 (review)
ATM we use a hardcoded period to sleep between polls. It would be better to use async and use separate poll_interval (already in the configs) per worker type.
It would be great to check the configs against some schema, so we don't make silly mistakes,
the last few log messages were:
{
"insertId": "n1eotk8hjonkexfb2",
"jsonPayload": {
"Type": "k8s_autoscale.main",
"Fields": {
"min_replicas": 0,
"provisioner": "scriptworker-k8s",
"deployment_name": "bouncer-prod-relengworker-firefoxci-comm-3-1",
"deployment_namespace": "prod-bouncer",
"msg": "Handling worker type. Getting the number of running replicas...",
"worker_type": "comm-3-bouncer"
},
"Hostname": "k8s-autoscale-prod-relengworker-app-1-8566fb748b-wskc2",
"EnvVersion": "2.0",
"Timestamp": 1600935067087724500,
"Pid": 1,
"Severity": 6,
"Logger": "Dockerflow"
},
"resource": {
"type": "k8s_container",
"labels": {
"cluster_name": "relengworker-prod-v1",
"project_id": "moz-fx-relengworker-prod-a67d",
"container_name": "k8s-autoscale",
"location": "us-west1",
"namespace_name": "prod-k8s-autoscale",
"pod_name": "k8s-autoscale-prod-relengworker-app-1-8566fb748b-wskc2"
}
},
"timestamp": "2020-09-24T08:11:07.087987874Z",
"severity": "INFO",
"labels": {
"k8s-pod/pod-template-hash": "8566fb748b",
"k8s-pod/app_kubernetes_io/managed-by": "jenkins",
"k8s-pod/app_kubernetes_io/part-of": "k8s-autoscale",
"k8s-pod/fullname": "k8s-autoscale-prod-relengworker-app-1",
"k8s-pod/jenkins-build-id": "1373",
"k8s-pod/app_kubernetes_io/name": "k8s-autoscale",
"k8s-pod/app_kubernetes_io/version": "1.0.0",
"k8s-pod/app_kubernetes_io/instance": "prod",
"k8s-pod/app_kubernetes_io/component": "scriptworker"
},
"logName": "projects/moz-fx-relengworker-prod-a67d/logs/stdout",
"receiveTimestamp": "2020-09-24T08:11:12.049170032Z"
}
{
"insertId": "n1eotk8hjonkexfb3",
"jsonPayload": {
"Logger": "Dockerflow",
"EnvVersion": "2.0",
"Fields": {
"running": 0,
"provisioner": "scriptworker-k8s",
"deployment_name": "bouncer-prod-relengworker-firefoxci-comm-3-1",
"min_replicas": 0,
"msg": "Calculating capacity",
"worker_type": "comm-3-bouncer",
"deployment_namespace": "prod-bouncer"
},
"Severity": 6,
"Timestamp": 1600935067106041900,
"Pid": 1,
"Hostname": "k8s-autoscale-prod-relengworker-app-1-8566fb748b-wskc2",
"Type": "k8s_autoscale.main"
},
"resource": {
"type": "k8s_container",
"labels": {
"cluster_name": "relengworker-prod-v1",
"namespace_name": "prod-k8s-autoscale",
"container_name": "k8s-autoscale",
"project_id": "moz-fx-relengworker-prod-a67d",
"location": "us-west1",
"pod_name": "k8s-autoscale-prod-relengworker-app-1-8566fb748b-wskc2"
}
},
"timestamp": "2020-09-24T08:11:07.111627894Z",
"severity": "INFO",
"labels": {
"k8s-pod/app_kubernetes_io/component": "scriptworker",
"k8s-pod/app_kubernetes_io/instance": "prod",
"k8s-pod/pod-template-hash": "8566fb748b",
"k8s-pod/app_kubernetes_io/name": "k8s-autoscale",
"k8s-pod/app_kubernetes_io/version": "1.0.0",
"k8s-pod/app_kubernetes_io/managed-by": "jenkins",
"k8s-pod/app_kubernetes_io/part-of": "k8s-autoscale",
"k8s-pod/fullname": "k8s-autoscale-prod-relengworker-app-1",
"k8s-pod/jenkins-build-id": "1373"
},
"logName": "projects/moz-fx-relengworker-prod-a67d/logs/stdout",
"receiveTimestamp": "2020-09-24T08:11:12.049170032Z"
}
{
"insertId": "n1eotk8hjonkexfb4",
"jsonPayload": {
"Severity": 6,
"Logger": "Dockerflow",
"Fields": {
"provisioner": "scriptworker-k8s",
"worker_type": "comm-3-bouncer",
"msg": "Checking pending",
"deployment_namespace": "prod-bouncer",
"running": 0,
"min_replicas": 0,
"capacity": 1,
"deployment_name": "bouncer-prod-relengworker-firefoxci-comm-3-1"
},
"Type": "k8s_autoscale.main",
"EnvVersion": "2.0",
"Timestamp": 1600935067106374100,
"Pid": 1,
"Hostname": "k8s-autoscale-prod-relengworker-app-1-8566fb748b-wskc2"
},
"resource": {
"type": "k8s_container",
"labels": {
"container_name": "k8s-autoscale",
"cluster_name": "relengworker-prod-v1",
"project_id": "moz-fx-relengworker-prod-a67d",
"location": "us-west1",
"namespace_name": "prod-k8s-autoscale",
"pod_name": "k8s-autoscale-prod-relengworker-app-1-8566fb748b-wskc2"
}
},
"timestamp": "2020-09-24T08:11:07.111681044Z",
"severity": "INFO",
"labels": {
"k8s-pod/app_kubernetes_io/component": "scriptworker",
"k8s-pod/app_kubernetes_io/part-of": "k8s-autoscale",
"k8s-pod/app_kubernetes_io/managed-by": "jenkins",
"k8s-pod/app_kubernetes_io/name": "k8s-autoscale",
"k8s-pod/app_kubernetes_io/instance": "prod",
"k8s-pod/jenkins-build-id": "1373",
"k8s-pod/pod-template-hash": "8566fb748b",
"k8s-pod/fullname": "k8s-autoscale-prod-relengworker-app-1",
"k8s-pod/app_kubernetes_io/version": "1.0.0"
},
"logName": "projects/moz-fx-relengworker-prod-a67d/logs/stdout",
"receiveTimestamp": "2020-09-24T08:11:12.049170032Z"
}
I'd like to get to a point where the entire release process can be tested on try-comm-central. There are a couple missing worker types.
comm-1-beetmover
comm-1-bouncer
comm-1-tree
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.