Git Product home page Git Product logo

cluster-autoscaler-operator's Introduction

Cluster Autoscaler Operator

The cluster-autoscaler-operator manages deployments of the OpenShift Cluster Autoscaler using the cluster-api provider.

Custom Resource Definitions

The operator manages the following custom resources:

  • ClusterAutoscaler: This is a singleton resource which controls the configuration of the cluster's autoscaler instance. The operator will only respond to the ClusterAutoscaler resource named "default" in the managed namespace, i.e. the value of the WATCH_NAMESPACE environment variable. (Example)

    The fields in the spec for ClusterAutoscaler resources correspond to command-line arguments to the cluster-autoscaler. The example linked above results in the following invocation:

      Command:
        cluster-autoscaler
      Args:
        --logtostderr
        --balance-similar-node-groups=true
        --v=1
        --cloud-provider=clusterapi
        --namespace=openshift-machine-api
        --leader-elect-lease-duration=137s
        --leader-elect-renew-deadline=107s
        --leader-elect-retry-period=26s
        --expendable-pods-priority-cutoff=-10
        --max-nodes-total=24
        --cores-total=8:128
        --memory-total=4:256
        --gpu-total=nvidia.com/gpu:0:16
        --gpu-total=amd.com/gpu:0:4
        --scale-down-enabled=true
        --scale-down-delay-after-add=10s
        --scale-down-delay-after-delete=10s
        --scale-down-delay-after-failure=10s
        --scale-down-utilization-threshold=0.4
        --ignore-daemonsets-utilization=false
        --skip-nodes-with-local-storage=true
    
  • MachineAutoscaler: This resource targets a node group and manages the annotations to enable and configure autoscaling for that group, e.g. the min and max size. Currently only MachineSet objects can be targeted. (Example)

Development

## Build, Test, & Run
$ make build
$ make test

$ export WATCH_NAMESPACE=openshift-machine-api
$ ./bin/cluster-autoscaler-operator -alsologtostderr

The Cluster Autoscaler Operator is designed to be deployed on OpenShift by the Cluster Version Operator, but it's possible to run it directly on any vanilla Kubernetes cluster that has the machine-api components available. To do so, apply the manifests in the install directory: kubectl apply -f ./install

This will create the openshift-machine-api namespace, register the custom resource definitions, configure RBAC policies, and create a deployment for the operator.

End-to-End Tests

You can run the e2e test suite with make test-e2e. These tests assume the presence of a cluster already running the operator, and that the KUBECONFIG environment variable points to a configuration granting admin rights on said cluster.

If running make targets in container with podman and encountering permission issues, see hacking-guide.

Validating Webhooks

By default the operator starts an HTTP server for webhooks and registers a ValidatingWebhookConfiguration with the API server for both the ClusterAutoscaler and MachineAutoscaler types. This can be disabled via the WEBHOOKS_ENABLED environment variable. At the moment, reconciliation of the webhook configuration is only performed once at startup after leader-election has succeeded.

If the webhook server is enabled, you must provide a TLS certificate and key as well as a CA certificate to the operator. The location of these is controlled by the WEBHOOKS_CERT_DIR environment variable, which defaults to: /etc/cluster-autoscaler-operator/tls

The files must be in the following locations:

  • ${WEBHOOKS_CERT_DIR}/tls.crt
  • ${WEBHOOKS_CERT_DIR}/tls.key
  • ${WEBHOOKS_CERT_DIR}/service-ca/ca-cert.pem

The default cluster-autoscaler-operator deployment on OpenShift will generate the TLS assets automatically with the help of the OpenShift service-ca-operator. This works by annotating the Service object associated with the operator, which causes the service-ca-operator to generate a TLS certificate and inject it into a Secret, which is then mounted into the operator pod. Additionally, the service-ca-operator injects its CA certificate into a ConfigMap, which is also mounted. The operator then uses the TLS certificate and key to secure the webhook HTTP server, and injects the CA certificate into the webhook configuration registered with the API server.

Updates to the TLS certificate and key are handled transparently. The controller-runtime library the operator is based on watches the files mounted in the pod for changes and updates HTTP server's TLS configuration. Updates to the CA certificate are not handled automatically, however a restart of the operator will load the new CA certificate and update the webhook configuration. This is not usually a problem in practice because CA certificates are generally long-lived and the webhook configuration is set to ignore communication failures as the validations are merely a convenience.

cluster-autoscaler-operator's People

Contributors

aleskandro avatar alexander-demicev avatar bison avatar cfergeau avatar csrwng avatar damdo avatar danil-grigorev avatar derekwaynecarr avatar eggfoobar avatar elmiko avatar enxebre avatar frobware avatar guillaumerose avatar ingvagabund avatar joelspeed avatar mfojtik avatar michaelgugino avatar odvarkadaniel avatar openshift-bot avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar paulfantom avatar racheljpg avatar radekmanak avatar samuelstuchly avatar smarterclayton avatar spangenberg avatar vikaschoudhary16 avatar wking avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster-autoscaler-operator's Issues

install blocked by empty cluster-autoscaler ClusterOperator

$ oc get clusteroperator cluster-autoscaler -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: 2019-02-25T20:10:10Z
  generation: 1
  name: cluster-autoscaler
  resourceVersion: "8185"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/cluster-autoscaler
  uid: 5a92b388-3939-11e9-ac79-06529dbe11fc
spec: {}

cluster-autoscaler-operator pod is Running

>>> 2019-02-25 14:10:04.45925468 -0600 CST m=+1078.985419586 (created) -> ContainerCreating
>>> 2019-02-25 14:10:06.070423667 -0600 CST m=+1080.596588583 ContainerCreating -> Running

****************************************************************
Pod openshift-machine-api/cluster-autoscaler-operator-6888c5b57b-fcjg7 started
****************************************************************
I0225 20:10:07.902042       1 main.go:14] Go Version: go1.10.3
I0225 20:10:08.099820       1 main.go:15] Go OS/Arch: linux/amd64
I0225 20:10:08.099844       1 main.go:16] Version: cluster-autoscaler-operator v0.0.0-60-gda772db-dirty
W0225 20:10:10.300302       1 machineautoscaler_controller.go:107] Removing support for unregistered target type: cluster.k8s.io/v1alpha1, Kind=MachineDeployment
I0225 20:10:10.400064       1 main.go:30] Starting cluster-autoscaler-operator
I0225 20:10:10.400219       1 leaderelection.go:185] attempting to acquire leader lease  openshift-machine-api/cluster-autoscaler-operator-leader...
I0225 20:10:10.600183       1 leaderelection.go:194] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader
E0225 20:10:10.999872       1 reflector.go:205] github.com/openshift/cluster-autoscaler-operator/vendor/sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.Deployment: an error on the server ("apiserver is shutting down.") has prevented the request from succeeding (get deployments.apps)
E0225 20:10:10.999940       1 reflector.go:205] github.com/openshift/cluster-autoscaler-operator/vendor/sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1alpha1.MachineAutoscaler: an error on the server ("apiserver is shutting down.") has prevented the request from succeeding (get machineautoscalers.autoscaling.openshift.io)
E0225 20:10:11.000058       1 reflector.go:205] github.com/openshift/cluster-autoscaler-operator/vendor/sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1alpha1.ClusterAutoscaler: an error on the server ("apiserver is shutting down.") has prevented the request from succeeding (get clusterautoscalers.autoscaling.openshift.io)
W0225 20:10:11.015258       1 reflector.go:341] github.com/openshift/cluster-autoscaler-operator/vendor/sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: watch of *unstructured.Unstructured ended with: very short watch: github.com/openshift/cluster-autoscaler-operator/vendor/sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Unexpected watch close - watch lasted less than a second and no items received
W0225 20:10:11.015372       1 reflector.go:341] github.com/openshift/cluster-autoscaler-operator/vendor/sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: watch of *unstructured.Unstructured ended with: very short watch: github.com/openshift/cluster-autoscaler-operator/vendor/sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Unexpected watch close - watch lasted less than a second and no items received

The situation never resolves (at least not within 30m) and results in failed installation.

@derekwaynecarr @smarterclayton

Autoscaler on user provisioned infrastructure

Reading in the documentation I saw that the cluster-autoscaler works only in clusters where the machine API is operational.
Does it mean that it's not possible to use it on user-provisioned infrastructure?

Thanks for the help

Config logic for skip-nodes-with-local-storage is flawed

There is an example to set the following option:

skipNodesWithLocalStorage: true

However, when setting this option to false nothing happens. The deployment is not getting updated.

This is because the configuration logic is flawed:

if ca.Spec.SkipNodesWithLocalStorage != nil && *ca.Spec.SkipNodesWithLocalStorage {
args = append(args, SkipNodesWithLocalStorage.String())
}

But you want the autoscaler to run with --skip-nodes-with-local-storage=false if you want to scale down nodes with pods using emptyDir.

Operator reports failing status

Default install, see the following status:

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: '2019-01-07T19:24:54Z'
  generation: 1
  name: cluster-autoscaler-operator
  resourceVersion: '42230'
  selfLink: /apis/config.openshift.io/v1/clusteroperators/cluster-autoscaler-operator
  uid: e95572fd-12b1-11e9-8f4e-0243bc135e96
spec: {}
status:
  conditions:
    - lastTransitionTime: '2019-01-07T19:53:55Z'
      status: 'True'
      type: Available
    - lastTransitionTime: '2019-01-07T19:53:55Z'
      status: 'False'
      type: Progressing
    - lastTransitionTime: '2019-01-07T19:53:55Z'
      message: machine-api-operator not ready
      reason: MissingDependency
      status: 'True'
      type: Failing
  extension: null
  version: v0.0.0-12-ge17a761-dirty

@bison

Update type for spec.resourceLimits

The types.go for resource limits should match vanilla kubernetes type definition for resource limits.

// ResourceList is a set of (resource name, quantity) pairs.
type ResourceList map[ResourceName]resource.Quantity

We can express a range for particular resources by just having a min/max similar to LimitRange

	Max ResourceList `json:"max,omitempty" protobuf:"bytes,2,rep,name=max,casttype=ResourceList,castkey=ResourceName"`
	// Min usage constraints on this kind by resource name.
	// +optional
	Min ResourceList `json:"min,omitempty" protobuf:"bytes,3,rep,name=min,casttype=ResourceList,castkey=ResourceName"`

/cc @bison @frobware @ingvagabund

Assign a priority class to pods

Priority classes docs:
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class

Example: https://github.com/openshift/cluster-monitoring-operator/search?q=priority&unscoped_q=priority

Notes: The pre-configured system priority classes (system-node-critical and system-cluster-critical) can only be assigned to pods in kube-system or openshift-* namespaces. Most likely, core operators and their pods should be assigned system-cluster-critical. Please do not assign system-node-critical (the highest priority) unless you are really sure about it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.