Git Product home page Git Product logo

configure-alertmanager-operator's Introduction

configure-alertmanager-operator

Go Report Card GoDoc codecov License

Summary

The Configure Alertmanager Operator was created for the OpenShift Dedicated platform to dynamically manage Alertmanager configurations based on the presence or absence of secrets containing a GoAlert URLs, Pager Duty RoutingKey, and Dead Man's Snitch URL. When the secret is created/updated/deleted, the associated Receiver and Route will be created/updated/deleted within the Alertmanager config.

The operator contains the following components:

  • Secret controller: watches the openshift-monitoring namespace for any changes to relevant Secrets or ConfigMaps that are used in the configuration of Alertmanager. For more information on this see Secret Controller below.

  • Types library: these types are imported from the Alertmanager Config library and pared down to suit our config needs. (Since their library is intended for internal use only).

Secret Controller

The Secret Controller watches over the resources in the table below. Changes to these resources will prompt the controller to reconcile.

Resource Type Resource Namespace/Name Reason for watching
Secret openshift-monitoring/alertmanager-main Represents the Alertmanager Configuration that the operator creates/maintains the state of.
Secret openshift-monitoring/goalert-secret Indicates that the operator should configure GoAlert routing. Contains 3 values used by GoAlert; URL for high alerts, low alerts, and a heartbeat.
Secret openshift-monitoring/pd-secret Indicates that the operator should configure PagerDuty routing. Contains the PagerDuty API Key that is used for PagerDuty communications.
Secret openshift-monitoring/dms-secret Indicates that the operator should configure DeadmansSnitch routing. Contains the DeadmansSnitch URL that the Alertmanager should report readiness to.
ConfigMap openshift-monitoring/ocm-agent Indicates that the operator should configure OCM Agent routing. Contains the OCM Agent service URL that Alertmanager should route alerts to.
ConfigMap openshift-monitoring/managed-namespaces Defines a list of OpenShift "managed" namespaces. The operator will route alerts originating from these namespaces to PagerDuty and/or GoAlert.
ConfigMap openshift-monitoring/ocp-namespaces Defines a list of OpenShift Container Platform namespaces. The operator will route alerts originating from these namespaces to PagerDuty and/or GoAlert.

Cluster Readiness

To avoid alert noise while a cluster is in the early stages of being installed and configured, this operator waits to configure Pager Duty -- effectively silencing alerts -- until a predetermined set of health checks, performed by osd-cluster-ready, has completed.

This determination is made through the presence of a completed Job named osd-cluster-ready in the openshift-monitoring namespace.

Metrics

The Configure Alertmanager Operator exposes the following Prometheus metrics:

Metric name Purpose
ga_secret_exists indicates that a Secret named goalert-secret exists in the openshift-monitoring namespace.
pd_secret_exists indicates that a Secret named pd-secret exists in the openshift-monitoring namespace.
dms_secret_exists indicates that a Secret named dms-secret exists in the openshift-monitoring namespace.
am_secret_exists indicates that a Secret named alertmanager-main exists in the openshift-monitoring namespace.
managed_namespaces_configmap_exists indicates that a ConfigMap named managed-namespaces exists in the openshift-monitoring namespace.
ocp_namespaces_configmap_exists indicates that a ConfigMap named ocp-namespaces exists in the openshift-monitoring namespace.
am_secret_contains_ga indicates the GoAlert receiver is present in alertmanager.yaml.
am_secret_contains_pd indicates the Pager Duty receiver is present in alertmanager.yaml.
am_secret_contains_dms indicates the Dead Man's Snitch receiver is present in alertmanager.yaml.

The operator creates a Service and ServiceMonitor named configure-alertmanager-operator to expose these metrics to Prometheus.

Alerts

The following alerts are added to Prometheus as part of configure-alertmanager-operator:

  • Mismatch between DMS secret and DMS Alertmanager config.
  • Mismatch between GoAlert secret and GoAlert Alertmanager config.
  • Mismatch between PD secret and PD Alertmanager config.
  • Alertmanager config secret does not exist.

Testing

Tips for testing on a personal cluster:

Building

You may build (make docker-build) and push (make docker-push) the operator image to a personal repository by overriding components of the image URI:

  • IMAGE_REGISTRY overrides the registry (default: quay.io)
  • IMAGE_REPOSITORY overrides the organization (default: app-sre)
  • IMAGE_NAME overrides the repository name (default: managed-cluster-validating-webhooks)
  • OPERATOR_IMAGE_TAG overrides the image tag. (By default this is generated based on the current commit of your local clone of the git repository; but make docker-build will also always tag latest)

For example, to build, tag, and push quay.io/my-user/configure-alertmanager-operator:latest, you can run:

make IMAGE_REPOSITORY=my-user docker-build docker-push

Deploying

Prevent Overwrites

Note: This step requires elevated permissions

This operator is managed by OLM, so you must switch that off, or your changes to the operator's Deployment will be overwritten:

oc scale deploy/cluster-version-operator --replicas=0 -n openshift-cluster-version
oc scale deploy/olm-operator --replicas=0 -n openshift-operator-lifecycle-manager

NOTE: Don't forget to revert these changes when you have finished testing:

oc scale deploy/olm-operator --replicas=1 -n openshift-operator-lifecycle-manager
oc scale deploy/cluster-version-operator --replicas=1 -n openshift-cluster-version

Replace the Image

Edit the operator's deployment (oc set image deployment configure-alertmanager-operator -n openshift-monitoring *=<IMAGE>), replacing the image: with the URI of the image you built above. The deployment will automatically delete and replace the running pod.

NOTE: If you are testing coordination with the osd-cluster-ready job, you may need to set the MAX_CLUSTER_AGE_MINUTES environment variable in the deployment's configure-alertmanager-operator container definition. For example, to ensure the osd-cluster-ready Job is checked in a cluster less than 1048576 minutes (~two years) old:

        containers:
        - command:
          - configure-alertmanager-operator
          env:
          - name: WATCH_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: POD_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.name
          - name: OPERATOR_NAME
            value: configure-alertmanager-operator
          ### Add this entry ###
          - name: MAX_CLUSTER_AGE_MINUTES
            value: "1048576"
          image: quay.io/2uasimojo/configure-alertmanager-operator:latest
          imagePullPolicy: Always
          name: configure-alertmanager-operator
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File

configure-alertmanager-operator's People

Contributors

2uasimojo avatar bdematte avatar bmeng avatar boranx avatar cblecker avatar clcollins avatar dak1n1 avatar dependabot[bot] avatar dofinn avatar fahlmant avatar grdryn avatar iamkirkbater avatar jewzaam avatar jharrington22 avatar lisa avatar mjlshen avatar mrbarge avatar nautilux avatar ninabauer avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar ravitri avatar rendhalver avatar robotmaxtron avatar sam-nguyen7 avatar theautoroboto avatar tnierman avatar typeid avatar yithian avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.