Git Product home page Git Product logo

prometheus-am-executor's Introduction

CircleCI

prometheus-am-executor

The prometheus-am-executor is a HTTP server that receives alerts from the Prometheus Alertmanager and executes a given command with alert details set as environment variables.

Building

Requirements

1. Clone this repository

git clone https://github.com/imgix/prometheus-am-executor.git

2. Compile the prometheus-am-executor binary

go build

Usage

Usage: ./prometheus-am-executor [options] script [args..]

  -f string
        YAML config file to use
  -l string
    	HTTP Port to listen on (default ":8080")
  -v	Enable verbose/debug logging

The executor runs the provided script(s) (set via cli or yaml config file) with the following environment variables set:

  • AMX_RECEIVER: name of receiver in the AM triggering the alert
  • AMX_STATUS: alert status
  • AMX_EXTERNAL_URL: URL to reach alertmanager
  • AMX_ALERT_LEN: Number of alerts; for iterating through AMX_ALERT_<n>.. vars
  • AMX_LABEL_<label>: alert label pairs
  • AMX_GLABEL_<label>: label pairs used to group alert
  • AMX_ANNOTATION_<key>: alert annotation key/value pairs
  • AMX_ALERT_<n>_STATUS: status of alert
  • AMX_ALERT_<n>_START: start of alert in seconds since epoch
  • AMX_ALERT_<n>_END: end of alert, 0 for firing alerts
  • AMX_ALERT_<n>_URL: URL to metric in prometheus
  • AMX_ALERT_<n>_LABEL_<label>: alert label pairs
  • AMX_ALERT_<n>_ANNOTATION_<key>: alert annotation key/value pairs

Using a configuration file

If the -f flag is set, the program will read the given YAML file as configuration on startup. Any settings specified at the cli take precedence over the same settings defined in a config file.

This feature is useful if you wish to configure prometheus-am-executor to dispatch to multiple processes based on what labels match between an alert and a command configuration.

An example config file is provided in the examples directory.

Configuration file format

---
listen_address: ":8080"
verbose: false
commands:
  - cmd: echo
    args: ["banana", "tomato"]
    match_labels:
      "env": "testing"
      "owner": "me"
  - cmd: /bin/true
Parameter Use
listen_address HTTP Port to listen on. Equivalent to the -l cli flag.
verbose Enable verbose/debug logging. Equivalent to the -v cli flag.
commands A config section that specifies one or more commands to execute when alerts are received.
cmd The name or path to the command you want to execute.
args Optional arguments that you want to pass to the command
match_labels What alert labels you'd like to use, to determine if the command should be executed. All specified labels must match in order for the command to be executed. If match_labels isn't specified, the command will be executed for all alerts.

In the above configuration example, /bin/true will be executed for all alerts, and echo will be executed when an alert has the labels env="testing" and owner="me".

Testing configuration file changes

If you'd like to check the behaviour of a configuration file when prometheus-am-executor receives alerts, you can use the curl command to replay an alert. An example alert payload is provided in the examples directory.

1. Start prometheus-am-executor with your configuration file
./prometheus-am-executor -f examples/executor.yml -v
2. Send an alert to prometheus-am-executor

Make sure the port used in the curl command matches whatever you specified.

curl --include -H 'Content-Type: application/json' --data-binary "@examples/alert_payload.json" -X GET 'http://localhost:23222/'
3. Check the output of prometheus-am-executor

Example: Reboot systems with errors

Sometimes a system might exhibit errors that require a hard reboot. This is an example on how to use Prometheus and prometheus-am-executor to reboot a machine a machine based on a alert while making sure enough instances are in service all the time.

Let assume the counter app_errors_unrecoverable_total should trigger a reboot if increased by 1. To make sure enough instances are in service all the time, the reboot should only get triggered if at least 80% of all instances are reachable in the load balancer. A alerting expression would look like this:

ALERT RebootMachine IF
	increase(app_errors_unrecoverable_total[15m]) > 0 AND
	avg by(backend) (haproxy_server_up{backend="app"}) > 0.8

This will trigger an alert RebootMachine if app_errors_unrecoverable_total increased in the last 15 minutes and there are at least 80% of all servers for backend app up.

Now the alert needs to get routed to prometheus-am-executor like in this alertmanager config example.

Finally prometheus-am-executor needs to be pointed to a reboot script:

./prometheus-am-executor examples/reboot

As soon as the counter increases by 1, an alert gets triggered and the alertmanager routes the alert to prometheus-am-executor which executes the reboot script.

Caveats

To make sure a system doesn't get rebooted multiple times, the repeat_interval needs to be longer than interval used for increase(). As long as that's the case, prometheus-am-executor will run the provided script only once.

increase(app_errors_unrecoverable_total[15m]) takes the value of app_errors_unrecoverable_total 15 minutes ago to calculate the increase, it's required that the metric already exists before the counter increase happens. The Prometheus client library sets counters to 0 by default, but only for metrics without dynamic labels. Otherwise the metric only appears the first time it is set. The alert won't get triggered if the metric uses dynamic labels and was incremented the very first time (the increase from 'unknown to 0). Therefor you need to initialize all error counters with 0.

Since the alert gets triggered if the counter increased in the last 15 minutes, the alert resolves after 15 minutes without counter increase, so it's important that the alert gets processed in those 15 minutes or the system won't get rebooted.

prometheus-am-executor's People

Contributors

discordianfish avatar otakup0pe avatar colakong avatar shamil avatar chm052 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.