imgix / prometheus-am-executor Goto Github PK

Execute command based on Prometheus alerts

License: BSD 2-Clause "Simplified" License

Go 99.54% Makefile 0.46%

prometheus-am-executor's Introduction

prometheus-am-executor

The prometheus-am-executor is a HTTP server that receives alerts from the Prometheus Alertmanager and executes a given command with alert details set as environment variables.

ℹ️ This project's development is currently stale

We haven't needed to update this program in some time. If you are looking for something with similar functionality and is more actively maintained, @aantn has suggested their project: Robusta (docs)

issue 7 has discussion relating to the status of this project.

Building

Requirements

1. Clone this repository

git clone https://github.com/imgix/prometheus-am-executor.git

2. Compile the `prometheus-am-executor` binary

go test -count 1 -v ./...

go build

Usage

Usage: ./prometheus-am-executor [options] script [args..]

  -f string
        YAML config file to use
  -l string
    	HTTP Port to listen on (default ":8080")
  -v	Enable verbose/debug logging

The executor runs the provided script(s) (set via cli or yaml config file) with the following environment variables set:

AMX_RECEIVER: name of receiver in the AM triggering the alert
AMX_STATUS: alert status
AMX_EXTERNAL_URL: URL to reach alertmanager
AMX_ALERT_LEN: Number of alerts; for iterating through AMX_ALERT_<n>.. vars
AMX_LABEL_<label>: alert label pairs
AMX_GLABEL_<label>: label pairs used to group alert
AMX_ANNOTATION_<key>: alert annotation key/value pairs
AMX_ALERT_<n>_STATUS: status of alert
AMX_ALERT_<n>_START: start of alert in seconds since epoch
AMX_ALERT_<n>_END: end of alert, 0 for firing alerts
AMX_ALERT_<n>_URL: URL to metric in prometheus
AMX_ALERT_<n>_FINGERPRINT: Message Fingerprint
AMX_ALERT_<n>_LABEL_<label>: alert label pairs
AMX_ALERT_<n>_ANNOTATION_<key>: alert annotation key/value pairs

Using a configuration file

If the -f flag is set, the program will read the given YAML file as configuration on startup. Any settings specified at the cli take precedence over the same settings defined in a config file.

This feature is useful if you wish to configure prometheus-am-executor to dispatch to multiple processes based on what labels match between an alert and a command configuration.

An example config file is provided in the examples directory.

Configuration file format

---
listen_address: ":23222"
verbose: false
# tls_key: "certs/key.pem"
# tls_crt: "certs/cert.pem"
commands:
  - cmd: echo
    args: ["banana", "tomato"]
    match_labels:
      "env": "testing"
      "owner": "me"
    notify_on_failure: false
  - cmd: /bin/true
    max: 3
    ignore_resolved: true
  - cmd: /bin/sleep
    args: ["10s"]
    resolved_signal: SIGUSR1

Parameter	Use
`listen_address`	HTTP Port to listen on. Equivalent to the `-l` cli flag.
`verbose`	Enable verbose/debug logging. Equivalent to the `-v` cli flag.
`tls_key`	The TLS Key file for an optional TLS listener.
`tls_crt`	The TLS Certificate file for an optional TLS listener.
`commands`	A config section that specifies one or more commands to execute when alerts are received.
`cmd`	The name or path to the command you want to execute.
`args`	Optional arguments that you want to pass to the command
`match_labels`	What alert labels you'd like to use, to determine if the command should be executed. All specified labels must match in order for the command to be executed. If `match_labels` isn't specified, the command will be executed for all alerts.
`notify_on_failure`	By default if any executed command returns a non-zero exit code, the caller (alertmanager) is notified with an HTTP 500 status code in the response. This will likely result in alertmanager considering the message a 'failure to notify' and re-sends the alert to am-executor. If this is not desired behaviour, set `nofity_on_failure` to `false`.
`max`	The maximum instances of this command that can be running at the same time. A zero or negative value is interpreted as 'no limit'.
`ignore_resolved`	By default when an alertmanager message indicating the alerts are 'resolved' is received, any commands matching the alarm are sent a signal if they are still active. If this is not desired behaviour, set this to `true`.
`resolved_signal`	Specify which signal to send to matching commands that are still running when the triggering alert is resolved. (default: SIGKILL)

In the above configuration example:

echo will be executed when an alert has the labels env="testing" and owner="me", receives SIGKILL if triggering alarm resolves while it's still running. If the command fails, the source of the alert isn't notified.
/bin/true will be executed for all alerts, and doesn't receive a signal if triggering alarm resolves while running.
/bin/sleep is executed for all alerts, and receives SIGUSR1 signal if triggering alarm resolves while still running.

Creating TLS Certificates

With the following command can you create a TLS key and certificate for testing purposes.

mkdir certs
cd certs
go run $(go env GOROOT)/src/crypto/tls/generate_cert.go --rsa-bits=2048 --host=localhost

Testing configuration file changes

If you'd like to check the behaviour of a configuration file when prometheus-am-executor receives alerts, you can use the curl command to replay an alert. An example alert payload is provided in the examples directory.

1. Start prometheus-am-executor with your configuration file

./prometheus-am-executor -f examples/executor.yml -v

2. Send an alert to prometheus-am-executor

Make sure the port used in the curl command matches whatever you specified.

curl --include -H 'Content-Type: application/json' --data-binary "@examples/alert_payload.json" -X GET 'http://localhost:23222/'

3. Check the output of prometheus-am-executor

Example: Reboot systems with errors

Sometimes a system might exhibit errors that require a hard reboot. This is an example on how to use Prometheus and prometheus-am-executor to reboot a machine a machine based on a alert while making sure enough instances are in service all the time.

Let assume the counter app_errors_unrecoverable_total should trigger a reboot if increased by 1. To make sure enough instances are in service all the time, the reboot should only get triggered if at least 80% of all instances are reachable in the load balancer. A alerting expression would look like this:

ALERT RebootMachine IF
	increase(app_errors_unrecoverable_total[15m]) > 0 AND
	avg by(backend) (haproxy_server_up{backend="app"}) > 0.8

This will trigger an alert RebootMachine if app_errors_unrecoverable_total increased in the last 15 minutes and there are at least 80% of all servers for backend app up.

Now the alert needs to get routed to prometheus-am-executor like in this alertmanager config example.

Finally prometheus-am-executor needs to be pointed to a reboot script:

./prometheus-am-executor examples/reboot

As soon as the counter increases by 1, an alert gets triggered and the alertmanager routes the alert to prometheus-am-executor which executes the reboot script.

Caveats

To make sure a system doesn't get rebooted multiple times, the repeat_interval needs to be longer than interval used for increase(). As long as that's the case, prometheus-am-executor will run the provided script only once.

increase(app_errors_unrecoverable_total[15m]) takes the value of app_errors_unrecoverable_total 15 minutes ago to calculate the increase, it's required that the metric already exists before the counter increase happens. The Prometheus client library sets counters to 0 by default, but only for metrics without dynamic labels. Otherwise the metric only appears the first time it is set. The alert won't get triggered if the metric uses dynamic labels and was incremented the very first time (the increase from 'unknown to 0). Therefor you need to initialize all error counters with 0.

Since the alert gets triggered if the counter increased in the last 15 minutes, the alert resolves after 15 minutes without counter increase, so it's important that the alert gets processed in those 15 minutes or the system won't get rebooted.

prometheus-am-executor's People

Contributors

Stargazers

Watchers

prometheus-am-executor's Issues

doesn't build with prometheus/client_golang:master

See prometheus/client_golang#600. The prometheus.Handler() function has been removed.

request id in logger

Greetings. Would it be possible to add generated request id (uuid or smth) attached to request verbose output and script's stdout/stderr? Sometimes it receives multiple alerts for same script simultaneously and I think it would be a useful feature.

Check support for multi-line alarm annotations

It's possible that users may be interested in the contents of an alarm's annotations when control passes to the called program. I'm uncertain what the behaviour would be, if an env var contained newline characters.

This issue is to check the behaviour when using annotations with newline characters. If the contents are different than what prometheus-am-executor received (truncated, etc), make note of it in README.

Add functionality for executing different commands, based on matching alarm labels

This depends on #20

Currently, prometheus-am-executor can only call one program for every alarm it receives. If you wanted to call multiple programs, you'd need to run multiple instances of prometheus-am-executor.

This issue is to extend functionality so that different commands could be executed, based on matching alarm labels defined in a config file.

Is this an active project ?

Hi,
Is this an active project ?
There seems to be no development for a long time.
Is there a similar project somewhere ?

thanks

Support alarm deduplication for unique alarm entities within a time-frame

This issue is to add functionality that can deduplicate unique alarm entities received within a time-frame.

An example use-case is where you have multiple prometheus servers configured the same way, and issue alarms to multiple alertmanagers, who all point to an executor. If the same alarm is received from multiple executors within a time-frame, we could support configuration that would say to execute matching commands once, not once per matching alarm.

Add test case for `handleWebhook`

Depends on #18

This would have us check the behaviour of handleWebhook expected, for some test inputs

executor.yaml file

Hi. It's maybe not an issue, but I have problem with executor.yaml file, can someone provide me 10 minute to solve it.

Getting error at webhook url

Build executor and triggered alert for testing but got this error at URL

"unexpected end of JSON input"

run multiple script file

Hi Everyone,

I want to run with multiple script but i don't know how to run or how to run multiple command once time? can everyone help me.

Thanks you so much!

Cannot run a script

I am trying to use the dockerised version.
I have deployed the container and it accepts alert from Alert Manager but when trying to execute my script it outputs:

root@vagrant:/home/vagrant# docker service logs omilia_promexecutor
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:42:39 Listening on :8080 and running [script.sh]
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:43:37 unexpected end of JSON input
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:43:37 http: multiple response.WriteHeader calls
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:43:37 exec: "script.sh": executable file not found in $PATH
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:43:38 unexpected end of JSON input
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:43:38 http: multiple response.WriteHeader calls
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:43:38 exec: "script.sh": executable file not found in $PATH
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:43:43 exec: "script.sh": executable file not found in $PATH
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:43:43 exec: "script.sh": executable file not found in $PATH
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:43:44 exec: "script.sh": executable file not found in $PATH
omilia_promexecutor.1.6r4zs5izmfqi@vagrant    | 2020/03/20 08:43:45 exec: "script.sh": executable file not found in $PATH

prometheus-am-executor as a docker container

Can you guide me how I can use it as container (docker)

MacOS with Go throwing error "undefined: syscall.SIGCLD" and similar error

While building the project in MacOS (11.1) and Go version (go version go1.14.3 darwin/amd64), I am seeing the following error of undefined Signal:

$ go build .
# github.com/imgix/prometheus-am-executor
./command.go:40:16: undefined: syscall.SIGCLD
./command.go:50:16: undefined: syscall.SIGPOLL
./command.go:52:16: undefined: syscall.SIGPWR
./command.go:55:16: undefined: syscall.SIGSTKFLT
./command.go:63:16: undefined: syscall.SIGUNUSED

Once I removed the mentioned signals from command.go, it was built successfully.

Add ability to kill exec process with SIGTERM instead of SIGKILL when resolved alert arrived

Greetings. I think it will be useful if one may use SIGTERM instead of SIGKILL via configuration file (since it not supported on all OSes).

github links obsolete

Hello,

I saw that in the main.go there are two links obsolete,

"github.com/prometheus/alertmanager/template"
"github.com/prometheus/client_golang/prometheus"

From what i see now there is :

"github.com/prometheus/alertmanager/tree/master/template"
"github.com/prometheus/client_golang/tree/master/prometheus"

Allow for command delays when matching alarms/alarm entities exceeds a threshold in a time-frame

This issue is to add functionality to introduce delays in executing a particular command, if

the number of matching alarms within a time-frame exceeds a threshold, and/or
the number of unique entities in the matching alarms within a time-frame exceeds a threshold

An example of a use-case is that if you are executing a command that manages the state of a fleet of services, you want to control how much of the fleet that is affected at the same time if you received an alarm that affects the entire fleet. This is helpful for avoiding situations where a problem in monitoring itself triggers an issue by applying an operation to an entire fleet at the same time.

Add feature to enable mutual TLS

Currently, it appears to be possible to specify a TLS config, but that's only to secure the communication channel between the client and server process. As an improvement to security, it would be great if mutual TLS could be enabled so that only select clients with the proper certificate can connect to the executor.

Fix the LICENSE file contents, so that package docs show up properly on pkg.go.dev

README and documentation for this project doesn't show up in pkg.go.dev. The reason given is that the license file doesn't match the golang package directory website's license policy.

This issue is to determine what needs to change in our LICENSE file, in order for README/Docs to show up properly in pkg.go.dev.

Some helpful links for this issue:

pkg.go.dev license policy
The OSI formatting of the BSD 2-clause license that we're using (which is detected by GitHub)

panic: send on closed channel

Hi! It seems that there is a race condition in server.go. The following error is occurred sometimes:

goroutine 68 [running]:
main.(*Server).amFiring.func3({0xc00006ea80, 0xc00007ac00})
  /app/server.go:245 +0x1c5
created by main.(*Server).amFiring
  /app/server.go:270 +0x4a7

Refactor main.go, to use a config struct instead of global vars

This will

help make handleWebhook easier to test
unblock development of some other features

Support cancelling running command for an alarm if a resolve condition is received.

If we receive a message indicating that an alarm was 'resolved', we could support killing the processes that were run for that alarm.

problem with testing

while testing i keep getting the following error 'unexpected end of JSON input'

Unable to run prometheus-am-executor.

I have cloned prometheus-am-executor to my directory. And then I run ~/.prometheus-am-executor it does not work. I have already configured Prometheus and alertmanager. Help me how to do it.

panic: runtime error: invalid memory address or nil pointer dereference

hi.
i'm using prometheus-am-executor and i have a question.
i executed reboot script that default file in ./prometheus-am-executor and i got runtime error.

how can i fix this error?
Should i modify go file?
[root@controller1 prometheus-am-executor]# ./prometheus-am-executor -v examples/reboot
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x8c12fc]

goroutine 1 [running]:
main.readCli(0x203000, 0x680, 0xc000235c00)
/home/prometheus-am-executor/config.go:100 +0x29c
main.readConfig(0x98d0a0, 0x30, 0x30)
/home/prometheus-am-executor/config.go:117 +0x26
main.main()
/home/prometheus-am-executor/main.go:36 +0x42

thanks for your help.

Update CircleCI to run go build, and upload packages for different platforms

Add functionality for reading config from a file

This depends on #18

This issue would populate the config struct defined in #18. It would allow us to do things like specify different commands to execute based on alarm labels received.

The config file might as well be yaml format, since everything else prometheus-related is yaml.

Functionality for passing alarm data to commands via STDIN

Some users might not want to use environment variables when handling alarms. This issue is to add functionality for passing alarm data to commands via STDIN. Formatting of the alarm data could be JSON.