Git Product home page Git Product logo

consul-alerts's Introduction

This project is no longer supported by AcalephStorage. Please use https://github.com/fusiondog/consul-alerts for active development going forward.

consul-alerts

Join the chat at https://gitter.im/AcalephStorage/consul-alerts

A highly available daemon for sending notifications and reminders based on Consul health checks.

Under the covers, consul-alerts leverages Consul's own leadership election and KV store to provide automatic failover and seamless operation in the case of a consul-alerts node failure and ensures that your notifications are still sent.

consul-alerts provides a high degree of configuration including:

  • Several built-in Notifiers for distribution of health check alerts (email, sns, pagerduty, etc.)
  • The ability to create Notification Profiles, sets of Notifiers which will respond to the given alert when a configurable threshold is exceeded
  • Multiple degrees of customization for Notifiers and Blacklisting of alerts (service, check id or host)

Requirement

  1. Consul 0.4+. Get it here.
  2. Configured GOPATH.

Releases

Stable release are here.

Latest release are found here:

Installation

$ go get github.com/AcalephStorage/consul-alerts
$ go install

This should install consul-alerts to $GOPATH/bin

or pull the image from docker:

$ docker pull acaleph/consul-alerts

Usage

$ consul-alerts start

By default, this runs the daemon and API at localhost:9000 and connects to the local consul agent (localhost:8500) and default datacenter (dc1). These can be overriden by the following flags:

$ consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --consul-dc=dc1 --consul-acl-token=""

Once the daemon is running, it can act as a handler for consul watches. At the moment only checks and events are supported.

$ consul watch -type checks consul-alerts watch checks --alert-addr=localhost:9000
$ consul watch -type event consul-alerts watch event --alert-addr=localhost:9000

or run the watchers on the agent the daemon connects by adding the following flags during consul-alerts run:

$ consul-alerts start --watch-events --watch-checks

Usage - Docker

There are a few options for running in docker.

First option is using the consul agent built into the container. This option requires overriding the default entry point and running an exec to launch consul alerts.

Start consul:

docker run -ti \
  --rm -p 9000:9000 \
  --hostname consul-alerts \
  --name consul-alerts \
  --entrypoint=/bin/consul \
  acaleph/consul-alerts \
  agent -data-dir /data -server -bootstrap -client=0.0.0.0

Then in a separate terminal start consul-alerts:

$ docker exec -ti consul-alerts /bin/consul-alerts start --alert-addr=0.0.0.0:9000 --log-level=info --watch-events --watch-checks

The second option is to link to an existing consul container through docker networking and --link option. This method can more easily share the consul instance with other containers such as vault.

First launch consul container:

$ docker run \
  -p 8400:8400 \
  -p 8500:8500 \
  -p 8600:53/udp \
  --hostname consul \
  --name consul \
  progrium/consul \
  -server -bootstrap -ui-dir /ui

Then run consul alerts container:

$ docker run -ti \
  -p 9000:9000 \
  --hostname consul-alerts \
  --name consul-alerts \
  --link consul:consul \
  acaleph/consul-alerts start \
  --consul-addr=consul:8500 \
  --log-level=info --watch-events --watch-checks

Last option is to launch the container and point at a remote consul instance:

$ docker run -ti \
  -p 9000:9000 \
  --hostname consul-alerts \
  --name consul-alerts \
  acaleph/consul-alerts start \
  --consul-addr=remote-consul-server.domain.tdl:8500 \
  --log-level=info --watch-events --watch-checks

NOTE: Don't change --alert-addr when using the docker container.

Configuration

To assure consistency between instances, configuration is stored in Consul's KV with the prefix: consul-alerts/config/. consul-alerts works out of the box without any customizations by using the defaults documented below and leverages the KV settings as overrides.

A few suggestions on operating and bootstrapping your consul-alerts configuration via the KV store are located in the Operations section below.

If ACLs are enabled the folowing policy should be configured for consul-alerts token:

key "consul-alerts" {
  policy = "write"
}

service "" {
  policy = "read"
}

event "" {
  policy = "read"
}

session "" {
  policy = "write"
}

Health Checks

Health checking is enabled by default and is at the core what consul-alerts provides. The Health Check functionality is responsible for triggering a notification when the given consul check has changed status. To prevent flapping, notifications are only sent when a check status has been consistent for a specific time in seconds (60 by default). The threshold can be set globally or for a particular node, check, service and/or all of them.

Configuration Options: The default Health Check configuration can be customized by setting kv with the prefix: consul-alerts/config/checks/

key description
enabled Globally enable the Health Check functionality. [Default: true]
change-threshold The time, in seconds, that a check must be in a given status before an alert is sent [Default: 60]
single/{{ node }}/{{ serviceID }}/{{ checkID }}/change-threshold Overrides change-threshold for a specific check associated with a particular service running on a particular node
check/{{ checkID }}/change-threshold Overrides change-threshold for a specific check
service/{{ serviceID }}/change-threshold Overrides change-threshold for a specific service
node/{{ node }}/change-threshold Overrides change-threshold for a specific node

When change-threshold is overridden multiple times, the most specific condition will be used based on the following order: (most specific) single > check > service > node > global settings > default settings (least specific).

Notification Profiles

Notification Profiles allow the operator the ability to customize how often and to which Notifiers alerts will be sent via the Interval and NotifList attributes described below.

Profiles are configured as keys with the prefix: consul-alerts/config/notif-profiles/.

Notification Profile Specification

Key: The name of the Notification Profile

Ex. emailer_only would be located at consul-alerts/config/notif-profiles/emailer_only

Value: A JSON object adhering to the schema shown below.

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "title": "Notifier Profile Schema.",
  "description": "Defines a given Notifier Profile's configuration",
  "properties": {
    "Interval": {
      "type": "integer",
      "title": "Reminder Interval.",
      "description": "Defines the Interval (in minutes) which Reminders should be sent to the given Notifiers.  Should be a multiple of 5."
    },
    "NotifList": {
      "type": "object",
      "title": "Hash of Notifiers to configure.",
      "description": "A listing of Notifier names with a boolean value indicating if it should be enabled or not.",
      "patternProperties" : {
        ".{1,}" : { "type" : "string" }
      }
    },
    "VarOverrides": {
      "type": "object",
      "title": "Hash of Notifier variables to override.",
      "description": "A listing of Notifier names with hash values containing the parameters to be overridden",
      "patternProperties" : {
        ".{1,}" : { "type" : "object" }
      }
    }
  },
  "required": [
    "Interval",
    "NotifList"
  ]
}

Notification Profile Examples

Notification Profile to only send Emails with reminders every 10 minutes:

Key: consul-alerts/config/notif-profiles/emailer_only

Value:

{
  "Interval": 10,
  "NotifList": {
    "log":false,
    "email":true
  }
}

NOTE: While it is completely valid to explicitly disable a Notifier in a Notifier Profile, it is not necessary. In the event that a Notification Profile is used, only Notifiers which are explicitly defined and enabled will be used. In the example above then, we could have omitted the "log": false in the NotifList and achieved the same results.

Example - Notification Profile to only send to PagerDuty but never send reminders:

Key: consul-alerts/config/notif-profiles/pagerduty_no_reminders

Value:

{
  "Interval": 0,
  "NotifList": {
    "pagerduty":true
  }
}

NOTE: The Interval being set to 0 disables Reminders from being sent for a given alert. If the service stays in a critical status for an extended period, only that first notification will be sent.

Example - Notification Profile to only send Emails to the overridden receivers:

Key: consul-alerts/config/notif-profiles/emailer_overridden

Value:

{
  "Interval": 10,
  "NotifList": {
    "email":true
  },
  "VarOverrides": {
    "email": {
      "receivers": ["[email protected]"]
    }
  }
}

Example - Notification Profile to disable Slack:

Key: consul-alerts/config/notif-profiles/slack_off

Value:

{
  "Interval": 0,
  "NotifList": {
    "slack":false
  }
}

Notification Profile Activation

It is possible to activate Notification Profiles in 2 ways - for a specific entity or for a set of entities matching a regular expression. For a specific item the selection is done by setting keys in consul-alerts/config/notif-selection/services/, consul-alerts/config/notif-selection/checks/, consul-alerts/config/notif-selection/hosts/, or consul-alerts/config/notif-selection/status/, with the appropriate service, check, or host name as the key and the desired Notification Profile name as the value. To activate a Notification Profile for a set of entities matching a regular expression, create a json map of type regexp->notification-profile as a value for the keys consul-alerts/config/notif-selection/services, consul-alerts/config/notif-selection/checks, consul-alerts/config/notif-selection/hosts, or consul-alerts/config/notif-selection/status.

Example - Notification Profile activated for all the services which names start with infra-

Key: consul-alerts/config/notif-selection/services

Value:

{
  "^infra-.*$": "infra-support-profile"
}

Example - Disable slack notifications when status is passing-

Key: consul-alerts/config/notif-selection/status/passing

Value: slack_off

In addition to the service, check and host specific Notification Profiles, the operator can setup a default Notification Profile by creating a Notification Profile kv consul-alerts/config/notif-profiles/default, which acts as a fallback in the event a specific Notification Profile is not found. If there are no Notification Profiles matching the criteria, consul-alerts will send the notification to the full list of enabled Notifiers and no reminders will be sent.

As consul-alerts attempts to process a given notification, it has a series of lookups it does to associate an event with a given Notification Profile by matching on:

  • Service
  • Check
  • Host
  • Status
  • Default

NOTE: An event will only trigger notification for the FIRST Notification Profile that meets it's criteria.

Reminders resend the notifications at programmable intervals until they are resolved or added to the blacklist. Reminders are processed every five minutes therefore Interval values should be a multiple of five. If the Interval value is 0 or not set then reminders will not be sent.

Enable/Disable Specific Health Checks

There are multiple ways to enable/disable health check notifications: mark them by node, serviceID, checkID, regular expression, or mark individually by node/serviceID/checkID. This is done by adding a KV entry in consul-alerts/config/checks/blacklist/.... Removing the entry will re-enable the check notifications.

Disable all notification by node

Add a KV entry with the key consul-alerts/config/checks/blacklist/nodes/{{ nodeName }}. This will disable notifications for the specified nodeName.

Disable all notifications for the nodes matching regular expressions

Add a KV entry with the key consul-alerts/config/checks/blacklist/nodes and the value containing a list of regular expressions. This will disable notifications for all the nodes, which names match at least one of the regular expressions.

Disable all notification by service

Add a KV entry with the key consul-alerts/config/checks/blacklist/services/{{ serviceId }}. This will disable notifications for the specified serviceId.

Disable all notifications for the services matching regular expressions

Add a KV entry with the key consul-alerts/config/checks/blacklist/services and the value containing a list of regular expressions. This will disable notifications for all the services, which names match at least one of the regular expressions.

Disable all notification by healthCheck

Add a KV entry with the key consul-alerts/config/checks/blacklist/checks/{{ checkId }}. This will disable notifications for the specified checkId.

Disable all notifications for the healthChecks matching regular expressions

Add a KV entry with the key consul-alerts/config/checks/blacklist/checks and the value containing a list of regular expressions. This will disable notifications for all the healthchecks, which names match at least one of the regular expressions.

Disable a single health check

Add a KV entry with the key consul-alerts/config/checks/blacklist/single/{{ node }}/{{ serviceId }}/{{ checkId }}. This will disable the specific health check. If the health check is not associated with a service, use the _ as the serviceId.

Events

Event handling is enabled by default. This delegates any consul event received by the agent to the list of handlers configured. To disable event handling, set consul-alerts/config/events/enabled to false.

Handlers can be configured by adding them to consul-alerts/config/events/handlers. This should be a JSON array of string. Each string should point to any executable. The event data should be read from stdin.

Notifiers

There are several built-in notifiers. Only the Log notifier is enabled by default. Details on enabling and configuration these are documented for each Notifier.

Custom Notifiers

It is also possible to add custom notifiers similar to custom event handlers. Custom notifiers can be added as keys with command path string values in consul-alerts/config/notifiers/custom/. The keys will be used as notifier names in the profiles.

Logger

This logs any health check notification to a file. To disable this notifier, set consul-alerts/config/notifiers/log/enabled to false.

The log file is set to /tmp/consul-notifications.log by default. This can be changed by changing consul-alerts/config/notifiers/log/path.

Email

This emails the health notifications. To enable this, set consul-alerts/config/notifiers/email/enabled to true.

The email and smtp details needs to be configured:

prefix: consul-alerts/config/notifiers/email/

key description
enabled Enable the email notifier. [Default: false]
cluster-name The name of the cluster. [Default: "Consul Alerts"]
url The SMTP server url
port The SMTP server port
username The SMTP username
password The SMTP password
sender-alias The sender alias. [Default: "Consul Alerts"]
sender-email The sender email
receivers The emails of the receivers. JSON array of string
template Path to custom email template. [Default: internal template]
one-per-alert Whether to send one email per alert [Default: false]
one-per-node Whether to send one email per node [Default: false] (overriden by one-per-alert)

The template can be any go html template. An TemplateData instance will be passed to the template.

InfluxDB

This sends the notifications as series points in influxdb. Set consul-alerts/config/notifiers/influxdb/enabled to true to enabled. InfluxDB details need to be set too.

prefix: consul-alerts/config/notifiers/influxdb/

key description
enabled Enable the influxdb notifier. [Default: false]
host The influxdb host. (eg. localhost:8086)
username The influxdb username
password The influxdb password
database The influxdb database name
series-name The series name for the points

Slack

Slack integration is also supported. To enable, set consul-alerts/config/notifiers/slack/enabled to true. Slack details needs to be configured.

prefix: consul-alerts/config/notifiers/slack/

key description
enabled Enable the Slack notifier. [Default: false]
cluster-name The name of the cluster. [Default: Consul Alerts]
url The incoming-webhook url (mandatory) [eg: https://hooks.slack.com...]
channel The channel to post the notification (mandatory) [eg: #consul-alerts or @consul-alerts]
username The username to appear on the post [eg: Consul Alerts]
icon-url URL of a custom image for the notification [eg: http://someimage.com/someimage.png]
icon-emoji Emoji (if not using icon-url) for the notification [eg: :ghost:]
detailed Enable "pretty" Slack notifications [Default: false]

In order to enable slack integration, you have to create a new Incoming WebHooks. Then use the token created by the previous action.

Mattermost

Mattermost integration is also supported. To enable, set consul-alerts/config/notifiers/mattermost/enabled to true. Mattermost details needs to be configured.

prefix: consul-alerts/config/notifiers/mattermost/

key description
enabled Enable the Mattermost notifier. [Default: false]
cluster-name The name of the cluster. [Default: "Consul Alerts"]
url The mattermost url (mandatory)
username The mattermost username (mandatory)
password The mattermost password (mandatory)
team The mattermost team (mandatory)
channel The channel to post the notification (mandatory)

Notifications can also be sent to Incoming Webhooks. To enable, set consul-alerts/config/notifiers/mattermost-webhook/enabled to true and set consul-alerts/config/notifiers/mattermost-webhook/url to URL of the webhook created on the previous step.

prefix: consul-alerts/config/notifiers/mattermost-webhook/

key description
enabled Enable the Mattermost Webhook notifier. [Default: false]
cluster-name The name of the cluster. [Default: Consul Alerts]
url The incoming-webhook url (mandatory) [eg: https://mattermost.com/hooks/...]
channel The channel to post the notification (mandatory) [eg: consul-alerts]
username The username to appear on the post [eg: Consul Alerts]
icon-url URL of a custom image for the notification [eg: http://someimage.com/someimage.png]

PagerDuty

To enable PagerDuty built-in notifier, set consul-alerts/config/notifiers/pagerduty/enabled to true. This is disabled by default. Service key and client details also needs to be configured.

prefix: consul-alerts/config/notifiers/pagerduty/

key description
enabled Enable the PagerDuty notifier. [Default: false]
service-key Service key to access PagerDuty
client-name The monitoring client name
client-url The monitoring client url
max-retry The upper limit of retries on failure. [Default: 0 for no retries]
retry-base-interval The base delay in seconds before a retry. [Default: 30 seconds ]

HipChat

To enable HipChat built-in notifier, set consul-alerts/config/notifiers/hipchat/enabled to true. Hipchat details needs to be configured.

prefix: consul-alerts/config/notifiers/hipchat/

key description
enabled Enable the HipChat notifier. [Default: false]
from The name to send notifications as (optional)
cluster-name The name of the cluster. [Default: "Consul Alerts"]
base-url HipChat base url [Default: https://api.hipchat.com/v2/]
room-id The room to post to (mandatory)
auth-token Authentication token (mandatory)

The auth-token needs to be a room notification token for the room-id being posted to. See HipChat API docs.

The default base-url works for HipChat-hosted rooms. You only need to override it if you are running your own server.

OpsGenie

To enable OpsGenie built-in notifier, set consul-alerts/config/notifiers/opsgenie/enabled to true. OpsGenie details needs to be configured.

prefix: consul-alerts/config/notifiers/opsgenie/

key description
enabled Enable the OpsGenie notifier. [Default: false]
cluster-name The name of the cluster. [Default: "Consul Alerts"]
api-key API Key (mandatory)

Amazon Web Services Simple Notification Service ("SNS")

To enable AWS SNS built-in notifier, set consul-alerts/config/notifiers/awssns/enabled to true. AWS SNS details needs to be configured.

prefix: consul-alerts/config/notifiers/awssns/

key description
enabled Enable the AWS SNS notifier. [Default: false]
cluster-name The name of the cluster. [Default: "Consul Alerts"]
region AWS Region (mandatory)
topic-arn Topic ARN to publish to. (mandatory)
template Path to custom template. [Default: internal template]

VictorOps

To enable the VictorOps built-in notifier, set consul-alerts/config/notifiers/victorops/enabled to true. VictorOps details needs to be configured.

prefix: consul-alerts/config/notifiers/victorops/

key description
enabled Enable the VictorOps notifier. [Default: false]
api-key API Key (mandatory)
routing-key Routing Key (mandatory)

HTTP Endpoint

To enable the HTTP endpoint built-in notifier, set consul-alerts/config/notifiers/http-endpoint/enabled to true. HTTP endpoint details needs to be configured.

prefix: consul-alerts/config/notifiers/http-endpoint/

key description
enabled Enable the http-endpoint notifier. [Default: false]
cluster-name The name of the cluster. [Default: "Consul Alerts"]
base-url Base URL of the HTTP endpoint (mandatory)
endpoint The endpoint to append to the end of base-url
payload The payload to send to the HTTP endpoint (mandatory)

The value of 'payload' must be a json map of type string. Value will be rendered as a template.

{
  "message": "{{ range $name, $checks := .Nodes }}{{ range $check := $checks }}{{ $name }}:{{$check.Service}}:{{$check.Check}} is {{$check.Status}}.{{ end }}{{ end }}"
}

iLert

To enable iLert built-in notifier, set consul-alerts/config/notifiers/ilert/enabled to true. Service API key needs to be configured.

prefix: consul-alerts/config/notifiers/ilert/

key description
enabled Enable the iLert notifier. [Default: false]
api-key The API key of the alert source. (mandatory)
incident-key-template Format of the incident key. [Default: {{.Node}}:{{.Service}}:{{.Check}}

Health Check via API

Health status can also be queried via the API. This can be used for compatibility with nagios, sensu, or other monitoring tools. To get the status of a specific check, use the following entrypoint.

http://consul-alerts:9000/v1/health?node=<node>&service=<serviceId>&check=<checkId>

This will return the output of the check and the following HTTP codes:

Status Code
passing 200
warning 503
critical 503
unknown 404

http://consul-alerts:9000/v1/health/wildcard?node=<node>&service=<serviceId>&check=<checkId>&status=<status>&alwaysOk=true&ignoreBlacklist=true v1/health/wildcard is similar to v1/health but returns all matched checks (omitted service/node/check params assumed as any) . Values returned in JSON form, status code 503 if one of services in critical state.

Additional params are ignoreBlacklist and alwaysOk which forces status code to 200 regardless of checks status.

Operations

Configuration may be set manually through consul UI or API, using configuration management tools such as chef, puppet or Ansible, or backed up and restored using consulate.

Consulate Example:

consulate kv backup consul-alerts/config -f consul-alerts-config.json
consulate kv restore consul-alerts/config -f consul-alerts-config.json --prune

Contribution

PRs are more than welcome. Just fork, create a feature branch, and open a PR. We love PRs. :)

TODO

Needs better doc and some cleanup too. :)

consul-alerts's People

Contributors

ch-acctg avatar dagvl avatar darkcrux avatar davidsiefert avatar dbresson avatar divolgin avatar dlackty avatar flosell avatar fpietka avatar fusiondog avatar gerrrr avatar gitter-badger avatar hexedpackets avatar hunter avatar jrnt30 avatar ketzacoatl avatar lyrixx avatar matthewlowry avatar mboorstin avatar msupino avatar perlence avatar pharaujo avatar roman-vynar avatar rrreeeyyy avatar stephenweber avatar they4kman avatar timurstrekalov avatar tsaridas avatar tuannh99 avatar udzura avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

consul-alerts's Issues

No warning when alert-addr is already in use

I started consul-alert daemon with no error, but when I run the consul watch consul-alert watch checks it returned an error:

INFO[0005] consul-alert daemon is not running.

Then I searched the error in the source code, found that this error will occur when http://consul-alert/v1/info is not available.

Then i tried telnet locahost 9000, and could connect to that port, finally i found that PHP-FPM is taking that port, so the consul-alert daemon was not started completely.

So my question is, if the port is taken and can not bootstrap because of that, why no error or warning? Is that a bug or a special case with PHP-FPM?

Context for notifications

Nagios-herald (https://github.com/etsy/nagios-herald) is a nice idea for adding context to email notifications sent out from Nagios. It might make sense to add a similar feature to consul-alerts.

Initial thoughts are to include a graphite query URL (with variable substitution?) to the consul KV with specific relation to the checks. That way we can pull out a window of metrics as an image and include in the notification emails.

Feedback and ideas welcome!

Node blacklist does not work as expected

  • During our deploy, I blacklist all nodes. And after the deploy, I un-blacklist the nodes.
  • I have severals services that have checks with TTL.
  • During the deploy, theses services are down

And so, when I re-enable the monitoring, I get N notifications like that:

insight-d1.consumer-0:insight.analysis_feedback:Service 'insight.analysis_feedback' check is passing.
insight-d1.consumer-0:insight.analysis_feedback:Service 'insight.analysis_feedback' check is passing.
insight-d1.consumer-0:insight.analysis_feedback_collector:Service 'insight.analysis_feedback_collector' check is passing.
insight-d1.consumer-0:insight.analysis_kill:Service 'insight.analysis_kill' check is passing.
insight-d1.consumer-0:insight.analysis_new:Service 'insight.analysis_new' check is passing.
insight-d1.consumer-0:insight.comment_add:Service 'insight.comment_add' check is passing.
insight-d1.consumer-0:insight.mail:Service 'insight.mail' check is passing.
insight-d1.consumer-0:insight.notification:Service 'insight.notification' check is passing.
insight-d1.consumer-0:insight.scm_hook:Service 'insight.scm_hook' check is passing.

I should not, because theses nodes were down only during the deploy. More over, I never get the "service are down" notification.

Some failures are not reported

I upgraded to v0.3, and now, when I shut down a service (nginx), I don't get an alert where I get one with the v0.2.

Basically, I have one node A:

root@frontend-0:/home/ubuntu$ cat /etc/consul.d/check-http-200.json 
{
  "check": {
    "name": "http-200",
    "script": "curl -si --fail http://127.0.0.1:8080/monitoring/200",
    "interval": "1s"
  }
}

When I stop nginx, consul detect well this failure:

# consul monitor // on node A
2015/02/20 14:19:08 [WARN] Check 'http-200' is now critical
2015/02/20 14:19:09 [WARN] Check 'http-200' is now critical

On node B, there is:
consul in server mode
consul-ui
consul-alerts

consul detect also the failure, but not consul alerts.
I have nothing special in the logs (/tmp/consul-notifications.log nor /var/log/consul-alerts.log (process STDOUT and STDERR)

Leader election can hang

Unfortunately I have not been able to reproduce this yet, but I have a three node Consul cluster where I am running one instance of Consul-alerts on the same server as each Consul node, where the cluster got into a state where the node whose host name was in consul-alert/leader key believed it was not the leader.

Restarting consul-alerts on the node in question did not appear to resolve the issue, but restarting the consul node did, so I'm not sure if this is a consul-alerts issue or a consul issue or something else.

A couple of observations regardless where the problem is:

Some mechanism is needed to verify that the leader is still live, and not just sufficiently undead to hold on to the lock.

One approach is to force leadership elections every N minutes, and to make the other nodes alert if the current leader does not relinquish the lock as part of that. Another alternative is for the other nodes to register health checks against the alert endpoint of the leader, and alert if those health checks fail, even when they're not the leader.

An alternative might be for the cluster leader to provide a heartbeat by regularly updating an un-locked key, and for the other members to verify that the key has been recently updated and otherwise alert.

Unable to Build

Disclaimer I'm very new to Go, so this might be user error.

Since I didn't see a pre-built release anywhere (if there is one, please let me know), I'm attempting to build per the README, but running into the error below:

I'm attempting to use the official golang image since I don't have anything installed on my machine for Go Development

docker run -it golang bash
root@fcfab98eb5e1:/go# git clone https://github.com/AcalephStorage/consul-alerts.git

root@fcfab98eb5e1:/go# cd consul-alerts/

root@fcfab98eb5e1:/go/consul-alerts# make deps
--> Getting Dependencies
downloading github.com/armon/consul-api
downloading github.com/docopt/docopt-go
downloading github.com/Sirupsen/logrus

root@fcfab98eb5e1:/go/consul-alerts# make install-global
--> Testing application
check-handler.go:11:2: cannot find package "github.com/AcalephStorage/consul-alerts/consul" in any of:
    /usr/src/go/src/pkg/github.com/AcalephStorage/consul-alerts/consul (from $GOROOT)
    /go/consul-alerts/_vendor/src/github.com/AcalephStorage/consul-alerts/consul (from $GOPATH)
    /go/consul-alerts/src/github.com/AcalephStorage/consul-alerts/consul
    /go/src/github.com/AcalephStorage/consul-alerts/consul
check-handler.go:12:2: cannot find package "github.com/AcalephStorage/consul-alerts/notifier" in any of:
    /usr/src/go/src/pkg/github.com/AcalephStorage/consul-alerts/notifier (from $GOROOT)
    /go/consul-alerts/_vendor/src/github.com/AcalephStorage/consul-alerts/notifier (from $GOPATH)
    /go/consul-alerts/src/github.com/AcalephStorage/consul-alerts/notifier
    /go/src/github.com/AcalephStorage/consul-alerts/notifier
notifier/influxdb-notifier.go:5:2: cannot find package "github.com/influxdb/influxdb/client" in any of:
    /usr/src/go/src/pkg/github.com/influxdb/influxdb/client (from $GOROOT)
    /go/consul-alerts/_vendor/src/github.com/influxdb/influxdb/client (from $GOPATH)
    /go/consul-alerts/src/github.com/influxdb/influxdb/client
    /go/src/github.com/influxdb/influxdb/client
gom:  exit status 1
Makefile:26: recipe for target 'test' failed
make: *** [test] Error 1

Did I do something wrong, or are the docs missing a step?
Thanks

Unable to use slack notification

I'm unable to set-up slack notifications.
I can do anything, the log output is always

INFO[0015] Slack notification sent.

(I try with a wrong token).

So it's not easy to debug...

VictorOps notifier

Great tool to monitor our infrastructure!

We are interested in a VictorOps notifier. Are there any plans to support this?
If not, any other people who are interested?

Scale of consul-alerts

Does anyone have any numbers of the number of servers and service check alerts that consul-alerts can process? We have an infrastructure in the 4 digit number of servers with each server running about 8-10 service checks each.

I was curious if anyone had experience any latency in processing and receiving the notification for the either the down host or service. Currently consul-alert takes approximately up to half-hour to an hour for it to just start-up. Is this normal?

Does consul-alerts provide any metrics that we can extract to show it's performance?

No notification is triggered if unable to retrieve health check status

If the Consul connection fails or consul-alerts is otherwise unable to retrieve the health check status, it does two things that IMHO are broken:

  • It doesn't appear to trigger a notification.
  • It carries out backoff so that if cluster comes back online, the longer it has been offline the longer it will take before the status change is noticed.

I'm not sure yet if this is down to "consul watch" or "consul-alerts watch checks"

No Alerts

I haven't actually managed to get this to give me any alerts as yet.

I've entered in my email details, enabled it, and when my services go critical, I still get no email. Any idea what I might be doing wrong?

hipchat integration, question

Hi,

great idea so far, just wondering if you can confirm what the base-url should be I've tried a few combinations but none seem to be working so far :(

Thanks,
Matt

Alerts on Warning status as well as Critical status

Hi,
I'm using the email notification and so far am only able to get message about Critical statuses.

Is there an option for Warning status notifications as well?

I might have overlooked where to set that option, any tips much appreciated.

Thank you.

Discussion: HA

So we were talking internally today about how best to ensure the alerts service is "always up". There are clearly lots of ways to do this, but here is what we came up with and I'm interested if a similar approach makes sense to be supported by the service itself.

Goals

  • We don't want a dead instance/node to result in us losing alerting for any duration of time. A few seconds or a minute is OK
  • We'd like to avoid the traditional (AKA complicated) HA solutions involving thinks like pacemaker and friends
  • We don't want multiple alerts fired for a single state change. IE if we have 5 consul servers and a service goes into error, we only want one email.

Current Direction
We want to install the alerts daemon on every consul server. We are then going to drop a small cron script that is going to check if the current node is the leader or not. In all cases the daemon will continue to run, however if the node is not the current leader we are going to disable the watch. So effectively we'll have 4 "warm standbys" and should the leader fail, we'll let Consul do its thing around re-election and picky back on that.

We are using Chef to manage the daemon, but we don't converge Chef frequently enough for this to be effective. Similar logic would exist in the Chef recipe so Chef wasn't fighting with the cron script.

Awesomeness
If there were some sort of leader_mode flag that could be passed to consul-alerts at startup this would be perfect for our use case. The daemon would continue to run, and the watch would stay in place on all the servers, however when Consul alerts sees its running on the same node as the master, it would process alerts, if its not running on the master it would simply be a noop.

I'm totally open to any other idea folks might have on how best to pull of the goals.

Thanks

How to run --watch-checks from docker container?

When using --watch-checks / --watch-events during a container run, the logs say:

$ docker run --name consul-alerts -d acaleph/consul-alerts start --consul-addr=localhost:8500 --consul-dc=dc1 --watch-checks
time="2015-03-29T22:34:59Z" level="info" msg="Shutting down watcher --> Something went wrong running consul watch:  exec: \"consul\": executable file not found in $PATH"

Unable to send mail using GMail SMTP; TLS issue?

Hello

first thanks for this great tool!
It seems I managed to set up everything and in the logs I could observe checks notifications happening. But when I tried to use mail notifier with GMail SMTP, I receive no email and i the log I typically receive:

INFO[0086] Processing health checks for notification.   
INFO[0213] Unable to send notification: dial tcp 64.233.171.108:587: connection timed out

(64.233.171.108 is the IP of smtp.gmail.com)

Sorry, I know very little about Golang to debug...
But I think it may be because GMail SMTP forces you to use TLS.
Here is an example how they do it in Python for instance http://segfault.in/2010/12/sending-gmail-from-python/

Specially see session.starttls() before login and sendmail

So may be we should us too use StartTLS() in your email-notifier before auth here
https://github.com/AcalephStorage/consul-alerts/blob/master/notifier/email-notifier.go#L86 ?

Again, this is pure extrapolation, but may be we could have a KV to force TLS and if this option is true, then you call StartTLS()

I'm sorry in fact I was unable to build the image because of lack of Golang knowledge. Eventually would be cool if you could have a section explaining how to build the project (I'm using Ubuntu), I typically have this error when doing go install:

root@cloudmaster:~/golang/src/github.com/AcalephStorage/consul-alerts# go install                                                                                      
check-handler.go:14:2: cannot find package "github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/Sirupsen/logrus" in any of:
        /root/.gvm/gos/go1.3.3/src/pkg/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/Sirupsen/logrus (from $GOROOT)
        /root/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/Sirupsen/logrus (from $GOPATH)
consul-alerts.go:17:2: cannot find package "github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/docopt/docopt-go" in any of:
        /root/.gvm/gos/go1.3.3/src/pkg/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/docopt/docopt-go (from $GOROOT)
        /root/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/docopt/docopt-go (from $GOPATH)
check-handler.go:11:2: cannot find package "github.com/AcalephStorage/consul-alerts/consul" in any of:
        /root/.gvm/gos/go1.3.3/src/pkg/github.com/AcalephStorage/consul-alerts/consul (from $GOROOT)
        /root/src/github.com/AcalephStorage/consul-alerts/consul (from $GOPATH)
check-handler.go:12:2: cannot find package "github.com/AcalephStorage/consul-alerts/notifier" in any of:
        /root/.gvm/gos/go1.3.3/src/pkg/github.com/AcalephStorage/consul-alerts/notifier (from $GOROOT)
        /root/src/github.com/AcalephStorage/consul-alerts/notifier (from $GOPATH)

If you help me build I could test the TLS thing for you...

If starting consul-alert when Consul is starting, consul-alerts seemingly get stuck without exiting *or* alerting

This happened when starting consul-alerts (in Docker container built with "go get ..." based on the "golang" container) before Consul was fully initialized on the host:

root@1dab0b580258:/go/bin# /go/bin/consul-alerts start --alert-addr=0.0.0.0:9000 --consul-addr=192.168.0.44:8500
INFO[0000] Checking consul agent connection...
INFO[0000] Unable to load custom config, using default instead: Unexpected response code: 500
INFO[0000] Consul Alerts daemon started
INFO[0000] Consul Alerts Host: 1dab0b580258
INFO[0000] Consul Agent: 192.168.0.44:8500
INFO[0000] Consul Datacenter: dc1 
WARN[0000] Unable to retrieve list of sessions.
ERRO[0000] Unable to create new sessions: Unexpected response code: 500 (No cluster leader)
ERRO[0000] Failed to run Consul KV Acquire: Unexpected response code: 500 (No cluster leader)

At this point it hung until I ctrl-c'd it. On restart, it then started fine.

Consul-alert panics

time="2015-02-20T13:58:42Z" level=warning msg="Unable to retrieve node name."
panic: interface conversion: interface is nil, not string

goroutine 13 [running]:
github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper.(*Candidate).retrieveNode(0xc20800d900)
        /usr/local/go/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper/skipper.go:141 +0x257
github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper.(*Candidate).campaign(0xc20800d900)
        /usr/local/go/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper/skipper.go:99 +0x36
github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper.(*Candidate).campaign(0xc20800d900)
        /usr/local/go/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper/skipper.go:131 +0x994
github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper.(*Candidate).campaign(0xc20800d900)
        /usr/local/go/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper/skipper.go:131 +0x994
created by github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper.(*Candidate).RunForElection
        /usr/local/go/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper/skipper.go:46 +0x32

goroutine 1 [chan receive]:
main.daemonMode(0xc2080abc80)
        /home/ubuntu/.buildbox/Agent/acaleph/consul-alerts/consul-alerts.go:110 +0x111e
main.main()
        /home/ubuntu/.buildbox/Agent/acaleph/consul-alerts/consul-alerts.go:49 +0x13c

goroutine 5 [syscall]:
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init�1
        /usr/local/go/src/os/signal/signal_unix.go:27 +0x35
goroutine 48 [select]:
net/http.(*persistConn).writeLoop(0xc208064580)
        /usr/local/go/src/net/http/transport.go:945 +0x41d
created by net/http.(*Transport).dialConn
        /usr/local/go/src/net/http/transport.go:661 +0xcbc

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 47 [IO wait]:
net.(*pollDesc).Wait(0xc2080b2fb0, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc2080b2fb0, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc2080b2f50, 0xc208076000, 0x1000, 0x1000, 0x0, 0x7f0d378f6d30, 0xc2080b1ba8)
        /usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc2080360c8, 0xc208076000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:121 +0xdc
net/http.noteEOFReader.Read(0x7f0d378f8460, 0xc2080360c8, 0xc2080645d8, 0xc208076000, 0x1000, 0x1000, 0x756ea0, 0x0, 0x0)
        /usr/local/go/src/net/http/transport.go:1270 +0x6e
net/http.(*noteEOFReader).Read(0xc208080fc0, 0xc208076000, 0x1000, 0x1000, 0xc208012000, 0x0, 0x0)
        <autogenerated>:125 +0xd4
bufio.(*Reader).fill(0xc20814aa20)
        /usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).Peek(0xc20814aa20, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/bufio/bufio.go:132 +0xf0
net/http.(*persistConn).readLoop(0xc208064580)
        /usr/local/go/src/net/http/transport.go:842 +0xa4
created by net/http.(*Transport).dialConn
        /usr/local/go/src/net/http/transport.go:660 +0xc9f

goroutine 14 [syscall]:
syscall.Syscall6(0x3d, 0x3cfd, 0xc2080c4c14, 0x0, 0xc208063320, 0x0, 0x0, 0x7f0d378e5000, 0x40a0b1, 0xc20814a380)
        /usr/local/go/src/syscall/asm_linux_amd64.s:46 +0x5
syscall.wait4(0x3cfd, 0xc2080c4c14, 0x0, 0xc208063320, 0x90, 0x0, 0x0)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:124 +0x79
syscall.Wait4(0x3cfd, 0xc2080c4c5c, 0x0, 0xc208063320, 0x0, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall_linux.go:224 +0x60
os.(*Process).wait(0xc2081455c0, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/exec_unix.go:22 +0x103
os.(*Process).Wait(0xc2081455c0, 0xc208065080, 0x0, 0x0)
        /usr/local/go/src/os/doc.go:45 +0x3a
os/exec.(*Cmd).Wait(0xc208066f00, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:364 +0x23c
os/exec.(*Cmd).Run(0xc208066f00, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:246 +0x71
main.runWatcher(0xc20800c53e, 0xe, 0xc20800c67d, 0x3, 0x822f30, 0x6)
        /home/ubuntu/.buildbox/Agent/acaleph/consul-alerts/watchers.go:26 +0x261
created by main.daemonMode
        /home/ubuntu/.buildbox/Agent/acaleph/consul-alerts/consul-alerts.go:93 +0xe20

goroutine 15 [chan receive]:
main.processEvents()
        /home/ubuntu/.buildbox/Agent/acaleph/consul-alerts/event-handler.go:42 +0x72
created by main.daemonMode
        /home/ubuntu/.buildbox/Agent/acaleph/consul-alerts/consul-alerts.go:99 +0xe8f

goroutine 16 [chan receive]:
main.processChecks()
        /home/ubuntu/.buildbox/Agent/acaleph/consul-alerts/check-handler.go:51 +0x47
created by main.daemonMode
        /home/ubuntu/.buildbox/Agent/acaleph/consul-alerts/consul-alerts.go:100 +0xea0

goroutine 18 [IO wait]:
net.(*pollDesc).Wait(0xc20814c6f0, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20814c6f0, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc20814c690, 0x0, 0x7f0d378f6d30, 0xc2081514c0)
        /usr/local/go/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc208036540, 0x48cede, 0x0, 0x0)
        /usr/local/go/src/net/tcpsock_posix.go:234 +0x4e
net/http.tcpKeepAliveListener.Accept(0xc208036540, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1976 +0x4c
net/http.(*Server).Serve(0xc20814a2a0, 0x7f0d378f8c68, 0xc208036540, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1728 +0x92
net/http.(*Server).ListenAndServe(0xc20814a2a0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1718 +0x154
net/http.ListenAndServe(0xc2080522ec, 0xe, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1808 +0xba
created by main.daemonMode
        /home/ubuntu/.buildbox/Agent/acaleph/consul-alerts/consul-alerts.go:106 +0xf76

Consul-alerts crashes if Consul dies

Rather than triggering an alert, when killing Consul while consul-alerts is running in a Docker container just built based on the golang Docker container, and using "go get ...", I get this:

Prior to taking down Consul:

time="2015-01-27T17:07:30Z" level="info" msg="Checking consul agent connection..."
time="2015-01-27T17:07:30Z" level="info" msg="Consul Alerts daemon started"
time="2015-01-27T17:07:30Z" level="info" msg="Consul Alerts Host: 1dab0b580258"
time="2015-01-27T17:07:30Z" level="info" msg="Consul Agent: 192.168.0.44:8500"
time="2015-01-27T17:07:30Z" level="info" msg="Consul Datacenter: dc1"
time="2015-01-27T17:07:50Z" level=warning msg="Unable to retrieve node name." 

Consul gets killed here. Few seconds later:

panic: interface conversion: interface is nil, not string

goroutine 11 [running]:
github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper.(*Candidate).retrieveNode(0xc20802e960)
        /go/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper/skipper.go:141 +0x257
github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper.(*Candidate).campaign(0xc20802e960)
        /go/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper/skipper.go:99 +0x36
github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper.(*Candidate).campaign(0xc20802e960)
        /go/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper/skipper.go:131 +0x994
created by github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper.(*Candidate).RunForElection
        /go/src/github.com/AcalephStorage/consul-alerts/Godeps/_workspace/src/github.com/darkcrux/consul-skipper/skipper.go:46 +0x32

goroutine 1 [chan receive]:
main.daemonMode(0xc2080c3c80)
        /go/src/github.com/AcalephStorage/consul-alerts/consul-alerts.go:110 +0x111e
main.main()
        /go/src/github.com/AcalephStorage/consul-alerts/consul-alerts.go:49 +0x13c

goroutine 5 [syscall]:
os/signal.loop()
        /usr/src/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
        /usr/src/go/src/os/signal/signal_unix.go:27 +0x35

goroutine 12 [chan receive]:
main.processEvents()
        /go/src/github.com/AcalephStorage/consul-alerts/event-handler.go:42 +0x72
created by main.daemonMode
        /go/src/github.com/AcalephStorage/consul-alerts/consul-alerts.go:99 +0xe8f

goroutine 13 [chan receive]:
main.processChecks()
        /go/src/github.com/AcalephStorage/consul-alerts/check-handler.go:51 +0x47
created by main.daemonMode
        /go/src/github.com/AcalephStorage/consul-alerts/consul-alerts.go:100 +0xea0

goroutine 14 [IO wait]:
net.(*pollDesc).Wait(0xc2080bfc60, 0x72, 0x0, 0x0)
        /usr/src/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc2080bfc60, 0x0, 0x0)
        /usr/src/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc2080bfc00, 0x0, 0x7f4682481d30, 0xc2080c7088)
        /usr/src/go/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc20802c1b0, 0x48c824, 0x0, 0x0)
        /usr/src/go/src/net/tcpsock_posix.go:234 +0x4e
net/http.tcpKeepAliveListener.Accept(0xc20802c1b0, 0x0, 0x0, 0x0, 0x0)
        /usr/src/go/src/net/http/server.go:1976 +0x4c
net/http.(*Server).Serve(0xc208054de0, 0x7f46824838b0, 0xc20802c1b0, 0x0, 0x0)
        /usr/src/go/src/net/http/server.go:1728 +0x92
net/http.(*Server).ListenAndServe(0xc208054de0, 0x0, 0x0)
        /usr/src/go/src/net/http/server.go:1718 +0x154
net/http.ListenAndServe(0x7fff18f9eed8, 0xc, 0x0, 0x0, 0x0, 0x0)
        /usr/src/go/src/net/http/server.go:1808 +0xba
created by main.daemonMode
        /go/src/github.com/AcalephStorage/consul-alerts/consul-alerts.go:106 +0xf76

Better Logging

Cause everyone loves better logs. :)

Just need to remove of confusing logs and add better error logs.

Question: how many consul-alerts instances should be running

Hi,

I have consul-alerts up and running OK, very pleased!

I'm quite new to Consul so still figuring out the finer details and best practices.

One question I have is how many instances of consul-alerts should be running?

In my case I have a 7 node cluster.

3 x consul servers (key/value)
4 x consul clients

Each client is running the UI web admin.

All 7 consul instance run a mix of custom service/checks and nagios plugin checks.

Should consul-alerts run on:

a) all 7 nodes
b) the 4 client nodes
c) the 3 server nodes
d) single instance?
e) something else?

Any advice here or suggestions would be much appreciated.

If I'm on the wrong track altogether with my consul architecture do please point that out.

Thanks!

Per-Service Threshold

Currently, it looks like the threshold level is a global setting for all services. Obviously this is a good idea for flapping services. However, I'd like to see it become far more granular. Not all services are equal and some services may flap quicker than others.

So maybe something like this in the KV store:

CONSUL-ALERTS/CONFIG/CHECKS/CHANGE-THRESHOLD/SERVICE/WARN.. PASS.. FAIL

Thoughts?

Custom configuration by CLI

Add support for loading a default custom configuration via command/api instead of manually editing consul's KV

multi-data center configuration

In my use case I have consul running in two data centers. I'd like to configure the SMTP server to use for email notifications to be one that is local to that DC. It seems that since the K/V store is "global", I can't populate a data center specific configuration. My consul-alert daemons in each DC pick up the same configuration. Are there plans to support something like this?

For right now, I'm just relying on some host file hackery to accomplish this, but I think it would be great if it could be supported.

Support more feature in the API

Right now, it's not easy to build a dashboard to expose all check status.
You could say I have to look at the status directly in consul. But I prefer
to read the data in consul-alerts because:

  • I'm able to blacklist some checks
  • consul-alerts do not use the exact check, but a "mean" of the last N secondes.

So It could be nice to more endpoints

  • list of services with the "state"
  • list of nodes with the "state"
  • By able to filter by state on all endpoints
  • ....

What do you think?

No notification when consul leader down

I'm trying consul-alerts with 3 consul servers.
When a server node which is not a leader is down, Serf Health Status faillure is notified.
But the leader node is down, no notification is occured.
Is it a bug?

Add support for enable/disable some checks

I have lot of checks in my DC, but I don't want every check triggers an alert. Some check are not so important (ram usage, load usage). they are here just for quick overview of our DC, and to quickly find an error when something goes wrong.

So, I see at least two options to configure this:

  1. Add a black-list of check name consul-alerts/config/checks/black-list : ["load_usage", "memory_usage"].
  2. By check on node: consul-alerts/checks/consumer-0.insight-d1/_/disk_usage_disable: true

I think the first option better fits my use case, because I want to ignore (all load|memory)_usage. But the second one is more granular.

What do you think?

Some thoughts about consul-alerts

We are using consul-alerts for few weeks now, and I want to share my thoughts.

I like it, but I thinks there is room for improvement.

I Like the system of notification. It's simple and efficient. We planned to build an full HTML+JS application to managed it. But there is no ETA about it. So if someone already started / finished it, it would be awesome.

I like the black list feature

I like the retention system to avoid flapping. But I think it could be more fine-tunable. Because we are running some checks every 5 minutes (like zip integrity). So it does not work well with consul-alerts retention.

I don't like the notification messages. We have about 40 nodes, and on each node from 3 to 30 checks. And now, it's really hard to have a good overview of the macro and micro state of the cluster.
For example, In my mail box, I have theses email subject sorted by date

SensioLabs is HEALTHY
SensioLabs is HEALTHY
SensioLabs is UNSTABLE
SensioLabs is HEALTHY
SensioLabs is UNSTABLE
SensioLabs is UNSTABLE
SensioLabs is HEALTHY
SensioLabs is HEALTHY

So I have to open each mail to know which node is OK or KO. Then, I have to remember all KO checks because when I get a new mail, consul-alerts list only changed state nodes.

So I propose something like:

  • in each mails / notification, list all checks in a warning or critical node
  • in each mails / notification, list all checks in a healthy state in comparison with the previous mail / notification

To not break the BC, we could use a flag to enable / disable this feature.

No Emails Recieved if only one address fails

I've got everything up and running and liking things so far. One issue I noticed is that I seem to not get any emails if there is a failure on just one address. I'm using Amazon SES and have two email addresses configured under receivers. One is verified email address and one is not. I see the following message logged into the log (odd its an info rather than error?): level="info" msg="Unable to send notification: 554 Message rejected: Email address is not verified." and I do not get the email notification. If I remove the unverified email from the Array, I do get the email as expected.

It might also be helpful if it logged the recipient it was trying to send to as part of the message

config.HttpClient.Timeout

Hi,

I'm very new to Go (haven't used it at all before) but, I have Consul up and running OK.

Am having an error installing consul-alerts - I could be on a wrong version or missing a dependency.

I'm on ubuntu 14.04.

$ go version
go version go1.2.1 linux/amd64
$ go get github.com/AcalephStorage/consul-alerts
# github.com/AcalephStorage/consul-alerts/consul
.go/src/github.com/AcalephStorage/consul-alerts/consul/client.go:33: config.HttpClient.Timeout undefined (type *http.Client has no field or method Timeout)

Can you see where I'm going wrong?

Thanks.

Consul-alerts fails if Consul cluster is unavailable on startup

If I specify a --consul-addr pointing to a Consul node that is down, I get:

INFO[0000] Cluster has no leader or is unreacheable. Get http://10.0.42.1:8500/v1/status/leader?dc=dc1: dial tcp 10.0.42.1:8500: connection refused

.. and consul-alerts dies. This at the very least ought to be documented behaviour. However I don't think it's very nice behaviour even if documented.

I'd propose the following:

  • Allow "bootstrapping" notifier configuration via a JSON config file or command line options to be able to obtain a basic configuration even if Consul can't be reached.
  • Allow specifying multiple consul nodes on command line and try each before bailing out.

question on email notifier

I've got things up and running and really digging the project so far, thanks for releasing this!.
I had a question on the email notifier. I was surprised to see a subject that indicated the system was healthy even though 100% (all of one right now) checks were in error. I've included a screenshot below.

Is this expected? What are the criteria around what the subject says? Thanks again!

image

Do not query too often consul, when consul reply with a 500

Right now, If consul cluster has no elected leader yet,
consul-alerts is going to try query consul a lot.

It not harmful, unless you collect log of consul.
IMHO, there is not need for this intensive quering.
1 req / sec should be enough IMHO.

What do you think ?

question on logging statement

What does this log line actually indicated? level="info" msg="replacing health check request queue." Its not causing me any problems (i don't think) just curios what it means

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.