samber / awesome-prometheus-alerts Goto Github PK

View Code? Open in Web Editor NEW

6.4K 154.0 987.0 953 KB

🚨 Collection of Prometheus alerting rules

Home Page: https://samber.github.io/awesome-prometheus-alerts/

License: Other

Ruby 0.68% HTML 14.22% Ruby 0.68% HTML 14.22% Ruby 0.75% HTML 14.22% Ruby 0.46% HTML 35.94% CSS 14.71% JavaScript 4.10%

prometheus alertmanager alert rule collection awesome monitoring alerting query promql

awesome-prometheus-alerts's Issues

Add Prometheus alerts for MongoDB

Add Prometheus alerts for Traefik v2

node-exporter nfs alerts

I collect the nfs stats using node-exporter and i want to monitoring the nfs stats. But i have have no idea how to set the threshold, for example nfs request latency. Can someone give me advice about the threshold of nfs stats.

Add Prometheus alerts for Istio

ContainerCpuUsage reports kubernetes-cadvisor metrics

kubernetes-cadvisor is a process that runs in the kubelet. Consider changing the query for ContainerCpuUsage from

(sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80

(sum(rate(container_cpu_usage_seconds_total{job!="kubernetes-cadvisor"}[3m])) BY (instance, name) * 100) > 80

so that the alert won't fire constantly for cadvisor?

Docker Containers Alert Rules

Hi,
Thanks for this great overview!
I am testing the rules and I have a question about it.

I just implemented these rules: https://awesome-prometheus-alerts.grep.to/rules#docker-containers. They are firing, but I am not sure if the query is correct.

The following queries:

3.2. Container CPU usage
3.3. Container Memory usage
3.4. Container Volume usage
3.5. Container Volume IO usage

Are all using a sum by ip. But those metrics do not have any ip added on it.

I am running following versions:

cAdvisor v0.32.0
prometheus v2.11.1
alertmanager v0.18.0

Are there some requirements to get the ip? Or should it be changed to instance?

Add Prometheus alerts for Linkerd

Consensus on severity label

Is there any standard or agreed meaning on the severity levels?

I can't see anywhere that documents it - apologies if it exist.

Check node exporter alerts, please

Since node_exporter 0.16 all metrics related to size in bytes added "_bytes" to original name
Eg.
node_filesystem_free became node_filesystem_free_bytes

Add Prometheus alerts for Zookeeper

How to filter out master in ReplicationLag

I have data from postgres-exporter: https://github.com/socialwifi/docker-postgres-exporter/blob/master/queries.yaml#L1
But it also returns very big value for master node. Idk how to filter it out on alertmanager level.

604800 > pg_replication_lag > 10

Looks very stupid, but works fine. Assuming alerts should fire and fixed in one week.

However, master node can change in any time, so replication lag will be lower for master during change, so I keep looking for better solution.

Add Elastic Search Alert - Disk Usage

Hi,

an idea to add for elastic search:

    - alert: ElasticsearchDiskUsageTooHigh
      expr: ( 1-(elasticsearch_filesystem_data_available_bytes{}/elasticsearch_filesystem_data_size_bytes{}) ) * 100 > 90
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Elasticsearch Disk Usage Too High (instance {{ $labels.instance }})"
        description: "The disk usage is over 90% for 5m\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

    - alert: ElasticsearchDiskUsageWarning
      expr: ( 1-(elasticsearch_filesystem_data_available_bytes{}/elasticsearch_filesystem_data_size_bytes{}) ) * 100 > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Elasticsearch Disk Usage warning (instance {{ $labels.instance }})"
        description: "The disk usage is over 80% for 5m\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Information about absent values

I recently checked the alerts of our cluster and some of them would never fire,
as the value we wanted to be alerted for, was not exported.

I tackled it by adding an alert that fires on "absent" values:
expr: absent(up{job="myjob"})

Might it be helpful to include information about this issue in general?

Change in network average transmit

This is a good way to be notified in case of intrusion, DDOS attack or just a dumb user doing something stupid with a server.

If average traffic increases a lot over the last 5 min. compared to the last 5h, you get an alert.

(sum by (instance) (avg_over_time(node_network_transmit_bytes_total[5m])))/(sum by (instance) (avg_over_time(node_network_transmit_bytes_total[5h])))

(sum by (instance) (avg_over_time(node_network_receive_bytes_total[5m])))/(sum by (instance) (avg_over_time(node_network_receive_bytes_total[5h])))

MysqlRestarted will never fire

Looking at this alert

  - alert: MysqlRestarted
    expr: mysql_global_status_uptime < 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "MySQL restarted (instance {{ $labels.instance }})"
      description: "MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

we need to change either for or expr, because mysql_global_status_uptime is in seconds and it won't stay < 60 for 5 minutes.

I'm very new to Prometheus, and I have these two ideas:
1.

    expr: mysql_global_status_uptime < 70
    for: 1m

This will fire an alert almost immediately after MySQL restart

    expr: mysql_global_status_uptime < 310
    for: 5m

This will fire after 5 minutes.

I personally would rather go with first option, but let me know if I'm mistaken here with anything.

Add Prometheus alerts for CassandraDB

Postgresql replication lag alert syntax error

This alert (

awesome-prometheus-alerts/_data/rules.yml

Line 365 in 951d801

 query: '(pg_replication_lag > 10 and ON(instance) (pg_replication_is_replica == 1)' 

) has a syntax error (missing right parentheses):

query should be (parentheses bolded below)

(pg_replication_lag > 10 ) and ON(instance) (pg_replication_is_replica == 1)

Add Prometheus alerts for HAProxy v2

Exporter doc => https://github.com/haproxy/haproxy/tree/master/contrib/prometheus-exporter

Add Prometheus alerts for Kafka

CorednsPanicCount

I use helm chart: prometheus-9.1.0 (prometheu
the alert rule :

- alert: CorednsPanicCount
  expr: increase(coredns_panic_count_total[10min]) > 0
  for: 5m
  labels:
    severity: error
  annotations:
    summary: "CoreDNS Panic Count (instance {{ $labels.instance }})"
    description: "Number of CoreDNS panics encountered\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

seem to be not correct I see in log :


 \"CorednsPanicCount\": could not parse expression: parse error at char 36: bad duration syntax: \"10mi\""
level=error ts=2019-09-13T15:07:29.344Z caller=main.go:757 msg="Failed to apply

something strange it's like 10min become 10mi in log

Autogenerate a "real" set of rules files consumable by prometheus

Copy and pasting these rules is ok, but it would be easier if there was a generated set of rules files from rules.yml that would "just work".

Is there any reason this isn't generated? If not are there any templating tools that anyone knows of that could achieve this?

Add Prometheus alerts for Kubernetes

HostUnusualDiskReadLatency

Hello，I do not understand why use time divided by total ？

  - alert: HostUnusualDiskReadLatency
    expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host unusual disk read latency (instance {{ $labels.instance }})"
      description: "Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Refacto: write more accurate descriptions for faster troubleshooting

Example:

From:

- name: Prometheus rule evaluation failures
  description: 'Prometheus encountered {{ $value }} rule evaluation failures.'
  query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'

To:

- name: Prometheus rule evaluation failures
  description: 'Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.'
  query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'

An effect field would enable us to improve alert template.

Should cadvisor metrics filter out metrics with no name?

Hello and thanks for this great list of alerts!

I've been using them for a while on my raspberry pi cluster and have noticed that, although I pass --docker_only=true, for each metric there is one with no name label (I don't know what it stands for) which can trigger alerts quite often. I name all my containers and when I filter with {name != "" } I just get the metrics for the containers and no false alarms.

Should the cadvisor alerts filter out these metrics with no name label or am I missing something here? What do the signify? Especially since I run the cadvisor with --docker_only=true?

Best regards

Panos

Kubernetes Pod not healthy is not working correcly

Hello,

Alert query for Kubernetes Pod healt need to be fixed. Kube state metric assigns the value of current pod phase with 1.

awesome-prometheus-alerts/_data/rules.yml

Line 989 in c20227b

 query: 'min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:])' 

Correct expression:

min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"} == 1)[1h:])

Add Prometheus alert for openebs

for https://openebs.io/

- alert: openebs_used_pool_capacity_percent
        expr: (openebs_used_pool_capacity_percent) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "OpenEBS Pool use more than 80 percent of his capacity (instance {{ $labels.instance }})"
          description: "OpenEBS Pool use more than 80% of his capacity\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Change Rule Status Code

I think we should change this expr

probe_http_status_code <= 199 OR probe_http_status_code >= 300

probe_http_status_code <= 199 OR probe_http_status_code >= 400

Because status code 302 or 301 is http redirect https and it valid

discussion about blackbox probe_http_duration_seconds alert

Hey

awesome-prometheus-alerts/_data/rules.yml

Lines 533 to 536 in 3040fe5

 - name: Blackbox slow requests 

 description: Blackbox request took more than 2s 

 query: 'probe_http_duration_seconds > 2' 

 severity: warning

I recommend replace with something like this:

avg_over_time(probe_duration_seconds{env="Prod"}[15m]) > 1

what others think? Should I do a PR?

rabbitmq rules don't work

Hello,
Do you maybe have updated rabbitmq rules as these are not working (most of them)?
The metrics you have specified in the expressions do not exist so i have made similar (kinda). Any feedback appreciated and thanks.

Translate rules from node_exporter to Netdata metrics

Do you think it is interesting to translate all those rules from node_exporter to Netdata metrics so you can use a different source for the metrics?

I particularly use Netdata metrics as a source for Prometheus, so I will translate for me those rules I consider interesting. I can share if you think that could be useful.

Add Prometheus alerts for Django-prometheus

Django-prometheus exporter: https://github.com/korfuri/django-prometheus

Add Prometheus alerts for Etcd

There's no easy way to copy-paste the rules

One thing that was slightly frustrating when copying these rules over to my deployment was the fact that if you just want to chuck all the alerts into the relevant YAML file you have to copy-paste each one individually.

Some sort of presentation that would allow you to get a relevant exporter's rules in one solid chunk for ease of copypasta would be very welcome.

Add Prometheus alerts for Apache HTTPD

Add an "optimization" section => config example of recorded rule

This may reduce load on prometheus instances 🔥

Restrict Fstype for Disk Fill Alert

After adding Alert 2.7, I noticed that i'm getting disk fill alerts for /run/user/1113 mountpoint, since the rule will apply to all fstype labels of tmpfs as well. Perhaps we should set rule to restrict it to ext4 and xfs etc or may be add a not equal to condition to rule out tmpfs ?

predict_linear(node_filesystem_free_bytes{fstype=~"ext4|xfs"}[1h], 4 * 3600) < 0

Add Prometheus alerts for Blackblox exporter

Add Prometheus alerts for Nomad

Add Prometheus alerts for MySQL

Exporter => https://github.com/prometheus/mysqld_exporter

CPU load

I'm currently testing some of your alert rules and stumbled accross the CPU load snippet. I'm not 100% sure how it exactly works, but if I got this straight it should trigger an alert when the average CPU usage for 30mins is greater than 2 ... right?

- alert: CpuLoad
  expr: node_load15 / (count without (cpu, mode) (node_cpu_seconds_total{mode="system"})) > 2
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "CPU load (instance {{ $labels.instance }})"
    description: "CPU load (15m) is high\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

I ran the stress tool yesterday for ~1.5h which pushed CPU usage to 100% within this timespan. During this load test, I executed the following expression as Prometheus query - and the returned value crawled upwards to 1.0, but never completely reached 1.0:

node_load15 / (count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))

As expected, no alert was triggered. Am I doing something wrong, or is something wrong with the CPU load snippet?

UPDATE: I found the following expr seems to deliver more accurate results in my case. Not quite sure why one want to track mode="system" (as in the your expr) instead of inverting the cpu's mode="idle" metric.

- alert: CpuLoad
  expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High CPU load (instance {{ $labels.instance }})"
    description: "CPU load is high\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Add Prometheus alerts for PHP-FPM

replace traefik by traefik-V1

https://awesome-prometheus-alerts.grep.to/rules#traefik

I say that because the V2 (just release since 9 days) has different metrics and forget the concept of backend/frontend so I suppose that these alerts won't works on it

KubernetesPodNotHealthy expr problem

- alert: KubernetesPodNotHealthy
    expr: min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h]) > 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Kubernetes Pod not healthy (instance {{ $labels.instance }})"
      description: "Pod has been in a non-ready state for longer than an hour.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

I want to use this ,but the "expr" Doesn't seem right. I get the error like:

Error executing query: invalid parameter 'query': 1:107: parse error: ranges only allowed for vector selectors

if I use min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) , the result is OK。

Add Prometheus alerts for Elasticsearch

Add Prometheus alerts for Consul

Add a "for" clause to each alert

Today, we add for: 5m to alerts, automatically.

Use case: mysql_global_status_uptime < 60 would never trigger

5m can be the default value.

Progression:

Basic resource monitoring

Databases and brokers

Reverse proxies and load balancers

Nginx
Apache
HaProxy
Traefik

Runtimes

PHP-FPM
JVM
Sidekiq

Orchestrators

Network and storage

Other

Thanos

MDRaid Alert

Would be handy if we could get add the MDRaid alert for md raid array degradation. Here's what i've.

- alert: MDRaidDegrade
    expr: (node_md_disk - node_md_disk_active) != 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID."
      description: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID {{$labels.device}}. VALUE - {{ $value }}.

Context switch rate rule always alerts

Hello,

Thanks for your prometheus alerts which are, indeed, well and truly awesome! They've been very helpful in migrating a project away from icinga2 towards prometheus.

There's one alert here that I find will always trigger: The node alert named ContextSwitching. If I plot the graph for the query, I find that idle servers will generally have around 2100 context switches, while a moderately busy one will have 50.5k.

These are generally multi-core processors, does that factor into it at all? Whatever the case, I think there is an issue with the PromQL expression that you might want to be made aware of. Thanks again.

7.7. Dead locks

pg_stat_database_deadlocks{pg_stat_database_de}[1m]
is an invalid expression

7.7. Dead locks
PostgreSQL has dead-locks

alert: DeadLocks
expr: rate(pg_stat_database_deadlocks{pg_stat_database_de}[1m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Dead locks (instance {{ $labels.instance }})"
description: "PostgreSQL has dead-locks\n VALUE = {{ $value }}\n LABELS: {{ $labels

	- name: Blackbox slow requests
	description: Blackbox request took more than 2s
	query: 'probe_http_duration_seconds > 2'
	severity: warning

samber / awesome-prometheus-alerts Goto Github PK

awesome-prometheus-alerts's Issues

Progression:

Recommend Projects

Recommend Topics

Recommend Org