samber / awesome-prometheus-alerts Goto Github PK
View Code? Open in Web Editor NEW🚨 Collection of Prometheus alerting rules
Home Page: https://samber.github.io/awesome-prometheus-alerts/
License: Other
🚨 Collection of Prometheus alerting rules
Home Page: https://samber.github.io/awesome-prometheus-alerts/
License: Other
I collect the nfs stats using node-exporter and i want to monitoring the nfs stats. But i have have no idea how to set the threshold, for example nfs request latency. Can someone give me advice about the threshold of nfs stats.
kubernetes-cadvisor is a process that runs in the kubelet. Consider changing the query for ContainerCpuUsage from
(sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
to
(sum(rate(container_cpu_usage_seconds_total{job!="kubernetes-cadvisor"}[3m])) BY (instance, name) * 100) > 80
so that the alert won't fire constantly for cadvisor?
Hi,
Thanks for this great overview!
I am testing the rules and I have a question about it.
I just implemented these rules: https://awesome-prometheus-alerts.grep.to/rules#docker-containers. They are firing, but I am not sure if the query is correct.
The following queries:
Are all using a sum by ip. But those metrics do not have any ip added on it.
I am running following versions:
Are there some requirements to get the ip? Or should it be changed to instance
?
Is there any standard or agreed meaning on the severity levels?
I can't see anywhere that documents it - apologies if it exist.
Since node_exporter 0.16 all metrics related to size in bytes added "_bytes" to original name
Eg.
node_filesystem_free became node_filesystem_free_bytes
I have data from postgres-exporter: https://github.com/socialwifi/docker-postgres-exporter/blob/master/queries.yaml#L1
But it also returns very big value for master node. Idk how to filter it out on alertmanager level.
604800 > pg_replication_lag > 10
Looks very stupid, but works fine. Assuming alerts should fire and fixed in one week.
However, master node can change in any time, so replication lag will be lower for master during change, so I keep looking for better solution.
Hi,
an idea to add for elastic search:
- alert: ElasticsearchDiskUsageTooHigh
expr: ( 1-(elasticsearch_filesystem_data_available_bytes{}/elasticsearch_filesystem_data_size_bytes{}) ) * 100 > 90
for: 5m
labels:
severity: error
annotations:
summary: "Elasticsearch Disk Usage Too High (instance {{ $labels.instance }})"
description: "The disk usage is over 90% for 5m\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: ElasticsearchDiskUsageWarning
expr: ( 1-(elasticsearch_filesystem_data_available_bytes{}/elasticsearch_filesystem_data_size_bytes{}) ) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Elasticsearch Disk Usage warning (instance {{ $labels.instance }})"
description: "The disk usage is over 80% for 5m\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
I recently checked the alerts of our cluster and some of them would never fire,
as the value we wanted to be alerted for, was not exported.
I tackled it by adding an alert that fires on "absent" values:
expr: absent(up{job="myjob"})
Might it be helpful to include information about this issue in general?
This is a good way to be notified in case of intrusion, DDOS attack or just a dumb user doing something stupid with a server.
If average traffic increases a lot over the last 5 min. compared to the last 5h, you get an alert.
(sum by (instance) (avg_over_time(node_network_transmit_bytes_total[5m])))/(sum by (instance) (avg_over_time(node_network_transmit_bytes_total[5h])))
(sum by (instance) (avg_over_time(node_network_receive_bytes_total[5m])))/(sum by (instance) (avg_over_time(node_network_receive_bytes_total[5h])))
Looking at this alert
- alert: MysqlRestarted
expr: mysql_global_status_uptime < 60
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL restarted (instance {{ $labels.instance }})"
description: "MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
we need to change either for
or expr
, because mysql_global_status_uptime
is in seconds and it won't stay < 60 for 5 minutes.
I'm very new to Prometheus, and I have these two ideas:
1.
expr: mysql_global_status_uptime < 70
for: 1m
This will fire an alert almost immediately after MySQL restart
expr: mysql_global_status_uptime < 310
for: 5m
This will fire after 5 minutes.
I personally would rather go with first option, but let me know if I'm mistaken here with anything.
This alert (
awesome-prometheus-alerts/_data/rules.yml
Line 365 in 951d801
query should be (parentheses bolded below)
(pg_replication_lag > 10 ) and ON(instance) (pg_replication_is_replica == 1)
I use helm chart: prometheus-9.1.0 (prometheu
the alert rule :
- alert: CorednsPanicCount
expr: increase(coredns_panic_count_total[10min]) > 0
for: 5m
labels:
severity: error
annotations:
summary: "CoreDNS Panic Count (instance {{ $labels.instance }})"
description: "Number of CoreDNS panics encountered\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
seem to be not correct I see in log :
\"CorednsPanicCount\": could not parse expression: parse error at char 36: bad duration syntax: \"10mi\""
level=error ts=2019-09-13T15:07:29.344Z caller=main.go:757 msg="Failed to apply
something strange it's like 10min become 10mi in log
Copy and pasting these rules is ok, but it would be easier if there was a generated set of rules files from rules.yml
that would "just work".
Is there any reason this isn't generated? If not are there any templating tools that anyone knows of that could achieve this?
Hello,I do not understand why use time divided by total ?
- alert: HostUnusualDiskReadLatency
expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Host unusual disk read latency (instance {{ $labels.instance }})"
description: "Disk latency is growing (read operations > 100ms)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
Example:
From:
- name: Prometheus rule evaluation failures
description: 'Prometheus encountered {{ $value }} rule evaluation failures.'
query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'
To:
- name: Prometheus rule evaluation failures
description: 'Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.'
query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'
An effect
field would enable us to improve alert template.
Hello and thanks for this great list of alerts!
I've been using them for a while on my raspberry pi cluster and have noticed that, although I pass --docker_only=true
, for each metric there is one with no name label (I don't know what it stands for) which can trigger alerts quite often. I name all my containers and when I filter with {name != "" }
I just get the metrics for the containers and no false alarms.
Should the cadvisor alerts filter out these metrics with no name label or am I missing something here? What do the signify? Especially since I run the cadvisor with --docker_only=true
?
Best regards
Panos
Hello,
Alert query for Kubernetes Pod healt need to be fixed. Kube state metric assigns the value of current pod phase with 1.
awesome-prometheus-alerts/_data/rules.yml
Line 989 in c20227b
Correct expression:
min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"} == 1)[1h:])
- alert: openebs_used_pool_capacity_percent
expr: (openebs_used_pool_capacity_percent) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "OpenEBS Pool use more than 80 percent of his capacity (instance {{ $labels.instance }})"
description: "OpenEBS Pool use more than 80% of his capacity\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
I think we should change this expr
probe_http_status_code <= 199 OR probe_http_status_code >= 300
to
probe_http_status_code <= 199 OR probe_http_status_code >= 400
Because status code 302
or 301
is http
redirect https
and it valid
Hey
awesome-prometheus-alerts/_data/rules.yml
Lines 533 to 536 in 3040fe5
I recommend replace with something like this:
avg_over_time(probe_duration_seconds{env="Prod"}[15m]) > 1
what others think? Should I do a PR?
Hello,
Do you maybe have updated rabbitmq rules as these are not working (most of them)?
The metrics you have specified in the expressions do not exist so i have made similar (kinda). Any feedback appreciated and thanks.
Do you think it is interesting to translate all those rules from node_exporter to Netdata metrics so you can use a different source for the metrics?
I particularly use Netdata metrics as a source for Prometheus, so I will translate for me those rules I consider interesting. I can share if you think that could be useful.
Django-prometheus exporter: https://github.com/korfuri/django-prometheus
One thing that was slightly frustrating when copying these rules over to my deployment was the fact that if you just want to chuck all the alerts into the relevant YAML file you have to copy-paste each one individually.
Some sort of presentation that would allow you to get a relevant exporter's rules in one solid chunk for ease of copypasta would be very welcome.
This may reduce load on prometheus instances 🔥
After adding Alert 2.7, I noticed that i'm getting disk fill alerts for /run/user/1113 mountpoint, since the rule will apply to all fstype labels of tmpfs as well. Perhaps we should set rule to restrict it to ext4 and xfs etc or may be add a not equal to condition to rule out tmpfs ?
predict_linear(node_filesystem_free_bytes{fstype=~"ext4|xfs"}[1h], 4 * 3600) < 0
Exporter => https://github.com/prometheus/mysqld_exporter
I'm currently testing some of your alert rules and stumbled accross the CPU load snippet. I'm not 100% sure how it exactly works, but if I got this straight it should trigger an alert when the average CPU usage for 30mins is greater than 2 ... right?
- alert: CpuLoad
expr: node_load15 / (count without (cpu, mode) (node_cpu_seconds_total{mode="system"})) > 2
for: 30m
labels:
severity: warning
annotations:
summary: "CPU load (instance {{ $labels.instance }})"
description: "CPU load (15m) is high\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
I ran the stress
tool yesterday for ~1.5h which pushed CPU usage to 100% within this timespan. During this load test, I executed the following expression as Prometheus query - and the returned value crawled upwards to 1.0, but never completely reached 1.0:
node_load15 / (count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))
As expected, no alert was triggered. Am I doing something wrong, or is something wrong with the CPU load snippet?
UPDATE: I found the following expr
seems to deliver more accurate results in my case. Not quite sure why one want to track mode="system"
(as in the your expr) instead of inverting the cpu's mode="idle"
metric.
- alert: CpuLoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU load (instance {{ $labels.instance }})"
description: "CPU load is high\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
https://awesome-prometheus-alerts.grep.to/rules#traefik
I say that because the V2 (just release since 9 days) has different metrics and forget the concept of backend/frontend so I suppose that these alerts won't works on it
- alert: KubernetesPodNotHealthy
expr: min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h]) > 0
for: 5m
labels:
severity: error
annotations:
summary: "Kubernetes Pod not healthy (instance {{ $labels.instance }})"
description: "Pod has been in a non-ready state for longer than an hour.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
I want to use this ,but the "expr" Doesn't seem right. I get the error like:
Error executing query: invalid parameter 'query': 1:107: parse error: ranges only allowed for vector selectors
if I use min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:])
, the result is OK。
Today, we add for: 5m
to alerts, automatically.
Use case: mysql_global_status_uptime < 60
would never trigger
5m
can be the default value.
Basic resource monitoring
Databases and brokers
Reverse proxies and load balancers
Runtimes
Orchestrators
Network and storage
Other
Would be handy if we could get add the MDRaid alert for md raid array degradation. Here's what i've.
- alert: MDRaidDegrade
expr: (node_md_disk - node_md_disk_active) != 0
for: 1m
labels:
severity: critical
annotations:
summary: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID."
description: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID {{$labels.device}}. VALUE - {{ $value }}.
Hello,
Thanks for your prometheus alerts which are, indeed, well and truly awesome! They've been very helpful in migrating a project away from icinga2 towards prometheus.
There's one alert here that I find will always trigger: The node alert named ContextSwitching. If I plot the graph for the query, I find that idle servers will generally have around 2100 context switches, while a moderately busy one will have 50.5k.
These are generally multi-core processors, does that factor into it at all? Whatever the case, I think there is an issue with the PromQL expression that you might want to be made aware of. Thanks again.
pg_stat_database_deadlocks{pg_stat_database_de}[1m]
is an invalid expression
7.7. Dead locks
PostgreSQL has dead-locks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.