Let's say I have an outage on one of my server I'm monitoring and it's inaccessible bu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yes that's right! The problem is that Alertmanager is stateless, so some kind of

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Do you think you have time to work on this? I think the best place to start is t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Have a way to mute alert until it's resolved to receive a resolved notification once it's fixed about alertmanager HOT 11 OPEN

freak12techno commented on September 2, 2024

Have a way to mute alert until it's resolved to receive a resolved notification once it's fixed

from alertmanager.

Comments (11)

grobinson-grafana commented on September 2, 2024

Hi! 👋 It sounds to me like you want a silence to expire when it is no longer silencing any active alerts.

I think there are a couple of problems that we would need to solve to add such a feature. For example:

Alertmanager does not persist alerts to disk, so if an Alertmanager is restarted all of its alerts will be lost; and because of this all of its silences will also be expired. This is undesirable because the alert might not have actually resolved. Prometheus will resend all alerts to Alertmanager after the resend delay and you may receive duplicate notifications.
If Alertmanager is run in HA (high availability) and one Alertmanager becomes partitioned, then its alerts will resolve as Prometheus will be unable to communicate to that Alertmanager. The partitioned Alertmanager will expire its silences as it has no more active alerts. When the partition recovers the Alertmanager that expired its silences will gossip these expirations to the other Alertmanagers, expiring them on those Alertmanagers too.

from alertmanager.

freak12techno commented on September 2, 2024

@grobinson-grafana seems so.

For issues that you outlined:

Can this be solved by saving alerts to disk every time Alertmanager is receiving one and loading it from disk if it's present, or do you think there are other caveats with this approach?
I don't know a lot about Alertmanager HA internals, but let's say if there is a cluster of 3 nodes and one is partitioned and loses the mute given all the alerts there are resolved, once it goes back, won't other 2 nodes disagree and won't the consensus be that this mute isn't in fact removed? (and if there are 2/3 nodes partitioned, pretty sure expiring mutes aren't gonna be the biggest problem here lol)

from alertmanager.

grobinson-grafana commented on September 2, 2024

Yes that's right! The problem is that Alertmanager is stateless, so some kind of embedded database will need to be evaluated and then all the code will need to be written to use it.
Alertmanager doesn't use consensus for gossiping silences, its a case of last write wins. Since the expiration was the most recent event the other Alertmanagers will believe it to be the correct one.

from alertmanager.

freak12techno commented on September 2, 2024

@grobinson-grafana

From what I'm seeing, it shouldn't be difficult:

when creating/editing/deleting the alert, just dump whatever alerts there are to the disk
when starting, load the alerts from state if it's present
afaik it's not a proper database, but basically alerts snapshot, so no need to sync between the file and Alertmanager in cases other than the two above

Do you think that introduces new troubles?

Were the team ever considering using the consensus model? Wonder if it has any payoffs other than kinda being the requirement for the feature I propose and if it adds more problems.

from alertmanager.

grobinson-grafana commented on September 2, 2024

Do you think you have time to work on this? I think the best place to start is to evaluate some of the embedded k/v stores such as bbolt to see which would be the most appropriate.
Yes, but the current Alertmanager design is that alerts should continue to work even if all but one Alertmanager is down. If we add consensus then we need N/2+1 Alertmanagers to be up at all times.

from alertmanager.

freak12techno commented on September 2, 2024

@grobinson-grafana for 1) I can try implementing it by myself, but I'm not not sure if I can manage 2) or if it's even feasible.

from alertmanager.

grobinson-grafana commented on September 2, 2024

Hi! 👋 Do you have time to evaluate some embedded k/v stores? That would be a fantastic contribution as we have discussed durable storage for Alertmanager in the past but haven't decided what to use.

For example, I know that Grafana Loki uses bbolt, but it would be nice to see a comparison of some other embedded databases. You could even include sqlite3. Alertmanager has avoided being dependent on other processes as it needs to operate even when these are unavailable, so that means no MySQL, PostgreSQL, memcache, redis, etc.

Second, it is not uncommon for users to have Alertmanager installations with 10,000s of alerts, so it would be nice to see some performance comparisons of different databases. I expect the workload to be write-heavy as reads will only happen at startup time.

from alertmanager.

freak12techno commented on September 2, 2024

@grobinson-grafana so I looked a bit into how it's done for silences. Apparently it's all serialised into some binary format and stored on disk as a single file. Do you think it makes sense to do it the same way for alerts here as well, or would it be better to do it via a proper db? Basically we only need to read from it once to load all the alerts when starting Alertmanager and to write to it once an alert is created/updated.

(One issue I see with that approach is that if there are 10k inserts creating alerts, then every time it'll have to overwrite the whole file, which is not nice.)

from alertmanager.

grobinson-grafana commented on September 2, 2024

One issue I see with that approach is that if there are 10k inserts creating alerts, then every time it'll have to overwrite the whole file, which is not nice.

Yes! That's the issue! :) It works for silences because silences are not created very often and you don't tend to have very many of them. But alerts are very different, and Alertmanager can be receiving 1000s of alerts per minute (i.e. the EndsAt timestamp needs to be updated to stop firing alerts from resolving).

from alertmanager.

freak12techno commented on September 2, 2024

@grobinson-grafana okay, from my point of view, sqlite3 here doesn't make a lot of sense as it adds another layer of complexity by having to deal with db schema, so I think this won't be the best approach here.

From other kv databases, other than bbold that you've suggested, one cool option I found is https://github.com/dgraph-io/badger - it has quite a big community (it has more github stars than bbolt) is used by a lot of projects and seems to be maintained. I haven't used either of this in my projects, so I can mostly look at the library popularity and if it's maintained - both seem cool with it.

What do you think?

from alertmanager.

schustersv commented on September 2, 2024

Just as a quick note: there's kthxbye which automatically extends silences (prefixed by "ACK!" in the default configuration) that are still firing before they expire
While you still won't receive a resolved notification this way, you can set the silence duration to a shorter span and don't need to take care of a long-running silence.

from alertmanager.

Have a way to mute alert until it's resolved to receive a resolved notification once it's fixed about alertmanager HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent