Git Product home page Git Product logo

Comments (10)

grobinson-grafana avatar grobinson-grafana commented on June 5, 2024

Hi! 👋 It sounds to me like you want a silence to expire when it is no longer silencing any active alerts.

I think there are a couple of problems that we would need to solve to add such a feature. For example:

  1. Alertmanager does not persist alerts to disk, so if an Alertmanager is restarted all of its alerts will be lost; and because of this all of its silences will also be expired. This is undesirable because the alert might not have actually resolved. Prometheus will resend all alerts to Alertmanager after the resend delay and you may receive duplicate notifications.
  2. If Alertmanager is run in HA (high availability) and one Alertmanager becomes partitioned, then its alerts will resolve as Prometheus will be unable to communicate to that Alertmanager. The partitioned Alertmanager will expire its silences as it has no more active alerts. When the partition recovers the Alertmanager that expired its silences will gossip these expirations to the other Alertmanagers, expiring them on those Alertmanagers too.

from alertmanager.

freak12techno avatar freak12techno commented on June 5, 2024

@grobinson-grafana seems so.

For issues that you outlined:

  1. Can this be solved by saving alerts to disk every time Alertmanager is receiving one and loading it from disk if it's present, or do you think there are other caveats with this approach?
  2. I don't know a lot about Alertmanager HA internals, but let's say if there is a cluster of 3 nodes and one is partitioned and loses the mute given all the alerts there are resolved, once it goes back, won't other 2 nodes disagree and won't the consensus be that this mute isn't in fact removed? (and if there are 2/3 nodes partitioned, pretty sure expiring mutes aren't gonna be the biggest problem here lol)

from alertmanager.

grobinson-grafana avatar grobinson-grafana commented on June 5, 2024
  1. Yes that's right! The problem is that Alertmanager is stateless, so some kind of embedded database will need to be evaluated and then all the code will need to be written to use it.
  2. Alertmanager doesn't use consensus for gossiping silences, its a case of last write wins. Since the expiration was the most recent event the other Alertmanagers will believe it to be the correct one.

from alertmanager.

freak12techno avatar freak12techno commented on June 5, 2024

@grobinson-grafana

  1. From what I'm seeing, it shouldn't be difficult:
  • when creating/editing/deleting the alert, just dump whatever alerts there are to the disk
  • when starting, load the alerts from state if it's present
  • afaik it's not a proper database, but basically alerts snapshot, so no need to sync between the file and Alertmanager in cases other than the two above

Do you think that introduces new troubles?

  1. Were the team ever considering using the consensus model? Wonder if it has any payoffs other than kinda being the requirement for the feature I propose and if it adds more problems.

from alertmanager.

grobinson-grafana avatar grobinson-grafana commented on June 5, 2024
  1. Do you think you have time to work on this? I think the best place to start is to evaluate some of the embedded k/v stores such as bbolt to see which would be the most appropriate.
  2. Yes, but the current Alertmanager design is that alerts should continue to work even if all but one Alertmanager is down. If we add consensus then we need N/2+1 Alertmanagers to be up at all times.

from alertmanager.

freak12techno avatar freak12techno commented on June 5, 2024

@grobinson-grafana for 1) I can try implementing it by myself, but I'm not not sure if I can manage 2) or if it's even feasible.

from alertmanager.

grobinson-grafana avatar grobinson-grafana commented on June 5, 2024

Hi! 👋 Do you have time to evaluate some embedded k/v stores? That would be a fantastic contribution as we have discussed durable storage for Alertmanager in the past but haven't decided what to use.

For example, I know that Grafana Loki uses bbolt, but it would be nice to see a comparison of some other embedded databases. You could even include sqlite3. Alertmanager has avoided being dependent on other processes as it needs to operate even when these are unavailable, so that means no MySQL, PostgreSQL, memcache, redis, etc.

Second, it is not uncommon for users to have Alertmanager installations with 10,000s of alerts, so it would be nice to see some performance comparisons of different databases. I expect the workload to be write-heavy as reads will only happen at startup time.

from alertmanager.

freak12techno avatar freak12techno commented on June 5, 2024

@grobinson-grafana so I looked a bit into how it's done for silences. Apparently it's all serialised into some binary format and stored on disk as a single file. Do you think it makes sense to do it the same way for alerts here as well, or would it be better to do it via a proper db? Basically we only need to read from it once to load all the alerts when starting Alertmanager and to write to it once an alert is created/updated.

(One issue I see with that approach is that if there are 10k inserts creating alerts, then every time it'll have to overwrite the whole file, which is not nice.)

from alertmanager.

grobinson-grafana avatar grobinson-grafana commented on June 5, 2024

One issue I see with that approach is that if there are 10k inserts creating alerts, then every time it'll have to overwrite the whole file, which is not nice.

Yes! That's the issue! :) It works for silences because silences are not created very often and you don't tend to have very many of them. But alerts are very different, and Alertmanager can be receiving 1000s of alerts per minute (i.e. the EndsAt timestamp needs to be updated to stop firing alerts from resolving).

from alertmanager.

freak12techno avatar freak12techno commented on June 5, 2024

@grobinson-grafana okay, from my point of view, sqlite3 here doesn't make a lot of sense as it adds another layer of complexity by having to deal with db schema, so I think this won't be the best approach here.

From other kv databases, other than bbold that you've suggested, one cool option I found is https://github.com/dgraph-io/badger - it has quite a big community (it has more github stars than bbolt) is used by a lot of projects and seems to be maintained. I haven't used either of this in my projects, so I can mostly look at the library popularity and if it's maintained - both seem cool with it.

What do you think?

from alertmanager.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.