Git Product home page Git Product logo

akkadotnet-healthcheck's People

Contributors

aaronontheweb avatar arkatufus avatar cumpsd avatar dependabot-preview[bot] avatar dependabot[bot] avatar eaba avatar izavala avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

akkadotnet-healthcheck's Issues

Akka.HealthCheck.Persistence unit tests

Need to validate the following via unit tests:

  1. Should be able to load the AkkaPersistenceLivenessProbeProvider from HOCON configuration when starting up an ActorSystem with it configured
  2. The underlying AkkaPersistenceLivenessProbe should be able to correctly handle subscriptions in any state.
  3. The AkkaPersistenceLivenessProbe should correctly report that Akka.Persistence is available when it is
  4. The AkkaPersistenceLivenessProbe should correctly report that Akka.Persistence is NOT available when it isn't able to load at startup.
  5. The AkkaPersistenceLivenessProbe should correctly report that Akka.Persistence has become unavailable AFTER initially being available (simulate a future change in dis-connectivity.)
  6. Steps 4,5,6 should test both the Akka.Persistence Journal AND the SnapshotStore.

We will likely need to create some custom Akka.Persistence journal and SnapshotStore implementations in order to succeed in testing these - please take a look at some of the tests we have in the

https://github.com/akkadotnet/akka.net/blob/e15b935b59e739220f28cb197bb191211e912bd4/src/core/Akka.Persistence.Tests/PersistentActorFailureSpec.cs#L49-L114

and

https://github.com/akkadotnet/akka.net/blob/e15b935b59e739220f28cb197bb191211e912bd4/src/core/Akka.Persistence.Tests/SnapshotFailureRobustnessSpec.cs#L156-L168

Rewrite Cluster health checks

These HCs are poisonous and can prevent cluster formation. Need to be relaxed.

Liveness

AkkaClusterLivenessProbe - Liveness probe for clustering.

Reports healthy when:
The ActorSystem joined a cluster.
The ActorSystem is connected to a cluster
Reports unhealthy when:
The ActorSystem just started and has not joined a cluster.
The ActorSystem left the cluster.

Rewrite to

ClusterLivenessProbeProvider - Liveness probe for clustering.
Reports healthy when:
Reports unhealthy when:
The ActorSystem leaving the cluster.

Readiness

AkkaClusterReadinessProbe - Readiness probe for clustering.

Reports healthy when:
The ActorSystem joined a cluster.
The ActorSystem is connected to a cluster
Reports unhealthy when:
The ActorSystem just started has not joined a cluster.
All other nodes in the cluster is unreachable.

Rewrite to:

ClusterReadinessProbeProvider - Liveness probe for clustering.
Reports healthy when:
Reports unhealthy when:
All other nodes in the cluster is unreachable.

Do I need Akka.HealthCheck.Cluster?

I would like to do akka.net cluster and healthcheck. Do I need Akka.HealthCheck.Cluster or is Akka.HealthCheck enough?

Do you have some example how to check if cluster is up and running and use this for monitoring?

Thank you

Add debug logging option for transports

Add the ability to turn on debug logging for all built-in transports for both liveness and readiness probes.

The logs should also make it clear whether or not it's the liveness OR readiness probe writing to the transport.

How can you perform multiple health checks ?

It seems you can only configure a single probe for readiness (liveness) check. Isn't it possible to run multiple readiness (liveness) checks like 'cluster readiness', 'persistence journal check', 'my custom check1', ...

Missing Logic in AkkaPersistenceLivenessProbe

The SuicideProbe class is to check the persistence journal state.
It recovers the last event and snapshot from the journal
After it writes a new event and snapshot to the journal
and deletes all old events and snapshots.

The issue is that it already send a RecoveryStatus on successful revoery back
without checking the success of the new persistet event or snapshot

And in the case when the journal success in only recovery (read-mode) and not in persist (write-mode)
then the RecoveryStatus will still be always successful.

The bottom line is that the write and the delete of new "hit" messages is somehow
not used by the healthcheck itself and only makes a hit on one sector of the SSD every 10sec

Akka.Persistence health checks create blobs but don't clean itself up

During analysing our storage account blob snapshot folder I saw that I had a lot of Akka.Healthchecks snapshots. In the standup yesterday it was mentioned that the suicideProbe should cleanup the journal/snapshot store after it's tests. This is definetely not the case:

image

At this moment this cluster only has 3 nodes, so old recycled nodes (pods) are still having snapshots/journal records lingering around.

It would be nice to remove the snapshot/journal records used for the healthprobe after the probe is finished. Off course you have the scenario that a pod could crash during the healthprobe, then you still get undeleted journal/snapshots but the chance of that happening is really low and I can live with that.

Socket probe transport doesn't accept / handle incoming connections

Given how this is implemented:

_socket = new Socket(SocketType.Stream, ProtocolType.Tcp);
_socket.Bind(new IPEndPoint(IPAddress.IPv6Any, Settings.Port));
_socket.Listen(10);

We never actually handle the incoming socket requests, thus the liveness probe will eventually fail given enough time. Need to actually handle the socket request and verify it by sending back some trivial piece of data.

Can you make this fix and verify it via a TCP integration test @izavala ?

Need reduce log verbosity on Akka.Persistence liveness probes

[INFO][2/7/2023 10:35:01 PM][Thread 0055][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Persistence probe terminated. Recreating...
[INFO][2/7/2023 10:35:11 PM][Thread 0052][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=True, JournalPersisted
=True, SnapshotSaved=True, Failures=null) from probe.
[INFO][2/7/2023 10:35:21 PM][Thread 0054][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=False, JournalPersiste
d=True, SnapshotSaved=True, Failures=null) from probe.
[INFO][2/7/2023 10:35:31 PM][Thread 0055][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=True, JournalPersisted
=True, SnapshotSaved=True, Failures=null) from probe.
[INFO][2/7/2023 10:35:41 PM][Thread 0010][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=True, JournalPersisted
=True, SnapshotSaved=True, Failures=null) from probe.

This needs to go into debug logging or not get logged at all unless there's a problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.