petabridge / akkadotnet-healthcheck Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 9.0 376 KB

Healthchecks for Akka.NET Applications :hospital:

License: Apache License 2.0

Batchfile 0.04% F# 5.12% PowerShell 5.72% Shell 1.27% C# 87.85%

akka-cluster akkadotnet azure-arm docker k8s

akkadotnet-healthcheck's People

Contributors

Stargazers

Watchers

Forkers

aaronontheweb izavala johnnyr1985 ingted arkatufus tstojecki wesselkranenborg cumpsd eaba

akkadotnet-healthcheck's Issues

Rename Akka.Cluster.HealthCheck to Akka.HealthCheck.Cluster

Need to rename all projects and the NuGet packages accordingly.

Akka.HealthCheck.Persistence unit tests

Need to validate the following via unit tests:

Should be able to load the AkkaPersistenceLivenessProbeProvider from HOCON configuration when starting up an ActorSystem with it configured
The underlying AkkaPersistenceLivenessProbe should be able to correctly handle subscriptions in any state.
The AkkaPersistenceLivenessProbe should correctly report that Akka.Persistence is available when it is
The AkkaPersistenceLivenessProbe should correctly report that Akka.Persistence is NOT available when it isn't able to load at startup.
The AkkaPersistenceLivenessProbe should correctly report that Akka.Persistence has become unavailable AFTER initially being available (simulate a future change in dis-connectivity.)
Steps 4,5,6 should test both the Akka.Persistence Journal AND the SnapshotStore.

We will likely need to create some custom Akka.Persistence journal and SnapshotStore implementations in order to succeed in testing these - please take a look at some of the tests we have in the

https://github.com/akkadotnet/akka.net/blob/e15b935b59e739220f28cb197bb191211e912bd4/src/core/Akka.Persistence.Tests/PersistentActorFailureSpec.cs#L49-L114

and

https://github.com/akkadotnet/akka.net/blob/e15b935b59e739220f28cb197bb191211e912bd4/src/core/Akka.Persistence.Tests/SnapshotFailureRobustnessSpec.cs#L156-L168

AkkaPersistenceLivenessProbe subscription test failing due to LivenessStatus not being true

#24

Akka Cluster HealthCheck Probe Readiness Test

Need to implement Recover for LivenessStatus with a test to make sure it is able to correctly signal when the cluster is up. As well as when it is no longer reachable.

Rewrite Cluster health checks

These HCs are poisonous and can prevent cluster formation. Need to be relaxed.

Liveness

AkkaClusterLivenessProbe - Liveness probe for clustering.

Reports healthy when:
The ActorSystem joined a cluster.
The ActorSystem is connected to a cluster
Reports unhealthy when:
The ActorSystem just started and has not joined a cluster.
The ActorSystem left the cluster.

Rewrite to

ClusterLivenessProbeProvider - Liveness probe for clustering.
Reports healthy when:
Reports unhealthy when:
The ActorSystem leaving the cluster.

Readiness

AkkaClusterReadinessProbe - Readiness probe for clustering.

Reports healthy when:
The ActorSystem joined a cluster.
The ActorSystem is connected to a cluster
Reports unhealthy when:
The ActorSystem just started has not joined a cluster.
All other nodes in the cluster is unreachable.

Rewrite to:

ClusterReadinessProbeProvider - Liveness probe for clustering.
Reports healthy when:
Reports unhealthy when:
All other nodes in the cluster is unreachable.

Add startup message indicating that liveness / readiness probes have been configured successfully

Should log this automatically without any configuration settings - just to let the end-user know in the startup logs that the system is running with one or both of these tools enabled.

Do I need Akka.HealthCheck.Cluster?

I would like to do akka.net cluster and healthcheck. Do I need Akka.HealthCheck.Cluster or is Akka.HealthCheck enough?

Do you have some example how to check if cluster is up and running and use this for monitoring?

Thank you

Can't install Akka.HealthCheck.Hosting.Web into .NET 7.0 application

WebApiTemplate.App.csproj: [NU1100] Unable to resolve 'Akka.HealthCheck.Hosting.Web (>= 1.0.0)' for 'net7.0'. PackageSourceMapping is enabled, the following source(s) were not considered: nuget.****

Is the issue here how we're targeting ASP.NET?

need to validate the nuget publication before 1.0 ships

    LGTM - need to validate the nuget publication locally (or check what the build server produced.) Don't want to publish any sample projects and need to make sure that the correct `README.md` files are included.

Originally posted by @Aaronontheweb in #148 (review)

Add debug logging option for transports

Add the ability to turn on debug logging for all built-in transports for both liveness and readiness probes.

The logs should also make it clear whether or not it's the liveness OR readiness probe writing to the transport.

How can you perform multiple health checks ?

It seems you can only configure a single probe for readiness (liveness) check. Isn't it possible to run multiple readiness (liveness) checks like 'cluster readiness', 'persistence journal check', 'my custom check1', ...

Missing Logic in AkkaPersistenceLivenessProbe

The SuicideProbe class is to check the persistence journal state.
It recovers the last event and snapshot from the journal
After it writes a new event and snapshot to the journal
and deletes all old events and snapshots.

The issue is that it already send a RecoveryStatus on successful revoery back
without checking the success of the new persistet event or snapshot

And in the case when the journal success in only recovery (read-mode) and not in persist (write-mode)
then the RecoveryStatus will still be always successful.

The bottom line is that the write and the delete of new "hit" messages is somehow
not used by the healthcheck itself and only makes a hit on one sector of the SSD every 10sec

Akka.Persistence health checks create blobs but don't clean itself up

During analysing our storage account blob snapshot folder I saw that I had a lot of Akka.Healthchecks snapshots. In the standup yesterday it was mentioned that the suicideProbe should cleanup the journal/snapshot store after it's tests. This is definetely not the case:

At this moment this cluster only has 3 nodes, so old recycled nodes (pods) are still having snapshots/journal records lingering around.

It would be nice to remove the snapshot/journal records used for the healthprobe after the probe is finished. Off course you have the scenario that a pod could crash during the healthprobe, then you still get undeleted journal/snapshots but the chance of that happening is really low and I can live with that.

Socket probe transport doesn't accept / handle incoming connections

Given how this is implemented:

akkadotnet-healthcheck/src/Akka.HealthCheck/Transports/Sockets/SocketStatusTransport.cs

Lines 35 to 37 in 85ef553

 _socket = new Socket(SocketType.Stream, ProtocolType.Tcp); 

 _socket.Bind(new IPEndPoint(IPAddress.IPv6Any, Settings.Port)); 

 _socket.Listen(10);

We never actually handle the incoming socket requests, thus the liveness probe will eventually fail given enough time. Need to actually handle the socket request and verify it by sending back some trivial piece of data.

Can you make this fix and verify it via a TCP integration test @izavala ?

Add debug option to log startup configuration at launch

Should do a dump of all of the built-in HealthCheck settings at launch, so users can troubleshoot when first configuring the probes and transports.

Need reduce log verbosity on Akka.Persistence liveness probes

[INFO][2/7/2023 10:35:01 PM][Thread 0055][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Persistence probe terminated. Recreating...
[INFO][2/7/2023 10:35:11 PM][Thread 0052][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=True, JournalPersisted
=True, SnapshotSaved=True, Failures=null) from probe.
[INFO][2/7/2023 10:35:21 PM][Thread 0054][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=False, JournalPersiste
d=True, SnapshotSaved=True, Failures=null) from probe.
[INFO][2/7/2023 10:35:31 PM][Thread 0055][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=True, JournalPersisted
=True, SnapshotSaved=True, Failures=null) from probe.
[INFO][2/7/2023 10:35:41 PM][Thread 0010][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=True, JournalPersisted
=True, SnapshotSaved=True, Failures=null) from probe.

This needs to go into debug logging or not get logged at all unless there's a problem.

	_socket = new Socket(SocketType.Stream, ProtocolType.Tcp);
	_socket.Bind(new IPEndPoint(IPAddress.IPv6Any, Settings.Port));
	_socket.Listen(10);