instaclustr / cassandra-operator Goto Github PK

View Code? Open in Web Editor NEW

238.0 19.0 62.0 6.82 MB

Kubernetes operator for Apache Cassandra

Home Page: https://instaclustr.com

License: Apache License 2.0

Java 11.35% Shell 15.05% Dockerfile 2.38% Makefile 3.25% Go 67.11% Mustache 0.86%

kubernetes kubernetes-operator cassandra nosql storage ops netapp-public

cassandra-operator's People

Contributors

Stargazers

Watchers

cassandra-operator's Issues

Make K8s lookups less brittle

There are a number of times where we lookup user supplied values that might not exist and we don't handle that, which will cause the operator to exit. Let's handle not finding k8s objects.

Cluster Scaling Down

Statefulset deletes the last pod
Run decommission on the node

DC can't be recreated after deleting it

Maybe related with issue #54

Custom Seed provider for Kubernetes

Restore Cassandra Cluster data

Should be the same size cluster as original

Fork backup tool to separate repository

The backup tool that is part of the toolbelt of the operator is getting some attention outside of the kubernetes functionality being provided with the operator project. At this point the backup module presents enough functionality that it could be a stand-alone project.

The important step is to make sure that the cassandra-operator remains unaffected. A somewhat clean way of proceeding here would be to segregate the backup module and add it back as a dependency that can be used by cassandra-operator.

To keep the backup module stable an initial release would be required. Post this release further breaking changes could be introduced to the master branch whilst a 1.0.0 branch could be maintained for bug fixing. This approach will allow for the cassandra-operator to introduce breaking changes on backup's master branch if necessary.

Project release is not ready

Looks like we are not ready to give out a project v1.0.0 release since we still have some bugs to be tackled.

Support namespaces other than `default`

Currently default is hard-coded into every API call that takes a namespace parameter.

There is a CLI option. This should be wired up via Guice and @Inject-ed into all the paces that need a namespace.

Issue when creating backup snapshot

23:13:20.114 [BackupControllerService] ERROR o.g.j.m.i.WriterInterceptorExecutor - MessageBodyWriter not found for media type=application/json, type=class com.instaclustr.backup.BackupArguments, genericType=class com.instaclustr.backup.BackupArguments.

Embed operator git commit + version in manifest and print on startup

Would be nice to see the git commit SHA and the operator version printed in the logs on startup.

Rolling Upgrades with no downtime[minor or major Cassandra version changes]

Readiness Probe and healthcheck per node

Develop a set of scripts to perform health check operation. As a best practice, stay away from 'nodetool status'

Failure Scenario: Failure of a C* Pod

Change podManagementPolicy from "serial" to "parallel" to speed up cluster provision

Current seed provider is able to support "parallel" Pod creation. TBD for other advanced seed providers.

DC Deprovision Issue

Deployed operator on PKS using YAML here, got exception whiling deleting a test DataCenter cluster below. Seems lack of permissions on PV:

16:07:19.830 [ControllerService] ERROR c.g.c.util.concurrent.ServiceManager - Service ControllerService [FAILED] has failed in the RUNNING state.
io.kubernetes.client.ApiException: Forbidden
        at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
        at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
        at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaimWithHttpInfo(CoreV1Api.java:30090)
        at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaim(CoreV1Api.java:30072)
        at com.instaclustr.cassandra.operator.service.ControllerService.deletePersistentVolumeAndPersistentVolumeClaim(ControllerService.java:524)
        at com.instaclustr.cassandra.operator.service.ControllerService.deleteDataCenter(ControllerService.java:411)
        at com.instaclustr.cassandra.operator.service.ControllerService.run(ControllerService.java:130)
        at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
        at java.lang.Thread.run(Thread.java:748)
16:07:19.833 [ControllerService] ERROR com.instaclustr.guava.Application - Service ControllerService [FAILED] failed. Shutting down.

Heartbeat race condition

Looks like the hearbeat detector will start before a node is potentially ready and gives up.

6:41:34.227 [JMX client heartbeat 2] WARN c.i.c.o.j.CassandraConnectionFactory - JMX connection to /10.8.1.8 unexpectedly failed. javax.management.remote.JMXConnectionNotification[source=javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin@33ab4d49][type=jmx.remote.connection.failed][message=Failed to communicate with the server: java.net.ConnectException: Connection refused (Connection refused)]
16:41:34.229 [JMX client heartbeat 2] WARN javax.management.remote.misc - Failed to check connection: java.net.ConnectException: Connection refused (Connection refused)
16:41:34.229 [JMX client heartbeat 2] WARN javax.management.remote.misc - stopping

The operator is not working at all :(

Hi,

I'm just installing the operator and it's failing with these errors:

18:50:10.198 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Creating Custom Resource Definition cassandra-datacenters.stable.instaclustr.com
18:50:10.234 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Custom Resource Definition cassandra-datacenters.stable.instaclustr.com already exists.
18:50:10.237 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Creating Custom Resource Definition cassandra-clusters.stable.instaclustr.com
18:50:10.243 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Custom Resource Definition cassandra-clusters.stable.instaclustr.com already exists.
18:50:10.248 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Creating Custom Resource Definition cassandra-backups.stable.instaclustr.com
18:50:10.254 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Custom Resource Definition cassandra-backups.stable.instaclustr.com already exists.
18:50:10.372 [main] INFO  com.instaclustr.guava.Application - Services to start: [ControllerService [NEW], GarbageCollectorService [NEW], CassandraHealthCheckService [NEW], BackupControllerService [NEW], WatchService(Cluster) [NEW], WatchService(DataCenter) [NEW], WatchService(V1beta2StatefulSet) [NEW], WatchService(V1ConfigMap) [NEW], WatchService(Backup) [NEW]]
18:50:10.374 [main] INFO  com.instaclustr.guava.Application - Starting services.
18:50:10.394 [WatchService(Cluster)] ERROR c.g.c.util.concurrent.ServiceManager - Service WatchService(Cluster) [FAILED] has failed in the RUNNING state.
io.kubernetes.client.ApiException: Not Found
	at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
	at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
	at com.instaclustr.k8s.watch.WatchService.syncResourceCache(WatchService.java:149)
	at com.instaclustr.k8s.watch.WatchService.run(WatchService.java:91)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
	at java.lang.Thread.run(Thread.java:748)
18:50:10.396 [WatchService(Cluster)] ERROR com.instaclustr.guava.Application - Service WatchService(Cluster) [FAILED] failed. Shutting down.
18:50:10.411 [ServiceManager Shutdown Hook] INFO  com.instaclustr.guava.Application - Shutting down [ControllerService [RUNNING], GarbageCollectorService [RUNNING], CassandraHealthCheckService [RUNNING], BackupControllerService [RUNNING], WatchService(DataCenter) [RUNNING], WatchService(V1beta2StatefulSet) [RUNNING], WatchService(V1ConfigMap) [RUNNING]].
18:50:10.416 [main] ERROR com.instaclustr.guava.Application - Services [WatchService(Cluster) [FAILED]] failed to start.
Exception in thread "main" picocli.CommandLine$ExecutionException: Error while calling command (com.instaclustr.cassandra.operator.Operator@3b220bcb): java.lang.IllegalStateException: Expected to be healthy after starting. The following services are not running: {STARTING=[WatchService(Backup) [RUNNING]], FAILED=[WatchService(Cluster) [FAILED]]}
	at picocli.CommandLine.execute(CommandLine.java:880)
	at picocli.CommandLine.access$700(CommandLine.java:111)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:1037)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:1005)
	at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:913)
	at picocli.CommandLine.parseWithHandlers(CommandLine.java:1196)
	at picocli.CommandLine.call(CommandLine.java:1448)
	at picocli.CommandLine.call(CommandLine.java:1413)
	at com.instaclustr.cassandra.operator.Operator.main(Operator.java:82)
Caused by: java.lang.IllegalStateException: Expected to be healthy after starting. The following services are not running: {STARTING=[WatchService(Backup) [RUNNING]], FAILED=[WatchService(Cluster) [FAILED]]}
	at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.checkHealthy(ServiceManager.java:741)
	at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.awaitHealthy(ServiceManager.java:568)
	at com.google.common.util.concurrent.ServiceManager.awaitHealthy(ServiceManager.java:329)
	at com.instaclustr.guava.Application.call(Application.java:59)
	at com.instaclustr.cassandra.operator.Operator.call(Operator.java:129)
	at com.instaclustr.cassandra.operator.Operator.call(Operator.java:29)
	at picocli.CommandLine.execute(CommandLine.java:873)
	... 8 more
	Suppressed: com.google.common.util.concurrent.ServiceManager$FailedService: WatchService(Cluster) [FAILED]
	Caused by: io.kubernetes.client.ApiException: Not Found
		at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
		at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
		at com.instaclustr.k8s.watch.WatchService.syncResourceCache(WatchService.java:149)
		at com.instaclustr.k8s.watch.WatchService.run(WatchService.java:91)
		at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
		at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
		at java.lang.Thread.run(Thread.java:748)
18:51:10.431 [ServiceManager Shutdown Hook] WARN  com.instaclustr.guava.Application - Timeout waiting for [WatchService(DataCenter) [STOPPING], WatchService(V1beta2StatefulSet) [STOPPING], WatchService(V1ConfigMap) [STOPPING], WatchService(Backup) [STOPPING]] to stop. Retrying.
java.util.concurrent.TimeoutException: Timeout waiting for the services to stop. The following services have not stopped: {STOPPING=[WatchService(DataCenter) [STOPPING], WatchService(V1beta2StatefulSet) [STOPPING], WatchService(V1ConfigMap) [STOPPING], WatchService(Backup) [STOPPING]]}
	at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.awaitStopped(ServiceManager.java:586)
	at com.google.common.util.concurrent.ServiceManager.awaitStopped(ServiceManager.java:365)
	at com.instaclustr.guava.Application.lambda$call$0(Application.java:37)
	at java.lang.Thread.run(Thread.java:748)
18:51:10.525 [ServiceManager Shutdown Hook] INFO  com.instaclustr.guava.Application - Successfully shut down all services.

I took the default configuration of the operator Helm chart... Can someone help me please?

Thanks,

Unable to create "CassandraDataCenter" resource on PKS

Issue: Unable to create "CassandraDataCenter" resource

kubectl create -f test-dc.yaml gives below error

Error from server (NotFound): error when creating "test-dc.yaml": the server could not find the requested resource (post cassandradatacentres.stable.instaclustr.com)

Recreate Missing Primitive Resources with Watchers

Implement a thread to scan cluster status periodically. In addition to Issue #6 for provisioning, this thread should be alive through operator's life cycle. Also it should be able to detect resource failure/missing and rescue them back, hold the cluster in a desired state.

Cluster Scaling Up

Enable basic security with Cassandra Clusters

Basic RBAC with Cassandra admin role

Enable deletion of C* Data Center(DC) in Operator

Parameterize Operator inputs

Helm package

Create a helm package for the operator

Can't install the cassandra chart

Hi,

Following your instructions I get the following error when trying to install Cassandra (not the operator):
helm : Error: unable to recognize "": no matches for stable.instaclustr.com/, Kind=CassandraDataCenter

I just followed the README, is it working for you?

Thank you 😃

Warning "Unprocessable Entity"

When a scale command is triggered, got the warning as below:

16:14:15.727 [ControllerService] WARN  c.i.c.operator.k8s.K8sResourceUtils - Failed to update resource. This will be a hard exception in the future.
io.kubernetes.client.ApiException: Unprocessable Entity
	at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
	at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
	at io.kubernetes.client.ApiClient.execute(ApiClient.java:781)
	at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.lambda$createOrReplaceResource$1(K8sResourceUtils.java:51)
	at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.createOrReplaceResource(K8sResourceUtils.java:41)
	at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.createOrReplaceResource(K8sResourceUtils.java:49)
	at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.createOrReplaceNamespaceService(K8sResourceUtils.java:60)
	at com.instaclustr.cassandra.operator.service.ControllerService.createOrReplaceNodesService(ControllerService.java:279)
	at com.instaclustr.cassandra.operator.service.ControllerService.createOrReplaceDataCenter(ControllerService.java:128)
	at com.instaclustr.cassandra.operator.service.ControllerService.run(ControllerService.java:119)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
	at java.lang.Thread.run(Thread.java:748)

StatefulSet for Cassandra, optimized with operator design

Collaborate on combining the statefulsets from Pivotal + InstaClustr and create one as a building block for Operator

Rolling Upgrades with no downtime[Docker Image changes]

Multi-AZ topologies

The behavior of Kubernetes scheduling pods across different failure domains is somewhat undefined and hard to control. See for details:

kubernetes/kubernetes#41598
kubernetes/kubernetes#44798
kubernetes/community#1857

Despite recent fixes to statefulsets, we still cannot inject failure domain information into the pods. So we still need to do a statefulset per failure domain.

This also gives us some flexability in tolerating outages and nodes being down in a single SS.

Null Pointer Exception prevents cluster creation

I installed via Helm, and saw that the operator was up and running:

k get pods
NAME                                  READY     STATUS    RESTARTS   AGE
cassandra-operator-549744c8fc-w6p47   1/1       Running   0          44m

But then when I create a cluster, also via Helm, the Cassandra pods are not created and I see this in the operator's logs:

java.lang.NullPointerException: null
	at com.instaclustr.cassandra.operator.controller.DataCenterReconciliationController.createOrReplaceNodesService(DataCenterReconciliationController.java:417)
	at com.instaclustr.cassandra.operator.controller.DataCenterReconciliationController.reconcileDataCenter(DataCenterReconciliationController.java:66)
	at com.instaclustr.cassandra.operator.service.OperatorService.run(OperatorService.java:106)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
	at java.lang.Thread.run(Thread.java:748)

Any ideas what I might have done wrong?

Backups of Cassandra cluster data

Should be periodic cron and options to upload to S3/Object store of choice

Operator Restarted Once when Creating DC

18:14:33.431 [CassandraHealthCheckService RUNNING] ERROR c.g.c.util.concurrent.ServiceManager - Service CassandraHealthCheckService [FAILED] has failed in the RUNNING state.
java.lang.NullPointerException: null
	at com.google.common.net.InetAddresses.ipStringToBytes(InetAddresses.java:162)
	at com.google.common.net.InetAddresses.forString(InetAddresses.java:136)
	at com.instaclustr.cassandra.operator.service.CassandraHealthCheckService.runOneIteration(CassandraHealthCheckService.java:63)
	at com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:193)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

(?) Memory continuosly increasing

Hi,

I said in my previous issue I will wait for a more improved operator to be able to use Jaeger easily... So I deleted my Cassandra cluster instance but left the operator.

In the meantime one of my nodes has crashed so I looked for the cause of the problem. Looking at all my pods/statefulsets/deployments metrics I noticed that the Cassandra operator memory had increased over time:

And that's the only one doing that. I "suspect" it for crashing its host node. For sure I could set request and limit resources to kill the pod before it eats all the node memory but that doesn't explain the consumption graph over time.

Moreover, currently the operator pod memory is stable around 200MB... 😄

I can't tell for sure there is a memory leak inside the operator, but I wanted to share with you this "experience".

I would have loved to also share logs of the operator but I don't have them (I'm really sorry).

Have a good day,

Prometheus Monitoring + metrics

Leverage Instaclustr Prometheus exporter

Rolling upgrade with image change & cassandra minor version change

A test to see if statefulset can handle this properly.

Support Operator Lifecycle Manager

Watch and see how the coreos OLM project goes in terms of adoption. Support OLM management
https://github.com/operator-framework/operator-lifecycle-manager

Setup Circle CI

Failed to pull image "gcr.io/cassandra-operator/cassandra-sidecar-dev"

I see you uploaded almost everything to gcr.io, but cassandra-sidecar is missing.
After installing the operator and starting the cluster I have this error:

Failed to pull image "gcr.io/cassandra-operator/cassandra-sidecar-dev": rpc error: code = Unknown desc = unauthorized: authentication required

Failure Scenario: Failure of entire C* cluster

PVs will the entire SSTables on IaaS persistent disks
Need to work on recovery scenarios

Operator Config for deploying to K8s env

Warnings on com.squareup.okhttp.OkHttpClient

Operator keeps throwing out warnings like below periodically:

[OkHttp ConnectionPool] WARN  com.squareup.okhttp.OkHttpClient - A connection to https://192.168.99.101:8443/ was leaked. Did you forget to close a response body?

Lack of Permission while deleting DC cluster on PKS

16:07:19.830 [ControllerService] ERROR c.g.c.util.concurrent.ServiceManager - Service ControllerService [FAILED] has failed in the RUNNING state.
io.kubernetes.client.ApiException: Forbidden
        at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
        at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
        at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaimWithHttpInfo(CoreV1Api.java:30090)
        at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaim(CoreV1Api.java:30072)
        at com.instaclustr.cassandra.operator.service.ControllerService.deletePersistentVolumeAndPersistentVolumeClaim(ControllerService.java:524)
        at com.instaclustr.cassandra.operator.service.ControllerService.deleteDataCenter(ControllerService.java:411)
        at com.instaclustr.cassandra.operator.service.ControllerService.run(ControllerService.java:130)
        at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
        at java.lang.Thread.run(Thread.java:748)
16:07:19.833 [ControllerService] ERROR com.instaclustr.guava.Application - Service ControllerService [FAILED] failed. Shutting down.

Cluster Scaling Down (Duplicated with Issue #16)

WARNING on JMX Client Heartbeat

Deleting DC will causing warnings as below:

17:25:58.041 [JMX client heartbeat 3] WARN  c.i.c.o.j.CassandraConnectionFactory - JMX connection to /172.17.0.4 unexpectedly failed. javax.management.remote.JMXConnectionNotification[source=javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin@341b15ad][type=jmx.remote.connection.failed][message=Failed to communicate with the server: java.net.NoRouteToHostException: No route to host (Host unreachable)]
17:26:01.170 [JMX client heartbeat 3] WARN  javax.management.remote.misc - Failed to check connection: java.net.NoRouteToHostException: No route to host (Host unreachable)
17:26:01.171 [JMX client heartbeat 3] WARN  javax.management.remote.misc - stopping

Watch the StatefulSet - Clean up decom'd PVs

Currently, if the DC is scaled down by more than one node (e.g, 6 -> 3 nodes), then the controller will get "stuck" after it has scaled down the statefulset as it currently isn't notified when the pod has finished being deleted. I'm hoping that a watch on the statefulset objects will be enough for the controller to observe that the pod has been deleted (statefulset.status.currentReplicas will change) and trigger a reconcile to kick-off the next decommission.

For now, if you re-apply the DC definition after the node has decomissioned, keeping the same size but modifying a random-named field, it'll trigger a reconcile. (e.g, create fied spec.xyz with value 1 then change it to 2 and re-apply)

It also looks like deletion needs to be improved. Turns out the logic of scaling-down a statefulset before deleting it is required (oops) -- i'll add it back in. But we also need to remove PVs and PV claims. The PVs also need to be removed on decommission. Otherwise if a scale down then scale up is attempted, C* will fail to start as the pod gets a PV with a data directory of an already decomissioned node.

instaclustr / cassandra-operator Goto Github PK

cassandra-operator's People

Contributors

Stargazers

Watchers

Forkers

cassandra-operator's Issues

Recommend Projects

Recommend Topics

Recommend Org