instaclustr / cassandra-operator Goto Github PK
View Code? Open in Web Editor NEWKubernetes operator for Apache Cassandra
Home Page: https://instaclustr.com
License: Apache License 2.0
Kubernetes operator for Apache Cassandra
Home Page: https://instaclustr.com
License: Apache License 2.0
There are a number of times where we lookup user supplied values that might not exist and we don't handle that, which will cause the operator to exit. Let's handle not finding k8s objects.
Statefulset deletes the last pod
Run decommission on the node
Maybe related with issue #54
Should be the same size cluster as original
The backup tool that is part of the toolbelt of the operator is getting some attention outside of the kubernetes functionality being provided with the operator project. At this point the backup module presents enough functionality that it could be a stand-alone project.
The important step is to make sure that the cassandra-operator remains unaffected. A somewhat clean way of proceeding here would be to segregate the backup module and add it back as a dependency that can be used by cassandra-operator.
To keep the backup module stable an initial release would be required. Post this release further breaking changes could be introduced to the master
branch whilst a 1.0.0 branch could be maintained for bug fixing. This approach will allow for the cassandra-operator to introduce breaking changes on backup's master branch if necessary.
Looks like we are not ready to give out a project v1.0.0 release since we still have some bugs to be tackled.
Currently default
is hard-coded into every API call that takes a namespace parameter.
There is a CLI option. This should be wired up via Guice and @Inject
-ed into all the paces that need a namespace.
23:13:20.114 [BackupControllerService] ERROR o.g.j.m.i.WriterInterceptorExecutor - MessageBodyWriter not found for media type=application/json, type=class com.instaclustr.backup.BackupArguments, genericType=class com.instaclustr.backup.BackupArguments.
Would be nice to see the git commit SHA and the operator version printed in the logs on startup.
Develop a set of scripts to perform health check operation. As a best practice, stay away from 'nodetool status'
Current seed provider is able to support "parallel" Pod creation. TBD for other advanced seed providers.
Deployed operator on PKS using YAML here, got exception whiling deleting a test DataCenter cluster below. Seems lack of permissions on PV:
16:07:19.830 [ControllerService] ERROR c.g.c.util.concurrent.ServiceManager - Service ControllerService [FAILED] has failed in the RUNNING state.
io.kubernetes.client.ApiException: Forbidden
at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaimWithHttpInfo(CoreV1Api.java:30090)
at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaim(CoreV1Api.java:30072)
at com.instaclustr.cassandra.operator.service.ControllerService.deletePersistentVolumeAndPersistentVolumeClaim(ControllerService.java:524)
at com.instaclustr.cassandra.operator.service.ControllerService.deleteDataCenter(ControllerService.java:411)
at com.instaclustr.cassandra.operator.service.ControllerService.run(ControllerService.java:130)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
at java.lang.Thread.run(Thread.java:748)
16:07:19.833 [ControllerService] ERROR com.instaclustr.guava.Application - Service ControllerService [FAILED] failed. Shutting down.
Looks like the hearbeat detector will start before a node is potentially ready and gives up.
6:41:34.227 [JMX client heartbeat 2] WARN c.i.c.o.j.CassandraConnectionFactory - JMX connection to /10.8.1.8 unexpectedly failed. javax.management.remote.JMXConnectionNotification[source=javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin@33ab4d49][type=jmx.remote.connection.failed][message=Failed to communicate with the server: java.net.ConnectException: Connection refused (Connection refused)]
16:41:34.229 [JMX client heartbeat 2] WARN javax.management.remote.misc - Failed to check connection: java.net.ConnectException: Connection refused (Connection refused)
16:41:34.229 [JMX client heartbeat 2] WARN javax.management.remote.misc - stopping
Hi,
I'm just installing the operator and it's failing with these errors:
18:50:10.198 [main] INFO c.i.c.o.p.o.CreateCustomResourceDefinitions - Creating Custom Resource Definition cassandra-datacenters.stable.instaclustr.com
18:50:10.234 [main] INFO c.i.c.o.p.o.CreateCustomResourceDefinitions - Custom Resource Definition cassandra-datacenters.stable.instaclustr.com already exists.
18:50:10.237 [main] INFO c.i.c.o.p.o.CreateCustomResourceDefinitions - Creating Custom Resource Definition cassandra-clusters.stable.instaclustr.com
18:50:10.243 [main] INFO c.i.c.o.p.o.CreateCustomResourceDefinitions - Custom Resource Definition cassandra-clusters.stable.instaclustr.com already exists.
18:50:10.248 [main] INFO c.i.c.o.p.o.CreateCustomResourceDefinitions - Creating Custom Resource Definition cassandra-backups.stable.instaclustr.com
18:50:10.254 [main] INFO c.i.c.o.p.o.CreateCustomResourceDefinitions - Custom Resource Definition cassandra-backups.stable.instaclustr.com already exists.
18:50:10.372 [main] INFO com.instaclustr.guava.Application - Services to start: [ControllerService [NEW], GarbageCollectorService [NEW], CassandraHealthCheckService [NEW], BackupControllerService [NEW], WatchService(Cluster) [NEW], WatchService(DataCenter) [NEW], WatchService(V1beta2StatefulSet) [NEW], WatchService(V1ConfigMap) [NEW], WatchService(Backup) [NEW]]
18:50:10.374 [main] INFO com.instaclustr.guava.Application - Starting services.
18:50:10.394 [WatchService(Cluster)] ERROR c.g.c.util.concurrent.ServiceManager - Service WatchService(Cluster) [FAILED] has failed in the RUNNING state.
io.kubernetes.client.ApiException: Not Found
at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
at com.instaclustr.k8s.watch.WatchService.syncResourceCache(WatchService.java:149)
at com.instaclustr.k8s.watch.WatchService.run(WatchService.java:91)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
at java.lang.Thread.run(Thread.java:748)
18:50:10.396 [WatchService(Cluster)] ERROR com.instaclustr.guava.Application - Service WatchService(Cluster) [FAILED] failed. Shutting down.
18:50:10.411 [ServiceManager Shutdown Hook] INFO com.instaclustr.guava.Application - Shutting down [ControllerService [RUNNING], GarbageCollectorService [RUNNING], CassandraHealthCheckService [RUNNING], BackupControllerService [RUNNING], WatchService(DataCenter) [RUNNING], WatchService(V1beta2StatefulSet) [RUNNING], WatchService(V1ConfigMap) [RUNNING]].
18:50:10.416 [main] ERROR com.instaclustr.guava.Application - Services [WatchService(Cluster) [FAILED]] failed to start.
Exception in thread "main" picocli.CommandLine$ExecutionException: Error while calling command (com.instaclustr.cassandra.operator.Operator@3b220bcb): java.lang.IllegalStateException: Expected to be healthy after starting. The following services are not running: {STARTING=[WatchService(Backup) [RUNNING]], FAILED=[WatchService(Cluster) [FAILED]]}
at picocli.CommandLine.execute(CommandLine.java:880)
at picocli.CommandLine.access$700(CommandLine.java:111)
at picocli.CommandLine$RunLast.handle(CommandLine.java:1037)
at picocli.CommandLine$RunLast.handle(CommandLine.java:1005)
at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:913)
at picocli.CommandLine.parseWithHandlers(CommandLine.java:1196)
at picocli.CommandLine.call(CommandLine.java:1448)
at picocli.CommandLine.call(CommandLine.java:1413)
at com.instaclustr.cassandra.operator.Operator.main(Operator.java:82)
Caused by: java.lang.IllegalStateException: Expected to be healthy after starting. The following services are not running: {STARTING=[WatchService(Backup) [RUNNING]], FAILED=[WatchService(Cluster) [FAILED]]}
at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.checkHealthy(ServiceManager.java:741)
at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.awaitHealthy(ServiceManager.java:568)
at com.google.common.util.concurrent.ServiceManager.awaitHealthy(ServiceManager.java:329)
at com.instaclustr.guava.Application.call(Application.java:59)
at com.instaclustr.cassandra.operator.Operator.call(Operator.java:129)
at com.instaclustr.cassandra.operator.Operator.call(Operator.java:29)
at picocli.CommandLine.execute(CommandLine.java:873)
... 8 more
Suppressed: com.google.common.util.concurrent.ServiceManager$FailedService: WatchService(Cluster) [FAILED]
Caused by: io.kubernetes.client.ApiException: Not Found
at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
at com.instaclustr.k8s.watch.WatchService.syncResourceCache(WatchService.java:149)
at com.instaclustr.k8s.watch.WatchService.run(WatchService.java:91)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
at java.lang.Thread.run(Thread.java:748)
18:51:10.431 [ServiceManager Shutdown Hook] WARN com.instaclustr.guava.Application - Timeout waiting for [WatchService(DataCenter) [STOPPING], WatchService(V1beta2StatefulSet) [STOPPING], WatchService(V1ConfigMap) [STOPPING], WatchService(Backup) [STOPPING]] to stop. Retrying.
java.util.concurrent.TimeoutException: Timeout waiting for the services to stop. The following services have not stopped: {STOPPING=[WatchService(DataCenter) [STOPPING], WatchService(V1beta2StatefulSet) [STOPPING], WatchService(V1ConfigMap) [STOPPING], WatchService(Backup) [STOPPING]]}
at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.awaitStopped(ServiceManager.java:586)
at com.google.common.util.concurrent.ServiceManager.awaitStopped(ServiceManager.java:365)
at com.instaclustr.guava.Application.lambda$call$0(Application.java:37)
at java.lang.Thread.run(Thread.java:748)
18:51:10.525 [ServiceManager Shutdown Hook] INFO com.instaclustr.guava.Application - Successfully shut down all services.
I took the default configuration of the operator Helm chart... Can someone help me please?
Thanks,
Issue: Unable to create "CassandraDataCenter" resource
kubectl create -f test-dc.yaml gives below error
Error from server (NotFound): error when creating "test-dc.yaml": the server could not find the requested resource (post cassandradatacentres.stable.instaclustr.com)
Implement a thread to scan cluster status periodically. In addition to Issue #6 for provisioning, this thread should be alive through operator's life cycle. Also it should be able to detect resource failure/missing and rescue them back, hold the cluster in a desired state.
Basic RBAC with Cassandra admin role
Create a helm package for the operator
Hi,
Following your instructions I get the following error when trying to install Cassandra (not the operator):
helm : Error: unable to recognize "": no matches for stable.instaclustr.com/, Kind=CassandraDataCenter
I just followed the README, is it working for you?
Thank you ๐
When a scale command is triggered, got the warning as below:
16:14:15.727 [ControllerService] WARN c.i.c.operator.k8s.K8sResourceUtils - Failed to update resource. This will be a hard exception in the future.
io.kubernetes.client.ApiException: Unprocessable Entity
at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
at io.kubernetes.client.ApiClient.execute(ApiClient.java:781)
at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.lambda$createOrReplaceResource$1(K8sResourceUtils.java:51)
at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.createOrReplaceResource(K8sResourceUtils.java:41)
at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.createOrReplaceResource(K8sResourceUtils.java:49)
at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.createOrReplaceNamespaceService(K8sResourceUtils.java:60)
at com.instaclustr.cassandra.operator.service.ControllerService.createOrReplaceNodesService(ControllerService.java:279)
at com.instaclustr.cassandra.operator.service.ControllerService.createOrReplaceDataCenter(ControllerService.java:128)
at com.instaclustr.cassandra.operator.service.ControllerService.run(ControllerService.java:119)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
at java.lang.Thread.run(Thread.java:748)
Collaborate on combining the statefulsets from Pivotal + InstaClustr and create one as a building block for Operator
The behavior of Kubernetes scheduling pods across different failure domains is somewhat undefined and hard to control. See for details:
kubernetes/kubernetes#41598
kubernetes/kubernetes#44798
kubernetes/community#1857
Despite recent fixes to statefulsets, we still cannot inject failure domain information into the pods. So we still need to do a statefulset per failure domain.
This also gives us some flexability in tolerating outages and nodes being down in a single SS.
I installed via Helm, and saw that the operator was up and running:
k get pods
NAME READY STATUS RESTARTS AGE
cassandra-operator-549744c8fc-w6p47 1/1 Running 0 44m
But then when I create a cluster, also via Helm, the Cassandra pods are not created and I see this in the operator's logs:
java.lang.NullPointerException: null
at com.instaclustr.cassandra.operator.controller.DataCenterReconciliationController.createOrReplaceNodesService(DataCenterReconciliationController.java:417)
at com.instaclustr.cassandra.operator.controller.DataCenterReconciliationController.reconcileDataCenter(DataCenterReconciliationController.java:66)
at com.instaclustr.cassandra.operator.service.OperatorService.run(OperatorService.java:106)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
at java.lang.Thread.run(Thread.java:748)
Any ideas what I might have done wrong?
Should be periodic cron and options to upload to S3/Object store of choice
18:14:33.431 [CassandraHealthCheckService RUNNING] ERROR c.g.c.util.concurrent.ServiceManager - Service CassandraHealthCheckService [FAILED] has failed in the RUNNING state.
java.lang.NullPointerException: null
at com.google.common.net.InetAddresses.ipStringToBytes(InetAddresses.java:162)
at com.google.common.net.InetAddresses.forString(InetAddresses.java:136)
at com.instaclustr.cassandra.operator.service.CassandraHealthCheckService.runOneIteration(CassandraHealthCheckService.java:63)
at com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:193)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Hi,
I said in my previous issue I will wait for a more improved operator to be able to use Jaeger easily... So I deleted my Cassandra cluster instance but left the operator.
In the meantime one of my nodes has crashed so I looked for the cause of the problem. Looking at all my pods/statefulsets/deployments metrics I noticed that the Cassandra operator memory had increased over time:
And that's the only one doing that. I "suspect" it for crashing its host node. For sure I could set request and limit resources to kill the pod before it eats all the node memory but that doesn't explain the consumption graph over time.
Moreover, currently the operator pod memory is stable around 200MB... ๐
I can't tell for sure there is a memory leak inside the operator, but I wanted to share with you this "experience".
I would have loved to also share logs of the operator but I don't have them (I'm really sorry).
Have a good day,
Leverage Instaclustr Prometheus exporter
A test to see if statefulset can handle this properly.
Watch and see how the coreos OLM project goes in terms of adoption. Support OLM management
https://github.com/operator-framework/operator-lifecycle-manager
I see you uploaded almost everything to gcr.io, but cassandra-sidecar is missing.
After installing the operator and starting the cluster I have this error:
Failed to pull image "gcr.io/cassandra-operator/cassandra-sidecar-dev": rpc error: code = Unknown desc = unauthorized: authentication required
PVs will the entire SSTables on IaaS persistent disks
Need to work on recovery scenarios
Operator keeps throwing out warnings like below periodically:
[OkHttp ConnectionPool] WARN com.squareup.okhttp.OkHttpClient - A connection to https://192.168.99.101:8443/ was leaked. Did you forget to close a response body?
16:07:19.830 [ControllerService] ERROR c.g.c.util.concurrent.ServiceManager - Service ControllerService [FAILED] has failed in the RUNNING state.
io.kubernetes.client.ApiException: Forbidden
at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaimWithHttpInfo(CoreV1Api.java:30090)
at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaim(CoreV1Api.java:30072)
at com.instaclustr.cassandra.operator.service.ControllerService.deletePersistentVolumeAndPersistentVolumeClaim(ControllerService.java:524)
at com.instaclustr.cassandra.operator.service.ControllerService.deleteDataCenter(ControllerService.java:411)
at com.instaclustr.cassandra.operator.service.ControllerService.run(ControllerService.java:130)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
at java.lang.Thread.run(Thread.java:748)
16:07:19.833 [ControllerService] ERROR com.instaclustr.guava.Application - Service ControllerService [FAILED] failed. Shutting down.
Deleting DC will causing warnings as below:
17:25:58.041 [JMX client heartbeat 3] WARN c.i.c.o.j.CassandraConnectionFactory - JMX connection to /172.17.0.4 unexpectedly failed. javax.management.remote.JMXConnectionNotification[source=javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin@341b15ad][type=jmx.remote.connection.failed][message=Failed to communicate with the server: java.net.NoRouteToHostException: No route to host (Host unreachable)]
17:26:01.170 [JMX client heartbeat 3] WARN javax.management.remote.misc - Failed to check connection: java.net.NoRouteToHostException: No route to host (Host unreachable)
17:26:01.171 [JMX client heartbeat 3] WARN javax.management.remote.misc - stopping
Currently, if the DC is scaled down by more than one node (e.g, 6 -> 3 nodes), then the controller will get "stuck" after it has scaled down the statefulset as it currently isn't notified when the pod has finished being deleted. I'm hoping that a watch on the statefulset objects will be enough for the controller to observe that the pod has been deleted (statefulset.status.currentReplicas will change) and trigger a reconcile to kick-off the next decommission.
For now, if you re-apply the DC definition after the node has decomissioned, keeping the same size but modifying a random-named field, it'll trigger a reconcile. (e.g, create fied spec.xyz with value 1 then change it to 2 and re-apply)
It also looks like deletion needs to be improved. Turns out the logic of scaling-down a statefulset before deleting it is required (oops) -- i'll add it back in. But we also need to remove PVs and PV claims. The PVs also need to be removed on decommission. Otherwise if a scale down then scale up is attempted, C* will fail to start as the pod gets a PV with a data directory of an already decomissioned node.
Curently, build
& build-all
tag images with the repo short hash & latest
. This is useful for development.
Ideally the Cassandra image should be tagged with the version of C* it represents, e.g. 3.2.11
.
This makes it obvious what version is contained within, and will be necissary once we need to support multiple C* versions.
Should this be MVP story?
Improve the operator CLI interface to support enabling the K8s API client debug/trace logging. Currently this needs to be done via a custom logback.xml
, which is painful when run on K8s/via Docker.
Also improve the current -v
option -- the descripton doesn't match the functionality (currently always enables TRACE)
For C* cluster deployment, we should be able to monitor the progress of the cluster creation status/failures.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.