Git Product home page Git Product logo

cassandra-operator's People

Contributors

alourie avatar ameyb avatar an-tex avatar andrew-waters avatar bbromhead avatar benbromhead avatar dependabot[bot] avatar efim-a-efim avatar johananl avatar lyubentt avatar mavimo avatar mmonemali avatar rimolive avatar senax avatar slater-ben avatar smiklosovic avatar srteam2020 avatar temujin9 avatar xek avatar zegelin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cassandra-operator's Issues

Make K8s lookups less brittle

There are a number of times where we lookup user supplied values that might not exist and we don't handle that, which will cause the operator to exit. Let's handle not finding k8s objects.

Fork backup tool to separate repository

The backup tool that is part of the toolbelt of the operator is getting some attention outside of the kubernetes functionality being provided with the operator project. At this point the backup module presents enough functionality that it could be a stand-alone project.

The important step is to make sure that the cassandra-operator remains unaffected. A somewhat clean way of proceeding here would be to segregate the backup module and add it back as a dependency that can be used by cassandra-operator.

To keep the backup module stable an initial release would be required. Post this release further breaking changes could be introduced to the master branch whilst a 1.0.0 branch could be maintained for bug fixing. This approach will allow for the cassandra-operator to introduce breaking changes on backup's master branch if necessary.

Support namespaces other than `default`

Currently default is hard-coded into every API call that takes a namespace parameter.

There is a CLI option. This should be wired up via Guice and @Inject-ed into all the paces that need a namespace.

Issue when creating backup snapshot

23:13:20.114 [BackupControllerService] ERROR o.g.j.m.i.WriterInterceptorExecutor - MessageBodyWriter not found for media type=application/json, type=class com.instaclustr.backup.BackupArguments, genericType=class com.instaclustr.backup.BackupArguments.

DC Deprovision Issue

Deployed operator on PKS using YAML here, got exception whiling deleting a test DataCenter cluster below. Seems lack of permissions on PV:

16:07:19.830 [ControllerService] ERROR c.g.c.util.concurrent.ServiceManager - Service ControllerService [FAILED] has failed in the RUNNING state.
io.kubernetes.client.ApiException: Forbidden
        at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
        at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
        at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaimWithHttpInfo(CoreV1Api.java:30090)
        at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaim(CoreV1Api.java:30072)
        at com.instaclustr.cassandra.operator.service.ControllerService.deletePersistentVolumeAndPersistentVolumeClaim(ControllerService.java:524)
        at com.instaclustr.cassandra.operator.service.ControllerService.deleteDataCenter(ControllerService.java:411)
        at com.instaclustr.cassandra.operator.service.ControllerService.run(ControllerService.java:130)
        at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
        at java.lang.Thread.run(Thread.java:748)
16:07:19.833 [ControllerService] ERROR com.instaclustr.guava.Application - Service ControllerService [FAILED] failed. Shutting down.

Heartbeat race condition

Looks like the hearbeat detector will start before a node is potentially ready and gives up.

6:41:34.227 [JMX client heartbeat 2] WARN c.i.c.o.j.CassandraConnectionFactory - JMX connection to /10.8.1.8 unexpectedly failed. javax.management.remote.JMXConnectionNotification[source=javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin@33ab4d49][type=jmx.remote.connection.failed][message=Failed to communicate with the server: java.net.ConnectException: Connection refused (Connection refused)]
16:41:34.229 [JMX client heartbeat 2] WARN javax.management.remote.misc - Failed to check connection: java.net.ConnectException: Connection refused (Connection refused)
16:41:34.229 [JMX client heartbeat 2] WARN javax.management.remote.misc - stopping

The operator is not working at all :(

Hi,

I'm just installing the operator and it's failing with these errors:

18:50:10.198 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Creating Custom Resource Definition cassandra-datacenters.stable.instaclustr.com
18:50:10.234 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Custom Resource Definition cassandra-datacenters.stable.instaclustr.com already exists.
18:50:10.237 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Creating Custom Resource Definition cassandra-clusters.stable.instaclustr.com
18:50:10.243 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Custom Resource Definition cassandra-clusters.stable.instaclustr.com already exists.
18:50:10.248 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Creating Custom Resource Definition cassandra-backups.stable.instaclustr.com
18:50:10.254 [main] INFO  c.i.c.o.p.o.CreateCustomResourceDefinitions - Custom Resource Definition cassandra-backups.stable.instaclustr.com already exists.
18:50:10.372 [main] INFO  com.instaclustr.guava.Application - Services to start: [ControllerService [NEW], GarbageCollectorService [NEW], CassandraHealthCheckService [NEW], BackupControllerService [NEW], WatchService(Cluster) [NEW], WatchService(DataCenter) [NEW], WatchService(V1beta2StatefulSet) [NEW], WatchService(V1ConfigMap) [NEW], WatchService(Backup) [NEW]]
18:50:10.374 [main] INFO  com.instaclustr.guava.Application - Starting services.
18:50:10.394 [WatchService(Cluster)] ERROR c.g.c.util.concurrent.ServiceManager - Service WatchService(Cluster) [FAILED] has failed in the RUNNING state.
io.kubernetes.client.ApiException: Not Found
	at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
	at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
	at com.instaclustr.k8s.watch.WatchService.syncResourceCache(WatchService.java:149)
	at com.instaclustr.k8s.watch.WatchService.run(WatchService.java:91)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
	at java.lang.Thread.run(Thread.java:748)
18:50:10.396 [WatchService(Cluster)] ERROR com.instaclustr.guava.Application - Service WatchService(Cluster) [FAILED] failed. Shutting down.
18:50:10.411 [ServiceManager Shutdown Hook] INFO  com.instaclustr.guava.Application - Shutting down [ControllerService [RUNNING], GarbageCollectorService [RUNNING], CassandraHealthCheckService [RUNNING], BackupControllerService [RUNNING], WatchService(DataCenter) [RUNNING], WatchService(V1beta2StatefulSet) [RUNNING], WatchService(V1ConfigMap) [RUNNING]].
18:50:10.416 [main] ERROR com.instaclustr.guava.Application - Services [WatchService(Cluster) [FAILED]] failed to start.
Exception in thread "main" picocli.CommandLine$ExecutionException: Error while calling command (com.instaclustr.cassandra.operator.Operator@3b220bcb): java.lang.IllegalStateException: Expected to be healthy after starting. The following services are not running: {STARTING=[WatchService(Backup) [RUNNING]], FAILED=[WatchService(Cluster) [FAILED]]}
	at picocli.CommandLine.execute(CommandLine.java:880)
	at picocli.CommandLine.access$700(CommandLine.java:111)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:1037)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:1005)
	at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:913)
	at picocli.CommandLine.parseWithHandlers(CommandLine.java:1196)
	at picocli.CommandLine.call(CommandLine.java:1448)
	at picocli.CommandLine.call(CommandLine.java:1413)
	at com.instaclustr.cassandra.operator.Operator.main(Operator.java:82)
Caused by: java.lang.IllegalStateException: Expected to be healthy after starting. The following services are not running: {STARTING=[WatchService(Backup) [RUNNING]], FAILED=[WatchService(Cluster) [FAILED]]}
	at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.checkHealthy(ServiceManager.java:741)
	at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.awaitHealthy(ServiceManager.java:568)
	at com.google.common.util.concurrent.ServiceManager.awaitHealthy(ServiceManager.java:329)
	at com.instaclustr.guava.Application.call(Application.java:59)
	at com.instaclustr.cassandra.operator.Operator.call(Operator.java:129)
	at com.instaclustr.cassandra.operator.Operator.call(Operator.java:29)
	at picocli.CommandLine.execute(CommandLine.java:873)
	... 8 more
	Suppressed: com.google.common.util.concurrent.ServiceManager$FailedService: WatchService(Cluster) [FAILED]
	Caused by: io.kubernetes.client.ApiException: Not Found
		at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
		at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
		at com.instaclustr.k8s.watch.WatchService.syncResourceCache(WatchService.java:149)
		at com.instaclustr.k8s.watch.WatchService.run(WatchService.java:91)
		at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
		at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
		at java.lang.Thread.run(Thread.java:748)
18:51:10.431 [ServiceManager Shutdown Hook] WARN  com.instaclustr.guava.Application - Timeout waiting for [WatchService(DataCenter) [STOPPING], WatchService(V1beta2StatefulSet) [STOPPING], WatchService(V1ConfigMap) [STOPPING], WatchService(Backup) [STOPPING]] to stop. Retrying.
java.util.concurrent.TimeoutException: Timeout waiting for the services to stop. The following services have not stopped: {STOPPING=[WatchService(DataCenter) [STOPPING], WatchService(V1beta2StatefulSet) [STOPPING], WatchService(V1ConfigMap) [STOPPING], WatchService(Backup) [STOPPING]]}
	at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.awaitStopped(ServiceManager.java:586)
	at com.google.common.util.concurrent.ServiceManager.awaitStopped(ServiceManager.java:365)
	at com.instaclustr.guava.Application.lambda$call$0(Application.java:37)
	at java.lang.Thread.run(Thread.java:748)
18:51:10.525 [ServiceManager Shutdown Hook] INFO  com.instaclustr.guava.Application - Successfully shut down all services.

I took the default configuration of the operator Helm chart... Can someone help me please?

Thanks,

Unable to create "CassandraDataCenter" resource on PKS

Issue: Unable to create "CassandraDataCenter" resource

kubectl create -f test-dc.yaml gives below error

Error from server (NotFound): error when creating "test-dc.yaml": the server could not find the requested resource (post cassandradatacentres.stable.instaclustr.com)

Recreate Missing Primitive Resources with Watchers

Implement a thread to scan cluster status periodically. In addition to Issue #6 for provisioning, this thread should be alive through operator's life cycle. Also it should be able to detect resource failure/missing and rescue them back, hold the cluster in a desired state.

Can't install the cassandra chart

Hi,

Following your instructions I get the following error when trying to install Cassandra (not the operator):
helm : Error: unable to recognize "": no matches for stable.instaclustr.com/, Kind=CassandraDataCenter

I just followed the README, is it working for you?

Thank you ๐Ÿ˜ƒ

Warning "Unprocessable Entity"

When a scale command is triggered, got the warning as below:

16:14:15.727 [ControllerService] WARN  c.i.c.operator.k8s.K8sResourceUtils - Failed to update resource. This will be a hard exception in the future.
io.kubernetes.client.ApiException: Unprocessable Entity
	at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
	at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
	at io.kubernetes.client.ApiClient.execute(ApiClient.java:781)
	at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.lambda$createOrReplaceResource$1(K8sResourceUtils.java:51)
	at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.createOrReplaceResource(K8sResourceUtils.java:41)
	at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.createOrReplaceResource(K8sResourceUtils.java:49)
	at com.instaclustr.cassandra.operator.k8s.K8sResourceUtils.createOrReplaceNamespaceService(K8sResourceUtils.java:60)
	at com.instaclustr.cassandra.operator.service.ControllerService.createOrReplaceNodesService(ControllerService.java:279)
	at com.instaclustr.cassandra.operator.service.ControllerService.createOrReplaceDataCenter(ControllerService.java:128)
	at com.instaclustr.cassandra.operator.service.ControllerService.run(ControllerService.java:119)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
	at java.lang.Thread.run(Thread.java:748)

Multi-AZ topologies

The behavior of Kubernetes scheduling pods across different failure domains is somewhat undefined and hard to control. See for details:

kubernetes/kubernetes#41598
kubernetes/kubernetes#44798
kubernetes/community#1857

Despite recent fixes to statefulsets, we still cannot inject failure domain information into the pods. So we still need to do a statefulset per failure domain.

This also gives us some flexability in tolerating outages and nodes being down in a single SS.

Null Pointer Exception prevents cluster creation

I installed via Helm, and saw that the operator was up and running:

k get pods
NAME                                  READY     STATUS    RESTARTS   AGE
cassandra-operator-549744c8fc-w6p47   1/1       Running   0          44m

But then when I create a cluster, also via Helm, the Cassandra pods are not created and I see this in the operator's logs:

java.lang.NullPointerException: null
	at com.instaclustr.cassandra.operator.controller.DataCenterReconciliationController.createOrReplaceNodesService(DataCenterReconciliationController.java:417)
	at com.instaclustr.cassandra.operator.controller.DataCenterReconciliationController.reconcileDataCenter(DataCenterReconciliationController.java:66)
	at com.instaclustr.cassandra.operator.service.OperatorService.run(OperatorService.java:106)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
	at java.lang.Thread.run(Thread.java:748)

Any ideas what I might have done wrong?

Operator Restarted Once when Creating DC

18:14:33.431 [CassandraHealthCheckService RUNNING] ERROR c.g.c.util.concurrent.ServiceManager - Service CassandraHealthCheckService [FAILED] has failed in the RUNNING state.
java.lang.NullPointerException: null
	at com.google.common.net.InetAddresses.ipStringToBytes(InetAddresses.java:162)
	at com.google.common.net.InetAddresses.forString(InetAddresses.java:136)
	at com.instaclustr.cassandra.operator.service.CassandraHealthCheckService.runOneIteration(CassandraHealthCheckService.java:63)
	at com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:193)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

(?) Memory continuosly increasing

Hi,

I said in my previous issue I will wait for a more improved operator to be able to use Jaeger easily... So I deleted my Cassandra cluster instance but left the operator.

In the meantime one of my nodes has crashed so I looked for the cause of the problem. Looking at all my pods/statefulsets/deployments metrics I noticed that the Cassandra operator memory had increased over time:
image

And that's the only one doing that. I "suspect" it for crashing its host node. For sure I could set request and limit resources to kill the pod before it eats all the node memory but that doesn't explain the consumption graph over time.

Moreover, currently the operator pod memory is stable around 200MB... ๐Ÿ˜„

I can't tell for sure there is a memory leak inside the operator, but I wanted to share with you this "experience".

I would have loved to also share logs of the operator but I don't have them (I'm really sorry).

Have a good day,

Failed to pull image "gcr.io/cassandra-operator/cassandra-sidecar-dev"

I see you uploaded almost everything to gcr.io, but cassandra-sidecar is missing.
After installing the operator and starting the cluster I have this error:

Failed to pull image "gcr.io/cassandra-operator/cassandra-sidecar-dev": rpc error: code = Unknown desc = unauthorized: authentication required

Warnings on com.squareup.okhttp.OkHttpClient

Operator keeps throwing out warnings like below periodically:

[OkHttp ConnectionPool] WARN  com.squareup.okhttp.OkHttpClient - A connection to https://192.168.99.101:8443/ was leaked. Did you forget to close a response body?

Lack of Permission while deleting DC cluster on PKS

16:07:19.830 [ControllerService] ERROR c.g.c.util.concurrent.ServiceManager - Service ControllerService [FAILED] has failed in the RUNNING state.
io.kubernetes.client.ApiException: Forbidden
        at io.kubernetes.client.ApiClient.handleResponse(ApiClient.java:882)
        at io.kubernetes.client.ApiClient.execute(ApiClient.java:798)
        at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaimWithHttpInfo(CoreV1Api.java:30090)
        at io.kubernetes.client.apis.CoreV1Api.readNamespacedPersistentVolumeClaim(CoreV1Api.java:30072)
        at com.instaclustr.cassandra.operator.service.ControllerService.deletePersistentVolumeAndPersistentVolumeClaim(ControllerService.java:524)
        at com.instaclustr.cassandra.operator.service.ControllerService.deleteDataCenter(ControllerService.java:411)
        at com.instaclustr.cassandra.operator.service.ControllerService.run(ControllerService.java:130)
        at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
        at java.lang.Thread.run(Thread.java:748)
16:07:19.833 [ControllerService] ERROR com.instaclustr.guava.Application - Service ControllerService [FAILED] failed. Shutting down.

WARNING on JMX Client Heartbeat

Deleting DC will causing warnings as below:

17:25:58.041 [JMX client heartbeat 3] WARN  c.i.c.o.j.CassandraConnectionFactory - JMX connection to /172.17.0.4 unexpectedly failed. javax.management.remote.JMXConnectionNotification[source=javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin@341b15ad][type=jmx.remote.connection.failed][message=Failed to communicate with the server: java.net.NoRouteToHostException: No route to host (Host unreachable)]
17:26:01.170 [JMX client heartbeat 3] WARN  javax.management.remote.misc - Failed to check connection: java.net.NoRouteToHostException: No route to host (Host unreachable)
17:26:01.171 [JMX client heartbeat 3] WARN  javax.management.remote.misc - stopping

Watch the StatefulSet - Clean up decom'd PVs

Currently, if the DC is scaled down by more than one node (e.g, 6 -> 3 nodes), then the controller will get "stuck" after it has scaled down the statefulset as it currently isn't notified when the pod has finished being deleted. I'm hoping that a watch on the statefulset objects will be enough for the controller to observe that the pod has been deleted (statefulset.status.currentReplicas will change) and trigger a reconcile to kick-off the next decommission.

For now, if you re-apply the DC definition after the node has decomissioned, keeping the same size but modifying a random-named field, it'll trigger a reconcile. (e.g, create fied spec.xyz with value 1 then change it to 2 and re-apply)

It also looks like deletion needs to be improved. Turns out the logic of scaling-down a statefulset before deleting it is required (oops) -- i'll add it back in. But we also need to remove PVs and PV claims. The PVs also need to be removed on decommission. Otherwise if a scale down then scale up is attempted, C* will fail to start as the pod gets a PV with a data directory of an already decomissioned node.

Tag `cassandra` image with Cassandra version

Curently, build & build-all tag images with the repo short hash & latest. This is useful for development.

Ideally the Cassandra image should be tagged with the version of C* it represents, e.g. 3.2.11.
This makes it obvious what version is contained within, and will be necissary once we need to support multiple C* versions.

Add better logging configurability

Improve the operator CLI interface to support enabling the K8s API client debug/trace logging. Currently this needs to be done via a custom logback.xml, which is painful when run on K8s/via Docker.

Also improve the current -v option -- the descripton doesn't match the functionality (currently always enables TRACE)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.