Git Product home page Git Product logo

kaap's Introduction

Kubernetes Autoscaling for Apache Pulsar (KAAP)

Kubernetes Autoscaling for Apache Pulsar (KAAP) simplifies running Apache Pulsar on Kubernetes by applying the familiar Operator pattern to Pulsar's components, and horizonally scaling resources up or down based on CPU and memory workloads.

KAAP operator's broker autoscaling integrates with the Pulsar broker's load balancer, which has insight into all other brokers' workloads. With this information, KAAP can make smarter resource management decisions than the Kubernetes HorizontalPodAutoscaler.

KAAP's Bookkeeper autoscaling solution is similarly Pulsar-native. Bookkeeper nodes are scaled up in response to running low on storage, and because of Bookkeeper's segment-based design, the new storage is available immediately for use by the cluster, with no log stream rebalancing required.

When KAAP sees low storage usage on a Bookkeeper node, the node is automatically scaled down (decommissioned) to free up volume usage and reduce storage costs. This scale-down is done in a safe, controlled manner which ensures no data loss and guarantees the configured replication factor for all messages. For example, if your replication factor is 3 (write and ack quorum of 3), 3 replicas are maintained at all times during the scale down to ensure data can be recovered, even if there is a failure during the scale-down phase. Scaling down bookies has been a consistent pain point in Pulsar, and KAAP automates this without sacrifing Pulsar's data guarantees.

Operating and maintaining Apache Pulsar clusters traditionally involves complex manual configurations, making it challenging for developers and operators to effectively manage the system's lifecycle. However, with the KAAP operator, these complexities are abstracted away, enabling developers to focus on their applications rather than the underlying infrastructure.

Some of the key features and benefits of the KAAP operator include:

  • Easy Deployment: Deploying an Apache Pulsar cluster on Kubernetes is simplified through declarative configurations and automation provided by the operator.

  • Scalability: The KAAP operator enables effortless scaling of Pulsar clusters by automatically handling the creation and configuration of new Pulsar brokers and bookies as per defined rules. The broker autoscaling is integrated with the Pulsar broker load balancer to make smart resource management decisions, and bookkeepers are scaled up and down based on storage usage in a safe, controlled manner.

  • High Availability: The operator implements best practices for high availability, ensuring that Pulsar clusters are fault-tolerant and can sustain failures without service disruptions.

  • Lifecycle Management: The operator takes care of common Pulsar cluster lifecycle tasks, such as cluster creation, upgrade, configuration updates, and graceful shutdowns.

We also offer the KAAP Stack if you're looking for more Kubernetes-native tooling deployed with your Pulsar cluster. Along with the PulsarCluster CRDs, KAAP stack also includes:

  • Pulsar Operator
  • Prometheus Stack (Grafana)
  • Pulsar Grafana dashboards
  • Cert Manager
  • Keycloak

The KAAP Stack is also included in this repository.

Whether you are a developer looking to leverage the power of Apache Pulsar in your Kubernetes environment, or an operator seeking to streamline the management of Pulsar clusters, the KAAP Operator provides a robust and user-friendly solution.

If you're running Luna Streaming, the DataStax distribution of Apache Pulsar, KAAP is 100% compatible with your existing Pulsar cluster.

Documentation

Full documentation is available in the DataStax Streaming Documentation or at this repo's GitHub Pages site.

Install KAAP Operator

This example installs only the KAAP operator. See the KAAP Stack below to install KAAP with an operator, a Pulsar cluster, and the Prometheus monitoring stack.

  1. Install the DataStax KAAP Helm repository:
helm repo add kaap https://datastax.github.io/kaap
helm repo update
  1. Install the KAAP operator Helm chart:
helm install kaap kaap/kaap

Result:

NAME: kaap
LAST DEPLOYED: Wed Jun 28 11:37:45 2023
NAMESPACE: pulsar-cluster
STATUS: deployed
REVISION: 1
TEST SUITE: None
  1. Ensure the operator is up and running:
kubectl get deployment

Result:

NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
kaap                  1/1     1            1           52m
  1. You've now installed the KAAP operator. By default, when KAAP is installed, the PulsarCluster CRDs are also created. This setting is defined in the KAAP values.yaml file as crd: {create: true}.

  2. To see available CRDs:

kubectl get crds | grep kaap

Result:

autorecoveries.kaap.oss.datastax.com             2023-06-28T15:37:39Z
bastions.kaap.oss.datastax.com                   2023-06-28T15:37:39Z
bookkeepers.kaap.oss.datastax.com                2023-06-28T15:37:39Z
brokers.kaap.oss.datastax.com                    2023-06-28T15:37:40Z
functionsworkers.kaap.oss.datastax.com           2023-06-28T15:37:40Z
proxies.kaap.oss.datastax.com                    2023-06-28T15:37:40Z
pulsarclusters.kaap.oss.datastax.com             2023-06-28T15:37:41Z
zookeepers.kaap.oss.datastax.com                 2023-06-28T15:37:41Z

For more, see the DataStax Streaming Documentation or this repo's GitHub Pages site.

Install KAAP Stack Operator with a Pulsar cluster

This example deploys a KAAP stack operator, and also deploys a minimally-sized Pulsar cluster for development testing.

  1. Install the DataStax KAAP Helm repository:
helm repo add kaap https://datastax.github.io/kaap
helm repo update
  1. Install the KAAP Stack operator Helm chart with the custom dev-cluster values file:
helm install pulsar kaap/kaap-stack --values helm/examples/dev-cluster/values.yaml

Result:

NAME: pulsar
LAST DEPLOYED: Thu Jun 29 14:30:20 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
  1. Ensure the operator is up and running:
kubectl get deployment

Result:

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
kaap                                  1/1     1            1           5m19s
pulsar-autorecovery                   1/1     1            1           3m19s
pulsar-bastion                        1/1     1            1           3m19s
pulsar-grafana                        1/1     1            1           5m19s
pulsar-kube-prometheus-sta-operator   1/1     1            1           5m19s
pulsar-kube-state-metrics             1/1     1            1           5m19s
pulsar-proxy                          1/1     1            1           3m19s
  1. You've now installed KAAP Stack operator with a Pulsar cluster. By default, when KAAP is installed, the PulsarCluster CRDs are also created. This setting is defined in the KAAP values.yaml file as crd: {create: true}.

  2. To see available CRDs:

kubectl get crds | grep kaap

Result:

autorecoveries.kaap.oss.datastax.com             2023-06-28T15:37:39Z
bastions.kaap.oss.datastax.com                   2023-06-28T15:37:39Z
bookkeepers.kaap.oss.datastax.com                2023-06-28T15:37:39Z
brokers.kaap.oss.datastax.com                    2023-06-28T15:37:40Z
functionsworkers.kaap.oss.datastax.com           2023-06-28T15:37:40Z
proxies.kaap.oss.datastax.com                    2023-06-28T15:37:40Z
pulsarclusters.kaap.oss.datastax.com             2023-06-28T15:37:41Z
zookeepers.kaap.oss.datastax.com                 2023-06-28T15:37:41Z

For more, see the DataStax Streaming Documentation or this repo's GitHub Pages site.

Uninstall KAAP

To uninstall KAAP:

helm delete kaap

Result:

release "kaap" uninstalled

For more, see the DataStax Streaming Documentation or this repo's GitHub Pages site.

Resources

For more, see the DataStax Streaming Documentation or this repo's GitHub Pages site.

kaap's People

Contributors

dan1el-k avatar dependabot[bot] avatar dlg99 avatar eolivelli avatar ericsyh avatar mendonk avatar michaeljmarshall avatar mmirelli avatar nicoloboschi avatar pgier avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kaap's Issues

Incorrect default values for prometheus scrape annotations

It seems that all the components (bookkeeper, zookeeper, etc) use the same default for prometheus pod annotations:

annotations:
  prometheus.io/port: "8080"
  prometheus.io/scrape: "true"

I think this is coming from here:

Map<String, String> annotations = new HashMap<>();
annotations.put("prometheus.io/scrape", "true");
annotations.put("prometheus.io/port", "8080");
if (customPodAnnotations != null) {

The problem is that these default ports don't work for all the pods. The default ports should be as follows:

autorecovery - 8000
bastion - none
bookkeeper - 8000
broker - 8080
function - 6750
proxy - 8080
zookeeper - 8000

Deleting a PulsarCluster resource immediately starts deleting all the resources

I guess this is probably intentional, but immediately after deleting a PulsarCluster object, the operator starts deleting all the components. I wonder if there should be some kind of setting in the operator to determine the behaviour when a PulsarCluster object is deleted. It might be better to leave the resources orphaned and give the user more options for deleting them. Maybe each cluster should have a flag that determines what happens to the orphaned resources if the cluster is deleted.

Sidecars and Env empty values prevent pods from running

When <component>.sidecars or <component>.env are set to empty ([]) the pods of <component> are continuously applied some values at runtime by the operator, which waits for the component to be ready. This is an example log fetched from the operator pod when zookeeper had sidecards and env set to empty:

11:51:03 INFO  [com.dat.oss.pul.crd.SpecDiffer] (ReconcilerExecutor-pulsar-zk-controller-70) 'zookeeper.env' value differs:
    was: []
    now: [{"name":"ZOOKEEPER_SERVERS","value":"pulsar-ls-us-central1-cluster-d-zookeeper-0,pulsar-ls-us-central1-cluster-d-zookeeper-1,pulsar-ls-us-central1-cluster-d-zookeeper-2,pulsar-ls-us-central1-cluster-d-zookeeper-3,pulsar-ls-us-central1-cluster-d-zookeeper-4"}]

11:51:03 INFO  [com.dat.oss.pul.crd.SpecDiffer] (ReconcilerExecutor-pulsar-zk-controller-70) 'zookeeper.sidecars' value differs:
    was: []
    now: [{"args":["bin\/apply-config-from-env.py conf\/zookeeper.conf && bin\/generate-zookeeper-config.sh conf\/zookeeper.conf && OPTS=\"${OPTS} -Dlog4j2.formatMsgNoLookups=true\" exec bin\/pulsar zookeeper"],"image":"datastax\/lunastreaming-all:2.10_4.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"periodSeconds":30,"timeoutSeconds":30,"successThreshold":1,"initialDelaySeconds":20,"exec":{"command":["timeout","30","bin\/pulsar-zookeeper-ruok.sh"]}},"name":"pulsar-ls-us-central1-cluster-d-zookeeper","readinessProbe":{"failureThreshold":3,"periodSeconds":30,"timeoutSeconds":30,"successThreshold":1,"initialDelaySeconds":20,"exec":{"command":["timeout","30","bin\/pulsar-zookeeper-ruok.sh"]}},"resources":{"requests":{"memory":"1200Mi","cpu":"500m"}},"env":[{"name":"ZOOKEEPER_SERVERS","value":"pulsar-ls-us-central1-cluster-d-zookeeper-0,pulsar-ls-us-central1-cluster-d-zookeeper-1,pulsar-ls-us-central1-cluster-d-zookeeper-2,pulsar-ls-us-central1-cluster-d-zookeeper-3,pulsar-ls-us-central1-cluster-d-zookeeper-4"}],"ports":[{"name":"client","containerPort":2181},{"name":"server","containerPort":2888},{"name":"leader-election","containerPort":3888}],"envFrom":[{"configMapRef":{"name":"pulsar-ls-us-central1-cluster-d-zookeeper"}}],"command":["sh","-c"],"volumeMounts":[{"mountPath":"\/pulsar\/data","name":"pulsar-ls-us-central1-cluster-d-zookeeper-data"}]}]

11:51:03 INFO  [com.dat.oss.pul.con.PulsarClusterController] (ReconcilerExecutor-pulsar-cluster-app-71) waiting for zookeeper to become ready

The components affected by this are: zookeeper, broker, bastion, autorecovery e proxy. But only zookeeper seems to have problems with env set to empty.

Thoughts and feedback

Config as json

I know most components have quite a list of possible configurations. I see the "config" option of each component in the API spec is formatted as json. So, my assumption is that all defaults for the given component follow Puslar documentation - this operator is not changing anything. And to override a value, one would follow Pulsar's spec.

If I wanted to turn off non-persistent topic creation

PulsarCluster:
  spec:
    broker:
      config: "{\"enableNonPersistentTopics\":false}"

Or I could change the broker service port (which would break things)

PulsarCluster:
  spec:
    broker:
      config: "{\"enableNonPersistentTopics\":false,\"brokerServicePort\":3333}"

It seems that there are quite a few configuration values that you would want the operator to explicitly control and not give someone the ability to change (like the brokerServicePort). Would it be better to declare all possible config values individually? Then the operator can be more obvious about what it supports and possibly override Pulsar's default values in a documentable way. Also, because there are so many config values, declaring them as one json string is error-prone and very hard to read.

There are many cases in pulsar components where K8s has external provisions (like Service ports and Ingress). Leaving config as a json string will put the operator in a place where it either A)is forced to silently overwrite values without telling anyone or B)hope that the person deploying doesn't decide to provide a bad value for certain things.

Logging

How/where would I set log4j values? Say I wanted to turn up logging on my bookies, where would I provide log4j.appender.CONSOLE.Threshold=DEBUG ?

Values decision tree

I am assuming that the operator assigns values in the following order:

  1. individual component param
  2. global param
  3. pulsar default

Cert issuer

I'd like to offer a different way of thinking about a certificate issuer. Currently, there are provisions for self-signed using cert-manager or passwords to "Override the default secret name from where to load the certificates". But what if I wanted to use a different issuer (enterprise or letsencrypt or cloud-based)? Instead of being so explicit why not accept any Issuer or ClusterIssuer object in the operator? The the stack chart can have provisions for setting up a self-signed issuer and provide the object to the operator. That way the limiting factor is the use of cert-manager but the Issuer and its configuration is left to the person deploying this.

Consolidate TLS values

PulsarCluster.spec.
PulsarCluster.spec.global.tls.
PulsarCluster.spec.global.tls.certProvisioner.selfSigned.

Could there be some consolidation of values? Maybe each component has PulsarCluster.spec.<component>.tls and then there is a global PulsarCluster.spec.global.tls.

Helm charts

Having a helm chart for the operator and the stack seems redundant. How about removing the operator helm chart? So your options are to create a cluster using the operator resources directly kubectl apply PulsarCluster.yaml or create a cluster (with optional add-ons) with stack chart helm install my-cluster datastax/pulsar-stack -y values.yaml.

Documentation

Descriptions in the CRD are essential. There are so many different tools one can use to manage K8s objects, the CRD is the only common thread to guarantee a good experience. From there one would look to the operator repo which should have the API reference and some simple getting started commands/yaml. OperatorHub and ArtifactHub can point back to the repo. Then in Luna Streaming docs, we can provide deep knowledge about the operator, how it works, its benefits, and many examples/guides.

Proxy configmap is renders with wrong broker urls when cluster.sepc.global.components.brokerBaseName is configured

By configuring:

cluster:
  spec:
    global:
      components: # legacy names to be compatible with existing clusters
        brokerBaseName: neuron-pulsar-broker

The service is created correctly:

kind: Service
apiVersion: v1
metadata:
  name: dev01-neuron-pulsar-broker
  namespace: dev01-neuron-pulsar
  labels:
    app: pulsar
    cluster: dev01
    component: neuron-pulsar-broker
    resource-set: broker
spec:
  clusterIP: None
  ipFamilies:
    - IPv4
  ports:
    - name: http
      protocol: TCP
  ...
  ...
  selector:
    app: pulsar
    cluster: dev01
    component: neuron-pulsar-broker

but both dev01-neuron-pulsar-proxy and dev01-neuron-pulsar-proxy-ws config maps contain the wrong broker urls because broker- is prefixed but should not be. Wrong URL http://dev01-broker-neuron-pulsar-broker.dev01-neuron-pulsar.svc.cluster.local:8080/ instead of correct URL http://dev01-neuron-pulsar-broker.dev01-neuron-pulsar.svc.cluster.local:8080/

Proxy config map wrong values PULSAR_PREFIX_brokerServiceURL, PULSAR_PREFIX_brokerServiceURLTLS, PULSAR_PREFIX_brokerWebServiceURL, PULSAR_PREFIX_brokerWebServiceURLTLS:

kind: ConfigMap
apiVersion: v1
metadata:
  name: dev01-neuron-pulsar-proxy
  namespace: dev01-neuron-pulsar
  labels:
    app: pulsar
    cluster: dev01
    component: neuron-pulsar-proxy
    resource-set: proxy
data:
  PULSAR_PREFIX_numHttpServerThreads: '10'
  PULSAR_PREFIX_tlsHostnameVerificationEnabled: 'false'
  PULSAR_LOG_ROOT_LEVEL: info
  PULSAR_LOG_LEVEL: info
  PULSAR_PREFIX_brokerServiceURL: pulsar://dev01-broker-neuron-pulsar-broker.dev01-neuron-pulsar.svc.cluster.local:6650/
  PULSAR_PREFIX_brokerServiceURLTLS: pulsar+ssl://dev01-broker-neuron-pulsar-broker.dev01-neuron-pulsar.svc.cluster.local:6651/
  PULSAR_PREFIX_brokerWebServiceURL: http://dev01-broker-neuron-pulsar-broker.dev01-neuron-pulsar.svc.cluster.local:8080/
  PULSAR_PREFIX_brokerWebServiceURLTLS: https://dev01-broker-neuron-pulsar-broker.dev01-neuron-pulsar.svc.cluster.local:8443/
  PULSAR_EXTRA_OPTS: '-Dpulsar.log.root.level=info'
  PULSAR_PREFIX_authenticateMetricsEndpoint: 'false'
  PULSAR_PREFIX_tlsEnabledWithKeyStore: 'false'
  PULSAR_PREFIX_clusterName: dev01
  PULSAR_PREFIX_configurationStoreServers: 'dev01-neuron-pulsar-zookeeper-ca.dev01-neuron-pulsar.svc.cluster.local:2181'
  PULSAR_PREFIX_zookeeperServers: 'dev01-neuron-pulsar-zookeeper-ca.dev01-neuron-pulsar.svc.cluster.local:2181'
  PULSAR_PREFIX_forwardAuthorizationCredentials: 'true'

Proxy WS config map wrong values PULSAR_PREFIX_serviceUrl, PULSAR_PREFIX_serviceUrlTls, PULSAR_PREFIX_brokerServiceUrl, PULSAR_PREFIX_brokerServiceUrlTls:

kind: ConfigMap
apiVersion: v1
metadata:
  name: dev01-neuron-pulsar-proxy-ws
  namespace: dev01-neuron-pulsar
  labels:
    app: pulsar
    cluster: dev01
    component: neuron-pulsar-proxy
    resource-set: proxy
data:
  PULSAR_PREFIX_numHttpServerThreads: '10'
  PULSAR_PREFIX_tlsHostnameVerificationEnabled: 'false'
  PULSAR_LOG_ROOT_LEVEL: info
  PULSAR_PREFIX_serviceUrl: http://dev01-broker-neuron-pulsar-broker.dev01-neuron-pulsar.svc.cluster.local:8080/
  PULSAR_PREFIX_serviceUrlTls: https://dev01-broker-neuron-pulsar-broker.dev01-neuron-pulsar.svc.cluster.local:8443/
  PULSAR_PREFIX_brokerServiceUrl: pulsar://dev01-broker-neuron-pulsar-broker.dev01-neuron-pulsar.svc.cluster.local:6650/
  PULSAR_PREFIX_brokerServiceUrlTls: pulsar+ssl://dev01-broker-neuron-pulsar-broker.dev01-neuron-pulsar.svc.cluster.local:6651/
  PULSAR_LOG_LEVEL: info
  PULSAR_PREFIX_webServicePort: '8000'
  PULSAR_PREFIX_zookeeperServers: 'dev01-neuron-pulsar-zookeeper-ca.dev01-neuron-pulsar.svc.cluster.local:2181'
  PULSAR_EXTRA_OPTS: '-Dpulsar.log.root.level=info'
  PULSAR_PREFIX_tlsEnabledWithKeyStore: 'false'
  PULSAR_PREFIX_clusterName: dev01
  PULSAR_PREFIX_configurationStoreServers: 'dev01-neuron-pulsar-zookeeper-ca.dev01-neuron-pulsar.svc.cluster.local:2181'

Don't use KubernetesRuntimeFactory because it doesn't work

Currently the operator set this in functions_worker.conf by default. This should not be set because it doesn't work.

functionRuntimeFactoryClassName: org.apache.pulsar.functions.runtime.kubernetes.KubernetesRuntimeFactory

The problem is that when this is enabled, it causes an incorrect pulsar admin url to be used in the generated connector pod.

Related: apache/pulsar#9213

[bug] Migration-tool automatically generates bookie metaformat init-container

When creating the CRDs for v0.1.0 via the migration tool triggered on a cluster using an old version of the operator, the tool generates the bookie metaformat init-container. This prevents the operator from deploying BookKeeper, as the metaformat init-container specification is auto-generated at deploy time.

Here is an example of what the migration tool generates in the bookie's CRDs section:

        initContainers:
        - args:
          - bin/apply-config-from-env.py conf/bookkeeper.conf && bin/bookkeeper shell
            metaformat --nonInteractive || true;
          command:
          - sh
          - -c
          envFrom:
          - configMapRef:
              name: pulsar-bookkeeper
          image: datastax/lunastreaming-all:2.10_4.5
          imagePullPolicy: IfNotPresent
          name: pulsar-bookkeeper-metadata-format
          resources: {
            }
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        config: []

A simple workaround is removing the metaformat initcontainer:

 initContainers: []

additionalVolumes setting should allow empty values

If I add additionalVolumes: to a resource, but don't add an volumes, I get this error:

PulsarCluster.kaap.oss.datastax.com "pulsar" is invalid: spec.proxy.additionalVolumes: Invalid value: "null": spec.proxy.additionalVolumes in body must be of type object: "null"

This field should allow empty/null values, and similarly the sub-fields volumes and volumeMounts should allow empty config.

Deploy WebSocket Proxy in its own Pod

In order for the service account token volume projection to work for authentication, we need to provide a way to break the WebSocket Proxy into its own container. It's up for debate whether we should make this toggle-able or always as its own container.

Failed to downgrade from version 3.0.0 to 2.10.x

I tried to downgrade a cluster from 3.0.0 to 2.10.5 and the process became stuck due to bookkeeper crashing and never reaching a ready state.

Errors in bookkeeper logs immediately before each crash:

2023-07-28T18:12:07,404+0000 [main] ERROR org.apache.bookkeeper.server.Main - Failed to build bookie server
org.apache.bookkeeper.bookie.BookieException$InvalidCookieException: instanceId null is not matching with 656d0f97-6d6e-40fa-b319-c008893cbf58
        at org.apache.bookkeeper.bookie.Cookie.verifyInternal(Cookie.java:168) ~[org.apache.bookkeeper-bookkeeper-server-4.16.1.jar:4.16.1]
        at org.apache.bookkeeper.bookie.Cookie.verify(Cookie.java:173) ~[org.apache.bookkeeper-bookkeeper-server-4.16.1.jar:4.16.1]
        at org.apache.bookkeeper.bookie.LegacyCookieValidation.verifyAndGetMissingDirs(LegacyCookieValidation.java:199) ~[org.apache.bookkeeper-bookkeeper-server-4.16.1.jar:4.16.1]
        at org.apache.bookkeeper.bookie.LegacyCookieValidation.checkCookies(LegacyCookieValidation.java:84) ~[org.apache.bookkeeper-bookkeeper-server-4.16.1.jar:4.16.1]
        at org.apache.bookkeeper.server.EmbeddedServer$Builder.build(EmbeddedServer.java:408) ~[org.apache.bookkeeper-bookkeeper-server-4.16.1.jar:4.16.1]
        at org.apache.bookkeeper.server.Main.buildBookieServer(Main.java:277) ~[org.apache.bookkeeper-bookkeeper-server-4.16.1.jar:4.16.1]
        at org.apache.bookkeeper.server.Main.doMain(Main.java:216) ~[org.apache.bookkeeper-bookkeeper-server-4.16.1.jar:4.16.1]
        at org.apache.bookkeeper.server.Main.main(Main.java:199) ~[org.apache.bookkeeper-bookkeeper-server-4.16.1.jar:4.16.1]

Errors in operator logs:

18:10:54 INFO  [com.dat.oss.kaa.con.PulsarClusterController] (ReconcilerExecutor-pulsar-cluster-app-95) waiting for bookkeeper to become ready
18:10:55 INFO  [com.dat.oss.kaa.con.boo.BookKeeperController] (ReconcilerExecutor-pulsar-bk-controller-94) Initializing bookie racks for bookkeeper-set 'bookkeeper'
18:10:55 ERROR [com.dat.oss.kaa.con.AbstractController] (ReconcilerExecutor-pulsar-bk-controller-94) Error during reconciliation for resource bookkeepers.kaap.oss.datastax.com with name pulsar-bookkeeper: KeeperErrorCode = NoNode for /bookies: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bookies
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2028)
        at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
        at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
        at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
        at org.apache.curator.framework.imps.GetDataBuilderImpl$2.forPath(GetDataBuilderImpl.java:145)
        at org.apache.curator.framework.imps.GetDataBuilderImpl$2.forPath(GetDataBuilderImpl.java:141)
        at com.datastax.oss.kaap.controllers.bookkeeper.racks.client.ZkClientRackClient$ZkNodeOp.get(ZkClientRackClient.java:140)
        at com.datastax.oss.kaap.controllers.bookkeeper.racks.BookKeeperRackMonitor.internalRun(BookKeeperRackMonitor.java:73)
        at com.datastax.oss.kaap.controllers.bookkeeper.racks.BookKeeperRackDaemon.triggerSync(BookKeeperRackDaemon.java:58)
        at com.datastax.oss.kaap.controllers.bookkeeper.BookKeeperController.compareLastAppliedSetSpec(BookKeeperController.java:249)
        at com.datastax.oss.kaap.controllers.bookkeeper.BookKeeperController.compareLastAppliedSetSpec(BookKeeperController.java:52)
        at com.datastax.oss.kaap.controllers.AbstractResourceSetsController.patchResources(AbstractResourceSetsController.java:125)
        at com.datastax.oss.kaap.controllers.AbstractController.reconcile(AbstractController.java:139)
        at com.datastax.oss.kaap.controllers.AbstractController.reconcile(AbstractController.java:62)
        at com.datastax.oss.kaap.controllers.bookkeeper.BookKeeperController_ClientProxy.reconcile(Unknown Source)
        at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:145)
        at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:103)
        at io.javaoperatorsdk.operator.monitoring.micrometer.MicrometerMetrics.lambda$timeControllerExecution$0(MicrometerMetrics.java:86)
        at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:69)
        at io.javaoperatorsdk.operator.monitoring.micrometer.MicrometerMetrics.timeControllerExecution(MicrometerMetrics.java:84)
        at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:102)
        at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:141)
        at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:121)
        at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:91)
        at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:64)
        at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:415)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

Simplify configuration by always adding PULSAR_PREFIX_

One frequent issue for new users is figuring out when to add PULSAR_PREFIX_ to configuration keys. Perhaps we can remove this confusion by always prepending configuration from the custom resource with PULSAR_PREFIX_? Then, users won't need to know when to add that prefix.

Option to manage CRDs separately from the main cluster object

I'm running into some problems due to the large size of the cluster CRD. I sometimes get memory errors in K9s/kubectl, and it can be a bit difficult to find the config I'm looking for. I would like to have a mode where the cluster CRD just has a flag to tell it to manage a proxy CRD. That way I can work directly with the proxy config without affecting config of the other components.

global.tls.functionsWorker.enabled is ignored

The functionsWorker tls setting seems to be ignored. When I have tls enabled globally, but disabled on the functionsWorker, then the functionsWorker still gets TLS settings activated.

  global:
    tls:
      enabled: true
      functionsWorker:
        enabled: false

Move Pulsar scrape configuration into Service/PodMonitors

Currently the Pulsar pod scrape configuration is included directly in the kaap-stack Helm values.

additionalScrapeConfigs:
- job_name: 'pulsar-pods'
honor_labels: true
kubernetes_sd_configs:
- role: pod
# namespaces:
# names:
# - pulsar
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_label_component]
action: replace
target_label: job
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name

The problem is that this cannot be used across namespaces. So if kube-prometheus-stack is installed into a different namespace, the config would have to be copied from the values file.

An alternative is to create the config using a PodMonitor and or ServiceMonitor. That would allow the kaap-stack to still install the monitoring config, but allow kube-prometheus-stack to load it across namespace.

For an example, see https://github.com/riptano/streaming-dataplane-argocd/pull/22

remove global TLS enable/disable setting

The behavior of this setting is a bit confusing

global:
  tls:
    enabled: false

If it's set to false it acts as a global override, but if it's set to true then it allows sub-resources to enable their TLS settings. I think it would be simpler if this option was just removed, and we just enable TLS on the components that need it.

config objects like `config` and `env` should allow empty/null values

If I set broker.env to an empty/null value the operator crashes. The operator should handle these errors more gracefully.

Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize value of type `java.util.LinkedHashMap<java.lang.String,java.lang.Object>` from Array value (token `JsonToken.START_ARRAY`)
 at [Source: (BufferedInputStream); line: 1, column: 6612] (through reference chain: io.fabric8.kubernetes.api.model.DefaultKubernetesResourceList["items"]->java.util.ArrayList[0]->com.datastax.oss.kaap.crds.cluster.PulsarCluster["spec"]->com.datastax.oss.kaap.crds.cluster.PulsarClusterSpec["broker"]->com.datastax.oss.kaap.crds.broker.BrokerSpec["config"])
        at com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59)
        at com.fasterxml.jackson.databind.DeserializationContext.reportInputMismatch(DeserializationContext.java:1746)
        at com.fasterxml.jackson.databind.DeserializationContext.handleUnexpectedToken(DeserializationContext.java:1520)
        at com.fasterxml.jackson.databind.deser.std.StdDeserializer._deserializeFromArray(StdDeserializer.java:222)
        at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:457)
        at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:32)
        at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
        at io.fabric8.kubernetes.model.jackson.SettableBeanPropertyDelegate.deserializeAndSet(SettableBeanPropertyDelegate.java:134)
        at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:314)
        at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:177)
        at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
        at io.fabric8.kubernetes.model.jackson.SettableBeanPropertyDelegate.deserializeAndSet(SettableBeanPropertyDelegate.java:134)
        at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:314)
        at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:177)
        at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
        at io.fabric8.kubernetes.model.jackson.SettableBeanPropertyDelegate.deserializeAndSet(SettableBeanPropertyDelegate.java:134)
        at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:314)
        at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:177)
        at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer._deserializeFromArray(CollectionDeserializer.java:359)
        at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:244)
        at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:28)
        at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
        at io.fabric8.kubernetes.model.jackson.SettableBeanPropertyDelegate.deserializeAndSet(SettableBeanPropertyDelegate.java:134)
        at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:314)
        at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:177)
        at com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:323)
        at com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:2105)
        at com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1481)
        at io.fabric8.kubernetes.client.utils.Serialization.unmarshal(Serialization.java:253)
        ... 19 more

operator throws exceptions when labels don't match

Configuring custom pod labels I ran into a situation where the labels I set don't match the selectors. This type of mismatch should be prevented by the operator. In other words, the operator should only allow valid configuration, I'm not sure why there are separate config fields for "labels, podLabels, and matchLabels".

01:31:26 ERROR [com.dat.oss.kaa.con.AbstractController] (ReconcilerExecutor-pulsar-bk-controller-103) Error during reconciliation for resource bookkeepers.kaap.oss.datastax.com with name pulsar-bookkeeper: Failure executing: POST at: https://172.20.0.1:443/apis/apps/v1/namespaces/pulsar/statefulsets. Message: StatefulSet.apps "pulsar-bookkeeper" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"app":"pulsar", "cluster":"pulsar-aws-apsoutheast1", "component":"bookkeeper", "resource-set":"bookkeeper"}: `selector` does not match template `labels`. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.template.metadata.labels, message=Invalid value: map[string]string{"app":"pulsar", "cluster":"pulsar-aws-apsoutheast1", "component":"bookkeeper", "resource-set":"bookkeeper"}: `selector` does not match template `labels`, reason=FieldValueInvalid, additionalProperties={})], group=apps, kind=StatefulSet, name=pulsar-bookkeeper, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=StatefulSet.apps "pulsar-bookkeeper" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"app":"pulsar", "cluster":"pulsar-aws-apsoutheast1", "component":"bookkeeper", "resource-set":"bookkeeper"}: `selector` does not match template `labels`, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://172.20.0.1:443/apis/apps/v1/namespaces/pulsar/statefulsets. Message: StatefulSet.apps "pulsar-bookkeeper" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"app":"pulsar", "cluster":"pulsar-aws-apsoutheast1", "component":"bookkeeper", "resource-set":"bookkeeper"}: `selector` does not match template `labels`. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.template.metadata.labels, message=Invalid value: map[string]string{"app":"pulsar", "cluster":"pulsar-aws-apsoutheast1", "component":"bookkeeper", "resource-set":"bookkeeper"}: `selector` does not match template `labels`, reason=FieldValueInvalid, additionalProperties={})], group=apps, kind=StatefulSet, name=pulsar-bookkeeper, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=StatefulSet.apps "pulsar-bookkeeper" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"app":"pulsar", "cluster":"pulsar-aws-apsoutheast1", "component":"bookkeeper", "resource-set":"bookkeeper"}: `selector` does not match template `labels`, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).
        at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:536)
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:570)
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:554)
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:347)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:704)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:93)
        at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1107)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:93)
        at com.datastax.oss.kaap.controllers.BaseResourcesFactory.patchResource(BaseResourcesFactory.java:150)
        at com.datastax.oss.kaap.controllers.bookkeeper.BookKeeperResourcesFactory.patchStatefulSet(BookKeeperResourcesFactory.java:212)
        at com.datastax.oss.kaap.controllers.bookkeeper.BookKeeperController.patchResourceSet(BookKeeperController.java:115)
        at com.datastax.oss.kaap.controllers.AbstractResourceSetsController.patchResources(AbstractResourceSetsController.java:143)
        at com.datastax.oss.kaap.controllers.AbstractController.reconcile(AbstractController.java:139)
        at com.datastax.oss.kaap.controllers.AbstractController.reconcile(AbstractController.java:62)
        at com.datastax.oss.kaap.controllers.bookkeeper.BookKeeperController_ClientProxy.reconcile(Unknown Source)
        at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:145)
        at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:103)
        at io.javaoperatorsdk.operator.monitoring.micrometer.MicrometerMetrics.lambda$timeControllerExecution$0(MicrometerMetrics.java:86)
        at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:69)
        at io.javaoperatorsdk.operator.monitoring.micrometer.MicrometerMetrics.timeControllerExecution(MicrometerMetrics.java:84)
        at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:102)
        at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:139)
        at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:119)
        at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:89)
        at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)
        at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

Migration tool doesn't find services with "-local" suffix

We have services configured with a "-local" suffix. When running the migration tool against this cluster, it fails with this error:

17:09:49.469 [main] DEBUG org.jboss.logging - Logging Provider: org.jboss.logging.Slf4jLoggerProvider
Exception in thread "main" java.lang.IllegalStateException: Expected service with name mycluster-proxy not found
	at com.datastax.oss.kaap.migrationtool.specs.BaseSpecGenerator.getService(BaseSpecGenerator.java:142)
	at com.datastax.oss.kaap.migrationtool.specs.BaseSpecGenerator.requireService(BaseSpecGenerator.java:133)
	at com.datastax.oss.kaap.migrationtool.specs.ProxySetSpecGenerator.internalGenerateSpec(ProxySetSpecGenerator.java:82)
	at com.datastax.oss.kaap.migrationtool.specs.ProxySetSpecGenerator.<init>(ProxySetSpecGenerator.java:59)
	at com.datastax.oss.kaap.migrationtool.specs.ProxySpecGenerator.internalGenerateSpec(ProxySpecGenerator.java:53)
	at com.datastax.oss.kaap.migrationtool.specs.ProxySpecGenerator.<init>(ProxySpecGenerator.java:46)
	at com.datastax.oss.kaap.migrationtool.PulsarClusterResourceGenerator.<init>(PulsarClusterResourceGenerator.java:72)
	at com.datastax.oss.kaap.migrationtool.SpecGenerator.generatePulsarClusterSpec(SpecGenerator.java:153)
	at com.datastax.oss.kaap.migrationtool.SpecGenerator.generate(SpecGenerator.java:132)
	at com.datastax.oss.kaap.migrationtool.SpecGenerator.generate(SpecGenerator.java:119)
	at com.datastax.oss.kaap.migrationtool.Main$GenerateCmd.run(Main.java:98)
	at com.datastax.oss.kaap.migrationtool.Main.main(Main.java:73)

stack trace when there is a problem with zookeeper nodes

This should probably just be an error or warning message instead of a stack trace.

│ 17:35:46 ERROR [com.dat.oss.kaa.con.AbstractController] (ReconcilerExecutor-pulsar-bk-controller-93) Error during reconciliation for resource bookkeepers.kaap.oss.datastax.com with name pulsar-bookkeeper: KeeperErrorCode = NoNode for /bookies: org.apache.zookeeper.KeeperException$NoNodeException: K │
│     at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)                                                                                                                                                                                                                                │
│     at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)                                                                                                                                                                                                                                 │
│     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2028)                                                                                                                                                                                                                                          │
│     at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)                                                                                                                                                                                                             │
│     at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)                                                                                                                                                                                                             │
│     at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)                                                                                                                                                                                                                                        │
│     at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)                                                                                                                                                                                                   │
│     at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)                                                                                                                                                                                                            │
│     at org.apache.curator.framework.imps.GetDataBuilderImpl$2.forPath(GetDataBuilderImpl.java:145)                                                                                                                                                                                                          │
│     at org.apache.curator.framework.imps.GetDataBuilderImpl$2.forPath(GetDataBuilderImpl.java:141)                                                                                                                                                                                                          │
│     at com.datastax.oss.kaap.controllers.bookkeeper.racks.client.ZkClientRackClient$ZkNodeOp.get(ZkClientRackClient.java:140)                                                                                                                                                                               │
│     at com.datastax.oss.kaap.controllers.bookkeeper.racks.BookKeeperRackMonitor.internalRun(BookKeeperRackMonitor.java:73)                                                                                                                                                                                  │
│     at com.datastax.oss.kaap.controllers.bookkeeper.racks.BookKeeperRackDaemon.triggerSync(BookKeeperRackDaemon.java:58)                                                                                                                                                                                    │
│     at com.datastax.oss.kaap.controllers.bookkeeper.BookKeeperController.compareLastAppliedSetSpec(BookKeeperController.java:249)                                                                                                                                                                           │
│     at com.datastax.oss.kaap.controllers.bookkeeper.BookKeeperController.compareLastAppliedSetSpec(BookKeeperController.java:52)                                                                                                                                                                            │
│     at com.datastax.oss.kaap.controllers.AbstractResourceSetsController.patchResources(AbstractResourceSetsController.java:125)                                                                                                                                                                             │
│     at com.datastax.oss.kaap.controllers.AbstractController.reconcile(AbstractController.java:139)                                                                                                                                                                                                          │
│     at com.datastax.oss.kaap.controllers.AbstractController.reconcile(AbstractController.java:62)                                                                                                                                                                                                           │
│     at com.datastax.oss.kaap.controllers.bookkeeper.BookKeeperController_ClientProxy.reconcile(Unknown Source)                                                                                                                                                                                              │
│     at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:145)                                                                                                                                                                                                                     │
│     at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:103)                                                                                                                                                                                                                     │
│     at io.javaoperatorsdk.operator.monitoring.micrometer.MicrometerMetrics.lambda$timeControllerExecution$0(MicrometerMetrics.java:86)                                                                                                                                                                      │
│     at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:69)                                                                                                                                                                                                                │
│     at io.javaoperatorsdk.operator.monitoring.micrometer.MicrometerMetrics.timeControllerExecution(MicrometerMetrics.java:84)                                                                                                                                                                               │
│     at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:102)                                                                                                                                                                                                                     │
│     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:139)                                                                                                                                                                          │
│     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:119)                                                                                                                                                                             │
│     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:89)                                                                                                                                                                               │
│     at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)                                                                                                                                                                              │
│     at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406)                                                                                                                                                                                          │
│     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)                                                                                                                                                                                                            │
│     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)                                                                                                                                                                                                            │
│     at java.base/java.lang.Thread.run(Thread.java:833)

JVM memory settings should not be hard coded

Currently, there appear to be hard-coded JVM memory settings for various components. For example zookeeper sets Xms and Xmx to 1g regardless of how much memory the container has. Modern JVMs are able to detect the container memory settings and automatically set good defaults for things like heap size, so we should probably use the default memory settings in most cases.

TLS on the proxy is always disabled

It seems the global TLS setting always overrides the specific proxy setting for TLS, even when it's not specified. It looks like this is because there is a default value of false in the proxy resource.

Improve logging configuration

Now it's hard to understand how to configure logging for a particular component. Probably a logging section would be useful

migration tool should strip out runtime annotations

There are a lot of annotations that are added automatically by Kubernetes or by Helm, etc. For example:

          checksum/config: 06fad642cbb2b00903b8c47039cd347a8bb7b345b9f3ea0222a1363272f1c5f2
          kubectl.kubernetes.io/restartedAt: '2023-05-19T17:22:21-05:00'

          meta.helm.sh/release-name: astradev-gcp-pulsar-useast1
          meta.helm.sh/release-namespace: pulsar

These should not be included in the generated values file because they are added automatically at deploy or run time.

migration tool should only put non-default values in the generated values file

To keep the generated values more concise and allow for easier review, the migration tool should only print values which override the defaults.

An alternative might be a tool to remove all default values from an existing values file. Then you could take the output of the current migration tool and run it through the cleanup/minimize tool.

Fix failing multi-node CI tests

The Multi nodes CI job always fails with timeouts. This test should be fixed or disabled.

Error:  Failures: 
Error:    RacksTest.testBookKeeper:185->BaseK8sEnvTest.applyOperatorDeploymentAndCRDs:264->BaseK8sEnvTest.awaitOperatorRunning:544 » ConditionTimeout
Error:    RacksTest.testBroker:45->BaseK8sEnvTest.applyOperatorDeploymentAndCRDs:264->BaseK8sEnvTest.awaitOperatorRunning:544 » ConditionTimeout
Error:    RacksTest.testProxy:114->BaseK8sEnvTest.applyOperatorDeploymentAndCRDs:264->BaseK8sEnvTest.awaitOperatorRunning:544 » ConditionTimeout

proxy shouldn't start until broker is ready

Currently it appears that proxy and broker are updated at the same time. The proxy can't do much without the broker, so it would be better if the proxy starts after the broker startup is complete.

Omitting PULSAR_GC VM settings may lead to very high Proxy memory usage

Hi guys,

Issue: Extremely high memory usage by the pulsar-proxy pods (4GB for proxy+websocket containers at idle)

Steps to reproduce:

Fixes attempted

  • changing pulsar image (tried variants of 3.0.0, 3.0.2, default)
  • setting resource limits for proxy (caused OOM kills, as expected)
  • setting higher requests for proxy
  • disabling just the websocket - this led to halving the memory usage (now only 2GB)!

I then did a vanilla deployment of Pulsar via the official Pulsar Helm Chart. This used ~200mb of memory. I compared the configmaps between this and the KAAP proxy, and noticed that the KAAP doesn't specify many JVM settings, especially w/r to the GC.

I set these explicitly in the config for the proxy with the values I saw in the official Pulsar instance. The resulting proxy instances exhibit normal memory consumption (~345Mb).

What I added was the config section was:

  proxy:
    replicas: 1
    resources:
      requests:
        cpu: "0.2"
        memory: "128Mi"
    config:
      PULSAR_GC: |
        -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.linkCapacity=1024 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=4 -XX:ConcGCThreads=4 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem

I saw that these were set in the past, but were removed in this commit, for reasonable reasons. My experience, and the fact that they are still explicitly set by the Pulsar devs may be a good reason to persist them...

If I can help triage this further, please let me know.

Environment

  • Azure Kubernetes Cluster (AKS)
  • kubectl v 1.28
  • k8s v: v1.25 (<< slightly behind, but don't expect this to matter here?)

-s

bastion additionalVolumes

I don't see an option in the spec for additionalVolumes in the bastion. Currently bastion can't connect to the proxy because it doesn't have the required tokens mounted.

Temporary error during secrets provisioning

During the secret provisioning this error appears for a couple of "cycles":

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.40.0.1/api/v1/namespaces/pulsar-operator-test-qejpebzl/secrets. Message: Secret in version "v1" cannot be handled as a Secret: illegal base64 data at input byte 1620. Received status: Status(apiVersion=v1, code=400, details=null, kind=Status, message=Secret in version "v1" cannot be handled as a Secret: illegal base64 data at input byte 1620, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=BadRequest, status=Failure, additionalProperties={}).
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:709)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:689)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:640)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:576)
    at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$retryWithExponentialBackoff$2(OperationSupport.java:618)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$4.onResponse(OkHttpClientImpl.java:277)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    ... 3 more

INSTALLATION FAILED: type mismatch on prometheus-node-exporter: %!t(<nil>)

Error message:

Error: INSTALLATION FAILED: type mismatch on prometheus-node-exporter: %!t(<nil>)

Reproduction steps:

kind create cluster
helm install pulsar kaap/kaap-stack --values helm/examples/dev-cluster/values.yaml

Environment:

  • MacBook M1
  • Docker & Helm & Kind
    > kind --version                                                                                                                                                                                                                kind version 0.20.0
    > helm version
    version.BuildInfo{Version:"v3.13.2", GitCommit:"2a2fb3b98829f1e0be6fb18af2f6599e0f4e8243", GitTreeState:"clean", GoVersion:"go1.21.4"}
    > docker version 
    Client:
     Cloud integration: v1.0.35+desktop.5
     Version:           24.0.7
     API version:       1.43
     Go version:        go1.20.10
     Git commit:        afdd53b
     Built:             Thu Oct 26 09:04:20 2023
     OS/Arch:           darwin/arm64
     Context:           desktop-linux
    
    Server: Docker Desktop 4.26.1 (131620)
     Engine:
      Version:          24.0.7
      API version:      1.43 (minimum version 1.12)
      Go version:       go1.20.10
      Git commit:       311b9ff
      Built:            Thu Oct 26 09:08:15 2023
      OS/Arch:          linux/arm64
      Experimental:     false
     containerd:
      Version:          1.6.25
      GitCommit:        d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
     runc:
      Version:          1.1.10
      GitCommit:        v1.1.10-0-g18a0cb0
     docker-init:
      Version:          0.19.0
      GitCommit:        de40ad0

Service order during downgrade

I noticed that when changing the Pulsar version, the same order of updates (1. zookeeper, 2. bookkeeper, etc) is used for downgrades as during an upgrade. I would have thought that the order would be reversed during a downgrade, i.e. the pulsar proxy would be updated first, then broker, etc.

Proxy configmap is not automatically re-created after deletion

If I manually delete the proxy configmap, the operator can get into a bad state where it's waiting for the proxy pods to come up, but they can't come up because the configmap is missing.

A workaround for this was to disable and then re-enable TLS on the proxy to force a new config to be created.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.