clastix / kamaji-etcd Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 15.0 116 KB

Multi-tenant etcd datastore for Kamaji

Smarty 32.36% Makefile 1.57% Shell 66.07%

kamaji-etcd's People

Contributors

Stargazers

Watchers

Forkers

prometherion ptx96 maruina cenkalti skalanetworks sn4psh0t alexrogalskiy cuongpiger aenix-io killianmuldoon bdalpe jds9090 kvaps mathiaspius

kamaji-etcd's Issues

cfssl issue with arm64

Hi there!

I wanted to test clastix/kamaji but unfortunately I only have access to arm64 machines.
During the deployment I noticed that the script below won't run due to architecture incompatibilities.
https://github.com/clastix/kamaji-etcd/blob/master/scripts/certs-renew.sh#L28

It turns out that cfssl in version 1.6.1 is not available for arm64 platforms. I don't know if a patch would break things and if the pinned version is intentional. If not I would create a PR to bump the version :)

Kind regards
Tim

Scheduled job for etcd snapshot

Snapshotting the etcd cluster on a regular basis serves as a durable backup for an etcd keyspace. By taking periodic snapshots of an etcd member’s backend database, an etcd cluster can be recovered to a point in time with a known good state.

A CronJob with regular snapshot of etcd data is required. The job should push the snapshot on a remote S3 location.

For each etcd node in the cluster, the etcd cluster health is checked. If the node reports that the etcd cluster is healthy, a snapshot is created from it and optionally uploaded to S3. On S3, the snapshot will always be from the last node that uploads it, as all etcd nodes upload it and the last will remain. In the case when multiple etcd nodes exist, any created snapshot is created after the cluster has been health checked, so it can be considered a valid snapshot of the data in the etcd cluster.

As for example:

Option	Action	default
etcd Snapshot Backup S3 Target	Select S3 bucket where you want the snapshots to be saved.	---
Recurring etcd Snapshot Enabled	Enable/Disable recurring snapshots	False
Recurring etcd Snapshot Creation Period	cron-like schedule for recurring snapshots	0 23 * * *
Recurring etcd Snapshot Retention Count	Number of snapshots to retain	10

Document how to backup and restore Datastores

Document how to backup and restore Datastores with Velero

Unable to perform helm upgrade

Issue

The DataStore upgrade is not working due to an error with Helm hooks.

How to reproduce

$: helm upgrade --install etcd-nvme clastix/kamaji-etcd --namespace kamaji-system --create-namespace --set "datastore.enabled=true"
Error: UPGRADE FAILED: post-upgrade hooks failed: Internal error occurred: failed calling webhook "vdatastore.kb.io": failed to call webhook: the server could not find the requested resource

The reported error is due to Kamaji unable to deal properly with DELETE actions, which is hiding the underlying root cause with the kamaji-etcd chart which shouldn't perform the deletion of the DataStore resource at all.

Version

NAME            NAMESPACE       REVISION        UPDATED                                         STATUS          CHART                   APP VERSION
etcd-nvme       kamaji-system   9               2023-08-19 20:17:11.658751094 +0200 CEST        failed          kamaji-etcd-0.3.0       3.5.6

Provide a Grafana dashboard for monitoring

Provide a Grafana dashboard for monitoring.

refactoring of documentation

BUG helm chart will install but not uninstall or update

Hi,

we created an additional datastore of type etcd using your helm chart clastix/kamaji-etcd. It successfully installs, but as soon as you try to update or delete the datastore / the helm chart, you get the following error:

helm upgrade kamaji-etcd-2 clastix/kamaji-etcd -n kamaji-etcd-2 -f values.yaml
Error: UPGRADE FAILED: post-upgrade hooks failed: admission webhook "vdatastore.kb.io" denied the request: unable to decode into *v1alpha1.DataStore: there is no content to decode

Actually it seems (most?) resources get updated. However you also get the error, if you try to kubectl delete datastores.kamaji.clastix.io kamaji-etcd-2, resulting in an half-installed datastore. It doesn't entirely get removed.

We use kamaji v0.3.3 together with cluster-api v1.5.0 with infrastructure-vsphere v1.7.0 and your kamaji plugin control-plane-kamaji v0.3.0.

Provide a procedure to renew certficates

Currently, the self generated CA certificate has 5 years of lifetime and other certificates have 1 year. Provide a procedure/script to renew certificates on a live setup without service disruption.

Setup jobs are missing tolerations

In a strict environment where workload scheduling is controlled by nodeSelector and taint/tolerations, I cannot install the Chart if there are no taints-free node pools.

maybe the storageClass value name is wrong

in template used storageClassName:

  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        storageClassName: {{ .Values.persistenVolumeClaim.storageClassName }}
        accessModes:
        {{- range .Values.persistenVolumeClaim.accessModes }}
        - {{ . | quote }}
        {{- end }}
        resources:
          requests:
            storage: {{ .Values.persistenVolumeClaim.size }}

but in values.yaml named storageClass

persistenVolumeClaim:
  # -- The size of persistent storage for etcd data 
  size: 10Gi
  # -- A specific storage class
  storageClass: ""
  # -- The Access Mode to storage
  accessModes:
  - ReadWriteOnce

try to specify storageClass by helm install xxx --set persistenVolumeClaim.storageClass=xxx not work,and i edit values.yaml to storageClassName it works.

or is there a specical usage? i am a new fish for helm.

CI documentation misalignement check

We need to prevent commits from introducing misalignments in the Helm documentation, as we did with Kamaji.

helm: Allow for iRSA

The helm chart currently only allows for hard-coded environment variable based AWS IAM access key and secret key input. It does not allow for not-specifying these and letting iRSA handle it via the kubernetes service account.

Publish `kamaji-etcd` chart on the public Clastix repo

Publish kamaji-etcd chart on the public Clastix repo so the end user can install it:

helm repo add clastix https://clastix.github.io/charts
helm install kamaji-etcd clastix/kamaji-etcd -n etcd-system --create-namespace

Manage volumeClaimTemplates annotations

We should adapt helm charts to easily add annotations to sts volumeClaimtemplates:

this could be useful, for example, to create local volumes with local-path-provisioner or to backup volumes with velero

Move to an Operator

Currently, kamaji-etcd is deployed as a Helm Chart while some maintenance tasks like snapshotting, backup, restore, certificates rotation, are delegated to scripting. As evolution of this project, we should move to a Kubernetes Operator, in order to encode operational knowledge into a software machine driven by Kubernetes.

Provide a procedure to restore from snapshot

Provide a procedure/script to restore from a taken snapshot.

Create a Kamaji Datastore

It would be nice to have a datastore.kamaji.clastix.io resource created when the chart is deployed.
This can be enabled in the values.yaml chart as the creation of the datastore fails if the Kamaji Controller is not already installed.

Support for externally referenced S3 credentials and quoted plain values

Hi,

we created an additional datastore of type etcd using your helm chart clastix/kamaji-etcd.
We successfully set up it with an s3 backup. However in our values.yaml we had to define our password like this (not the real pwd):

  s3:
    #...
    bucket: kamaji/kamaji-etcd-2/
    accessKey: kamaji
    secretKey: "'xY%foo$ba)R'"

This is necessary so that the single quotes get added to the mc command, otherwise shell would try to interpret the password.
Maybe in the containers command, the password should be quoted by default.

We use kamaji v0.3.3 together with cluster-api v1.5.0 with infrastructure-vsphere v1.7.0 and your kamaji plugin control-plane-kamaji v0.3.0.

Add note to retention helm parameter

To enable the retention parameter for s3.backup will delete all previousy set lifecycle rules;

It's better to be safe than sorry

Error installing with Kubernetes v1.25

Installing the chart has few pods in ErrImagePull because of

Warning  Failed     23m (x4 over 25m)     kubelet            Failed to pull image "clastix/kubectl:v1.25": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/clastix/kubectl:v1.25": failed to resolve reference "docker.io/clastix/kubectl:v1.25": docker.io/clastix/kubectl:v1.25: not found

It seems you don't have published v1.25 in https://hub.docker.com/r/clastix/kubectl/tags

The datastore resource is not created

The datastore resource is not created because the required secret is missing. Need to defer the creation of the datastore after the secrets is created.

service account missing

Hi @prometherion ,

I think there is a new bug here:

I'm using the following values.yaml (which is basically a modified version of the original one):

#cat kamaji-etcd_values.yaml| grep -Ev '^\s*(#|$)'
replicas: 5
serviceAccount:
  create: true
  name: ""
image:
  repository: quay.io/coreos/etcd
  tag: ""
  pullPolicy: IfNotPresent
peerApiPort: 2380
clientPort: 2379
metricsPort: 2381
livenessProbe: {}
extraArgs: []
autoCompactionMode: periodic
autoCompactionRetention: 5m
snapshotCount: "10000"
quotaBackendBytes: "8589934592" # 8Gi
persistenVolumeClaim:
  size: 2Gi
  storageClassName: "block-small-sc"
  accessModes:
  - ReadWriteOnce
  customAnnotations: {}
defragmentation:
  schedule: "*/15 * * * *" # https://crontab.guru/
backup:
  enabled: true
  all: false
  schedule: "*/15 * * * *" # https://crontab.guru/
  snapshotNamePrefix: kpoc-tenant1
  snapshotDateFormat: $(date +%Y%m%d-%H%M)
  s3:
    url: https:/s3.our.domain
    bucket: kamaji/kpoc-tenant1-etcd/
    retention: "--expiry-days 3"
    accessKey:
      valueFrom:
        secretKeyRef:
          key: access_key
          name: minio-key
    secretKey:
      valueFrom:
        secretKeyRef:
          key: secret_key
          name: minio-key
    image:
      repository: minio/mc
      tag: "RELEASE.2022-11-07T23-47-39Z"
      pullPolicy: IfNotPresent
podLabels:
  application: kpoc-tenant1-etcd
podAnnotations: {}
securityContext:
  allowPrivilegeEscalation: false
priorityClassName: system-cluster-critical
resources:
  limits: {}
  requests: {}
nodeSelector:
  kubernetes.io/os: linux
tolerations: []
affinity: {}
topologySpreadConstraints: []
datastore:
  enabled: true
serviceMonitor:
  enabled: false
  namespace: ''
  labels: {}
  annotations: {}
  matchLabels: {}
  targetLabels: []
  serviceAccount:
    name: etcd
    namespace: etcd-system
  endpoint:
    interval: "15s"
    scrapeTimeout: ""
    metricRelabelings: []
    relabelings: []
alerts:
  enabled: false
  namespace: ''
  labels: {}
  annotations: {}
  rules: []

If I apply this with your fork (from back, when you implemented #43 (branch "issues/43", commit "a521b27d830f4bf16908f03cab146039ab4a1401" is what I have still checked out), then everything works fine.

helm install kpoc-tenant1-etcd ~/tmp/kamaji/kamaji-etcd-prometherion/charts/kamaji-etcd -n kpoc-tenant1 -f kamaji-etcd_values.yaml

kg job
#NAME                                COMPLETIONS   DURATION   AGE
#kpoc-tenant1-etcd-backup-28253640   1/1           6m22s      21m
#kpoc-tenant1-etcd-backup-28253655   1/1           9s         6m22s
#kpoc-tenant1-etcd-defrag-28253640   1/1           80s        21m
#kpoc-tenant1-etcd-defrag-28253655   1/1           81s        6m22s
#kpoc-tenant1-etcd-etcd-setup-1      1/1           94s        33m
kg sa
#NAME                SECRETS   AGE
#default             0         33m
#kpoc-tenant1-etcd   0         33m

However, if I apply the very same config with the latest upstream branch, then it doesn't work because the relevant serviceaccount is not created:

helm install kpoc-tenant2-etcd clastix/kamaji-etcd -n kpoc-tenant2 -f kamaji-etcd_values.yaml

kg job
#NAME                                COMPLETIONS   DURATION   AGE
#kpoc-tenant2-etcd-backup-28253655   0/1           8m27s      8m27s
#kpoc-tenant2-etcd-defrag-28253655   0/1           8m27s      8m27s
kg sa
#NAME      SECRETS   AGE
#default   0         16m
k describe job kpoc-tenant2-etcd-backup-28253655
#...
#  Warning  FailedCreate  5m6s (x6 over 9m56s)  job-controller  Error creating: pods "kpoc-tenant2-etcd-backup-28253655-" is forbidden: error looking up service account kpoc-tenant2/kpoc-tenant2-etcd: serviceaccount "kpoc-tenant2-etcd" not found

And actually ... while ensuring I can reproduce the issue, I found, that the sa is actually created and then seems to be deleted after install has finished:

kg sa
#NAME                SECRETS   AGE
#default             0         113s
#kpoc-tenant2-etcd   0         75s
kg sa
#NAME      SECRETS   AGE
#default   0         2m6s

BUG issue 41 not fixed

see #41

Hi @prometherion ,

I seems #41 was no duplicate of #38 , we still have the same issue with v0.4.0.
As I had a look into #38, actually there is another error message.
Again, our error ist:

Error: UPGRADE FAILED: post-upgrade hooks failed: admission webhook "vdatastore.kb.io" denied the request: unable to decode into *v1alpha1.DataStore: there is no content to decode

unable to decode into *v1alpha1.DataStore: there is no content to decode

This sounds like there may be an issue with the CRD or something?

This is one of our datastores in question:

apiVersion: kamaji.clastix.io/v1alpha1
kind: DataStore
metadata:
  annotations:
    helm.sh/hook: post-install,post-upgrade,post-rollback
    helm.sh/hook-weight: "5"
  creationTimestamp: "2023-08-28T12:00:22Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: kamaji-etcd-2
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kamaji-etcd
    app.kubernetes.io/version: 3.5.6
    helm.sh/chart: kamaji-etcd-0.4.0
  name: kamaji-etcd-2
  resourceVersion: "5406229"
  uid: e37abb97-7840-4157-b8f2-61164f540cb9
spec:
  driver: etcd
  endpoints:
  - kamaji-etcd-2-0.kamaji-etcd-2.kamaji-etcd-2.svc.cluster.local:2379
  - kamaji-etcd-2-1.kamaji-etcd-2.kamaji-etcd-2.svc.cluster.local:2379
  - kamaji-etcd-2-2.kamaji-etcd-2.kamaji-etcd-2.svc.cluster.local:2379
  tlsConfig:
    certificateAuthority:
      certificate:
        secretReference:
          keyPath: ca.crt
          name: kamaji-etcd-2-certs
          namespace: kamaji-etcd-2
      privateKey:
        secretReference:
          keyPath: ca.key
          name: kamaji-etcd-2-certs
          namespace: kamaji-etcd-2
    clientCertificate:
      certificate:
        secretReference:
          keyPath: tls.crt
          name: kamaji-etcd-2-root-client-certs
          namespace: kamaji-etcd-2
      privateKey:
        secretReference:
          keyPath: tls.key
          name: kamaji-etcd-2-root-client-certs
          namespace: kamaji-etcd-2
status:
  usedBy:
  - bcp-cluster-test1/bcp-cluster-test1

The same error can actually be triggered when trying to delete the datastore:

k delete datastores.kamaji.clastix.io kamaji-etcd-2
Error from server: admission webhook "vdatastore.kb.io" denied the request: unable to decode into *v1alpha1.DataStore: there is no content to decode

(nothing actually gets deleted)

Idempotency for etcd bootstrap scripts

It would be helpful, as in clastix/kamaji#378, to make etcd bootstrap scripts idempotent.

Datastore restore script does not run without declaring DEBUG var

Memory segmentation violation in Kamaji Controller

Description

Observed a segmentation violation in Kamaji Controller when using etcd as datastore

1.6693292987985492e+09  INFO    controller.tenantcontrolplane   marked for deletion, performing clean-up        {"reconciler group": "kamaji.clastix.io", "reconciler kind": "TenantControlPlane", "name": "tenant-01", "namespace": "default"}
{"level":"warn","ts":"2022-11-24T22:34:58.917Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0018a8700/etcd-0.etcd.kamaji-system.svc.cluster.local:2379","attempt":0,"error":"rpc error: code = InvalidArgument desc = etcdserver: revision of auth store is old"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x16fb550]

It looks related to etcd less than 3.5.6 according to this issue.

We need to upgrade the etcd version in the Helm chart.

Add a prefix into chart naming convention

It would be useful to add a prefix etcd into chart naming convention to avoid overlaps with other resources

BUG restore.sh script relies on optional annotation

Hi,

we tried how well an etcd backup restore works atm.
We use kamaji v0.3.3 together with cluster-api v1.5.0 with infrastructure-vsphere v1.7.0 and your kamaji plugin control-plane-kamaji v0.3.0.

We created an additional datastore of type etcd using your helm chart clastix/kamaji-etcd.
The backup job to s3 succeeds and we were able to restore the etcd using your restore.sh script, but we had to make some changes to it:

diff --git a/scripts/restore.sh b/scripts/restore.sh
index dddec4f..dc8b1f8 100755
--- a/scripts/restore.sh
+++ b/scripts/restore.sh
@@ -19,6 +19,7 @@ set -eu -o pipefail
 KAMAJI_TCP_SNAP_URL=$1
 
 # Service variables
+KAMAJI_PODS_JSON_BEFORE=$(kubectl get pods -n $ETCD_NAMESPACE -l app.kubernetes.io/instance=$ETCD_NAME -o json)
 KAMAJI_PODS_JSON="kubectl get pods -n $ETCD_NAMESPACE -l app.kubernetes.io/instance=$ETCD_NAME -o json"
 KAMAJI_TCP_SNAP="snapshot.db"
 TMP_FOLDER="/tmp"
@@ -105,8 +106,8 @@ etcdFolderSwitch() {
     PVC_VOLUMENAME=$($KAMAJI_PVC_JSON |\
       jq -j '.spec.volumeName')
   
-    PVC_NODE=$($KAMAJI_PVC_JSON |\
-      jq -j '.metadata.annotations["volume.kubernetes.io/selected-node"]')
+    PVC_NODE=$(echo $KAMAJI_PODS_JSON_BEFORE |\
+      jq -r --arg pvc_name $DATA '.items[] | select(.spec.volumes[].persistentVolumeClaim.claimName == $pvc_name) | .spec.nodeName')
  
   cat <<EOF | kubectl apply -f -
   apiVersion: batch/v1

I'm aware this may not be the best option to fix it, but the underlying problem seems to be:

The annotation "volume.kubernetes.io/selected-node" does not necessarily exist.
We use trident as CSI storage provider and we think it relates to this, that the annotation is missing for us.
However this means it may not be the best option to rely on this annotation.