Git Product home page Git Product logo

kamaji-etcd's People

Contributors

bsctl avatar lixd avatar maruina avatar prometherion avatar ptx96 avatar skalanetworks avatar sn4psh0t avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

kamaji-etcd's Issues

cfssl issue with arm64

Hi there!

I wanted to test clastix/kamaji but unfortunately I only have access to arm64 machines.
During the deployment I noticed that the script below won't run due to architecture incompatibilities.
https://github.com/clastix/kamaji-etcd/blob/master/scripts/certs-renew.sh#L28

It turns out that cfssl in version 1.6.1 is not available for arm64 platforms. I don't know if a patch would break things and if the pinned version is intentional. If not I would create a PR to bump the version :)

Kind regards
Tim

Scheduled job for etcd snapshot

Snapshotting the etcd cluster on a regular basis serves as a durable backup for an etcd keyspace. By taking periodic snapshots of an etcd member’s backend database, an etcd cluster can be recovered to a point in time with a known good state.

A CronJob with regular snapshot of etcd data is required. The job should push the snapshot on a remote S3 location.

For each etcd node in the cluster, the etcd cluster health is checked. If the node reports that the etcd cluster is healthy, a snapshot is created from it and optionally uploaded to S3. On S3, the snapshot will always be from the last node that uploads it, as all etcd nodes upload it and the last will remain. In the case when multiple etcd nodes exist, any created snapshot is created after the cluster has been health checked, so it can be considered a valid snapshot of the data in the etcd cluster.

As for example:

Option Action default
etcd Snapshot Backup S3 Target Select S3 bucket where you want the snapshots to be saved. ---
Recurring etcd Snapshot Enabled Enable/Disable recurring snapshots False
Recurring etcd Snapshot Creation Period cron-like schedule for recurring snapshots 0 23 * * *
Recurring etcd Snapshot Retention Count Number of snapshots to retain 10

Unable to perform helm upgrade

Issue

The DataStore upgrade is not working due to an error with Helm hooks.

How to reproduce

$: helm upgrade --install etcd-nvme clastix/kamaji-etcd --namespace kamaji-system --create-namespace --set "datastore.enabled=true"
Error: UPGRADE FAILED: post-upgrade hooks failed: Internal error occurred: failed calling webhook "vdatastore.kb.io": failed to call webhook: the server could not find the requested resource

The reported error is due to Kamaji unable to deal properly with DELETE actions, which is hiding the underlying root cause with the kamaji-etcd chart which shouldn't perform the deletion of the DataStore resource at all.

Version

NAME            NAMESPACE       REVISION        UPDATED                                         STATUS          CHART                   APP VERSION
etcd-nvme       kamaji-system   9               2023-08-19 20:17:11.658751094 +0200 CEST        failed          kamaji-etcd-0.3.0       3.5.6

BUG helm chart will install but not uninstall or update

Hi,

we created an additional datastore of type etcd using your helm chart clastix/kamaji-etcd. It successfully installs, but as soon as you try to update or delete the datastore / the helm chart, you get the following error:

helm upgrade kamaji-etcd-2 clastix/kamaji-etcd -n kamaji-etcd-2 -f values.yaml
Error: UPGRADE FAILED: post-upgrade hooks failed: admission webhook "vdatastore.kb.io" denied the request: unable to decode into *v1alpha1.DataStore: there is no content to decode

Actually it seems (most?) resources get updated. However you also get the error, if you try to kubectl delete datastores.kamaji.clastix.io kamaji-etcd-2, resulting in an half-installed datastore. It doesn't entirely get removed.

We use kamaji v0.3.3 together with cluster-api v1.5.0 with infrastructure-vsphere v1.7.0 and your kamaji plugin control-plane-kamaji v0.3.0.

Provide a procedure to renew certficates

Currently, the self generated CA certificate has 5 years of lifetime and other certificates have 1 year. Provide a procedure/script to renew certificates on a live setup without service disruption.

Setup jobs are missing tolerations

In a strict environment where workload scheduling is controlled by nodeSelector and taint/tolerations, I cannot install the Chart if there are no taints-free node pools.

maybe the storageClass value name is wrong

in template used storageClassName:

  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        storageClassName: {{ .Values.persistenVolumeClaim.storageClassName }}
        accessModes:
        {{- range .Values.persistenVolumeClaim.accessModes }}
        - {{ . | quote }}
        {{- end }}
        resources:
          requests:
            storage: {{ .Values.persistenVolumeClaim.size }}

but in values.yaml named storageClass

persistenVolumeClaim:
  # -- The size of persistent storage for etcd data 
  size: 10Gi
  # -- A specific storage class
  storageClass: ""
  # -- The Access Mode to storage
  accessModes:
  - ReadWriteOnce

try to specify storageClass by helm install xxx --set persistenVolumeClaim.storageClass=xxx not work,and i edit values.yaml to storageClassName it works.

or is there a specical usage? i am a new fish for helm.

helm: Allow for iRSA

The helm chart currently only allows for hard-coded environment variable based AWS IAM access key and secret key input. It does not allow for not-specifying these and letting iRSA handle it via the kubernetes service account.

Publish `kamaji-etcd` chart on the public Clastix repo

Publish kamaji-etcd chart on the public Clastix repo so the end user can install it:

helm repo add clastix https://clastix.github.io/charts
helm install kamaji-etcd clastix/kamaji-etcd -n etcd-system --create-namespace

Manage volumeClaimTemplates annotations

We should adapt helm charts to easily add annotations to sts volumeClaimtemplates:

this could be useful, for example, to create local volumes with local-path-provisioner or to backup volumes with velero

Move to an Operator

Currently, kamaji-etcd is deployed as a Helm Chart while some maintenance tasks like snapshotting, backup, restore, certificates rotation, are delegated to scripting. As evolution of this project, we should move to a Kubernetes Operator, in order to encode operational knowledge into a software machine driven by Kubernetes.

Create a Kamaji Datastore

It would be nice to have a datastore.kamaji.clastix.io resource created when the chart is deployed.
This can be enabled in the values.yaml chart as the creation of the datastore fails if the Kamaji Controller is not already installed.

Support for externally referenced S3 credentials and quoted plain values

Hi,

we created an additional datastore of type etcd using your helm chart clastix/kamaji-etcd.
We successfully set up it with an s3 backup. However in our values.yaml we had to define our password like this (not the real pwd):

  s3:
    #...
    bucket: kamaji/kamaji-etcd-2/
    accessKey: kamaji
    secretKey: "'xY%foo$ba)R'"

This is necessary so that the single quotes get added to the mc command, otherwise shell would try to interpret the password.
Maybe in the containers command, the password should be quoted by default.

We use kamaji v0.3.3 together with cluster-api v1.5.0 with infrastructure-vsphere v1.7.0 and your kamaji plugin control-plane-kamaji v0.3.0.

Error installing with Kubernetes v1.25

Installing the chart has few pods in ErrImagePull because of

Warning  Failed     23m (x4 over 25m)     kubelet            Failed to pull image "clastix/kubectl:v1.25": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/clastix/kubectl:v1.25": failed to resolve reference "docker.io/clastix/kubectl:v1.25": docker.io/clastix/kubectl:v1.25: not found

It seems you don't have published v1.25 in https://hub.docker.com/r/clastix/kubectl/tags

The datastore resource is not created

The datastore resource is not created because the required secret is missing. Need to defer the creation of the datastore after the secrets is created.

service account missing

Hi @prometherion ,

I think there is a new bug here:

I'm using the following values.yaml (which is basically a modified version of the original one):

#cat kamaji-etcd_values.yaml| grep -Ev '^\s*(#|$)'
replicas: 5
serviceAccount:
  create: true
  name: ""
image:
  repository: quay.io/coreos/etcd
  tag: ""
  pullPolicy: IfNotPresent
peerApiPort: 2380
clientPort: 2379
metricsPort: 2381
livenessProbe: {}
extraArgs: []
autoCompactionMode: periodic
autoCompactionRetention: 5m
snapshotCount: "10000"
quotaBackendBytes: "8589934592" # 8Gi
persistenVolumeClaim:
  size: 2Gi
  storageClassName: "block-small-sc"
  accessModes:
  - ReadWriteOnce
  customAnnotations: {}
defragmentation:
  schedule: "*/15 * * * *" # https://crontab.guru/
backup:
  enabled: true
  all: false
  schedule: "*/15 * * * *" # https://crontab.guru/
  snapshotNamePrefix: kpoc-tenant1
  snapshotDateFormat: $(date +%Y%m%d-%H%M)
  s3:
    url: https:/s3.our.domain
    bucket: kamaji/kpoc-tenant1-etcd/
    retention: "--expiry-days 3"
    accessKey:
      valueFrom:
        secretKeyRef:
          key: access_key
          name: minio-key
    secretKey:
      valueFrom:
        secretKeyRef:
          key: secret_key
          name: minio-key
    image:
      repository: minio/mc
      tag: "RELEASE.2022-11-07T23-47-39Z"
      pullPolicy: IfNotPresent
podLabels:
  application: kpoc-tenant1-etcd
podAnnotations: {}
securityContext:
  allowPrivilegeEscalation: false
priorityClassName: system-cluster-critical
resources:
  limits: {}
  requests: {}
nodeSelector:
  kubernetes.io/os: linux
tolerations: []
affinity: {}
topologySpreadConstraints: []
datastore:
  enabled: true
serviceMonitor:
  enabled: false
  namespace: ''
  labels: {}
  annotations: {}
  matchLabels: {}
  targetLabels: []
  serviceAccount:
    name: etcd
    namespace: etcd-system
  endpoint:
    interval: "15s"
    scrapeTimeout: ""
    metricRelabelings: []
    relabelings: []
alerts:
  enabled: false
  namespace: ''
  labels: {}
  annotations: {}
  rules: []

If I apply this with your fork (from back, when you implemented #43 (branch "issues/43", commit "a521b27d830f4bf16908f03cab146039ab4a1401" is what I have still checked out), then everything works fine.

helm install kpoc-tenant1-etcd ~/tmp/kamaji/kamaji-etcd-prometherion/charts/kamaji-etcd -n kpoc-tenant1 -f kamaji-etcd_values.yaml
kg job
#NAME                                COMPLETIONS   DURATION   AGE
#kpoc-tenant1-etcd-backup-28253640   1/1           6m22s      21m
#kpoc-tenant1-etcd-backup-28253655   1/1           9s         6m22s
#kpoc-tenant1-etcd-defrag-28253640   1/1           80s        21m
#kpoc-tenant1-etcd-defrag-28253655   1/1           81s        6m22s
#kpoc-tenant1-etcd-etcd-setup-1      1/1           94s        33m
kg sa
#NAME                SECRETS   AGE
#default             0         33m
#kpoc-tenant1-etcd   0         33m

However, if I apply the very same config with the latest upstream branch, then it doesn't work because the relevant serviceaccount is not created:

helm install kpoc-tenant2-etcd clastix/kamaji-etcd -n kpoc-tenant2 -f kamaji-etcd_values.yaml
kg job
#NAME                                COMPLETIONS   DURATION   AGE
#kpoc-tenant2-etcd-backup-28253655   0/1           8m27s      8m27s
#kpoc-tenant2-etcd-defrag-28253655   0/1           8m27s      8m27s
kg sa
#NAME      SECRETS   AGE
#default   0         16m
k describe job kpoc-tenant2-etcd-backup-28253655
#...
#  Warning  FailedCreate  5m6s (x6 over 9m56s)  job-controller  Error creating: pods "kpoc-tenant2-etcd-backup-28253655-" is forbidden: error looking up service account kpoc-tenant2/kpoc-tenant2-etcd: serviceaccount "kpoc-tenant2-etcd" not found

And actually ... while ensuring I can reproduce the issue, I found, that the sa is actually created and then seems to be deleted after install has finished:

kg sa
#NAME                SECRETS   AGE
#default             0         113s
#kpoc-tenant2-etcd   0         75s
kg sa
#NAME      SECRETS   AGE
#default   0         2m6s

BUG issue 41 not fixed

see #41

Hi @prometherion ,

I seems #41 was no duplicate of #38 , we still have the same issue with v0.4.0.
As I had a look into #38, actually there is another error message.
Again, our error ist:

Error: UPGRADE FAILED: post-upgrade hooks failed: admission webhook "vdatastore.kb.io" denied the request: unable to decode into *v1alpha1.DataStore: there is no content to decode

->

unable to decode into *v1alpha1.DataStore: there is no content to decode

This sounds like there may be an issue with the CRD or something?

This is one of our datastores in question:

apiVersion: kamaji.clastix.io/v1alpha1
kind: DataStore
metadata:
  annotations:
    helm.sh/hook: post-install,post-upgrade,post-rollback
    helm.sh/hook-weight: "5"
  creationTimestamp: "2023-08-28T12:00:22Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: kamaji-etcd-2
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kamaji-etcd
    app.kubernetes.io/version: 3.5.6
    helm.sh/chart: kamaji-etcd-0.4.0
  name: kamaji-etcd-2
  resourceVersion: "5406229"
  uid: e37abb97-7840-4157-b8f2-61164f540cb9
spec:
  driver: etcd
  endpoints:
  - kamaji-etcd-2-0.kamaji-etcd-2.kamaji-etcd-2.svc.cluster.local:2379
  - kamaji-etcd-2-1.kamaji-etcd-2.kamaji-etcd-2.svc.cluster.local:2379
  - kamaji-etcd-2-2.kamaji-etcd-2.kamaji-etcd-2.svc.cluster.local:2379
  tlsConfig:
    certificateAuthority:
      certificate:
        secretReference:
          keyPath: ca.crt
          name: kamaji-etcd-2-certs
          namespace: kamaji-etcd-2
      privateKey:
        secretReference:
          keyPath: ca.key
          name: kamaji-etcd-2-certs
          namespace: kamaji-etcd-2
    clientCertificate:
      certificate:
        secretReference:
          keyPath: tls.crt
          name: kamaji-etcd-2-root-client-certs
          namespace: kamaji-etcd-2
      privateKey:
        secretReference:
          keyPath: tls.key
          name: kamaji-etcd-2-root-client-certs
          namespace: kamaji-etcd-2
status:
  usedBy:
  - bcp-cluster-test1/bcp-cluster-test1

The same error can actually be triggered when trying to delete the datastore:

k delete datastores.kamaji.clastix.io kamaji-etcd-2
Error from server: admission webhook "vdatastore.kb.io" denied the request: unable to decode into *v1alpha1.DataStore: there is no content to decode

(nothing actually gets deleted)

Memory segmentation violation in Kamaji Controller

Description

Observed a segmentation violation in Kamaji Controller when using etcd as datastore

1.6693292987985492e+09  INFO    controller.tenantcontrolplane   marked for deletion, performing clean-up        {"reconciler group": "kamaji.clastix.io", "reconciler kind": "TenantControlPlane", "name": "tenant-01", "namespace": "default"}
{"level":"warn","ts":"2022-11-24T22:34:58.917Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0018a8700/etcd-0.etcd.kamaji-system.svc.cluster.local:2379","attempt":0,"error":"rpc error: code = InvalidArgument desc = etcdserver: revision of auth store is old"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x16fb550]

It looks related to etcd less than 3.5.6 according to this issue.

We need to upgrade the etcd version in the Helm chart.

BUG restore.sh script relies on optional annotation

Hi,

we tried how well an etcd backup restore works atm.
We use kamaji v0.3.3 together with cluster-api v1.5.0 with infrastructure-vsphere v1.7.0 and your kamaji plugin control-plane-kamaji v0.3.0.

We created an additional datastore of type etcd using your helm chart clastix/kamaji-etcd.
The backup job to s3 succeeds and we were able to restore the etcd using your restore.sh script, but we had to make some changes to it:

diff --git a/scripts/restore.sh b/scripts/restore.sh
index dddec4f..dc8b1f8 100755
--- a/scripts/restore.sh
+++ b/scripts/restore.sh
@@ -19,6 +19,7 @@ set -eu -o pipefail
 KAMAJI_TCP_SNAP_URL=$1
 
 # Service variables
+KAMAJI_PODS_JSON_BEFORE=$(kubectl get pods -n $ETCD_NAMESPACE -l app.kubernetes.io/instance=$ETCD_NAME -o json)
 KAMAJI_PODS_JSON="kubectl get pods -n $ETCD_NAMESPACE -l app.kubernetes.io/instance=$ETCD_NAME -o json"
 KAMAJI_TCP_SNAP="snapshot.db"
 TMP_FOLDER="/tmp"
@@ -105,8 +106,8 @@ etcdFolderSwitch() {
     PVC_VOLUMENAME=$($KAMAJI_PVC_JSON |\
       jq -j '.spec.volumeName')
   
-    PVC_NODE=$($KAMAJI_PVC_JSON |\
-      jq -j '.metadata.annotations["volume.kubernetes.io/selected-node"]')
+    PVC_NODE=$(echo $KAMAJI_PODS_JSON_BEFORE |\
+      jq -r --arg pvc_name $DATA '.items[] | select(.spec.volumes[].persistentVolumeClaim.claimName == $pvc_name) | .spec.nodeName')
  
   cat <<EOF | kubectl apply -f -
   apiVersion: batch/v1

I'm aware this may not be the best option to fix it, but the underlying problem seems to be:

The annotation "volume.kubernetes.io/selected-node" does not necessarily exist.
We use trident as CSI storage provider and we think it relates to this, that the annotation is missing for us.
However this means it may not be the best option to rely on this annotation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.