Git Product home page Git Product logo

Comments (9)

Kidswiss avatar Kidswiss commented on May 27, 2024

Hi @dns2utf8

Interesting I've never observed that the operator would delete the newest jobs first.

By default k8up keeps 6 finished jobs around. If you delete the pod it won't delete the job because the pod is a child of the job and deletion propagation is from top to bottom. So deleting the job will delete the pod, too.

Also each job is a child of a backup object (kubectl get backups). Also worth noting: if a backup run fails it will recreate the pods, too (I also think 6 times by default). So if the backup was failing and retries happened it could be that a newer pod actually belongs to an older backup job, giving the impression that it deleted the newer jobs first.

The relation ship between the objects is: schedule -> backup -> job -> pod. Deleting the parent will delete the child.

Can you please try again and observe the backup objects in your namespace? For example: watch kubectl get backups without manually deleting finished jobs?

Regards
Simon

from k8up.

dns2utf8 avatar dns2utf8 commented on May 27, 2024

Hi Simon

Something is very strange. I don't get any backups:

$ kubectl get backups
No resources found.

I cleaned the remaining Jobs by hand an hour ago

$ kubectl get jobs --all-namespaces | rg backupjob
devops              backupjob-1579018440                                   1/1           46s        11m
myapp-staging       backupjob-1579017060                                   1/1           13s        34m

The pods still don't survive long enough:

$ kubectl get pods --all-namespaces | rg backupjob
$

The operator has the following logs during the same time:

2020/01/14 16:14:00 [INFO] scheduled-backup-schedule-devops-1579018440 for repo s3:http://s3.bucket.internal:9000/k8up is queued waiting for jobs [Prune Check] to finish
2020/01/14 16:14:00 [INFO] All blocking jobs on s3:http://s3.bucket.internal:9000/k8up for scheduled-backup-schedule-devops-1579018440 are now finished
2020/01/14 16:14:00 [INFO] New backup job received scheduled-backup-schedule-devops-1579018440 in namespace devops
2020/01/14 16:14:00 [INFO] Listing all PVCs with annotation appuio.ch/backup in namespace devops
2020/01/14 16:14:00 [INFO] PVC devops/gitlab-minio doesn't have annotation, adding to list...
2020/01/14 16:14:00 [INFO] PVC devops/gitlab-minio-old isn't RWX
2020/01/14 16:14:00 [INFO] PVC devops/gitlab-postgresql annotation is false. Skipping
2020/01/14 16:14:00 [INFO] PVC devops/gitlab-postgresql-old isn't RWX
2020/01/14 16:14:00 [INFO] PVC devops/gitlab-prometheus-old isn't RWX
2020/01/14 16:14:00 [INFO] PVC devops/gitlab-prometheus-server doesn't have annotation, adding to list...
2020/01/14 16:14:00 [INFO] PVC devops/gitlab-redis doesn't have annotation, adding to list...
2020/01/14 16:14:00 [INFO] PVC devops/gitlab-redis-old isn't RWX
2020/01/14 16:14:00 [INFO] PVC devops/repo-data-gitlab-gitaly-0 doesn't have annotation, adding to list...
2020/01/14 16:14:00 [INFO] PVC devops/repo-data-gitlab-gitaly-0-old isn't RWX
2020/01/14 16:14:00 [INFO] devops/backupjob-1579018440 is running
2020/01/14 16:14:00 [INFO] devops/backupjob-1579018440 is running
2020/01/14 16:14:05 [INFO] devops/backupjob-1579018440 is running
2020/01/14 16:14:35 [INFO] devops/backupjob-1579018440 is running
2020/01/14 16:14:46 [INFO] devops/backupjob-1579018440 finished successfully
2020/01/14 16:14:46 [INFO] Cleaning up 11/21 jobs
2020/01/14 16:14:46 [INFO] Removing job scheduled-backup-schedule-devops-1578586440 limit reached
2020/01/14 16:14:46 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578586440
2020/01/14 16:14:46 [INFO] Removing job scheduled-backup-schedule-devops-1578615240 limit reached
2020/01/14 16:14:46 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578615240
2020/01/14 16:14:46 [INFO] Removing job scheduled-backup-schedule-devops-1578644040 limit reached
2020/01/14 16:14:46 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578644040
2020/01/14 16:14:46 [INFO] Removing job scheduled-backup-schedule-devops-1578658440 limit reached
2020/01/14 16:14:46 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578658440
2020/01/14 16:14:46 [INFO] Removing job scheduled-backup-schedule-devops-1578672840 limit reached
2020/01/14 16:14:46 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578672840
2020/01/14 16:14:47 [INFO] Removing job scheduled-backup-schedule-devops-1578701640 limit reached
2020/01/14 16:14:47 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578701640
2020/01/14 16:14:47 [INFO] Removing job scheduled-backup-schedule-devops-1578730440 limit reached
2020/01/14 16:14:47 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578730440
2020/01/14 16:14:47 [INFO] Removing job scheduled-backup-schedule-devops-1578744840 limit reached
2020/01/14 16:14:47 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578744840
2020/01/14 16:14:47 [INFO] Removing job scheduled-backup-schedule-devops-1578759240 limit reached
2020/01/14 16:14:47 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578759240
2020/01/14 16:14:47 [INFO] Removing job scheduled-backup-schedule-devops-1578788040 limit reached
2020/01/14 16:14:47 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578788040
2020/01/14 16:14:47 [INFO] Removing job scheduled-backup-schedule-devops-1578932040 limit reached
2020/01/14 16:14:47 [INFO] Cleanup backup scheduled-backup-schedule-devops-1578932040

Hope this helps.

Best regards,
Stefan

from k8up.

Kidswiss avatar Kidswiss commented on May 27, 2024

Hi

Thanks for the details!

Something is very strange. I don't get any backups:
Backups are namespaced, so you'll need --all-namespaces if you're in the wrong one.

Is the operator currently processing any schedules? If not it may be stuck, try restarting it. I think you're using an older release (the helm charts aren't yet updated with the newest releases yet) and experience some deadlocking on the operator. These issues should be fixed with newer releases. It's still in somewhat early development ;)

Your log snipped shows the operator deleting the backups with ascending timestamps, it's starting with the oldest one (lowest timestamp) in ascending order. According to the logs it's doing it the right way around.

from k8up.

dns2utf8 avatar dns2utf8 commented on May 27, 2024

Good morning

Backups are namespaced, so you'll need --all-namespaces if you're in the wrong one.
My bad, I ran it again just now. Sadly I do not get any k8up backups, velero only.

$ kubectl get backups --all-namespaces
NAMESPACE   NAME                                      AGE
velero      all-namespaces-stateless-20191218123559   27d
...
velero      all-namespaces-stateless-20200115003819   7h40m
velero      all-namespaces-stateless-20200115063819   100m
velero      bi-hourly-stateless-20200113081718        2d
velero      bi-hourly-stateless-20200113101718        46h
velero      bi-hourly-stateless-20200113121718        44h
...
velero      bi-hourly-stateless-20200115081719        118s

Right now k8up tries to run archivejobs that fail with Error occurred: Bucket name contains invalid characters.
Strange enough, the PVC (snapshot-test-restore-test-mfw) it tries to archive is from a different namespace ...

I am killing the operator pod now and see if it recovers.

Cheers,
Stefan

from k8up.

Kidswiss avatar Kidswiss commented on May 27, 2024

Hi

Can you show me the schedules you use for the backups/archives?

The archive job is per backup repository, not per namespace. The archive simply takes everything it finds in the repository and then dumps the latest snapshot to the other S3 bucket. So that's expected behaviour. I recommend to backup not more than a few namespaces to the same S3 bucket. Restic has some limitations with mutual exclusive operations on the same bucket.

Regards
Simon

from k8up.

dns2utf8 avatar dns2utf8 commented on May 27, 2024

Hi

This config is currenlty in two namespaces:

---
apiVersion: backup.appuio.ch/v1alpha1
kind: Schedule
metadata:
  name: schedule-tubee-staging
  namespace: myapp-staging
  #namespace: devops
spec:
  backend:
    s3:
      endpoint: http://s3.bucket.internal:9000
      bucket: k8up
      accessKeyIDSecretRef:
        name: backup-credentials
        key: username
      secretAccessKeySecretRef:
        name: backup-credentials
        key: password
    repoPasswordSecretRef:
      name: backup-repo
      key: password
  archive:
    schedule: '14 4 * * *'
    restoreMethod:
      s3:
        endpoint: http://s3.bucket.internal:9000/
        bucket: k8up
        accessKeyIDSecretRef:
          name: backup-credentials
          key: username
        secretAccessKeySecretRef:
          name: backup-credentials
          key: password
  backup:
    schedule: '5 9,10,11,12,13,14,15,16,17,18 * * *'
    keepJobs: 48
    #promURL: https://prometheus-io-instance
  check:
    schedule: '20 5 * * 1-5'
    #promURL: https://prometheus-io-instance
  prune:
    schedule: '14 5 * * 0'
    retention:
      keepLast: 25
      keepDaily: 14

I am currently using the same bucket for everything and separate the jobs by time.

Do you think one Archive Job in the eg. k8up-operator namespace would be the preferred setup?

Cheers,
Stefan

from k8up.

Kidswiss avatar Kidswiss commented on May 27, 2024

Hi @dns2utf8

I see that you use the same bucket for the backups and the archives. The archive function is intended to get some long term archival data. It works by doing actual restores into tar.gz files and uploading them to another bucket (for example with glacier), so it will use quite a lot of space and isn't really compatible with the Restic repository format. Do you really need that to be run on a daily basis? What do you want to accomplish?

Yes it would make sense to set the archive in only one namespace thus only running one, as this could potential lock the prune job that's running an hour later, because the archive could take a lot of time and is mutual exclusive to the prune job.

Best Regards
Simon

from k8up.

dns2utf8 avatar dns2utf8 commented on May 27, 2024

Hi Simon

The initial idea was to have a daily snapshot of all the important data for LTS or disaster recovery.
The hourly restic backups during the day in case somebody deleted something by accident.

I solved the problems by deleting all the prune and archive jobs for now.
Tomorrow, the new strategy will be discussed.

Best,
Stefan

from k8up.

Kidswiss avatar Kidswiss commented on May 27, 2024

Hi Stefan

Well in that case I'd solve it with appropriate retention rules for the prune instead of daily archives. I'd recommend monthly archives if that kind of data protection is needed for your use case.

If all other issues have been fixed we can close the ticket.

If you need advice on setting up schedules to avoid Restic's locking problems please open a new ticket. We have customers that use k8up for very large deployments with archival schedules, so we already worked out some quirks that come with that :)

Regards
Simon

from k8up.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.