I found an odd issue, it seems that if there's a Failed run of a CronJob, <code class=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Oh actually, just realised that this might be relevant, I have a <code class="notransl

So my job is written in pulmi type k8s but its pretty easy to follow from what a

I can see <a class="issue-link js-issue-link" data-error-text="Failed to load title" d

Failed CronJob runs get re-raised until cleaned up (and don't have message) about k8s-sentry HOT 10 OPEN

wichert commented on September 27, 2024 3

Failed CronJob runs get re-raised until cleaned up (and don't have message)

from k8s-sentry.

Comments (10)

williscool commented on September 27, 2024

Having this issue also with a regualr job... complete blew out my dev sentry quote o.o

from k8s-sentry.

wichert commented on September 27, 2024

That definitely sounds like a real bug. I'll try to reproduce this.

from k8s-sentry.

wichert commented on September 27, 2024

@daaain Isn't the behaviour correct for you? I would expect a Sentry event every time the created Pod failed. For example with this Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: failure
spec:
  template:
    spec:
      containers:
      - name: failure
        image: busybox
        command: ["sh",  "-c", "echo Failing now ; /bin/false"]
      restartPolicy: Never
  backoffLimit: 4

Kubernetes will try to create and run a pod four times, resulting in four failure events. It sounds like your CronJob is setup to do 8 attempts per hour?

from k8s-sentry.

wichert commented on September 27, 2024

@williscool Can you show me what your Job resource looks like? In my test k8s-sentry does not report more errors than the number of times Kubernetes tries to run the job, so I'm wondering how you get 1500 events for a single Job.

from k8s-sentry.

daaain commented on September 27, 2024

@wichert it's a single run of the CronJob which was repeated in Sentry, the runs afterwards were successful.

It's running in a cluster with 1.16.13-gke.401

This is the template from GKE with some irrelevant bits removed:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: etl-sync-worker-bike-public-module-state
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      labels:
        app: etl-sync-worker-bike-public-module-state
    spec:
      backoffLimit: 1
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command: ...
            image: ...
            name: etl-sync-worker-bike-public-module-state
          restartPolicy: Never
          terminationGracePeriodSeconds: 30
  schedule: 0/2 * * * *
  startingDeadlineSeconds: 60
  successfulJobsHistoryLimit: 3

from k8s-sentry.

daaain commented on September 27, 2024

Oh actually, just realised that this might be relevant, I have a postStart lifecycle command which might be the one which failed the run which would explain why there wasn't any output.

from k8s-sentry.

williscool commented on September 27, 2024

So my job is written in pulmi typescript k8s but its pretty easy to follow from what a normal yaml file looks like

the main event is

yarn install --production=false --no-progress && yarn jest

so far I think this is what the issue was ... I had a test that was failing in a way that kept a database connection open which hung the jest process and so it kept retuning exit 1 ... somehow every couple of minitues k8s-sentry was observing that failure and sending it to sentry

const testJob = new k8s.batch.v1.Job(`${projectName}-test-job`, {
    spec: {
        backoffLimit: 0, // only run once
        template: {
            metadata: {
                generateName: `${projectName}-test-job-`,
            },
            spec: {
                containers: [
                    { 
                        name: `${projectName}-test`,
                        image: watcherImage,
                        command: ["/bin/sh"],
                        args: ["-c", "yarn install --production=false --no-progress && yarn jest"],
                        env: [
                            { name: "DD_ENV", value: pulumi.getStack() },
                            { name: "SENTRY_DSN", value: notificationServiceSentryDsn },
                            { name: "PG_DATABASE_URL", value: dburl },
                            ///..  more env vars and such...
                        ]

                    },
                ],
                restartPolicy: "Never",
            }
        },
    },
}, { provider: cluster.provider });

had to set a filter

and by the looks of things

its still an issue even though I have this test passing now

from k8s-sentry.

daaain commented on September 27, 2024

Some new info: I had a CronJob failing overnight a few times by running out of memory and getting OOMKilled with exit code 137 which didn't trigger the repetition, but also didn't have any error message after the Pod name, so at least one part of the problem seems to be easier to reproduce.

from k8s-sentry.

Tenzer commented on September 27, 2024

We've been hit by this bug a few times as well, most recently over the last ~22 hours.

We have a corn job that runs every 5 minutes in Kubernetes and it had 1 failed invocation yesterday which was retried 3 times, and we only noticed today that these 4 failed pods resulting from the issue had caused 10k errors reported to Sentry. Deleting the failed pods stopped the errors from being reported.

The number of errors reported approximately matches 4 pods each getting an error reported every 30 seconds for 22 hours.

Looking through the source code, it stands out to me the isNewTermination check only is used if it's not the entire pod that has failed. If that could be added for fully failed pods we could at least avoid getting the errors reported to Sentry repeatedly, chewing away at the quotas.

Would that be a change that would make sense?

from k8s-sentry.

Tenzer commented on September 27, 2024

I can see #14 already makes that change as part of a bigger change in how pods are handled when they fail, and getting that change merged it would help solve this problem, I believe.

from k8s-sentry.

Failed CronJob runs get re-raised until cleaned up (and don't have message) about k8s-sentry HOT 10 OPEN

Comments (10)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent