Git Product home page Git Product logo

Comments (10)

williscool avatar williscool commented on September 27, 2024

Having this issue also with a regualr job... complete blew out my dev sentry quote o.o

image

from k8s-sentry.

wichert avatar wichert commented on September 27, 2024

That definitely sounds like a real bug. I'll try to reproduce this.

from k8s-sentry.

wichert avatar wichert commented on September 27, 2024

@daaain Isn't the behaviour correct for you? I would expect a Sentry event every time the created Pod failed. For example with this Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: failure
spec:
  template:
    spec:
      containers:
      - name: failure
        image: busybox
        command: ["sh",  "-c", "echo Failing now ; /bin/false"]
      restartPolicy: Never
  backoffLimit: 4

Kubernetes will try to create and run a pod four times, resulting in four failure events. It sounds like your CronJob is setup to do 8 attempts per hour?

from k8s-sentry.

wichert avatar wichert commented on September 27, 2024

@williscool Can you show me what your Job resource looks like? In my test k8s-sentry does not report more errors than the number of times Kubernetes tries to run the job, so I'm wondering how you get 1500 events for a single Job.

from k8s-sentry.

daaain avatar daaain commented on September 27, 2024

@wichert it's a single run of the CronJob which was repeated in Sentry, the runs afterwards were successful.

It's running in a cluster with 1.16.13-gke.401

This is the template from GKE with some irrelevant bits removed:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: etl-sync-worker-bike-public-module-state
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      labels:
        app: etl-sync-worker-bike-public-module-state
    spec:
      backoffLimit: 1
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command: ...
            image: ...
            name: etl-sync-worker-bike-public-module-state
          restartPolicy: Never
          terminationGracePeriodSeconds: 30
  schedule: 0/2 * * * *
  startingDeadlineSeconds: 60
  successfulJobsHistoryLimit: 3

from k8s-sentry.

daaain avatar daaain commented on September 27, 2024

Oh actually, just realised that this might be relevant, I have a postStart lifecycle command which might be the one which failed the run which would explain why there wasn't any output.

from k8s-sentry.

williscool avatar williscool commented on September 27, 2024

So my job is written in pulmi typescript k8s but its pretty easy to follow from what a normal yaml file looks like

the main event is

yarn install --production=false --no-progress && yarn jest

so far I think this is what the issue was ... I had a test that was failing in a way that kept a database connection open which hung the jest process and so it kept retuning exit 1 ... somehow every couple of minitues k8s-sentry was observing that failure and sending it to sentry

const testJob = new k8s.batch.v1.Job(`${projectName}-test-job`, {
    spec: {
        backoffLimit: 0, // only run once
        template: {
            metadata: {
                generateName: `${projectName}-test-job-`,
            },
            spec: {
                containers: [
                    { 
                        name: `${projectName}-test`,
                        image: watcherImage,
                        command: ["/bin/sh"],
                        args: ["-c", "yarn install --production=false --no-progress && yarn jest"],
                        env: [
                            { name: "DD_ENV", value: pulumi.getStack() },
                            { name: "SENTRY_DSN", value: notificationServiceSentryDsn },
                            { name: "PG_DATABASE_URL", value: dburl },
                            ///..  more env vars and such...
                        ]

                    },
                ],
                restartPolicy: "Never",
            }
        },
    },
}, { provider: cluster.provider });

had to set a filter

image

and by the looks of things

image

its still an issue even though I have this test passing now

from k8s-sentry.

daaain avatar daaain commented on September 27, 2024

Some new info: I had a CronJob failing overnight a few times by running out of memory and getting OOMKilled with exit code 137 which didn't trigger the repetition, but also didn't have any error message after the Pod name, so at least one part of the problem seems to be easier to reproduce.

from k8s-sentry.

Tenzer avatar Tenzer commented on September 27, 2024

We've been hit by this bug a few times as well, most recently over the last ~22 hours.

We have a corn job that runs every 5 minutes in Kubernetes and it had 1 failed invocation yesterday which was retried 3 times, and we only noticed today that these 4 failed pods resulting from the issue had caused 10k errors reported to Sentry. Deleting the failed pods stopped the errors from being reported.

The number of errors reported approximately matches 4 pods each getting an error reported every 30 seconds for 22 hours.

Looking through the source code, it stands out to me the isNewTermination check only is used if it's not the entire pod that has failed. If that could be added for fully failed pods we could at least avoid getting the errors reported to Sentry repeatedly, chewing away at the quotas.

Would that be a change that would make sense?

from k8s-sentry.

Tenzer avatar Tenzer commented on September 27, 2024

I can see #14 already makes that change as part of a bigger change in how pods are handled when they fail, and getting that change merged it would help solve this problem, I believe.

from k8s-sentry.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.