Comments (10)
Having this issue also with a regualr job... complete blew out my dev sentry quote o.o
from k8s-sentry.
That definitely sounds like a real bug. I'll try to reproduce this.
from k8s-sentry.
@daaain Isn't the behaviour correct for you? I would expect a Sentry event every time the created Pod failed. For example with this Job:
apiVersion: batch/v1
kind: Job
metadata:
name: failure
spec:
template:
spec:
containers:
- name: failure
image: busybox
command: ["sh", "-c", "echo Failing now ; /bin/false"]
restartPolicy: Never
backoffLimit: 4
Kubernetes will try to create and run a pod four times, resulting in four failure events. It sounds like your CronJob is setup to do 8 attempts per hour?
from k8s-sentry.
@williscool Can you show me what your Job resource looks like? In my test k8s-sentry does not report more errors than the number of times Kubernetes tries to run the job, so I'm wondering how you get 1500 events for a single Job.
from k8s-sentry.
@wichert it's a single run of the CronJob which was repeated in Sentry, the runs afterwards were successful.
It's running in a cluster with 1.16.13-gke.401
This is the template from GKE with some irrelevant bits removed:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: etl-sync-worker-bike-public-module-state
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1
jobTemplate:
metadata:
labels:
app: etl-sync-worker-bike-public-module-state
spec:
backoffLimit: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- command: ...
image: ...
name: etl-sync-worker-bike-public-module-state
restartPolicy: Never
terminationGracePeriodSeconds: 30
schedule: 0/2 * * * *
startingDeadlineSeconds: 60
successfulJobsHistoryLimit: 3
from k8s-sentry.
Oh actually, just realised that this might be relevant, I have a postStart
lifecycle command which might be the one which failed the run which would explain why there wasn't any output.
from k8s-sentry.
So my job is written in pulmi typescript k8s but its pretty easy to follow from what a normal yaml file looks like
the main event is
yarn install --production=false --no-progress && yarn jest
so far I think this is what the issue was ... I had a test that was failing in a way that kept a database connection open which hung the jest process and so it kept retuning exit 1 ... somehow every couple of minitues k8s-sentry was observing that failure and sending it to sentry
const testJob = new k8s.batch.v1.Job(`${projectName}-test-job`, {
spec: {
backoffLimit: 0, // only run once
template: {
metadata: {
generateName: `${projectName}-test-job-`,
},
spec: {
containers: [
{
name: `${projectName}-test`,
image: watcherImage,
command: ["/bin/sh"],
args: ["-c", "yarn install --production=false --no-progress && yarn jest"],
env: [
{ name: "DD_ENV", value: pulumi.getStack() },
{ name: "SENTRY_DSN", value: notificationServiceSentryDsn },
{ name: "PG_DATABASE_URL", value: dburl },
///.. more env vars and such...
]
},
],
restartPolicy: "Never",
}
},
},
}, { provider: cluster.provider });
had to set a filter
and by the looks of things
its still an issue even though I have this test passing now
from k8s-sentry.
Some new info: I had a CronJob failing overnight a few times by running out of memory and getting OOMKilled with exit code 137
which didn't trigger the repetition, but also didn't have any error message after the Pod name, so at least one part of the problem seems to be easier to reproduce.
from k8s-sentry.
We've been hit by this bug a few times as well, most recently over the last ~22 hours.
We have a corn job that runs every 5 minutes in Kubernetes and it had 1 failed invocation yesterday which was retried 3 times, and we only noticed today that these 4 failed pods resulting from the issue had caused 10k errors reported to Sentry. Deleting the failed pods stopped the errors from being reported.
The number of errors reported approximately matches 4 pods each getting an error reported every 30 seconds for 22 hours.
Looking through the source code, it stands out to me the isNewTermination
check only is used if it's not the entire pod that has failed. If that could be added for fully failed pods we could at least avoid getting the errors reported to Sentry repeatedly, chewing away at the quotas.
Would that be a change that would make sense?
from k8s-sentry.
I can see #14 already makes that change as part of a bigger change in how pods are handled when they fail, and getting that change merged it would help solve this problem, I believe.
from k8s-sentry.
Related Issues (13)
- Add image to docker hub
- Helm chart HOT 1
- Support to exclude few namespaces out of monitoring
- support to listen and registry new namespace for monitoring
- Report errors from pod's stderr stream
- Add manifest files for k8s deployment
- Ability to Customize Sentry Reporting HOT 1
- Add cluster tag from CLUSTER_NAME env HOT 2
- Missing fingerprints on events HOT 3
- Add k8s context HOT 1
- Bind logger HOT 4
- Support monitoring multiple namespaces
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-sentry.