hjacobs / kubernetes-failure-stories Goto Github PK

View Code? Open in Web Editor NEW

6.2K 469.0 309.0 80 KB

Compilation of public failure/horror stories related to Kubernetes

Home Page: https://k8s.af

Dockerfile 0.94% HTML 94.11% Shell 1.44% Python 3.17% Makefile 0.33%

kubernetes postmortem incidents failures production-engineering sre reliability post-mortem

kubernetes-failure-stories's Introduction

Kubernetes Failure Stories

The source code repository for https://k8s.af moved to https://codeberg.org/hjacobs/kubernetes-failure-stories

kubernetes-failure-stories's People

Contributors

Stargazers

Watchers

Forkers

cxz ghanima kotnik aaronfriel kenny-ngo hien nguyenhoangson aasiutin 40a devops-ru ashwajce tjwallas ellerbrock cloud-architecture rahulsoibam sandyverden ulucaydin mehrdad-shokri shaunstanislauslau ryanmaclean zuker alexhaker syldej thaneacheron bill-fischofer-linaro gs- imrel lalatendumohanty chcheruk bobhenkel sambasivareddyb kjgorman arthar360 jameskumar hunt8r crazyguitar utgardar sudhak80 minasys parampavar vkoukoutsas vmhocuron0 andaok mandeepbal cty222 lioupinchen kangman xiak kersus zhiqinyang renansdias vinodverma7584 olegstotsky juner417 cmcconomy taregnaeem johnlunney kevintrannz schatterjee4 corneliupopescu almasry daftano chenjianxin xiaoyumu akopitsa iac-infrastructureascode vjremotegithub alanfx stackout evelinad staleks ali-shaikh1190 productinfo albertogviana emayssat povilasv boluisa wrasdf dvbnrg prasobhpk marco-styra eformat aizuddin85 medyagh cdtiwari-ebi sasg hungaikev salilgupta1 pacoxu xguitian gjtempleton dtrabas sqlcode ismashal zhuchance junneyang occultism kubematic splisson-altair karthik-cbe

kubernetes-failure-stories's Issues

When GKE ran out of IP addresses

https://deploy.live/blog/when-gke-ran-out-of-ip-addresses/

Create structure for people to contribute failure stories as Markdown

After talking with Kubernetes users at ContainerDays Hamburg, it seems to be worthwhile to create a structure to allow direct contributions as .mdfiles --- this allows companies to contribute who do not have a proper place (github, blog, ..) themselves.

GitHub Availability Report: July 2020 (ImagePullPolicy Always)

https://github.blog/2020-08-05-github-availability-report-july-2020/

https://twitter.com/stefanprodan/status/1292734613778051072

Form the GitHub availability report: "the container in the Pod was configured with an ImagePullPolicy of Always".
Using the "Always" pull policy is a workaround when you don't ship immutable images. Is GitHub using "latest" tags in production?

I seem to hang a lot around here lately, but we've got a few interesting failure scenarios discovered lately. Does this fit the profile?

https://deploy.live/blog/kubernetes-networking-problems-due-to-the-conntrack/

Thanks

FreeNow: New K8s workers unable to join cluster

https://github.com/freenowtech/postmortems/blob/master/2019-09-19%20-%20New%20K8s%20workers%20unable%20to%20join%20cluster.pdf

KubeCon Lightning Talk by Pusher

https://kccnceu19.sched.com/event/MPgf/lightning-talk-oh-sht-the-config-changed-joel-speed-pusher

Nice little failure story by Joel.

Potential story: VPA recommender evicting large number of pods

"the recommender/updater decided to evict a large number of pods" (see kubernetes/autoscaler#2180)

Failure story length and language

Hi there,

is there any lower limit on how long a failure story needs to be (i.e would an Ingress traffic outage because of a wrongly configured Load Balancer Service count already)?

And, what are the requirements on language? English only? I have published the failure story mentioned above in my German blog, but still would like to contribute it.

Regards
Marcel

Monzo: fun failure stories promised

https://twitter.com/suhailpatel/status/1222598801308995589

Stripe Learning to operate Kubernetes Reliably

Stripe had a great post about some hiccups they had around reliably scheduling crons in kubernetes https://stripe.com/blog/operating-kubernetes

Bump on the (network) roads with Kubernetes

Hello, as part of our CI infrastructure based on kubernetes, we've encountered a nasty bug. Hard to understand, and not obvious to reproduce and fix, it took some time before we figured it out and were able to push the community to fix it.

Here's a blog article that relates the story: https://www.zenko.io/blog/solving-connection-reset-issue-in-kubernetes/

Kubernetes the very hard way (by Datadog) contains some lessons

https://www.youtube.com/watch?v=2dsCwp_j0yQ

KubeCon Europe: Kubernetes Failure Stories

To add: https://www.slideshare.net/try_except_/kubernetes-failure-stories-kubecon-europe-barcelona

Includes one new incident (Skipper OOM).

Sailing with the Istio through the shallow water

https://twitter.com/learnk8s/status/1221357221893808129:

Istio is not all bells and whistles. Note from a production horror story:

All pods crashing due to missing labels
Jobs not completing (sidecar is running forever)
doesn't update IP for headless Services
sidecar shuts down before the app does

https://medium.com/@jakubkulich/sailing-with-the-istio-through-the-shallow-water-8ae81668381e

Feature request - RSS feed

It would be great to have a RSS feed for https://k8s.af.

Grafana Production Outage Caused Using Kubernetes Pod Priorities

https://twitter.com/tom_wilkie/status/1154049727882649601

Failure story: hub.docker.com slow (5-10kB/s)

Short, easy, small story, but got on my nerves as I had to cancel a dinner.
In end effect it was a huge decision-changer for my company. All looked good till this failure... I decided to delay our k8s-based hosting offerings for now :|

I used k8s cluster(s) on AWS, provisioned with kops.

Friday, 5PM. Last task remaining for the week: change instance size(s) of k8s cluster. All services are correctly distributed over N nodes, what could possibly go wrong?

hub.docker.com's CDN. I'm not sure where it's hosted, but for some reason it became totally slow on AWS. Downloads of ca. 5-10kb/s. Works like a charm in another AWS regions or non-AWS datacenter. Just does not work for me.

So... I had to cancel my evening plans (and rendered the cluster unavailable), because:

each node tries to start for >30 minutes, then fails and is re-added. basic k8s services can't start, as container images can't be downloaded in reasonable. There's no quick fail - each time it tries to start, getting images is super slow and eventually times out
I got an open "kops update" in a terminal window on my local workstation. I found no information if I can safely break this operation. It will disconnect if unplug network cable from my laptop.

Solution:

cancel your dinner
wait for some hours until CDN bandwidth stabilizes
rethink many, many times, if our company should offer production k8s services...

"I saw problems with kubelet connecting to API server, let's look at the kubelet, dynamic ELB IPs outage post"
"I saw DNS issues in our cluster, let's see what the incident report with keyword DNS has to say"