Short, easy, small story, but got on my nerves as I had to cancel a dinner.
In end effect it was a huge decision-changer for my company. All looked good till this failure... I decided to delay our k8s-based hosting offerings for now :|
I used k8s cluster(s) on AWS, provisioned with kops.
Friday, 5PM. Last task remaining for the week: change instance size(s) of k8s cluster. All services are correctly distributed over N nodes, what could possibly go wrong?
hub.docker.com
's CDN. I'm not sure where it's hosted, but for some reason it became totally slow on AWS. Downloads of ca. 5-10kb/s. Works like a charm in another AWS regions or non-AWS datacenter. Just does not work for me.
So... I had to cancel my evening plans (and rendered the cluster unavailable), because:
- each node tries to start for >30 minutes, then fails and is re-added. basic k8s services can't start, as container images can't be downloaded in reasonable. There's no quick fail - each time it tries to start, getting images is super slow and eventually times out
- I got an open "kops update" in a terminal window on my local workstation. I found no information if I can safely break this operation. It will disconnect if unplug network cable from my laptop.
Solution:
- cancel your dinner
- wait for some hours until CDN bandwidth stabilizes
- rethink many, many times, if our company should offer production k8s services...