A List of Post-mortems!

##Table of Contents

Config Errors

Hardware/Power Failures

Conflicts

Uncategorized

Other lists of postmortems

Analysis

Contributors

Config Errors

Cloudflare. A bad config (router rule) caused all of their edge routers to crash, taking down all of Cloudflare.

Etsy. Sending multicast traffic without properly configuring switches caused an Etsy global outage.

Facebook. A bad config took down both Facebook and Instagram.

Google. A bad config (autogenerated) took down most Google services.

Google. A bad config caused a quota service to fail, which caused multiple services to fail (including gmail).

Google. / was checked into the URL blacklist, causing every URL to show a warning.

Microsoft. A bad config took down Azure storage.

Stack Overflow. A bad firewall config blocked stackexchange/stackoverflow.

TravisCI. A configuration issue (incomplete password rotation) led to "leaking" VMs, leading to elevated build queue times.

Valve. Although there's no official postmortem, it looks like a bad BGP config severed Valve's connection to Level 3, Telia, and Abovenet/Zayo, which resulted in a global Steam outage.

Hardware/Power Failures

Amazon. An unknown event caused a transformer to fail. One of the PLCs that checks that generator power is in phase failed for an unknown reason, which prevented a set of backup generators from coming online. This affected EC2, EBS, and RDS in EU West.

Amazon. Bad weather caued power failures throughout AWS US East. A single backup generator failed to deliver stable power when power switched over to backup and the generator was loaded. This is despite having passed a load tests two months earlier, and passing weekly power-on tests.

FirstEnergy / General Electric. FirstEnergy had a local failure when some transmission lines hit untrimmed foliage. The normal process is to have an alarm go off, which causes human operators to re-distribute power. But the GE system that was monitoring this had a bug which prevented the alarm from getting triggered, which eventually caused a cascading failure that eventually affected 55 million people.

Google. Successive lightning strikes on their European datacenter (europe-west1-b) caused loss of power to Google Compute Engine storage systems within that region. I/O errors were observed on a subset of Standard Persistent Disks (HDDs) and permanent data loss was observed on a small fraction of those.

Sun/Oracle. Sun famously didn't include ECC in a couple generations of server parts. This resulted in data corruption and crashing. Following Sun's typical MO, they made customers that reported a bug sign an NDA before explaining the issue.

Conflicts

CCP Games A typo and a name conflict caused the installer to sometimes delete the boot.ini file on installation of an expansion for EVE Online - with consequences.

GoCardless. All queries on a critical PostgreSQL table were blocked by the combination of an extremely fast database migration and a long-running read query, causing 15 seconds of downtime.

Knight Capital. A combination of conflicting deployed versions and re-using a previously used bit caused a $460M loss.

Uncategorized

Allegro. The Allegro platform suffered a failure of a subsystem responsible for asynchronous distributed task processing. The problem affected many areas, e.g. features such as purchasing numerous offers via cart and bulk offer editing (including price list editing) did not work at all. Moreover, it partially failed to send daily newsletter with new offers. Also some parts of internal administration panel were affected.

Amazon. Message corruption caused the distributed server state function to overwhelm resources on the S3 request processing fleet.

Amazon. Human error during a routine networking upgrade led to a resource crunch, exacerbated by software bugs, that ultimately resulted in an outage across all US East Availability Zones as well as a loss of 0.07% of volumes.

Amazon. Elastic Load Balancer ran into problems when "a maintenance process that was inadvertently run against the production ELB state data".

Amazon. A "network disruption" caused metadata services to experience load that caused resposne times to exceed timeout values, causing storage nodes to take themselves down. Nodes that took themselves down continued to retry, ensuring that load on metadata services couldn't decrease.

AppNexus. A double free revealed by a database update caused all "impression bus" servers to crash simultaneously. This wasn't caught in staging and made it into production because a time delay is required to trigger the bug, and the staging period didn't have a built-in delay.

Bitly. Hosted source code repo contained credentials granting access to bitly backups, including hashed passwords.

BrowserStack. An old prototype machine with the Shellshock vulnerability still active had secret keys on it which ultimately led to a security breach of the Production system.

CCP Games. A problematic logging channel caused cluster nodes dying off during the cluster start sequence after rolling out a new game patch.

CircleCI. A GitHub outage and recovery caused an unexpectedly large incoming load. For reasons that aren't specified, a large load causes CircleCI's queue system to slow down, in this case to handling one transaction per minute.

Dropbox. This postmortem is pretty thin and I'm not sure what happened. It sounds like, maybe, a scheduled OS upgrade somehow caused some machines to get wiped out, which took out some databases.

European Space Agency. An overflow occured when converting a 16-bit number to a 64-bit numer in the Ariane 5 intertial guidance system, causing the rocket to crash. The actual overflow occured in code that wasn't necessary for operation but was running anyway. According to one account, this caused a diagnostic error message to get printed out, and the diagnostic error message was somehow interpreted as actual valid data. According to another account, no trap handler was installed for the overflow.

Etsy. First, a deploy that was supposed to be a small bugfix deploy also caused live databases to get upgraded on running production machines. To make sure that this didn't cause any corruption, Etsy stopped serving traffic to run integrity checks. Second, an overflow in ids (signed 32-bit ints) caused some database operations to fail. Etsy didn't trust that this wouldn't result in data corruption and took down the site while the upgrade got pushed.

Gitlab. After the primary locked up and was restarted, it was brought back up with the wrong filesystem, causing a global outage.

Google. Checking the vendor string instead of feature flags renders NaCl unusable on otherwise compatible non-mainstream hardware platforms.

Google. A mail system emailed people more than 20 times. This happened because mail was sent with a batch cron job that sent mail to everyone who was marked as waiting for mail. This was a non-atomic operation and the batch job didn't mark people as not waiting until all messages were sent.

GPS/GLONASS. A bad update that caused incorrect orbital mechanics calculations caused GPS satellites that use GLONASS to broadcast incorrect positions for 10 hours. The bug was noticed and rolled back almost immediately due to (?) this didn't fix the issue.

Healthcare.gov.

Heroku. Having a system that requires scheduled manual updates resulted in an error which caused US customers to be unable to scale, stop or restart dynos, or route HTTP traffic, and also prevented all customers from being able to deploy.

Intel. A scripting bug caused the generation of the divider logic in the Pentium to very occasionally produce incorrect results. The bug wasn't caught in testing because of an incorrect assumption in a proof of correctness.

Joyent. Operations on Manta were blocked because a lock couldn't be obtained on their PostgreSQL metadata servers. This was due to a combination of PostgreSQL's transaction wraparound maintence taking a lock on something, and a Joyent query that unecessarily tried to take a global lock.

Kickstarter. Primary DB became inconsistent with all replicas, which wasn't detected until a query failed. This was caused by a MySQL bug which sometimes caused order by to be ignored.

Medium. Polish users were unable to use their "Ś" key on Medium.

NASA. Use of different units of measurement (metric vs. English) caused Mars Climate Orbiter to fail. This is basically the same issue Sweden ran into in 1628 with its ship.

Netflix. An EBS outage in one availability zone was mitigated by migrating to other availability zones.

Sentry. Transaction ID Wraparound in Postgres caused Sentry to go down for most of a working day.

Spotify. Lack of exponential backoff in a microservice caused a cascading failure, leading to notable service degradation.

Stack Exchange. Enabling StackEgg for all users resulted in heavy load on load balancers and consequently, a DDoS.

Strava. Hit the signed integer limit on a primary key, causing uploads to fail.

Stripe. Manual operations are regularly executed on production databases. A manual operation was done incorrectly (missing dependency), causing the Stripe API to go down for 90 minutes.

Sweden. Use of different rulers by builders caused the Vasa to be more heavily built on its port side and the ship's designer, not having built a ship with two gun decks before, overbuilt the upper decks, leading to a design that was top heavy. Twenty minutes into its maiden voyage in 1628, the ship heeled to port and sank.

Therac-25. The Therac-25 was a radiation therapy machine involved in at least six accidents between 1985 and 1987 in which patients were given massive overdoses of radiation. Because of concurrent programming errors, it sometimes gave its patients radiation doses that were thousands of times greater than normal, resulting in death or serious injury.

Valve. Steam's desktop client deleted all local files and directories. The thing I find most interesting about this is that, after this blew up on social media, there were widespread reports that this was reported to Valve months earlier. But Valve doesn't triage most bugs, resulting in an extremely long time-to-mitigate, despite having multiple bugreports on this issue.

Yeller. A network partition in a cluster caused some messages to get delayed, up to 6-7 hours. For reasons that aren't clear, a rolling restart of the cluster healed the partition. There's some suspicious that it was due to cached routes, but there wasn't enough logging information to tell for sure.

Unfortunately, most of the interesting post-mortems I know about are locked inside confidential pages at Google and Microsoft. Please add more links if you know of any interesting public post mortems! is a pretty good resource; other links to collections of post mortems are also appreciated.

Analysis

How Complex Systems Fail

John Allspaw on Resilience Engineering

Contributors

Ahmet Alp Balkan
Amber Yust
Anuj Pahuja
BigEd/Ed S?
Brian Scanlan
Brock Boland
Brian Scanlan
Chris Higgs
Connor Shea
Dan Luu
Dan Nguyen
David Pate
Florent Genette
Franck Arnulfo
Grey Baker
Jacob Kaplan-Moss
James Graham
Jason Dusek
John Daily
jomo
Julia Hansbrough
Julian Szulc
Kunal Mehta
Luan Cestari
Mark Dennehy
Matt Day
Michael Robinson
Nat Welch
Nick Sweeting
Nate Parsons
Raul Ochoa
Samuel Hunter
Siddharth Kannan
Tim Freeman
Tom Crayford
Vaibhav Bhembre
Vincent Ambo

havk64 / post-mortems Goto Github PK

post-mortems's Introduction

A List of Post-mortems!

Config Errors

Hardware/Power Failures

Conflicts

Uncategorized

Other lists of postmortems

Analysis

Contributors

post-mortems's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent