Postmortems

Postmortem lessons - a survey from Dan Luu

In this article Dan Luu surveys postmortem reports looking for lessons:
Reading postmortems

Examples from the article:

most bugs were due to bad error handling. 92% of those failures are actually from errors that are handled incorrectly.

Graphic of previous paragraph

Configuration bugs, not code bugs, are the most common cause I’ve seen of really bad outages.
image

There are a lot of cases where the outage happened because a human was expected to flawlessly execute a series of instructions and failed to do so. That’s exactly the kind of thing that programs are good at!
image

via his recent post, rather interesting in its own right
Some reasons to measure

Kyle’s early work found critical flaws in nearly everything he tested
Many of these problems had existed for quite a while