In this article Dan Luu surveys postmortem reports looking for lessons:
Reading postmortems
Examples from the article:
most bugs were due to bad error handling. 92% of those failures are actually from errors that are handled incorrectly.
Configuration bugs, not code bugs, are the most common cause I’ve seen of really bad outages.
There are a lot of cases where the outage happened because a human was expected to flawlessly execute a series of instructions and failed to do so. That’s exactly the kind of thing that programs are good at!
via his recent post, rather interesting in its own right
Some reasons to measure
Kyle’s early work found critical flaws in nearly everything he tested
Many of these problems had existed for quite a while