Detailed postmortem from Twilio on their recent billing problems.

Detailed postmortem from Twilio on their recent billing problems. It’s interesting to see how one of their initial reactions actually made things worse (“Observing extreme load on the host, the redis process on redis-master was misdiagnosed as requiring a restart to recover”) - I wonder how common this is in incident response?
http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html

I think it is common. Overload can be surprisingly hard to recognize if it’s very rare, which it would be in a well provisioned system. Then your mind set is “I’m sure this is not overload. What else could cause these symptoms? Maybe a thread is stuck. Let’s see if a restart helps.” This is especially easy if you have seen three problems that were fixed with restarts just last week, while overload has never been a problem.

Do you know of a good solution to this?

“Let’s see if a restart helps” is a very common approach in lots of organizations.

Yes - there’s likely confirmation bias in play here. I’ve seen it tackled with checklists that purposely eliminate less likely causes early on in the incident response, thus causing the responder to question his assumptions about what’s causing the symptoms that he observes.