Postmortems

A postmortem for another App Engine outage (2011-03-08):

A postmortem for another App Engine outage (2011-03-08): this one due to a Python runtime software update: “While the new Python runtime
contained no known issues, a performance optimization in a system
update pushed on March 3rd included a bug which would cause future
updates to App Engine runtimes to disrupt running applications as the
new runtime rolled out.”

There’s another App Engine postmortem at
https://groups.google.com/forum/#!msg/google-appengine/p2QKJ0OSLc8/7MtZ3YC9TqQJ
for a two-hour outage 2010-02-24:
“We failed to plan for the case of a power outage that might affect
some, but not all, of our machines in a datacenter (in this case,
about 25%). In particular, this led to incorrect analysis of the
serving state of the failed datacenter and when it might recover”

Previously in this community, the 2 July 2009 outage, caused by a GFS bug:
https://plus.google.com/u/0/110357001884194145645/posts/BWVkgofMMKt
https://groups.google.com/forum/?fromgroups=#!topic/google-appengine/jJ0aRAvRJeY

Oops, I’d confused two App Engine outages: adjusted the post to include both.
See also http://www.datacenterknowledge.com/archives/2010/03/08/when-the-power-goes-out-at-google/ which summarises and tells us “The scope and detail of the [postmortem] report drew plaudits.”