Postmortems

Joyent outage of 27 May 2014

Joyent outage of 27 May.
http://www.joyent.com/blog/postmortem-for-outage-of-us-east-1-may-27-2014

Global reboot of datacenter due to operator error: "The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter. " and “The extended API outage time was due to the need for manual recovery of stateful components in the control plane. While the system is designed to handle 2F+1 failures of any stateful system, rebooting the datacenter resulted in complete failure of all these components, and they did not maintain enough history to bring themselves online.”

Joyent “mortified”: https://news.ycombinator.com/item?id=7806972

Turns out there’s a first hand account - see this 2017 talk by Bryan Cantrill, CTO:

(Slides here (pdf))

(via Dan Luu’s collaborative postmortems list)