Github were down for two hours - a power glitch at first, with a whole class of machines then failing to boot and also an unintended boot-time dependency on Redis service which was overloaded. Key phrase: “Drain the flea power.”
“We will be updating our application’s test suite to explicitly ensure that our application processes start even when certain external systems are unavailable and we are improving our circuit breakers so we can gracefully degrade functionality when these backend services are down.”
A lot of companies depend on github in their workflows, so github downtime is a direct productivity loss. That might be a failure mode in itself.
via Dan Luu’s big list at