Cascading failure at Slack caused by rollback

EdS · July 12, 2020, 8:02pm

This story describes the technical details of the problems that caused the Slack downtime on May 12th, 2020. To learn more about the process behind incident response for same outage, read Ryan Katkov’s post, “All Hands on Deck”.

Comments from HN:

TL;DR First a performance bug was caught during rollout, and rolled back within a few minutes. However this triggered their auto-scaling of web apps to ramp up to more instances than a hard limit they had. This in turn triggered a bug in how they update the list of hosts in their load balancer, causing it to not get updated with new instances, and eventually go stale. After 8 hours the only real remaining instances in the list were the few oldest ones, so when they then scaled down the number of instances again, these old instances were the first to be shut down, causing the outage, because the instances that should have taken over were not listed in the stale load balancer host list.

But also

Its a load balancer bug

and

No, its not a load balancer bug. It was a bug with the home built program they built for managing config load ins TO the load balancer. The load balancer did exactly what it was instructed to do–but the slack developed app for helping the load balancer deal with their overly complicated back end discovery problems failed to keep the load balancers updated with accurate information on backend servers available for serving.