A big outage today at Cloudflare, a global CDN, DNS and edge-compute provider, caused by an internal routing update colliding with a new internal architecture.
caused by a change that was part of a long-running project to increase resilience in our busiest locations.
While we did use a stagger procedure for this change, the stagger policy did not include an MCP data center until the final step.
Cloudflare engineers experienced added difficulty in reaching the affected locations to revert the problematic change. We have backup procedures for handling such an event and used them to take control of the affected locations.
… also caused our internal load balancing system Multimog to stop working … our smaller compute clusters in an MCP received the same amount of traffic as our largest clusters, causing the smaller ones to overload.
In the discussion on HN, we are treated to a list of other BGP incidents:
Yet another BGP caused outage. At some point we should collect all of them:
-
Cloudflare 2022 (this one)
-
Facebook 2021: Understanding how Facebook disappeared from the internet | Hacker News - this one probably had the single biggest impact, since engineers got locked out of their systems, which made the fixing part look like a sci-fi movie
-
(Indirectly caused by BGP: Cloudflare 2020: https://blog.cloudflare.com/cloudflare-outage-on-july-17-202…)
-
Google Cloud 2020: Google told BGP to forget its Euro-cloud – after first writing bad access control lists • The Register
-
IBM Cloud 2020: https://www.bleepingcomputer.com/news/technology/ibm-cloud-g…
-
Cloudflare 2019: Route Leak Impacting Cloudflare | Hacker News
-
Amazon 2018: https://www.techtarget.com/searchsecurity/news/252439945/BGP…
-
AWS: https://www.thousandeyes.com/blog/route-leak-causes-amazon-a… (2015)
-
Youtube: https://www.infoworld.com/article/2648947/youtube-outage-und… (2008)
-
Google 2016, configuration management bug/BGP: Google Cloud Status Dashboard
-
Valve 2015: https://www.thousandeyes.com/blog/steam-outage-monitor-data-…
-
Cloudflare 2013: Today's Outage Post Mortem
and
Google in Japan 2017: https://www.internetsociety.org/blog/2017/08/google-leaked-p…