A collation of BGP incidents (and a Cloudflare postmortem)

A big outage today at Cloudflare, a global CDN, DNS and edge-compute provider, caused by an internal routing update colliding with a new internal architecture.

caused by a change that was part of a long-running project to increase resilience in our busiest locations.

While we did use a stagger procedure for this change, the stagger policy did not include an MCP data center until the final step.

Cloudflare engineers experienced added difficulty in reaching the affected locations to revert the problematic change. We have backup procedures for handling such an event and used them to take control of the affected locations.

… also caused our internal load balancing system Multimog to stop working … our smaller compute clusters in an MCP received the same amount of traffic as our largest clusters, causing the smaller ones to overload.

In the discussion on HN, we are treated to a list of other BGP incidents:
Yet another BGP caused outage. At some point we should collect all of them:

Some more:

Google in Japan 2017: https://www.internetsociety.org/blog/2017/08/google-leaked-p…