Postmortems

Internet broken: Cloudflare blames Verizon and a BGP Optimizer

This is Cloudflare’s description of the BGP leak, laying blame on Verizon, with specific remediations:

From the discussion on HN defending the rapidity and specificity of the blog post:

The sequence of events went a bit like this:

Team in London started working the problem and called in reinforcements from elsewhere;

Upper management (me and one other person) got involved as it was serious/not resolved fast;

I spoke with the network team in London who seemed to have a good handle on the problem and how they were working to resolve but we decided to wake a couple of other smart folks up to make sure we had all the best people on it;

Problem got resolved by the diligence of an engineer in London getting through to and talking with DQE;

Some people went back to bed;

Tom worked on writing our internal incident report so that details were captured fast and people had visibility. He then volunteered to be point on writing the public blog (1415 UTC);

Folks in California woke up and got involved with the blog. Ton of people contributed to it from around the world with Tom fielding all the changes and ideas;

Very senior people at Cloudflare (including legal) signed off and we posted (1958 UTC).

No one had an axe to grind with Verizon. We were working a complex problem affecting a good chunk of our traffic and customers. Everyone was calm and collected and thoughtful throughout.

Shout out to the Support team who handled an additional 1,000 support requests during the incident!

Edit: there’s an earlier discussion on HN (“Route Leak Impacting Cloudflare” previously “Cloudflare is observing network performance issues”) with these snippets within:

Final update from me. This was a widespread problem that affected 2,400 networks (representing 20,000 prefixes) including us, Amazon, Linode, Google, Facebook and others.

and

This appears to be a routing problem. All our systems are running normally but traffic isn’t getting to us for a portion of our domains.

1128 UTC update Looks like we’re dealing with a route leak and we’re talking directly with the leaker and Level3 at the moment.

1131 UTC update Just to be clear this isn’t affecting all our traffic or all our domains or all countries. A portion of traffic isn’t hitting Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.

1134 UTC update We are now certain we are dealing with a route leak.

@dang etc.: could someone update the title to reflect the status page “Route Leak Impacting Cloudflare”

1147 UTC update Staring at internal graphs looks like global traffic is now at 97% of expected so impact lessening.

1204 UTC update This leak is wider spread that just Cloudflare.

1208 UTC update Amazon Web Services now reporting external networking problem https://status.aws.amazon.com/

1230 UTC update We are working with networks around the world and are observing network routes for Google and AWS being leaked at well.

1239 UTC update Traffic levels are returning to normal.