Postmortems

Cloudflare down for 30 mins - regular expression trouble

Cloudflare’s customer sites and their own website were down for about 30 mins, giving 502 errors. The root cause was an update to Web Application Firewall rules, where new rules which are run in a simulation mode (so they don’t affect traffic) happened to include a very expensive regexp which pegged CPUs.

Blog post:

This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.

Cloudflare run a CDN with many global points of presence and proxy a large proportion of website traffic. They also run the 1.1.1.1 DNS resolver and act as a DNS provider.

Many web services were affected, and many workflows and deployments which rely on online services.

Though some of them might not be because of Cloudflare, the ones I spot checked all do appear related. Medium, DigitalOcean, Shopify, CodeShip, Pingdom, and many more. The impact is staggering.

Via the discussion at HN.

Another interesting observation I’ve seen about this is that this seems to have been caused by failing to treat configuration as a thing that needs to be tested, just like code does, and deployed thoughtfully.

I see from the comments that they plan a more thorough blog post - that should be good.

And here’s the full post mortem:

It’s an excellent read, with 11 action points. It notes, rightly, that in a well-built system, we see failures only when several things go wrong at once. There are no simple failures with single root causes.

There’s a discussion here:
https://news.ycombinator.com/item?id=20421538