Postmortems

Cascade of doom: UK Covid dashboard

A major update of database software a couple of weeks before Halloween weekend, causing a failure at just the wrong time to get support from the US.

I had no time to do much at that point, it was 2:50pm already — all I knew was that I needed to reduce the load on the database to enable the deployment of data. All I could think of then point was to forcibly reduce the number of connections to the database to reduce the load, and so I turned off 70% of our servers. That did the trick, and we managed to deploy the first chunk of data at 3:08pm. We finished by the skin of our teeth at 3:58pm, I turned the servers back on, and we managed to release the data at 4:01pm.

The root cause turns out to be a new JIT feature which is enabled by default and which, for this system’s typical load, turns out to be hugely expensive and a net loss.

Also note that this critical national infrastructure is very underresourced:

Again, you’re looking at it from the outside - which is always easier. We don’t always have the resources to do all the nice things that are written in textbooks. I created this service from the scratch in two weeks, and have since been working on average 16 hours a day to maintain it.

As for being that close to the bleeding edge; there was a reason for it, which I outlined in my response to another comment. This service relies on modern technologies primarily because it is very demanding, and nothing remotely like it has been done in the public sector before. It is the most hit service in the history of UK Government’s online services.

via MeFi