Fix for a cosmetic problem caused data loss by side-effect, hotfix for the side-effect caused more data loss.
We knew this code was critical and, per our typical process, we had two domain experts provide code reviews for the change – but sadly, we didn’t spot the bug.
Our investigation uncovered what we thought was an impossible situation: a small number of our WorldServers had loaded without the correct configuration which fixed the corruption issues from 2.7.1. Unfortunately, anyone whose characters had been accessed using one of these out-of-date servers encountered the character-corrupting problem.
Normalisation of deviance:
Back in October, to handle increased CPU and player load for Shadowkeep’s launch, we spun up more servers than we have ever used before for this task. Running with this many servers has had some small side effects … one issue was that a small percentage (less than 1 percent) of these servers would crash on start-up due to the volume of servers overwhelming one of the backing databases. Our workaround for this was to simply manually restart the crashed servers each time we detected this issue, and this appeared to address the problem without any discernable side effects …
We have verification systems that detect these sorts of version misconfigurations, but the WorldServer crashes and subsequent manual restarts caused the servers to also skip the verification process. Prior to this morning, we had believed skipping these overrides and verifications to be impossible.