Also, not quite a post-mortem but an explanation of how a large data-loss occurred (and the personal results thereof).
In the author’s defense - there seems to have been an appalling lack of dev-ops controls in place. Engineers shouldn’t be working against a consumer-facing database and if backups aren’t going as planned, someone should be notified before it becomes a problem.
He should never have been placed in such a position. If anyone should have resigned over the incident, it should be the person who made the decision to do dev against a production database.
That happens every day in ‘startups’. It’s the largest technical debt factor that can close a company down before it gets far.
Agreed that clearing the Users table was a terrible thing to happen and costs big. Making the person who did it apologize in a whole company meeting would be even worse.
It’s a bigger failure that happened here. Failure in mentoring the junior developer and making sure he knows the implications of working on a prod DB. Failure in taking adequate precautionary steps to recover from something like this. It could have been him , or a natural disaster, or a hacker. It was just a disaster waiting to happen, in my opinion.
There are multiple fails here, none of which are the fault of the author. Human mistakes are inevitable, and the system and processes should be designed to make them really hard to do, and easy to recover from. Running untested code against the prod db is a fail, and so is the complete lack of backups for their valuable data. Telling your junior engineers to be careful with prod data is not a solution. After this incident, the more senior engineers should have apologised to the company and acknowledged that anyone could have made that mistake. Maybe the guy would still have left but damn, it was totally not his fault.