We’re getting ready to review and redesign our internal post-mortem process at work. If you had the opportunity to do a refresh on your process, what questions, activities, and/or deliverables would you want to ensure were part of your new way of doing things? Why?
I would fix the engineering processes
hmm, On re-reading I was unclear.
“If you could do a refresh on your post-mortem process . . .”
The key is to make the goals clear. A good post-mortem should identify root cause(s) and ways to improve. It should not be an exercise in finger pointing or assigning blame. This ensures that you get honest answers rather than ass-covering.
Try to inspire yourself with this:
Blameless, and a really good template so you don’t forget things to include in them.
Summary, timeline (don’t forget timezones) including clear markers of start and end of outage/impact, what went well & poorly, action items, any appendixes (full chat logs, systems logs, that sort of thing.
I’ve been encouraging clients to ask how they could modify systems/processes/personnel to detect, measure and respond to future instances of the incident(s) being reviewed, and to then desk check how the incident may have been handled if those suggestions were in place. Documenting what happened (and why) is important and necessary but it’s not sufficient. My experience is that when a team goes through this type of simulation , they’re better able to deal with subsequent incidents. As a side note, I try to have the most junior person on a team present as a second scribe for post-mortems. They can ask the most wonderful and revealing questions