2013-08-14

Don't play the blame game

An article I dug up by the guys behind the "Etsy" marketplace website struck a chord with me. In it they explain why blaming engineers for making mistakes is perhaps the worst thing you could do. They express it in terms of a vicious cycle, where the key steps are:

2. Engineer is punished, shamed, blamed, or retrained.
3. Reduced trust between engineers on the ground (the "sharp end") and management (the "blunt end") looking for someone to scapegoat.
4. Engineers become silent on details about actions/situations/observations, resulting in "Cover-Your-Ass" engineering
CYA engineering leads, as night follows day, to a workplace where no-one wants to take responsibility for doing anything. This works (for some definition of "works") well in a government environment, but for a business where innovation and change is key to survival it's a rapidly fatal affliction. All well and good, but what's the alternative? Should you just let engineers make mistakes willy-nilly with no consequence? If they could do that, you could even hire PPE graduates for engineering jobs and save yourself the social dysfunctions of real engineers.

The solution adopted by Etsy (and very few other engineering organisations I've encountered) is a culture of blameless post-mortems. Whenever something goes wrong with significant impact - website goes down, loss of sales, software running amok - then, once the immediate incident has been dealt with everyone will expect a post-mortem to be written. In somewhere like an investment bank this would traditionally be written by a manager who will avoid technical detail and seek to blame anyone and everyone but his own team; this is not helpful. Instead, a good post-mortem culture will require the engineer closest to the incident to write up the post-mortem, and ideally post it for circulation and discussion within a small number of days. The post mortem should detail what went wrong, what actually happened - ideally incorporating a timeline, relevant fragments of IM and email discussions and pointers to logs and graphs - how it was resolved, and most importantly the actions that need to be take to prevent this kind of problem happening again.

It's fine in the post-mortem to identify people who made incorrect decisions, since indeed it's expected that in a stressful, time-pressured and unusual situation engineers and others will make bad calls. What isn't acceptable in the blameless post-mortem is to stop the analysis there: "Fred decided to repush the old version of the binary, which ended up breaking all customers, not just the 1-2 originally affected." Instead we ask ourselves: why did Fred do this? Did he have bad information about the problem? In that case the system monitoring may need improvement. Was he following an out-of-date playbook instruction? In that case someone needs to bring the playbook up to date. Was he too inexperienced to realise the consequences of what he was doing? In that case, perhaps should there be a minimum level of experience for on-call engineers in charge of an incident. What you can't say is "Fred did this because he's an idiot and should be fired."

jailspaw from Etsy explains why this approach works:

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error.
I've seen spirited discussion over post-mortems but crucially they are not about blame - or when the discussion starts to veer off in that direction, a senior engineer steps in to put it back on track. Good engineers hate to have the same mistake happening again and again. A post mortem lets the team understand, without fear of facts being concealed by blame-avoidance, what went wrong, and puts the team in a good position to make a fix. And if the fix is ineffective, and another outage occurs, there's the original post-mortem to include in the analysis: "why did we make the wrong diagnosis? What did we miss in our analysis of the right fix?"

Now if engineers can do this, and make it work - and it seems to work well for Etsy - is there any reason we can't incorporate this into government? When politicians, policy advisors or other policy makers screw up, how about we get them to describe what went wrong, why, and what can be done differently in future to prevent the same thing happening again? Of course, that would require a politician to admit being wrong in the first place, so I'm not holding my breath.

No comments:

Post a Comment

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.