POSTS

Who Broke Prod? DevOpsDays London 2018

My talk at DevOpsDays London 2018 was inspired by a conversation with one of my previous teams the morning after I’d spent a long weekend on call. Instead of searching for the “human” cause of a production outage, we should focus on facts, truthfulness and accountability. I wanted to share with the world some actionable improvement katas we can all practice to help us prepare for and embrace failure.

Self-Defence

Failure is a form of feedback on the work we’ve been busting our gut for – and most humans aren’t well-equipped to respond positively to poor feedback. We tend to roll ourselves into a little conceptual ball and put up our emotional defences.

The most alarming thing about our human response to failure is our instinctive need to seek out a single root cause, to pinpoint the failure and to deflect the focus from ourselves. It’s a human self-defence mechanism to try and find someone else, or some external factor to blame so that we survive any form of punishment that might accompany the blame or failure. Apportioning blame is a survival instinct. When we think that someone might feel we are to blame, we curl up into our little armadillo ball, put up our barriers and stop collaborating. That’s when small failures can become big failures, and when visibility, honesty and accountability start to suffer, preventing us from learning from our collective failures.

We need to learn to be comfortable lowering our defences, and to stop seeking blame.

Incident

Those self-defence mechanisms quickly come into play during an outage – if we’re worried about putting a foot wrong then we tend to shut down and stop communicating in case we say or do something that will get us in trouble. It may seem counter-intuitive but one of the ways to try and get others to trust us is to do EXACTLY the things that we think will expose us to punishment. Starting with transparency…

During a major incident you need to share. Even over-share. Be brutal. Don’t be afraid to declare quite how bad the situation is. If you’re worried that you’re going to get into trouble, you tend to try and disguise the scale of the problem but if you’re honest with yourself, and others, people can see and assist. Be honest now so that nobody can accuse you later.

Post-Mortem

In my humble opinion, if the only place in your process where you’re actively trying to eliminate blame is AFTER the event, then there’s something very wrong – growing a blameless culture starts well before anything goes boom. That said, the incident post-mortem can be one of the most uncomfortable experiences in our business and is definitely the place in which blame and finger-pointing can be most obvious; so it’s a good starting point if you want to start building a less accusatory culture in your workplace.

If you’re interested in conducting blameless post-mortems then I highly recommend the book, Beyond Blame by Dave Zwieback. It’s a short read, written as a story – a bit like The Phoenix Project, its designed to show you that it is possible to turn around behaviours such that we all stop hunting for someone or something to blame and instead start to focus on facts and actions. Dave Zwieback lays out something called the Learning Review Framework; simply renaming this event as a “Learning Review” instead of a post-mortem helps us to re-set our expectations.

Visibility

If we are able to see and spot failure early then we start to learn patterns that lead to failure. And when we visualise failure publicly, we learn to get used to that uncomfortable feeling of having failed. Those people who seek to blame others quickly change their tune if the things they are complaining about are perfectly visible for all to see.

At a very basic level, visualising your production system with simple health monitors and metrics will help us to spot failure and will get us used to seeing it right there in front of our eyes, and we’ll get used to that feeling of disappointment when the systems we’ve poured our heart and soul into go boom.

So how do we decide what metrics, charts and health monitors to visualise on our walls o help us spot failure? Well, there’s a whole host of metrics and monitoring tooling available that will allow us to create all the pretty charts that our hearts could possibly desire – that, in itself can be dangerous. If we build SLA charts and complex metrics graphs based on what we’ve learned about incidents that have failed in the past, we might well miss the scary stuff that we couldn’t possibly predict. Instead, I’d recommend building charts and indicators around your normal behaviour. Go and visualise what “normal” looks like in your system. Build charts that monitor the normal behaviour of a service, or the normal response patterns of an API. Go chart the normal system usage. Or normal storage size. Understand how data normally moves through your system, from the load balancers, through different services, through queues, and data stores… map it out and build charts that visualise for everyone to see the everyday normal stuff happening in your world. Once your developers know what normal looks like for their services, and when your ops teams know what normal looks like for the system stats, when your support team know what a normal day looks like in the system… then you’re all equipped to spot failure early and work together to fix it.

Reward

If you punish people for making mistakes, you teach them that its not OK to try… and fail, and worse still, you’re teaching them to hide the mistakes they do make – punishment in the event of failure causes us to lose the honesty and transparency we’re working so hard to build. If we want to work in a company where innovation and creativity thrive, we need to help build an environment in which people feel safe to try new things. Punishment stifles creativity. Nobody is going to step up and try to find a new but risky way of solving a complex problem if they’re worried how people will react if they get it wrong – we need the psychological safety.

So how do we unknit that sort of behaviour and encourage our colleagues, our senior leaders and key stakeholders to reward the good things we do, rather than punishing for the mistakes we might make along the way? How can we teach our managers and leaders that it is OK to fail…but that if we do consistently make mistakes, then perhaps we need some coaching to improve, rather than to be side-lined or bypassed for future opportunities? I think the answer lies here. With all of us. We can all start to make a difference by setting an example and displaying the behaviours we expect to see of others. You don’t need to be a leader to start rewarding good behaviours and building a culture in which those around you feel safe.

Many many thanks to the DevOpsDays London organisers for hosting an inclusive, thought-provoking conference with a delicate mix of tech and culture, and for offering me an opportunity to face a fear by make my public-speaking debut.

comments powered by Disqus