DevOps Metric: Mean time to recovery (MTTR) Definition and reasoning

Important to: VP Operations, CTO
Definition:When a production failure occurs, how long does it take to recover from the issue?
How to measure: Different between systems. Common metric is production downtime avg over last ten downtimes
 

Expected outcome: MTTR should become lower and lower as DevOps maturity grows.

MTTR vs Mean time to Failure
To understand MTTR, we have to first understand its evil brother: MTTF or "Mean time to failure" which is used by many organizations' operational IT departments today.

MTTF means "How much time passes between failures of my system in production?".

Why is MTTR more valueble?

There are two arguments in favor of putting MTTR as a higher priority than MTTF (we don't want to ignore that we fail often, but it's not as important as MMTR).

Perception.
If Amazon.com was down once every three years, but it took them a whole day to recover, consumers won't care that the issue has not happened for three years. All that will be talked about is the long recovery time. But if Amazon.com was down for 3 times a day for less than one second, it would barely be noticeable.

Wrong Incentives.
Let's consider the incentive MTTF creates on an operations department: The less often failure happens in the first place, the more stable your system is, the more bonus you get at the end of the year in your paycheck.

What's the best way to keep a system stable? Don't touch it!

Avoid any changes to the system, release as little as possible, in a very controlled, waterfall-ish manner to make sure releasing is so painful that the person on the other end really really wants to release, and would go through all the trouble of doing so.

Sounds familier?

This "Stability above all else" behavior goes exactly against the common theme of DevOps: release continuously and seize opportunities as soon as you can.

The main culprit for breeding this anti-agile behavior is a systematic influence problem: What we measure influences people into doing the behavior that hurts the organization (You can read more about this in my book "Elastic Leadership" in the chapter about "Influence Forces" (here is a blog post that talks about them in more details).

Developers can't rest on their laurels and claim that operations are the only ones to blame for slowing down the continuous delivery train here. Developers have their own version of "Stable above all else" behavior which often can be seen in their reluctance to merge their changes into the main branch of source control (where the build pipelines get their main input from).

Ask a developer if they'd like to code directly on the main branch and many in enterprise situations will tell you they'd be afraid of doing so in the fear of "breaking the build". Developers are trying to keep the main branch as "stable" as possible so that the version going off to release (which is in itself a long and arduous process as we saw before), has no reason to come back with quality issues.

"Breaking the build" is the developer's version of "mean time to failure", and again, here the incentives from management are usually the culprit. If middle managers tell their developers it's wrong to break a build, then they are driving exactly the same fear as operations have. The realization that "builds are made to be broken" is a bit tough to swallow for developers who fear that they will hold up then entire pipeline and all other teams that depend on them.

Again, the same thought here applies: Failure is going to happen, so focus on the recovery aspect: how long does it take to create a fix for something that stops the build? If you have automated unit tests, acceptance tests and environments, and you're doing test-driven development, fixing an issue that stops the pipeline can usually be a matter of minutes: code the fix, along with tests, see that you didn't break anything, and check it into the main branch.

Both operations and development have the same fear: don't rock the boat. how does that fit in with seizing opportunities as quickly as possible? How does this support "mean time to change" to be as fast as possible?


My answer is that measuring and rewarding MTTF above all else absolutely does not support a more agile organization. That fear of "breaking the build" and "keep the system stable" is one of the reasons many organizations fail at adopting agile processes. They try to force "release often" down the throat of an organization that is measured and rewarded when everything stays stable instead of being rewarded and measured on how often can you change and how fast can you recover if a failure occurs.


MTTR is a very interesting case of Devops Culture taking something well established, and turing it on its head. If DevOps is about build, measure learn, and fast feedback cycles, then it should become an undeniable truth that "whatever can go wrong, will go wrong". if you embrace that mantra, you can start measuring "mean time to recovery" instead.

The MTTR incentive can drive the following behaviors:

  • Build resilience into the operations system as well as the code.
  • Build faster feedback mechanisms into the pipeline so that a fix can go through the pipeline faster
  • Creating a system and code architecture that minimizes dependencies between teams, systems and products so failures do not propagate easily and deployment can be faster and partial
  • Add and start using better logging and monitoring systems
  • Create a pipeline to drivers a deployment of an application fix into product as quickly as possible.
  • Making it just as easy and fast to deploy a fix as it is to roll back a version

With mean time to recovery, you're incentivizing people based on how much they contribute to the "concept to cash" value stream, or the "mean time to change" number.