A Metrics Framework for Continuous Delivery

Roy Osherove · July 11, 2022

Here’s a framework I like to start with when I discuss what types of metrics can help or harm you in yoru journey to continuous delivery:

Metrics Framework for continuous delivery

Lagging Indicators

At the organizational level, unit tests are usually part of a bigger set of goals, usually related to continuous delivery. If that's the case for you, I highly recommend using the four common DevOps metrics :

• Deployment Frequency—How often an organization successfully releases to production

• Lead Time for Changes—

o The amount of time it takes a feature request to get into production

NOTE: Many places incorrectly publish this as : The amount of time it takes a commit to get into production – which is only a part of the journey a feature goes through from an organizational standpoint. If you're measuring from commit time – you're closer to measuring the "cycle time" of a feature from commit up to a specific point. Lead time is made up of multiple cycle times.

• Change Failure Rate— Number of failures found in production per release/Deployment/time

OR: The percentage of deployments causing a failure in production OR

• Time to Restore Service—How long it takes an organization to recover from a failure in production

These four are we'd call "Lagging Indicators" and are very hard to fake (although pretty easy to measure in most places). They are great in making sure we do not lie to ourselves about the results of experiments.

Leading Indicators

However, many times we'd like a faster feedback loop that we're going ther right way, that's where "leading indicators" come in.

Leading indicators are things we can control on the day to day view – code coverage, number of tests, build run time and more. They are easier to "fake" but, combined with lagging indicators, can often provide us with early signs we might be going the right way.

We can measure leading indicators at the team and middle management level, but we have to make sure to always have the lagging indicators as well, so we do not lie to ourselves that we’re doing well when we’re not.

Metric groups and categories

I usually break up the leading indicators to two types of groups:

· Team Level (Metrics and individual team can control)

· Engineering Management Level (Metrics that require cross team collaboration or aggregate metrics across multiple teams)

I also like to categorize the, based on what they will be used to solve:

· Progress: used to solve visibility and decision making on the plan

· Bottlenecks and feedback – as the name implies

· Quality – Leading indicators connected to the Quality lagging indicator indirectly (Escaped Bugs in production)

· Skills – track that we are slowly removing knowledge barriers inside teams or across teams

· Learning – are we acting like a learning organization?

The metrics are mostly quantitative (i.e they are numbers that can be measured), but a few are qualitative – in that you ask people how they feel or think about something. The ones I use are:

· From 1-5 how much confidence do you have in the tests (that they will and can find bugs in the code if they arise)? (take the average of the team/cross teams)

· Same question but for the code – that it does what I is supposed to do.

These are just surveys you can ask at each retrospective meeting, and they take 5 minutes to answer.

Trend lines are your friends

For all leading indicators and lagging indicators, you want to see trend lines, not just snapshots of numbers. Lines over time is how you see if you're getting better or worse.

Don’t fall into the trap of a nice dashboard with large numbers on it, if your goal is improvement over time. Trend lines tell you if you're better this week than you were last week. Numbers do not.

Numbers without context are not good or bad. Only in relation to last week/month/release etc. Remember, we're about change here. So trendlines are you friends.

Follow Roy on Twitter

Video: Lies, Damned Lies and Metrics

Roy Osherove · July 11, 2022

I spoke about what types of metrics and KPIs (including OKRs) you can use with continuous delivery, and which ones you might want to stop using, during my talk at GOTO 2019. Also discussed: Leading and Lagging Indicators. Enjoy.

Follow Roy on Twitter

If a Build Takes 4 hours, Run It Every 4 Hours

Roy Osherove · April 19, 2020

(Note: You might need to fix your environment bottlenecks first before being able to act on this blog post)

Builds (i.e compile + run tests + deploys + more tests etc…) represent a bottleneck for getting feedback on our code so we can be confident about making more changes or releases. There are two main factors here:

How often a build runs (Per commit, hourly, weekly etc..)
How long a build takes to finish

Many companies are also in the process of reducing reliance on manual repetitive QA cycles, but still require some manual tests or verifications to be part of the process while the transition is happening (while building out more automated tests for example, or if there are some types of tests which currently don’t make sense automating). So we’ll add one extra factor here:

When does the manual verification take place

Sometimes the manual verification can only take place after the build has been finished due to deployment requirements, environment requirements or a requirement for the built binaries to be available. This means the build acts as one of the constraints on our process.

“How often?” Is the new Constraint

We can assume the following based on Theory of Constraints:

The longer time passes between each build, the more changes are integrated in each build from various team members
This means the batch size for changes is larger which means..
- we need to test/verify more things - manual testing takes longer, thus making the feedback loop longer
- There are more compounded risks due to more changes which makes everyone more nervous, and more likely to ask for more verification to be do

What can we do to reduce the batch size of each build, and get a faster feedback loop?

The immediate answer is “run the build more often”. But, how often can that be?

What I see many teams doing as an anti pattern, is doing the following thinking:

Because the build takes a long time (4 hours), let’s only run it at night or once a week

But the thinking should be reversed:

Because the build takes 4 hours, let’s run it every 4 hours

What do we get by that?

The feedback loop time is reduced to between 4-8 hours instead of 24 hours on a daily build, or a few days on a weekly build. Assuming a developer is making changes right now, they can get on the next “build train” that will start in a couple of hours and end 4 hours later. It’s not perfect but it’s much much better than waiting until the next day, or until the end of the week instead of waiting until 4pm to push that important change to the master branch.
We always have a “latest stable build” which is the last green build that happens every 4 hours. Assuming this also deploys to a dynamic demo environment, it means we always have a fresh-enough demo to show, and if not, just wait 4 hours.
The batch size is much smaller now: How many changes can fit into 4 hours? Less than 24 hours, that’s for sure. Which means verification can be much faster too.

This also means that if I need to manually verify an issue I fixed, I don’t have to wait until tomorrow to verify it as part of an integrated build, I can wait 8 hours as a worst case scenario (same day half the time!).

“How Long Does it Take?” is the next constraint

Now, (and only now, after we’ve fixed a bigger constraint on when the build runs) that we reduced the time between builds, the build time is the next largest constraint that keeps the feedback loop from shortening.

Every change we make to shorten the build time (by parallelizing tests or steps, for example) now has a compounded effect: The build time is our feedback cycle time. If the build now takes only one hour instead of four, we can run the build on an hourly basis.

SO, manually verifying a change I made to an integrated build can now happen within 1-2 hours after making the change, committing and pushing it. Definitely same-day.

What if the build takes 30 minutes? You get the idea. The loop is now 30-60 minutes long.

Environments could be the original constraint ( Bottlenecks)

This whole conversation might not be feasible for you because you have this issue.

Sometimes teams are forced to run the build less because they don’t want to deploy to a static environment while someone else might be using it. I wrote about this before: Static environments are a huge bottleneck.

So a huge factor here is your ability to have an environment per build that gets destroyed after a specified time or other constraint. The good news is that if you’re using docker and things like Kubernetes, this ability is 5 minutes away with K8s namespaces that are dynamically generated per build.

99% of static environments should be a thing of the past, and we need to know to ask for this from IT departments that keep the old ideas and just implement them on new tools that allow newer ideas.

You might have to fix this issue, or at least create one static automation-only environment that humans do not touch, so you can open up this constraint and start working from the top of this article once this has been solved.

Follow Roy on Twitter

Co-Ops: Enabling True Continuous Delivery with Cooperative Pipelines & Processes (Title Feedback Needed)

Roy Osherove · December 6, 2019

“Co-Ops” is a term that one of the attendees at a talk I was giving about cooperative pipelines came up with. I wish I knew his name. I asked him to email me afterwards so I can credit him but he never did.

Co-Ops describes the act of working with a cooperative pipeline: A Pipeline that every expertise contributes to: Dev, Test, Ops, Sec, Compliance etc.. a SINGLE Cooperative pipeline, with humans removed from tactical pipeline decision.

I find it a really great naming scheme, and I think I’ll use it to title my book “Co-Ops: Enabling True Continuous Delivery with Cooperative Pipelines & Processes.”

I’m still not fully “sold” on the full title. I fyou have ideas - I’d love to hear them. Your thoughts are welcome!

Follow Roy on Twitter

Video: The Pipeline Driven Organization - Enabling True Continuous Delivery (Oredev 2019)

Roy Osherove · November 20, 2019

A Pipeline Driven Organization is operating around the idea that automated pipelines make the important IT decisions such as: - When is it OK to deploy a new version? - Is a feature working? - When and how to deploy a new environment - When to rollback (or forward) a new version - Are we compliant and secure?

Follow Roy on Twitter

Continuous Delivery Israel Meetup - First Meeting June 24 2019

Roy Osherove · June 2, 2019

I’m pleased to announce that we will have the first ever meeting Continuous Delivery Israel meetup , in the AWS Floor28 offices in Tel Aviv, June 24th, 2019, at 6PM.

The Group’s page is located at CDIsrael.com

If you would like to be a speaker or sponsor, contact me.

AGENDA:
-------------
18:00-18:30 Networking & Pizza
18:30 - 18:40 Goals for CD Israel + Logistics
18:40 - 19:20 Ant Weiss: Optimizing the Delivery Pipeline for Flow
19:20 - 19:30 Break
19:30 - 20:15 TBD - We are looking for one more speaker!

ABOUT:

Continuous Delivery is much more than simply Dev + Ops. It is also tech leads, compliance, security, project managers, scrum masters, architects, and everything in between - let's learn from each other!

We deal with TECHNIQUES, PRACTICES, PROCESSES (and sometimes tools) that enable true continuous delivery - in large and small companies.

We also discuss influencing such change in large and growing organizations.

Bottlenecks & Pipelines: We feel feel pipelines are the core engines that can drive a software company, That Dev, Ops, Test, Sec, compliance should all be working together around common automated pipelines, and the processes, and metrics that push for and against adoption.

Challenges and insights: We will learn from each other about challenges and techniques that enable true continuous delivery.

Great for Ops, Dev, Security, compliance and agile leadership folks (PM, scrum masters, etc)

Follow Roy on Twitter

"Pipeline Driven" vs. "We have Jenkins Jobs"

Roy Osherove · May 3, 2019

High performing and continuously delivering organizations such as Netflix, Amazon, Google, Wix and others don’t just have pipelines. They don’t just automate stuff and put jobs on it. That’s the easy part.

Many organizations I have consulted for also have pipelines, jenkins jobs, teamcity projects, TFS configurations. But they don’t perform nearly as well fast or consistently (even though they are not happy with their current status)

Places like Netflix and Amazon don’t just have automated pipelines. They are driven by pipelines.

What does that even mean compared to “traditional” pipeline approaches?

It means important decisions are made in the pipelines without very little to no human interventions. These places trust their pipelines to make decisions for them instead of putting humans in the bottleneck.

In many places humans take on the approving things like:

Merges between branches and pull requests
Deployment to environments
Setting up or killing environments
Enabling or disabling features
Pushing a feature to production (or other environments)
Security
Compliance
Configuration
Load balancing, scaling up or down
Tests passing or not
Builds passing or not
and more..

Companies can make all these decisions by humans and still be fully surrounded by a bunch of pipelines (automated jobs) that they trigger when they decide the time is right.

Here lies the difference between a pipeline driven organization and a traditional one. To be pipeline driven we remove the onus of approval from humans, and give it to pipelines.

This requires several things:

That we trust a pipeline enough to let it make good decisions
This means we need to teach it to make good decisions by adding important steps to it in the form of various automated tests, security tests, compliance tests etc.
That we reframe questions from “did X say it’s OK” to “Did the pipeline pass?”
That the various knowledge silos in the organization shift their work towards mainly supporting and propping up pipelines relating to their knowledge that can make good decisions:
- Ops folks will, among other things, help create infrastructure pipelines, infrastructure tests, environment configurations in code that are run in pipelines
- Security folks will, among other things, help create security tests and specs that can be checked by automated pipelines and run by each team as needed.
- Test folks will help developers learn how to write automated tests at various levels for the pipelines, and help with the test infrastructure
- Developers will teach the pipeline (with the help of ops) how to deploy the application, which tests to run, and to do it per commit.
- Everyone will write code that will run in a pipeline.
- Everyone will slowly reduce their decisions and slowly let the pipeline make more and more of these decisions.

And the end of the day, what separates the Netflix’s of the world from others is a simple truth: They worked really hard to create pipelines that they trust enough to make the hard decisions, and they use them continuously without waiting for humans in the process.

Follow Roy on Twitter

Delivery Pipelines and Discovery Pipelines

Roy Osherove · February 20, 2019

In a previous post I discussed the notion of having two seperate pipelines. I want to revisit this, with a bit more details.

When we’re talking about Continuous Delivery, we’re very interested in the speed of the feedback loop. When you’re expecting to merge into the master branch many times a day, a 5 minute detail in the run of a pipeline is crucial for making things flow productively and not waiting too much.

People tend to create two types of pipelines to deal with this issue: Fast pipelines and slow pipelines. They’ll use the slow pipeline to run long running e2e tests for example.

That really only slows down the rate of delivery. If those e2e tests are so important that you won’t delivery until you know their results, you’ll have to wait 24 hours to delivery, regardless of how small the changes are.

Delivery Stoppers

We can use the term Delivery Stoppers to create a logical bucket where we can put all things that we feel we cannot deliver unless they are “green”. e2e tests usually fit that bucket. So do unit tests, security tests, compilation steps etc.

Discovery Pipelines

What doesn’t fit that bucket? Code complexity is useful to know, but it usually doesn’t prevent us from delivering a new version when we know we need to fix some complicated code. Performance (under a specific threshold) is a useful metric, but should it break our build to know that our performance has increased by 1%? Or should it tell us that we need to add a backlog item to take care of that issue?

If those things aren’t delivery stoppers, let’s call them discoveries. We want to know about them, but they should not be a build breaker.

So now let’s consider our pipelines from that point:

We can create Delivery pipelines

that run all things that can prevent delivery if they fail.
That end with a deployment of a deliverable product
That give us confidence we didn’t screw up badly from a functional stand point.
Has to run fast (even if it has slow tests - we’ll have to deal with making it run fast enough)

We can create a discovery pipeline

that runs all the tasks that result in interesting discoveries and KPI
Gives us new things to consider as technical debt or non functional requirements
Can result in new backlog items
Does not result in deployment.
Feeds a dashboard of KPIs
Can take a long time (that’s why it’s separate)

Here’s an example of such pipelines:

Delivery Pipeline and Discovery Pipeline

I know I said that every 5 minutes on a delivery pipeline makes a huge difference - so how come I’m willing to put long running 2e2 tests there?

Because they are delivery stoppers. We have to have them running per commit.

What do we do about the fact they are running slowly?

We can run the tests in parallel on multiple environments/build agents
We can split the tests into multiple tests suits and run them in even more parallelized processes on multiple environments.
We can optimize the tests themselves
We can remove unneeded/duplicated tests (if they already exist at a lower level such as unit test, API tests etc.)

But we do not move them into the nightly pipeline. We deal with the fact that our delivery pipeline takes a long time, and don’t enable it to go further.

Delivery pipelines and discovery pipelines - I think I can live with that.

Follow Roy on Twitter

Red Pipelines and Build Whisperers: Getting to a Trustworthy Noise Free Test Pipeline

Roy Osherove · February 18, 2019

People wonder why it takes such a long time to transform the way a large organization works. Here’s just a small nugget, a small tip of a small iceberg, in a small corner of a large building in a large state, filled IT folks trying to get their jobs done.

And they have an nightly test pipeline.

And nobody cares about it.

I already explained why this might be an anti pattern, but another thing that usually happens is that the nightly pipeline is not trustworthy:

There is a high noise ratio. in many places, the “nightly” is almost always red and it takes a “specialist” (usually from the QA department) to “decrypt” the status of the build and tell developers if everything is actually OK or not. I like to call these pipelines: “Red unless Green” to signify that they are red by default, and green is almost by accident if it exists at all.
Because there is a high noise ratio - almost everyone disregards the results of the build, and so even if the build fails and finds real issues, they will be ignored and dismissed.
Many of the tests in the pipeline are fragile. They fail sometimes, and sometimes they don’t - but almost every build has some failing tests. Which leads to all the points above.

When we have that kind of situation there are several things we might want to consider moving towards:

Getting the pipeline to become trustworthy: a.k.a “Green unless Red”. Right now it is at “Red unless Green”.
Getting the pipeline to run fast enough so that we can execute it on each commit instead of at night.
Merging the nightly test pipeline into the Delivery per commit pipeline.

First, I want to make sure we understand how many risks are involved in “Red Unless Green” (RUG) pipelines:

RUG (Red Unless Green) pipeline risks:

Tests may indicate a real issue but everyone dismisses the results because the build is always red.
Green tests are also ignored - so any confidence benefits we might get to help us run faster are dismissed. This often leads to manual testing since we cannot trust the nightly build. Wasting even more time.
Test cost goes unmaintained since nobody cares about the tests.
New tests are not added, or added as “lip service” since nobody cares about the feedback the build provides: both wasting time and providing no value.

Let’s add Build Whisperers to those risks.

The Build Whisperer

If you do have a person that is looking at the tests and can decipher their output to tell if they are actually in a good state, we call that person “Build Whisperer” - for no one else can understand the build they way that person does.

That person is also human bottleneck and acting as a manual dashboard for the team. This presents several risks as well:

This person can take a while to decipher the results of the build before communicating status
The person can also make mistakes trying to understand the results
If the person is not available or working on other tasks - status is delayed, sometimes by days or even weeks.
If it is one person’s job to understand the build, then it’s nobody else’s job (“not my job” syndrome). This allows everyone else to dismiss responsibility for the build status.
If only one person watched the build, only one person cares about it. This might seem too much like the last point but let me put it this way: If many people watch the build, many people start slowly caring about it more and more.

To get over thee risks - we decide, as a team, that the build will provide the final say on whether the product is deliverable or not.

It then becomes our job to make the build robust enough and trustworthy enough that we want to listen to it and deliver based on its results, and not based on a human making a decision about manual test cases.

What can we do?

Configure the nightly build to trigger as many times a day as possible (if it takes 5 hours to run, execute it 4 times a day). Don’t put it on the developer’s wall yet. This is just to increase the feedback loop speed as we fix the build.
DELETE or IGNORE all red tests.
Run the build if it is not running yet.
If there are red tests, repeat 1 -3.
If the build has been green for three days we have achieved a BASELINE GREEN BUILD.
At this point we can show the build on the Team’s Wall
At this point we can treat red test with actual work to fix the tests or the bugs they discover. We can finally feel worried about red tests, and gain more confidence if the tests are green.
We can slowly add back ignored tests that we deem necessary, adding them one by one, and fixing them in the process.
Continue adding new and old tests to the build incrementally to gain more confidence (not just E2E tests. Use a test recipe.)
Make the build run faster (parallelize items, break up test suites, increase amount of agents and environments) so you can ultimately merge the nightly build into the per-commit delivery build.

If point #2 scares you - remember that those tests are not really bringing to you value today. You are delivering your product even though they are failing.

Follow Roy on Twitter

A Pipeline Friendly Layered Testing Strategy & Recipe for DEV and QA

Roy Osherove · February 10, 2019

The key point of running a pipeline is to get feedback which in turn is supposed to provide us with Confidence. That’s because a pipeline is really just a big test - it’s either green or red; and it’s composed of multiple small tests that are either green or red.

Types of feedback

We can divide the types of tests in a pipeline into two main groups:

Break/fail feedback

Provides a go-no-go for an increment of the code to be releasable and deployed
Great for unit tests, e2e, system tests, security tests and other binary forms of feedback

Continuous Monitoring Feedback

An ongoing KPIs pipeline for non binary feedback
Great for code analysis and complexity scanning, high load performance testing, long running non functional tests that provide non binary feedback (“good to know”)

To get faster feedback I usually have two pipelines, one for each type of feedback I care about:

Delivery pipeline: Pass/fail pipeline
Discovery Pipeline: Continuous KPI pipeline in parallel.

The point of the Delivery feedback is to be used for continuous delivery – a go/no go that also deploys our code if all seems green. It ends with a deployed code, hopefully in production.

The point of a Learning feedback is to provide learning missions for the team (Our code complexity is too high.. let’s learn how to deal with it), as well as show if those learnings are effective over time. It does not deploy anything except for the purpose of running specialized tests or analyzing code and its various KPIs. It ends with numbers on a dashboard.

In this article I will focus on the confidence we gain from the delivery pipeline, and how we can gain even more speed out of it but still maintain a high level of confidence in the tests that it runs

it’s important to note that speed is a big motivator for writing this article in the first place, and splitting into “discovery” and “delivery” pipelines is yet another technique to keep in your arsenal, regardless of what I write further down in this article.

But let’s get back to the idea of “confidence”.

What does confidence look like?

In the delivery realm, what type of confidence are we looking for?

Confidence that we didn’t break our code
Confidence that our tests are good
Confidence that the product does the right thing

Without Confidence…

If we don’t have confidence in the tests this can manifest in two main ways:

If the test is red and we don’t have confidence in the test- you might hear the words “Don’t worry about it” - we go on as if everything is OK and assume the test is wrong. (potentially not paying attention to real issues in the code)
If the test is green and we don’t have confidence in the test - we still debug or do manual testing of the same scenario (wasting any time the test should have saved us). We might even be afraid to merge to the master branches, deploy or do other things that affect others.

With Confidence…

This confidence allows us to do wonderful things:

Deploy early and often
Add, change and fix code early and often.

So what’s the problem?

Let’s look at common types of tests that we can write and run in a delivery pipeline:

There are quite a few types to choose from. That means, there are usually several questions teams needs to decide when they come to the realization that they want to have lots of delivery confidence through automated tests:

Which type of tests should we focus on?

How do we make sure the tests don’t take hours and hours to run so we can get faster feedback?
How can we avoid test duplication between various kinds of tests on the same functionality?
We are still (in some cases) working in a QA+Dev fashion. How can we increase collaboration and communication between the groups and create more knowledge sharing?
How can we inch towards DEV taking on automated test ownership as part of our transformation?
How can we be confident we have the right tests for the feature/user story?

To start making an informed decision, let’s consider several key things for each test type:

How much confidence does the test provide? (higher up usually means more confidence – in fact, nothing beats an app running in production for knowing everything is OK)
How long is the test run time? (feedback loop time)(higher up means slower)
How much ROI to I get for each new test? (the first test of each type will usually provide the highest ROI. The second one usually has to repeat parts of the first so ROI is diminished)
How easy is it to write a new test? (higher up is usually more difficult)
How easy is it to maintain a test? (higher up is usually more difficult)
How easy is it to pinpoint where the problem is when the test fails? (lower down is easier)

Where teams shoot themselves in the foot

Many teams will try to avoid (rightfully so) repeating the same test twice, and since End to End (aka “e2e”) tests usually provide lots of confidence, the teams will focus mostly on that layer of testing for confidence.

So the diagram might look like this for a specific feature/user story:

It might also be the case that a separate team is working on the end to end tests while developers are also writing unit tests (yes, let’s be realistic – the real world is always shades of grey – and enterprises are slow moving giants. We have to be able to deal with current working practices during transforation), but there might be plenty of duplication or missing pieces between the two types of tests:

In both of these cases (and other variations in between) we end up with a growing issue: These e2e tests will take a long time to run as soon as we have more than a dozen.

Each test can easily take anywhere from 30 seconds to a couple of minutes. It only takes 20-30 e2e tests to make the build run for an hour.

That’s not a fast feedback loop. That’s a candidate for a “let’s just run that stuff at night and pray we see green in the morning” anti-pattern.

Indeed, many teams at some point opt for that option and run these long running tests at night , possibly in a separate “nightly” pipeline which is different from the “continuous integration” pipeline.

So now we might have the worst of both worlds:

What’s wrong with this picture?

The pipeline that really determines the go-no-go delivery is the nightly pipeline. Which means 12-24 hours of feedback loop on the “real truth” about the status of the code (“come back tomorrow morning to see if your code broke anything”) .
Developers might get a false sense of confidence from just the CI (continuous integration) pipeline.
To know if you can deliver, you now have to look at two different locations (and it might also mean different people or departments are looking at each board instead of the whole team)

The best of both worlds?

Can we get high confidence with fast feedback? Close.

What if we try to get the confidence of e2e tests but still get some of the nice fast feedback that unit tests give us? Let’s try this scheme on for size:

Given Feature 1
We can test the standard scenario 1.1 as an e2e test
For any variation on that scenario, we always write that variation in either a unit test or an integration test (faster feedback).

Now we get both the “top to bottom” confidence and the added confidence that more complicated variations in the logic are also tested in fast tests at a lower level. Here’s how that might end up looking:

Creating a Test Recipe

Based on this strategy, here’s a simple tactical maneuver that I’ve found helpful in various teams I’ve consulted with:

Before starting to code a feature or a user story, the developer sits with another person to create a “Test Recipe”. That other person could be another developer, a QA person assigned ot the feature (if that’s your working process), an architect, or anyone else you’d feel comfortable discussing testing ideas with.

A Test Recipe as I like to call it is simply a list of the various tests that we think might make sense for this particular feature or user story, including at which level each test scenario will be located in.

A test recipe is NOT:

A list of manual tests
A binding commitment
A list of test cases in a test planning piece of software
A public report, user story or any other kind of promise to a stakeholder.
A complete and exhaustive list of all test permutations and possibilities

At its core it’s a simple list of 5-20 lines of text, detailing simple scenarios to be tested in an automated fashion and at what level. The list can be changed, added to or detracted from. Consider it a “comment”.

I usually like to just put it right there as a bottom comment in the user story or feature in JIRA or whatever program you’re using.

Here’s what it might look like:

When do I write and use a test recipe?

Just before coding a feature or a user story, sit down with another person and try to come up with various scenarios to be tested, and discuss at which level that scenario would be better off tested.

This meeting is usually no longer than 5-15 minutes, and after it, coding begins, including the writing of the tests (if you’re doing TDD, you’d start with a test first).

In orgs where there are automation or QA roles, the developer will take on writing the lower level tests and the automation expert (Yes QA without automation abilities wil need to learn them slowly) will focus on writing the higher-level tests in parallel, while coding of the feature is taking place.

If you are working with feature toggles, then those feature toggles will also be checked as part of the tests, so that if a feature if off, it’s tests will not run.

Simple rules for a test recipe

Faster. Prefer writing tests at lower levels
- Unless a high level test is the only way for you to gain confidence that the feature works
- Unless there is no way to write a low level test for this scenario that makes you feel comfortable
Confidence. The recipe is done when you can tell yourself “If all these tests passed, I’m feel pretty good about this feature working.” If you can’t say that, write more scenarios that would make you say that.
Revise: Feel free to add or remove tests from the list as you code, just make sure to notify the other person you sat with.
Just in time: Write this recipe just before starting to code, when you know who is going to code it, and coding is about to start the same day.
Pair. Don’t write it alone if you can help it. Two people think in different ways and it’s important to talk through the scenarios and learn from each other about testing ideas and mindset.
Don’t repeat yourself. If this scenario is already covered by an existing tests (perhaps from a previous feature), there is no need to repeat this scenario at that level (usually e2e).
Don’t repeat yourself again. Try not to repeat the same scenario at multiple levels. If you’re checking a successful login at the e2e level, lower level tests will only check variations of that scenario (logging in with different providers, unsuccessful login results etc..).
More, faster. A good rule of thumb is that to end up with a ratio of at least 1 to five between each level (for one e2e test you might end up with 5 or more lower level other scenario tests)
Pragmatic. Don’t feel the need to write at test at all levels. For some features or user stories you might only have unit tests. For others you might only have API or e2e tests. As long as you don’t repeat scenarios, and as long as you

Benefits of Layered Test Strategies and Test Recipes

Using this strategy, we can gain several things we might not have considered:

Faster feedback

OK, that one we did consider, but it’s the most notable.

Single delivery pipeline

If we can get the tests to run fast, we can stop running e2e tests at night and start putting them as part of the CI pipeline that runs on each commit. That’s real feedback for developers when they need it.

Less test duplication

We don’t waste time and effort writing and running the same test in multiple layers, or maintaining it multiple times if it breaks.

Knowledge sharing

Because test recipes are done in pairs, there is much better knowledge sharing and caring about the tests, especially if you have a separate QA department. Test recipes will “force” devs and QA to have real conversations about mindset and testing scenarios and come up with a better, smaller, faster plan to gain confidence, together. Real teamwork is closer.

Thoughts?

Hope you find this useful. Feel free to drop me a line on twitter @royosherove or email me roy AT 5whys dot com (or comment on this post!)

PS

Test strategies aren’t usually enough when you already have a large=ish list of e2e long running tests. You’ll still have to optimize the pipelines. There are many ways, but a rather simple one (given enough build agents) is to parallelize the steps in the the pipelines. Here’s a simple example of that:

Follow Roy on Twitter

Always start with a "Why" when transforming to DevOps.

Roy Osherove · October 21, 2018

I’ve been to quite a few DevOps and Agile transformations, big and small. Some just want to up the coverage of their tests (not recommended!), others want to revamp their team structures, and some want to change their process.

There are other points of view as well: Many internal and external consultants who want to make a difference try changing things but end up failing for many reasons.

My advice, which I’ve learned the hard way (like anything of true value) - is to have a good “Why” in your head at all times.

“Why am I doing this?”, “What’s the point of this change?”, “What are we trying to fix by changing this?” , “What is the cost of NOT changing this?”.

You have to have a really good answer for all of these, which are just a variation of the age old “What’s your motive?” question.

Not only do you need to have a good answer, you need to fully believe that this answer is what would really drive things for the better. If you’re aiming to change something and you yourself are not sure there’s any point in doing this, or tht it could actually succeed, if you can’t even imagine a world where that success has been accomplished, a realizty where things have changed (be it a month from now or 5 years from now) - it’s going to be very difficult.

Change always has detractors and naysayers. And even though from their point of view there might be truth to their reasons for not agreeing with you, their truth should not hinder your own belief that this change is needed.

Where this “Why” comes in handy the most, for me, is when someone comes with a really good argument for “Why not”. You can’t know everything and you can bet on being surprised at some of the good reasons people give why things shouldn’t be done or cannot. Or why they think things will not succeed, or are doomed to make things worse. It’s really handy because at those dark times, where you’re even unsure in yourself that this is the right thing to do, you can come back to your original “Why”. And remind yourself why this makes sense to you. And you can use that “Why” to measure any fact you encounter, where it still holds true or not. If it no longer holds true - you might have truly discovered or learned something new and would need need to possibly change strategy or goals.

Here’s an example of “Why” that I have used:

“Developers should have fast feedback loops that enable them to move fast with confidence”.

Strategy:

Teach about single trunk branching technique
Teach about Feature Flags
Help team design a new branching model that reduces amount of merges
Help team design pipeline that triggers on each commit
Get the pipeline to run fast, in parallel, on multiple environments
Teach team about unit testing, TDD and integration tests

That shoudl get a team far enough on that “Why”. And here’s what happens in real life, especially in large organizations:

Every bullet point above will be challenges, cross examined, denounced as the devil, offered multiple variations that seem to be close but are not, and very few people will be able to automatically support it in many teams.

But this “Why” is what keeps my moral compass afloat. No matter what is the discussion - and especially if it becomes confusing, or goes deep into alternative suggestion land - I keep asking myself if that “Why” is being answered. And I can use it as my main reason for each one of these things, because it is absolutely true in my mind - these points help push that “Why” - and doing them in other ways might seem like it’s close but can push us farther away from it.

Always have a “Why” - and you’ll never be truly stumped, even when confusion takes over.

You could have many “Whys” - and not all of them will be in play at all times. But for each action you push for, there should be a “Why” behind it.

In my mind it reminds me a bit of TDD - any production code has a reason to exist: a failing test at some point that pointed out that this functionality is missing. Tests are the “Why” of code functionality.

In DevOps and Continuous Delivery, a big “Why” can be “We want to reduce our cycle time. “ Then from that you can derive smaller “Whys”, but they all live up to the big one, and support it.

Follow Roy on Twitter

Ten Devops & Agility Metrics to Check at the Team Level

Roy Osherove · March 10, 2018

When I coach teams that are getting into the DevOps and Continuous Delivery mindset, a common question that comes up is "What should we measure?"

Measuring is a core piece of change - how do you know you're progressing without measuring anything?

Here are ten ideas for things you can measure to see if your team is getting closer to a DevOps and continuous delivery skillset. It's important to realize that what we are measuring are end symptoms - results. The core behaviors that need to change can be varied quite a bit, but at the end of the day, we want to see real progress in things that matter to us from a continuous delivery perspective.

Cycle time. (you want to see this number going down) If you put a GoPro on a user story, from the moment it enters the mind of a customer or PO, and track what it goes through, to the point of being active in production, you get a calendar-time number that represents your core delivery cycle time. It could take weeks, months and sometimes years in large organizations. It usually id a big surprise. I'll write about this more in a separate blog post. The idea is to see cycle time reduced over time, so you actually deliver faster and be more competitive.
Time from red build to green build (you want to see this number going down) - Take the last instances of a red-to-green build (count from the first red build, until the first green build after that) to get how long on average it takes to make a red build green. This is how effective your team is with dealing with a build failure. Build failures a re a good thing - they tell us what's really going on. We should not avoid them. But we should be taking care of them quickly and efficiently (for example you can set up "build keeper" shifts -every day someone else is in charge of build investigations and pushing the issue to the right people in the team.
Amount of open pull requests on a daily basis, closed pull requests, coupled with the avg. time a pull request has been open. (you want to see closed requests going up, request time going down and open requests being stable or going down). This gives us a measure of team communication and collaboration - how often does code get reviewed, and how often is code code stuck waiting for a review. A trend of open pull requests going up could mean the team has a bottleneck in the code review area. The same is true for very long pull request times.
Frequency of merges to Trunk. (this should be going up or staying stable) If your code gets merged to trunk every few days or weeks, it means that whatever it is your build pipeline is building and delivering is days old or weeks old code. It also is a path to many types of risks such as: not getting feedback fast enough on how your code integrates with everyone else, your code not being deployed and available to turn on with a feature flag, and generally it's a pathway for people who are afraid of exposing their work to the world, thus potentially creating hours and sometimes days of pain down the line.
Test Code Coverage (coupled with test reviews) (you want to see this go up or stay stable at a high level, while watching closely for quality of code reviews). I always like to say that low code coverage means only one thing - you are missing tests. but high code coverage is meaningless unless you have code reviewed, because human nature leads us to fulfill whatever we are measured on. so sometimes you can see teams writing tests with no asserts just to get high code coverage. this is where the code reviews come in.
Amount of tests (this should obviously be going up as you add new functionality to your product).
pipeline run time . (this should be declining or staying at a low level). The slower your automated build pipeline is the slower your feedback is. This helps you know if the steps you are taking also help increase the feedback cycle.
pipeline visibility in team rooms (you want to see this go up or stay stable at a high level). This is a metric that tells you about commitment to visual indicators, information radiators etc. It's a small but important part of team non verbal communication and increases the team's ability to respond quickly to important events.
team pairing time (should be going up or stay stable at a medium or high level) - we can measure this to see if we have knowledge sharing going on.
amount of feature flags -(should be going up as team learns about feature flags, and then stay stable. if it continues to increase it means you're not getting rid of feature flags fast enough which can lead to trouble down the line.

Two bonus metrics:

feature size estimate (should be staying stable or going down) - helps to track how well the team estimates feature sizes or to check the variance of the feature sizes you estimate.
Bus factor count - (should be going down and staying down) how many people are bus factors?

Follow Roy on Twitter

Ephemeral Environments for DevOps: Why, What and How?

Roy Osherove · March 30, 2017

In the previous post I discussed the issues with having static environments. In this post I cover one of the solutions for those issues: Ephemeral environments.

Ephemeral environments are also sometimes called “Dynamic environments”, “Temporary environments”, “on-demand environments” or “short-lived environments”.

The idea is that instead of environments “hanging around” waiting for someone to use them, the CI/CD pipeline stages are responsible for instantiating and destroying the environments they will run against.

For example, we might have a pipeline with the following stages:

Build
Test:Integration&Quality
Test:functional
Test:Load&Security
Approval
Deploy:Prod

In a traditional static environment configuration, each stage (perhaps except the build stage) would be configured to run again a static environment that is already built and is waiting for it, or some steps might share the same environment, which causes all the issues I mentioned previously.

In an ephemeral environment configuration, each relevant stage would contain two extra actions: one at the beginning, and one at the end, that spin up an environment for the purpose of testing, and spin it down at the end of the stage.

The first step(1) is to compile and run fast unit tests, followed by putting the binaries in s aspecial binary repository such as artifactory.

There is also an stage (2) that creates a pre-baked environment as a set of AMIs of VM images (or containers) to later be instantiated. :

Build & Unit Test
1. Build Binaries, run unit tests
2. Save binaries to artifact managemnt
Pre-Bake Staging Environment
1. Instantiate Base AMIs
2. Provision OS/Middleware components
3. Provision/Install application
4. Save AMIs for later instantiation as STAGING environment (in places such as S3, artifactory etc.)
Test:Integration&Quality
1. Spin up staging environment
2. Run tests
3. Spin down staging environment
Test:functional
1. Spin up staging environment
2. Run tests
3. Spin down staging environment
Test:Load&Security
1. Spin up staging environment
2. Run tests
3. Spin down staging environment
Approval
1. Spin up staging environment
2. Run approval tests/wait for approval and provide a link to the environment for humans to look into the environment
3. Spin down staging environment
Deploy:Prod
1. Spin up staging environment
2. Data replication
3. Switch DNS from old production to new environment
4. Spin down old prod environment (this is a very simplistic solution)

A few notes:

· Pre-Baked Environment:

Notice that the environment we are spinning up and spinning down is always the same environment, and it is a staging environment with the application pre-loaded on top of it.

Staging environments are designed to look exactly like production (in fact, in this case, we are using staging as a production environment in the final stage).

The reason we are always using the same environment template, is because:

This provides environmental consistency between all tests and stages, and removes any false positives or negatives. If something works or doesn’t work, it is likely to have the same effect in production.
Environments are pre-installed with the application, which means that we are always testing the exact same artifacts, so we get artifact consistency
Because environments are pre-installed, we are also in-explicitly testing the installation/deployment scripts of the application.

Only one Install.

Also notice that there is no “installation” after the pre-baking stage. – which means we also don’t “deploy” into production. We simply “instantiate a new production in parallel”.

We “install-once, promote many” which means we get Installation consistency across the stages.

Blue-Green Deployment

Deploying to production just means we instantiate a new pre baked environment in the production zone (for example a special VPC if we are dealing in AWS) which would run in parallel with the “real” production. Then we slowly soak up production data, let the two systems run in parallel, and eventually either switch a DNS to the new server, or slowly drain the production load balancer into the new server (there are other approaches to this that are beyond the scope of this article.

Speed

Another advantage of this set up is that because each stage can have its own environment, we can run some stages in parallel, so, in this case we can run all the various tests in parallel, which will save us valuable time:

Build & Unit Test
1. Build Binaries, run unit tests
2. Save binaries to artifact managemnt
Pre-Bake Staging Environment
1. Instantiate Base AMIs
2. Provision OS/Middleware components
3. Provision/Install application
4. Save AMIs for later instantiation as STAGING environment (in places such as S3, artifactory etc.)
HAPPENING IN PARALLEL:
1. Test:Integration&Quality
  1. Spin up staging environment
  2. Run tests
  3. Spin down staging environment
2. Test:functional
  1. Spin up staging environment
  2. Run tests
  3. Spin down staging environment
3. Test:Load&Security
  1. Spin up staging environment
  2. Run tests
  3. Spin down staging environment
Approval
1. Spin up staging environment
2. Run approval tests/wait for approval and provide a link to the environment for humans to look into the environment
3. Spin down staging environment
Deploy:Prod
1. Spin up staging environment
2. Data replication
3. Switch DNS from old production to new environment
4. Spin down old prod environment (this is a very simplistic solution)

How?

One tool to look into for managing environments and also killing them easily later would be Chef:Provision, which can be invoked from the jenkins command line, but also saves the state later for spinning down an environment. It also follows the toolchain values we discussed before on this blog.

In the docker world, given a pre-baked set of docker images, we can use kubernetes to create ephemeral environments very easily, and destroy them at will.

Jenkins-X would be a good tool to look into specifically for those types of environments.

Follow Roy on Twitter

The four biggest issues with having static environments

Roy Osherove · March 30, 2017

An environment is a set of one or more servers, configured to host the application we are developing, and with the application already installed on them, available for either manual or automated testing.

What are static environments?

Static environments are environments that are long lived. We do not destroy the environment, but instead keep loading it with the latest and greatest version of the application.

For example, we might have the following static environments:

“DEV”, “STAGE” and “PROD”.

Each one is used for a different purpose and by different crowds. What’s common about them is that they are all long lived (sometimes for months or years), and this creates several issues for the organization:

1. Environment Rot: As time goes by, the application is continuously installed on the environments and configuration is done on them (manually). This creates an ongoing flux of changes to each environment that leads to several problems

a. Inconsistency between environments (false positives or negatives)

Any deployment or tests results you get in one environment may not reflect what you actually get in production. For example, tests that pass in “Model” could be passing due to a specific configuration in MODEL that does not exist in other environments, meaning we’d get a false positive.
Bugs that happen in one of the environments might not happen in production, which is a false negative.

b. Inability to reproduce issues between environments

If a bug is manifested in one environment but cannot be reproduced in another, due to the fact that the environments are different “pets”

2. Long and costly maintenance times

Because environments are treated as “pets” (i.e you name them, you treat them when they are sick, each one is a unique snowflake that has its ups and downs), it takes a lot of time and manual, error prone activities, to maintain the environment and bring it up if it crashes.

This causes a delay whenever a team needs to test the product on an environment.

3. Queuing

Because of teams waiting to deploy to an environment, and because there is a limited number of these environments (because they are costly to set up, maintain and pay for to keep running 24-7), queues start to form as teams await environments to become available.

This queueing can also be caused because multiple teams are expecting to use the same environment, and so each team waits for other teams to finish working with the environment before they can start working.

This causes release delays.

4. Waste of money

Static environments usually run 24-7, and in a private or public cloud scenario this might mean paying per hour per machine or VM instance. However, most environments are only used during work hours, which means in many organizations up to 16 hours of idle paid time.

Solution:

In the next post I'll cover ephemeral environments and how they solve the issues mentioned here.

Follow Roy on Twitter

Continuous Delivery Values

Roy Osherove · March 10, 2017

I mentioned in my last post about the toolchain needing to respect the continuous delivery values. What are those values? Many of them derive from lean thinking, eXtreme Programming ideas and the book "Continuous Delivery".

Michael Wagner, a colleague and mentor of mine at Dell EMC,has described them as:

The core value is :

Our highest priority is to satisfy the customer through early and continuous delivery of valuable software

The principles are:

The process for releasing/deploying software MUST be repeatable and reliable
Automate Everything down to bare metal
If something’s difficult or painful, do it more often
Keep EVERYTHING in source control
Done means “released”
Build quality in
Everyone has responsibility for the release process
Improve Continuously

The Practices are:

Build binaries (or image, or containers, or any artifact that will be deployed) only once, promote many
Use precisely the same mechanism to deploy to every environment
Smoke test your deployment
Organizational policy is implemented as automated steps in the pipeline
If anything fails, stop the line

Follow Roy on Twitter

Guidelines for selecting tools for continuous delivery toolchains

Roy Osherove · March 10, 2017

If the tool you use does not support continuous delivery values, you're going to have a bad time implementing CI/CD with fully automated pipelines.

Here are some rules for the road:

The first rule is: Don't select your toolchain until you have designed the pipeline you want to have
Every action or configuration can become code in source control so you can version things and get an audit trail on changes
Everything that can be invoked or configured has an automation end point through command line or APIs
Every command line can return zero for success or non zero for failures
If you have to log in to a UI to be able to configure or execute something, you're going down the wrong path - Your CI/CD tools (like Jenkins) is your user, not humans.
Queues should be avoidable: if you can only do one task at a time, but you have multiple builds that need to use the tool, you'll have a queue. The tool should be able to support parallel work by multiple pipelines or any other support for avoiding such queuing - as long as it enables pipelines to run faster, not slower, but still give all the feedback you need into the pipeline.
Results should be easily visible in the pipeline, or importable via API or command line: You need to be able to see the results easily in the pipeline log so that you can easily understand pipeline issues without going through different teams and tools.

Follow Roy on Twitter

The first rule of continuous delivery toolchains

Roy Osherove · March 10, 2017

Continuous delivery transformations are hard enough in many areas already: changing people's behavior, changing policies, architecting things differently - all these things are difficult on their own.

Having a toolchain that prohibits you from achieving your pipelines in a way that fully supports CI/CD is something you can try to avoid off the bat.

The simple rule is:

Don't select your toolchain before you've designed the pipeline you wish to have

Don't decide you're going to use a specific tool for the new way of working before you've done at least the following:

Choose a project in your organization that exemplifies an "average" project (if there are many types choose two or three).
Do a value stream mapping of that project from conception to cash - from the point of an idea, to that getting into a backlog, to that code ending up in production. Then Time the value stream and see how long each step takes.
Once you have a value stream, design a conceptual pipeline that tries to automate as much of the value stream as possible (some things will not be able to be automated without the change of organizational policies)

Now that you have conceptual pipeline (or several), you can start deciding which tools fit best for the actions you would like to pull off in the pipeline.

What you'll find is that some tools will not lend themselves easily into the pipeline work, and some tools will fit in just right.

Remember: the pipeline drives the tool selection. Not the other way around.

You should not let the tool dictate how your pipeline works because you're trying to change the way the organization works, not map the organization to a tool that might hider your progress.

Follow Roy on Twitter

The four most common root causes that slow down enterprise continuous delivery

Roy Osherove · March 9, 2017

In my journey as a developer, architect, team leader, CTO, director, coach and consultant for software development, the two most common "anti patterns" I come across int he wild that generate the most problems are:

Manual Testing
Static Environments
Manual configuration of those environments
Organizational Policies

Manual Testing

This one seems pretty straight forward: Because testing is manual, it takes a really long time. Which slows down the rate of releases as everyone waits for testing to be done.

There are other side effects to this approach though:

Because testing is manual:

It is hard to reproduce bugs, or even the same testing steps continuously
It is prone to human error
It is boring and creates frustration to the people who do it (which leads to the common feeling among people that 'QA is a stepping stone to becoming a developer' - and we do not want that! We want testers who love their job and provide great value!)
The list of manual tests grows exponentially - and so does the manual work involved. as more features are added, you get very low code coverage, since people can only spend so much time testing the release before the company goes belly up.
It creates a knowledge silo in the organization, a "throw it over the fence" mentality for developers who are only interested in hitting a date, but not wether the feature quality is up to par ('not my job").
This knowledge silo also creates psychological "bubbles" around people in different groups, that 'protect' them from information of what happens before and after they do their work. Essentially meaning that people stop caring or have little knowledge of how they contribute and where they stand on the long pipeline of delivering software to production ("Once I sign off on it, I have no idea what is the next step - I just check a box in "ServiceNow" and move on with my life" ) .
You never have enough time to automate your tests, because you are too busy testing things manually!

So manual testing is both a scaling issue, a consistency issue and a cause for people leaving the company.

Static Environments

An environment is usually a collection of one or more servers sitting inside or outside the organization (private or public cloud, and sometimes just some physical machines the developers might have lying around if they get really desperate).

In the traditional software development life cycle that uses pipelines, code is built, and then promoted through the environments, that progressively look more and more like production, with "Staging" usually being the final environment that stands between the code and production deployment.

How it moves through the environments changes: some organizations will force to merge to a special branch for each type of environment (i.e "branch per promotion"). This is usually not recommended, because it means that for each environment the code will have to be built and compiled again, which breaks a cardinal rule of CI/CD: Build only once, promote many - that's how you get consistency. But that's a subject for another blog post.

More problematic is the fact that the environments are "static" in the first place. Static means they they are long-lived: an environment is (hopefully) instantiated as a set of virtual machines somewhere in a cloud, and then is declared "DEV" or "TEST" or "STAGE" or any other name that signified where it is supposed to be used in the organizational pipeline that leads to production. Then a specific group of people tend to get access to it and use it to (as we said in the first item) test things manually.

But even if they test things with automated tests,

The fact that the environment is long lived is an issue for multiple reasons:

Because they are static, there is only a fixed number of such environments (usually a very low number), which can easily cause bottlenecks in the organization: "We'd like to run our tests in this environment, but between 9-5 some people are using it" or "We'd like to deploy to this environment, but people are expecting this environment to have an oder version"
Because they are static, they become "stale" and "dirty" over time: Each deploy, or configuration change becomes a patch on top of the existing environment leading it to become a unique "snowflake" that cannot be recreated elsewhere. This creates inconsistency between the environments, which leads to problems like "The bug appears in this environment, but not in that environment, and we have no idea why"
It costs a lot of money: An environment other than production, usually is only utilized in times when people are at work, otherwise it just sits there, ticking away CPU time on amazon , or electricity on prem, and is a waste of money. Imagine that you have a fleet of 100 testing machines in an environment that are utilized in parallel during load testing, but that only happens once a day, or three times a day, but 12 hours a day these 100 machines just sit there ,costing money.

So static environments are both a scale issue and a consistency issue.

Manual Configuration of environments

In many organizations the static environments are manually maintained by a team of people dedicated to the well being of those snowflakes: they patch them, they mend their wounds, they open firewall ports, they provide access to folders and DNS and they know those systems well.

But

Manual configuration is an issue for multiple reasons:

Just like manual testing, it takes a lot of time to do anything useful with an environment : bringing it up and operationalizing it takes a long time and painstaking dedication over multiple days sometimes weeks of getting everything sorted with the internal organization. Simple questions rise up and have to be dealt with: "who pays for this machine? Who gets access? what service accounts are needed? how exposed is it? Will it contain information that might need encryption? Will compliance folks have an issue with this machine being exposed to the public internet? - these and more have to be answered, and usually be multiple groups of people within a large organization. The same with any special changes to an existing environment, or, god forbid, debugging stray application that's not working on this snowflake, but does work everywhere else - a true nightmare to get such access if you've ever been involved in such an effort.
Not only is it slow, but it also does not scale: Onboarding new projects in the organization will take a long time - if they need a build server, or a few "static" environments: that's days or weeks away. So this slows the rate of innovation.
There is no consistency between the environments, since all things are done manually, and people often find it hard to repeat things exactly. No consistency: the feedback you get from deploying onto that environment might not be true, and thus the time to really know if your feature actually does work might be very long: maybe you'll find out only in production, a few weeks or months from now. and we all know that the longer you wait to get feedback, the more it costs to fix issues.
Importantly: there is no record of any changes made to environment configuration: all changes are manual so there is no telling what changes were made to an environment, as well as rolling back changes is difficult or impossible.

So manual configuration of environments hurts time to market, consistency, operational cost and the rate of innovation.

Organizational Policies

If your organization requires manual human intervention to approve things that go into production , for everything then this policy can cause several issues:

No matter how much automation you will have, the policy that drives the process will force manual intervention and automation will not be accepted (especially auto-deploy to production)
It will slow down time to market
It will make small changes take along time, and large changes take even longer (if not the same)

Here is the guidance I usually offer in these cases:

If you are doing manual testing

you probably want to automate things but you either don't have time to automate because you are too busy doing manual tests, or you might have people (not "resources" , "people"!) that do not have the automation skill yet:

If you don't have time to automate

you are probably in Survival Mode. You are over committed. You will have to change your commitments and have a good hard talk with your peers and technical leadership about changing your commitments so you get more time to automate, otherwise you are in a vicious cycle that only leads to a worse place (the less time you have to automated the more you test manually, and so you get even less time to automate)

If your people are missing the skills:

coach them and give them time to learn those new skills (or possibly hire an expert that teaches them)

If you are using static environments

- look into moving to Ephemeral environments: environments that are spun up and torn down automatically before and after a specific pipeline stage, such as testing.

If you are configuring your static environments manually:

Look into tools such as Chef Provisioner, Terraform or Puppet to automate the creation of those environments in a consistent way that is fully automated, and that also keeps configuration current. This also solves the issue of having no record of any configuration changes to environments, since all changes are in the form of changing a file in source control, and having the tool take care of the provisioning and configuration for you. Less human error, auditing, versioning: who wouldn't want that? It's the essence of infrastructure as code: treating our infrastructure the same way we treat our software: through code! This also gives you the ability to use tools to test your environments and configuration with tools such as Chef Compliance, InSpec and others.

If you have organizational policies that prevent automated deployment:

Look into creating a specialized, cross functional team of people within the organization, that is tasked with creating a new policy that allows for some categories of application changes to be "pre-approved".
This will enables pipelines to deploy changes that fall into this category direction to production without going through a very complicated approval process that takes a long time.
For other types of changes automate as much as you can whatever compliance, security and approval checks (yes, even ServiceNow has APIs you can use, did you know?), and then put them into a special approval pipeline, that removes as much of the manual burden on the change committees as possible, so that the meetings they have become more frequent, and faster "we want to make this change, it has already passed compliance automated tests and security tests in a staging environment - are we good to go?"
For changes that affect other systems in the organization, you can look into creating a special trigger of those external pipelines that get deployed with your new system in parallel to a test environment, and then make tests are run on the dependent system to see if you broke it. Pipelines that trigger other pipelines can be an ultimate weapon to detect cross-company changes with many dependencies. I will expand on this in a later post.

Follow Roy on Twitter

DevOps Metric: Amount of Defects and Issues: definition.

Roy Osherove · October 29, 2016

Important to: VP of QA, Operations
Definition: How many bugs or tickets are found in production/ how many recurring bugs in production. As well as - how many deployment issues during release are encountered.
How to measure: tickets from customers, support records. For deployment issues: post release retrospectives.
Expected Outcome: Amount of deployment issues, defects and recurring defects should decrease as DevOps maturity grows.

Many organizations make an almost-conscious choice: go fast, or build with high quality. Usually they lose on both ends.

Measuring defects and deployment issues is a good measure for a DevOps maturity model. You should become faster while not compromising on quality, because quality gates are fully automated in the form of tests in multiple steps, instead of manual error prone steps.

The same applies for error prone deployments: they almost stop happening, and when they do happen, the deployment script itself is fixed and can be tested before being used to deploy to production. Just like regular software development life cycles. The ROI is huge.

Follow Roy on Twitter

DevOps Metric: Frequency of Releases Definition

Roy Osherove · October 29, 2016

Important to: Release Manager, CIO
Definition: How often an official release is deployed to production, in paying customer's hands
How to measure: If no pipeline: release schedule. Otherwise, history data for Pipeline Production deploy step.
Expected outcome: The time between releases should become shorter and shorter (in some cases an order of magnitude shorter) as DeOps maturity grows.

It's one thing for the IT of an organization to create pipelines so fast that they can release at will. The other side of that coin is that of the business.

If IT is no longer a bottleneck, it is the business' responsibility to decide *when* to release new versions.

In traditional organizations IT is usually too slow to supply and deploy the release as fast as the business requires, so the business is forced to operate at the speed of IT , which might mean 4 release a month, 4 releases a year or even less frequent.

In a DevOps culture, every code check in (A 'commit') is a potentially releasable product version, assuming it has passed the whole continuous delivery pipeline all the way up to the staging servers.

We can use the "Frequency of release metric to decide "how many releases do we want to do per year, month, week?" We can then measure our progress.

As IT becomes faster, we can cut down our scheduled release time. As the pipeline supports more and more automated policy and compliance checks as feedback cycles, more time is removed from manual processes and potential releases become easier to achieve.

The question "How often should we pull the trigger to deploy the latest potential release that has passed through the pipeline?" shifts from becoming IT's problem to being owned solely by business stakeholders.

As the schedules become more frequent, the competitiveness in the marketplace becomes better.

In some organizations (Netflix, or Amazon are a good example), the decision on a release schedule is removed completely and only the pipeline's "flow" of releases that automatically get deployed determines the release frequency, which might happen as often as every minute or even less.

Follow Roy on Twitter