The four most common root causes that slow down enterprise continuous delivery
In my journey as a developer, architect, team leader, CTO, director, coach and consultant for software development, the two most common "anti patterns" I come across int he wild that generate the most problems are:
- Manual Testing
- Static Environments
- Manual configuration of those environments
- Organizational Policies
Manual Testing
This one seems pretty straight forward: Because testing is manual, it takes a really long time. Which slows down the rate of releases as everyone waits for testing to be done.
There are other side effects to this approach though:
Because testing is manual:
- It is hard to reproduce bugs, or even the same testing steps continuously
- It is prone to human error
- It is boring and creates frustration to the people who do it (which leads to the common feeling among people that 'QA is a stepping stone to becoming a developer' - and we do not want that! We want testers who love their job and provide great value!)
- The list of manual tests grows exponentially - and so does the manual work involved. as more features are added, you get very low code coverage, since people can only spend so much time testing the release before the company goes belly up.
- It creates a knowledge silo in the organization, a "throw it over the fence" mentality for developers who are only interested in hitting a date, but not wether the feature quality is up to par ('not my job").
- This knowledge silo also creates psychological "bubbles" around people in different groups, that 'protect' them from information of what happens before and after they do their work. Essentially meaning that people stop caring or have little knowledge of how they contribute and where they stand on the long pipeline of delivering software to production ("Once I sign off on it, I have no idea what is the next step - I just check a box in "ServiceNow" and move on with my life" ) .
- You never have enough time to automate your tests, because you are too busy testing things manually!
So manual testing is both a scaling issue, a consistency issue and a cause for people leaving the company.
Static Environments
An environment is usually a collection of one or more servers sitting inside or outside the organization (private or public cloud, and sometimes just some physical machines the developers might have lying around if they get really desperate).
In the traditional software development life cycle that uses pipelines, code is built, and then promoted through the environments, that progressively look more and more like production, with "Staging" usually being the final environment that stands between the code and production deployment.
How it moves through the environments changes: some organizations will force to merge to a special branch for each type of environment (i.e "branch per promotion"). This is usually not recommended, because it means that for each environment the code will have to be built and compiled again, which breaks a cardinal rule of CI/CD: Build only once, promote many - that's how you get consistency. But that's a subject for another blog post.
More problematic is the fact that the environments are "static" in the first place. Static means they they are long-lived: an environment is (hopefully) instantiated as a set of virtual machines somewhere in a cloud, and then is declared "DEV" or "TEST" or "STAGE" or any other name that signified where it is supposed to be used in the organizational pipeline that leads to production. Then a specific group of people tend to get access to it and use it to (as we said in the first item) test things manually.
But even if they test things with automated tests,
The fact that the environment is long lived is an issue for multiple reasons:
- Because they are static, there is only a fixed number of such environments (usually a very low number), which can easily cause bottlenecks in the organization: "We'd like to run our tests in this environment, but between 9-5 some people are using it" or "We'd like to deploy to this environment, but people are expecting this environment to have an oder version"
- Because they are static, they become "stale" and "dirty" over time: Each deploy, or configuration change becomes a patch on top of the existing environment leading it to become a unique "snowflake" that cannot be recreated elsewhere. This creates inconsistency between the environments, which leads to problems like "The bug appears in this environment, but not in that environment, and we have no idea why"
- It costs a lot of money: An environment other than production, usually is only utilized in times when people are at work, otherwise it just sits there, ticking away CPU time on amazon , or electricity on prem, and is a waste of money. Imagine that you have a fleet of 100 testing machines in an environment that are utilized in parallel during load testing, but that only happens once a day, or three times a day, but 12 hours a day these 100 machines just sit there ,costing money.
So static environments are both a scale issue and a consistency issue.
Manual Configuration of environments
In many organizations the static environments are manually maintained by a team of people dedicated to the well being of those snowflakes: they patch them, they mend their wounds, they open firewall ports, they provide access to folders and DNS and they know those systems well.
But
Manual configuration is an issue for multiple reasons:
- Just like manual testing, it takes a lot of time to do anything useful with an environment : bringing it up and operationalizing it takes a long time and painstaking dedication over multiple days sometimes weeks of getting everything sorted with the internal organization. Simple questions rise up and have to be dealt with: "who pays for this machine? Who gets access? what service accounts are needed? how exposed is it? Will it contain information that might need encryption? Will compliance folks have an issue with this machine being exposed to the public internet? - these and more have to be answered, and usually be multiple groups of people within a large organization. The same with any special changes to an existing environment, or, god forbid, debugging stray application that's not working on this snowflake, but does work everywhere else - a true nightmare to get such access if you've ever been involved in such an effort.
- Not only is it slow, but it also does not scale: Onboarding new projects in the organization will take a long time - if they need a build server, or a few "static" environments: that's days or weeks away. So this slows the rate of innovation.
- There is no consistency between the environments, since all things are done manually, and people often find it hard to repeat things exactly. No consistency: the feedback you get from deploying onto that environment might not be true, and thus the time to really know if your feature actually does work might be very long: maybe you'll find out only in production, a few weeks or months from now. and we all know that the longer you wait to get feedback, the more it costs to fix issues.
- Importantly: there is no record of any changes made to environment configuration: all changes are manual so there is no telling what changes were made to an environment, as well as rolling back changes is difficult or impossible.
So manual configuration of environments hurts time to market, consistency, operational cost and the rate of innovation.
Organizational Policies
If your organization requires manual human intervention to approve things that go into production , for everything then this policy can cause several issues:
- No matter how much automation you will have, the policy that drives the process will force manual intervention and automation will not be accepted (especially auto-deploy to production)
- It will slow down time to market
- It will make small changes take along time, and large changes take even longer (if not the same)
Here is the guidance I usually offer in these cases:
If you are doing manual testing
you probably want to automate things but you either don't have time to automate because you are too busy doing manual tests, or you might have people (not "resources" , "people"!) that do not have the automation skill yet:
If you don't have time to automate
you are probably in Survival Mode. You are over committed. You will have to change your commitments and have a good hard talk with your peers and technical leadership about changing your commitments so you get more time to automate, otherwise you are in a vicious cycle that only leads to a worse place (the less time you have to automated the more you test manually, and so you get even less time to automate)
If your people are missing the skills:
coach them and give them time to learn those new skills (or possibly hire an expert that teaches them)
If you are using static environments
- look into moving to Ephemeral environments: environments that are spun up and torn down automatically before and after a specific pipeline stage, such as testing.
If you are configuring your static environments manually:
Look into tools such as Chef Provisioner, Terraform or Puppet to automate the creation of those environments in a consistent way that is fully automated, and that also keeps configuration current. This also solves the issue of having no record of any configuration changes to environments, since all changes are in the form of changing a file in source control, and having the tool take care of the provisioning and configuration for you. Less human error, auditing, versioning: who wouldn't want that? It's the essence of infrastructure as code: treating our infrastructure the same way we treat our software: through code! This also gives you the ability to use tools to test your environments and configuration with tools such as Chef Compliance, InSpec and others.
If you have organizational policies that prevent automated deployment:
- Look into creating a specialized, cross functional team of people within the organization, that is tasked with creating a new policy that allows for some categories of application changes to be "pre-approved".
- This will enables pipelines to deploy changes that fall into this category direction to production without going through a very complicated approval process that takes a long time.
- For other types of changes automate as much as you can whatever compliance, security and approval checks (yes, even ServiceNow has APIs you can use, did you know?), and then put them into a special approval pipeline, that removes as much of the manual burden on the change committees as possible, so that the meetings they have become more frequent, and faster "we want to make this change, it has already passed compliance automated tests and security tests in a staging environment - are we good to go?"
- For changes that affect other systems in the organization, you can look into creating a special trigger of those external pipelines that get deployed with your new system in parallel to a test environment, and then make tests are run on the dependent system to see if you broke it. Pipelines that trigger other pipelines can be an ultimate weapon to detect cross-company changes with many dependencies. I will expand on this in a later post.