toc

If a Build Takes 4 hours, Run It Every 4 Hours

(Note: You might need to fix your environment bottlenecks first before being able to act on this blog post)

Builds (i.e compile + run tests + deploys + more tests etc…) represent a bottleneck for getting feedback on our code so we can be confident about making more changes or releases. There are two main factors here:

  • How often a build runs (Per commit, hourly, weekly etc..)

  • How long a build takes to finish

Many companies are also in the process of reducing reliance on manual repetitive QA cycles, but still require some manual tests or verifications to be part of the process while the transition is happening (while building out more automated tests for example, or if there are some types of tests which currently don’t make sense automating). So we’ll add one extra factor here:

  • When does the manual verification take place

Sometimes the manual verification can only take place after the build has been finished due to deployment requirements, environment requirements or a requirement for the built binaries to be available. This means the build acts as one of the constraints on our process.

“How often?” Is the new Constraint

We can assume the following based on Theory of Constraints:

  • The longer time passes between each build, the more changes are integrated in each build from various team members

  • This means the batch size for changes is larger which means..

    • we need to test/verify more things - manual testing takes longer, thus making the feedback loop longer

    • There are more compounded risks due to more changes which makes everyone more nervous, and more likely to ask for more verification to be do

What can we do to reduce the batch size of each build, and get a faster feedback loop?

The immediate answer is “run the build more often”. But, how often can that be?

What I see many teams doing as an anti pattern, is doing the following thinking:

Because the build takes a long time (4 hours), let’s only run it at night or once a week

But the thinking should be reversed:

Because the build takes 4 hours, let’s run it every 4 hours

What do we get by that?

  • The feedback loop time is reduced to between 4-8 hours instead of 24 hours on a daily build, or a few days on a weekly build. Assuming a developer is making changes right now, they can get on the next “build train” that will start in a couple of hours and end 4 hours later. It’s not perfect but it’s much much better than waiting until the next day, or until the end of the week instead of waiting until 4pm to push that important change to the master branch.

  • We always have a “latest stable build” which is the last green build that happens every 4 hours. Assuming this also deploys to a dynamic demo environment, it means we always have a fresh-enough demo to show, and if not, just wait 4 hours.

  • The batch size is much smaller now: How many changes can fit into 4 hours? Less than 24 hours, that’s for sure. Which means verification can be much faster too.

This also means that if I need to manually verify an issue I fixed, I don’t have to wait until tomorrow to verify it as part of an integrated build, I can wait 8 hours as a worst case scenario (same day half the time!).

“How Long Does it Take?” is the next constraint

Now, (and only now, after we’ve fixed a bigger constraint on when the build runs) that we reduced the time between builds, the build time is the next largest constraint that keeps the feedback loop from shortening.

Every change we make to shorten the build time (by parallelizing tests or steps, for example) now has a compounded effect: The build time is our feedback cycle time. If the build now takes only one hour instead of four, we can run the build on an hourly basis.

SO, manually verifying a change I made to an integrated build can now happen within 1-2 hours after making the change, committing and pushing it. Definitely same-day.

What if the build takes 30 minutes? You get the idea. The loop is now 30-60 minutes long.

Environments could be the original constraint ( Bottlenecks)

This whole conversation might not be feasible for you because you have this issue.

Sometimes teams are forced to run the build less because they don’t want to deploy to a static environment while someone else might be using it. I wrote about this before: Static environments are a huge bottleneck.

So a huge factor here is your ability to have an environment per build that gets destroyed after a specified time or other constraint. The good news is that if you’re using docker and things like Kubernetes, this ability is 5 minutes away with K8s namespaces that are dynamically generated per build.

99% of static environments should be a thing of the past, and we need to know to ask for this from IT departments that keep the old ideas and just implement them on new tools that allow newer ideas.

You might have to fix this issue, or at least create one static automation-only environment that humans do not touch, so you can open up this constraint and start working from the top of this article once this has been solved.