Cost Of Software Hygiene

5 minute read

If your senior software engineers can’t clone and make a working copy of your software system, and then contribute to your project in a day in meaningful ways, you have a bad software practice.

If you are a senior engineer and haven’t tested this capability on any project; if it is not well documented and a trivial number of steps to get the system up and running, from a customer perspective - you are failing at software.

If you have an excuse “the software is way too complicated” to start up, you have failed.

Any software business can create a development environment where an engineer can make a working piece of it in a day, which they can use to meaningfully contribute to the system. Consider Linux itself, Apache Kafka, Apache Druid. These are incredibly complex systems, where this truth holds.

If you don’t have this quality, it means you yourself can’t get a new computer, and start working again almost instantly. This is the ultimate ‘dev velocity’ failing, if you yourself can’t be productive on a new computer, then you are also unable to make a new working copy of the system.

As soon as you can’t easily make another copy of the system, you will have failed, from a productivity standpoint.

I’m not being overly dramatic.

What happens next is, you can’t have more groups of people, working independently on the system, because they can’t make their own QA where they can work on the system, and test idependently, it until it is fully ready.

You need almost as many copies of the software system as you have developers.

You want it to be easy to make, and easy for people to isolate themselves, and run the entire thing, until they have actually used and verified their own changes.

I’m not talking about some small system; I’m referring to Amazon Ads size systems, major portions of entire investment banks. At hedge funds, this is what we in fact do, to keep teams productive. We need the hygiene, to survive. We don’t make excuses, we solve the challenges, and reap the rewards of the hygienic environment. It does feel painful, but it drives velocity more than any other investment, and so it is done, because at a hedge fund we are truly interested in the success of the team, and the fund; not our own. While no one gets proper credit for the hygiene investments, the software thrives which can help the fund survive, and so we focus on the most important things, despite lacking “attribution” to ourselves.

If developers can’t mint their own kind of QA, now you can’t scale your operation. These are facts I’ve looked at first hand in some of the biggest dot com and also financial software shops in the world. I’ve also made sure that in at least a few investment banks and hedge funds, that we were able to make a new copy of the entire system, rapidly; in a few hours or a day. I did this out of fear, so that I myself could be effective. Others rallied to the cause. This is what a passionate team does, to make sure everyone is effective.

At Lehman Brothers, we could start and run the entire ‘POINT’ bond reporting and valuation system, in a few minutes on any developer’s machine. This system was a many-tiered distributed beast, an incredibly complex financial reporting system. Any part of it could be built and tested almost instantly, and we could make a new one to verify a build almost instantly. This was back in 2003. We did it because we knew it was critical for our own survival. I did this same thing again at ExodusPoint when I designed the majority of the fixed income software systems; same story. At ExodusPoint a tiny team can work across the entire hedge fund on a moments’ notice. In high finance, we have this kind of software hygiene because it is necessary to survive, so we figure it out.

Without the ability to create another working copy of your entire software system, in an hour or two, you have failed. You can’t make another QA, and then a new team or new person can’t spin up and work independently.

From this lack of hygiene, everything deteriorates into everyone stepping on each other, and giant broken systems where nothing is really ever working in your one or two QA environments, because so many people are throwing code at it. And everyone is frustrated, and then what do you do which is wrong? Instead of fixing this first problem and making it easy to make more copies of the system, you think Microservices will save you. They will not save you, and you make the problem worse instead of better.

So start out with the ability to easily create another working copy of your SaaS software system. There is no downside to this level of software hygiene, and it solves more of your problems than anything else. Make sure it is practiced across all teams, and have someone verify that teams are holding to a high standard here, such that your entire system can easily be assembled somewhere else; another copy of it again, fast, in an hour or less by a senior engineer.

A fallacy I’ve heard is “this is too expensive to maintain.” In fact, it is the most expensive decision you can ever make once you fail to invest in this kind of hygiene.

If this is not supported, figure out how to do it, before any rearchitecture of any sort. It means you’ve made the system too unnecessarily complicated, and you need to eliminate the complexity, not introduce more to solve your dev velocity problems.

We know this is true because Linux, Kafka, Druid and many other highly complex software systems enjoy this capability.
They do this because they need it to be effective, and to welcome new contributors.

You need this exact same quality in your business system, at all costs. To deny this statement is to effectively say, it is better to have 3x the staff, than 1x and this small up front dev velocity investment. This is the correct value to attribute to a hygienic software environment, despite it being “hard to measure.”

It may not be invested in, because the velocity improvements are nearly impossible to measure, but it exists nonetheless.

Once you’ve done it the right way, you know.

Share on

X Facebook LinkedIn Bluesky

Chris Rodier

Cost Of Software Hygiene

Share on

You May Also Enjoy

100m Failed Case Study One Java Rmi Cache

Open Gamma

Good Intentions

A Software Velocity Quest