In Defence of Breaking Changes

James Uther
2020-10-02

Is nuance absolutely awesome, or simply rubbish?
- The news quiz, 103:2

For the purposes of this post let's assume it's simply rubbish.

Received wisdom is that breaking changes to supporting software (OS, libraries, services, etc) is bad. This makes intuitive sense. An API is a contract, and contracts are to be honoured.

We have SemVer to attempt to manage changes. Rich Hickey thinks that SemVer is wrong, and you should just accrete. Platform providers go to great lengths to let you keep running your broken software. Here's some Microsoft war stories. And of course, Linus has an opinion (and is working on expressing his opinions in a more constructive way).

But there's a hold out. Recently Steve Yegge had a rant about how google keeps ignoring this sensible consensus. And we all know that's true. Just try saying 'google reader' to a bunch of engineers and feel the wave of unresolved loss, and that's for a free service, not a paid and supported API. But why does Google do that? Now, the first law of software engineering is “You are not Google”. You do not have a magic money machine in the basement and a good percentage of engineering talent working for you (If you do, DM me). It turns out that internally they use that money and talent to do things the hard, right way (sometimes).

“God is change” says the prophet in “Parable of the Sower” ⊕ “In the ongoing contest over which dystopian classic is most applicable to our time […] Butler's novel […] may be unmatched.” (ref) . Google faced the fact that things change, and that “Software Engineering is Programming integrated over time”. Getting a program to run once is programming. Getting it to keep running when the OS gets upgraded, the team changes, requirements change, dependencies are upgraded, laws change, businesses change, […] is engineering. For good software to remain good software into the future the code must be malleable and deployable.

Let's narrow this to managing dependencies. Managing versioning between modules is a problem that scales quadratically, and is worse than you think given Hyrum's law that in the limit there is no such thing as a private interface. In the Google case, it's their code and they can tackle the problem at the root. Their solution seems to be 'live at head'. It ideally goes something like this:

everything is comprehensively covered by automated testing
a module is published (and it's all public interface, see Hyrum's Law)
it becomes widely used
a new version is published. It breaks a test somewhere.
a decision is made whether to fix the module, or fix the consumer
if many consumers will break, and they all need to be fixed, a tool is shipped to automate the fix given it's all in an obsessively tested monorepo the change is shipped when 'google works with this change'
the tail of supported versions is therefore very short, if it exists at all
A relatively recent presentation “Live at Head“: gives a far more authoritative view.

In essence, the contract changes from “We will not break the contract” to “We might break the contract but will provide good tools/docs (perhaps API credits?) to assist migration.”

Now the point of the Yegge rant above is that Google is not doing this reliably in the public APIs. It's hard! But let's assume the utopia. What would a software project that uses these 'live at head' APIs look like? There seem to be a few things that are necessary:

the engineering is live, as in there is institutional memory about how to make changes and get them into production
the code is well provisioned with automated tests
there is a live CI system that speculatively upgrades dependencies
if a test breaks because of a dependency upgrade there is someone who can step in to investigate
the step above is much simpler if the code is regular enough to allow automated upgrade tools to work. Newer languages like 'Go' are designed with this in mind

This is not a big investment, and well worth it. But it is a change of mindset. A software project from an enterprise would not just hand over a binary. They wouldn't even just hand over a service and a monthly AWS bill. The project would involve setting up and leaving behind (or maintaining ‐ for a reasonable consideration) the engineering capability listed above. In return, the customer would have software that has all the latest security patches, performance improvements, and so forth, and whoever is running the service would have the capability to optimise runtime platforms etc. That's a win/win from where I'm standing.

(Originally here)