don’t add that button – blog.moritzhaarmann.de

We very recently bought a car. And I was surprised how many models are on the market that still use manual shifting. I know how to drive them, probably quite well even, but you couldn't get me to buy one of those. The machine is just much better at shifting the gears up and down than a human. And also, it's not an activity I enjoy doing. Which brings me to the point of this post.

Every system that reaches some level of complexity has some operational dimension, at some point. Whether that's only some regular database maintenance or more involved tasks depends entirely on the system, but let's pretend for a second we're looking at a system that deals with loading product data from a source and providing said data to other services using something like HTTP and JSON. Of all the services I've worked with, this one I've probably seen the most.

As you go and build this system it evolves from just being a database that gets populated by running a job every once in a while to something more complex. You figure that doing a request to your database for every single GET might make things slow, so you add some object caching. That's fun, and it's also really helping with your performance. As you go, you add more bells, more whistles. There's nothing wrong with that, but there's one thing to be very vigilant about: the first button.

In our hypothetical example, imagine you find out about an edge case where the object cache does not always get purged correctly when new data wanders into the system. It's an edge case, doesn't happen too often and really, there's 5 more important items on your to do list. So you add a button. Or a how to in your internal wiki. Some means of manually resolving that problematic situation. And that's the first button. Don't build the first button. Why?

You do not want to build manual controls for anything that the machine can do without input from a human. Purging a cache doesn't require any input from a human. There's no parameters. It's simply "fix the situation". Once you start to build controls for things that the machine should have been doing in the first place by itself, you actually build a new kind of solution – one that contains one or many humans as orchestrators. And that's the worst kind of solution, since you're doing two things. Firstly, you're introducing a really unstable, non-deterministic and not always available element into your system. That's generally not a good idea. The second problem is: it just doesn't scale. If it was only about one button, fair. But people need to understand what conditions are problematic, how to detect them, and then go into a system to resolve something. That's a lot of training and contextual knowledge that needs to be shared.

Build systems that run themselves. In our example here, the very moment when you detect that there's a problem with cache invalidation is the right moment to fix that cache invalidation. This can be fixed, it just needs the right investment. Might take a little longer, but fixing this problem right there is the right solution – not having someone push random buttons.

Build robust systems, not fancy buttons.