Testing for Failure

This morning I did something I've never done before.

I walked into my office, plugged in my laptop, and killed the power to my server rack.

You see, since storm Ciara in the UK, I've installed a UPS system to ensure a managed shutdown of my servers. This morning was my first test.

One thing I have seen recently is that predicting failures often isn't enough. You need to cause them so that you can:

Understand what happens in a failure. Seeing failure first hand identifies further mitigations and allows you to recognise these failures faster.
Verify any backup or recovery systems work as expected.
Identify any knock-on issues.

Looking for knock-on issues can be incredibly valuable. I recently had a case where we had a database distributed across three nodes. An infrastructure issue caused all three nodes to go down, and only one recovered.

As expected when the one node came back online, the system started processing transactions again (if that single node could handle them) — all expected behaviour.

However, it turns out all of those failed transactions started filling the logs on the single running node, which caused that to fail due to having a full disk.

The secondary failure was easy to prevent in future but might have been missed without testing. So if you have a system that you intend to be resilient, then pull network cables, shutdown machines, restore from backup and see how it reacts.

And the result of my UPS test? A failure - the shutdown sequence of the UPS doesn't work as I thought, so it was a test worth doing!

21 September 2020 | Software Engineering