This morning I did something I’ve never done before.
I walked into my office, plugged in my laptop, and killed the power to my server rack.
You see, since storm Ciara in the UK, I’ve installed a UPS system to ensure a managed shutdown of my servers. This morning was my first test.
One thing I have seen recently is that predicting failures often isn’t enough. You need to cause them so that you can:
Looking for knock-on issues can be incredibly valuable. I recently had a case where we had a database distributed across three nodes. An infrastructure issue caused all three nodes to go down, and only one recovered.
As expected when the one node came back online, the system started processing transactions again (if that single node could handle them) — all expected behaviour.
However, it turns out all of those failed transactions started filling the logs on the single running node, which caused that to fail due to having a full disk.
The secondary failure was easy to prevent in future but might have been missed without testing. So if you have a system that you intend to be resilient, then pull network cables, shutdown machines, restore from backup and see how it reacts.
And the result of my UPS test? A failure – the shutdown sequence of the UPS doesn’t work as I thought, so it was a test worth doing!
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.