Testing for Failure

By James McNally | Software Engineering

cpu on fire

This morning I did something I’ve never done before.

I walked into my office, plugged in my laptop, and killed the power to my server rack.

You see, since storm Ciara in the UK, I’ve installed a UPS system to ensure a managed shutdown of my servers. This morning was my first test.

One thing I have seen recently is that predicting failures often isn’t enough. You need to cause them so that you can:

  1. Understand what happens in a failure. Seeing failure first hand identifies further mitigations and allows you to recognise these failures faster.
  2. Verify any backup or recovery systems work as expected.
  3. Identify any knock-on issues.

Looking for knock-on issues can be incredibly valuable. I recently had a case where we had a database distributed across three nodes. An infrastructure issue caused all three nodes to go down, and only one recovered.

As expected when the one node came back online, the system started processing transactions again (if that single node could handle them) — all expected behaviour.

However, it turns out all of those failed transactions started filling the logs on the single running node, which caused that to fail due to having a full disk.

The secondary failure was easy to prevent in future but might have been missed without testing. So if you have a system that you intend to be resilient, then pull network cables, shutdown machines, restore from backup and see how it reacts.

And the result of my UPS test? A failure – the shutdown sequence of the UPS doesn’t work as I thought, so it was a test worth doing!


About the Author

I founded Wiresmith Technology to help engineers improve their systems and products with quality measurement systems. I'm a Certified LabVIEW Architect, Certified LabVIEW Embedded Developer and LabVIEW Champion.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.