Handling Power Outages in Reliable Systems

I write this as storm Ciara has passed over the UK now. As I drove into work, I passed a tree down on a power line - and thought nothing of it.

When I got into the office, it turns out the power had been out and it had caused two failures in my office.

These are non-critical systems, so I knew that I might have issues in these scenarios. However, As I'm in the business of working on systems where reliability is important - this is my way of reviewing these failures and what I could do if these systems were critical.

Computers and Power

Many computers don't like losing power. But the power source is often out of our control - so what do we do?

The first is to understand the implications of failure. In my case for the server, the problem usually is disks. If the power is cut during write operations, you may get incomplete writes to the drive - especially in more complicated setups such as RAID.

In measurement and automation systems, you must also consider systems that they interact with. What happens if you stop driving that solenoid? Or updating a process output?

Uninterruptable Power Supply (UPS)

Installing a UPS has been on my todo list for 1 day too long! A UPS is a power supply with a local battery.

These are available for computer equipment from companies like APC and also for industrial use cases such as these from Phoenix Contact.

There is an important consideration with a UPS though - they aren't designed to keep things running! Or not for very long.

If you look at most UPS systems, they will keep your system running for maybe 30 minutes. The idea is this is enough time for your system to identify the fault (by communication with the UPS) and shut down to a safe state.

Defence in Depth

Once you know the kind of problems you may have, you can then develop strategies to avoid issues. Note the plural!

Having a multiple safeguards can be important to handle variations on the failures. You may not be able to predict every case.

Although I lacked the UPS, my server system already had two safeguards in place.

I used ReFS - a new file system from Microsoft, which provides more protection against disk corruption.

I also have a battery-backed RAID controller - this means it has time to finish it's writes during a power cut.

Despite lacking the protection of a UPS - I ended up with just two virtual hard drives being corrupted. I've still got to analyse why these two were affected, but it looks like one drive probably had some problems.

These were the drives that contained my backups. That means I've lost my backup history (since I didn't have additional backups of these). If they had been system drives, I could have restored them from full image backups that I take every day.

(Note: I should review the cost of this, but previously given this wasn't critical the price was too high to have full image backups replicated again)

Separating Components

You may think - if these weren't critical, how did I notice so fast.

When designing my back up system I recognised having a seperate backup server would be useful, but costs were prohibitive for a company of my size and it wasn't clear I could get my existing backup server to work with one.

So instead I hosted them on the server with the network services. You can guess what happens next.

The failure of the backup services has prevented me from running some network services until I recover the disks. This was a mistake!

When building high reliability systems you must consider the critical and support elements. For example, on a recent project, we needed to go from a Windows application monitoring a process to doing some control. So we split the interface from the primary measurement components and ran them on separate systems (the measurement components on a Real-Time OS) to ensure any failures or mistakes in the user interface (such as someone closing the window) doesn't impact the process.

The Other System: Recovery

I said two systems failed. The other was a system on long term test for a project. This failure highlighted another fun consequence of power failure: what happens when the power comes back on?

In the IT space, an acceptable solution might be to keep things off to be manually brought online (and checked). This approach may apply to some industrial uses as well.

Another option is to have the system auto-power on when the power recovers. This assumes the software is written to pick up where it left off. I've implemented this support on a number of projects. The software must store it's state so we can load it on boot and carry on where the last power cycle left off.

In either case, you must consider how the system comes back online and if it will do so in a safe manner.

What Will I Do Next

I will buy a UPS! It has been on the cards for a while and it would already have paid for itself in time saved recovering the systems.