Remembering Fail-Safe

The description “fail safe” is commonly used to mean something foolproof, or a system with backup systems to prevent failure.  In other words, “safe from failure”.

That’s a shame, since we have plenty of words that already mean that.   My dictionary defines fail-safe as … a system … that insures safety if the system fails to operate  properly.  The original meaning meant “safe in case of failure”.  Things break.  How do we head off catastrophe?

Real World examples

The TCP network protocol “guarantees” delivery, but it’s fail-safe.  If a packet can’t be delivered, as happens, the connection is dropped rather than either accepting partial or corrupted data.

In the movie Die Hard, the engineers of Nakatomi Plaza decided that safety meant that in the event of a power failure all the security systems of the building would be dis-abled.  In the movie that meant the bad guys could get into the vault.  In the real world, that decision would prevent people from being locked into the building.

After thousands of deaths resulting from train accidents, train car brakes are now engaged by default.  A pressure line powered by the locomotive pulls the brake pads away from the wheels.  In the event that any of the braking system (the non-braking system?) fails, the brakes are pressed against the wheels.

Airplanes use positive indicators for the status of important functions such as the landing gear being down.  Instead of an error light if the gear has failed, there’s a no-error light if the gear is locked.  Should the sensor, wiring, or bulb fail, the indication is that gear is not down.  Better to have gear down and think it’s not than think it is when it isn’t.

Value in software

This idea that we should expect failure isn’t novel, it’s called testing.  But arguably the primary purpose of testing is to identify defects in the software to avoid failure in production.  Is there value in assuming that we won’t be successful at preventing every possible anomalous condition, including that our code does what we expect?  Consider the questions that fail safe raises?

What can fail?

Your software has bugs in it.  Networks go down.  You may get broken input.  You may get correct input that breaks your system because you didn’t know the correct format.  You may get data in the wrong order.  Software you didn’t write but you’re counting on may fail.

What is “safe”?

What’s the best result when failure happens?  Roll back a transaction?  Immediately kill a system?  Display an error?  Throw an exception?

How we get back from “safe” to operational again?

Once having decided what failure means and how to entire a safe mode, we may not have asked ourselves before how to get things going again.  If we reject entry of a file that contains erroneous data, how do we notify someone to deal with that?  How do we get it out of a queue to be processed again?