Postmortems

A fundamental principle of good engineering is that you

Originally shared by Yonatan Zunger

A fundamental principle of good engineering is that you design the whole system to function well, not just the part you’re concentrating on. Most systems include humans as components – as operators, maintainers, passengers, or even obstacles. And when you fail to take that seriously into account in your design, you make a fundamental design error which can have lethal consequences.

It appears that the cause of the SpaceShipTwo crash was precisely of this sort: the designers never considered the possibility that a particular switch might be flipped at an incorrect time. In this case, it was flipped only a few seconds too soon, at a speed of Mach 0.8 instead of Mach 1.4. (This under rocket power, where acceleration is fast) That caused the tail system to unlock too soon, be ripped free by acceleration, and destroy the spacecraft, killing the co-pilot and severely injuring the pilot.

Scaled Composites’ design philosophy of “relying on human skill instead of computers” here reeks of test pilots’ overconfidence: the pilots are so good that they would never make a mistake. But at these speeds, under these g-forces, under these stresses, and tested repeatedly, it’s never hard for an error to happen.

There are a few design principles which apply here.

(1) It should not be easy to do something catastrophic. There are only a few circumstances under which it is safe for the feathers to unlock, for example, and those are easy to detect based on the flight profile; at any other time, the system should refuse to unlock them unless the operator gives a confirmatory “yes, I really mean that” signal.

(2) Mechanical tasks that can lead to disaster are a bad idea. Humans have limited bandwidth to process things: while our brain’s vision center is enormously powerful, our conscious mind’s ability to think through things works at language speed, a few ideas per second. Here, time was wasted with a human having to perform a basically mechanical task of unlocking a switch at a particular, precise time. This requires the human to pay attention, time something accurately, and flip a switch, at a time that they should be simply watching out for emergencies. Since the time of unlock is already known long before takeoff, a better design would be for the unlock to happen automatically at the right time – unless the risks from having an automatic unlocker (perhaps due to a reliability issue, or having a complex part prone to failure) exceed the benefits of removing it.

What’s important to learn from this accident is that this error isn’t specific to that one mechanism: this is an approach which needs to be taken across the entire design of the system. Every single potential or scheduled human action needs to be reviewed in this way.

An excellent perspective on this comes from James Mahaffey’s book Atomic Accidents, a catalogue of things that have gone horribly wrong. In the analysis, you see repeatedly that once designs progressed beyond the initial experimental “you’re doing WHAT?!” stage, almost all accidents come from humans pushing the wrong button at the wrong time.

Generally, good practice looks like:

(A) Have clear status indicators so that a human can tell, at a glance, the current status of the system, and if anything is in an anomalous state.

(B) Have “deep status” indicators that let a human understand the full state of some part of the system, so that if something is registering an anomaly, they can figure out what it is.

© Have a system of manual controls for the components. Then look at the flows of operation, and when there is a sequence which can be automated, build an automation system on top of those manual controls. (So that if automation fails or is incorrect for any reason, you can switch back to manual behavior)

(D) The system’s general behavior should be “run yourself on an autonomous schedule. When it looks like the situation may be going beyond the system’s abilities to deal with on its own – e.g., an anomaly whose mitigation isn’t something that’s been automated – alert a human.”

The job of humans is then to sit there and pay attention, both for any time when the system calls for help, and for any sign that the system may need to call for help and not realize it.

This wasn’t about a lack of a backup system: this was about a fundamentally improper view of humans as a component of a crtiical system.