Computer-Related Incidents with Commercial Aircraft

Unknown FBW aircraft type, Byzantine failures in Flight Control System (FCS), n.d.


Synopsis In their paper Byzantine Fault Tolerance, from Theory to Reality, which appeared in the volume in S. Anderson, M. Felici and B. Littlewood, eds, Computer Safety, Reliability, and Security, 22nd International Conference, SAFECOMP 2003, Edinburgh, UK, September 23-26, 2003, Lecture Notes in Computer Science volume 2788, Springer-Verlag, 2003., Kevin Driscoll, Brendan Hall, Hakan Sivencrona, and Phil Zumsteg recount a series of incidents to the digital control system of a major Fly-By-Wire commercial aircraft type, which almost led to the type being grounded. Driscoll et al. describe the incidents thus:

This aircraft had a massively redundant system (theoretically, enough redundancy to tolerate at least two Byzantine faults). but, no amount of redundancy can succeed in the event of a Byzantine fault unless the system has been designed specifically to tolerate these faults. In this case, each Byzantine fault occurrence caused the simultaneous failure of two or three "independent" units. The calculated probability of two or three simultaneous random hardware failures in the reporting period was 5 x 10**(-13) and 6 x 10**(-23) respectively. After several of these incidents, it was clear that these were not multiple random failures, but a systematic problem. The fleet was just a few days away from being grounded, when a fix was identified that could be implemented fast enough to prevent idling a large number of expensive aircraft.

Byzantine faults are faults in which agents (sensors, computers) in a distributed system "lie" to their interlocutors: they do not fail silently but distribute erroneous data, or data which is read differently by different receivers. The name arose from a whimsical analogy by Lamport, Shostak and Pease to a group of Byzantine generals trying to reach agreement in a situation in which no one trusts anyone else to speak the truth. The classic papers from twenty years ago are [Refs], and I understand arose from SRI International's involvement in trying formally to verify the operating system of the first digital flight control computer, SIFT.

Dealing with Byzantine faults became an extremely active area of distributed computing theory, but practitioners did not take them so seriously at first, perhaps partially due to the very high resource consumption of the solutions: Lamport, Shostak and Pease showed that any correct algorithm to achieve consensus required a large number of processors (roughly speaking, at least 3n+1, where n is the number of "liars") and a lot of processor cycles. It follows that solutions judged to be practical are unlikely to be complete solutions, and therefore one must analyse the actual problem space more closely to find out where one can most profitably handle possible problems, and which areas one can ignore.

I wrote an essay, Flight Control System Software Anomalies, which discussed this incident in more detail, in the Risks Digest, Volume 24 Number 3, 7 September 2005.