Synopsis On 4 June 1996 the maiden flight of the Ariane 5 launcher ended in a failure, about 40 seconds after initiation of the flight sequence. At an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded. The failure was caused by "complete loss of guidance and attitude information" 30 seconds after liftoff. To quote the synopsis of the official report: "This loss of information was due to specification and design errors in the software of the inertial reference system. The extensive reviews and tests carried out during the Ariane 5 development programme did not include adequate analysis and testing of the inertial reference system or of the complete flight control system, which could have detected the potential failure." Because of this conclusion, the accident has generated considerable public and private discussion amongst experts and lay persons. Code was reused from the Ariane 4 guidance system. The Ariane 4 has different flight characteristics in the first 30 seconds of flight and exception conditions were generated on both IGS channels of the Ariane 5. Even though the Ariane is not a transport category airplane, I include it as an instructive example. It suggests that we have as much or more reason to worry about the `new, improved, extended Mark 2 version' as about the original version of FBW software. Henry Petroski, in Design Paradigms: Case Histories of Error and Judgement in Engineering (Cambridge University Press, 1994) makes this very point about the history of bridge-building in the nineteenth and twentieth centuries. Petroski notes that failures often came not from the first, careful, conservative implementation of a design, but from its extension. The European Space Agency has provided a summary of the Ariane accident report as a Press Release, and also the full text of the Inquiry Board Report on the Web.
The problem was caused by an `Operand Error' in converting data in a subroutine from 64-bit floating point to 16-bit signed integer. One value was too large to be converted, creating the Operand Error. This was not explicitly handled in the program (although other were) and so the computer, the Inertial Reference System (SRI) halted, as specified in other requirements. There are two SRIs, one `active', one `hot back-up' and the active one halted just after the backup, from the same problem. Since no inertial guidance was now available, and the control system depends on it, we can say that the destructive consequence was the result of `Garbage in, garbage out' (GIGO). The conversion error occurred in a routine which had been reused from the Ariane 4, whose early trajectory was different from that of the Ariane 5. The variable containing the calculation of Horizontal Bias (BH), a quantity related to the horizontal velocity, thus went out of `planned' bounds (`planned' for the Ariane 4) and caused the Operand Error. Lots of software engineering issues arise from this case history.
After some recent discussion again on the Ariane 501 case on the Safety-Critical Systems Mailing List 2005 (largely under the misleading title "Unnecessary/unwanted COTS software functionality"), attention was brought to the paper Ariane 5: Learning from Failure by Colin O'Halloran, a member of the accident investigation committee. The paper was published as pp47-55 of Proceedings of the Workshop on Dependable Systems Evolution, held at the FM'05 conference at the University of Newcastle upon Tyne, 18 July 2005. Slides from the talk are also available. I drew some conclusions from O'Halloran's paper in the contribution Ariane Notes to the Safety-Critical Systems Mailing List.
Nancy Leveson has also analysed the accident in a draft of a book, available at pp133-50 ("especially p 147") of A New Approach to System Safety Engineering
Jean-Marc Jézéquel and Bertrand Meyer wrote a paper, Design by Contract: The Lessons of Ariane, IEEE Computer 30(2):129-130 January 1997, in which they argued that a different choice of programming language would have avoided the problem. Taken at face value, they are clearly right -- a language which forced explicit exception handling of all data type errors as well as other non-normal program states (whether expected or not) would have required an occurrence of an Operand Error in this conversion to be explicitly handled. To reproduce the problem, a programmer would have had to have written a handler which said `Do Nothing'. One can imagine that as part of the safety case for any new system, it would be required that such no-op handlers be tagged and inspected. An explicit inspection would have caught the problem before launch. As would, of course, other measures. Jézéquel and Meyer thus have to make the case that the programming language would have highlighted such mistakes in a more reliable manner than other measures. Ken Garlington argues in his Critique of "Put it in the contract: The lessons of Ariane" [sic] that they do not succeed in making this case.
The paper The Ariane 5 Accident: A Programming Problem? by Peter Ladkin discusses the characterisation of the circumstances of the Ariane Flight 501 failure in the light of the extensive discussion amongst computer scientists of the failure. Gérard Le Lann has proposed in his article The Failure of Satellite Launcher Ariane 4.5 that the failure has little connection with software, but is a systems engineering failure, and his argument is compelling. Le Lann's analysis is also supported by inspection of the WB-Graph of the Ariane 501 Failure, prepared by Karsten Loer from the significant events and states mentioned in the ESA Accident Report.
This is not the first time that computers critical to flight control of an expensive, complex and carefully-engineered system have failed. See The 1981 Space Shuttle Incident.