The X-31 and A320 Warsaw Crashes: Whodunnit?

Peter B. Ladkin

Research Report RVS-RR-96-08

"Who done it?-Professor Plum." There's no more to say.
D. Lewis (1)


Wondering about Mishaps with Computer-Based Flight Control

On 19 January 1995, an X-31A research aircraft crashed in California due to suspected pitot-static system icing. The pitot-static system in this airplane provides inputs to the computer-based control system of the airplane. I discussed this accident with colleagues, amongst them Mary Shafer (SR-71 Lead Engineer at NASA Dryden, where the flight took place, and moderator of sci.aeronautics), and reported on it in a Risks-16.89 article. I noted that although pitot-icing is a very ordinary phenomenon even for tiny airplanes like my Piper Archer, connecting the pitot probe directly to the computer-based flight control system caused a catastrophe that my Archer would never suffer.

The confirmation of the cause of the mishap was reported in Flight International, 22-28 November 1995, p26:

A NASA mishap-investigation board has concluded that an accumulation of ice in or on the unheated pitot-static system was the root cause of the crash of its X-31 experimental aircraft at Edwards AFB, California, on 19 January [1995]. The NASA board says that the build-up of ice blocked the small tube in the pitot static-pressure [sic] system.

In my Risks-report on the Lufthansa A320 runway overrun accident in Warsaw in Risks-15.13, I suggested that there may have been a failure of requirements specification (also noted in Peter Neumann's book Computer-Related Risks, Addison-Wesley). Some recent RISKS articles since November 1995 have discussed the X-31 crash further. Andy Fuller in Ice cause of X-31 crash ascribes the accident to "..a bug in the flight control software and sensor hardware..". Marty Gomez in X-31 crash follow-up noted that the "flight computer" was not meant to detect such sensor failures, therefore it did what it was designed to do, and that therefore there was no "bug". Fuller in X-31 crash follow-up (the same article) agreed that it was not in fact a "software bug" but a "design error", and sees a "recurring problem of software designers (and system designers) getting inadequate requirements specifications." In Follow-up on X-31 crash, Fuller notes the difference between an air data computer, which collects the readings from sensors such as the pitot which iced, and the flight control computer which implements the control laws, given the sensor values from the air data computer. He also suggests that the pilot was not correctly briefed that the pitot heat was disconnected, and therefore that the "immediate cause of this incident was a break in communication between the maintainers and the pilot." I noted in The X-31 Kieling over that design failures and requirements failures are not the same thing (illustrated by John Rushby's anecdote about the AFTI F-16 mishap, in which there was a design failure, but no software-, hardware-, or requirements failure); that it's premature to conclude that it was a requirements failure; and that airplanes with good old hydro-mechanical control systems have suffered the same fate (courtesy of Shafer). After a jolly Christmas, Peter Mellor (City University expert on software and system safety in commercial air transport) joined in, in Re: X-31 crash follow-up. He believes that " a "software bug" is simply a "system design fault" which happens to be located in software rather than in hardware. (I prefer the term "fault" rather than "error" to denote "something wrong with the way the system has been designed".)" and therefore that Fuller's original claim was correct. He notes that trying to decide if it is a `software' or `hardware' bug is often a misplaced activity, since software and hardware are mutable: "Different designs may provide the same overall function by partitioning the sub-functions differently between hardware and software." He illustrates his point with some comments on the Warsaw A320 overrun accident in September 1993. The Warsaw final report had mentioned the braking logic as a contributory factor (as I'd noted in my earlier RISKS note and Clive Leyman notes below). Mellor pointed out that changes could have been made by a software modification, or by a hardware modification (reducing the load required for the squat switch on the main gear to close and thus fulfil the weight-on-wheels criterion required for the braking systems to deploy). Finally, Mellor offered the following opinions:

Then followed some private correspondence. Carsten Thomas (Daimler-Benz researcher with knowledge of Airbus designs) noted that it would not always be possible for engineers to take into account "the whole system", since that would in general be far too complex. Thus, a good development process is necessary, with adequate design documents and division of responsibility between engineers, which ensure that problems are caught at some stage or other. Nancy Leveson (Boeing Professor of Computer Science at the University of Washington and author of the manual Safeware) agreed with Thomas that such problems as with the X-31 should not be considered to be software problems, since most software engineers wouldn't have either the information or the training to be able to handle such problems. Mellor noted that City University students are taught to develop complete, consistent and unambiguous specifications. I challenged him to define those terms. Leveson referred me to her paper Software Requirements Analysis for Real-Time Process Control Systems in the IEEE Transactions on Software Engineering 17(3), March 1991, 241-258 for some definitions and criteria for complete, unambiguous specifications.

The Real Scoop

There are people who really know what went on with the X-31 and with the Warsaw accident.

Mary Shafer said that an explicit decision was made not to include pitot heat on the boom, and the FCS was not modified in any way to try to compensate for icing which might occur. There was therefore simply an operational mishap, not a pilot error or "software bug". She points out the enormous care with which research flights are planned, critiqued and supervised at Dryden.

Clive Leyman was a BAe Chief Engineer and worked on the landing gear systems for the A320 at the time that the gear and braking system modifications were made, actually some 18 months to two years before the Warsaw accident. The modifications were not regarded as a safety issue. He took issue with Mellor's claim that if the system behaves undesirably, that that's automatically a design fault. Users must also use the system in a reasonable manner, and one cannot reasonably expect a system to cope with abuse. We are dependent upon the pilots to fly the airplane within its normal aerodynamic and performance limits, but that even though the airplane may be designed to tolerate deviation from those limits, it can't practically be designed to tolerate all such deviation. He agrees with Leveson and Thomas that software engineers aren't likely to have the expertise to analyse the requirements and design. As did Thomas, he stresses careful engineering practice and thorough review of system requirements documents.

Leyman has also produced an engineering analysis of the Warsaw accident which suggests that at the approach speeds used, there was so much lift that it was aerodynamically impossible to compress the squat switches, that the landing distance was extended also because the airplane was slightly high on approach, that the airplane would probably have run off the runway anyway even if it had not aquaplaned, and that the landing distances were much longer than the procedures manual would lead the crew to expect. A copy of this report will appear soon in these WWW pages.

There sure seems to be a lot to discuss.

Words, words, words: What Do They All Mean?
So, how should the X-31 mishap be classified? What was the cause? Was it lack of proper briefing, a `software bug', a `design fault', a communication problem? Or, maybe, was it pitot ice? Was there nevertheless a `software problem', `software bug', a system design problem, or a requirements problem? Does the A320 crash show a `software' problem, a `hardware problem', a requirements problem, pilot error -- and let's not forget the high speed of landing, the inaccurate wind report, the waterlogged runway, the aquaplaning, and the earth bank which the plane ran into beyond the runway overrun area. The official report determines the causes of the Warsaw accident (I quote in full):
Causes of the accident were incorrect decisions and actions of the flight crew taken in situation when the information about windshear at the approach to the runway was received. Windshear was produced by the front just passing the aerodrome; the front was accompanied by intensive variation of wind parameters as well as by heavy rain on the aerodrome itself.

Actions of the flight crew were also affected by design features of the aircraft which limited the feasibility of applying available braking systems as well as by insufficient information in the aircraft operations manual (AOM) relating to the increase of the landing distance.

Thus, the Polish authorities say the cause was pilot error of judgement and action in the prevailing weather situation; the cause of the weather situation was a front, which caused variation of wind and heavy rain; and flight crew actions were affected by design features and insufficient information about increase in landing distance resulting from pilot decisions and actions.

It helps to agree on a precise vocabulary. The RISKS discussants use the words "fault" and "failure", and associate these with software. Leveson notes that fault and failure have specific meanings to engineers. She uses the term error, and Mellor prefers fault.

I'll describe the event as a mishap and inquire about the causes of the mishap. I'll see if I can get away with using the word problem instead of fault, failure or error.

Is there's a mishap, we can inquire if there was a software, requirements or system design problem. What we decide will affect whom we hold responsible for the mishap. But responsibility does not necessarily imply blame. The US Federal Aviation Regulations, Subchapter A, Part 1, Paragaph 1.1, define the pilot-in-command as the "pilot responsible for the operation and safety of an aircraft during flight time". Further, Subchapter F, Part 91, Paragraph 91.3 defines: Responsibility and authority of the pilot in command.

  • (a) The pilot-in-command of an aircraft is directly responsible for, and is the final authority as to, the operation of that aircraft.
  • (b) In an in-flight emergency requiring immediate action, the pilot-in-command may deviate from any rule of this part to the extent required to meet that emergency.
  • ...
In both the X-31 and A320 accidents, the pilots-in-command were responsible (but I note that rules of operation for NASA test flights may not explicitly be covered by Part 91). That doesn't mean the accidents were the pilots' fault! We must separate responsibility from blame.

What's an accident? US FARs Part 830, Paragraph 830.2 defines:

Aircraft accident means an occurrence associated with the operation of an aircraft which takes place between the time any person boards the aircraft with the intention of flight and all such persons have disembarked and in which any person suffers death or serious injury, or in which the aircraft receives substantial damage.

Let's try to sort out causes, errors, responsibilities and blame.

Causes

"What's going on here?"-No need to mention that you're digesting your dinner
D. Lewis (1)

Accidents are defined by death and/or substantial damage. First, some common-sense talk on reasons and causes. I proceed with a series of "why?....because...." statements. One can think of each because as a cause or a reason. All causes are reasons, but not all reasons are causes. When in doubt, I say reason. I also use two more words, function and entail. A function is quite simply a mathematical or physical function -- its value is defined by the value of its arguments (it would be inappropriate to call arguments causes of function values); and I say that a decision entails another decision to distinguish human reasoning processes from statements of causality. One might be tempted to think of entailment as a kind of psychological `cause', and indeed they do seem to be similar, but they are only the same if one believes in complete psychological determinism and that there is no such thing as free will, and since few philosophers or other people hold such views, I think it appropriate to develop explanations here which are agnostic on such matters!

All of the following assertions should be understood fundamentally as why?....because.... statements.

The X-31 Accident

The Lufthansa A320 Accident in Warsaw

In order to explain my use of the words cause and reason (function and entail were explained above), I must first ask the reader whether she more-or-less agrees with the truth of all of these why?....because.... sentences. If so, they can form the basis for a clarification of the meanings of these words. (I make no claim that I have gone as far as I could with such a list of why?....because.... sentences.)

Aristotle was the first major figure explicitly to consider causality and what it meant. He noted that causes could form chains, A causes B and B causes C therefore A causes C, and considered first (or final) causes, the front ends of causal chains, as well as the individual links in a chain, efficient causes. Causal chains appear in the nursery rhyme for want of a nail, the shoe was lost; for want of a shoe .... ... the battle was lost and also in the more recent chaotic `folk saying' about a hurricane depending on the beat of a butterfly's wing. David Hume held that causality was just an illusion, following from human psychology. We see A causing B, but in fact we're just used to noting that A is always followed by B and there's no further meaning to causality than this (Treatise of Human Nature and Enquiry Concerning Human Understanding). He also suggested a counterfactual view, in Enquiry, Section 7, noted by Lewis (3). Kant was disturbed enough by Hume's sceptical arguments to write the Critique of Pure Reason in an attempt to refute them. Although some of Kant's major arguments are nowadays uniformaly judged mistaken (he thought, for example, that Euclidean geometry and strict physical determinism were necessary for experience, whereas it seems to us that relativistic geometry and the Heisenberg Uncertainty Principle are fundamental scientific principles), the arguments of Hume have not ever been canonically refuted.

Interesting discussions of causality may be found nowadays in biology, especially concerned with the mechanisms of evolution. Steven Jay Gould's essays, Richard Dawkins' books The Selfish Gene (6) and The Blind Watchmaker (7) discuss how large-scale structural changes can emerge simply through the operation of probabilistic laws on minor changes which give certain populations an advantage in competition for resources and reproduction. One may wonder whether causality could ever be found in the application of mathematical probabilistic calculations, but a similar wonder should arise, although with different details, when considering quantum-mechanics. Elliot Sober's book Philosophy of Biology (8) is a readable introduction to these questions in biology.

Causality is tricky stuff and a major topic in analytic philosophy and philosophy of science. The four main approaches seem to be the reductionist, the counterfactual, the realist, and the probabilistic. The first two correspond to the two views of Hume, the third to do with necessary and sufficient conditions, also with so-called deductive-nomological explanation, and the last a more recent idea of scientific and thus causal non-determinism.

Artificial Intelligence researchers interested in logical methods for planning have used a system called the Situation Calculus, whose semantics is similar to or the same as such linear-time temporal logics as TLA, which is used in specification and verification of systems (the syntax, however, differs). Various paradoxes and anomalies have arisen with purely formal attempts to consider the consequences of actions. The currently most successful attempts to solve these problems mostly rely on axiomatising and using notions of causality inside of the Situation Calculus. This is basically a realist view. Purely formal attempts inside the logic of the Situation Calculus itself have not yet been very successful. (See the proceedings of the biennial IJCAI and ECAI and annual AAAI conferences, and the journals Artificial Intelligence and the Journal of Artificial Intelligence Research for articles.)

John McDermid (Professor of Computer Science at the University of York, Director of the Dependable Computing Systems Center and the University Technology Center, and editor of the Handbook of Software Engineering) has emphasised how important he considers formal notions of causality for requirements analysis. A paper on causal models has appeared (10) . See my Reasons and Causes for a discussion.

Given how hard it seems to be, we cannot hope to clarify the meaning of causes in a short essay on airplanes. But we can explain how a few of the confusions about causes arise.

First, let's look at the statement A causes B. What are A and C?. Since there's a verb in the middle, they must be singular terms -- referring to what? Events or situations. For brevity, I'll call both kinds of thing events. Naively, the statement A causes C seems to mean the same as whenever event A occurs, event C must follow. What does must mean? Does it mean inevitable? Hume's reductionist view suggested it meant usually. That seems to require event types: when an event of type similar to A occurs, usually an event of type similar to C follows and indeed it did in this case (the second conjunct is necessary for explaining actual events). His counterfactual view is that if event A had not occurred, neither would event C have occurred: in his own words, "where, if the first object had not been, the second never had existed." That's a form of inevitability.

I'll play fast and loose with whether the arguments to `causes' are singular terms for events, or whether `causes' is a sentential connective, connecting assertions. The naive intertranslation above allows us either, without too much grief. Let's look at the most narrow interpretation of inevitability: logical implication. Let B be a tautology, a statement that is true by virtue of its form. Would it be appropriate to say A causes (A and B)? Most people's intuition says no. Two statements that are logically equivalent must describe the same event and an event cannot `cause' itself. So the word `causes' joins two event descriptions that are more loosely related than by logical equivalence.

So the sense of must involved in causality is not that of logical necessity. What about physics? We have theories, that are not obviously pure logic, of how the world works. We say there are `physical laws', that describe how, when I step in front of the rapidly moving bus, the bus can't stop in time and I will be knocked down. We mean something like: take these laws P as true; then P and (the bus is moving at 50kph) and (I am in the bus's trajectory 3 meters ahead of the bus and I don't move) logically implies I and the bus will meet while the bus is still moving. Since actions and events are involved and the events aren't all contemporaneous, this sense of logically implies had better be that of temporal logic. Hume the realist would say that P is just a description of our habits of observation, and has no other status. Kant would say that P is a necessary consequence of the meta-physical structure within which we have to view what happens in the world.

But this formal conception won't work, either. It's consistent with physical law, even though absurdly unlikely, that all the atoms in the bus jump backwards at the same time and thus the bus simply, instantaneously, stops 1 meter away from me. So it can't logically follow from P that I and the bus meet -- it can't be that causality means `logical consequence of physical law'. Similarly, intuitively one may ackowledge that exposure to huge amounts of radiation causes cancer, but it's not inevitable that you get cancer if you're run over by the bus before the cancer has a chance to develop. Causality doesn't imply inevitability. But also, inevitability doesn't imply causality, as we've seen from considering logical consequence.

There is however something compelling about this idea of inevitable provided that nothing unusual intervenes (the appropriate modification of Hume's counterfactual view). This notion is similar to that of the meaning of contrary-to-fact conditionals proposed by philosophers such as Robert Stalnaker and David Lewis (5). Lewis has argued that causality may be explained by counterfactuals (3). A contrary-to-fact conditional is one in which the antecedent is false: the statement if Ladkin were to be in England now, he'd have four legs is widely held to be false, and the statement if Ladkin were to be in England now, he'd have two legs or fewer is widely held to be true. A truth-functional interpretation of the if..then.. holds both to be true since the antecedent is false. Modal logic originally arose as an attempt to explain the `inevitability' of conditionals that truth-functional interpretations couldn't capture. But a traditional modal semantics doesn't appear to help us, since it rests on a formal explanation of what's meant by inevitability, and we're back where we started. Lewis's semantics works by saying that a contrary-to-fact conditional is true just in case, in some nearest possible world in which the antecedent is true, the consequent is also true. Stalnaker's semantics is broadly similar, except that he thinks there is a unique `nearest possible world'. To understand this semantics, one must understand the notion of possible world and also the meaning of nearest. Possible worlds are also used in the semantics of modality-- a statement is inevitably true just in case it's true in all possible worlds, and a statement is true just in case it's true in the actual world (`this' one, the one we're `in'). `Nearest' means the possible worlds that are basically like ours, but differ only in the most minimal ways. So, presumably, a possible world in which the atoms in the bus all jump backwards at the same time is not very `near', since it's unbelievably improbable. Neither is a world in which the physical `laws' are so different that the bus can stop on a dime. Nor a world in which I magically materialise on the sidewalk before the bus reaches me. In all the nearest worlds, I and the bus meet disastrously.

That's near enough for us, I think. Lewis and others have explained the notions of `nearest' and `possible world' clearly enough for us to be able to use these notions for engineering. The why?....because.... statements above can intuitively be given a Lewis-type semantics. Because they are not exlusively causal statements, but include reasons that are not causes, they avoid many of the objections that have been levelled against Lewis's account. In the nearest world in which control was not lost, the X-31 didn't crash. In the nearest world in which the pitot didn't ice, control was not lost. In the nearest world in which the appropriate atmospheric conditions weren't in the flight path, the pitot didn't ice. In the nearest world in which ground control noticed the atmospheric conditions, the plane didn't fly into them. And so forth.

But notice what this implies about causal chains. A causes B and B causes C does not necessarily imply A causes C -- only if the nearest worlds in which B is true to those nearest to us in which A is true are also those worlds nearest to us in which B is true. And this is not necessarily the case. But one can recapture transitivity by the following definition: A tr-causes B if and only if there is a sequence of events X.1, X.2, ..., X.n such that A causes X.1 causes X.2 causes .... X.n causes B.

Thus there is no need to try for transitivity. All the information lies in the causal chain already, and one can always use the transitive closure if one desires a transitive relation. A more pressing problem is as follows. Suppose we're trying to establish a `nearest' (or proximate) cause, and we believe A causes C. How do we ensure that there isn't a B such that A causes B and B causes C? Suppose we find one. Can we really now be sure that A causes C? Causality is no longer transitive. So we have to be more careful. And in accident investigations as well as requirements specification, more care is a good thing.

But if causality is not transitive, we have to write down all the links in all the causal chains. Could we draw a `causal graph'? Any irreflexive, antisymmetric binary relation can be represented by a directed graph. Points or nodes of the graph are statements I care about, and I draw an arrow from the node labelled A to the node labelled B just in case A causes B. But note that causes aren't unique. Looking at the A320 list above, one can see that events may have multiple causes -- A and B and C together cause D. Causality is a polyadic relation. So maybe we should be trying to draw a `causal hypergraph'. Hypergraphs can be notoriously difficult to draw, but luckily since the hypergraph must be acyclic (from the properties of causality), we can draw an arrow (`directed edge') from A to D just in case at least A and possibly other events cause D, and then recover the instance A and B and C together cause D of the polyadic causality relation just by observing in the graph that D has three arrows coming into it, whose sources are A, B and C. Note that we still need to color the edges to distinguish the different types of arrows corresponding to causes, functions and entailment (and also maybe `actual-inference' -- see below). Even though I have thus shown we can represent this structure as a colored directed graph, I shall continue to call it the `causal hypergraph'.

I propose that in order intuitively to analyse causes, we should write down all the descriptions of events and decisions that we think are possibly relevant (and talk to our colleagues to see if they have any that we've missed) and then arrange them in a causal hypergraph. (Such a procedure may remind the reader of fault tree analysis. I believe there are similarities.) This is the method I used to obtain my why?....because.... statements above. The reader will now easily be able to see that I just wrote down a linear description of a causal hypergraph. It might be a useful exercise at this point for the reader to draw the causal hypergraph from the information given above about the Warsaw accident.

What about functions and entailment? First, entailment. I noted above that entailment was different from causality, although similar to a kind of `psychological causality'. I believe it would be appropriate in this context to interpret `decision A entails decision B' as meaning something like `a person responsible for making decision A who is properly executing her responsibility will also make decision B as a consequence of making A'.

This notion of entailment could lead to difficulties: how can we be sure that it was really reasonable and proper to make decision B given decision A had already been made? For more precise description, one would want to contrast a entailment which was `reasonable and proper' to make, with the historical actual reasoning of the individuals involved in the events. For example, deciding not to turn on pitot heat entails deciding not to fly the plane through clouds in icing conditions according to the `reasonable and proper' interpretation, however it may be that an individual whacky pilot believed that flying outside the clouds caused ice, but flying inside did not, and therefore decided to fly inside a cloud and `therefore' not to turn on pitot heat. In this case, it would be a causal factor that the pilot's reasoning differed from the `entailment'. We might therefore want to use another word, say actually-inferred, to describe the relation between two historical decisions of an individual involved, whether it was an entailment or not. Thus a difference between `actually-inferred' and `entails' can be a causal factor. It doesn't appear to have been in the examples above, so I don't pursue this further here.

Notice also that decisions do not lead inevitably to actions. I can decide to do A, and do A; or I can decide to do A and simply not do A. But if I do do A having already decided to do A, it's reasonable to take my decision as a reason for my action. And it's reasonable to take either situation as a causal factor in an accident analysis. For example, the accident report assigns the `probable cause' of the Air India A320 accident at Bangalore on 14 February 1990 to:

Failure of the pilots to realise the gravity of the situation and respond immediately towards proper action of moving the throttles, even after the Radio altitude call-outs [...], in spite of knowing that the plane was in idle/open descent mode. [...]
By contrast, if the pilots of the Warsaw A320 had not increased airspeed by 20 knots even after deciding to do so, we could speculate that that accident might not have happened in the way it did. Both decision therefore action and decision but not action are legitimate edges of the causal hypergraph.

Second, functions. A function plays the following role in causality. If A causes the value of x to be u and B causes the value of y to be v and F is a function of x and y, then the occurrences of A and B are the joint reasons why the value of F is F(u,v). One might even want to say that they jointly cause the value of F to be F(u,v). This can obviously be extended to multiple causes. Thus, mathematical functions play a role similar to that of a `logical sum of causes'.

Finally, this vocabulary and structure allows us to see how both Fuller's comments and the Warsaw report are misleading, even though they may be correct. Fuller seems to be trying to identify a distal cause (the beginning of a long causal chain) rather than proximate causes (the smallest links in a causal chain). He seems to me to be assuming that causes are transitive, and furthermore that causality is a binary (`graph') relation rather than a hypergraph relation. He wants to find one distal cause (rather than the inevitably many) which he then wants to call THE cause. He first identifies it in the software, then in the design, then in the requirements discussion. He's entitled to follow this inquiry, of course, but it's not the complete story. The causal hypergraph is more like the complete story. And sorry, but loss of control caused the crash and pitot ice caused the loss of control. The Warsaw report also misses out many nodes in the causal hypergraph. They also seem to be trying to identify distal causes rather than proximate ones (the only causes mentioned are distal in my hypergraph). One may note that they leave out one internal node in the hypergraph for which Polish authorities are responsible (the position of the earth bank -- which draws attention to itself also because the hypergraph is of width one right before this node!), and that they do not consider as causes the wind report or the runway surface, although these are source nodes (distal causes) in the causal hypergraph. The report writers may believe that identifying only distal causes is appropriate, but we can see that they leave out causal information by doing so.

Faults in Systems

This topic arises often in discussions such as this. Gomez and Fuller, and even Mellor, give me the impression that they believe that if there's a fault, it's either a software fault or a hardware fault. That's actually incorrect. It's either a software fault or a hardware fault or a specification fault. Mellor might say `Yes, but specification is software'. Well, I doubt it. If one investigates the use of these words, and the reasoning that goes with them, one agrees quickly with Mellor that software and hardware are somewhat mutable, but neither of them are mutable with specification. As Mellor said, various parts of the design can be implemented in software or in hardware and it doesn't matter when analysing the design which it is. But try that trick with software and spec, or with hardware and spec, and it doesn't even seem to make sense. Software and hardware are thus more similar to each other in systems than either are to things called specifications. The mutability of software and hardware has also been emphasised in the context of reliability by Leyman, who has noted privately that in actuation systems which have part of their operation implemented by computers and other parts by hydraulics, that the hydraulics are more likely to fail first: it seems that hydraulic systems spring leaks more often than ethernet cables drop bits or send the wrong ones. The disconnection of specification from software is illustrated by a comment of Shafer concerning the X-31, that F4s have suffered a similar fate for similar reasons to the X31, even though there's no digital computer in sight in this part of an F4.

Some other arguments for treating specification as separate from software or hardware are given in my paper Correctness in System Engineering. But I warn you that it's full of what I consider careful philosophical analysis of certain kinds of reasoning. I'm no Peter Hoeg. You might want to wait until you're *really* insomniac before trying it.

But if systems are not fundamentally bipartite (`software' and `hardware'), are they fundamentally tripartite or quadripartite, or does the number of parts depend on whim? I propose, that although the total number of parts depends on whim (i.e., how finely you want to partition your system is up to you), that systems are fundamentally tripartite. Requirements specification, middleware (software, firmware, lots of what is traditionally called hardware), and the truly physical bit at the bottom. The reason for this tripartition is that problems in each of these parts are of different types and should be handled differently, whereas problems within one of these parts are of fundamentally similar types. I should note that middleware yields a host of intermediate specifications and designs.

A requirements specification is an attempt to say exactly what it is that will count as a satisfactory system. As Mellor says, requirements should be a) consistent, b) unambiguous, c) complete. Consistent means that there should be at least one system behavior that satisfies the spec; alternatively that FALSE is not a logical consequence of the spec. Unambiguous means that they should be written in a language with a precise semantics (whether it's English or logic -- but one touchstone of precision is that you are able to translate your requirements into some formal language with a precise semantics). Complete means that they exclude all intuitively-unwanted behavior and include all intuitively-wanted behavior. If you think these three definitions too crude, feel free to use your own, but please ensure that you can at least express these three concepts somehow in your language, because if not, you won't be able to say things that you have to ensure in lower-level specifications such as the design specification, as I note in the following paragraph. So mishaps arise when requirements are not consistent, complete and unambiguous.

A requirements specification doesn't tell you necessarily how to go about implementing your system. A design is roughly speaking a specification that describes how to implement a system. A design normally includes vocabulary that simply doesn't appear in the requirements. For example, it may be a requirement of your system that it doesn't deadlock, but it will usually be in the design where you say that the system has precisely 33 concurrently-executing coroutines. The relation of design to requirements specification is (rather, should be) that of logical implication. You need to prove that your 33 coroutines don't deadlock when you run them. (Hence your design specification must also be unambiguous otherwise you won't be able to prove anything. And it will only logically imply the requirements specification if it can exhibit no behavior that isn't allowed by the requirements. And it had better be consistent if you expect it to do anything at all!)

This relation of specification to design may be repeated lower down in middleware. Pseudocode with subroutines may be written for each of the 33 coroutines, and then in principle it should be provable that the pseudocode implements the design. Similarly, it should be provable that the C code written by your contractors implements the pseudocode. Or you may prefer to test instead, if you think you can. Similarly, for C code and pseudocode, substitute assembler instruction and chip design, as in the SRI AAMP5 pilot project, described in Formal Verification of an Avionics Microprocessor and Applying Formal Verification to a Commercial Microprocessor. How many specification/ design pairs you have depends on your patience. For a hierarchy of such pairs, see the Computational Logic Inc verified stack. All these belong to middleware. The relation between two descriptions in middleware should be that of logical implication. If this relation doesn't hold between two descriptions, there's a behavior that is allowed by the more detailed that is forbidden by the more abstract and so you allow the possibility of a mishap when you run the system. So mishaps arise from middleware when a lower-level description does not logically imply a higher-level description, or when the highest-level middleware description doesn't logically imply the requirements specification.

At the bottom lie the physical pieces. These pieces have a description which belongs to middleware. This description may be as high-level as an instruction set (if you're writing PC software, and you buy chips from a vendor and you trust the vendor) or as low-level as a compiled VHDL design. Note that if one considers the instruction set as the lowest description, others may still consider this as part of middleware and wish to delve deeper. In principle, one stops with the description of physical properties of the actual physical bits and pieces. Unlike at the top end, where there is precisely one requirements specification, one cannot look at a system design and building process and say a priori where the bottom boundary will be, how far down into the minutiae one wants to go. Nevertheless, wherever that may be, it exists. One has a description (or many) that ends the specification-implementation chain, and this description purports to be of the physical bits.

Now, the relation between the actual physical bits and their description suffers from all the features of science applying to the actual world. The description will necessarily be written in terms of or be dependent on an understanding of `physical laws' and it is standard fare in modern philosophy of science that there is normally no hard-and-fast logical guarantee that the physical bits always behave so as to satisfy your description (for some exceptions, see my Correctness paper). When they don't so behave, we say that there is a hardware failure and that this is due to a fault, in the normal engineering meanings of these terms. So, at the hardware level, a mishap is a failure and occurs when there is a fault, in the normal physical sense in which this is meant in engineering.

Thus, except for the requirements specification and the lowest-level hardware descriptions, every formal document (in Parnas's parlance, used also by Thomas and Leyman) in the system development is both a specification (of a lower-level part) and a design (for a more abstract part), and the relation between these documents is that of logical implication. The relation between the highest-level middleware specification and the requirements is also that of logical implication. The lowest documents are physical descriptions according to physical science of how the physical bits behave.

The requirements specification stands apart in how it must be checked. Consistency is a formal property, which can be ascertained directly; non-ambiguity also, under my definition (`non-ambiguity' as used by Mellor and Leveson seems to me to have more to do with what I called `completeness'). These properties, or lack of them, are also shared with middleware documents. However, ascertaining completeness is problematic. Unlike the case for middleware, there is no hard-and-fast criterion which will allow you to know if your specification is complete. Leveson's paper and book include a large collection of desiderata, which are similar to internal consistency checks. They have the status of `if your spec doesn't satisfy this, then look *real hard* at it right here, because it's probably unsatisfactory'. Such a list is a list of heuristics. The paper omits at least one that others have found useful, namely `write your requirements two different ways, say as automata and as temporal logic, and then formally prove the two logically equivalent, filling in the gaps you find as you go'. One may think this is a lot of work and entirely too complicated, but Gerard Holzmann of Bell Labs has had success with it. In any case, they're all still heuristics, no matter how good they are. This is the problem with `completeness'.

(I do have arguments in my `correctness' paper that there are some specs that you can *prove* complete, but the only *interesting* such specs that I could think of are those which specify I/O-type calculations of specifically mathematical functions, and of course what Leveson talks about in her paper is control and other reactive systems. `Safety' for a pocket calculator is a much less interesting concept......)

So when Mellor teaches students to ensure their design is complete (I think he means `requirements'), he's drumming into them to use as many heuristics as possible to analyse their requirements specifications. Whereas, when they're checking their design specifications against their requirements specifications, they can in principle use an automated theorem prover and not leave the room until they've proved it. And likewise for any of the middleware specifications. That's the difference. Requirements analysis and specification remains necessarily an art. A scientific art, but an art nevertheless.

I should emphasise that I'm not advocating any particular development process here. I'm explaining the status of documents. If your most reliable way of getting correct systems is to wave a magic wand over a pile of dead leaves and turn them into a deliverable Exabyte tape which your client reads into his Sparc20, who am I to argue? After all, it works for Ratbert.

What are the consequences of this tripartition for the causal hypergraph? With proper mathematical verification, middleware contributes only to the function edges of the causal hypergraph, and if there is a behavior of the system not allowed by the requirements specification which is not a hardware failure in the sense above, it's a middleware failure and can be so noted on the relevant function edge of the causal hypergraph. Hardware failures can be noted as distal causes, that is, leaves, and requirements failures would become apparent from the connections in the overall graph itself.

Responsibility and Blame

Prof Plum wuz framed
P. Ladkin
Back to top.

We've already noted that responsibility and blame are two different concepts, which are sometimes confused. One could speculate that the perceived inclination of accident investigation authorities to blame the pilot if all else fails stems from the statutory assignation of responsibility to the pilot-in-command.

Who to blame when your system breaks or your plane crashes? That surely depends in part on where you locate the fault. A system has at least three parts. Three parts always worth separating (I have argued) are requirements specification, middleware, and `hardware' (remembering that middleware may contain what has traditionally been called hardware). The way engineers reason, an unsatisfactory whole means an unsatisfactory part (`or an unsatisfactory connection between parts', as one reviewer of my Correctness paper said, but of course a `connection between parts' is itself a part - see Peter Simons' book Parts (9) for a thorough discussion of, well, parts). So, your middleware can be OK (you've proved the design implies the requirements) and your hardware can be OK but your pitot can still ice up and your plane crash. You can't blame your software engineers or your hardware engineers. You may want to blame your requirements review board, but remember that they didn't `fail' -- they only had heuristics to work with and they applied those heuristics diligently: they considered pitot icing and decided that they weren't going to fly the plane into such conditions. But as it happens they did anyway, without catching it in time, not that one has much time at those sorts of speeds. We can look at the causal hypergraph and even believe there exists a problem, but to blame someone for it is entirely a different matter and seems to me, and I hope to you to, rather pointless.

The Lufthansa A320 pilots knew the handbook, knew the weather, listened to what they were told by the tower, took their 20kts overspeed which the handbook allowed for countering windshear, looked for the windshear that their wind report led them to expect, and mentally promised themselves and their passengers not to let the airplane float, as the handbook warns against. Unfortunately, as Leyman has shown, at those speeds, with a normal approach angle, it's aerodynamically impossible to stop the airplane floating. Also, the handbook doesn't provide a means of calculating precisely what the extended landing distance would be even without float (as the report noted). But even if it had, as Leyman has also shown, the float pushes the plane off the end of the runway anyway, even without aquaplaning. And even if you did magically include all those figures in the handbook, it would still be unlikely to tell you to expect to use 1000 extra feet of runway if you're 14ft high on glideslope at 50ft. Maybe it should have done all these things, but I suspect most procedures manuals for commercial airplanes don't use that degree of precision (and there are arguments for thinking they shouldn't -- too much information inhibits quick search). After all, it took a British Aerospace Chief Engineer and a lot of hindsight even to notice the discrepancies. So does one really want to blame the pilots? What about the fly-by-wire flight control? A braking system logic issue does figure in the list of causal factors. But that's a story for another time. If your finger really itches to point, check out the causal hypergraph first, and here's hoping you have a lot of fingers.......

Here's wishing everyone safe flying.

Peter Ladkin


Back to top.

(1): David Lewis, Causal Explanation, in Philosophical Papers, ii, Oxford University Press, 1986, 214-240. Also in (2), 182-206. Back to top.
(2): David-Hillel Ruben, ed., Explanation, Oxford Readings in Philosophy Series, Oxford University Press, 1993.
(3): David Lewis, Causation, Journal of Philosophy 70, 1973, 556-567. Also in (4), 193-204. Back
(4): Ernest Sosa and Michael Tooley, eds., Causation, Oxford Readings in Philosophy Series, Oxford University Press, 1993.
(5): David Lewis, Counterfactuals, Basil Blackwell, 1973.Back
(6) Richard Dawkins: The Selfish Gene, Oxford University Press, 1976, 1989. Back
(7) Richard Dawkins: The Blind Watchmaker, Longmans 1986, Penguin 1991. Back
(8): Elliott Sober, Philosophy of Biology, Dimensions of Philosophy Series, Oxford University Press, 1993. Back
(9): Peter Simons, Parts : A Study in Ontology, Oxford University Press, 1987. Back
(10): Jonathan Moffett, Jon Hall, Andrew Coombes and John McDermid, A Model for a Causal Logic for Requirements Engineering, Journal of Requirements Engineering, to appear, 1996. Back