Artifactual systems such as computer systems are constructed with a purpose in mind, we can say in order to fulfil a goal (Sea95). There is much work on how to state and fulfil such goals, namely, the study of specification and verification. There is much less work on studying failure to fulfil a goal. We might say that while verification is concerned with proof, failure analysis is concerned with counterxamples.
Computer systems exhibit behavior, and behavior is generally taken to consist of sequences of states. Events are changes of state, and actions are sets of events e.g., (LamTLA). These may be referred to in other contexts as event tokens and event types respectively (Ste97a). This event-state ontology appears to be appropriate (Lad96c), although there is some discussion about it (Ste97a).
In `traditional' engineering, failure is often correlated with a persistent failure state, because behavior was simple, foreseeable, and failure was normally exhibited by a lack of structural integrity. In other words, something broke, and what's broke stays broke. (Logicians and verification specialists would say that relevant liveness properties reduce to safety properties, which are formal-logically less complex.) As systems become behaviorally more complex, this is in general no longer so. A so-called reactive system may exhibit unwanted behavior, may fail, even though no individual state is itself inappropriate. The sequence of states, however, is wrong. A trivial example could be that of putting money into a vending machine, whereupon one obtains the item, but the machine fails to give change. Each action is appropriate in and of itself, each state is appropriate in and of itself, but giving no change is not appropriate given the sequence of actions and states before. (One may expect in general in such cases that the system fails to satisfy liveness properties that are not logically reducible to safety properties.)
Failures, counterexamples, consist of unwanted behavior sequences. When the machine is standing by itself, an isolated system, then these unwanted behavior sequences can be noted and traced to origins in the design of the system, either software or hardware. This process is by now moderately well understood, and specification, verification, debugging and testing methods are constantly improving. But how about a situation in which the machine is not isolated, but is designed to operate in a context which may be differently engineered or which may not be an artifact at all, such as avionics systems, embedded in an aircraft, which is itself flying around in Mother Nature, under the control of a human pilot? When the system must react to the outside environment and to the needs of their human controllers, system analysis becomes much more difficult. The ubiquitous tendency in computer design that users must adapt themselves to the system, documented in (Roc97), can no longer suffice for safety-critical systems such as these. It is now widely agreed (in the US at least) that a crew-centered approach (Bil97), must be taken to automation in aviation. This conclusion is also implicitly argued for in general in (Roc97).
Commercial aviation is a particularly suitable domain for research into the analysis of failure, for the following reasons (1):
How do we know these are problems? Because of thorough accident and incident analysis. A recent accident which exhibited many of the interface problems above is that of American Airlines 965 into Cali, Colombia, on 20 December 1995 (2). We have analysed the Cali accident from the final report (Col96) using the WB-graph method (GeLa97) and the results appear below
Roughly speaking, the semantics of Lewis for the assertion that A is a causal factor of B, in which A, respectively B, is an event or state, is that in the nearest possible world in which A did not happen, neither did B. This relies on a notion of possible worlds, which is very similar to the ontology of the Kripke semantics for modal logic. The notion of `nearest' possible worlds is particular to the semantics for counterfactual conditionals, and we leave this notion as intuitive. For example, suppose my office door is open. The `nearest' possible world in which my door is shut is one in which my door is shut, air currents around it behave appropriately, sound through it is muffled as it should be, but broadly speaking everything else remains the same. A further-away world would be one in which someone else who is not me is sitting here typing, and an even further-away world is one in which this whole environment is situated in Ghana rather than Germany.
The informal definition of causal factor refers to events and states as particulars, as individuals: A and B look like referring expressions. Although events may be regarded as particulars (Dav67, Ste97a), states are commonly taken to be individuated by the collection of state predicates which are true in that state. This makes them types: a given state is said to occur or recur, whereas a given event happens, and that's it. A type of event can recur, but not an event itself. Thus for this informal definition to make sense, it looks as if one must specify a specific occurrence of a (partial) state, differentiating either by timestamp or by position in the causal chain or some such method. Such a differentiation method could also be regarded, and has been by some, as identifying the state as an durative event, distinguished by its initial event (when the partial state first manifests itself) and terminal event (when the partial state ceases to be). We may leave these ontological considerations aside here.
We consider Lewis's semantics as a semantics of causal explanation, rather than of `causality' itself (whatever that may be) for two reasons:
The WB-graph method is, very generally:
It is assumed that a suitable narrative is provided: the WB-graph method is a method for analysing causal relations between already-identified significant events and states and does not do field-analysis. The analysis in the second step has been successfully performed with Boolean logic (GeLa97, Lad96e, Lad96d), but it can be extended into a temporal-logical analysis (Lad96c).
The WB-graph for the Cali accident is
An Encapsulated Postscript version of this figure is also available, for those wishing a more finely-resolved version for zooming or printing.
The reason for performing a WB-graph analysis is to put the explanation of accidents on a rigorous logical foundation. We have already been able to identify reasoning mistakes in accident reports using this method. The three accident reports analysed in (GeLa97, Lad96e Lad96d), all contained facts which were significantly causally related to the accident, which appear in the WB-graph analysis as causes, but which are not contained in the list of `probable cause/contributing factors' of the report. We regard this as a logical oversight. (Formally, they appear in the WB-analysis as causal factors that are not themselves explained by any further causal factors; i.e., as source nodes with out-edges but no in-edges.) Some might speculate that there are administrative, political or other social grounds for excluding them from the list of original causal factors, but this is not our interest here. We regard the WB-graph analysis as demonstrating that logical mistakes were made, thereby justifying the use of the WB-analysis to put accident reporting on a rigorous foundation.
However, when using the method, how may we ourselves maintain internal consistency checks? Namely, we need
A WB-graph can be a complex object: the WB-graph of the events and states specifically handled in the Cali accident report contains about 60 nodes and more edges. That of a twenty-second altitude-deviation incident from a simulator `flight' contains 25 or so nodes (PaLa97). Complex objects are relatively difficult to build and maintain, because small mistakes are harder to catch and correct. The question arises, then, whether a WB-graph can be automatically built and maintained from the basic information concerning its construction.
In the WB-graph method, each event or state predicate, starting from the accident event, is queried to discover what are the sufficient causal factors of that event or state. Each of these causal factors is then queried in turn, until a suitably complete or finished account is obtained. In (GeLa97) this process was followed `by hand'. We show here the result of (re)writing this development as a program in the linguistic description language DATR (EvGa). See also (Gib-a) for more DATR tools, including an interpreter. DATR is a pure inheritance language: DATR programs are evaluated, much as LISP programs are evaluated, but its evaluation mechanism is that of inheritance. In our use of DATR, each node of a WB-graph corresponds to a DATR-node. The extensions of each node correspond to attributes; whether the node represents an event or a state; what its causal factors are (pointers to other nodes); what it is a causal factor of (also pointers to other nodes). The graph is therefore built from purely local information about each node, and the DATR interpreter (Gib-b) or compiler (Gib-c) evaluates the entire collection of nodes against structural queries, yielding exactly the same information as in the (correct) WB-graph - in principle, simply constructing the graph dynamically as required. A faster DATR implementation may be found at (Sch96).
It is a relatively simple matter to query the DATR program to find out what are the `root reasons' of an event or state predicate (namely, the `source nodes' of the subgraph leading to the node: see (GeLa97)), and also to determine the events and state predicates to which a given node causally contributes (namely, nodes on the transitive closure of the `causal factor of' relation to the given node). Both these questions need to be answered during an accident investigation for each causal factor present. These calculations are, of course, performed error-free by an error-free DATR interpreter or compiler.
Here is a part of the Cali-accident DATR file:
E1: <event> == 'aircraft impacts mountain' <why> == S11:<state> S11:<why> E12:<event> E12:<why> <prev> == S11:<state> E12:<event> <next> == '_'. S11: <state> == 'GWPS monoeuvre failed since S111:<state>' <why> == E111:<event> E111:<why> S112:<state> S112:<why> E113:<event> E113:<why> <prev> == E111:<event> S112:<state> E113:<event> <next> == E1:<event>. E12: <event> == 'AC in mountainous terrain' <why> == S121:<state> S121:<why> S122:<state> S122:<why> <prev> == S121:<state> S122:<state> <next> == E1:<event>. E111: <event> == 'GWPS manoeuvre initiated' <why> == E1111:<event> E1111:<why> <prev> == E1111:<event> <next> == S11:<state>. S112: <state> == 'AC did not exhibit optimal climb performance' <why> == S1121:<state> S1121:<why> S1122:<state> S1122:<why> <prev> == S1121:<state> S1122:<state> <next> == S11:<state>. E113: <event> == 'AC very close to mountains' <why> == S121:<state> S121:<why> S122:<state> S122:<why> <prev> == S121:<state> S122:<state> <next> == S11:<state>. S122: <state> == 'AC flying too low for cleared airspace (3rdD): since ' <why> == E1221:<event> E1221:<why> S1222:<state> S1222:<why> <prev> == E1221:<event> S1222:<state> <next> == E113:<event> E12:<event>. S121: <state> == 'AC on wrong course/position (2D-planar)' <why> == E1211:<event> E1211:<why> S1212:<state> S1212:<why> E1213:<event> E1213:<why> <prev> == E1211:<event> S1212:<state> E1213:<event> <next> == E113:<event> E12:<event>.Let us choose a state or an event from the WB-graph; for example, the event  . The following information is obtainable from the DATR program:
E1:<event> -> 'aircraft impacts mountain'Or one can ask why this event happened: E1:<why> lists all the events and state predicates that are direct or indirect causal factors of the queried state.
E1:<why> -> 'aircraft impacts mountain' 'GWPS monoeuvre failed since S111:<state>' 'GWPS manoeuvre initiated' ... 'AC did not exhibit optimal climb performance' ... 'AC in mountainous terrain' 'AC on wrong course/position (2D-planar)' ... 'AC flying too low for cleared airspace (3rdD): since  ... 'AC very close to mountains' ...The three points denote that the program gives many more reasons.
S11:<prev> -> 'GWPS manoeuvre initiated' 'AC did not exhibit optimal climb performance' 'AC very close to mountains' S11:<next> -> 'aircraft impacts mountain'.An underscore, _, denotes that an explanation of the state is not defined or that the next or previous state or event is not defined.
It is usually between a consequent and the sum of several antecedents; the concurrence of them all being requisite to produce, that is, to be certain of being followed by the consequent. In such cases it is very common to single out only one of the antecedents under the denomination of Cause, calling the others merely Conditions....The real Cause is the whole of these antecedents; and we have, philosophically speaking, no right to give the name of causes to one of them exclusively of the others. (Mil43, p214).Back
(Col96): Aeronautica Civil of The Republic of Columbia, Aircraft Accident Report: Controlled Flight Into Terrain, American Airlines Flight 965, Boeing 757-223, N651AA, Near Cali, Colombia, December 20 1995, Author, 1996. Also available in full in (LadCOMP). Back
(Ger97.2): T. Gerdsmeier, Cali DATR Program, http://www.rvs.uni-bielefeld.de/~thorsten/CaliDATR.html, January 1997. Back
(GeLa97): T. Gerdsmeier, P. B. Ladkin and K. Loer, Analysing the Cali Accident With a WB-Graph, at http://www.rvs.uni-bielefeld.de Publications, January 1997. Also to appear in the Proceedings of the Glasgow Workshop on Human Error and Systems Development, March 1997. Back
(Gib-b): D. Gibbon, Minimal Unix DATR: HyprLex Testbed Interface, http://coral.lili.uni-bielefeld.de/DATR/testbed.html. Back
(JoTe97):C. W. Johnson and A. J. Telford, Extending the Application of Formal Methods to Analyse Human Error and System Failure During Accident Investigations, Software Engineering Journal 11(6):355-365, November 1996.
(Lad96b): Peter B. Ladkin, Some Dubious Theses in the Tense Logic of Accidents, Technical Report RVS-RR-96-14, RVS Group, Faculty of Technology, University of Bielefeld. Available at http://www.rvs.uni-bielefeld.de Back
(Lad96c): Peter B. Ladkin, Explaining Failure With Tense Logic, Technical Report RVS-RR-96-13, RVS Group, Faculty of Technology, University of Bielefeld. Available at http://www.rvs.uni-bielefeld.de Back
(Lad96d): Peter B. Ladkin, Formalism Helps in Describing Accidents, Technical Report RVS-RR-96-12, RVS Group, Faculty of Technology, University of Bielefeld. Available at http://www.rvs.uni-bielefeld.de Back
(Lad96e): Peter B. Ladkin, The X-31 and A320 Warsaw Crashes: Whodunnit?, Technical Report RVS-RR-96-08, RVS Group, Faculty of Technology, University of Bielefeld. Available at http://www.rvs.uni-bielefeld.de Back
(MoHa96): Jonathan Moffett, Jon Hall, Andrew Coombes and John McDermid, A Model for a Causal Logic for Requirements Engineering, Requirements Engineering 1(1):27-46, March 1996. Also in ftp://ftp.cs.york.ac.uk/hise_reports/req_capture/causal.ps.Z.
(SoTo93): Ernest Sosa and Michael Tooley, eds., Causation, Oxford Readings in Philosophy Series, Oxford University Press, 1993.
(vWr73): Georg Hendrik von Wright, On the Logic and Epistomology of the Causal Relation, in P. Suppes et al., eds., Logic, Methodology and Philosophy of Science IV, Amsterdam: North-Holland, 1973. Also in (SoTo93), 193-204. Back