Formalising Failure Analysis

University of Bielefeld - Faculty of technology

Networks and distributed Systems Research group of Prof. Peter B. Ladkin, Ph.D.
Back to Abstracts of References and Incidents	Back to Root

Research Report RVS-Occ-97-06

Formalising Failure Analysis

Thorsten Gerdsmeier, Peter Ladkin, Karsten Loer
RVS Group, Technische Fakultät, Universität Bielefeld

3 June 1997

: Abstract: We consider systems which are artifacts designed to fulfil a goal. Most computer systems are of this sort. Criteria for success have been well-studied in the domain of specification and verification. Failure has been less well studied. The study of failure is normally exemplified by case histories. In other work, we have devised a method for determining and representing causality in case histories, called the WB-graph method. Methods need tools. We describe a tool for building and analysing WB-graphs built in DATR, a language devised for computational linguistics.

The Background

Artifactual systems such as computer systems are constructed with a purpose in mind, we can say in order to fulfil a goal (Sea95). There is much work on how to state and fulfil such goals, namely, the study of specification and verification. There is much less work on studying failure to fulfil a goal. We might say that while verification is concerned with proof, failure analysis is concerned with counterxamples.

Computer systems exhibit behavior, and behavior is generally taken to consist of sequences of states. Events are changes of state, and actions are sets of events e.g., (LamTLA). These may be referred to in other contexts as event tokens and event types respectively (Ste97a). This event-state ontology appears to be appropriate (Lad96c), although there is some discussion about it (Ste97a).

In `traditional' engineering, failure is often correlated with a persistent failure state, because behavior was simple, foreseeable, and failure was normally exhibited by a lack of structural integrity. In other words, something broke, and what's broke stays broke. (Logicians and verification specialists would say that relevant liveness properties reduce to safety properties, which are formal-logically less complex.) As systems become behaviorally more complex, this is in general no longer so. A so-called reactive system may exhibit unwanted behavior, may fail, even though no individual state is itself inappropriate. The sequence of states, however, is wrong. A trivial example could be that of putting money into a vending machine, whereupon one obtains the item, but the machine fails to give change. Each action is appropriate in and of itself, each state is appropriate in and of itself, but giving no change is not appropriate given the sequence of actions and states before. (One may expect in general in such cases that the system fails to satisfy liveness properties that are not logically reducible to safety properties.)

Failures, counterexamples, consist of unwanted behavior sequences. When the machine is standing by itself, an isolated system, then these unwanted behavior sequences can be noted and traced to origins in the design of the system, either software or hardware. This process is by now moderately well understood, and specification, verification, debugging and testing methods are constantly improving. But how about a situation in which the machine is not isolated, but is designed to operate in a context which may be differently engineered or which may not be an artifact at all, such as avionics systems, embedded in an aircraft, which is itself flying around in Mother Nature, under the control of a human pilot? When the system must react to the outside environment and to the needs of their human controllers, system analysis becomes much more difficult. The ubiquitous tendency in computer design that users must adapt themselves to the system, documented in (Roc97), can no longer suffice for safety-critical systems such as these. It is now widely agreed (in the US at least) that a crew-centered approach (Bil97), must be taken to automation in aviation. This conclusion is also implicitly argued for in general in (Roc97).

Commercial aviation is a particularly suitable domain for research into the analysis of failure, for the following reasons (1):

commercial aviation has been the subject of intense automation efforts; leading to significant and complex embedded computational systems;
commercial aircraft accidents (i.e., failures) are investigated with the highest degree of thoroughness possible;
although the same is true of other failures such as those in process control and power generation, airplane accidents occur with sufficient frequency to allow generalisations;
when a physical or system contribution to failure is identified, solutions can be implemented relatively straightforwardly in the rest of the fleet;
the systems are very dependent for their correct operation on time-critical human control;
the systems are not designed to tolerate catastrophic failure (there are no ejection devices);
most failures, while manifested by their state (the definition of `accident' in aviation is death or significant structural damage to the aircraft) are caused by a behavior: a sequence of interdependent actions, rather than a single state change;
this behavior and the causal analysis thereof are complicated, because of the physical environment and the varying degrees of human control of and responsibility for operation of the system.

Some of the human-machine interface problems summarised by (Roc97, p116) are

too heavy a workload associated with reprogramming flight management systems;
too much heads-down time in the cockpit attending to the systems (pilots are encouraged to keep heads-up and focussed on what is going on outside the airplane as much as possible);
deterioration of flying skills because of over-reliance on automation;
increasing complacency, lack of vigilance, and boredom;
lack of `situational awareness' when automated systems fail, making if difficult to identify and correct problems;
reluctance to take over from automated systems even in the face of compelling evidence that something is wrong.

How do we know these are problems? Because of thorough accident and incident analysis. A recent accident which exhibited many of the interface problems above is that of American Airlines 965 into Cali, Colombia, on 20 December 1995 (2). We have analysed the Cali accident from the final report (Col96) using the WB-graph method (GeLa97) and the results appear below

The Logical Semantics of Causal Explanation

The WB-Graph method is based on a formal semantics for causality introduced by David Lewis (Lew73b , Lew86 ), in turn based on his logical semantics for counterfactual conditionals (Lew73a). Unlike other analysts, who have attempted to capture causal relations entirely within a tense logic (MoHa96 , JoTe96 ), which the second author has criticised in (Lad96a , Lad96b ) respectively, we believe there are good reasons for separating the temporal relations, even necessary temporal relations, from causal relations, after the manner of von Wright (vWr73).

Roughly speaking, the semantics of Lewis for the assertion that A is a causal factor of B, in which A, respectively B, is an event or state, is that in the nearest possible world in which A did not happen, neither did B. This relies on a notion of possible worlds, which is very similar to the ontology of the Kripke semantics for modal logic. The notion of `nearest' possible worlds is particular to the semantics for counterfactual conditionals, and we leave this notion as intuitive. For example, suppose my office door is open. The `nearest' possible world in which my door is shut is one in which my door is shut, air currents around it behave appropriately, sound through it is muffled as it should be, but broadly speaking everything else remains the same. A further-away world would be one in which someone else who is not me is sitting here typing, and an even further-away world is one in which this whole environment is situated in Ghana rather than Germany.

The informal definition of causal factor refers to events and states as particulars, as individuals: A and B look like referring expressions. Although events may be regarded as particulars (Dav67 , Ste97a ), states are commonly taken to be individuated by the collection of state predicates which are true in that state. This makes them types: a given state is said to occur or recur, whereas a given event happens, and that's it. A type of event can recur, but not an event itself. Thus for this informal definition to make sense, it looks as if one must specify a specific occurrence of a (partial) state, differentiating either by timestamp or by position in the causal chain or some such method. Such a differentiation method could also be regarded, and has been by some, as identifying the state as an durative event, distinguished by its initial event (when the partial state first manifests itself) and terminal event (when the partial state ceases to be). We may leave these ontological considerations aside here.

We consider Lewis's semantics as a semantics of causal explanation, rather than of `causality' itself (whatever that may be) for two reasons:

it includes purely logical notions: it is trivial to observe that A is a causal factor of B if A is a logical consequence of B (noted by Lewis himself in (Lew86)). Logical consequence is not normally thought of as having anything to do with causality, but it has a lot to do with explanation;
although the argument that a given event may have many causal factors is over a century old ((Mil43), see (3)), recent work of Davidson, as interpreted by Steward (Ste97a, Ch 7; Ste97b ) suggests that an event may indeed have a singular cause, even though explaining why that event happened may require the invocation of many factors: the state of the environment, etc. The `network model' (Ste97a) of causation due to Mill is, according to Steward, an explanatory device and not a metaphysical reality.

A further technical problem for the Lewis semantics arises with so-called `overdetermination': the occurrence of two or more independent causal factors B and C for A that are individually sufficient: in the `nearest possible world' in which B does not occur, C occurs and, because it is sufficient, A occurs. Thus, according to the Lewis semantics, B would not be a causal factor for A; and mutatis mutandis with C. There is work suggesting how such situations may be handled (Pen97), but for our `engineering' purposes, such cases can simply be noted and analysed `by hand' as exceptions.

The WB-Graph Method

The WB-graph method is, very generally:

to identify `significant' events and system states (including states of human operators) from a narrative of the incident;
where necessary, to express these as Boolean or temporal-logical formulas - the `atomic events/states' are those corresponding to the propositional variables in this analysis;
finally, to use the Lewis-test between all pairs of propositional variables from the previous step to build the graph of the causal-factor relation between them.

It is assumed that a suitable narrative is provided: the WB-graph method is a method for analysing causal relations between already-identified significant events and states and does not do field-analysis. The analysis in the second step has been successfully performed with Boolean logic (GeLa97, Lad96e , Lad96d ), but it can be extended into a temporal-logical analysis (Lad96c).

The WB-graph for the Cali accident is

Back to Background section

An Encapsulated Postscript version of this figure is also available, for those wishing a more finely-resolved version for zooming or printing.

Analysing Accident Reports

The reason for performing a WB-graph analysis is to put the explanation of accidents on a rigorous logical foundation. We have already been able to identify reasoning mistakes in accident reports using this method. The three accident reports analysed in (GeLa97, Lad96e Lad96d), all contained facts which were significantly causally related to the accident, which appear in the WB-graph analysis as causes, but which are not contained in the list of `probable cause/contributing factors' of the report. We regard this as a logical oversight. (Formally, they appear in the WB-analysis as causal factors that are not themselves explained by any further causal factors; i.e., as source nodes with out-edges but no in-edges.) Some might speculate that there are administrative, political or other social grounds for excluding them from the list of original causal factors, but this is not our interest here. We regard the WB-graph analysis as demonstrating that logical mistakes were made, thereby justifying the use of the WB-analysis to put accident reporting on a rigorous foundation.

However, when using the method, how may we ourselves maintain internal consistency checks? Namely, we need

to ensure that all causally-relevant facts mentioned in the narrative are actually included in the WB-graph;
to ensure that all the causal relations between the causally-relevant facts are in fact exhibited in the graph we draw.

To this end, we use the programming language DATR.

A Tool For Analysing and Building WB-Graphs

The mathematical structure of a WB-graph is that of a labelled directed graph. Nodes represent either events or assertions about system or environment states (`state predicates'), and an edge is placed between two nodes if the event/state labelling the tail node is a reason (following the Lewis semantics) why the event/state labelling the head node came to pass.

A WB-graph can be a complex object: the WB-graph of the events and states specifically handled in the Cali accident report contains about 60 nodes and more edges. That of a twenty-second altitude-deviation incident from a simulator `flight' contains 25 or so nodes (PaLa97). Complex objects are relatively difficult to build and maintain, because small mistakes are harder to catch and correct. The question arises, then, whether a WB-graph can be automatically built and maintained from the basic information concerning its construction.

In the WB-graph method, each event or state predicate, starting from the accident event, is queried to discover what are the sufficient causal factors of that event or state. Each of these causal factors is then queried in turn, until a suitably complete or finished account is obtained. In (GeLa97) this process was followed `by hand'. We show here the result of (re)writing this development as a program in the linguistic description language DATR (EvGa). See also (Gib-a) for more DATR tools, including an interpreter. DATR is a pure inheritance language: DATR programs are evaluated, much as LISP programs are evaluated, but its evaluation mechanism is that of inheritance. In our use of DATR, each node of a WB-graph corresponds to a DATR-node. The extensions of each node correspond to attributes; whether the node represents an event or a state; what its causal factors are (pointers to other nodes); what it is a causal factor of (also pointers to other nodes). The graph is therefore built from purely local information about each node, and the DATR interpreter (Gib-b) or compiler (Gib-c) evaluates the entire collection of nodes against structural queries, yielding exactly the same information as in the (correct) WB-graph - in principle, simply constructing the graph dynamically as required. A faster DATR implementation may be found at (Sch96).

Advantages of the DATR representation

Despite that the Cali WB-graph was thoroughly checked by hand by three people, evaluation of the DATR version nevertheless found errors in the hand-generated graph: two nodes representing events had as causal factors only nodes representing state predicates (a clear violation of the normal ontology of causal factors!); and one node was construed as an event in one place and a state predicate in another.

It is a relatively simple matter to query the DATR program to find out what are the `root reasons' of an event or state predicate (namely, the `source nodes' of the subgraph leading to the node: see (GeLa97)), and also to determine the events and state predicates to which a given node causally contributes (namely, nodes on the transitive closure of the `causal factor of' relation to the given node). Both these questions need to be answered during an accident investigation for each causal factor present. These calculations are, of course, performed error-free by an error-free DATR interpreter or compiler.

How to use the tool

The output of the DATR program may be compared with the textual form of the Cali-WB-Graph in (GeLa97). The DATR program is available for testing at (Ger97.1). One may need to refer to the textual form in (GeLa97).

Here is a part of the Cali-accident DATR file:

          E1:
          <event> == 'aircraft impacts mountain'
          <why>   == S11:<state> S11:<why> E12:<event> E12:<why>
          <prev>  == S11:<state> E12:<event>
          <next>  == '_'. 
          
          S11:
          <state> == 'GWPS monoeuvre failed since S111:<state>'
          <why>   == E111:<event> E111:<why> S112:<state> S112:<why> E113:<event> E113:<why>
          <prev>  == E111:<event> S112:<state> E113:<event>
          <next>  == E1:<event>.
          
          E12:
          <event> == 'AC in mountainous terrain'
          <why>   == S121:<state> S121:<why> S122:<state> S122:<why>
          <prev>  == S121:<state> S122:<state>
          <next>  == E1:<event>.
          
          E111:
          <event> == 'GWPS manoeuvre initiated'
          <why>   == E1111:<event> E1111:<why>
          <prev>  == E1111:<event>
          <next>  == S11:<state>.
          
          S112:
          <state> == 'AC did not exhibit optimal climb performance'
          <why>   == S1121:<state> S1121:<why> S1122:<state> S1122:<why>
          <prev>  == S1121:<state> S1122:<state>
          <next>  == S11:<state>.
          
          E113:
          <event> == 'AC very close to mountains' 
          <why>   == S121:<state> S121:<why> S122:<state> S122:<why>
          <prev>  == S121:<state> S122:<state>
          <next>  == S11:<state>.
          
          S122:
          <state> == 'AC flying too low for cleared airspace (3rdD): since [121311]'
          <why>   == E1221:<event> E1221:<why> S1222:<state> S1222:<why>
          <prev>  == E1221:<event> S1222:<state> 
          <next>  == E113:<event> E12:<event>.
          
          S121:
          <state> == 'AC on wrong course/position (2D-planar)'
          <why>   == E1211:<event> E1211:<why> S1212:<state> S1212:<why>
                     E1213:<event> E1213:<why>
          <prev>  == E1211:<event> S1212:<state> E1213:<event>
          <next>  == E113:<event> E12:<event>.

Let us choose a state or an event from the WB-graph; for example, the event [1] . The following information is obtainable from the DATR program:
E1:<event> gives a description of the event.

          E1:<event> -> 'aircraft impacts mountain'

Or one can ask why this event happened: E1:<why> lists all the events and state predicates that are direct or indirect causal factors of the queried state.

          E1:<why> ->  'aircraft impacts mountain'
                                  'GWPS monoeuvre failed since S111:<state>'
                                        'GWPS manoeuvre initiated'
                                        ...
                                        'AC did not exhibit optimal climb performance'
                                        ... 
                                  'AC in mountainous terrain' 
                                        'AC on wrong course/position (2D-planar)'
                                        ...
                                        'AC flying too low for cleared airspace (3rdD): since [121311]
                                        ... 
                                        'AC very close to mountains' 
                                        ...

The three points denote that the program gives many more reasons.
S11:<prev> enumerates the state predicates and events that are immediate causal factors of the queried state; and S11:<next> the state predicates/events of which the queried state is a causal factor.

          S11:<prev> -> 'GWPS manoeuvre initiated'
                              'AC did not exhibit optimal climb performance'
                              'AC very close to mountains'
          
          S11:<next> -> 'aircraft impacts mountain'.

An underscore, _, denotes that an explanation of the state is not defined or that the next or previous state or event is not defined.

Conclusions

We have justified failure analysis as a long-overlooked part of rigorous computer science development, and argued that it is increasingly important for embedded safety-critical systems. We have proposed and justified the selection of commercial aviation as a source of examples, and described briefly the WB-graph causal reasoning method. We have given an example of a medium-to-large WB-graph, and described how WB-graphs may be easily programmed and checked for consistency in the language DATR.

Attributions and Acknowledgements

The WB-graph method was originally devised by Ladkin, and significantly improved by Ladkin and Everett Palmer of NASA Ames (PaLa97). The essay analysing the causal history in the Cali accident report is joint work of Gerdsmeier, Ladkin and Loer. This version of the WB-graph of the Cali accident was developed by Loer following a representation suggested by Palmer. The implementation of WB-graphs as DATR programs is the work of Gerdsmeier (Ger97). following a suggestion of Ladkin.

Footnotes

(1): That is, apart from the fact that the second author is a keen pilot. Back

(2): Particularly poignant for computer scientists because Paris Kanellakis of Brown University was killed in the crash with his family. Back

(3): Mill, as quoted by Steward (Ste97a, p214), says

It is usually between a consequent and the sum of several antecedents; the concurrence of them all being requisite to produce, that is, to be certain of being followed by the consequent. In such cases it is very common to single out only one of the antecedents under the denomination of Cause, calling the others merely Conditions....The real Cause is the whole of these antecedents; and we have, philosophically speaking, no right to give the name of causes to one of them exclusively of the others. (Mil43, p214).

Back

References

(Bil97): Charles E. Billings, Aviation Automation, New Jersey: Lawrence Erlbaum Associates, 1997 Back

(Col96): Aeronautica Civil of The Republic of Columbia, Aircraft Accident Report: Controlled Flight Into Terrain, American Airlines Flight 965, Boeing 757-223, N651AA, Near Cali, Colombia, December 20 1995, Author, 1996. Also available in full in (LadCOMP). Back

(Dav67): Donald Davidson, Causal Relations, Journal of Philosophy 64, 1967, 691-703. Also in (80) and (SoTo93). Back

(Dav80): Donald Davidson, Essays on Actions and Events, Oxford University Press, 1980. Back

(3): R. Evans, G. Gazdar and W. Keller, The DATR Web pages at Sussex. Back

(Ger97): T. Gerdsmeier, A Tool For Building and Analysing WB-graphs, Technical Report RVS-RR-97-02, at http://www.rvs.uni-bielefeld.de Publications, January 1997. Back

(Ger97.1): T. Gerdsmeier, Cali DATR Program, http://www.rvs.uni-bielefeld.de/~thorsten/CaliDATR.html, January 1997. Back

(GeLa97): T. Gerdsmeier, P. B. Ladkin and K. Loer, Analysing the Cali Accident With a WB-Graph, at http://www.rvs.uni-bielefeld.de Publications, January 1997. Also to appear in the Proceedings of the Glasgow Workshop on Human Error and Systems Development, March 1997. Back

(Gib-a): D. Gibbon, ILEX/DATR, http://coral.lili.uni-bielefeld.de/DATR. Back

(Gib-b): D. Gibbon, Minimal Unix DATR: HyprLex Testbed Interface, http://coral.lili.uni-bielefeld.de/DATR/testbed.html. Back

(Gib-c): D. Gibbon, ILEX/DATR Online Compendium, http://coral.lili.uni-bielefeld.de/DATR/. Back

(JoTe97):C. W. Johnson and A. J. Telford, Extending the Application of Formal Methods to Analyse Human Error and System Failure During Accident Investigations, Software Engineering Journal 11(6):355-365, November 1996. Back

(LadCOMP): Peter B. Ladkin, ed., Computer-Related Incidents with Commercial Aircraft, compendium of accident reports, commentary and discussion, at http://www.rvs.uni-bielefeld.de Back

(Lad96a): Peter B. Ladkin, Reasons and Causes, Technical Report RVS-RR-96-09, RVS Group, Faculty of Technology, University of Bielefeld. Available at http://www.rvs.uni-bielefeld.de Back

(Lad96b): Peter B. Ladkin, Some Dubious Theses in the Tense Logic of Accidents, Technical Report RVS-RR-96-14, RVS Group, Faculty of Technology, University of Bielefeld. Available at http://www.rvs.uni-bielefeld.de Back

(Lad96c): Peter B. Ladkin, Explaining Failure With Tense Logic, Technical Report RVS-RR-96-13, RVS Group, Faculty of Technology, University of Bielefeld. Available at http://www.rvs.uni-bielefeld.de Back

(Lad96d): Peter B. Ladkin, Formalism Helps in Describing Accidents, Technical Report RVS-RR-96-12, RVS Group, Faculty of Technology, University of Bielefeld. Available at http://www.rvs.uni-bielefeld.de Back

(Lad96e): Peter B. Ladkin, The X-31 and A320 Warsaw Crashes: Whodunnit?, Technical Report RVS-RR-96-08, RVS Group, Faculty of Technology, University of Bielefeld. Available at http://www.rvs.uni-bielefeld.de Back

(LadCOMP): Peter B. Ladkin, ed., Computer-Related Incidents with Commercial Aircraft, compendium of accident reports, commentary and discussion, at http://www.rvs.uni-bielefeld.de Back

(LamTLA): Leslie Lamport, The Temporal Logic of Actions Pages. Back

(Lew73a): David Lewis, Counterfactuals, Oxford: Basil Blackwell, 1973. Back

(Lew73b): David Lewis, Causation, Journal of Philosophy 70, 1973, 556-567. Also in (4), 193-204. Back

(Lew86): David Lewis, Causal Explanation, in Philosophical Papers, ii, Oxford University Press, 1986, 214-240. Also in (Rub93), 182-206. Back

(Mil43): John Stuart Mill, A System of Logic, 8th edn., 1843; London: Longmans, 1873. Quoted in (Ste97a, p214). Back

(MoHa96): Jonathan Moffett, Jon Hall, Andrew Coombes and John McDermid, A Model for a Causal Logic for Requirements Engineering, Requirements Engineering 1(1) (a local copy):27-46, March 1996. Also in (ftp://ftp.cs.york.ac.uk/hise_reports/req_capture/Causal.pdf). Back

(2): E. A. Palmer and P. B. Ladkin, Analysing An `Oops' Incident, in progress, will be available from http://www.rvs.uni-bielefeld.de Back

(Pen97): Alan Penczek, Disjunctive Properties and Causal Efficacy, Philosophical Studies 86(2):203-219, May 1997. Back

(Roc97): Gene L. Rochlin, Trapped in the Net, New Jersey: Princeton University Press, 1997 Back

(Rub93): David-Hillel Ruben, ed., Explanation, Oxford Readings in Philosophy Series, Oxford University Press, 1993. Back

(Sch96): C. Schillo, ZDATR, http://coral.lili.uni-bielefeld.de/DATR/Zdatr/, 1996. Back

(Sea95): John R. Searle, The Construction of Social Reality, New York: Simon and Schuster, 1995; London: Penguin, 1996. Back

(SoTo93): Ernest Sosa and Michael Tooley, eds., Causation, Oxford Readings in Philosophy Series, Oxford University Press, 1993. Back

(Ste97a): Helen Steward, The Ontology of Mind: Events, States and Processes, Oxford, Clarendon Press, 1997. Back

(Ste97b): Helen Steward, On the Notion of Cause `Philosophically Speaking', Proceedings of the Aristotelian Society XCVII(2):127-140, 1997. Back

(vWr73): Georg Hendrik von Wright, On the Logic and Epistomology of the Causal Relation, in P. Suppes et al., eds., Logic, Methodology and Philosophy of Science IV, Amsterdam: North-Holland, 1973. Also in (SoTo93), 193-204. Back


Copyright © 1999 Peter B. Ladkin, 1999-02-08
	Last modification on 1999-06-15 by Michael Blume