The goal of this document is to provide a reference list of criteria which can be used to compare different methods of root-causal analysis of accidents and incidents, and a handbook for applying these criteria. Each individual criterion is to have a
Since this document is evolutionary, and is intended to be for general use, we invite contributions and commentary. We acknowledge contributors at the bottom of the document.
Is the method targeted towards the "sophisticated user" (does it require use of techniques such as theorem proving which require specific expertise)? Is is suitable for use by domain experts? The ideal criterion is that the method should be easy to use (with a minimum amount of training, preferably requiring no proprietary tool) by the average engineer.(Jens Braband firstname.lastname@example.org)
This criterion is self-explanatory. What is there in the way of tool support?(Luke Emmet email@example.com)
Is the method scalable? Can the method be used cost-effectively for minor incidents as well as major accidents? Can you apply a subset of the method to small, or to less-significant, problems and the "full monty" to large, or to significant, problems?
In inquiring about the scalability of a method, one should keep in mind that there are objective differences in problem and analysis complexity, that is, differences which depend on characteristics of the incident to be analysed and not on properties of the analysis alone. For example, one might claim that the 1988 A320 accident at Habsheim is an easier accident to analyse than the train crash at Ladbroke Grove outside Paddington, despite a decade of public disagreement over faked data recorders and recordings concerning Habsheim. Here is the argument.
According to the official report, whose analysis was verified by a Why-Because Analysis, the accident at Habsheim can be put down to: bad flight planning and preparation on the part of the cockpit crew (CRW), bad management oversight (Air France broke its own rules and allowed CRW to plan to break French regulations governing airshow demonstration flights), lack of skill (the pilot flying had no experience in airshow demonstration flights or in test flying), bad execution (the pilot flying failed to follow his flight plan, and improvised, including breaking his own altitude limits), the ground layout (there were trees at the end of a runway), and, maybe, characteristics of the aircraft. That comprises six relatively clear factors, of which some are obvious without much investigation (for example, the ground layout). So five factors on which all agree, despite the public disagreements. The report is easy to read and, apart from some dissension about possibly missing data and perhaps slightly different aircraft characteristics, it is definitive.
Compare the situation at Ladbroke Grove. There are at least nine different domains in which factors appear, according to the Why-Because Analysis performed by Ernesto de Stefano, and the official (Cullen) report focused on only one of them. First of all, that is, objectively, a difference in complexity, due to subsystem constitution, if you like. Second, the disagreement between the official analysis and the technical WB-Analysis shows that the factors, whatever they might be, are not so easy or obvious to grasp that everyone can do it. And, finally, it is hard to see any way in which de Stefano's results can adequately be summarised so as to fit into a WB-graph the size of the Habsheim WB-graph while still retaining its explanatory force. One may thus reasonably conclude that Ladbrove Grove is a more complex accident than Habsheim to analyse, although there was essentially no public disagreement over the analysis.
So the question of scalability asks whether the complexity of analysis using the method scales with the complexity of the problem. For the WB-Analyses of the Habsheim and Ladbroke Grove accidents, one would consider whether the approximately twenty nodes of the former against the approximately ninety nodes of the latter constituted an appropriate measure of the objective problem complexity(Peter Ladkin firstname.lastname@example.org)
What is the nature of the method's graphical representation?
The motivating principle is that a picture is better than a thousand words. It is often more comprehensible to display results of an analysis method as an image, a graph, or other form of illustration, than as purely written text.
We argue that there is a reason for this phenomenon. Besides containing descriptions of the facts surrounding an incident, say written in a language L, a written text must also indicate the causal interrelations of those facts, which necessitates some use of specific technical language, say T, along with a method to distinguish this language from the language in which the facts of the incident are expressed. This method would usually be the syntax: T would use keywords, and a certain sentence structure, for example, in which the factual sentences in language L would be inserted. This embedding of L in T, as well as the necessity to distringuish the syntax of T from that in use already in L, leads either to more complicated sentence structures, or to oversimplification of the causal assertions. Neither of these features is desirable: the first leads to cognitive complexity, and the second leads to information loss.
However, a non-textual representation of the causal interrelations and other identified structure, such as taxonomies of failures, facilitates the cognitive distinction of this structure from that of the facts themselves. The causal and taxonomic structure is perceived visually, and the incident facts linguistically, and these two cognitive capabilities are largely independent. The complexity of the understanding of the facts remains at the level of the language L; the visual structure is at the level of the language T (with the embedded L-expressed facts treated as atomic), and the cognitive complexity of the representation remains at the level of L added to T, rather than L embedded in T, as with a purely text-based representation.
Anecdotal experience with text-based and graphical representations of the results of causal analyses has indicated that a representation of causal information as a directed graph allows both experts and non-experts alike to interpret the results more easily, as well as to see structural features that could well be hidden in a textual representation. Examples of such renderings are WB-Graphs (see the WBA Home Page at http://www.rvs.uni-bielefeld.de) and the causal graphs of Pearl (Judea Pearl, Causality: Models, Reasoning and Inference, Cambridge University Press, 2000).
The desirable properties of a graphical representation are:
In pursuit of the second property, it is desirable to display the results of a causal analysis as a unit, say a page in some standard format. The practical limit of page size is A0 in the international standard sizes. If the causal relations can be displayed graphically on a page of size less than or equal to A0 while remaining readable, then there is good reason so to do. If this cannot be accomplished, some method of factoring the results must be applied to render them over multiple pages. A factoring method is desirable even if results may be rendered on one page, for different page sizes are used for different purposes: an A0 rendering of a graph is adequate for comprehension and discussion, but is inappropriate for a report, for which A4 is the standard size.
One method of factoring which has proved relatively helpful for causal graphs in particular is graph-theoretic factoring, in which graphs are separated at their places of least width and each "chunk" is rendered separately, with the cut-set nodes identified in each chunk through color. The entire graph is rendered small, to exhibit the overall graph structure including the (colored) cut-set nodes while rendering any text unreadable. The individual chunks are rendered on separate pages. An example is the WBA of the 1993 A320 Warsaw accident (Ladkin, Höhl), available through the Publications page at http://www.rvs.uni-bielefeld.de This factoring, however, has limited application to situations in which the width of the graph is large (say, greater than five nodes), or in which a large proportion of the graph lies in one chunk. An example in which this factoring would seem to be poor is the WBA of the 1995 Cali accident, also at http://www.rvs.uni-bielefeld.de. This type of factoring obviously does not apply to representations of the results in forms other than that of a graph.
Another method of factoring consists in identifying subsystem involvement and rendering factors which concern an individual subsystem as one unit, with different units for different subsystems. Such a method was used for the WBA of the Ladbroke Grove accident (de Stefano) to factor a WB-graph whose minimal readable rendering is likely A2 as separate A4 units. (Contact the author, Ernesto de Stefano for a copy of his report, in German).
Some representational difficulties may be caused by orthogonal features of causal explanations, such as identifying causal factor relations on the one hand, and classifying them (say, according to latency/immediacy features; according to time of occurrence, which may well involve long intervals or recurrency for latent factors; or according to some taxonomy of human error or of organisational behavior). Attempts to represent all these features in one unit may lead to visual clutter and thereby to cognitive complexity. Some form of factoring is required in such cases.
Ideally, a graphical representation could also display the history of the analysis. For example, the method SOL and its tools (contact Dr. Babette Fahlbruch) infers the existence of causal factors in the structure of the organisations involved in an incident (called "indirect causes") from the ostensive facts concerning the progression of the incident (a "situational description") using a checklist/questionnaire style approach based on phenomenological and structural taxonomies, designed to control for the "heuristics" which lead to bias in using such "checklist" approaches to causal information gathering. (SOL is to our knowledge unique among causal anaylsis methods in attempting to control for heuristics.) Not all heuristics are known, and not all known heuristics have accepted controls. It might be judged that a visual rendering of the history of the analysis would allow one more easily to identify and control for heuristics whose features may not have been fully accounted for in the version of the method being used.(Claire Blackett email@example.com)
How modular is the method, and in what ways is it modular? Can its modularity be made to mirror the organisational division of labour and domain expertise in the user group?(Peter Ladkin,I Made Wiryana firstname.lastname@example.org email@example.com)
Are the results of the method reproducable? Do different people using it independently obtain similar results for the same tasks?(Fergus Toolan firstname.lastname@example.org)
Do there exist reasonable, quick plausibility checks on the results obtained which are independent of the tool? What ways are there of checking the "correctness" of the results?(Luke Emmet email@example.com)
How rigorous is the method? Rigor has two relevant aspects:
Does the method provide guidance on identifying additional causes which have not been identified on initial investigation?(Oliver Lemke firstname.lastname@example.org)
What is the improvement factor, in terms of quality of analysis and expressiveness, from what could be done using other methods, or using a "naive" approach? There are at least two aspects to judging the improvement factor:
How well does the method mesh with the "standard" methods already in use in one's organisation? How well does the method mesh with industry "best practice" to date?(Peter Ladkin email@example.com)
How adaptable is the method to individual requirements? How well does the method accomodate changes (usually called "improvements") in subdomain characterisation? That is expressed somewhat abstractly. More concretely, by example: suppose someone comes up with a new taxonomy for management factors, say reengineers the business, or implements Professor X's taxonomy for human factors in engineering processes. How easily can the method accomodate this new taxonomy?(Peter Ladkin firstname.lastname@example.org)
Webster's dictionary explains coverage as "the extent or degree to which something is observed, analyzed and reported." In the case of root-cause-analysis the focus of the analysis is the detection of error/fault sources which have contributed to the accident under consideration. The term "coverage" describes in this context the extent to which the analysis method used can provide this service. Two situations can be distinguished:
The following questions arise in connection with determining coverage:
Does the method provide support for different viewpoints, for example
How available, and of what quality, is documentation concerning the method?(Jens Braband email@example.com)