Peter B. Ladkin

Report RVS-Occ-99-02

Why Classify?
Which Bits? An Example
The Nature of Components
Components, Subcomponents and Interactions
Development Stages
Errors and Infelicities
What Are "Computer-Related Factors"?
- Classifying Specifically Human-Machine Errors
- What's New with Computer-Related Errors?
Other Classifications
Conclusion
References

Abstract: I consider a classification of complex-system components into digital, mechanical, electrical, wetware and procedural systems, which are not mutually exclusive and have themselves subcategories. Other schemes are also possible, such as one based on the different mathematical and other intellectual tools required. Subsystems do not generally form a static hierarchy, and may be identified according to causality, provenance, or habit. Whether failures are classified as component failures or interaction failures depends on the subsystem identification chosen. There may well be no indivisible, "unit" components of a system. Features which are errors should be distinguished from features which are infelicities. I classify "computer-related factors", discuss some human factor classification schemes, and mention mode confusion as an error factor which has gained prominence with the increased use of digital avionics systems. All this is illustrated and argued by means of aviation accident examples.

Why Classify?

A motto amongst investigators says that every accident is different; but similarities have to be found if we are to have any hope of learning lessons to apply to the future. Detecting similarities means classification. Classification of causal factors is useful when different classes correspond to different techniques for avoiding or mitigating the influence of similar factors in the future, and identifying factors in the same class leads to similar prophylactic measures.

There are a number of independent attributes along which factors can be classified:

Which bits of a system they involve;
The nature of the system components involved ;
At what stage in the development of the system component the factor was introduced;
What common features accidents tend to exemplify.

Which Bits? An Example

The recent extended investigation into the crash of US Air Flight 427, a Boeing 737-300 (B737) near Pittsburgh on September 9, 1994 focused on the rudder system, in particular the rudder actuator (the device that takes control inputs from the pilots and autoflight systems and moves the rudder). Both US Air 427 and the B737 accident to United Airlines Flight 585 near Colorado Springs in 1991 involved sudden loss of control and upset of the aircraft on approach to landing. Both dove into the ground. These accidents were studied along with an incident to Eastwinds Flight 517, a B737, near Richmond, Virginia in 1996, in which the aircraft departed suddenly from controlled flight, but in which control was recovered by the pilots. The focus on the rudder system identified a number of anomalies or failure modes in the laboratory, and redesign to avoid or to mitigate the effects of these failure modes is in progress.

The US National Transportation Safety Board concluded on March 24, 1999, that all three incidents were most likely caused by "rudder reversal", that is, movement of the rudder in a direction opposite to that commanded by the pilots (or autoflight system, should that be engaged). They further determined that this rudder reversal was caused by a jam in the secondary slide hydraulic valve in the system's main power control unit, leading to an overtravel of the primary valve in the unit (AvW.99.03.29 , NTSB.AAR.99.01 ).

There is still debate in the industry as to whether there was enough evidence to draw such a definitive conclusion about the cause. It is uncontentious, however, that failure modes of the rudder control system were identified as a result of the investigation. These failure modes are being addressed, and will be avoided or mitigated when newly redesigned units are retrofitted to the B737 fleet.

This example shows that identifying the bits is important and sometimes not very easy, but can be fruitful even when the identification is not conclusive.

It is important to realise that the classification into "bits", system components, is not a given. See Section Components, Subcomponents and Interactions below. We shall interchangeably use the terms "component", "subsystem", "part" for subsystems of more inclusive systems, below.

The Nature of Components

To enable discussion of classification systems, I propose a partial list of the types of components involved in a complex system such as a commercial aircraft. This list will be called (CompList):

Digital systems
- Software
- Hardware
Mechanical systems
- Hydraulic control
- Mechanical control (e.g., cables)
- Fuel storage and transport
- Propulsion
- Body and wing structure
Electrical systems
- Power
- Control
- Communications
- Navigation
- Environment sensing
Wetware (that is, people)
- Pilots
- Cabin attendants
- Air traffic control
- Flight dispatchers
Procedures
- Company management policies and habits
- Aircraft operating procedures
- Air traffic control procedures
- Aviation Regulations
- Procedural hardware
  - Operating manuals
  - Enroute and approach/departure navigation charts
  - Databases
  - Flight economic calculations
Specifications
- of the entire system requirements and design;
- of requirements and design of specific subsystems;
- of human behavior;
- of procedural hardware

Let us refer to these, for reasons which will be enumerated later, as basic types. It may seem odd at first sight to include specifications as components of systems. However, it was argued in (Lad96.03) that the (requirements) specification is also literally part of the system. This may seem counterintuitive to some; the argument is non-trivial, and rests on the analysis of inferences concerning system failure. Part of the argument will be repeated below in Section Soundness, Completeness and Mereological Sums, in discussion of the formal condition (FailCond) and (FailCond') for classification domains. If the argument is accepted, then specification will have to be added as a component of the system, as well as individually for digital systems, mechanical systems and electrical systems. The considerable interplay between the components under the heading of Procedures and the specification components of digital, mechanical and electrical subsystems should also be noted; aircraft operating procedures, in particular, are highly dependent on the specification of the systems, as was noted in the extensive non-public discussion after the 1993 Lufthansa A320 accident in Warsaw.

Even should this argument not be accepted, in conformance with the `traditional' view of systems, specification failures will still occur of course (classification cannot change facts quite like that) and will be classified under `development stage' failure (See Section Development Stages, below) rather than under system component failure.

Categories Are Not Mutually Exclusive

These categories are not mutually exclusive. For example, a flight management computer (FMC) is digital (hardware and software) as well as electrical (aviation digital hardware is at the moment all electrical, although optical hardware is coming). However, there are mechanical, optical and biological as well as perhaps quantum digital computers also, so a digital system is not necessarily electrical. When a suspected failure is localised to a electrical digital system, one must investigate not only the digital aspects (processor halt?) but also the electrical aspects (fried components?) as well as the structural aspects (physically broken component?). The same holds, mutatis mutandis, for the other sorts of digital system.

One should also note that a digital computational system has much in common with an analog computational system. For example, the phenomenon which caused the X-31 crash at NASA Dryden (Edwards AFB, California) had also occurred with F-4 aircraft, which have an analog flight control system, according to Mary Shafer, a flight engineer at Dryden (Sha96).

Further, an FMC contains procedural components (it implements navigation charts as well as digital databases of navigation aids and economic calculations). And of course the aircraft taken as a whole contains subsystems of all these different types. So there are many example of aircraft subsystems which are multiply classifiable within (CompList).

It has been argued that the Flight Operations Manual should (ideally) constitute a (incomplete) specification of the aircraft system (Lad95). If this is the case, and the (complete, or at least more complete) specification is also a system component, then the Operations Manual will be a subcomponent of the specification component, as well as being a subcomponent of the procedures component.

Also, if the (overall) system specification is indeed a system component, then individual subsystem specifications will be not only components of that subsystem, but also components of the specification component.

Would the fact that these overlaps are created by including specification as an actual component of a system count against making this move, even if justified by the arguments of (Lad96.03)? I do not believe so. There is, as far as I see, no reason to expect a priori that any classification of system components will produce a perfect hierarchical partition of components, such that no component overlaps any other, and no parts of the system are not contained in some collection of components.

Here, I have used the words perfect, hierarchical and partition. I should define what I mean. The definitions are somewhat technical. Perfect means that all system parts have types which are contained in some collection of component types as enumerated; hierarchy means that the type classification forms a tree under the "subcomponent type" relation; and partition means that no component fits under two or more non-comparable subtrees.

The inclination towards perfect hierarchical partioning may follow from some reasoning such as follows. Classifications may be thought of by some as a product of intellectual exercise (which they certainly are) but thereby entirely a product of our intellect (which they cannot be if one is in any way a realist, since they describe parts of `the world'); it may be thought further that we may impose any conditions we like upon something that is a pure invention of our minds, and that if it should fail to satisfy the condition, that shows rather a mental weakness than any objective feature of a system; therefore we may require a perfect hierarchical partitioning if we should so wish, and criticise an attempt at classification which does not satisfy it. However, contained in such a line of reasoning is an assumption which removes the major motivation for classifying, namely, that it is an invention of the mind. I wish to develop a classification in order better to understand the world; we should not thereby be surprised if it should prove hard or impossible to construct a useful classification conforming to certain self-imposed properties; for determining what is "useful" probably requires that we use criteria such as precision rather than criteria such as intellectual neatness, and one has no a priori reason to suppose those criteria should conform to a hierarchical perfect partition. It would be a nice theorem to show, could one do it, but a theorem nonetheless.

To say this is not to come into conflict with the argument that classification systems are a human construct, that is, that one makes choices as to how to build a classification. Even though I may construct a classification, there are objective constraints on the choices I may make. For example, one may have many choices in picking a pair of positive whole numbers whose sum totals 30; 1 and 29, 5 and 25, and so on; that we may make such a choice freely, however, does not affect the theorem that these pairs all add up to 30, and that adding them to 70 will result in 100. These are facts independent of our choice. Similarly, it may be a fact, for all we know, that any "reasonable" choice of classification system for complex systems will inevitably contain multiple cases of overlapping classifications.

So any "bit", any identifiable subsystem, of the aircraft may well have a multiple classification. But how do we obtain this classification? And what "bits" are there?

Obtaining a Substantive Classification

It is reasonable to ask after the principles according to which the content of a classification scheme is derived. In particular, since the component classification proposed above is not mutually exclusive, one could wonder why digital systems are distinguished from electrical systems, when at the time of writing in commercial aviation they are in fact all electrical? Before considering formal features of a classification system, it is well to ask where the classifications that we use now have come from. We distinguish digital systems from their substrate (electrical, optical, formerly mechanical, whatever) because they have certain features in common, and we have developed techniques for handling these features effectively, semi-independently of the substrate. In other words, the notion of digital system can be derived through an abstraction process (Lad97.04). But is this how it really came about? No, of course not. Babbage built (or rather, Ada Lovelace built) a machine to perform automatic calculations of a very specific sort. Eckert and Mauchly built a machine to perform calculations electronically of a different sort. John von Neumann popularised the idea of stored-program computers. And of course Alan Turing had the idea of an abstract computation engine before. And so on. In other words, there are some good theoretical reasons for the distinction (abstraction) and some social reasons (evolution of computing through engineering and mathematical discovery).

Furthermore, when one is interested in classification for a specific purpose, such as failure and accident analysis, then there may well be particular common features of failures which do not occur during normal operation.

Accordingly I shall consider three sorts of classification:

classification domains,
social domains,
feature domains.

Behind any of the categories proposed in, say, (CompList) lies an accumulated bodies of engineering expertise. Such expertise has been built up often through years of academic and industrial experience and investigation concentrating on certain phenomena. Correlated with this are conferences, information exchange, and other cultural structures which ensure that information concerning this particular domain is very well exchanged amongst the participants. Let us call a domain with this feature a social domain. In contrast, information flow across social domain boundaries is comparatively, and notoriously, thin. For example, people working in general system reliability seem to be fond of saying that software has a reliability of either 1 or 0, because any fault has been there since the installation, is latent until triggered. Software reliability people and computer scientists who deal with software and its unreliability, on the other hand, balk at this statement; many cannot believe their ears upon hearing it for the first time. These groups belong to different social domains - unfortunately, for they are attempting to handle the same system component.

Social domains can evolve partly for social reasons, partly for intellectual reasons concerning the subject matter. I use the term classification domain for a purely intellectual classification scheme, one based on formal properties of the domain itself. Feature domains are those based on specific features of the goal of the classification. When the goal is failure and accident analysis, there may be specific features of failures which are not common features of normal behavior. A failure analysis will want to pay attention to these features, though those studying normal operation of the system may not. An example of feature classification for aircraft accident analysis follows at the end of this section; also, the ICAO section headings for the factual information in a commercial aircraft accident report, is discussed in Other Classifications: The International Civil Aviation Organisation below.

When devising a classification scheme, social domains form practical constraints. One cannot form new social domains out of thin air. It would make sense to try to accomodate social domains within a classification, rather than constructing classification domains which bring together parts of different social domains. If one were to pursue the latter strategy, it could lead to a situation in which some parts of a classification domain would have much greater and more fruitful interaction with parts of different classification domains (their partners in the social domain to which they belong) and relatively sparse interaction with certain other members of their own classification domain (those which belong to different social domains). Such a situation would not render the classification scheme particularly useful to practioners, each of which belongs to a social domain.

On the other hand, it could be argued that a good classification scheme has logic and reasoning behind it, and social domains, even though encouraging information exchange, would be relatively handicapped and ineffective in so far as they do not cohere with or conform to the logic behind a classification. This is an incontestable point. However, to recognise the point in practice would entail that so-called "interdisciplinary" social interactions (conferences, journals, etc.) conforming to the classification domains be initiated and continued until a social domain has been built up which conforms to the classification domain (supposing that one were possible - it might also be the case that the classification simply cannot conform to a social domain. For example, if it required too much breadth of knowledge, which was beyond the capability of all but a handful of exceptional people). This of course happens, will continue to happen, and should be encouraged.

Another argument for making classification domains conform to existing social domains would be that the two factors described above have been permanently active in the past, and that therefore the social domains have indeed grown to conform more or less with a reasonable set of classification domains. Call this the argument from evolution, if one will.

By presenting these considerations, I am not proposing to draw deep conclusions. I am merely pointing out mechanisms at work in the choice of classifications which will help us deal with complex systems. It seems to me that the social mechanisms discussed have strength and relative influence which we do not know, and therefore we cannot effectively adjudicate which of the various situations described we are in. Although one should acknowledge the difference between social domains and classification domains, and not necessarily assume that a social domain makes a good classification scheme, these considerations do justify basing a classification scheme in large part on social domains, which the reader will observe is what I have done.

As an alternative to (CompList), one could consider basing a classification scheme mainly on the types of intellectual equipment one uses in handling design and implementation issues. For example

Formal logic and boolean algebra
Continuous mathematics
- Complex analysis
- Real analysis
- Dynamical systems
  - Control systems
  - Fluid flow systems
Discrete networks
- Feedforward nets
- Feedback nets
- Network flows
- .......
........

This of course is a standard classification used in teaching university students through courses, and there are in fact social domains, amongst mathematicians, which do correspond to this classification scheme. And because the nature of the subject matter and the endeavor of mathematics, the reasons which make this a social domain are mostly reasons why this makes a classification domain also. In contrast, something like the (CompList) classification scheme is used to collect courses into a "major" degree subject. Both classification schemes could have social domains in engineering practice based on them: using the mathematical scheme, a control systems specialist who needed discrete algorithms could approach, say, a logic and boolean algebra specialist, just as the mathematicians do now; and for (CompList), just as in engineering now, an electrical engineer who needs a specific piece of software reliably designed and rigorously verified will approach a safety-critical digital systems specialist.

In short, social domains may be more or less arbitrary; more because social evolution can be based on happenstance, and less because of evolutionary pressure bringing social domains closer to some reasonable classification domain. However, there is practical justification for basing a classification scheme on social domains, even while ackowledging some degree of arbitrariness in the scheme.

Feature domains occur when certain aspects of failures in a domain recur. For example, aviation accidents often involve fire, of which the accompanying poisonous smoke asphyxiates people; and survival aspects of accidents include the availability and response of emergency services, the accessibility of aircraft exits under emergency conditions, the level of protection from smoke, the level of suppression of fire, and the level of protection from trauma, amongst other things. Since survival is regarded by many as one of the most important aspects of aircraft accidents, the survivability of the accident is assessed, along with the features that contributed positively and negatively to survival. These categories do not currently occur within a system classification, although there does not seem to be a reason from logic why not. For example, if survivability and fire resistance criteria were built in to a system specification, then investigating these aspects would contribute to assessing whether there had been a specification or design failure of these system components. Because this is not yet generally the case, these aspects are properly classed for now as a feature domain. There are certain exceptions to this; the lengthy discussion about the overwing exits during the certification in Europe of the new generation B737 aircraft is an example.

Homogeneous and Heterogeneous (Sub)systems

Before considering what subcomponents of a system there can be, and how we identify them, I want to make a distinction between homogeneous systems or subsystems and heterogeneous systems or subsystems. The distinction is supposed to reflect systems which belong to one type, contrasted with systems which have components from many types. As a first cut, then, one could propose defining homogeneity as having components which all belong to just one of the basic types, and heterogeneity as having components which belong to more than one. However, this will not work, because a digital system has mechanical components (the box that contains it; the boards and the chips that are inside it) as well as being also electrical. And the software that it runs defines procedures.

One way of solving the definitional problem is to define the classification by fiat. Consider the following categories:

Digital systems, meaning systems whose operation and outcomes follow principles and goals of digital computation;
Physical systems, meaning systems whose operation and whose outcomes follow mainly mechanical or electrical principles and goals, excluding digital computation
Wetware, human operators and managers
Procedures, meaning sets of rules, regulations and policies to be followed by human operators and their managers, including their expression in hardware.

While not perfect, this classification is close to mutually exclusive. We can define a homogeneous (sub)system as one that fits in just one of these four categories, and a heterogeneous system as one that has components from two or more of the categories. Motivation for this stipulative definition would be the homogeneity or heterogeneity of the existing methods used to develop and analyse such subsystems and their failures.

Given the general idea, above, that homogeneity and heterogeneity refer to being of one or of many different types, it follows that any definition of homogeneity and heterogeneity must be given relative to a chosen classification scheme. Change the scheme, and homogeneous components may become heterogeneous according to the new scheme, and vice versa.

Soundness, Completeness and Mereological Sums

I have discussed social domains and mentioned feature domains and classification domains, but I have not yet proposed or discussed formal criteria such as could be required for a domain to be considered a classification domain. We can begin by considering the following three properties:

"soundness"
"completeness"
perfect hierarchical partitioning

We have considered, and rejected, the third property already as a necessary condition for a successful classification system. The first two of these properties are commonly sought in systems of logic, and they have reasonable interpretations for other formal systems as well. "Soundness" means no mistakes and "completeness" means nothing is missing. When so phrased, it becomes intuitively clear why they might be used as satisfactoriness criteria for formal schemes of many sorts. What might these concepts mean for a classification domain?

"Soundness" would mean something like:

(SoundCond): if C is a system component of type T and C' is a system component of type T' and C is a subcomponent of C', then T is a subclassification of T'

While this is an intuitively appropriate property of components, we should look at it a little more closely, because it has significant consequences. Consider the airbrake/spoiler actuation system of the A320 aircraft. It consists not only of the hardware spoilers themselves, but various hydraulic and electric actuators plus the blue, yellow and green hydraulic cabling and Spoiler-Elevator Computers (SECs) 1, 2 and 3 (Spi93). Subcomponents of this are, for example, each SEC (each of which is also a component of the elevator system). Further, the activation logic of system operation relies on weight-on-wheels (WoW) microswitches on the landing gear. When the gear is compressed and the spoilers are armed, they deploy. The WoW subsystem is therefore also a component of the spoiler subsystem.

How do we determine what the components of the spoiler subsystem are? And how do we check this determination for correctness? We need criteria. Let us consider it from the point of view of failure analysis. One goal in a failure analysis is to isolate a subsystem in which a failure has occurred. When there is a problem with deployment of airbrakes or spoilers, for example, one looks first at components of this subsystem. The logical condition for membership of this subsystem is thus:

(FailCond): Any failure of the spoiler subsystem is either a failure of one of the components or a failure of the specification of the spoiler subsystem itself.

There are (at least) two ways to regard such a condition. It could either be a theorem about a component classification, or it could be interpreted as a requirement to be fulfilled by any satisfactory component classification. Following the second option, (FailCond) could be used to determine what the components are of the spoiler subsystem. However, there is a potential problem: it looks as though the phrase the failure of the specification could be a catch-all, classifying anything that did not fit into a particular proposed component structure, no matter how unsatisfactory or incomplete this component structure may be. For example, suppose for some odd reason one failed to consider the SECs to be part of the spoiler subsystem of the A320 (even though the name SEC, for Spoiler-elevator computer gives one a hint that it is indeed a part of that subsystem). Let us call this smaller subsystem without the SECs the "spoiler subsystem", and the complete thing simply the spoiler subsystem (without quotes or italics). Suppose further that there were to be a software error in SECs (the same error in all three), and that this error caused spoiler deployment to fail in some particular circumstance. Using (FailCond) and the definition of "spoiler subsystem", one would conclude that there had been a "failure of the specification". That's the "catch-all" at work, and it seems to fail us, because we would surely prefer to be able to conclude rather that the spoiler subsystem as conceived was incomplete, and should include the SECs; and that there had been a failure of the SECS due to software error (that is, after all, the way I described the example).

However, let us look at the reasoning a little further. The "spoiler subsystem" itself didn't fail, so by (FailCond) there had been a failure of the specification. The specification, however, is definable independently of (FailCond), namely, by describing the behavior of the spoilers (the physical hardware) in a variety of different flight regimes, including the problem case in which the spoilers (let us say) failed to deploy when the specification says they should have. So the spoiler behavior did not fulfil this specification.

A case in which a system did not fulfil a specification is not a specification error, but the definition of a design or implementation error (see later for a classification of these error types). A specification error would be indicated when the spoilers in fact fulfilled the specification, but something nevertheless was incorrect or inappropriate (see Section Development Stages below). This case does not fulfil that condition. Therefore we conclude there was not a specification error. According to (FailCond), then, some other system component must have suffered failure. But none of the ones described in "spoiler subsystem" did. If we accept (FailCond), there is left only one conclusion that we may draw, namely that "spoiler subsystem" is incomplete - there is some component of the spoiler subsystem that in fact failed (because there was no specification failure), and that part is not part of the "spoiler subsystem". (By the way, the notion of fulfilment of a specification by a design or implementation is completely rigorous, but here is not the place to describe it.)

I conclude that taking (FailCond) as a definition of what constitutes being a subcomponent of a system component will enable substantive inferences concerning system components and does not lead, as initially feared, to wordplay concerning what is a component failure and what a specification failure.

If most subsystems identified similarly to the spoiler subsystem, then we could expect that system components can include digital parts (the SECs); mechanical and hydraulic parts (spoiler hardware itself; the three hydraulic systems); design parts (the logic connecting WoW condition and spoiler deployment) and specification parts (the rules for spoiler deployment). It is a mélange of component types as enumerated above. The spoiler subsystem itself thus belongs to no single basic type - it is heterogeneous. We took a function as a specification, and one might imagine that most subsystems identified in this manner will be heterogeneous when the function is fairly high-level (control of one of the three axes of motion) and the system itself (the aircraft) is heterogeneous.

What type does the spoiler subsystem in fact have? It is reasonable to suppose that meaningful system components, such as the spoiler subsystem, have as component types all the component types of their subcomponents. Let us call any specific collection of component types the sum of those types. Then we may formulate the principle:

(TypeCond): the type of a system component is the sum of the types of all its subcomponents

There is an important formal distinction to be noted here. I spoke of a "collection" of types. One may be tempted to think that I meant set of types. This would be a mistake. Consider a system with subsystems, all of which have themselves subsubsystems, and let us assume that each subsubsystem has a basic type. Each subsystem then has as type the collection of (basic) types of its individual subsubsystems. The system has as type the collection of types of its subsystems, which themselves have type that is a collection. So the system has type which is a collection of collections of basic types, which is itself a collection of basic types. However, if "collection" meant "set", then the type of a subsystem would be a set of types of its subsubsystems; that is, a set of basic types. The type of the system would be a set of types of its subsystems, that is, a set of sets of basic types. But it is elementary set theory that a set of sets of basic types is not a set of basic types (unless basic types are special sets called "non-well-founded sets"). Nevertheless, a collection of collections of basic types is a collection of basic types. So collections, in the sense in which I am using the term, are not sets. (This is why the terminology "sum" is appropriate: sums of sums are sums.)

There are certain types of pure sets, called transitive sets (technically: sets which are identical with their unions), which do satisfy the condition we're looking for on collections. However, we can't make that work so easily because prima facie we don't have pure sets (sets whose only members are other sets), since we have basic types. We may be able to find some mathematical encoding that would enable us to identify basic types with pure sets. But why bother? We can sum collections of basic types simply, so there is no need for a translation into the right kind of set theory.

So important system components and subsystems can be classified, not within a single component type, but as a collection, the sum, of types. Along with this formulation of the type of a component comes the standard notion of a component being constituted by all its parts. We speak of a component being the mereological sum of its component parts. This means that when I put all the parts "together", I have the entire component; equivalently, "putting the parts together" is taking the mereological sum.

Components can be the mereological sum of many different collections of parts. A cake is the mereological sum of its two halves; or of its four quarters; and of its eight eights also. The use of the notion of mereological sum comes when one lists parts or subcomponents; when one "puts them together", then one can observe if one has the entire system or not. If not, it means a part has not been enumerated, and one can identify this part by seeing what is missing from the mereological sum of the subcomponents one had listed.

Consider the following example, due to Leslie Lamport. One has two boxes with input and output wires, each of which outputs a zero if there is a zero on its input; otherwise outputs nothing. Connect these two boxes; the input of one to the output of the other. How do they behave? It is easy to show that either they both input and output nothing; or they both input and output zero. But can one decide whether they do the one or the other? There is nothing in the specification which allows one to: the behavior is not determined. One can think of the situation this way: the system formed by the two boxes contains an identifiable third component, which one may call the interface. The interface behavior consists of either two zeros being passed; or nothing. In order to reason about the subsystem formed by the two boxes connected in this way, it suffices to reason about the behavior of the two boxes (namely, their I/O specification as above), and the interconnection architecture, and thereby derive the constraint on the behavior of the interface. Once one knows these three things, one pretty much knows all there is to know about the system. Can one conclude that the system is the mereological sum of the boxes plus the interface?

Before doing so, one should ask whether the system could be the mereological sum of the boxes alone. The answer is: it cannot. Here is the proof. It is crucial to the system that the output wire of each box be connected to the input wire of the other. Just putting the boxes together, "next to each other" if you like, does not accomplish this, and the corresponding constraint on the interface would not be there. To see this, note that one could build a connected system also by connecting the two input wires to each other, and the two output wires to each other. This would form a system in which the behavior is determined - the system does nothing (neither input can receive, and therefore no output is generated by either). Clearly this is a different system from the first; it has different behavior, and a different configuration. Any argument that concluded that the former system were the mereological sum of the two boxes alone would also suffice to show that the latter would be the mereological sum of the same two boxes; but that cannot be because then the two different systems would be identical, and that they most certainly are not. QED.

We may conclude that the system consists of the two boxes plus the interface. That is, the system is the mereological sum of the two boxes plus the interface. Conversely, we would like to be able to conclude according to (FailCond) that any failure of the system to behave as specified is a failure in one of the boxes or a failure of the interface, and indeed this seems appropriate. Thus may we apply the logical condition above as a test to see whether our list of system parts enumerates a complete set of system components; namely, a list of which the system is the mereological sum.

(CondList) is a list of basic types, of which the type of any subsystem is a sum. We may now enumerate the last principle of classification, a form of completeness:

(CompCond): Any component whatsoever of any system in the application domain (commercial air transport) has a type which is a sum of basic types, as enumerated above.

Whether this is true of (CondList) remains to be proved; it is certainly not a given. But I propose that in so far as (CondList) wouldn't satisfy it, (CondList) would be defective.

Finally, building on this discussion, we may make a simple argument why the specification of a (sub)system is a component of the system. If the specification were to be considered a component, the logical condition for componenthood could be simplified to read:

(FailCond'): Any failure of a (sub)system is a failure of one of its components.

We now have formal criteria against which to test proposed classification domains. We have started to see how a satisfactory component type classification can aid in failure analysis of systems. But we must also remind ourselves now how complex this nevertheless remains.

Components, Subcomponents and Interactions

Components must work with each other to fulfil joint purposes. A flight control computer must receive inputs from pilot controls, give feedback to the instrument displays, and send signals to the aircraft control actuators. These interactions can sometimes go quite wrong.

Componenthood Does Not Form a Static Hierarchy

One might think that, logically, one can consider a complex system as a static hierarchy of ever more complex components. But consider the following example.

An Airbus A320 has 7 flight control computers of three different sorts, two Elevator Aileron Computers, ELAC 1 and 2; three Spoiler Elevator Computers SEC 1, 2 and 3; and two Flight Augmentation Computers, FAC 1 and 2. Multiple computers control each flight control surface. So, for example, if one of the ELAC computers that controls the ailerons and elevators completely fails, that is, just stops working, then the ailerons and elevators will be controlled by the other ELAC computer via a completely different hydraulic system. Supposing both ELACs fail, then one obtains roll control (for which the ailerons are used) via the spoilers, controlled by the 3 SEC computers, and the elevators are controlled by SECs 1 and 2.

The electrical flight control system (EFCS) is a static component (in the usual engineering sense) of the flight control system as a whole (FCS): the FCS includes the hardware, the control surfaces, that control roll, pitch and yaw of the aircraft. The FCS can be specified as follows: the subsystem which controls roll, pitch and yaw position and movement, according to certain rules (control laws). A double ELAC failure does not result in a failure of the FCS (that is, a failure of roll control, pitch control or yaw control). However, it is a failure mode of the EFCS since elevators are controlled in normal operation, but are no longer controlled in case of a double ELAC failure. This failure mode results in reduced (but not eliminated) EFCS function. The ELAC itself is a part of the EFCS, and when it stops working, it has no functionality any more; its functionality is eliminated. The ELAC itself has both hardware and software components, and the total failure of the ELAC could be caused by a failure of either hardware or software, or both, or by a design failure in which hardware or software functioned as designed, but the condition (state or sequence of behavior) in which they found themselves was non-functional and not foreseen by the designer.

But what exactly is the component describable as the elevator control system? Again, this may be specified as: the system which provides suitable input to elevator actuators to affect control of the elevators, in order partially to control pitch position and movement. Well, if everything is in order, is has as subcomponents the ELACs 1 and 2, the SECs 1 and 2 and the blue and green hydraulic systems, which are subcomponents it shares in part with the ailerons. If the blue hydraulic system fails, it has ELAC 2 and SEC 2 and the green system, and the ailerons have ELAC 2 and green. If ELAC 1 fails physically, it has ELAC 2, SECs 1 and 2, and the blue and green hydraulic systems, and the ailerons have ELAC 2 and green. If ELAC 1 fails in its aileron-control software only, then elevators have ELACs 1 and 2 and SECs 1 and 2 and blue and green. So the elevator control system has different components at different times and under different failure modes of other components of the system. It's a dynamic thing, formed out of changeable configuration of static hardware and software. Indentifying subsystems through causal chains can be intellectually a lot more complicated that at first sight. It seems it would be inappropriate to classify subsystems both hierarchically and statically.

Failures Which Are Not Failures of Static Components

One could think of the EFCS of the A320 (as, say, it is described in Spitzer, op. cit.) as consisting in 7 computers which partially interact. Suppose there is a failure in the communication channel connecting ELAC 1 and ELAC 2. Suppose ELAC 1 is waiting for a message from ELAC 2, and ELAC 2 is waiting for a message from ELAC 1. Since they're both waiting, neither message is sent, and they will carry on waiting for ever (this is called a deadlock). Then this is a failure of a component of the EFCS, namely the ELAC component, but is not an ELAC 1 or ELAC 2 failure per se. We classify it as an interaction failure. Where does the notion of interaction failure come from?

Note that the idea of an interaction failure comes from identifying all static components of a subsystem, which themselves individually have all not failed, and observing nevertheless that a failure of the subsystem has occurred. It is thus dependent on the identification of static components, and indeed on identifying the subsystem itself. Thus one classifies a failure as an interaction failure based on the classification one is using of system components. It could be that classifying the components another way would lead one to reclassify the particular failure so that it is no longer an interaction failure. The concept interaction failure thus cannot be guaranteed to be persistent between component classifications.

Are there any Hard-and-Fast Unit Components?

Despite the fluidity of subcomponent identification, one might expect to find some hard-and-fast items that turn out to be practically indivisible units no matter how one classifies. One might be tempted to consider a pilot, for example, as a unit:

Wetware
- Pilots
- ......

But she's not. A cognitive failure in the pilot is indicated by and accompanied by a communication failure between other system components and the pilot (for example, through the flight instruments, as an annunciation that was not attended to by the pilot), but this cognitive failure could have been caused by the heart attack that the pilot was suffering at the controls. The pilot's metabolic and physiological systems affect her cognitive systems. The study of these interactions is the domain of aviation medecine.

Wetware
- Pilots
  - Cognitive subsystem
  - Physiological subsystem
    - Metabolic subsystem
    - Circulatory subsystem
  - Anatomical subsystem
- ......

Logically, the pilot component has the same type substructure as that of other aircraft subsystems: there are components with different functions which influence each other, and a weakness or failure in one of these components can be temporarily mitigated by other components (tiredness can be compensated as far as the cognitive system is concerned by an adrenaline rush: very few pilots fall asleep in the landing phase), or can lead to a failure of system function, as with a heart attack.

One may reasonably remain agnostic about whether there are any units which persist through all reasonable classification schemes.

Some Ways of Identifying Component Subsystems

We have allowed that a classification of components of a system may be constructed intellectually however it is convenient to do so, subject to certain general formal conditions. But what exactly is convenient, and what not?

Suppose there was a failure to control altitude or glide path effectively on an approach, as in the case of the China Airlines Flight 676 accident to an Airbus A300 aircraft in Taipei, Taiwan on February 16, 1998. This could be deemed to be a failure of the altitude control system (ACS). Nobody may have specifically identified the ACS before this accident, for example, it may not have been an identified component during system build, but that is no reason not to select it for study when needed. The ACS could be specified as: that system which controls altitude and rate of change of altitude through pitch adjustments. In the Taipei accident, a particular phenomenon was observed: divergence from appropriate altitudes for that phase of flight. It may be presumed that the aircraft has a control system which controls altitude. The task would then be to identify this subsystem. Let us consider in what the ACS consists.

The ACS control loop passes from the actual altitude, through the altitude sensor mechanism (the pitot-static system, in this case the static part) through the electrical display systems which display the actual altitude and the target altitude to the pilot, through the pilot's eyes (part of his physiological system) through his optic nervous system to his brain, where it is cognitively processed, including attention, reasoning, decision about what to do, intention to do it, and the cognisant action which is taken on the control stick, which feeds back through the hydro-mechanical control system to the aircraft's pitch control surfaces. So the ACS has the pitot-static system, air data computers, cockpit display systems, pilots optic system, nervous system, cognitive capacity, muscle actuation mechanisms, control stick and elevator control mechanisms as components. It also includes (part of) the autopilot, which, when activated, feeds in elevator actuation commands and accompanying stabilator pitch ("pitch trim") according to certain rules. All in all, with digital, electrical, hydro-mechanical and human components, this aircraft subsystem is quite heterogeneous. It also has components, such as the autopilot connection, which are sometimes part of the ECS and sometimes not (depending on whether it is engaged or not). So the ECS is also not static.

What allowed us to consider this heterogeneous collection of subcomponents as a single system component is that they are involved in some causal chain towards achieving a condition (effective altitude) which we independently had concluded was or should have been a system goal. We can call this the causal-chain method of subsystem identification.

Another way of identifying subsystems is provenance: we get hardware from manufacturers, software from programmers, and pilots from their mums and dads via flight schools. Since errors can occur at each stage in system development, it makes sense to encapsulate the development in the organisation which performed it. (See Section Development Stages, below.)

Yet another way of identifying subsystems is habit. We are used to considering aircraft and pilot as a single entity since the dawn of aviation, but we have only recently become sensitive to the potentially devastating causal effect of air-ground miscommunications, so we are unused to considering as an identifiable system subcomponent the pilots' vocal and hearing mechanisms, the hardware that interacts with these and sends electromagnetic signals through the ether to and from the hardware in air traffic control stations, and the equivalent physiological and cognitive subsystems of the controllers (as evidence for lack of familiarity, just look how long it took me to describe that subsystem. We have no recognisable word for it yet).

One unfortunate consequence of identifying subsystems by habit is that it usually comes along with habitual reasoning - reasoning that may well be false. For example, there is traditional reasoning that when computer components fail, since computers have hardware and software, it must either be the hardware or the software which failed. As discussed in (Lad96.03), this reasoning is faulty. When a computer fails, it can be a hardware failure, a software failure, or a design/requirements failure. Our habits of identifying the visibly causal chains of the working of a computer system apparently let some of us overlook the invisible causal chains, such as that from faulty design to faulty operation. (Some engineers avoid this problem by identifying design/requirements failures as software failures. But this classification has unfortunate consequences, as discussed in (Lad96.03), and therefore I reject it.)

Conclusions Concerning Component Identification

Subcomponent identification can be complicated to understand and analyse. But that is just the way things are. An accident analyst can choose to ignore certain components, for example, pilot subsystems, if hisher goal is to improve the other aircraft subsystems. Aviation medical specialists and human factors specialists are to the contrary very interested in the pilot subsystems, with a view to improving, or improving the reliability of, those.

And I'm content to live with the conclusion that the failure analysis of complex systems is difficult. That's partly how I make my living, after all.

To summarise:

Groupings of components into subsystems is very dependent on what one's analytical interests are. Causal groupings, provenance groupings, and habitual groupings are common. One should not forget that identfying an individual component is itself an instance of subsystem grouping.
One should think of system attribute classifications as being partially exclusive rather than mutually exclusive; individual components may have many attributes, but possession of some attributes will exclude possession of others.
Subsystems may be homogeneous (for example, a software subsystem written all in the same language) or heterogeneous (for example, the ACS).
what is a component failure for one component may be an interaction failure for its subcomponents, and vice versa. There is no such thing as an interaction failure per se; it depends upon which components you have chosen to identify and which to ignore. But failures of interaction are very common for the traditional component groupings.
Reasonable subcomponent classifications for the purpose of failure analysis may very well not be static.

Development Stages

When you develop an artifact you have to

Figure out what you want it to do (Requirements Specification);
Figure out how it is going to do what you want (Design);
Build it (Implementation)

Errors or infelicities can be introduced at any one of these stages.

Requirements If you want a house without a roof, are you going to be happy with what happens when it rains?
Design It is not advisable to build a skyscraper out of bricks in an earthquake-prone neighborhood
Implementation For example, "I put a dot instead of a comma in my Fortran program and the rocket carrying the spacecraft went off course" and other apocryphal tales from the crypts of NASA. (This particular incident happened to the Mariner I spacecraft.)

The developmental stages can be put together with the component types. One can thus talk about a software implementation error (which I and most others call simply a "software error"), or a software-hardware subsystem design error, or a software subsystem requirements error, and so forth. The notation may get a bit unwieldy, but at least it's accurate. We can leave it to the reader to figure out a more felicitous notation if needed.

Errors and Infelicities

Not every design or requirements factor contributing to an accident is the result of an error. An error has unwanted consequences in every occurence. But there are other cases in which the consequences of a feature can be positive in some cases and negative in others. We call factors of this nature infelicities in accidents in which they played a causal role.

One such infelicity concerns a specific interlock which many aircraft have, to prevent the actual deployment of thrust reverse while the aircraft is still in flight. Sensors attached to the landing gear measure the compression of the gear, which corresponds to the weight of the aircraft on the wheels (when an aircraft's wings are producing lift, during takeoff and landing as well as flight, there will be less or no weight on the wheels). This is called "weight on wheels" (WoW). Thrust reverse actuation is inhibited when WoW is insufficient.

Such systems would have helped, it is assumed, in the case of the Lauda Air B767 accident in Thailand on May 26, 1991. The pilots noted that thrust reverse actuation was indicated on the instrument panel. The actuation is in part electrical, but there is a mechanical (hydraulic) interlock which prevents actual deployment of the reversers when the gear is raised. It is suspected that a failure of this interlock allowed the actuation through the electrical fault actually to deploy one reverser, leading to departure from controlled flight and the crash. A (WoW) interlock is very simple, and is thought to add a layer of protection because it does not have the failure modes of the hydraulic interlock.

In fact, on the A320, the WoW thrust-reverser interlock is implemented in digital logic in the computers which control the reversers. On September 14, 1993, an A320 landing in Warsaw in bad weather did not have sufficient WoW to allow immediate deployment of braking systems, and it continued for 9 seconds after touchdown before reversers and speedbrakes deployed (and a further 4 seconds before wheel brakes deployed). The aircraft ran off the end of the runway, hit an earth bank infelicitously placed at the end of the runway, and burned up (most occupants survived with no or minor injuries).

The same WoW interlock design would have been felicitous in the case of Lauda Air and was infelicitous in the case of Warsaw.

Another example could be the automatic configuration of aircraft control surfaces during emergency manoeuvres. Near Cali, Colombia on 20 December 1995, American Airlines Flight 965 hit a mountain while engaged in a Ground Proximity Warning System (GPWS) escape manoeuvre. The GPWS senses rising ground in the vicinity of the aircraft relative to the motion of the aircraft, and verbally warns the pilots. It is calculated that one has about 12 seconds between warning and impact. While executing the escape, the pilots forgot to retract the speed brakes, consequently the aircraft's climb performance was not optimal. It has been mooted (by the U.S. National Transportation Safety Board amongst others) that had the aircraft been automatically reconfigured during the escape manoeuvre (as, for example, the Airbus A320 is), the aircraft might have avoided impact (NTSB.SR.A-96-90) . Of course, it is quite another question how the aircraft got there in the first place, but this is not my concern here. I should point out that the hypothesis that the Cali aircraft might have avoided the mountain with optimal climb performance at GPWS warning has also been doubted (Sim98).

Having such automatic systems as on the A320 could have been felicitous for the Cali airplane, but would be infelicitous should they fail to operate effectively, in a case in which there is no effective manual backup. It is well-known that introducing extra layers of defence also introduces the possibility of extra failure modes through failure of the extra defensive systems.

What Are "Computer-Related Factors"?

Here is a collection of words one could use in "daily life" (such as: talking to journalists) for computer-related factors, based on (CondList) and the the other considerations discussed. Let me call it (ShortList). My intent is to pick out prominent or commonly-occurring factors.

Software Error. I use this term to mean a case in which the software did not fulfil its design. In cases in which the design did not fulfil its specification (which I normally call a design error), and the specification and design are of software functions only, one could be content to call it a software error also, although I prefer not to.
(Digital) Hardware Error. Similar to software error, but note that while all software errors are latent (that is, they are there in the code from the very installation, waiting for the thread of control to pass their way), a piece of hardware may suddenly cease to function; e.g., a chip burns out. This is one form of hardware failure, an event in which the function of the device actually changes, that cannot happen with software. Some hardware errors can be latent, just as with software: for example, a faulty implementation of a machine-language instruction.
Design Error. A system component was designed to do a particular job. The design, however, was faulty - it didn't do the job.
Requirements Specification Error. The description of the job that a system component was to perform is faulty.
Pilot-Automation Interaction Infelicity. If you don't like the phrase, you are welcome to invent your own. PAII is joint behavior of human and machine that has an unfortunate outcome. This divides into
- Mode Confusion. Modern autopilots operate in a number of so-called "modes", in each of which the behavior is different. It is well-known, and has been demonstrated in simulators, that even highly-competent pilots may falsely believe the autopilot is in one mode when it is really operating in another. There is evidence for some accidents that mode confusion was involved. The Toulouse A330, Nagoya A300, and Bangalore A320 accidents have been discussed; perhaps also the Strasbourg A320 accident.
- Other Interaction Confusion. The pilot of the Birgenair B757 (Atlantic Ocean off Puerto Plata, Dominican Republic, February 1996) had noted that his air speed indicator was indicating falsely. Later, he turned on the center autopilot - which gets its air data from the same source, the left Air Data Computer, which was transmitting the false indications. But this apparently didn't occur to him, and both pilots then had difficulty interpreting the instruments.
- Slips, mistakes, lapses. These are James Reason's "active errors". Triggering the go-around lever at Nagoya (A300) was a slip (we suppose). Supposing it did indeed happen, selecting vertical speed mode instead of angle-of-descent mode at Strasbourg (A320) could have been a slip; maintaining it could be argued to be mode confusion. Not retarding the thrust lever at Bucharest (Tarom A310, 31 March 1995) was arguably a lapse but might have been a slip. And so on. The distinction between them occurs at what Reason calls the "cognitive processing stage". Mistakes occur during planning to achieve an objective (and are often judgemental or inferential in nature); lapses concern storage (memory, including memory of the planned action sequence); slips occur during execution (Rea90)[p13].
Notice that PAII are behaviors, whereas latent errors such as those in software or design are persistent - they are part of the state of the system over a time period (one hopes, until the next software release), but manifest themselves through system behavior.

There are undoubtedly finer classifications to be sought, and one can argue (as many colleagues have with each other, regularly) for the worth or lack of worth of finer or coarser distinctions. So be it. It is a list of distinctions which I have found most useful.

Classifying Specifically Human-Machine Errors

Reason classifies human active error into mistakes, lapses and slips; Donald Norman classifies into mistakes and slips; Jens Rasmussen into skill-based, rule-based and knowledge-based mistakes. Further, Ladkin and Loer have introduced a human active-error classification scheme called PARDIA. Unlike the methods of dealing with digital-system error, these classification schemes for human error are not obviously equivalent, so it seems worthwhile discussing them briefly. Reason explains his classification as follows (op. cit., p9):

Error [is] a generic term to encompass all those occasions in which a planned sequence of mental or physical activities fails to achieve its intended outcome, and when these failures cannot be attributed to the intervention of some chance agency. [...]
Slips and lapses are errors which result from some failure in the execution and/or storage stage of an action sequence, regardless of whether or not the plan which guided them was adequate to achieve its objective. [...]
Mistakes may be defined as deficiencies or failures in the judgemental and/or inferential processes involved in the selection of an objective or in the specification of the means to achieve it, irrespective of whether or not the actions directed by this decision-scheme run according to plan.

Reason's definitions seem to classify each occurrence of a human-automation interaction infelicity as a specific type of human error. I'm not sure this is appropriate. For example, in the A320, the prominent displayed autopilot data of -3.3 can mean 3.3° descent angle when the autopilot is in track/flight path angle (TRK FPA) mode, and a 3,300 feet per minute descent when it is in heading/vertical speed (HDG V/S) mode. The two modes are interchanged by means of a toggle switch (whose position thereby does not indicate the mode the autopilot is in), and while the mode is annunciated, the annunciation is smaller than the display of value. If one is not paying sufficient attention, it is (was) apparently easy to confuse the two modes, and instances of this mode confusion have been confirmed in some incidents. An Air Inter A320 crashed into Mont St.-Odile on 20 January, 1992, on approach into Strasbourg on a 3,300fpm descent when the aircraft should have been on an 800fpm descent, or a 3.3° glideslope. There could have been a slip (mistoggling the mode) or a lapse (setting the figure; failing to check the mode). However, the device provided an affordance (a combination of constraints and encouragement through design) which could have been argued to have encouraged such an error. It seems more appropriate to classify the error as a pilot-automation interaction error than as a human error with machine affordance; the latter terminology suggests the human as the primary failed agent, whereas PAII is neutral with respect to primary responsibility.

Another important human-error classification is due to Rasmussen and Jensen, who divide errors into

Skill-based mistakes
Rule-based mistakes
Knowledge-based mistakes

(RasJen74), (Ras86).

The idea of the Rasmussen classification is that there are automatic actions, such as steering a car or riding a bicycle, that an operator might have learnt but then just performs, without conscious mental control. Learning proceeds by imitation of others, and largely by training through repetitive practice. Falling off the bicycle after one has learnt to ride would be a slip. There are some actions, however, that are performed cognitively and largely consciously by following a system of rules that one keeps in mind during performance of the action. Signalling in traffic is such a rule-based operation. Unlike skill-based operations, one does not repeatedly practice putting one's left hand out in order to learn how to signal: when wishing to turn left on a bicycle, one consciously puts one's hand out according to the rule. This is rule-based action. Little or no conscious reasoning is involved apart from "rule application". Not signalling when one should would be a lapse. Finally, when some form of reasoning is involved in determining the actions to take, one is engaged in knowledge-based behavior, and if one reasons to inappropriate conclusions this is a knowledge-based mistake.

Knowledge-based mistakes correspond roughly to Reason's mistakes; skill-based mistakes to Reason's slips. Rule-based mistakes are lapses. I do not know whether there are lapses that would be classified as skill-based mistakes.

Another classification which is based upon the functional role of a human operator is called PARDIA, for

Perception-Attention-Reasoning-Decision-Intention-Action (LadLoe.WBA). PARDIA is a sequential classification of steps in which active errors may occur, from information presentation from the system to the human (perception and attention) to communication from the human back to the rest of the system (action). Such classification schemes are termed information-processing models after Norman (Nor88). I wish to emphasise that PARDIA is a classification scheme, not a model of human operator behavior. We have found it useful in classifying human error in aviation accidents.

Norman classifies human active errors into slips and mistakes (Norman, op. cit.) and classifies slips into

Capture Error: a frequently-done activity `takes over' from the intended action. I am playing a folk dance tune, and as I start to play the second part, the second part of another similar tune forces itself into my head and that's what I find myself playing.
Description Error: Norman reports on a former student who came home from jogging, rolled his sweatshirt up and threw it into the toilet instead of the laundry basket. Or, two salespersons taking credit card details next to each other on the phone; one turns away, completes the task, turns back and hangs up the wrong handset.
Data-driven Error: Setting your transponder code, which you've just written down, into the altitude window of the autopilot, rather than in the transponder. Or, even more typical, confusing a heading clearance for a flight level clearance (the ranges are almost identical). There are many documented examples through NASA's ASRS of DDE's which are also afforded by communications standards and habits: for example, "two-eight-zero" could be a heading or a flight level (Cus94).
Associative Activation Error: Two things going on at once; you use the protocol for the second one for the first instead: when the telephone rings and there's a knock at the door, you pick up the receiver and say "Come in!" to it.
Loss-of-Activation Error: You forget why you're doing something in the middle of doing it.
Mode Error: this is the same as mode confusion.

Norman seems to use the word `slip' to cover also what Reason would call a `lapse'.

What's New with Computer-Related Errors?

Running through the list of error types given at the beginning of this section, one may note that one kind of error which becomes moderately frequent (as these things go) with digital automation in aviation is mode confusion. Some other types of error are well-known from the pre-digital age (slips, lapses, mistakes) but have different manifestations with digital equipment. It could be argued that mode confusion arises from the design (namely, equipment with modes), which itself arises from the complexity of the functions implemented, and not from anything expressly digital. Be that as it may - mode confusion seems to come with digital automation, and wasn't prevalent before. That's what makes it computer-related. The term "moderately frequent" is relative, and should be used with care. For example, John Hansman at MIT examined 184 cases of mode confusion occurring between 1990 and 1994, some of which resulted in fatal accidents (AvW.95.01.30). Considering the number of takeoffs and landings each day performed by commercial aircraft in the U.S., 184 is a very small number, although of course no claim is made that these are the only mode confusion incidents in this time frame. There could have been many more which did not result in accidents or incident reports.

Other Classifications of Components and Failures

Borning

Alan Borning classified causes of complex system failures in the course of discussing the reliability of military systems for detecting and waging nuclear war (Bor87). Borning's main purpose was to introduce readers to the complexity of failures of complex systems, and not to propose a complete classification. Further, he was classifying causes of failures and not system components. His categories are:

Incorrect or incomplete system specifications;
Hardware failure;
Hardware design errors;
Software coding errors;
Software design errors;
Human error (such as incorrect equipment operation or maintenance);
Combinations of problems from these categories

Correlating Borning's classification with (ShortList):

system specification [error] corresponds to requirements specification error;
software coding error corresponds to software error;
Software design error and hardware design error together correspond to design error;
the Borning examples of human error are active (human) error (using Reason's terminology, op. cit.)
there is no specific Borning category referring to human-machine interaction infelicity (HMII - the more general category into which pilot-automation interaction felicity falls).

One could infer that Borning subsumes HMII into human error, but rather than repeating the comment I made concerning Reason's classification, I judge that this would be reading too much into Borning's classification, which I take it was intended to be suggestive and not authoritative.

(ShortList) was developed independently of Borning's classification (although I had indeed read his paper some years previously). The high degree of correlation indicates the convergence of judgement concerning classification of complex computer-related failure causes and provides prima facie evidence that this classification forms a social domain. The major area of difference is in Borning's labelling of HMII problems as `human error'; as I noted in Section Errors and Infelicities, not all contributory factors to a failure can be classified as errors, and not all failures of interaction can be put down to human (active) error. This suggests that to form a suitable classification domain from this social domain, one should maybe focus attention on the HMII component.

Parnas

David Parnas (Par85) classified systems (not failures) into

discrete-state systems;
analogue systems;
hybrid systems

The correlation with the (CondList) is:

discrete-state systems correspond to digital systems;
analogue systems correspond to mechanical systems and electrical systems together;
hybrid systems are combinations of both, and thus have both discrete-state and analogue components; which is intended by the component classification in Section The Nature of Components.

Parnas goes on to explain why the development process for discrete-state systems leads to unreliable software.

Parnas's domains are clearly social domains: the first two are typical areas of mathematical concentration in universities, and the third is born of the necessity for considering systems which have both analogue and discrete aspects; digital control systems for example. Furthermore, since the classification is so coarse (analog, digital or bits-of-both), there are good grounds for considering it a classification domain for systems without human operators, as Parnas does. The reasoning would be: analog systems are those which have no digital components; hybrid systems are mixes of digital and non-digital components. It is a mutually exclusive and exhaustive classification, with no subcategories; so it fulfils (SoundCond), (TypeCond) and (CompCond) trivially, and presumably it could be argued to satisfy (FailCond). This domain would be only mildly useful for detailed classification of failure, but suffices for Parnas's goals of arguing in the large about system reliability, in particular software reliability.

The U.S. National Transportation Safety Board

The U.S. National Transportation Safety Board convenes `groups' during accident investigations, which groups report on particular aspects of the accident at the Public Hearing, a statutorily required conference convened by the Board before issuing the final report on the accident. The public hearing on the Korean Air Lines accident in Guam in 1997, (NTSB.98.Guam) contains "Factual Reports" from the following groups:

Operations/Human Performance
Air Traffic Control
Meteorology
Survival Factors
Structures
Powerplants
Systems
Flight Data Recorder
Maintenance Records
Cockpit Voice Recorder
Emergency Management Specialist
Airplane Performance Study

The public hearing on the explosion of TWA800 in 1996, (NTSB.98.TW800) contains "Factual Reports" from the following groups:

Airplane Interior Documentation
Structures
Systems
Flight Data Recorder
Maintenance Records
Cockpit Voice Recorder
Airplane Performance group
Metallurgist
Reconstruction
Metallurgy/Structures Sequencing
Medical Forensic
Fire and Explosion
Trajectory Study
Flight Test

The division of the aircraft into Structures, Systems and Powerplants is traditional, and has its reasons in the engineering knowledge brought to bear on the analysis. The distinction between structures and systems mirrors a distinction between engineering of static systems and engineering of dynamic systems, although of course aircraft structures are not themselves static, any more than a tree is static. It could be said more reasonably, maybe, that the purpose of a structure is to remain in more or less the same configuration, despite perturbation; whereas the purpose of a system is to exhibit changing behavior, as required. One may bemoan the lack of division into mechanical, electrical and digital (or analogue and discrete) until one realises that both the accidents happened to `classic' B747 machines, which have no digital systems on board. These considerations lead me to categorise this classification as in part a social domain.

However, the two reports also contain group reports on particular features of interest in the accidents. For example, the Guam accident occurred on a relatively remote hilltop. The performance of the emergency response teams under these conditions is of great interest to survivability, for example. Similarly, the impact was the result of vertical navigation errors in the absence of precision vertical navigation guidance from the ground. The absence of such guidance was annunciated internationally, following standard procedures. The crew apparently followed standard procedures in receiving and processing this information. However, the CVR showed that the captain had considered using the unavailable guidance. The question arises: what was the nature of the problem with the procedures and captain's behavior, that he didn't apply the information when needed? Hence an Operations/Human Performance Group investigated.

Similarly, TWA800 exploded in flight and broke up. Groups were formed to reconstruct this event and inquire after possible causes. These groups are identifiable in the list above.

I conclude that the NTSB lists are in part feature domains.

The International Civil Aviation Organisation

The NTSB groupings reflect in part the typical organisation of the final report on an aircraft accident required by ICAO of signatories. The Factual Information section of a report requires the following information:

History of the Flight
Injuries to Persons
Damage to Aircraft
Other Damage
Personnel (Crew) Information
Airplane Information (reg., total time, cycles, powerplant type, etc)
Meteorological Information
Aids to Navigation
Communications
Aerodrome Information
Flight Recorders
Wreckage and Impact Information
Medical and Pathological Information
Fire Information
Survival Aspects
Supplemental Tests and Research
`Other Information'

This represents a mixture of causally relevant with non-causally relevant information (for example, the captain's name cannot be causally relevant, although hisher age and time on type and time since last line check may well be). Some of the information is also relevant to determining whether the incident constituted an accident according to definition, and how severe that accident was (Injuries to Persons, Damage to Aircraft, Other Damage). These categories represent a classification of the information, by traditional specialty with a few exceptions, which has been found to be essential to determine in order to report on an accident, the main points of a report being to determine prophylactic measures as well as causality. The ICAO classification ranks therefore as a mixture of feature domain with social domain.

There are those who argue that fire and survival aspects should be integrated into the other engineering aspects, and that it is a continuing prophylactic disadvantage to consider them as separate aspects from the engineering. This point of view has some justification, although it is not my intent to argue that here.

That an aircraft is part of a larger system is reflected in the categories of Communications, Aids to Navigation and Aerodrome Information. Air Traffic Control aspects are normally subsumed under one or both of the first two headings. That the air transportation system is open, namely, that its behavior is significantly affected by certain aspects of its environment, is reflected in the presence of the Meteorology section.

Significantly missing from this list of topics is any section reporting on the regulatory and procedural environment, which was recognised as a component of a system in (CondList) . This is part of what authors such as Reason have emphasised as containing significant causal components of accidents (Rea98) (Rea90). This view has been incorporated into the accident investigation reportage of organisations such as the U.S. NTSB, Canadian TSB, U.K. AAIB, and Australian BASI, all independent safety boards, as well as supported by ICAO. This is partly reflected in the Operations/Human Performance Group report in the Guam hearing (NTSB.98.Guam), and is partly reflected in both hearings by the presence of the Maintenance Group Factual Report. (In fact, Reason devotes a whole chapter of (Rea98) to the consideration that Maintenance Can Seriously Damage Your System, showing that this is a feature domain).

We can conclude from the list of topics and the kind of information traditionally contained therein that the ICAO report structure has further procedural and bureaucratic goals than just the causal reportage of accidents. We may further conclude that organisations such as the NTSB create groupings to reflect that structure, as well as to deepen the causal investigation where this is necessary, for example the extra groupings to deal with sequencing of events, structural issues, and fire and explosion with respect to TWA 800; the air traffic control and emergency management groups with respect to KAL 801, to reflect appropriate feature domains.

Perrow

Charles Perrow proposed a classification for system components that he called DEPOSE (Per84). DEPOSE stands for

Design,
Equipment,
Procedures,
Operators,
Supplies and Materials,
Environment.

The purpose of the DEPOSE classification was to analyse accidents to complex systems. Perrow considered various so-called "system accidents" (his own term, now in general use) such as those at the Three Mile Island nuclear plant, petrochemical plant accidents, those in marine, aviation and airways systems, those in earthbound systems (dams, mines, lakes and in earthquakes), and finally space systems, war and DNA. DEPOSE was used to analyse and comment on these systems and their failures, using the behavioral notions of interactive simplicity/complexity and tight/loose coupling.

While DEPOSE sufficed for Perrow's use, one suspects that it might not fulfil the completeness conditions for a classification domain. For example, no argument was given that every system of interest has a type that is the type of a mereological sum of all components; equivalently, that every system whatsoever has a type that is a sum of D,E,P,O,S and E. I doubt that such an argument can be made, for two reasons:

all DEPOSE components except for P=procedures appear to be static, leaving P as the only component able to account for behavior;
by `procedures', Perrow means human procedures mostly at the human-machine interface; it's not clear, for example, that "reading the manual and understanding it" falls under P, although this might be an important component of some failure.

Further, one might criticise the notions of interactive complexity and coupling on the grounds that they lack proposed measures which could determine what counts as an instance of loose or of tight coupling, and what counts as an instance of interactive simplicity or complexity. These considerations are not, however, germane to the current discussion of componenthood.

Perrow's work is fundamental in the field of failure analysis of complex systems, and the DEPOSE classification and the notions of interactive complexity and tight coupling sufficed for him to make the points he wished to make. The DEPOSE classification is probably best thought of as a classification domain, even though it may not satisfy some of the criteria as enumerated above in Section Soundness, Completeness and Mereological Sums.

Conclusion

I hope to have indicated to the reader how classification systems can help us understand causal factors in order to mitigate their reoccurrence. I have proposed two potential classification domains, (CondList) and (ShortList). I hope also to have made clear:

that classification domains should satisfy certain formal properties, such as (SoundCond), (FailCond) or (FailCond'), (TypeCond) and (CompCond);
that attributes of system components are not necessarily mutually exclusive;
that the identification of subsystems is an intellectual activity based on the goals one has (causal chain, provenance, habit);
that notions of homogeneity, heterogeneity and interaction failure are dependent on the particular classification system being used;
that many important subsystems are heterogeneous rather than homogeneous;
that identifying system components (subsystems) in which failure has taken place is not trivial; and
that whether an infelicity can be classified as an interaction infelicity or a subsystem design or execution infelicity depends on the classification one has chosen to work with.

Given the current social domain in aviation, say, as represented by (ShortList), it seems that many if not most accidents with computer-related features fall into the interaction infelicity category. Some arguably fall into the design error or requirements error categories (for example, the Ariane 501 accident). In comparison, software and hardware errors that contribute to accidents appear to be comparatively rare - and let us hope they stay that way!

References

(AvW.95.01.30): D. Hughes, Incidents Reveal Mode Confusion. Automated Cockpits Special Report, Part 1, Aviation Week and Space Technology, 30 January 1995, p5.
(Back)

(AvW.99.03.29): Commercial Transports Face New Scrutiny in U.S., Aviation Week and Space Technology, March 29, 1999, pp38-9.
(Back).

(Bor87): Computer System Reliability and Nuclear War, Communications of the ACM 30(2):112-31, February 1987.
(Back)

(Cus94): Stephen Cushing, Fatal Words: Communication Clashes and Aircraft Crashes, Chicago and London: University of Chicago Press, 1994.
(Back)

(Lad95): Peter B. Ladkin, Analysis of a Technical Description of the Airbus A320 Braking System, High Integrity Systems 1(4):331-49, 1995; also available through www.rvs.uni-bielefeld.de
(Back)

(Lad96.03): Peter B. Ladkin, Correctness in System Engineering, Research Report RVS-RR-96-03, also a chapter of (Lad.SFCA), available from www.rvs.uni-bielefeld.de
(Back)

(Lad96.08): Peter B. Ladkin, The X-31 and A320 Warsaw Crashes: Whodunnit?, Report RVS-RR-96-08, also a chapter of (Lad.SFCA), RVS Group, Faculty of Technology, University of Bielefeld, 1996, available from www.rvs.uni-bielefeld.de
(Back)

(Lad97.04) Peter B. Ladkin Abstraction and Modelling, Report RVS-Occ-97-04, also a chapter of (Lad.SFCA), RVS Group, Faculty of Technology, University of Bielefeld, 1996, available from www.rvs.uni-bielefeld.de
(Back)

(Lad.Comp) Peter B. Ladkin (Ed.), Computer-Related Incidents With Commercial Aircraft, Document RVS-Comp-01, RVS Group, Faculty of Technology, University of Bielefeld, 1996-99, available from www.rvs.uni-bielefeld.de

(Lad.SFCA) Peter B. Ladkin The Success and Failure of Complex Artifacts, Book RVS-Bk-01, RVS Group, Faculty of Technology, University of Bielefeld, 1997-99, available from www.rvs.uni-bielefeld.de

(LadLoe.WBA) Peter B. Ladkin and Karsten Loer Why-Because Analysis: Formal Reasoning About Incidents, Book RVS-Bk-98-01, RVS Group, Faculty of Technology, University of Bielefeld, 1998, available from www.rvs.uni-bielefeld.de
(Back)

(Nor88): Donald Norman, The Psychology of Everyday Things, New York:Basic Books, 1988.
(Back)

(NTSB.AAR.99.01): National Transportation Safety Board, Report NTSB AAR-99-01, Uncontrolled Descent and Collision with Terrain, USAir Flight 427, Boeing 737-300, N513AU, Near Aliquippa, Pennsylvania, September 8, 1994 , available through www.ntsb.gov
(Back)

(NTSB.SR.A-96-90): National Transportation Safety Board, Safety Recommendation, referring to A-96-90 through -106, October 16, 1996. Reproduced on-line in the documents concerning the Cali accident in (Lad.Comp).
(Back)

(NTSB.98.Guam): National Transportation Safety Board, Public Hearing: Korean Air Flight 801, Agana, Guam, August 6, 1997, April 1998, available through www.ntsb.gov
(Back)

(NTSB.98.TW800): National Transportation Safety Board, Public Hearing: TWA Flight 800, Atlantic Ocean, Near East Moriches, New York, July 17, 1996, 1998, available through www.ntsb.gov
(Back)

(Par85): David L. Parnas, Software Aspects of Strategic Defense [sic] Systems, Communications of the ACM 28(12):1326-35. December 1985.
(Back)

(Per84): Charles Perrow, Normal Accidents, New York:Basic Books, 1984.
(Back)

(Ras86): Jens Rasmussen, Information Processing and Human-Machine Interaction, Amsterdam: North-Holland, 1986.
(Back)

(RasJen74): J. Rasmussen and A. Jensen, Mental procedures in real-life tasks: A case study of electronic troubleshooting, Ergonomics 17:293-307, 1974.
(Back)

(Rea90): James Reason, Human Error, Cambridge University Press, 1990.
(Back)

(Rea98): James Reason, Managing the Risks of Organisational Accidents, Aldershot, England and Brookfield, Vermont: Ashgate Publishing, 1998.
(Back)

(Sha96): Mary Shafer, personal communication, discussed in (Lad96.08).
(Back)

(Sim98): David A. Simmon, Boeing 757 CFIT Accident at Cali, Colombia, Becomes Focus of Lessons Learned, Flight Safety Digest, May-June 1998, Alexandria, VA:Flight Safety Foundation, available through www.flightsafety.org
(Back).

(Spi93): Cary R. Spitzer, Digital Avionics Systems, Second Edition, McGraw-Hill, 1993.
(Back)