Abstract: I consider a classification of complex-system components into digital, mechanical, electrical, wetware and procedural systems, which are not mutually exclusive and have themselves subcategories. Other schemes are also possible, such as one based on the different mathematical and other intellectual tools required. Subsystems do not generally form a static hierarchy, and may be identified according to causality, provenance, or habit. Whether failures are classified as component failures or interaction failures depends on the subsystem identification chosen. There may well be no indivisible, "unit" components of a system. Features which are errors should be distinguished from features which are infelicities. I classify "computer-related factors", discuss some human factor classification schemes, and mention mode confusion as an error factor which has gained prominence with the increased use of digital avionics systems. All this is illustrated and argued by means of aviation accident examples.
A motto amongst investigators says that every accident is different; but similarities have to be found if we are to have any hope of learning lessons to apply to the future. Detecting similarities means classification. Classification of causal factors is useful when different classes correspond to different techniques for avoiding or mitigating the influence of similar factors in the future, and identifying factors in the same class leads to similar prophylactic measures.
There are a number of independent attributes along which factors can be classified:
The US National Transportation Safety Board concluded on March 24, 1999, that all three incidents were most likely caused by "rudder reversal", that is, movement of the rudder in a direction opposite to that commanded by the pilots (or autoflight system, should that be engaged). They further determined that this rudder reversal was caused by a jam in the secondary slide hydraulic valve in the system's main power control unit, leading to an overtravel of the primary valve in the unit (AvW.99.03.29 , NTSB.AAR.99.01 ).
There is still debate in the industry as to whether there was enough evidence to draw such a definitive conclusion about the cause. It is uncontentious, however, that failure modes of the rudder control system were identified as a result of the investigation. These failure modes are being addressed, and will be avoided or mitigated when newly redesigned units are retrofitted to the B737 fleet.
This example shows that identifying the bits is important and sometimes not very easy, but can be fruitful even when the identification is not conclusive.
It is important to realise that the classification into "bits",
system components, is not a given. See
Section Components, Subcomponents and Interactions
We shall interchangeably use the terms "component",
"subsystem", "part" for subsystems of more inclusive systems,
The Nature of Components
To enable discussion of classification systems, I propose a partial list of the types of components involved in a complex system such as a commercial aircraft. This list will be called (CompList):
Even should this argument not be accepted,
in conformance with the `traditional' view of systems,
specification failures will still occur of course (classification cannot
change facts quite like that) and will
be classified under `development stage' failure (See Section
Development Stages, below)
rather than under system component failure.
Categories Are Not Mutually Exclusive
These categories are not mutually exclusive. For example, a flight management computer (FMC) is digital (hardware and software) as well as electrical (aviation digital hardware is at the moment all electrical, although optical hardware is coming). However, there are mechanical, optical and biological as well as perhaps quantum digital computers also, so a digital system is not necessarily electrical. When a suspected failure is localised to a electrical digital system, one must investigate not only the digital aspects (processor halt?) but also the electrical aspects (fried components?) as well as the structural aspects (physically broken component?). The same holds, mutatis mutandis, for the other sorts of digital system.
One should also note that a digital computational system has much in common with an analog computational system. For example, the phenomenon which caused the X-31 crash at NASA Dryden (Edwards AFB, California) had also occurred with F-4 aircraft, which have an analog flight control system, according to Mary Shafer, a flight engineer at Dryden (Sha96).
Further, an FMC contains procedural components (it implements navigation charts as well as digital databases of navigation aids and economic calculations). And of course the aircraft taken as a whole contains subsystems of all these different types. So there are many example of aircraft subsystems which are multiply classifiable within (CompList).
It has been argued that the Flight Operations Manual should (ideally) constitute a (incomplete) specification of the aircraft system (Lad95). If this is the case, and the (complete, or at least more complete) specification is also a system component, then the Operations Manual will be a subcomponent of the specification component, as well as being a subcomponent of the procedures component.
Also, if the (overall) system specification is indeed a system component, then individual subsystem specifications will be not only components of that subsystem, but also components of the specification component.
Would the fact that these overlaps are created by including specification as an actual component of a system count against making this move, even if justified by the arguments of (Lad96.03)? I do not believe so. There is, as far as I see, no reason to expect a priori that any classification of system components will produce a perfect hierarchical partition of components, such that no component overlaps any other, and no parts of the system are not contained in some collection of components.
Here, I have used the words perfect, hierarchical and partition. I should define what I mean. The definitions are somewhat technical. Perfect means that all system parts have types which are contained in some collection of component types as enumerated; hierarchy means that the type classification forms a tree under the "subcomponent type" relation; and partition means that no component fits under two or more non-comparable subtrees.
The inclination towards perfect hierarchical partioning may follow from some reasoning such as follows. Classifications may be thought of by some as a product of intellectual exercise (which they certainly are) but thereby entirely a product of our intellect (which they cannot be if one is in any way a realist, since they describe parts of `the world'); it may be thought further that we may impose any conditions we like upon something that is a pure invention of our minds, and that if it should fail to satisfy the condition, that shows rather a mental weakness than any objective feature of a system; therefore we may require a perfect hierarchical partitioning if we should so wish, and criticise an attempt at classification which does not satisfy it. However, contained in such a line of reasoning is an assumption which removes the major motivation for classifying, namely, that it is an invention of the mind. I wish to develop a classification in order better to understand the world; we should not thereby be surprised if it should prove hard or impossible to construct a useful classification conforming to certain self-imposed properties; for determining what is "useful" probably requires that we use criteria such as precision rather than criteria such as intellectual neatness, and one has no a priori reason to suppose those criteria should conform to a hierarchical perfect partition. It would be a nice theorem to show, could one do it, but a theorem nonetheless.
To say this is not to come into conflict with the argument that classification systems are a human construct, that is, that one makes choices as to how to build a classification. Even though I may construct a classification, there are objective constraints on the choices I may make. For example, one may have many choices in picking a pair of positive whole numbers whose sum totals 30; 1 and 29, 5 and 25, and so on; that we may make such a choice freely, however, does not affect the theorem that these pairs all add up to 30, and that adding them to 70 will result in 100. These are facts independent of our choice. Similarly, it may be a fact, for all we know, that any "reasonable" choice of classification system for complex systems will inevitably contain multiple cases of overlapping classifications.
So any "bit", any identifiable subsystem,
of the aircraft may well have a multiple classification.
But how do we obtain this classification? And what "bits" are there?
Obtaining a Substantive Classification
It is reasonable to ask after the principles according to which
the content of a
classification scheme is derived. In particular, since the
component classification proposed above is not mutually exclusive,
one could wonder why digital systems are distinguished from electrical
systems, when at the time of writing in commercial aviation
they are in fact all electrical?
Before considering formal features of a classification system, it is well
to ask where the classifications that we use now have come from.
We distinguish digital systems from their substrate (electrical,
optical, formerly mechanical, whatever) because they have certain
features in common, and we have developed techniques for handling these
features effectively, semi-independently of the substrate. In other words,
the notion of digital system can be derived
through an abstraction process
But is this how it really came about? No, of course not. Babbage built
(or rather, Ada Lovelace built) a machine to perform automatic calculations
of a very specific sort. Eckert and Mauchly built a machine to perform
calculations electronically of a different sort. John von Neumann
popularised the idea of stored-program computers. And of course Alan Turing
had the idea of an abstract computation engine before. And so on.
In other words, there are some good theoretical reasons for the
distinction (abstraction) and some social reasons (evolution of computing
through engineering and mathematical discovery).
Furthermore, when one is interested in classification for a specific purpose, such as failure and accident analysis, then there may well be particular common features of failures which do not occur during normal operation.
Accordingly I shall consider three sorts of classification:
Behind any of the categories proposed in, say, (CompList) lies an accumulated bodies of engineering expertise. Such expertise has been built up often through years of academic and industrial experience and investigation concentrating on certain phenomena. Correlated with this are conferences, information exchange, and other cultural structures which ensure that information concerning this particular domain is very well exchanged amongst the participants. Let us call a domain with this feature a social domain. In contrast, information flow across social domain boundaries is comparatively, and notoriously, thin. For example, people working in general system reliability seem to be fond of saying that software has a reliability of either 1 or 0, because any fault has been there since the installation, is latent until triggered. Software reliability people and computer scientists who deal with software and its unreliability, on the other hand, balk at this statement; many cannot believe their ears upon hearing it for the first time. These groups belong to different social domains - unfortunately, for they are attempting to handle the same system component.
Social domains can evolve partly for social reasons, partly for intellectual reasons concerning the subject matter. I use the term classification domain for a purely intellectual classification scheme, one based on formal properties of the domain itself. Feature domains are those based on specific features of the goal of the classification. When the goal is failure and accident analysis, there may be specific features of failures which are not common features of normal behavior. A failure analysis will want to pay attention to these features, though those studying normal operation of the system may not. An example of feature classification for aircraft accident analysis follows at the end of this section; also, the ICAO section headings for the factual information in a commercial aircraft accident report, is discussed in Other Classifications: The International Civil Aviation Organisation below.
When devising a classification scheme, social domains form practical constraints. One cannot form new social domains out of thin air. It would make sense to try to accomodate social domains within a classification, rather than constructing classification domains which bring together parts of different social domains. If one were to pursue the latter strategy, it could lead to a situation in which some parts of a classification domain would have much greater and more fruitful interaction with parts of different classification domains (their partners in the social domain to which they belong) and relatively sparse interaction with certain other members of their own classification domain (those which belong to different social domains). Such a situation would not render the classification scheme particularly useful to practioners, each of which belongs to a social domain.
On the other hand, it could be argued that a good classification scheme has logic and reasoning behind it, and social domains, even though encouraging information exchange, would be relatively handicapped and ineffective in so far as they do not cohere with or conform to the logic behind a classification. This is an incontestable point. However, to recognise the point in practice would entail that so-called "interdisciplinary" social interactions (conferences, journals, etc.) conforming to the classification domains be initiated and continued until a social domain has been built up which conforms to the classification domain (supposing that one were possible - it might also be the case that the classification simply cannot conform to a social domain. For example, if it required too much breadth of knowledge, which was beyond the capability of all but a handful of exceptional people). This of course happens, will continue to happen, and should be encouraged.
Another argument for making classification domains conform to existing social domains would be that the two factors described above have been permanently active in the past, and that therefore the social domains have indeed grown to conform more or less with a reasonable set of classification domains. Call this the argument from evolution, if one will.
By presenting these considerations, I am not proposing to draw deep conclusions. I am merely pointing out mechanisms at work in the choice of classifications which will help us deal with complex systems. It seems to me that the social mechanisms discussed have strength and relative influence which we do not know, and therefore we cannot effectively adjudicate which of the various situations described we are in. Although one should acknowledge the difference between social domains and classification domains, and not necessarily assume that a social domain makes a good classification scheme, these considerations do justify basing a classification scheme in large part on social domains, which the reader will observe is what I have done.
As an alternative to (CompList), one could consider basing a classification scheme mainly on the types of intellectual equipment one uses in handling design and implementation issues. For example
In short, social domains may be more or less arbitrary; more because social evolution can be based on happenstance, and less because of evolutionary pressure bringing social domains closer to some reasonable classification domain. However, there is practical justification for basing a classification scheme on social domains, even while ackowledging some degree of arbitrariness in the scheme.
Feature domains occur when certain aspects of failures in a domain
recur. For example, aviation accidents often involve fire, of which
the accompanying poisonous smoke asphyxiates people; and survival aspects
of accidents include the availability and response of emergency services,
the accessibility of aircraft exits under emergency conditions, the
level of protection from smoke, the level of suppression of fire, and
the level of protection from trauma, amongst other things. Since
survival is regarded by many as one of the most important aspects of aircraft
accidents, the survivability of the accident is assessed, along with
the features that contributed positively and negatively to survival.
These categories do not currently occur within a system classification,
although there does not seem to be a reason from logic why not. For
example, if survivability and fire resistance criteria
were built in to a system specification, then investigating these
aspects would contribute to assessing whether there had been a
specification or design failure of these system components. Because this
is not yet generally the case, these aspects are properly classed for now
as a feature domain. There are certain exceptions to this; the lengthy
discussion about the overwing exits during the certification in Europe of
the new generation B737 aircraft is an example.
Homogeneous and Heterogeneous (Sub)systems
Before considering what subcomponents of a system there can be, and how we identify them, I want to make a distinction between homogeneous systems or subsystems and heterogeneous systems or subsystems. The distinction is supposed to reflect systems which belong to one type, contrasted with systems which have components from many types. As a first cut, then, one could propose defining homogeneity as having components which all belong to just one of the basic types, and heterogeneity as having components which belong to more than one. However, this will not work, because a digital system has mechanical components (the box that contains it; the boards and the chips that are inside it) as well as being also electrical. And the software that it runs defines procedures.
One way of solving the definitional problem is to define the classification by fiat. Consider the following categories:
Given the general idea, above, that homogeneity and heterogeneity
refer to being of one or of many different types,
it follows that any definition of homogeneity and
heterogeneity must be given relative to a chosen classification scheme.
Change the scheme, and homogeneous components may become
heterogeneous according to the new scheme, and vice versa.
Soundness, Completeness and Mereological Sums
I have discussed social domains and mentioned feature domains and
classification domains, but I have not yet proposed or discussed
formal criteria such as could be required for a domain to be
considered a classification domain.
We can begin by considering the following three properties:
"Soundness" would mean something like:
How do we determine what the components of the spoiler subsystem are? And how do we check this determination for correctness? We need criteria. Let us consider it from the point of view of failure analysis. One goal in a failure analysis is to isolate a subsystem in which a failure has occurred. When there is a problem with deployment of airbrakes or spoilers, for example, one looks first at components of this subsystem. The logical condition for membership of this subsystem is thus:
There are (at least) two ways to regard such a condition. It could either be a theorem about a component classification, or it could be interpreted as a requirement to be fulfilled by any satisfactory component classification. Following the second option, (FailCond) could be used to determine what the components are of the spoiler subsystem. However, there is a potential problem: it looks as though the phrase the failure of the specification could be a catch-all, classifying anything that did not fit into a particular proposed component structure, no matter how unsatisfactory or incomplete this component structure may be. For example, suppose for some odd reason one failed to consider the SECs to be part of the spoiler subsystem of the A320 (even though the name SEC, for Spoiler-elevator computer gives one a hint that it is indeed a part of that subsystem). Let us call this smaller subsystem without the SECs the "spoiler subsystem", and the complete thing simply the spoiler subsystem (without quotes or italics). Suppose further that there were to be a software error in SECs (the same error in all three), and that this error caused spoiler deployment to fail in some particular circumstance. Using (FailCond) and the definition of "spoiler subsystem", one would conclude that there had been a "failure of the specification". That's the "catch-all" at work, and it seems to fail us, because we would surely prefer to be able to conclude rather that the spoiler subsystem as conceived was incomplete, and should include the SECs; and that there had been a failure of the SECS due to software error (that is, after all, the way I described the example).
However, let us look at the reasoning a little further. The "spoiler subsystem" itself didn't fail, so by (FailCond) there had been a failure of the specification. The specification, however, is definable independently of (FailCond), namely, by describing the behavior of the spoilers (the physical hardware) in a variety of different flight regimes, including the problem case in which the spoilers (let us say) failed to deploy when the specification says they should have. So the spoiler behavior did not fulfil this specification.
A case in which a system did not fulfil a specification is not a specification error, but the definition of a design or implementation error (see later for a classification of these error types). A specification error would be indicated when the spoilers in fact fulfilled the specification, but something nevertheless was incorrect or inappropriate (see Section Development Stages below). This case does not fulfil that condition. Therefore we conclude there was not a specification error. According to (FailCond), then, some other system component must have suffered failure. But none of the ones described in "spoiler subsystem" did. If we accept (FailCond), there is left only one conclusion that we may draw, namely that "spoiler subsystem" is incomplete - there is some component of the spoiler subsystem that in fact failed (because there was no specification failure), and that part is not part of the "spoiler subsystem". (By the way, the notion of fulfilment of a specification by a design or implementation is completely rigorous, but here is not the place to describe it.)
I conclude that taking (FailCond) as a definition of what constitutes being a subcomponent of a system component will enable substantive inferences concerning system components and does not lead, as initially feared, to wordplay concerning what is a component failure and what a specification failure.
If most subsystems identified similarly to the spoiler subsystem, then we could expect that system components can include digital parts (the SECs); mechanical and hydraulic parts (spoiler hardware itself; the three hydraulic systems); design parts (the logic connecting WoW condition and spoiler deployment) and specification parts (the rules for spoiler deployment). It is a mélange of component types as enumerated above. The spoiler subsystem itself thus belongs to no single basic type - it is heterogeneous. We took a function as a specification, and one might imagine that most subsystems identified in this manner will be heterogeneous when the function is fairly high-level (control of one of the three axes of motion) and the system itself (the aircraft) is heterogeneous.
What type does the spoiler subsystem in fact have? It is reasonable to suppose that meaningful system components, such as the spoiler subsystem, have as component types all the component types of their subcomponents. Let us call any specific collection of component types the sum of those types. Then we may formulate the principle:
There is an important formal distinction to be noted here. I spoke of a "collection" of types. One may be tempted to think that I meant set of types. This would be a mistake. Consider a system with subsystems, all of which have themselves subsubsystems, and let us assume that each subsubsystem has a basic type. Each subsystem then has as type the collection of (basic) types of its individual subsubsystems. The system has as type the collection of types of its subsystems, which themselves have type that is a collection. So the system has type which is a collection of collections of basic types, which is itself a collection of basic types. However, if "collection" meant "set", then the type of a subsystem would be a set of types of its subsubsystems; that is, a set of basic types. The type of the system would be a set of types of its subsystems, that is, a set of sets of basic types. But it is elementary set theory that a set of sets of basic types is not a set of basic types (unless basic types are special sets called "non-well-founded sets"). Nevertheless, a collection of collections of basic types is a collection of basic types. So collections, in the sense in which I am using the term, are not sets. (This is why the terminology "sum" is appropriate: sums of sums are sums.)
There are certain types of pure sets, called transitive sets (technically: sets which are identical with their unions), which do satisfy the condition we're looking for on collections. However, we can't make that work so easily because prima facie we don't have pure sets (sets whose only members are other sets), since we have basic types. We may be able to find some mathematical encoding that would enable us to identify basic types with pure sets. But why bother? We can sum collections of basic types simply, so there is no need for a translation into the right kind of set theory.
So important system components and subsystems can be classified, not within a single component type, but as a collection, the sum, of types. Along with this formulation of the type of a component comes the standard notion of a component being constituted by all its parts. We speak of a component being the mereological sum of its component parts. This means that when I put all the parts "together", I have the entire component; equivalently, "putting the parts together" is taking the mereological sum.
Components can be the mereological sum of many different collections of parts. A cake is the mereological sum of its two halves; or of its four quarters; and of its eight eights also. The use of the notion of mereological sum comes when one lists parts or subcomponents; when one "puts them together", then one can observe if one has the entire system or not. If not, it means a part has not been enumerated, and one can identify this part by seeing what is missing from the mereological sum of the subcomponents one had listed.
Consider the following example, due to Leslie Lamport. One has two boxes with input and output wires, each of which outputs a zero if there is a zero on its input; otherwise outputs nothing. Connect these two boxes; the input of one to the output of the other. How do they behave? It is easy to show that either they both input and output nothing; or they both input and output zero. But can one decide whether they do the one or the other? There is nothing in the specification which allows one to: the behavior is not determined. One can think of the situation this way: the system formed by the two boxes contains an identifiable third component, which one may call the interface. The interface behavior consists of either two zeros being passed; or nothing. In order to reason about the subsystem formed by the two boxes connected in this way, it suffices to reason about the behavior of the two boxes (namely, their I/O specification as above), and the interconnection architecture, and thereby derive the constraint on the behavior of the interface. Once one knows these three things, one pretty much knows all there is to know about the system. Can one conclude that the system is the mereological sum of the boxes plus the interface?
Before doing so, one should ask whether the system could be the mereological sum of the boxes alone. The answer is: it cannot. Here is the proof. It is crucial to the system that the output wire of each box be connected to the input wire of the other. Just putting the boxes together, "next to each other" if you like, does not accomplish this, and the corresponding constraint on the interface would not be there. To see this, note that one could build a connected system also by connecting the two input wires to each other, and the two output wires to each other. This would form a system in which the behavior is determined - the system does nothing (neither input can receive, and therefore no output is generated by either). Clearly this is a different system from the first; it has different behavior, and a different configuration. Any argument that concluded that the former system were the mereological sum of the two boxes alone would also suffice to show that the latter would be the mereological sum of the same two boxes; but that cannot be because then the two different systems would be identical, and that they most certainly are not. QED.
We may conclude that the system consists of the two boxes plus the interface. That is, the system is the mereological sum of the two boxes plus the interface. Conversely, we would like to be able to conclude according to (FailCond) that any failure of the system to behave as specified is a failure in one of the boxes or a failure of the interface, and indeed this seems appropriate. Thus may we apply the logical condition above as a test to see whether our list of system parts enumerates a complete set of system components; namely, a list of which the system is the mereological sum.
(CondList) is a list of basic types, of which the type of any subsystem is a sum. We may now enumerate the last principle of classification, a form of completeness:
Whether this is true of (CondList) remains to be proved; it is certainly not a given. But I propose that in so far as (CondList) wouldn't satisfy it, (CondList) would be defective.
Finally, building on this discussion, we may make a simple argument why the specification of a (sub)system is a component of the system. If the specification were to be considered a component, the logical condition for componenthood could be simplified to read:
We now have formal criteria against which to test proposed
classification domains. We have started to see how a satisfactory
component type classification can aid in failure analysis of systems.
But we must also remind ourselves now how complex this nevertheless remains.
Components, Subcomponents and Interactions
Components must work with each other to fulfil joint purposes. A
flight control computer must receive inputs from pilot controls, give
feedback to the instrument displays, and send signals to the aircraft
control actuators. These interactions can sometimes go quite wrong.
Componenthood Does Not Form a Static Hierarchy
One might think that, logically, one can consider a complex system as a static hierarchy of ever more complex components. But consider the following example.
An Airbus A320 has 7 flight control computers of three different sorts, two Elevator Aileron Computers, ELAC 1 and 2; three Spoiler Elevator Computers SEC 1, 2 and 3; and two Flight Augmentation Computers, FAC 1 and 2. Multiple computers control each flight control surface. So, for example, if one of the ELAC computers that controls the ailerons and elevators completely fails, that is, just stops working, then the ailerons and elevators will be controlled by the other ELAC computer via a completely different hydraulic system. Supposing both ELACs fail, then one obtains roll control (for which the ailerons are used) via the spoilers, controlled by the 3 SEC computers, and the elevators are controlled by SECs 1 and 2.
The electrical flight control system (EFCS) is a static component (in the usual engineering sense) of the flight control system as a whole (FCS): the FCS includes the hardware, the control surfaces, that control roll, pitch and yaw of the aircraft. The FCS can be specified as follows: the subsystem which controls roll, pitch and yaw position and movement, according to certain rules (control laws). A double ELAC failure does not result in a failure of the FCS (that is, a failure of roll control, pitch control or yaw control). However, it is a failure mode of the EFCS since elevators are controlled in normal operation, but are no longer controlled in case of a double ELAC failure. This failure mode results in reduced (but not eliminated) EFCS function. The ELAC itself is a part of the EFCS, and when it stops working, it has no functionality any more; its functionality is eliminated. The ELAC itself has both hardware and software components, and the total failure of the ELAC could be caused by a failure of either hardware or software, or both, or by a design failure in which hardware or software functioned as designed, but the condition (state or sequence of behavior) in which they found themselves was non-functional and not foreseen by the designer.
But what exactly is the component describable as
the elevator control system?
Again, this may be specified as: the system which provides suitable
input to elevator actuators to affect control of the elevators,
in order partially to control pitch position and movement.
Well, if everything is
in order, is has as subcomponents the ELACs 1 and 2, the SECs 1 and
2 and the blue and green hydraulic systems, which are subcomponents
it shares in part with the ailerons. If the blue hydraulic
system fails, it has ELAC 2 and SEC 2 and the green system, and
the ailerons have ELAC 2 and green. If
ELAC 1 fails physically, it has ELAC 2, SECs 1 and 2, and the
blue and green hydraulic systems, and the ailerons have ELAC 2
and green. If ELAC 1 fails in its aileron-control software only,
then elevators have ELACs 1 and 2 and SECs 1 and 2 and blue and
green. So the elevator control system has different components
at different times and under different failure modes of other
components of the system. It's a dynamic thing, formed out of
changeable configuration of static hardware and software.
Indentifying subsystems through causal chains can be
intellectually a lot more complicated that at first sight.
It seems it would be inappropriate to classify subsystems
both hierarchically and statically.
Failures Which Are Not Failures of Static Components
One could think of the EFCS of the A320 (as, say, it is described in Spitzer, op. cit.) as consisting in 7 computers which partially interact. Suppose there is a failure in the communication channel connecting ELAC 1 and ELAC 2. Suppose ELAC 1 is waiting for a message from ELAC 2, and ELAC 2 is waiting for a message from ELAC 1. Since they're both waiting, neither message is sent, and they will carry on waiting for ever (this is called a deadlock). Then this is a failure of a component of the EFCS, namely the ELAC component, but is not an ELAC 1 or ELAC 2 failure per se. We classify it as an interaction failure. Where does the notion of interaction failure come from?
Note that the idea of an interaction failure comes from identifying
all static components of a subsystem, which themselves individually
have all not failed,
and observing nevertheless that a failure of the subsystem has occurred.
It is thus dependent on the identification of static components, and
indeed on identifying the subsystem itself. Thus one classifies a
failure as an interaction failure based on the classification one
is using of system components. It could be that classifying the
components another way would lead one to reclassify the
particular failure so that it is no longer an interaction failure.
The concept interaction failure thus cannot be guaranteed to
be persistent between component classifications.
Are there any Hard-and-Fast Unit Components?
Despite the fluidity of subcomponent identification, one might expect to find some hard-and-fast items that turn out to be practically indivisible units no matter how one classifies. One might be tempted to consider a pilot, for example, as a unit:
One may reasonably remain agnostic about whether there are any
units which persist through all reasonable classification schemes.
Some Ways of Identifying Component Subsystems
We have allowed that a classification of components of a system
may be constructed intellectually however it is convenient to do so,
subject to certain general formal conditions.
But what exactly is convenient, and what not?
Suppose there was a failure to control altitude or glide path effectively on an approach, as in the case of the China Airlines Flight 676 accident to an Airbus A300 aircraft in Taipei, Taiwan on February 16, 1998. This could be deemed to be a failure of the altitude control system (ACS). Nobody may have specifically identified the ACS before this accident, for example, it may not have been an identified component during system build, but that is no reason not to select it for study when needed. The ACS could be specified as: that system which controls altitude and rate of change of altitude through pitch adjustments. In the Taipei accident, a particular phenomenon was observed: divergence from appropriate altitudes for that phase of flight. It may be presumed that the aircraft has a control system which controls altitude. The task would then be to identify this subsystem. Let us consider in what the ACS consists.
The ACS control loop passes from the actual altitude, through the altitude sensor mechanism (the pitot-static system, in this case the static part) through the electrical display systems which display the actual altitude and the target altitude to the pilot, through the pilot's eyes (part of his physiological system) through his optic nervous system to his brain, where it is cognitively processed, including attention, reasoning, decision about what to do, intention to do it, and the cognisant action which is taken on the control stick, which feeds back through the hydro-mechanical control system to the aircraft's pitch control surfaces. So the ACS has the pitot-static system, air data computers, cockpit display systems, pilots optic system, nervous system, cognitive capacity, muscle actuation mechanisms, control stick and elevator control mechanisms as components. It also includes (part of) the autopilot, which, when activated, feeds in elevator actuation commands and accompanying stabilator pitch ("pitch trim") according to certain rules. All in all, with digital, electrical, hydro-mechanical and human components, this aircraft subsystem is quite heterogeneous. It also has components, such as the autopilot connection, which are sometimes part of the ECS and sometimes not (depending on whether it is engaged or not). So the ECS is also not static.
What allowed us to consider this heterogeneous collection of subcomponents as a single system component is that they are involved in some causal chain towards achieving a condition (effective altitude) which we independently had concluded was or should have been a system goal. We can call this the causal-chain method of subsystem identification.
Another way of identifying subsystems is provenance: we get hardware from manufacturers, software from programmers, and pilots from their mums and dads via flight schools. Since errors can occur at each stage in system development, it makes sense to encapsulate the development in the organisation which performed it. (See Section Development Stages, below.)
Yet another way of identifying subsystems is habit. We are used to considering aircraft and pilot as a single entity since the dawn of aviation, but we have only recently become sensitive to the potentially devastating causal effect of air-ground miscommunications, so we are unused to considering as an identifiable system subcomponent the pilots' vocal and hearing mechanisms, the hardware that interacts with these and sends electromagnetic signals through the ether to and from the hardware in air traffic control stations, and the equivalent physiological and cognitive subsystems of the controllers (as evidence for lack of familiarity, just look how long it took me to describe that subsystem. We have no recognisable word for it yet).
One unfortunate consequence of identifying subsystems by habit is that
it usually comes along with habitual reasoning - reasoning
that may well be false.
For example, there is traditional
reasoning that when computer components fail, since computers have
hardware and software, it must either be the hardware or the software
which failed. As discussed in
this reasoning is faulty. When a computer fails, it can be
a hardware failure, a software failure, or a design/requirements failure.
Our habits of identifying the visibly causal chains of the working
of a computer system apparently let some of us overlook the invisible
causal chains, such as that from faulty design to faulty operation.
(Some engineers avoid this problem by identifying design/requirements
failures as software failures. But this classification has unfortunate
consequences, as discussed in (Lad96.03),
and therefore I reject it.)
Conclusions Concerning Component Identification
Subcomponent identification can be complicated to understand and analyse. But that is just the way things are. An accident analyst can choose to ignore certain components, for example, pilot subsystems, if hisher goal is to improve the other aircraft subsystems. Aviation medical specialists and human factors specialists are to the contrary very interested in the pilot subsystems, with a view to improving, or improving the reliability of, those.
And I'm content to live with the conclusion that the failure analysis of complex systems is difficult. That's partly how I make my living, after all.
Errors or infelicities can be introduced at any one of these stages.
The developmental stages can be put together with the component types.
One can thus talk about a software implementation error (which I
and most others call simply a "software error"), or a
software-hardware subsystem design error, or a software subsystem
requirements error, and so forth.
The notation may get a bit unwieldy, but at least it's accurate.
We can leave it to the reader to figure out a more felicitous notation
Errors and Infelicities
Not every design or requirements factor contributing to an accident is the
result of an error. An error has unwanted consequences in every
occurence. But there are other cases in which the consequences of
can be positive in some cases and negative in others. We call factors of
this nature infelicities in accidents in which they played a
One such infelicity concerns a specific interlock which many aircraft have, to prevent the actual deployment of thrust reverse while the aircraft is still in flight. Sensors attached to the landing gear measure the compression of the gear, which corresponds to the weight of the aircraft on the wheels (when an aircraft's wings are producing lift, during takeoff and landing as well as flight, there will be less or no weight on the wheels). This is called "weight on wheels" (WoW). Thrust reverse actuation is inhibited when WoW is insufficient.
Such systems would have helped, it is assumed, in the case of the Lauda Air B767 accident in Thailand on May 26, 1991. The pilots noted that thrust reverse actuation was indicated on the instrument panel. The actuation is in part electrical, but there is a mechanical (hydraulic) interlock which prevents actual deployment of the reversers when the gear is raised. It is suspected that a failure of this interlock allowed the actuation through the electrical fault actually to deploy one reverser, leading to departure from controlled flight and the crash. A (WoW) interlock is very simple, and is thought to add a layer of protection because it does not have the failure modes of the hydraulic interlock.
In fact, on the A320, the WoW thrust-reverser interlock is implemented in digital logic in the computers which control the reversers. On September 14, 1993, an A320 landing in Warsaw in bad weather did not have sufficient WoW to allow immediate deployment of braking systems, and it continued for 9 seconds after touchdown before reversers and speedbrakes deployed (and a further 4 seconds before wheel brakes deployed). The aircraft ran off the end of the runway, hit an earth bank infelicitously placed at the end of the runway, and burned up (most occupants survived with no or minor injuries).
The same WoW interlock design would have been felicitous in the case of Lauda Air and was infelicitous in the case of Warsaw.
Another example could be the automatic configuration of aircraft control surfaces during emergency manoeuvres. Near Cali, Colombia on 20 December 1995, American Airlines Flight 965 hit a mountain while engaged in a Ground Proximity Warning System (GPWS) escape manoeuvre. The GPWS senses rising ground in the vicinity of the aircraft relative to the motion of the aircraft, and verbally warns the pilots. It is calculated that one has about 12 seconds between warning and impact. While executing the escape, the pilots forgot to retract the speed brakes, consequently the aircraft's climb performance was not optimal. It has been mooted (by the U.S. National Transportation Safety Board amongst others) that had the aircraft been automatically reconfigured during the escape manoeuvre (as, for example, the Airbus A320 is), the aircraft might have avoided impact (NTSB.SR.A-96-90) . Of course, it is quite another question how the aircraft got there in the first place, but this is not my concern here. I should point out that the hypothesis that the Cali aircraft might have avoided the mountain with optimal climb performance at GPWS warning has also been doubted (Sim98).
Having such automatic systems as on the A320 could have been
felicitous for the Cali airplane, but would be infelicitous should
they fail to operate effectively, in a case in which there is no
effective manual backup. It is well-known that introducing extra
layers of defence also introduces the possibility of extra failure
modes through failure of the extra defensive systems.
What Are "Computer-Related Factors"?
Here is a collection of words one could use in "daily life"
(such as: talking to journalists) for computer-related factors,
based on (CondList) and the the other considerations
discussed. Let me call it
My intent is to pick out prominent or commonly-occurring factors.
Notice that PAII are behaviors, whereas latent errors such as those in software or design are persistent - they are part of the state of the system over a time period (one hopes, until the next software release), but manifest themselves through system behavior.
There are undoubtedly finer classifications to be sought, and one
can argue (as many colleagues have with each other, regularly) for the worth
or lack of worth of finer or coarser distinctions. So be it. It is a list
of distinctions which I have found most useful.
Classifying Specifically Human-Machine
Reason classifies human active error into mistakes, lapses and slips;
Donald Norman classifies into mistakes and slips; Jens Rasmussen into
skill-based, rule-based and knowledge-based mistakes. Further, Ladkin
and Loer have introduced a human active-error classification scheme
called PARDIA. Unlike the methods of dealing with digital-system error,
these classification schemes for human error are not obviously equivalent,
so it seems worthwhile discussing them briefly.
Reason explains his classification as follows (op. cit., p9):
Error [is] a generic term to encompass all those occasions in which a planned sequence of mental or physical activities fails to achieve its intended outcome, and when these failures cannot be attributed to the intervention of some chance agency. [...]
Slips and lapses are errors which result from some failure in the execution and/or storage stage of an action sequence, regardless of whether or not the plan which guided them was adequate to achieve its objective. [...]
Mistakes may be defined as deficiencies or failures in the judgemental and/or inferential processes involved in the selection of an objective or in the specification of the means to achieve it, irrespective of whether or not the actions directed by this decision-scheme run according to plan.
Reason's definitions seem to classify each occurrence of a human-automation interaction infelicity as a specific type of human error. I'm not sure this is appropriate. For example, in the A320, the prominent displayed autopilot data of -3.3 can mean 3.3° descent angle when the autopilot is in track/flight path angle (TRK FPA) mode, and a 3,300 feet per minute descent when it is in heading/vertical speed (HDG V/S) mode. The two modes are interchanged by means of a toggle switch (whose position thereby does not indicate the mode the autopilot is in), and while the mode is annunciated, the annunciation is smaller than the display of value. If one is not paying sufficient attention, it is (was) apparently easy to confuse the two modes, and instances of this mode confusion have been confirmed in some incidents. An Air Inter A320 crashed into Mont St.-Odile on 20 January, 1992, on approach into Strasbourg on a 3,300fpm descent when the aircraft should have been on an 800fpm descent, or a 3.3° glideslope. There could have been a slip (mistoggling the mode) or a lapse (setting the figure; failing to check the mode). However, the device provided an affordance (a combination of constraints and encouragement through design) which could have been argued to have encouraged such an error. It seems more appropriate to classify the error as a pilot-automation interaction error than as a human error with machine affordance; the latter terminology suggests the human as the primary failed agent, whereas PAII is neutral with respect to primary responsibility.
Another important human-error classification is due to Rasmussen and Jensen, who divide errors into
The idea of the Rasmussen classification is that there are automatic actions, such as steering a car or riding a bicycle, that an operator might have learnt but then just performs, without conscious mental control. Learning proceeds by imitation of others, and largely by training through repetitive practice. Falling off the bicycle after one has learnt to ride would be a slip. There are some actions, however, that are performed cognitively and largely consciously by following a system of rules that one keeps in mind during performance of the action. Signalling in traffic is such a rule-based operation. Unlike skill-based operations, one does not repeatedly practice putting one's left hand out in order to learn how to signal: when wishing to turn left on a bicycle, one consciously puts one's hand out according to the rule. This is rule-based action. Little or no conscious reasoning is involved apart from "rule application". Not signalling when one should would be a lapse. Finally, when some form of reasoning is involved in determining the actions to take, one is engaged in knowledge-based behavior, and if one reasons to inappropriate conclusions this is a knowledge-based mistake.
Knowledge-based mistakes correspond roughly to Reason's mistakes; skill-based mistakes to Reason's slips. Rule-based mistakes are lapses. I do not know whether there are lapses that would be classified as skill-based mistakes.
Another classification which is based upon the functional role of a human operator is called PARDIA, for
Norman classifies human active errors into slips and mistakes (Norman, op. cit.) and classifies slips into
was developed independently of Borning's classification (although I had
indeed read his paper some years previously). The high degree of
correlation indicates the convergence of judgement concerning classification
of complex computer-related failure causes and provides prima facie
evidence that this classification forms a social domain.
The major area of difference is in Borning's labelling of
HMII problems as `human error'; as I noted in Section
Errors and Infelicities,
not all contributory factors to a failure can be classified as errors,
and not all failures of interaction can be put down to human (active) error.
This suggests that to form a suitable classification domain from this
social domain, one should maybe focus attention
on the HMII component.
classified systems (not failures) into
Parnas's domains are clearly social domains: the first two are
typical areas of mathematical concentration in universities, and the
third is born of the necessity for considering systems which have both
analogue and discrete aspects; digital control systems for example.
Furthermore, since the classification is so coarse (analog, digital
or bits-of-both), there are good grounds for considering it a
classification domain for systems without human operators, as Parnas does.
The reasoning would be: analog systems are those which have no digital
components; hybrid systems are mixes of digital and non-digital components.
It is a mutually exclusive and exhaustive classification, with no
subcategories; so it fulfils (SoundCond),
(TypeCond) and (CompCond) trivially, and presumably
it could be argued to satisfy (FailCond).
This domain would be only
mildly useful for detailed classification of failure, but
suffices for Parnas's goals of arguing in the large about
system reliability, in particular software reliability.
The U.S. National Transportation Safety Board
The U.S. National Transportation Safety Board convenes `groups'
during accident investigations, which groups report on particular aspects
of the accident at the Public Hearing, a statutorily required conference
convened by the Board before issuing the final report on the accident.
The public hearing on the Korean Air Lines accident in Guam in 1997,
contains "Factual Reports" from the following groups:
The division of the aircraft into Structures, Systems and Powerplants is traditional, and has its reasons in the engineering knowledge brought to bear on the analysis. The distinction between structures and systems mirrors a distinction between engineering of static systems and engineering of dynamic systems, although of course aircraft structures are not themselves static, any more than a tree is static. It could be said more reasonably, maybe, that the purpose of a structure is to remain in more or less the same configuration, despite perturbation; whereas the purpose of a system is to exhibit changing behavior, as required. One may bemoan the lack of division into mechanical, electrical and digital (or analogue and discrete) until one realises that both the accidents happened to `classic' B747 machines, which have no digital systems on board. These considerations lead me to categorise this classification as in part a social domain.
However, the two reports also contain group reports on particular features of interest in the accidents. For example, the Guam accident occurred on a relatively remote hilltop. The performance of the emergency response teams under these conditions is of great interest to survivability, for example. Similarly, the impact was the result of vertical navigation errors in the absence of precision vertical navigation guidance from the ground. The absence of such guidance was annunciated internationally, following standard procedures. The crew apparently followed standard procedures in receiving and processing this information. However, the CVR showed that the captain had considered using the unavailable guidance. The question arises: what was the nature of the problem with the procedures and captain's behavior, that he didn't apply the information when needed? Hence an Operations/Human Performance Group investigated.
Similarly, TWA800 exploded in flight and broke up. Groups were formed to reconstruct this event and inquire after possible causes. These groups are identifiable in the list above.
I conclude that the NTSB lists are in part feature domains.
The International Civil Aviation Organisation
The NTSB groupings reflect in part the typical organisation of the final report
on an aircraft accident required by ICAO of signatories. The
Factual Information section of a report requires the following
There are those who argue that fire and survival aspects should be integrated into the other engineering aspects, and that it is a continuing prophylactic disadvantage to consider them as separate aspects from the engineering. This point of view has some justification, although it is not my intent to argue that here.
That an aircraft is part of a larger system is reflected in the categories of Communications, Aids to Navigation and Aerodrome Information. Air Traffic Control aspects are normally subsumed under one or both of the first two headings. That the air transportation system is open, namely, that its behavior is significantly affected by certain aspects of its environment, is reflected in the presence of the Meteorology section.
Significantly missing from this list of topics is any section reporting on the regulatory and procedural environment, which was recognised as a component of a system in (CondList) . This is part of what authors such as Reason have emphasised as containing significant causal components of accidents (Rea98) (Rea90). This view has been incorporated into the accident investigation reportage of organisations such as the U.S. NTSB, Canadian TSB, U.K. AAIB, and Australian BASI, all independent safety boards, as well as supported by ICAO. This is partly reflected in the Operations/Human Performance Group report in the Guam hearing (NTSB.98.Guam), and is partly reflected in both hearings by the presence of the Maintenance Group Factual Report. (In fact, Reason devotes a whole chapter of (Rea98) to the consideration that Maintenance Can Seriously Damage Your System, showing that this is a feature domain).
We can conclude from the list of topics and the kind of information
traditionally contained therein that the ICAO report structure has further
procedural and bureaucratic goals than just the causal reportage of accidents.
We may further conclude that organisations such as the NTSB create
groupings to reflect that structure, as well as to deepen the causal
investigation where this is necessary, for example the extra groupings to
deal with sequencing of events, structural issues, and fire and explosion
with respect to TWA 800; the air traffic control and emergency management
groups with respect to KAL 801, to reflect appropriate feature domains.
Charles Perrow proposed a classification for system components
that he called DEPOSE
DEPOSE stands for
While DEPOSE sufficed for Perrow's use, one suspects that it might not fulfil the completeness conditions for a classification domain. For example, no argument was given that every system of interest has a type that is the type of a mereological sum of all components; equivalently, that every system whatsoever has a type that is a sum of D,E,P,O,S and E. I doubt that such an argument can be made, for two reasons:
Further, one might criticise the notions of interactive complexity and coupling on the grounds that they lack proposed measures which could determine what counts as an instance of loose or of tight coupling, and what counts as an instance of interactive simplicity or complexity. These considerations are not, however, germane to the current discussion of componenthood.
Perrow's work is fundamental in
the field of failure analysis of complex systems, and the DEPOSE classification
and the notions of interactive complexity and tight coupling sufficed for
him to make the points he wished to make. The DEPOSE classification is
probably best thought of as a classification domain, even though it may not
satisfy some of the criteria as enumerated above in Section
Soundness, Completeness and Mereological Sums.
I hope to have indicated to the reader
how classification systems can help us understand
causal factors in order to mitigate their reoccurrence. I have
proposed two potential classification domains,
(CondList) and (ShortList).
I hope also to have made clear:
Given the current social domain in aviation, say, as represented
it seems that many if not most accidents with
computer-related features fall into the interaction infelicity
category. Some arguably fall into the design error or requirements error
categories (for example, the Ariane 501 accident). In comparison,
software and hardware errors that contribute to accidents appear to be
comparatively rare - and let us hope they stay that way!
D. Hughes, Incidents Reveal Mode Confusion.
Automated Cockpits Special Report, Part 1, Aviation Week and
Space Technology, 30 January 1995, p5.
Commercial Transports Face New Scrutiny in
U.S., Aviation Week and Space Technology, March 29, 1999,
Computer System Reliability and Nuclear War,
Communications of the ACM 30(2):112-31, February 1987.
Stephen Cushing, Fatal Words: Communication Clashes and Aircraft
Crashes, Chicago and London: University of Chicago Press, 1994.
Peter B. Ladkin,
Analysis of a Technical Description of the Airbus
A320 Braking System, High Integrity Systems 1(4):331-49, 1995;
also available through
Peter B. Ladkin,
Research Report RVS-RR-96-03,
also a chapter of (Lad.SFCA),
Peter B. Ladkin,
The X-31 and A320 Warsaw Crashes: Whodunnit?,
also a chapter of (Lad.SFCA),
RVS Group, Faculty of Technology, University of Bielefeld, 1996,
Peter B. Ladkin
Abstraction and Modelling,
also a chapter of (Lad.SFCA),
RVS Group, Faculty of Technology, University of Bielefeld, 1996,
Peter B. Ladkin (Ed.),
Computer-Related Incidents With
Commercial Aircraft, Document RVS-Comp-01,
RVS Group, Faculty of Technology, University of Bielefeld, 1996-99,
Peter B. Ladkin
The Success and Failure of Complex
Artifacts, Book RVS-Bk-01,
RVS Group, Faculty of Technology, University of Bielefeld, 1997-99,
Peter B. Ladkin and Karsten Loer
Formal Reasoning About Incidents, Book RVS-Bk-98-01,
RVS Group, Faculty of Technology, University of Bielefeld, 1998,
Donald Norman, The Psychology of Everyday Things,
New York:Basic Books, 1988.
National Transportation Safety Board, Report
Uncontrolled Descent and Collision with Terrain, USAir Flight 427,
Boeing 737-300, N513AU, Near Aliquippa, Pennsylvania, September 8, 1994
, available through
National Transportation Safety Board,
Safety Recommendation, referring to A-96-90 through -106,
October 16, 1996.
Reproduced on-line in the documents concerning the Cali accident in
National Transportation Safety Board,
Public Hearing: Korean Air Flight 801, Agana, Guam, August 6, 1997,
April 1998, available through
National Transportation Safety Board,
Public Hearing: TWA Flight 800, Atlantic Ocean, Near East Moriches,
New York, July 17, 1996, 1998, available through
David L. Parnas,
Software Aspects of Strategic Defense [sic] Systems,
Communications of the ACM 28(12):1326-35. December 1985.
Charles Perrow, Normal Accidents, New York:Basic Books, 1984.
Information Processing and Human-Machine Interaction,
Amsterdam: North-Holland, 1986.
J. Rasmussen and A. Jensen, Mental procedures in real-life tasks: A
case study of electronic troubleshooting, Ergonomics 17:293-307, 1974.
Human Error, Cambridge University Press, 1990.
James Reason, Managing the Risks
of Organisational Accidents, Aldershot, England and Brookfield,
Vermont: Ashgate Publishing, 1998.
David A. Simmon,
Boeing 757 CFIT Accident at Cali, Colombia, Becomes Focus of
Lessons Learned, Flight Safety Digest,
May-June 1998, Alexandria, VA:Flight Safety Foundation, available
Cary R. Spitzer, Digital Avionics Systems, Second
Edition, McGraw-Hill, 1993.