Professor Peter B. Ladkin

Special Report RVS-S-98-01

8 March 1998

The Subcommittee has asked me for my opinion on:

the length of time an audit to determine whether the NERC system would work would take (Sta.19.02.98);
Sir Ronald Mason's suggestion that the software for the NSC should be developed independently of that of NERC, for safety reasons (Mas.10.10.97) (Mas.21.07.97).

In response, I formulated my views on these matters and contacted various colleagues in Great Britain and the United States elecontrically to discuss those views. Those whose views have been most helpful in discussion are Professors John Knight, Nancy Leveson and Fred Schneider. Additionally, Professor John McDermid has indicated his interest in the issues, but was unable to take part in substantial discussion as he was out of town during this period.

First, I give my answers to the questions above; second, I justify my answers to the first question; third, I justify my answer to the second question. In an appendix, I give some further information on the colleagues with whom I consulted. I ask the Subcommittee to indulge my loquacity in trying to answer their questions to the best of my ability.

Answers

A technical audit is required; one that looks at the actual system more than the management (a `traditional' audit would look more at the management). With the current state of knowledge a technical audit could take weeks to months to a year. I therefore suggest a two-stage process: a pre-audit, involving three to four reknowned domain experts, followed by a detailed technical audit. The pre-audit would identify the problem areas in the system, and would run for 60-100 hours; the outcome would include a precise estimate for how long the technical audit would take. The technical audit would follow;
We broadly disagree with Sir Ronald: whether one or two system suppliers are chosen is a peripheral issue with regard to the safety of the NERC and NSC systems.

Two-Phase Technical Audit

A `normal' software audit looks at the project structure and management, and would be relatively good at assessing the quality of the project management, and the prevalence of `good practice'. This is often thought to be a good guide to the quality of the software; indeed, there is such a correlation. However, members of the committee have already formed substantial opinions of the software-development management, in particular of the most recent schedule slippage (TSC.19.11.97) (TSC.29.01.97); confirming those views with a `normal' audit could be therefore regarded as duplication of effort. It would also not directly answer the question the committee would like answered: will the system work as designed?

This question is technical: the functionality of the system needs to be assessed, which requires assessing real-time distributed algorithms and software, not just assessing `good practice' and project management. How fast and effective such a technical audit would be would depend very much on the capability of the experts who do it. I believe it is crucial that the pre-audit at least is performed by the leading authorities in large distributed, real-time, safety-relevant software systems, particularly those with experience of air traffic control systems (such authorities include Professors John Knight, Nancy Leveson, John McDermid and Fred Schneider).

My reasons for my estimate for the pre-audit resources, 3 or 4 experts spending 60-100 hours each, are as follows.

First, it's taken me 60 hours with unlimited access to project personnel to understand the overall architecture of a moderately complex real-time system before now, and one with few resource-contention issues. The NERC software is larger and more complex, and has resource contention issues (they can run a few stations simultaneously, but not all of them. The FAA AAS system also exhibited these symptoms). How long it would take to dig into it would depend on how modular the design and the implementation is. Truly modular designs are usually categorised by an initially high degree of reliability, which the NERC system has notably not exhibited (FI.21.05.97) (NATS.28.01.98). That the contractor is taking longer and longer to find and fix the `last few' problems (cf. the confidence of Mr. Semple four months ago that the system would be delivered by now (TSC.19.11.97), with my opinion at roughly the same time (Lad.13.11.97)) is circumstantial evidence that the design is not optimally modular. However, any a large project must be distributed amongst many smaller development teams, which must develop their own individual parts, and this constraint enforces a `human resource' upper limit on any lack of modularity.

Second, Fred Schneider broadly agrees:

I envisage [a pre-audit] as 3-4 people sitting in a room, 8 hours/day, getting briefed by [Lockheed-Martin] system architects (with lower-level [Lockheed-Martin] technical people in the room to answer specific technical questions). I think that your guestimate of 2-3 weeks (closer to 3) of such meetings would allow the preaudit team to learn enough about the system to understand where the risk is. This would include identifying possible problems (including both those that the contracter is aware of and those that the contractor is not) as well as asking questions that the contracter can't answer (but whose answers will be important to understanding overall project risk).
(Sch.05.03.98)

(This was a rapid response by Fred to my query, and we assume a common context of discussion. I expect there to be substantial documentation on the design, and we are both assuming here that pre-audit team members would read this documentation privately. This of course consumes more person-hours.)

The output of the pre-audit would be answers to

in which parts is the system at high risk of not functioning as required?
what are the symptoms and the reasoning upon which this judgement is based?
how long would it take (more generally, what resources would it consume: contractor, client and audit-team personnel; how many hours; how much elapsed time; for what extra money) to perform a thorough technical audit?

The output of the audit would be answers to

will the NERC system ever function?
what needs to be done to get it to function?
how long do the contractor and client expect it to take?
how long will it take?

None of my correspondents disagreed with my position that it was not possible to judge how long a technical audit would take until the pre-audit reported. However, I can give some upper and lower bounds. A lower bound would occur if the pre-audit team is able effectively to determine

where the technical problems lie and what can be done about them; or else
that the system is so hopelessly convoluted that there's no way any team would be able to dig inside it and find out what was up in less than a year.

There is significant evidence against expecting either of these two situations:

Concerning Case 1, let's assume reasonable economic behavior of contractor and client. If the pre-audit team could do it, then they could have been hired by the contractor or the client already to do it, saving both of them a lot of embarrassment before the Subcommittee and the general public, and also improving their chances of getting similar contracts in the future; this didn't happen, suggesting they doubt it would be possible. Besides, I doubt it's a matter of some `Eureka' solution to some persistent bugs, which is the kind of situation that would yield to a pre-audit alone;
concerning Case 2, if the system were that bad, I doubt the client would have been so persistently optimistic, and I doubt they would have awarded the bid for NSC to the same contractor.

The upper bound is of the order of a year; the lower time estimate for the `convoluted' case. Whereas I doubt the lower bound would be achieved, from my knowledge of reverse-engineering projects I think it quite conceivable that the upper bound could be achieved.

To back up my resource estimate for the pre-audit with some circumstantial evidence, I note that the research for and preparation of my memo (Lad.13.11.97) took some 40 person-hours of my time (plus some hours for my correspondents at NAV CANADA); and that the `res and prep' for this note will have taken some 20 person-hours of my time (plus some for my correspondents). This alone almost reaches the lower bound of the pre-audit time for one auditor.

The Subcommittee Clerk, Christopher Stanton, has advised me that the Subcommittee has received an estimate that a full technical audit would take three weeks (Sta.19.02.98). If the Subcommittee will pardon my levity, I would note that according to the analysis above this makes the estimator a `lower bounder'......

Does Safety Require Dual-Sourcing?

Sir Ronald Mason points out that flight-control systems for fly-by-wire aircraft such as the Eurofighter will be developed with at least two `channels' (independently-operating subsystems working from the same or similar input data, with the same output parameters). The idea is that a failure of one channel can be detected (say, by some simple comparison) or even corrected (if there are four channels or more: by some sort of `Byzantine agreement' algorithm).

When considering the properties of reliable multi-channel systems, it is important to distinguish between hardware and software failures. A hardware failure usually stops one channel producing output at all - it simply breaks. This is easy to detect, and easy to figure out which channel is broken, so one can continue using the other channel and limp back home carefully. Going to multiple channels, such as in the space shuttle, gives you strength in depth, and you don't have an emergency after just one hardware failure. However, software failures can be very different. Software is often thought of as `design' in the aerospace industry (although some of us resist this classification), and when the software does not work right, the hardware might very well continue producing results from it, oblivious to the fact that these results are nutty (I mean, how could it tell?). Now, one has a problem telling which channel is faulty, because one channel is `lying'. In fact, at least four channels are needed to be able to detect and correct one `liar'. It may also be possible, through signal distortion or other selective interference, that hardware can also `lie' like this. Such lying, software or hardware, is called `Byzantine failure' after original identification and first algorithmic solution of such problems by my former SRI colleagues Leslie Lamport, Marshall Pease and Robert Shostak in the late 70's/early 80's ((ShLa98, pp132-135) gives a pleasant account of the origins of this work, while minimising the technical detail).

Sir Ronald calls this `dual sourcing'; it might be appropriate to call it `multi-sourcing' when more than two channels are used. When applied to smallish software systems which run more-or-less independently, it is generally called `N-version programming', and is indeed a standard tool in the workbag of safety-critical and reliable system designers.

These ideas enable me to reconstruct what I take to be Sir Ronald's reasoning.

Suppose there are two (or more) channels, programmed identically.
- If there's a hardware failure, one channel will cease functioning, this cessation will be detected, and the other channel will take over the function;
- if there's a software failure, since both channels are identically programmed, this failure, this `lie', will appear on both channels, it won't be detected, and you're really in the soup.
Suppose there are two (or more) channels, programmed disparately.
- Hardware failure: same as above;
- software failure: there will be a failure on one channel, not on both; you will detect the difference, and if you can tell which is `lying', you can use the other channel.

Thus `dual-sourcing', to use Sir Ronald's term, seems to give you more chance to work around software failures. The space shuttle, a `very old' design by today's standards, has a number of identical channels, and one disparate channel; Sir Ronald cites the Eurofighter; the Airbus A320/330/340 uses slightly different computer systems with overlapping functions, and each system is designed by different divisions of Sextant Avionique (Sp87, pp131-133) (this is not quite dual-sourcing, but close, if the company allows divisions to develop their individual design standards).

Sir Ronald proposes that a failure to dual-source is "inconsistent with the basic principles of safety-critical software development practice" (Mas.10.10.97). There are three reasons which led me to query Sir Ronald's strong statement:

The fly-by-wire Boeing 777, whose avionics is designed and built by the British company GEC-Marconi, uses one main channel, which has internal fault-tolerance; backup is provided by a very much simpler and very different backup flight control computer, which the pilots are able to switch to manually, and which provides `direct control' (i.e., it more-or-less mimics traditional control via cables and hydraulic lines);
the efficacy of N-version programming was cast into doubt in justly famous work by Professors John Knight and Nancy Leveson (KnLe86) (see also (BrKnLe89) (BrKnLe90)). They showed correlations between the failures of programs developed by supposedly independent teams. You apparently cannot simply assume that different teams working separately are going to make different sorts of mistakes. But why not? Part of the answer may be:
The Jet Propulsion Laboratory at Cal Tech in Pasadena, which builds much of the software for NASA space probes (including the recent Mars Pathfinder), performed a study of software failures in mission-critical software. They found that well over 95 per cent of the problems occurred because of errors in requirements specifications - the critical part of system development in which one states specifically and formally what the system is supposed to accomplish. (This was reported at a software conference by Robyn Lutz of NASA-JPL and Iowa State University, who participated in the study as I understand, but a quick WWW search failed to turn up the precise reference. It's a `folk theorem' in software engineering now.)

The Boeing design casts doubt on Sir Ronald's implicit assertion that dual-sourcing is a `basic principle' of safety-critical software or hardware engineering. Back-ups, alternative solutions, disparate design, certainly, but not necessarily dual-sourcing. Noone expects Boeing 777's to fall out of the sky, and none has done so yet, despite the worries of Computer Weekly, who ran a series of articles on it ((CW.24.05.95a), (CW.24.05.95b), (CW.01.06.95), (CW.08.06.95); to which I replied in (Lad.15.06.95)). A single highly-fault-tolerant design, able to tolerate multiple layers of degraded service, with a simple temporary alternative if this all fails, is indeed acceptable to many experts. At worst, the jury is still out on which type of design is `better'. It may very well remain out.

Secondly, the requirements are laid down by the client. The JPL study shows that if the requirements are similar, in particular if they're written by the same client, there is evidence that failures in the software failures will be correlated, no matter who wrote various versions of the code and how it was written. The Knight-Leveson studies support that view experimentally, adding evidence also that the similarity of requirements is not the only reason to expect correlated errors. However, the NERC/NSC question is not the N-version programming question per se. The NERC/NSC systems are much bigger than those considered by Knight, Leveson and Lutz and it is not clear how these results will scale up in detail. The broad principle, however, still makes sense no matter what the scale. If your requirements are similar or identical, then requirements failures will show up in all versions, or in none.

The NERC and NSC systems will be working to very similar requirements specifications; they will also have to interface heavily as enroute traffic moves from one FIR to the other; they are specifically designed, as most ATC systems are, to support `classic' degradation-of-service procedures, which typically lead to delays but not to safety compromises (for example, nearly 200 hours in 11 separate incidents of complete system outages of en-route centers in the US between September 12, 1994 and September 12, 1995 involved only one case of loss of separation of controlled aircraft (NTSB.96)). This point has also been made to the Subcommittee by Mr. Semple (TSC.19.11.97). It could therefore be argued against Sir Ronald's analogy to the Eurofighter that superficially the case is more similar to the Boeing 777 than to the Eurofighter or Airbus.

My view is that in fact a simple analogy between flight control systems and the NERC/NSC systems is tenuous to the point of being unhelpful. Some principles apply similarly; some do not. I acknowledge that our entire safety-critical systems knowledge leads only to a limited number of architectural principles, but which principles apply cannot be decided a priori, and Sir Ronald's view is by no means universal amongst experts. Questions of safety can mostly be decided only with detailed knowledge of requirements and constraints on architecture. I put the question also to John Knight, Nancy Leveson, and Fred Schneider, who confirmed my view. John replied:

The complexities [of this question] are such that there is no simple answer. [...] The SYSTEM (both UK and Scotland being thought of as part of a single large system) issues here are more than just similarity vs. dissimilarity. The goal is to meet a variety of complex requirements many of which relate to dependability. For such an important and complex system, any decision about the system architecture has to be made in the context of an analysis of the system and the various trade-off's that need to be made.
(Kni.05.03.98)

Nancy was typically direct. When evaluating her view, the Subcommittee might like to consider that she is widely regarded as the foremost authority on software safety in the world, and is one of the founders of the discipline.

As [Ladkin said], this is not N-version programming. These two systems have to work together, at least at the interfaces. It seems to me more a case of "would you build a plane with a jet on one wing and a prop engine on the other so they won't have the same failures." It's hard enough to integrate components built by a single company.
[...]

[Concerning Sir Ronald's suggestion that `dual-source' is a basic safety principle in software......]
Nonsense. [...] The safety of the new UK system has almost nothing to do with what Sir Ronald is worrying about.
(Lev.05.03.98)

Fred said:

The pragmatics of building two separate systems would depend on their interface. [...] Compare [two cases.] [Case 1:] both systems independently monitoring the same airspace [Case 2:] both systems communicating at the level of internal proprietary database records (corresponding to flight strips). [Case 1] is easier to manage with separately-built systems than [Case 2] is.
There are also user-interface pragmatics. Two separate systems are likely to have different user interfaces. Is there an expectation that controllers who can work one system should be able to sit down and work at the other? (Different manufacturers would make that a questionable proposition, even if the manufactuerers were given detailed specifications for the UI.) User-interface details were a major stumbling block for the US AAS system.
The usual reasoning behind "dual sourcing" etc. is that each source will exhibit independent modes of failure. Both of your systems will be built from [similar] specs, though. [See Knight-Leveson]
Second, there is the issue of cost trade-off. More failure-detection/replication at lower levels [could] be more cost effective [than dual-sourcing], arguing for two systems from the same contractor.
Finally, if it is difficult to build one system correctly, then building [two] increases the likelihood that you will build at least one correctly but also decreases the likelihood that you will build both correctly. And if both are not correct are you ahead of the game? That would depend on whether failures are detectable and whether adequate capacity exists for one system to take the load of the other.
(Sch.05.03.98)

For these reasons, and more, my colleagues and I reject a categorical statement that single sourcing of the NERC/NSC is ` inconsistent with the basic principles of safety-critical software development practice'. We believe that the safety issues here are more subtle, and do not yield to such simple aphorisms.

But I would ask the Subcommittee to note that this conclusion is agnostic. We do not offer any view in this note as to whether single-sourcing or dual-sourcing is more appropriate for NERC/NSC.

References

(BrKnLe89): Susan Brilliant, John Knight and Nancy Leveson, The Consistent Comparison Problem in N-Version Software, IEEE Transactions on Software Engineering 15(11):1481-5, November 1989. (Back)

(BrKnLe90): Susan Brilliant, John Knight and Nancy Leveson, Analysis of Faults in an N-Version Software Experiment, IEEE Transactions on Software Engineering 16(2):238-47, February 1990. (Back)

(CW.24.05.95a): Charles Walker, Boeing slated over 777 software setup, Computer Weekly, 24 May 1995. (Back)

(CW.24.05.95b): Another set of safety-critical doubts, Editorial, Computer Weekly, 24 May 1995. (Back)

(CW.01.06.95): Charles Walker, Is Boeing flying in the face of safety?, Computer Weekly, 1 June 1995. (Back)

(CW.08.06.95): Charles Walker, Pilot shaken by faults in 777 flight computers, Computer Weekly, 8 June 1995. (Back)

(FI.21.05.97): Andrew Doyle, Moving Target, Report on the NERC software problems in Flight International, 21-27 May 1997. (Back)

(Kni.05.03.98): John Knight, Reply to P. Ladkin, 5 March 1998. (Back)

(KnLe86): John Knight and Nancy Leveson, An experimental evaluation of the assumption of independence in multi-version programming, IEEE Transactions on Software Engineering SE-12(1):96-109, January 1986. (Back)

(Lad): Peter Ladkin (ed.), Computer-Related Incidents with Commercial Aircraft, a compendium of references, accident reports, and reliable discussion and commentary. Available through http://www.rvs.uni-bielefeld.de. (Back)

(Lad.15.06.95): Peter Ladkin, Flak over Boeing 777 article, Letter to the Editor, Computer Weekly, 15 June 1995. (Back)

(Lad.13.11.97): Peter Ladkin, Letter to the Transport Subcommittee, 13 November 1997. (Back)

(Lev.05.03.98): Nancy Leveson, Reply to P. Ladkin, 5 March 1998. (Back)

(Mas.21.07.97): Sir Ronald Mason, Letter to the Rt. Hon. John Prescott, MP, 21 July 1997. (Back)

(Mas.10.10.97): Sir Ronald Mason, Letter to the Transport Subcommittee, 10 October 1997. (Back)

(NATS.28.01.98): NATS, Further Advice to the Transport Sub-Committee on Air Traffic Control, 28 January 1998. (Back)

(NTSB.96): U.S. National Transportation Safety Board, Special Investigation Report: Air Traffic Control Equipment Outages, Report NTSB/SIR-96/01, 23 January 1996. Available electronically over the WWW in (Lad). (Back)

(Sta.19.02.98): Christopher Stanton, Letter to P. Ladkin, 19 February 1998. (Back)

(TSC.19.11.97): Transport Subcommittee, Minutes of Meeting of 19 November 1997. (Back)

(TSC.29.01.97): Transport Select Committee, Minutes of Meeting of 29 January 1997. (Back)

(Sch.05.03.98): Fred B. Schneider, Reply to P. Ladkin, 5 March 1998. (Back)

(ShLa98): Dennis Shasha and Cathy Lazere, Out of Their Minds: The Lives and Discoveries of 15 Great Computer Scientists, Copernicus (an imprint of Springer-Verlag), 2nd edition, 1998 (1st edition, 1995). (Back)

(Sp87): Cary R. Spitzer, Digital Avionics Systems: Principles and Practice, McGraw-Hill, 2nd edition, 1987. (Back)

Appendix: Relevant Information Concerning Participants

John Knight is Professor of Computer Science at the University of Virginia, and before that worked at NASA's Langley Research Center. (His motto: "Be VERY careful which airplanes you fly in.") John has worked extensively on the U.S. ATC system recently to study info-system survivability, and is well aware of the management issues involved as well as some of the technical issues. He is well-known for fundamental studies with Professor Nancy Leveson querying the effectiveness of the so-called `N-version programming' technique for software fault-tolerance that is prevalent in the aerospace industry. I would reckon this work amongst the most fundamental papers in software engineering. John is presently also working on an FAA committee looking at ways to streamline the certification of on-board flight-crucial systems in aircraft. He said he'd be delighted to be involved in a pre-audit, but cautioned that he has some schedule constraints. He is British. His home page is http://www.cs.virginia.edu/brochure/profs/knight.html.

Nancy Leveson is Boeing Professor of Computer Science at the University of Washington, presently Jerome C. Hunsaker Visiting Professor in Aeronautics at MIT. She is widely regarded as the leading authority in software safety in the world, and is one of the founders of the discipline of software safety. She has consulted extensively for NASA, the FAA and other aerospace organisations, and is one of the developers of the requirements specifications for TCAS-II, the second version of the Traffic Alert and Collision Avoidance System, pioneered in the US and about to become mandatory in Europe. Her company is performing safety analyses of the NASA upgrades to the FAA ATC system, and has done an extensive safety analysis of the DFW TRACON. She is a U.S. citizen. Her home page is at http://www.cs.washington.edu/homes/leveson/

John McDermid is Professor of Software Engineering at the University of York, a Director of the British Aerospace Dependable Computing Systems Center and the Director for the Rolls-Royce University Technology Center. The Uni York is one of 6 universities rated top (5*) in the Computer Science part of the Research Assessment Exercise. John is well-known for his involvement in industrial computing projects, particularly in aerospace, and for his devotion to `technology transfer' (as it is called) from industry to university. He would be the first British name that comes to mind for many computer scientists for an assessment of safety-critical software in aerospace. He has kept eyes on the NERC project from the `middle distance', as it were. He is willing and in principle able to participate in the first phase of an audit, but noted that his schedule is very tight. He is British. His home page is http://www.cs.york.ac.uk/~jam/

Fred Schneider is Professor of Computer Science at Cornell University in Ithaca, New York State. He is one of the top people in the design and verification of fault-tolerant algorithms for concurrent distributed systems, and some of his designs are the `standard' algorithms used in such systems. He participated with a Cornell colleague in major technical consulting on the FAA's AAS project 1991-93, redesigned the `application fault-tolerance' scheme, and did technical trouble-shooting, also on some of the lower-level fault-tolerant algorithm schemes. He indicated to me his more-than-willingness to participate in a pre-audit, but with, again, concerns about scheduling. He shared with me his concern, derived from his previous experience, that such an audit be performed at the highest level of technical competence available. He is a U.S. citizen. His home page is reachable through http://www.cs.cornell.edu/faculty/index.htm