8 March 1998
The Subcommittee has asked me for my opinion on:
First, I give my answers to the questions above; second, I justify my answers to the first question; third, I justify my answer to the second question. In an appendix, I give some further information on the colleagues with whom I consulted. I ask the Subcommittee to indulge my loquacity in trying to answer their questions to the best of my ability.
This question is technical: the functionality of the system needs to be assessed, which requires assessing real-time distributed algorithms and software, not just assessing `good practice' and project management. How fast and effective such a technical audit would be would depend very much on the capability of the experts who do it. I believe it is crucial that the pre-audit at least is performed by the leading authorities in large distributed, real-time, safety-relevant software systems, particularly those with experience of air traffic control systems (such authorities include Professors John Knight, Nancy Leveson, John McDermid and Fred Schneider).
My reasons for my estimate for the pre-audit resources, 3 or 4 experts spending 60-100 hours each, are as follows.
First, it's taken me 60 hours with unlimited access to project personnel to understand the overall architecture of a moderately complex real-time system before now, and one with few resource-contention issues. The NERC software is larger and more complex, and has resource contention issues (they can run a few stations simultaneously, but not all of them. The FAA AAS system also exhibited these symptoms). How long it would take to dig into it would depend on how modular the design and the implementation is. Truly modular designs are usually categorised by an initially high degree of reliability, which the NERC system has notably not exhibited (FI.21.05.97) (NATS.28.01.98). That the contractor is taking longer and longer to find and fix the `last few' problems (cf. the confidence of Mr. Semple four months ago that the system would be delivered by now (TSC.19.11.97), with my opinion at roughly the same time (Lad.13.11.97)) is circumstantial evidence that the design is not optimally modular. However, any a large project must be distributed amongst many smaller development teams, which must develop their own individual parts, and this constraint enforces a `human resource' upper limit on any lack of modularity.
Second, Fred Schneider broadly agrees:
I envisage [a pre-audit] as 3-4 people sitting in a room, 8 hours/day, getting briefed by [Lockheed-Martin] system architects (with lower-level [Lockheed-Martin] technical people in the room to answer specific technical questions). I think that your guestimate of 2-3 weeks (closer to 3) of such meetings would allow the preaudit team to learn enough about the system to understand where the risk is. This would include identifying possible problems (including both those that the contracter is aware of and those that the contractor is not) as well as asking questions that the contracter can't answer (but whose answers will be important to understanding overall project risk).(This was a rapid response by Fred to my query, and we assume a common context of discussion. I expect there to be substantial documentation on the design, and we are both assuming here that pre-audit team members would read this documentation privately. This of course consumes more person-hours.)
The output of the pre-audit would be answers to
The output of the audit would be answers to
None of my correspondents disagreed with my position that it was not possible to judge how long a technical audit would take until the pre-audit reported. However, I can give some upper and lower bounds. A lower bound would occur if the pre-audit team is able effectively to determine
To back up my resource estimate for the pre-audit with some circumstantial evidence, I note that the research for and preparation of my memo (Lad.13.11.97) took some 40 person-hours of my time (plus some hours for my correspondents at NAV CANADA); and that the `res and prep' for this note will have taken some 20 person-hours of my time (plus some for my correspondents). This alone almost reaches the lower bound of the pre-audit time for one auditor.
The Subcommittee Clerk, Christopher Stanton, has advised me that the Subcommittee has received an estimate that a full technical audit would take three weeks (Sta.19.02.98). If the Subcommittee will pardon my levity, I would note that according to the analysis above this makes the estimator a `lower bounder'......
When considering the properties of reliable multi-channel systems, it is important to distinguish between hardware and software failures. A hardware failure usually stops one channel producing output at all - it simply breaks. This is easy to detect, and easy to figure out which channel is broken, so one can continue using the other channel and limp back home carefully. Going to multiple channels, such as in the space shuttle, gives you strength in depth, and you don't have an emergency after just one hardware failure. However, software failures can be very different. Software is often thought of as `design' in the aerospace industry (although some of us resist this classification), and when the software does not work right, the hardware might very well continue producing results from it, oblivious to the fact that these results are nutty (I mean, how could it tell?). Now, one has a problem telling which channel is faulty, because one channel is `lying'. In fact, at least four channels are needed to be able to detect and correct one `liar'. It may also be possible, through signal distortion or other selective interference, that hardware can also `lie' like this. Such lying, software or hardware, is called `Byzantine failure' after original identification and first algorithmic solution of such problems by my former SRI colleagues Leslie Lamport, Marshall Pease and Robert Shostak in the late 70's/early 80's ((ShLa98, pp132-135) gives a pleasant account of the origins of this work, while minimising the technical detail).
Sir Ronald calls this `dual sourcing'; it might be appropriate to call it `multi-sourcing' when more than two channels are used. When applied to smallish software systems which run more-or-less independently, it is generally called `N-version programming', and is indeed a standard tool in the workbag of safety-critical and reliable system designers.
These ideas enable me to reconstruct what I take to be Sir Ronald's reasoning.
Sir Ronald proposes that a failure to dual-source is "inconsistent with the basic principles of safety-critical software development practice" (Mas.10.10.97). There are three reasons which led me to query Sir Ronald's strong statement:
The Boeing design casts doubt on Sir Ronald's implicit assertion that dual-sourcing is a `basic principle' of safety-critical software or hardware engineering. Back-ups, alternative solutions, disparate design, certainly, but not necessarily dual-sourcing. Noone expects Boeing 777's to fall out of the sky, and none has done so yet, despite the worries of Computer Weekly, who ran a series of articles on it ((CW.24.05.95a), (CW.24.05.95b), (CW.01.06.95), (CW.08.06.95); to which I replied in (Lad.15.06.95)). A single highly-fault-tolerant design, able to tolerate multiple layers of degraded service, with a simple temporary alternative if this all fails, is indeed acceptable to many experts. At worst, the jury is still out on which type of design is `better'. It may very well remain out.
Secondly, the requirements are laid down by the client. The JPL study shows that if the requirements are similar, in particular if they're written by the same client, there is evidence that failures in the software failures will be correlated, no matter who wrote various versions of the code and how it was written. The Knight-Leveson studies support that view experimentally, adding evidence also that the similarity of requirements is not the only reason to expect correlated errors. However, the NERC/NSC question is not the N-version programming question per se. The NERC/NSC systems are much bigger than those considered by Knight, Leveson and Lutz and it is not clear how these results will scale up in detail. The broad principle, however, still makes sense no matter what the scale. If your requirements are similar or identical, then requirements failures will show up in all versions, or in none.
The NERC and NSC systems will be working to very similar requirements specifications; they will also have to interface heavily as enroute traffic moves from one FIR to the other; they are specifically designed, as most ATC systems are, to support `classic' degradation-of-service procedures, which typically lead to delays but not to safety compromises (for example, nearly 200 hours in 11 separate incidents of complete system outages of en-route centers in the US between September 12, 1994 and September 12, 1995 involved only one case of loss of separation of controlled aircraft (NTSB.96)). This point has also been made to the Subcommittee by Mr. Semple (TSC.19.11.97). It could therefore be argued against Sir Ronald's analogy to the Eurofighter that superficially the case is more similar to the Boeing 777 than to the Eurofighter or Airbus.
My view is that in fact a simple analogy between flight control systems and the NERC/NSC systems is tenuous to the point of being unhelpful. Some principles apply similarly; some do not. I acknowledge that our entire safety-critical systems knowledge leads only to a limited number of architectural principles, but which principles apply cannot be decided a priori, and Sir Ronald's view is by no means universal amongst experts. Questions of safety can mostly be decided only with detailed knowledge of requirements and constraints on architecture. I put the question also to John Knight, Nancy Leveson, and Fred Schneider, who confirmed my view. John replied:
The complexities [of this question] are such that there is no simple answer. [...] The SYSTEM (both UK and Scotland being thought of as part of a single large system) issues here are more than just similarity vs. dissimilarity. The goal is to meet a variety of complex requirements many of which relate to dependability. For such an important and complex system, any decision about the system architecture has to be made in the context of an analysis of the system and the various trade-off's that need to be made.
Nancy was typically direct. When evaluating her view, the Subcommittee might like to consider that she is widely regarded as the foremost authority on software safety in the world, and is one of the founders of the discipline.
As [Ladkin said], this is not N-version programming. These two systems have to work together, at least at the interfaces. It seems to me more a case of "would you build a plane with a jet on one wing and a prop engine on the other so they won't have the same failures." It's hard enough to integrate components built by a single company.
[Concerning Sir Ronald's suggestion that `dual-source' is a basic safety principle in software......]
Nonsense. [...] The safety of the new UK system has almost nothing to do with what Sir Ronald is worrying about.
The pragmatics of building two separate systems would depend on their interface. [...] Compare [two cases.] [Case 1:] both systems independently monitoring the same airspace [Case 2:] both systems communicating at the level of internal proprietary database records (corresponding to flight strips). [Case 1] is easier to manage with separately-built systems than [Case 2] is.
There are also user-interface pragmatics. Two separate systems are likely to have different user interfaces. Is there an expectation that controllers who can work one system should be able to sit down and work at the other? (Different manufacturers would make that a questionable proposition, even if the manufactuerers were given detailed specifications for the UI.) User-interface details were a major stumbling block for the US AAS system.
The usual reasoning behind "dual sourcing" etc. is that each source will exhibit independent modes of failure. Both of your systems will be built from [similar] specs, though. [See Knight-Leveson]
Second, there is the issue of cost trade-off. More failure-detection/replication at lower levels [could] be more cost effective [than dual-sourcing], arguing for two systems from the same contractor.
Finally, if it is difficult to build one system correctly, then building [two] increases the likelihood that you will build at least one correctly but also decreases the likelihood that you will build both correctly. And if both are not correct are you ahead of the game? That would depend on whether failures are detectable and whether adequate capacity exists for one system to take the load of the other.
For these reasons, and more, my colleagues and I reject a categorical statement that single sourcing of the NERC/NSC is ` inconsistent with the basic principles of safety-critical software development practice'. We believe that the safety issues here are more subtle, and do not yield to such simple aphorisms.
But I would ask the Subcommittee to note that this conclusion is agnostic. We do not offer any view in this note as to whether single-sourcing or dual-sourcing is more appropriate for NERC/NSC.
(KnLe86): John Knight and Nancy Leveson, An experimental evaluation of the assumption of independence in multi-version programming, IEEE Transactions on Software Engineering SE-12(1):96-109, January 1986. (Back)
(Lad): Peter Ladkin (ed.), Computer-Related Incidents with Commercial Aircraft, a compendium of references, accident reports, and reliable discussion and commentary. Available through http://www.rvs.uni-bielefeld.de. (Back)
(NTSB.96): U.S. National Transportation Safety Board, Special Investigation Report: Air Traffic Control Equipment Outages, Report NTSB/SIR-96/01, 23 January 1996. Available electronically over the WWW in (Lad). (Back)
(ShLa98): Dennis Shasha and Cathy Lazere, Out of Their Minds: The Lives and Discoveries of 15 Great Computer Scientists, Copernicus (an imprint of Springer-Verlag), 2nd edition, 1998 (1st edition, 1995). (Back)
John Knight is Professor of Computer Science at the University of Virginia, and before that worked at NASA's Langley Research Center. (His motto: "Be VERY careful which airplanes you fly in.") John has worked extensively on the U.S. ATC system recently to study info-system survivability, and is well aware of the management issues involved as well as some of the technical issues. He is well-known for fundamental studies with Professor Nancy Leveson querying the effectiveness of the so-called `N-version programming' technique for software fault-tolerance that is prevalent in the aerospace industry. I would reckon this work amongst the most fundamental papers in software engineering. John is presently also working on an FAA committee looking at ways to streamline the certification of on-board flight-crucial systems in aircraft. He said he'd be delighted to be involved in a pre-audit, but cautioned that he has some schedule constraints. He is British. His home page is http://www.cs.virginia.edu/brochure/profs/knight.html.
Nancy Leveson is Boeing Professor of Computer Science at the University of Washington, presently Jerome C. Hunsaker Visiting Professor in Aeronautics at MIT. She is widely regarded as the leading authority in software safety in the world, and is one of the founders of the discipline of software safety. She has consulted extensively for NASA, the FAA and other aerospace organisations, and is one of the developers of the requirements specifications for TCAS-II, the second version of the Traffic Alert and Collision Avoidance System, pioneered in the US and about to become mandatory in Europe. Her company is performing safety analyses of the NASA upgrades to the FAA ATC system, and has done an extensive safety analysis of the DFW TRACON. She is a U.S. citizen. Her home page is at http://www.cs.washington.edu/homes/leveson/
John McDermid is Professor of Software Engineering at the University of York, a Director of the British Aerospace Dependable Computing Systems Center and the Director for the Rolls-Royce University Technology Center. The Uni York is one of 6 universities rated top (5*) in the Computer Science part of the Research Assessment Exercise. John is well-known for his involvement in industrial computing projects, particularly in aerospace, and for his devotion to `technology transfer' (as it is called) from industry to university. He would be the first British name that comes to mind for many computer scientists for an assessment of safety-critical software in aerospace. He has kept eyes on the NERC project from the `middle distance', as it were. He is willing and in principle able to participate in the first phase of an audit, but noted that his schedule is very tight. He is British. His home page is http://www.cs.york.ac.uk/~jam/
Fred Schneider is Professor of Computer Science at Cornell University in Ithaca, New York State. He is one of the top people in the design and verification of fault-tolerant algorithms for concurrent distributed systems, and some of his designs are the `standard' algorithms used in such systems. He participated with a Cornell colleague in major technical consulting on the FAA's AAS project 1991-93, redesigned the `application fault-tolerance' scheme, and did technical trouble-shooting, also on some of the lower-level fault-tolerant algorithm schemes. He indicated to me his more-than-willingness to participate in a pre-audit, but with, again, concerns about scheduling. He shared with me his concern, derived from his previous experience, that such an audit be performed at the highest level of technical competence available. He is a U.S. citizen. His home page is reachable through http://www.cs.cornell.edu/faculty/index.htm