Engineering Modeling and Analysis: Sound Methods and Effective Tools

Engineering Modeling and Analysis: Sound Methods and Effective Tools A Dissertation Presented to the faculty of the School of Engineering and Applied ...

Author: Molly Johns

10 downloads 2 Views 1MB Size

Report

Download PDF

Recommend Documents

Marketing Research, Methods and Tools

Comparison of Common Root Cause Analysis Tools and Methods

A Study on Sentiment Analysis: Methods and Tools

Simulation Modeling and Analysis

Predictive Modeling and Analysis

Methods and Data Analysis

Numerical Methods and Analysis

Central Sound Regional Science and Engineering Fair

Chapter 4 Process Modeling Notations and Tools

Software Testing and Analysis Tools

Parsing, Lexical Analysis, and Tools

Resources and Tools Spatial Analysis

Tools and Methods based on Task Models

Dependency treebanks: methods, annotation schemes and tools

Compressors, tools, studio and sound equipment, and vehicles

[Task Analysis and Effective Teaching*

French urban planning tools and methods renewal

Experimental methods and tools in nanotechnology II

Methods and Tools for External Dependencies Management

Fifth International Conference on the Multiscale Modeling and Methods: Upscaling in Engineering and Medicine

Optimization under uncertainty: modeling and solution methods

3D MODELING METHODS OVERVIEW AND COMPARISON

9. Analysis a. Analysis tools for dam removal v. Hydrodynamic, sediment transport and physical modeling

Research Methods, Design, and Analysis

Engineering Modeling and Analysis: Sound Methods and Effective Tools A Dissertation Presented to the faculty of the School of Engineering and Applied Science University of Virginia

In Partial Fulfillment of the requirements for the Degree Doctor of Philosophy Computer Science

by

David Coppit January 2003

c Copyright January 2003

David Coppit All rights reserved

Abstract

Developing high quality software tools for specialized domains is difficult. One problem is the cost of developing feature-rich and usable tool interfaces. Another problem is the task of providing a sound basis for trustworthiness of the tool and the overall method which it supports. In this dissertation we present and evaluate an approach which addresses these key difficulties. The approach is based on two concepts: using specialized and tightly integrated mass-market applications to provide the bulk of the tool’s functionality, and the use of formal methods for the precise specification of the tool’s domain-dependent modeling language. We have evaluated our component-based work in part by developing a tool using the technique, deploying it to NASA, and having engineers from across the organization use and evaluate it. In the area of formal methods, we have developed and validated, both informally and formally, a mathematically precise specification of the language employed by an innovative modeling and analysis method for the reliability of fault tolerant systems. We have also developed a prototype tool that shows in concrete terms that our combined approach can work. The chief contribution of this work is a new approach to developing software tools having formal foundations for trustworthiness and sophisticated user interfaces. Constituent contributions include a qualified positive evaluation of the component-based approach, a proof of feasibility of using formal methods for domain-specific modeling languages, and the precise definition of an important modeling language, namely one for dynamic fault tree analysis.

iv

Acknowledgments

Let me first thank my advisor, Kevin Sullivan. It has been only through his patient guidance that I have learned the process of research. Not only has he worked hard to shape my ability to identify, formulate and express important research problems, but he has also introduced me to the community of software engineering researchers, and has provided me with every opportunity to excel in this field. I would also like to thank the rest of my committee: Worthy Martin, Jack Stankovic, Joanne Dugan and John Knight. Joanne Dugan, in particular, has provided invaluable expertise regarding dynamic fault tree modeling and analysis, and has been a pleasure to work with. I certainly would not be in the field of computer science at all if my father, Jim Coppit, had not bought me that Commodore Vic-20 all those years ago. That computer, and the Commodore64 which followed, provided the means by which I taught myself my first programming language (BASIC), and later led to my learning and programming an accumulator-based CPU in assembly language. My mother, Wanee An-McCabe, has taught me, by example, the value of strong character. She has continued to support and encourage me through the years, and her own personal triumph over her hardships in Korea are an inspiration. I can always trust that no matter how many degrees I earn, she’ll always remind me not to “get a big head”. I would also like to thank my friends from Ole Miss. John B. Denley, Elliot Hutchcraft, Chad Harrison, Lynn Michelletti, and Amanda Keith helped me stay sane despite their own heavy courseloads. Of course, the incomparable SIGBeer crowd can not be forgotten. Thanks go to Sean Mc-

v

vi Culloch, David Engler, Mike Nahas, John Rehehr, Brian “Paco” Hope, Brian White, John Karro, Kimberly Hanks, and Gabriel Ferrer. I’m also grateful for the friendships I forged later on in grad school. Rashmi Srinivasa and Nuts (Anand Natrajan) are good friends and gracious hosts. Glenn Wasson, Mark Morgan, and Karine and Anh Nguyen-Tuong are people who I was initially only acquainted with and later came to know well. Special thanks to John Regehr, Sean McCulloch, Kim Hanks, and John Karro for providing feedback on a draft of this dissertation. I’d also like to thank my new family for their support: my parents-in-law Carmen and Maryanne Rodgers, and the rest of clan Rodgers. If they have their way, I’ll go down in the annals of history for saving all of NASA. Daniel, thanks for being a supportive friend–in philo, bro’, in philo. Finally, there is my wife Dorothy. She has always been there for me, putting her life on hold while I complete my degree. Every minute of these five years has been wonderful, and I’m looking forward to the rest of our lives.

Contents

1

2

3

4

Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Thesis, Claims and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.5

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

Dynamic Fault Tree Modeling and Analysis: Background and State Of the Art

12

2.1

Background and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.2

Ambiguity Resulting from Imprecise Specification . . . . . . . . . . . . . . . . .

17

2.3

Inadequate Tool Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.4

DFTs as a Domain-Specific Language . . . . . . . . . . . . . . . . . . . . . . . .

19

Formal Methods for Engineering Modeling And Analysis

21

3.1

Formal Methods: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.2

Formal Specification for Domain-Specific Modeling and Analysis . . . . . . . . .

22

3.3

An Experiment: Formalizing DFTs . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.4

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Formalization of the Dynamic Fault Tree Language

26

4.1

Overview of the DFT Specification . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.2

Formalization of the DFT Specification . . . . . . . . . . . . . . . . . . . . . . .

32

vii

viii

Contents 4.3 5

6

7

8

9

Validation of the Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Formal Methods for Modeling and Analysis: Results and Evaluation

51

5.1

Results of Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

5.2

Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

5.3

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

Package-Oriented Programming

66

6.1

Component-Based Software Development . . . . . . . . . . . . . . . . . . . . . .

66

6.2

Package-Oriented Programming . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

6.3

An Evaluation: Package-Oriented Programming for Tools . . . . . . . . . . . . . .

73

6.4

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

Galileo: A Tool Built Using the POP Approach

76

7.1

Description and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

7.2

Development Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

7.3

End-User Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

Evaluation of the POP Approach

97

8.1

Targeting Industrial Viability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

8.2

Component Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

8.3

Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

A Combined Approach to Building Tools

105

9.1

Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

9.2

A New Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

9.3

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

10 Nova: A Tool Built Using the Combined Approach

114

10.1 Revising the DFT Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.2 The Nova Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Contents

ix

10.3 The Textual Dynamic Fault Tree Editor . . . . . . . . . . . . . . . . . . . . . . . 121 10.4 The Graphical Dynamic Fault Tree Editor . . . . . . . . . . . . . . . . . . . . . . 124 10.5 The Basic Event Model Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 10.6 A New DFT Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 10.7 The Resulting Nova Modeling and Analysis Tool . . . . . . . . . . . . . . . . . . 138 11 Evaluation of the Combined Approach

146

11.1 Package-Oriented Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 11.2 Formal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 11.3 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 11.4 Applicability and Limitations of the Approach . . . . . . . . . . . . . . . . . . . . 151 12 Conclusion

153

12.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 12.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 A The Z Formal Specification Language

159

B A Formal Specification of Dynamic Fault Trees

163

B.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 B.2 Conventions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 B.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 B.4 Basic Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 B.5 Abstract Syntax of Fault Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 B.6 Failure Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 B.7 Semantics of Fault Trees in Terms of Failure Automata . . . . . . . . . . . . . . . 179 B.8 Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 B.9 Basic Event Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 B.10 Semantics of Failure Automata in Terms of Markov Models . . . . . . . . . . . . . 201 B.11 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

Contents

x

B.12 Fault Tree Subtypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Bibliography

215

List of Figures

2.1

A simple static fault tree for a nuclear reactor . . . . . . . . . . . . . . . . . . . .

13

2.2

A simple dynamic fault tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

4.1

Specification strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

4.2

Example fault tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.3

Example failure automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.4

Example Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

4.5

Original domain check theorem for SpareBeingUsed . . . . . . . . . . . . . . . .

45

4.6

Z/Eves proof commands for SpareBeingUsed domain check . . . . . . . . . . . .

46

4.7

Proof goal for SpareBeingUsed domain check . . . . . . . . . . . . . . . . . . . .

46

4.8

Final domain check theorem for SpareBeingUsed . . . . . . . . . . . . . . . . . .

47

4.9

Z/Eves proof commands for SpareBeingUsed domain check . . . . . . . . . . . .

47

4.10 SetOfCausalBasicEventsIsSetOfBasicEvents theorem . . . . . . . . . . . . . . . .

48

4.11 CausalBasicEventsInBasicEvents theorem . . . . . . . . . . . . . . . . . . . . . .

49

4.12 FailureAutomatonTransitionInBasicEvents theorem . . . . . . . . . . . . . . . . .

50

4.13 CausalBasicEventsInBasicEvents proof . . . . . . . . . . . . . . . . . . . . . . .

50

5.1

PAND with two replicated inputs . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.2

A subtlety concerning non-determinism . . . . . . . . . . . . . . . . . . . . . . .

54

5.3

Spare gates taking from a common pool of spares . . . . . . . . . . . . . . . . . .

54

5.4

Cascaded FDEPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

5.5

Cyclic FDEPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

xi

List of Figures

xii

6.1

Reliasoft Blocksim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

6.2

Microsoft Visio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

7.1

A screenshot of Galileo/ASSAP 3.0.0 . . . . . . . . . . . . . . . . . . . . . . . .

78

7.2

The architecture of Galileo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

9.1

Common structure for tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

10.1 Original depictions of DFT shapes . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.2 Revised depictions of DFT shapes . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.3 The architecture of Nova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 10.4 A screenshot of the textual editor . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 10.5 Checking the syntax of a malformed fault tree . . . . . . . . . . . . . . . . . . . . 124 10.6 A screenshot of the graphical editor . . . . . . . . . . . . . . . . . . . . . . . . . 125 10.7 Checking the syntax of a malformed fault tree . . . . . . . . . . . . . . . . . . . . 131 10.8 A screenshot of the basic event model editor . . . . . . . . . . . . . . . . . . . . . 132 10.9 The analysis engine architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 10.10A screenshot of Nova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 10.11Cascading contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 10.12Contention interacting with allocation . . . . . . . . . . . . . . . . . . . . . . . . 144

List of Tables

11.1 Lines of code written . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

xiii

Chapter 1 Introduction

1.1

Motivation

The growing use of computers in fields outside of computer science has spurred the need for software tools that allow users to express, manipulate, and analyze concepts in their domain. Engineers, for example, often use modeling and analysis tools to gain insight into the design of complex systems. The engineer creates a computer model of the system using the modeling language supported by the tool. The tool then interprets the meaning of the model in order to perform useful analysis. Finally, the engineer uses the analysis results to draw conclusions about the real system being modeled. In order for new modeling and analysis methods to be developed and employed effectively, two key requirements must be satisfied. First, the methods themselves must have solid foundations for their dependability and trustworthiness. A central component of modeling and analysis methods is the modeling language—in order to trust the overall method, engineers must have confidence that the language has a precise and unambiguous meaning. Second, the method must be implemented in and supported by a high quality software tool. The tool must support the functionality that users desire, and must present that functionality in an easy-to-use manner. Furthermore, the tool must implement the modeling language faithfully, or the engineering will be unable to trust analysis results that it provides. Tools that do not meet these requirements are significantly less valuable to engineers. Without a

1

Chapter 1. Introduction

2

precise definition of the modeling and analysis method, the engineer is unable to state with certainty that a given model is an accurate representation of the system. Software developers who implement the method have no documented specification, and are likely to make arbitrary or incorrect assumptions about the semantics of the modeling language. The complexity of the modeling and analysis method also increases the chance of implementation error, which is particularly troublesome because errors in analysis results can be difficult to detect. Tools that lack features or are hard to use do little to increase the productivity of engineers, and increase the likelihood that the tool will be misused. Users have become accustomed to features and usability on par with mass-market software packages. Today, we lack technologies and approaches that enable the cost-effective construction of high quality software tools. The methods used in the construction of tools often lack mathematical foundations for dependability due to imprecise or incomplete definitions of the essential modeling languages. We also lack the ability to provide the features and usability that users expect, at a cost that is reasonable for the market of software tools. These problems are made even more acute by the small market for engineering tools, which makes amortization of the costs associated with today’s development approaches difficult. As Knight observes [43], software tools are increasingly used in the development of safety critical systems, and this important role requires that the software to be treated as a critical component of the overall engineering process. Evidence indicates that modeling and analysis tools developed using traditional techniques have deficiencies which can result in incorrect, and perhaps life-threatening, conclusions. In 1996 the United States Nuclear Regulatory Commission issued an alert [55] to all operators and builders of nuclear power plants, warning of significant errors in several tools used in nuclear reactor design and analysis. In this example, the safe design of nuclear reactors might have been compromised had the tools been used without adequate validation of the analysis results. Similarly, Hatton’s analysis of several seismic analysis tools showed that they produced different results even when ostensibly computing the same function [37]. Dugan’s analysis of several reliability analysis tools showed that they contained the same error in their analysis algorithms [4].

Chapter 1. Introduction

3

In this work we focus on two key problems in the development of high quality tools. The first problem is that delivering large amounts of functionality, along with high levels of usability, is costly using current software development techniques. Users have come to expect complex features such as print preview, unlimited undo, and crash recovery. Furthermore, users expect such functionality to be presented in an easy-to-use interface, the development of which often requires careful usability engineering. Tools that are feature-deprived will be less useful to users, and those that are hard to use will result in lower productivity and increased chance of misuse. The second problem involves the soundness of the modeling and analysis methods, and the use of mathematically precise notations, tools, and techniques for the specification of the modeling language employed by the methods. Formal methods are rarely applied in practice, due at least in part to their reputation of being prohibitively costly. The second problem, then, is that we lack evidence that formal methods can be applied cost-effectively to ensure the soundness and dependability of modeling and analysis methods and the tools which support them. The use of formal methods, despite their cost, is more easily justified for software that plays a direct role in safety-critical systems. In contrast, applying formal methods to an entire software tool, with its myriad features, may not be justified given its indirect impact on the safety of the system being designed. One promising approach is to apply formal methods to the definition, design, and implementation of only the core modeling language supported by the tool. However, today we lack data which indicates that even this approach can yield benefits at acceptable cost. Solutions to these two problems can significantly improve the task of developing high quality modeling and and analysis tools at low cost. By reducing the high cost of developing the features and usability, a large portion of the development costs can be reduced. If formal methods can be applied effectively for the precise definition of the modeling languages supported by tools, then users can have increased confidence in the models that the tool supports and the analyses it performs.

Chapter 1. Introduction

1.2

4

Approach

What is needed is a new approach to domain-specific tools that address the problems we have identified. In this dissertation we identify and evaluate two key aspects of a new approach which we believe can address these problems. The first aspect involves the use of a relatively new but largely unevaluated component-based development approach to address the high cost of developing featurerich, easy-to-use interfaces. The second aspect involves the judicious use of formal methods to address the difficulty of discovering and defining a modeling method’s subtle and complex domainspecific language and associated semantics. This approach is based on the observation that the overall tool architecture can be characterized in terms of an overall superstructure which provides the model editing capability, and an analysis core which implements the semantics of the modeling language in the form of analysis algorithms. In terms of code size, the analysis core is relatively small compared to the supporting superstructure. However, the user places a larger amount of trust in the analysis core, as it implements the modeling language and performs the critical task of evaluating a given model in terms of its semantics. The large amount of features and high demand for usability make component reuse attractive for the development of the superstructure. The critical nature of the modeling language supported by the tool makes it an attractive target for mathematically precise definition. The combination of these two approaches promises to reduce the cost of developing the sizable superstructure, while simultaneously increasing the overall dependability by developing a precise specification of the modeling language. This work is structured in three parts. The first part investigates the feasibility of using the component-based approach to deliver rich functionality and high usability at low cost. The second part investigates the cost-effective application of formal methods for the development of modeling and analysis tools. The third and final part is an evaluation of the overall approach which demonstrates the feasibility of combining the two key approaches.

Chapter 1. Introduction

1.2.1

5

Package-Oriented Programming: Features and Usability at Low Cost

Package-Oriented Programming (POP) is a relatively new component-based development approach created in industry in which large, architecturally coherent mass-market applications are specialized and integrated as components of the overall system [20, 62, 63]. Such applications are often more than a million lines of code in size, and contain a large amount of carefully engineered functionality. They are advertised as conforming to a set of standards meant to ease their integration, and their volume-priced, mass-market sale as stand-alone applications means that they can be acquired for very little cost. Furthermore, mass-market packages cover general application domains such as textual and graphical editing, which generally correspond to functionality needed by software tools. Assuming that such applications can be used effectively as components, the POP approach has the potential to provide the bulk of the functionality and usability of software tools at a greatly reduced development cost. The use of applications as components is not a new idea [10]. However, to date there have been few successful CBSD models, and none that support the construction of highly interactive software from multiple components. POP is a candidate approach that shows some promise but has not yet been carefully evaluated. As a key component of this research, we evaluate the feasibility of the package-oriented programming component-based software development approach, with a particular emphasis on the construction of modeling and analysis tools in particular. There are several potential impediments to the success of this approach. First, it is not clear that packages provide the flexibility to be customized to the degree necessary to provide the required functionality of the tool. Second, large components bind numerous design decisions which may contradict the requirements of the component developer. Third, packages may make conflicting assumptions about the manner in which they will be integrated [31]. Fourth, packages may not provide the capabilities which enable their tight integration. Fifth, packages are subject to rapid evolution, the impact of which on component development and maintenance is not clear. Sixth, packages have undocumented limitations and flaws which increase the risk that the required levels of integration and functionality can not be achieved. Seventh, such packages have their own user interfaces which must also be specialized and integrated. Lastly, the extensive internal state and

Chapter 1. Introduction

6

complex user interfaces of mass-market packages complicates testing. Package-Oriented Programming is a promising approach for the development of feature-rich and highly-usable software tools. However, due to the problems above, we lack an understanding of the feasibility of the POP approach to provide the level of specialization and integration required to build high quality modeling and analysis tools. Central to the component-based aspect of this research is an assessment of the POP approach in this context. We hypothesize that the POP approach can be used to develop sophisticated, industrially-viable tools for engineering modeling and analysis at a cost substantially lower than traditional methods. To test this hypothesis, we use the approach in the end-to-end development of a tool for reliability modeling and analysis. The tool is based on requirements derived from engineers working in industry, and delivers an innovative modeling method in a sophisticated user interface. The tool is currently in use at NASA, and its evaluation is based in part on survey data from NASA engineers. In attempting to develop such a tool, we gained insight into the key risks and opportunities of using the POP approach to build engineering tools. The result of this work is an in-depth evaluation of the POP model which demonstrates the ability of the approach to deliver the required functionality and usability, and provides insight into the development model with respect to the key challenges.

1.2.2

Formal Methods for Modeling and Analysis

Complex, critical software systems such as engineering modeling and analysis tools require the use of development techniques that help reduce errors in understanding, definition, and implementation. One promising approach is the use of formal methods—mathematically precise notations, tools, and techniques for the specification and implementation of systems. Central to the approach is a formal specification which precisely documents the system. This specification is validated against the understanding of the domain experts, and serves as the basis for a trustworthy implementation and precise user documentation. Despite the availability and potential benefits of formal methods, such methods are still far from being in routine use for the development of modeling and analysis tools. The reasons [44, 69] are

Chapter 1. Introduction

7

complex and multifaceted, involving issues such as the complexity of the languages and tools, lack of expertise, and the existing software development culture. But one widely cited reason is the perceived high cost associated with the use. The cost concern often prohibits the use of formal methods for all but the most safety-critical systems. In particular, the technical and economic feasibility of applying formal methods to the development of modeling and analysis tools remains unknown. We hypothesize that it is possible to use formal methods in a cost-effective manner for the precise definition and validation of the domain-specific language supported by the tool. The approach we take to testing this hypothesis is to apply formal methods to the definition of a particular notation for the modeling and analysis of the reliability of fault-tolerant systems. Working in a collaboration with domain experts, we formally define the syntax and semantics of the notation. We express the high level semantics of the notation by defining a mapping from an arbitrary element in the syntactic domain to a corresponding element in an intermediate semantic domain which we have defined. We then refine the semantics from this intermediate semantic domain to well-understood, low-level semantic domains such as Markov chains and binary decision diagrams. This effort helps to place the overall modeling and analysis method on a firmer foundation for trustworthiness. Once such a specification is defined and validated, there exists a mathematically precise mapping from an arbitrary model in the language to a corresponding representation in the well-understood domain. While the cost of developing and validating such a specification is not negligible, we hypothesize that doing so will significantly increase the trustworthiness of the notation, and, by extension, software tools which implement the notation.

1.3

Evaluation

Our evaluation mirrors the three-part structure of this research. The component-based user interface aspect and the formal methods aspect are evaluated separately and in depth. We then demonstrate the feasibility of their combined use in the development of a single modeling and analysis tool. The application area is that of tools for the analysis of fault-tolerant computer-based systems. The

Chapter 1. Introduction

8

tools in this research support dynamic fault trees (DFTs), a novel notation and technique for the reliability modeling and analysis of systems having complex redundancy management [8,24]. Fault trees model how component-level failures in an engineered system combine to produce system level failures. The evaluation of the POP model is based largely on the experiences garnered from the development of several versions the Galileo [65] dynamic fault tree modeling and analysis tool. The Galileo project is an end-to-end evaluation of the POP approach in which requirements taken from users in industry form the basis for a tool whose validity is tested in an industrial setting. By targeting the needs of real users, we discovered important issues in the use of the POP approach which may have been overlooked in an academic case study. The development of Galileo has provided the basis for the evaluation of the POP approach in terms of the opportunities it enables, the associated risks, and its performance relative to the challenges it faces. Early success for this tool led to a contract with NASA to develop a version for use in the aerospace industry. This new version, called Galileo/ASSAP, is now used by engineers at NASA to model faults on the International Space Station. It has been the primary tool used during three NASA-wide workshops on dynamic fault tree modeling and analysis. Participants at these workshops completed two surveys which provided independent evaluation of the ability of the POP approach to deliver tool features and usability. To evaluate the use of formal methods for modeling methods, we use the Z specification language and associated tools to formally define and validate a specification of the dynamic fault tree notation. We first develop the formal specification using the domain experts for clarification. Once this specification is completed, we meet again with domain experts to informally validate it via inspection. Next the specification is formally validated in certain limited respects using a theorem prover to prove key theorems. The experiences garnered from these efforts allow us to evaluate the suitability of today’s formal methods notations and tools, the extent to which their use helped the definition of the notation, and the costs involved with their use. The evaluation of the combination of POP and formal methods is based on the development of Nova, an advanced prototype tool for dynamic fault tree modeling and analysis which is similar in

Chapter 1. Introduction

9

function to Galileo. The user interface is implemented using a more aggressive application of the POP approach. The analysis core is based on the semantics of the dynamic fault tree as expressed in the formal specification of the DFT notation. The resulting tool demonstrates that the POP approach can be combined effectively with formal methods to develop tools that have rich feature sets and high usability, as well as formal foundations for dependability. Despite the features, usability, and dependability characteristics of the tool, Nova was implemented in less than 30,000 lines of code with less than two person-years of effort. This work estimate includes the specification and validation of the language associated with the modeling method, as well as its implementation in a feature-rich and usable tool.

1.4

Thesis, Claims and Contributions

This research makes contributions in three areas: component-based software development, applied formal methods, and the domain of reliability analysis of fault-tolerant systems. The thesis of this work is that the use of mass-market applications as components combined with the targeted use of formal methods can contribute significantly to the cost-effective development of high quality software tools. We will show that the POP approach is a potentially valuable component-based model for the development of feature-rich and usable modeling and analysis tools. We will also show that formal methods can be applied cost-effectively to provide formal foundations for complex modeling methods and the software tools that support them. Finally, this dissertation will demonstrate that these two elements can be combined effectively as part of an overall approach to building high quality modeling and analysis tools. My primary contribution to CBSD is an end-to-end experimental evaluation of the POP model, in which I present the essential characteristics of the approach, issues encountered during its use, and its potential as a successful model. As a case study, this research provides much-needed data on POP component extension, specialization, integration, and evolution. The Galileo prototype demonstrates that the POP approach can deliver high levels of functionality and usability, as confirmed by surveys of end users.

Chapter 1. Introduction

10

My chief contribution in the area of applied formal methods is the demonstrated feasibility of applying formal methods to the development of modeling and analysis tools to provide significant benefits for a modest cost. This case study also supports the notion that the use of formal methods in the design of languages for modeling and analysis methods and tools is needed, and provides data on the suitability, cost-effectiveness, and limitations of modern formal methods for this task. The overall contribution of this work is the identification and development of two key components for the cost-effective development of high quality modeling and analysis tools. The prototype Nova tool represents a tangible demonstration of the feasibility of this overall approach. Nova is the first DFT tool which combines a sophisticated, highly-usable user interface with model editors and analysis engine based on a formal definition of the modeling language. Finally, my focus on the software engineering of Galileo has led to several contributions to the reliability analysis of fault-tolerant computer-based systems. My work improves the soundness of the dynamic fault tree modeling and analysis methodology by providing, for the first time, a reasonably complete specification of the formal semantics for the models expressed in the language. As a result of this effort, I have also developed a revised DFT language which resolves problems in the previous version of the language which were found during the specification effort. Nova is the first tool to implement a formally-based DFT notation using the POP approach. Finally, I have made many contributions in the development of Galileo, the experimental testbed for the POP approach which is currently being used by the NASA space station team to model faults as part of the failure diagnosis and repair process.

1.5

Outline

The next chapter provides background on the domain of dynamic fault tree modeling and analysis, the application domain which provides the basis for evaluating our approach. The remainder of this thesis is structured in three parts: the use of formal methods for the precise specification of modeling languages, the POP approach to building interactive software, and the combination of these two elements for building high quality software tools.

Chapter 1. Introduction

11

Chapters 3 to 5 address the use of formal methods. Chapter 3 introduces formal methods, discusses related work, and presents the key research questions related to their use for the definition of modeling languages. Chapter 4 describes the development and validation of the specification for DFTs. Chapter 5 presents the results of our specification effort and evaluates the use of formal methods for the definition of the DFT notation. Chapters 6 to 8 address the use of package-oriented programming. Chapter 6 describes the POP approach, discusses related work, and presents the research questions to be addressed by the research. Chapter 7 presents experiences garnered during the development of several versions of the Galileo tool. Chapter 8 provides an evaluation of the POP approach. Chapters 9 to 11 describe the combined approach for building tools and its evaluation. Chapter 9 describes the synthesis of the two elements of the approach and discusses related work. Chapter 10 describes the development of the new tool based on the combined approach. Chapter 11 evaluates the overall approach. Chapter 12 discusses future work and concludes. Appendix A provides an overview of the Z specification language. Appendix B presents the complete formal specification of dynamic fault trees.

Chapter 2 Dynamic Fault Tree Modeling and Analysis: Background and State Of the Art

In this chapter we introduce the application domain in which this research has been conducted, and describe the state of the art prior to this research. In order to evaluate the technical and economic feasibility of our approach, we have employed it in the development of formally-based methods and tools for the domain of reliability modeling and analysis of fault-tolerant computer-based systems. In this domain, domain-specific languages are used to build models of systems with complex failure management. These models represent the failures and failure recovery behaviors of the system, and are analyzed to provide estimations of key system properties such as overall system unreliability. These analysis results are then used by the reliability engineer to assess the reliability of the system being modeled, and to perhaps modify the design of the system to improve its reliability. The particular modeling and analysis method which we address is dynamic fault tree analysis. Central to this method is the dynamic fault tree (DFT) language. We first describe the language informally, then address two concerns which helped to motivate this research: ambiguity in the language resulting from the lack of a complete and mathematically precise definition, and the lack of feature-rich and usable tools. These problems, while expected and accepted during the research and development phase of the DFT language, were obstacles for the eventual use of the approach by practitioners in the design of critical systems.

12

Chapter 2. Dynamic Fault Tree Modeling and Analysis: Background and State Of the Art

13

Core Damage

Thermal Damage to the Core

Low Termal Dissipation

Short Period

Loss of Coolant

Physical Damage to the Core

Power Level Too High

Mechanical Damage

Explosive Damage

Figure 2.1: A simple static fault tree for a nuclear reactor

2.1

Background and Overview

Fault trees [70] were originally developed for analysis of the Minuteman missile system [71]. In this language, the occurrence of low-level events or the occurrence of basic components in a system are modeled using probability distributions. The composition of basic component failures is modeled using combinatorial constructs called gates. Today we refer to such fault trees as static fault trees, as they model Boolean combinations, but not orders, of failures in systems. The semantics of static fault trees is fairly straightforward—a static fault tree represents a combinatorial composition of probabilities, which is a well understood mathematical problem. For example, if component A has a probability of occurrence of 0.4, and component B has a probability of occurrence of 0.5, then the probability of component A or component B occurring is 0.7. (0.5 for component A + 0.4 for component B - 0.5×0.4 for the probability of both occurring, which was counted twice.) Figure 2.1 presents a simple static fault tree. The top-level node, “Core Damage”, an OR gate, models the system as failing if either “Thermal Damage to the Core” or “Physical Damage to the Core” occurs. These gates are in turn modeled as OR gates whose inputs are basic events in the system. Given such a model, a fault tree tool can perform a number of analyses. The most common analysis is the computation of the overall system failure probability. For example, if “Mechanical

Chapter 2. Dynamic Fault Tree Modeling and Analysis: Background and State Of the Art

14

Damage” and “Explosive Damage” have constant failure probabilities of 0.5, then the probability “Physical Damage to the Core” is 0.75. This probability is then composed with the probability computed for “Thermal Damage to the Core” to compute the overall system unreliability. Static fault trees are in widespread use in the nuclear, aerospace, and railway industries. Fault trees have been developed for many large systems, and some authors have solved fault trees with 1020 basic events [22]. However, static fault trees, as a modeling language, are seriously limited due to their inability to model failure modes that depend on the order in which events orccur. For example, a common method of increasing reliability is to use spare components which are swapped in after the primary component fails. The semantics of sparing behavior depend upon the notion of a primary component failing before a spare. Such fault tolerance mechanisms can not be modeled using static fault trees. As a result, many reliability engineers utilize Markov chains as a more general modeling language which can be used to model combinatorial as well as order-dependent fault tolerance mechanisms. A continuous-time Markov chain is a state machine consisting of states and transitions between these states. Associated with each state is a probability of being in that state. The sum of all the probabilities for all the states must be 1. Associated with each transition is a rate which describes how the probability flows from the source state to the destination state. Given an initial state probability assignment, a Markov chain can be converted into a set of differential equations which are then solved for a given time period using standard methods. The resulting solution indicates the final probability assignments for the states in the Markov chain. Markov states are order-independent in that a particular state is independent of the path from the initial state. In the context of reliability engineering, each state of a Markov chain encodes a states of the components in the system. Order-dependent information is stored as part of the state of the system. The initial state corresponds to the “fully operational” state of the system in which all components are fully operational. The other states correspond to situations in which one or more basic events has occurred. The initial state has an initial probability of 1, and the other states have an initial probability of 0. Each transition of the Markov chain corresponds to the occurrence of a basic event, and models how the occurrence of the basic event changes the state of the system. A basic

Chapter 2. Dynamic Fault Tree Modeling and Analysis: Background and State Of the Art

15

event may represent a basic component or the occurrence of some external event. The transition rate of a transition is the occurrence rate of the associated basic event. The overall Markov chain is constructed by simulating the occurrence of each operational basic event in each state of the system, computing the resulting state, and adding a transition from the original state to the new state. When the Markov chain is solved, the system unreliability is computed as the probability of being in any state in which the top-level system event has occurred. Unfortunately, Markov chains suffer from the state explosion problem—any nontrivial system can easily require millions of states in the Markov chain. For example, a fault tree with only nine basic events can, in the worst case, result in a Markov chain having over 980,000 states due to the exponential behavior of interacting combinations and permutations of basic event occurrences1 . Even when optimizations are performed during the creation of the Markov chain, the resulting number of states is often exponential in the number of basic events. Developing and validating Markov models of such size by hand is prohibitively costly and extremely error-prone. Dynamic fault trees [8, 24] (DFTs) address these problems by extending the traditional fault tree language to provide gates which allow order to be modeled. They can model dynamic replacement of failed components from pools of spares; failures that occur only if others occur in certain orders; dependencies that propagate the failure of one component to others; and the specification of constraints which limit possible failure orders. Recent advances in dynamic fault trees also allow the modeling of catastrophic single point failures, and systems which undergo multiple phases of operation. The key benefit of DFTs over Markov chains is that they allow the reliability engineer to model the system using high-level domain-specific constructs. As a result, DFTs models are compact compared to Markov chains, with sizes on the order of the number of basic events. The smaller size of the model combined with the domain-specific nature of the constructs helps the engineer to validate models which are expressed in the language. A second benefit of the DFT language is that the state is implicit, and becomes explicit during analysis, when the DFT is automatically converted to an equivalent Markov chain. The analysis 1 Since

each state is a permutation of k occurred events out of n total events, the worst-case number of states is where P (n, k ) is the number of permutations of k elements taken from n total elements.

∑n k =0 P (n, k ),

Chapter 2. Dynamic Fault Tree Modeling and Analysis: Background and State Of the Art FDEP

Event A

16

PAND

Event B

Event C

Figure 2.2: A simple dynamic fault tree algorithms expand the DFT according to the simulated occurrence of basic events in the model, combined with the semantics of the gates. By automating the translation, the user is relieved of the burden of creating the Markov chain by hand. Figure 2.2 presents a simple dynamic fault tree. In this model “PAND” is a priority-AND gate which occurs if the inputs Event B and Event C occur in order. “FDEP” is a functional dependency which indicates that the dependent events Event B and Event C occur if the trigger event Event A occurs. The state of this DFT consists of the occurrence status of the basic events, as well as the order of occurrence of the inputs to the PAND gate. We now describe the individual DFT modeling constructs using informal natural language. While natural language is inherently ambiguous and imprecise, we use this description to provide the reader a high-level understanding of the language. Also note that this description is of the original DFT language prior to our efforts to formalize and revise it, as described in Chapter 4. • Replicated Basic Events: Basic events model unelaborated events using probability distributions and coverage models. As a convenience to the user, basic events can have a replication value, which allows a basic event to represent several identical events connected to the same locations in the fault tree. This is particularly useful in conjunction with the spare gate, where replacement components can be taken from a pool of identical components until that pool of components is exhausted. Basic events also have a dormancy factor, which attenuates the failure rate when the basic event is used as a warm spare (see below). • AND: The output event occurs if all the input events have occurred.

Chapter 2. Dynamic Fault Tree Modeling and Analysis: Background and State Of the Art

17

• OR: The output event occurs if any of the input events have occurred. • KOFM: The output event occurs if at least k out of m of the input events have occurred. • Priority AND (PAND): The output event occurs if all the input events have occurred and they occurred in the order in which they appear as inputs. • Cold, Warm, Hot Spare Gates (CSP, WSP, HSP): When the primary input occurs, available spare inputs are used in order until none are left, at which time the gate occurs. Spares can be shared among spare gates, in which case the first spare gate to utilize the spare makes the spare inaccessible to the other spare gates. The “temperature” of a spare gate indicates whether unused spares can not occur (cold), occur at a rate attenuated by the dormancy factor of the spare (warm), or occur at their full rates (hot). • Sequence Enforcer (SEQ): Asserts that events can occur only in a given order. This gate has no output event. • Functional Dependency (FDEP): Asserts a functional dependency—that the occurrence of the trigger event causes the immediate and simultaneous occurrence of the dependent basic events. This gate has no output event.

2.2 Ambiguity Resulting from Imprecise Specification In order for a user to have confidence in their understanding of the DFT methodology, the language, and the automated translation performed by the tool, it is vital that the semantics of the language be precisely defined. “Semantics”, refers to the mapping from expressions in the DFT language to corresponding analytical results. Prior to this work, the semantics of the DFT language had been expressed only with informal prose, a few isolated examples, and prototype computer programs [5, 8, 24, 25, 36]. Unfortunately, these methods are inadequate for precisely defining complex and subtle modeling languages. Informal prose descriptions are incomplete and inherently ambiguous. Mappings of individual DFTs to Markov models do not capture the general case. Source code and executable

Chapter 2. Dynamic Fault Tree Modeling and Analysis: Background and State Of the Art

18

implementations are precise, but procedural code is often resistant to human understanding and validation, and in the absence of a high-level specification there is no basis for rigorous verification [26]. There is the risk that the programmer may, perhaps unknowingly, make arbitrary and incorrect decisions about important semantic concerns. Without a sufficiently complete, abstract, and precise specification, it is likely that a complex language such as dynamic fault trees will contain conceptual and semantic errors. For example, conceptual confusion is evident in the informal definition of the SEQ and FDEP gates in the previous section. These constructs are defined as gates even though they do not compute an output occurrence relation based on their inputs. (i.e. An OR gate’s output occurs if any of the inputs occur, but a SEQ gate computes no such relation.) The example fault tree in Figure 2.2 demonstrates a semantic error in the language. In this case, the occurrence of the trigger Event A causes the simultaneous occurrence of the dependent events, as defined in the informal semantics of the FDEP gate. However, the original informal semantics of the priority-AND gate did not address the issue of simultaneous occurrence of the priority-AND gate inputs. If the ordering is strict, then the PAND gate should not occur if the inputs occur simultaneously. If the ordering is not strict, it should occur in this case. Ambiguity in the semantics of the language is not simply an academic curiosity. This example was discovered by an engineer at Lockheed-Martin while attempting to model a system with a tool supporting the DFT language. Unfortunately, the informal specification of the language did not address this case, and the DIFTree implementation could not be used as a semantic reference because it did not even behave consistently in this respect. Apparently the implementor did not realize the subtlety and therefore left the semantics inconsistent. When we presented the issue to the domain experts, they acknowledged the ambiguity and suggested that the priority-AND gate be defined as having a strict ordering. In general, the presence of such semantic problems in the modeling language significantly reduces the trustworthiness of the modeling and analysis method and any tool developed to support it. Importantly, precisely defining the meanings of the modeling constructs in isolation is unlikely to resolve such problems. The language must have a precise semantics for arbitrary combinations

Chapter 2. Dynamic Fault Tree Modeling and Analysis: Background and State Of the Art

19

of constructs, i.e., for arbitrary well-formed expressions in the modeling language.

2.3

Inadequate Tool Support

Several tools were developed to support various versions of the DFT modeling language, but were limited in features and usability. HARP [27], for example, lacked support for a number of DFT constructs. DIFTree [28,36], is another tool developed to support a more powerful DFT language. This tool supported the modeling constructs which were not supported by HARP, and also implemented an innovative modular solution technique. However, the graphical interface lacked a number of useful features. Both tools were tied to Unix, which was by then not the platform of choice in engineering practice. Like most research prototypes, these tools were useful for researchers, as they demonstrated the feasibility of implementing a tool to support the language and provided a testbed for further research. But the tool limitations placed the widespread adoption of the dynamic fault tree modeling and analysis approach at risk.

2.4

DFTs as a Domain-Specific Language

In this research we treat the modeling languages supported by software tools as domain-specific formal modeling languages analogous to programming languages. Like a general-purpose programming language, a modeling language has a syntax and semantics. Unlike a general-purpose language, a modeling language does not necessarily target the machine language of a computer, but rather is “compiled” into underlying mathematical expressions which are then solved. For example, the semantics of dynamic fault trees can be expressed, ultimately, in terms of differential equations or probability theory. Also, while many programming languages have a textual syntax, modeling languages often have a graphical representation in addition to a textual one. The development of languages for dynamic fault trees is in many ways similar to the development of programming languages in computer science. Markov chains can be viewed as an “assembly language” of reliability modeling—a simple, well-understood language in which “programs”

Chapter 2. Dynamic Fault Tree Modeling and Analysis: Background and State Of the Art

20

are unwieldy to develop and validate. DFTs are a “high-level language” which provide abstract, domain-specific modeling capabilities. Like a compiler translates a program written in a high-level language into a program written in assembly language, analysis engines for DFTs translate DFT models into Markov models and then eventually to differential equations. As with the development of the first high-level programming languages, the definition and correctness of the translation implementation is a key concern. As in compiler development, reliability researchers have developed a test suite to serve as a standard for the partial correctness of DFT analysis implementation [72]. However, as with all test suites, this DFT benchmark is an incomplete validation of the analysis engine. In contrast to commonly used general purpose programming languages, modeling languages are often small enough to be amenable to the development of a comprehensive, mathematically precise definitions. As we will demonstrate in Chapter 4, we can express the semantics of the language in a denotational style, even developing a mathematically precise abstract syntax for the language and an intermediate representation that is independent of the particular low-level “assembly language” representation such as Markov chains or binary decision diagrams. We present a formal specification of DFTs in Chapter 4, the complete version of which is included in Appendix B.

Chapter 3 Formal Methods for Engineering Modeling And Analysis

In this chapter we present an overview of formal methods, and discuss some of the challenges facing their use. We also describe our experiment in the use of formal methods in a targeted manner to help increase the trustworthiness of modeling methods and supporting languages and tools. We conclude the chapter with a section which discusses related work.

3.1

Formal Methods: Overview

The term formal methods denotes the use of mathematically precise notations and supporting tools to abstractly express, analyze, validate, and verify desirable properties of a system. Formal methods provide precise ways to specify the entities, relationships and functionality of a system. Using a formal notation, the engineer creates an abstract specification of the system. Because the notation has a well-defined syntax and precise semantics, the specification has a precise meaning. In addition, the notation has a set of associated inference rules which enable automated analysis. The use of formal methods can significantly reduce the chance of errors in both the conceptualization and implementation of the system. The mere act of rendering an intuitive understanding to precise, documented form can reveal previously unrecognized and undocumented subtleties. A precise specification also clarifies ambiguities inherent in an informal (e.g. natural-language) specification. Once developed, the specification can be informally validated against the understanding of domain experts in order to ensure that it states the concepts correctly. The specification can also 21

Chapter 3. Formal Methods for Engineering Modeling And Analysis

22

be validated formally via tool-assisted automated analysis in order to prove consistency theorems, to deduce logical consequences, generate counter-examples to stated theorems, etc. Implementations of the system can also be verified against the specification to help ensure that they perform correctly. For a system of significant complexity, the absence of a documented specification must necessarily cause the engineer to be concerned for the integrity of its conceptual structure and the faithfulness of a purported implementation. van Lamsweerde [69] has cataloged a number of challenges to the application of formal methods. Today’s formal notations are strongly influenced by low-level programming languages: they often have poor support for specifying non-functional properties, and have limited structuring mechanisms. Another important issue is that formal methods are thought to be very expensive—they require high levels of expertise in the use of the notation, analysis techniques, and tools. Today’s tools are also limited in capabilities. They have poor support for reasoning about specifications in the presence of errors, do not support the use of multiple specification techniques, and expose the user to many low-level mathematical details.

3.2

Formal Specification for Domain-Specific Modeling and Analysis

In this work we focus on methods and tools for the modeling and analysis of complex systems. The critical nature of such methods and tools means that their dependability is essential, and a significant portion of the dependability rests on the existence of a specified and validated syntax and semantics of the modeling language employed. In a tool, the concrete syntax is used in the implementation of the editor which the engineer uses to construct a model from an intuitive understanding. The semantics is used in the implementation of the tool’s analysis algorithms, whose results are used by the engineer to make important decisions regarding the system. There are several risks associated with the use of tools that implement poorly designed languages: (1) a poorly designed language can make it hard to model the important characteristics of the system, (2) the engineer’s incorrect understanding of the language may result in the development of incorrect models, and (3) the implementation of analysis algorithms may contain errors

Chapter 3. Formal Methods for Engineering Modeling And Analysis

23

that compromise the results. For many modeling tools, these risks are compounded by the subtle and complex nature of the modeling languages, and by software implementations that are created without the benefit of a precise definition. In order for a modeling and analysis methodology to be trustworthy, it is important that the modeling language that it employs have a precise meaning. However, to the best of our knowledge many modeling methods have supporting languages that lack formal foundations. Without a precise specification of the modeling constructs and composition rules of a language, it is impossible to rigorously validate a model against the system being modeled, or to a state precisely what a given analysis result really means. The design of the modeling language itself will be subject to conceptual errors and irregularities. There will be no sound basis for developing and verifying a software implementation of the modeling language, or for developing accurate user documentation. The underlying science of the modeling technique will remain unclear and incomplete. There are several barriers for the widespread use of formal methods in the design and development of modeling and analysis tools. Research on the penetration of formal methods in practice (e.g. [44, 69]) indicates that there is no one simple solution. One key concern, and the one we help to address in this research, is the reputation that formal methods have of being prohibitively costly to apply except in very specific and demanding cases, such as the control software of life-critical systems.

3.3

An Experiment: Formalizing DFTs

We hypothesize that formal methods can be used in a cost-effective manner to significantly improve the trustworthiness of modeling and analysis methods and tools. In particular, we focus on the application of formal methods to the precise definition of the syntax and semantics of the modeling language independent of any particular tool. By cost-effective we mean that it is possible to significantly improve the understanding, design, and implementation of the modeling language with perhaps a person-year of effort. The soundness of the overall method and tools which support it depends on the trustworthiness of the modeling language. As a test we decided to apply this approach

Chapter 3. Formal Methods for Engineering Modeling And Analysis

24

to develop a specification of the dynamic fault tree language supported by DIFTree [24]. During the course of this effort, we discovered a number of design flaws which we were able to resolve in a new version of the DFT language (presented in Section 10.1). It is this new version of the language which we present in this dissertation. In the next chapter we present such a specification for DFTs, written using the Z specification language [59]. The semantics are expressed in a denotational style in which the semantics of arbitrary DFTs are expressed in terms of a lower-level semantic models which we call failure automata. This domain is then further refined to the well-understood domain of Markov models. The specification was subjected to informal validation with domain experts and tool-assisted formal validation. Formalizing the DFT language revealed a number of significant conceptual, design, and implementation errors. It also led to significant insights concerning the underlying science and improvements in language design (described in Section 5.1). The next two chapters describe the case study and evaluate the approach. This portion of the overall research makes several contributions. First, we have identified a cost-effective approach for the judicious application of formal methods to increase the dependability of modeling and analysis tools. Second, our data support the claim that the use of formal methods in the design of modeling languages for engineering is badly needed, and that it can be both practical and profitable. Third, the case study has resulted in the first complete and mathematically precise definition of the DFT language. This definition is a significant contribution to the field of fault-tolerant computing. Fourth, the formal definition of the DFT language revealed several errors in conception and design, which were then resolved in a revision to the language (described in Chapter 10).

3.4

Related Work

The semantics of fault tree static gates and basic events had previously been defined informally in terms of probability theory [70]; and fault tree dynamic gates have been described in terms of individual examples mapped to corresponding Markov chains [8, 24, 25, 36]. However, to the best of our knowledge, our work is the first attempt to present a mathematically precise, abstract, and

Chapter 3. Formal Methods for Engineering Modeling And Analysis

25

sufficiently complete semantics for the entire dynamic fault tree language. Our work is analogous to that of Abowd, Allen, and Garlan [1], who formalized software architectural diagramming notations. They use Z to formally define the components and connectors typically used in such diagrams. Having developed this formalism, the authors then use the framework to perform analyses both within and across architectural styles. In another paper [3], Allen and Garlan formalize the notion of component connectors in software architecture. A collection of protocols characterize the participants and their interaction. The authors show how their formal framework can be used to define several common connector types, and how formalization provides for architectural compatibility checking in a manner akin to type checking. Both of these efforts apply formal methods in order to enable formal reasoning about the architecture of software. In contrast, our work seeks to assess the cost-effectiveness of applying formal methods to the specification and development of domain-specific languages used in other engineering disciplines. Our work also presents a more substantial case study involving the mapping of multiple syntactic domains to multiple semantic domains. Evans, et al. have formed the Precise UML (PUML) group in order to formalize the UML modeling language [33]. In a paper describing their work [29], the authors use Z to specify a syntactic and semantic domain for the UML, and to specify a mapping between the domains. Their work supports our basic argument regarding the necessity and applicability of formal methods to engineering models, in general. Kamin and Hyatt [42] present an approach to the design of domain-specific languages which is based on the definition of the domain-specific language in terms of a general-purpose language. The primary benefit—and potential downside—of this approach is that the domain-specific language takes on the characteristics of the more general purpose language. For example, the authors implement the FPIC graphical language using Standard ML. As a result, the FPIC language benefits from the careful design and precise semantics of the ML language, and inherits all of the features of that language. The primary difference in our work is that instead of embedding the domain-specific language into another, we specify a precise and explicit mapping from the language’s syntactic domain to semantic domains which are already established and well-understood.

Chapter 4 Formalization of the Dynamic Fault Tree Language

In this chapter we present our work in applying formal methods to the precise specification of domain-specific languages employed by modeling and analysis methods and supported by tools. This work was done in collaboration with domain experts, who first presented the language informally, allowing us to formalize it by using the Z specification language [59]. The process of formalizing the language revealed gaps in our knowledge, causing us to again consult with domain experts for clarification. Many of the subtleties which we found were known to the experts, but were not documented in a usable form (i.e. existed only in computer code). In some cases, we discovered previously unknown subtleties in the modeling language. Once the first draft of the specification was complete, we validated it informally with the domain experts. This entailed a review process in which we studied the specification with the experts, often “translating” the formal specification back into natural language. During this process we found additional errors in our understanding of the DFT language, as well as deficiencies in the specification. Finally, the specification was subjected to tool-assisted partial formal validation, the goal of which was to ensure that the specification was internally consistent and meaningful. For example, a number of theorems were proven to ensure that partial functions are applied only to values which are in their domains. We also proved several theorems designed to ensure that the specification met certain global properties. For example, the number of failures in the fault tree should be monotoni-

26

Chapter 4. Formalization of the Dynamic Fault Tree Language

27

cally increasing across state transitions, as the modeling language does not currently support repair of components. This formal validation effort establishes the basic soundness of the specification, which is an essential first step toward complete validation. We begin by presenting an overview of the DFT specification. Next we present selected aspects of the specification. The last section of this chapter describes the informal and formal validation of the specification. Due to its length, we do not present the entire specification. Interested readers should consult Appendix B. Some of the content in this chapter is based on work published elsewhere [18, 21].

4.1

Overview of the DFT Specification

The overall structure of the specification is denotational, as depicted in Figure 4.1. DFTs are considered as a syntactic domain, and expressions in this domain are mapped to corresponding expressions in an underlying semantic domain. The initial version of the concrete textual and graphical syntax was previously defined in the DIFTree tool [36]—our purpose is to assign a precise semantics to this language, improving the language in the abstract, and providing a well-defined meaning for models in the language. We first define an abstract syntax domain which provides a common representation-independent abstraction for DFTs. This allows us to represent DFTs in terms of an abstract representation rather than a particular concrete textual or graphical notation. The primary semantic domain is failure automata (FA), instances of which are state machines which model the changing state of the system as failures occur. We formalize the semantics of DFTs by specifying a mapping from arbitrary fault trees in the abstract domain to the FA domain. We partition DFTs and FAs into several types which depend on particular characteristics of the model. Depending on the type, the semantics of members of the domain of failure automata can be expressed in terms of well understood mathematical domains such as Markov chains or Binary decision diagrams (BDDs). In this work we address the subset of DFTs whose semantics can be defined in terms of Markov chains, namely DFTs whose basic event probability distributions are Weibull or exponential only (as indicated by the dark lines in Figure

28

Chapter 4. Formalization of the Dynamic Fault Tree Language

Textual Syntax

Graphical Syntax

Abstract Syntax

Fault Tree Automata

Markov Chains

BDDs

...

Figure 4.1: Specification strategy 4.1). Ignoring many important subtleties, the correspondence between a dynamic fault tree and its Markov chain is intuitive. The states of the Markov chain roughly correspond to sequences, or histories, of event occurrences. The initial state of the Markov chain represents a situation in which no events have yet occurred. Transitions in the Markov chain correspond to basic event occurrences that extend these histories, and the rates on these transitions are derived from the probability distributions of the triggering events. A subsequent state differs from a prior state both in the occurrence of the basic event and in additional consequences that follow from the structure of the fault tree, such as occurrences of events associated with gates. The DFT language does not support the modeling of repair, so event histories increase monotonically as one progresses through a Markov chain, and there are no cycles. The resulting Markov chain corresponds in a straightforward manner to a set of differential equations which can be solved using standard methods. The initial probability of being in the initial (all components operational) state is 1, and all other states have an initial probability of 0. When the differential equations are solved, this initial probability is distributed to the other states in the Markov chain. The resulting final probabilities indicate the probability of being in each state based on the user-specified operational time of the system.

Chapter 4. Formalization of the Dynamic Fault Tree Language

29

AND

Spare Gate

Event A

Event C

Event B

Event D

Figure 4.2: Example fault tree Once the Markov chain is generated and solved, the computation of the system-level failure probability is straightforward. The fault tree defines the states in the Markov chain which are system-level failure states: namely those in which the user-identified system-level event has occurred. The overall unreliability is then the sum of the probabilities of being in any of these systemfailed states. Figure 4.2 shows a graphical representation of a simple fault tree with a spare gate, an AND gate, and several basic events. The spare gate event occurs if a spare is unavailable (i.e. events A, C , and D all occur). The system level event associated with the AND gate occurs if both the spare gate and the event B occur. In this fault tree, the spare gate behaves much like an AND gate, but the events C and D occur at reduced rates unless they are in use. Traditionally, reliability engineers have expressed the semantics of DFTs directly in terms of Markov chains. Our work led to the discovery of a potentially important, more general semantic domain—that of failure automata (FAs). FAs provide a crucial intermediate domain between highlevel models such as fault trees and low-level representations such as Markov chains. They provide a general model for the order of events and the change in state of a system, and can potentially serve as the basis for the formal definition of other reliability modeling languages. FAs may also serve as a common semantic domain for demonstrating the equivalence of different reliability modeling languages which are in use today. A FA is a state-transition diagram in which each state is an abstraction of the system state. A

Chapter 4. Formalization of the Dynamic Fault Tree Language

30

state indicates which events have occurred, in what order (the event history), which spares are in use by which spare gates, and other information discussed in more detail in the next section. FAs support the specification of semantics for all subtypes of DFTs, including those for which there are no corresponding Markov chains. For example, the semantics FAs for DFTs with constant probability distributions can not be expressed as continuous-time Markov chains, and must be expressed in terms of simulation or binary decision diagrams instead. Figure 4.3 illustrates a portion of the FA for the example fault tree in Figure 4.2 . In each state, the top row of state components is the state of all the events and the allocation of spares to spare gates. The second row contains a history of the event states up to the current state. The last row indicates whether or not an uncovered failure has occurred. Generally speaking, the initial state (not shown) is the state in which all basic events have not occurred, all spare gates are using their first spare, and the system has not failed uncovered. (The details of covered failure are not important for this discussion.) The history will have only one set of event states which corresponds to the initial state of events. In this example, state 1 is the state in which event B has already occurred. An arc between two FA states is labeled with the basic event whose occurrence caused the state transition. The transition to the upper state in the figure, labeled with an “A”, indicates that the occurrence of basic event Event A caused the state change. In the resulting state, Event A has 0 operational replicates and the history is augmented accordingly. In addition, the spare gate is now using the basic event Event C (rather than Event A) because Event A is no longer available. There are two other state transitions which correspond to the occurrence of the remaining operational basic events Event C and Event D. Figure 4.4 presents the Markov model that corresponds to this portion of the FA. Each state in the Markov model corresponds to state in the FA, and, in this case, transitions also correspond one-to-one. In general, a FA can have multiple arcs between two states. A Markov model will have a single transition corresponding to such a set of arcs. The transition rates between states of the Markov model correspond to the rate of occurrence of the causal basic event, modified by several scale factors. In this case, the rate is modified by the dormancy factor for the unused spares that

31

Chapter 4. Formalization of the Dynamic Fault Tree Language

2 A, 0 B, 0 C, 1 D, 1

SG, 1

SG -> C

AND, 1 C

A, 1 B, 1 C, 1 D, 1

SG, 1

A, 1

AND, 1

C, 1

B, 0

,

D, 1

SG, 1

A, 0

AND, 1

C, 1

B, 0

,

D, 1

SG, 1

D

AND, 1

false

A

1 A, 1 B, 0 C, 1 D, 1

3 A, 1

SG, 1

B, 0

SG -> A

C, 0

AND, 1

D, 1

SG, 1

SG -> A

AND, 1 A

A, 1

A, 1

B, 1

SG, 1

B, 0

SG, 1

C, 1 D, 1

AND, 1

C, 1 D, 1

AND, 1

,

C

A, 1

A, 1

B, 1

SG, 1

C, 1 D, 1

AND, 1

,

A, 1

B, 0

SG, 1

B, 0

SG, 1

C, 1 D, 1

AND, 1

C, 0 D, 1

AND, 1

false

,

D

false

D

4 A, 1 B, 0

SG, 1

C, 1

AND, 1

D, 0

SG -> A

A A, 1 B, 1 C, 1 D, 1

SG, 1

A, 1

AND, 1

C, 1

B, 0

,

D, 1

SG, 1

A, 1

AND, 1

C, 1

false

Figure 4.3: Example failure automaton

B, 0

,

D, 0

SG, 1 AND, 1

C

Chapter 4. Formalization of the Dynamic Fault Tree Language

32

State 2 for te Ra ent A Ev

State 1

Dorm. for Event C X Rate for Event C Do rm Ra . fo te r E for ve Ev nt D en tD X

State 3

State4

Figure 4.4: Example Markov model occurred. The specification we have developed not only defines the domains of the models in this example, but it also precisely defines mappings from fault trees to failure automata, and from failure automata to Markov chains for the general case. Through the specification of these domains and mappings, we provide a precise semantics for the DFT language. We will show how the semantics can be the basis for the precise definition of fault tree analyses such as the computation of overall system unreliability.

4.2

Formalization of the DFT Specification

In this section we present the dynamic fault tree specification. Due to its length, we present only the key aspects: the abstract syntax of DFTs, the FA domain, and the mapping from an arbitrary DFT to its associated FA. The entire specification is included in this dissertation as Appendix B, including the semantics of FAs in terms of Markov models. For an introduction to the Z specification language [59], see Appendix A.

4.2.1

DFT Abstract Syntax

In this section we present the abstract syntax of DFTs. We begin by defining an abstract event:

[Event]

33

Chapter 4. Formalization of the Dynamic Fault Tree Language

Event is a “given type”, and represents failures or event occurrences in the fault tree. This definition allows us to abstract the independent failure of basic events and the dependent failure of gates.

BasicEvents basicEvents : F Event

This Z schema defines the basic events in the fault tree as a finite set of events. A schema in Z is a structuring mechanism that defines a new type. In this part of the specification we use the schema primarily as a structuring mechanism, and do not take advantage of its type semantics. That is, the BasicEvents schema separates the definition of the basic events of a fault tree from other concerns, such as the definition of gates which we present next.

Gates andGates : F Event orGates : F Event thresholdGates : F Event pandGates : F Event spareGates : F Event gates : F Event thresholds : Event → 7 N1 handGates, orGates, thresholdGates, pandGates, spareGatesi

(4.2.1)

partition gates dom thresholds = thresholdGates

(4.2.2)

In this schema we define the different types of gates in a fault tree as finite sets of events. We define the gates set as the set of all gates in the fault tree, and thresholds as a function mapping

Chapter 4. Formalization of the Dynamic Fault Tree Language

34

threshold gates to their associated thresholds. Note that this schema has a set of constraints below the horizontal line. These constraints express relationships between the state components above the horizontal line. For example, line 4.2.1 states that the gate sets are independent, and together form the set of all gates in the fault tree. Line 4.2.2 states that, only the threshold gates have threshold values in the thresholds mapping. This schema demonstrates a specification strategy which we use throughout the specification. The threshold values for threshold gates are defined in terms of a function which maps threshold gates to associated threshold values. An alternative approach is to define a new schema type for threshold gates which contains a threshold value as a state component. One difficulty with this latter approach is that this new type is then different from the types of other gates. Because Z does not support polymorphism, the specification becomes complicated as a result of this specification strategy. In an early publication [18] we employed this “object-oriented” approach to defining the fault tree type in Z. However, as we became more experienced in the use of Z, we found that using simple types and function mappings for associated attributes was a more suitable approach. There is a variant of Z called Object-Z which may address this issue more effectively. We chose to use the basic Z language in order to take advantage of the more developed documentation, tools, etc.

InputSequence == iseq Event

The notion of an input sequence is central to the definition of the inputs of gates and constraints. We define an input sequence as an injective sequence of events—that is, a sequence which does not contain repeated events.

Constraints seqs : F InputSequence fdeps : Event ↔ InputSequence

35

Chapter 4. Formalization of the Dynamic Fault Tree Language

Here we present a schema for the SEQ and FDEP constraints. The sequence constraints in the fault tree are simply a finite set of sequences of events. The functional dependencies are a general relation from trigger events to input sequences of dependent events. We use a relation rather than a function because an event may trigger more than one FDEP.

FaultTree BasicEvents Gates Constraints events : F Event inputs : InputsMap replications : ReplicationMap hbasicEvents, gatesi partition events

(4.2.3)

dom inputs = gates

(4.2.4)

∀ g : gates • ran(inputs g) ⊆ events ∀ g : gates • ¬ IsInputTo(g, g, inputs)

(4.2.5)

∀ sg : spareGates • ran(inputs sg) ⊆ basicEvents

(4.2.6)

∀ sg : spareGates; be : basicEvents | IsDirectlyInputTo(be, sg, inputs) • ¬ (∃ g : gates \ spareGates • IsDirectlyInputTo(be, g, inputs)) ∀ s : seqs • ran s ⊆ events

(4.2.7)

dom fdeps ⊆ events

(4.2.8)

dom replications = events ∀ t : dom fdeps • replications t = 1 ∀ is : ran fdeps • ran is ⊆ basicEvents ∀ g : gates • replications g = 1

(4.2.9)

Chapter 4. Formalization of the Dynamic Fault Tree Language

36

The FaultTree schema defines the abstract syntax domain of a fault tree. Here we use “schema inclusion” to incorporate the definitions we previously defined in the schemas for the basic events, gates, and constraints. The events set is the set of all events in the fault tree. inputs is a mapping for the inputs of each gate in the fault tree, and replications is a similar mapping for the replications of the events. The constraints state the following: • (4.2.3) An event is either a basic event or a gate. (Recall that a gate can be an AND gate, an OR gate, a threshold gate, a PAND gate, or a spare gate.) • (4.2.4) Only gates can have inputs. The inputs must be one of the events in the fault tree • (4.2.5) No gate can be input to itself (directly or indirectly. (i.e. cycles in gate inputs are not allowed) • (4.2.6) The inputs to spare gates are only basic events. Basic events that are inputs to spare gates can not be inputs for other types of gates (but note that they can be inputs to constraints). • (4.2.7) Sequence enforcers must operate over the events of the fault tree. • (4.2.8) Functional dependencies must be triggered by some event in the fault tree. Every gate and basic event has a replication. The trigger must have a replication of 1, and only basic events can be dependent inputs. • (4.2.9) All gates must have a replication of 1.

4.2.2

Failure Automaton Domain

Having defined a formal specification of the abstract syntax of fault trees, we now turn our attention to the intermediate semantic domain, called failure automata.

StateOfEvents == Event → N

Chapter 4. Formalization of the Dynamic Fault Tree Language

37

StateOfEvents represents the state of all the events for a fault tree. Note that this does not capture the entire state of the fault tree; in particular, the allocation of spares to spare gates is not modeled by StateOfEvents.

History == { h : iseq StateOfEvents | h 6= hi ∧ (∀ i , j : dom h • dom(h i ) = dom(h j )) • h }

A History is specified as a non-repeating sequence of StateOfEvents that represents the changing state of the fault tree over a sequence of causal basic event failures. Every step in the history has the same set of events, although the event states can change.

SpareInUse == { siu : Event → 7 Event | siu ∈ F(Event × Event) • siu }

We declare SpareInUse as a finite partial function from Event to Event. As we will specify shortly, the domain represents a subset of the spare gates in the fault tree, and the range is the spare being used by the spare gate (if any).

FailureAutomatonState stateOfEvents : StateOfEvents history : History spareInUse : SpareInUse systemFailedUncovered : Boolean stateOfEvents = last history

A state in a failure automaton consists of the state of the events, the history, the spare allocation, and the uncovered failure status. The stateOfEvents must be equal to the last state of events in the history.

Chapter 4. Formalization of the Dynamic Fault Tree Language

38

FailureAutomatonTransition from : FailureAutomatonState to : FailureAutomatonState causalBasicEvent : Event to.history = from.history a hto.stateOfEventsi causalBasicEvent ∈ dom from.stateOfEvents from.stateOfEvents causalBasicEvent < to.stateOfEvents causalBasicEvent

A transition between states consists of a from state, a to state, and an associated causal basic event. The destination state extends the history of the source state by one set of event states. The causal basic event must be one of the events in the event state, and it must be the case that additional replicate failures of the basic event occur in the transition between states.

FailureAutomaton states : F FailureAutomatonState transitions : F FailureAutomatonTransition states = { t : transitions • {t.from, t.to} } S

A failure automaton consists of a finite set of states and transitions. The predicate constrains the transitions to be between the states of the failure automaton.

4.2.3

Semantics of DFTs in Terms of FAs

The bulk of the specification relates to the semantics of each type of gate in a fault tree. To illustrate the issues involved, we present the key aspects of the semantics of priority-and gates. We begin with a precise definition of “occurs in order”, resolving the ambiguity in the case of simultaneous failure which was illustrated in Figure 2.2.

39

Chapter 4. Formalization of the Dynamic Fault Tree Language InputsOccurredAndInOrder : P(InputSequence × History × ReplicationMap) ∀ is : InputSequence; h : History; rs : ReplicationMap | ran is ⊆ dom(h 1) • InputsOccurredAndInOrder (is, h, rs) ⇐⇒ NumberOfOccurredReplicatesInInputs(is, last h) =

(4.2.10)

NumberOfReplicatesInInputs(is, rs) ∧ (∀ i , j : dom is | i < j •

(4.2.11)

FirstFullOccurrenceTime(h, is i , rs(is i )) 6= 0 ∧ FirstOccurrenceTime(h, is j ) 6= 0 ∧ FirstFullOccurrenceTime(h, is i , rs(is i )) ≤ FirstOccurrenceTime(h, is j ))

Given a history and a sequence of events, the value of the InputsOccurredAndInOrder function is true if the replicates in each position fail before or at the same time as the replicates in later positions. This definition depends on two functions, FirstFullOccurrenceTime and FirstOccurrenceTime, which compute the history time step in which all replicates of an event have occurred, or in which one replicate of an event has occurred, respectively. Predicate 4.2.10 states that all the inputs must have occurred. Predicate 4.2.11 states that all the replicates at position i must have occurred. (FirstFullOccurrenceTime) at or before the time at which the first replicate at position i + 1 occurs (FirstOccurrenceTime). Note that this specification addresses two subtleties in the DFT language. In using the ≤ symbol, we have defined simultaneous failure of inputs to be “in order”. Secondly, the specification addresses ambiguity which arises if some but not all of the replicates of one input fail and then replicates in a later input fails. In this case, we specify that all the replicates of the first input must fail “in order” with respect to the replicates of the later input.

Chapter 4. Formalization of the Dynamic Fault Tree Language

40

PANDSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) • PANDSemantics(ft, fa) ⇐⇒ (∀ pg : ft.pandGates; fas : FailureAutomatonState | fas ∈ fa.states • InputsOccurredAndInOrder (ft.inputs pg, fas.history, ft.replications) =⇒ fas.stateOfEvents pg = 1 ∧ ¬ InputsOccurredAndInOrder (ft.inputs pg, fas.history, ft.replications) =⇒ fas.stateOfEvents pg = 0)

We define the semantics of the priority-and gate as a Boolean function. In order for the function to be applicable, the fault tree and failure automaton events must match—each event in the fault tree must have a corresponding state in the failure automaton. The function is true if each priority-and gate pg in the fault tree behaves correctly in each state fas of the failure automaton. That is, if the inputs have occured and they occurred in order, then the PAND gate is failed. Otherwise it is operational. The semantics of each gate type is expressed in this manner. The overall semantics of fault trees is then expressed as a function mapping fault trees to failure automata.

Chapter 4. Formalization of the Dynamic Fault Tree Language

41

FaultTreeSemantics : FaultTree → FailureAutomaton ∀ ft : FaultTree; fa : FailureAutomaton • FaultTreeSemantics(ft) = fa ⇐⇒ FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) ∧ CausalBasicEventSemantics(ft, fa) ∧ UncoveredFailureSemantics(fa) ∧ ANDSemantics(ft, fa) ∧ ORSemantics(ft, fa) ∧ ThresholdSemantics(ft, fa) ∧ PANDSemantics(ft, fa) ∧ SpareGateSemantics(ft, fa) ∧ SEQSemantics(ft, fa) ∧ FDEPSemantics(ft, fa)

This definition specifies the function which maps an arbitrary fault tree to its associated failure automaton. In this specification, we state that the fault tree corresponds to a failure automaton if: • the events in the FT have state in the FA • the spares being used relation in the FA state refers to spare gates and spares in the FT • causal basic events fail properly • basic events fail uncovered properly • each gate and constraint behaves correctly with respect to its semantics. This function, along with a similar function which maps failure automata to Markov chains, provides sufficiently complete and mathematically precise definition of the semantics of dynamic fault trees by providing a mapping from arbitrary fault trees to Markov chains.

4.3

Validation of the Specification

The development of the specification was an iterative process. We would document our understanding of the modeling language, sometimes discovering gaps in our knowledge or inconsistencies in

Chapter 4. Formalization of the Dynamic Fault Tree Language

42

the definition. We would then consult with our domain experts in order to clarify these issues, either modifying our understanding and the specification, or using the issues to drive revision of the language. The development of a specification of the DFT language clarified and documented our understanding in a mathematically precise manner. However, the complexity of the language and its specification meant that the chance of error (both in concept and expression) was high. To address this risk, we needed to validate the specification to ensure that it represented the semantics intended by the domain experts, and was also consistent and meaningful. Our approach, which we describe in this section, was to subject the specification to both informal and formal validation. Informal validation consisted of several sessions with the domain experts in which we carefully reviewed each line of the specification. Formal validation involved using a theorem proving tool to prove key theorems, thereby increasing our confidence that the specification was sound and correct.

4.3.1

Informal Validation

Informal validation of the specification involved a review with domain experts in which the entire specification was presented. We presented each paragraph of formalism as a natural language “translation”. This eased the difficulty of reading the specification on the part of the domain experts. During this review process we found a number of errors in the specification resulting from our misunderstanding of the DFT semantics. For example, we discovered that our specification of the computation of Markov transition rates incorrectly accounted for the dormancy of unused spares. We also found during the review that the specification was precise, but that its expression lacked useful abstractions which would improve the readability and therefore the ease of validation process. We found it better to encapsulate the use of esoteric Z language features in descriptively named functions. For example, we abstracted confusing constructs such as from ∈ ran(inputs to), hiding this syntax in a function named IsDirectlyInputTo(from, to, inputs). The domain experts could validate the confusing syntax in the function once, then use the function name as a simple conceptual proxy elsewhere in the specification.

Chapter 4. Formalization of the Dynamic Fault Tree Language

43

While this experience is only an isolated experiment, our formal specification expertise combined with the expert’s domain knowledge helped to reveal the errors we discovered. There is also some evidence that we were becoming more efficient as a team—not only did we acquire a better understanding of the domain, but one domain expert commented toward the end of informal validation that reading the specification had become significantly easier. This is encouraging, as it indicates both that specification and domain experts can work effectively together, and that the learning curve for the formal specification language was not prohibitive, even for domain experts not familiar with formal specification languages such as Z.

4.3.2

Formal Validation

As we will describe in the next chapter, the development and informal validation of the specification yielded significant benefits. However, a formal specification, while useful for thinking carefully about the domain, is not guaranteed to be correct, complete, consistent, or valid. As a result, errors in the specification not found by informal validation are likely to be reflected as errors in an implementation which is based on the specification. Consider the following simple specification.

Type xs : P Z 5 ⊂ xs #xs > 1 xs = {3} ∧ xs = {5, −1}

This specification defines a type called “Type” whose state components consist of a single set “xs”, defined as a set of integers. There are several predicates below the state components which demonstrate different types of errors. For example, the first predicate contains a syntax error—“5” should be written as the set “{5}” or set membership should be used instead of subset. The next predicate demonstrates a domain error—the application of a function outside of its domain. The

Chapter 4. Formalization of the Dynamic Fault Tree Language

44

problem is that the set-size operator “#” only applies to finite sets, but here is applied otherwise. Finally, the last predicate demonstrates an inconsistency in the specification. Formal validation can help identify these types of specification problems. Type errors can be identified with a type checker, and theorem proving tools can be used to find semantic errors. We used the ztc [41] and fuzz [60] type checkers during the development of the specification to ensure that there were no type errors. We then used the Z/Eves [57] syntax checker and theorem prover to check the syntax of the DFT specification again, and to prove certain important theorems about the specification. The bulk of the effort was invested in proving the absence of domain errors, although some effort was made to prove consistency. In the next three sections we describe our experiences using type checking, domain checking, and theorem proving to help formally validate the DFT specification.

4.3.2.1

Type Checking

Type checking was used heavily during initial development of the specification. We type checked the specification as it was written. Misspellings were common, as were simple type errors such as attempting to concatenate “sequence a element” instead of “sequence a helementi”. Once the specification was informally validated, we then imported it into the Z/Eves theorem prover for formal validation. Any changes made to the specification using Z/Eves were translated back manually into the text document. The Z/Eves type checker found numerous misspellings and type errors as we made modifications to the specification.

4.3.2.2

Domain Checking

Consider the definition below of the partial function SpareBeingUsed . SpareBeingUsed , given a spare gate and a failure automaton state, returns the basic event that the spare gate is currently using.

Chapter 4. Formalization of the Dynamic Fault Tree Language

45

theorem axiom SpareBeingUsed $domainCheck local SpareBeingUsed ∈ Event × FailureAutomatonState → Event ∧ (sg ∈ Event ∧ fas ∈ FailureAutomatonState) =⇒ (sg, fas) ∈ dom local SpareBeingUsed ∧ (fas.spareInUse, sg) ∈ applies$to Figure 4.5: Original domain check theorem for SpareBeingUsed

SpareBeingUsed : Event × FailureAutomatonState → Event ∀ sg : Event; fas : FailureAutomatonState • SpareBeingUsed (sg, fas) = fas.spareInUse sg

Z/Eves automatically generates domain check theorems for each Z paragraph. In this case, it generates the theorem shown in Figure 4.5. The domain check theorem states that given an event sg and a failure automaton state fas, (sg, fas) is in the domain of SpareBeingUsed , and the spareInUse partial function for fas can be applied to sg. In order to prove this theorem, we would need to show that the premises lead to the consequences of the implication. By invoking the prove command of the theorem prover, Z/Eves can determine that (sg, fas) is in the domain of SpareBeingUsed . However, the tool can not automatically determine that the spareInUse partial function can be applied to sg. Part of the problem is that the definition of spareInUse is not immediately available to the theorem prover, which does not automatically “delve into” the definition of FailureAutomatonState. To address this problem, we apply the FailureAutomatonState$member theorem that was automatically created by Z/Eves when the FailureAutomatonState definition was checked. This theorem states that if an element is of type FailureAutomatonState, then there exists a local binding (i.e. θFailureAutomatonState) for the element. In this case, we apply the theorem to fas ∈ FailureAutomatonState, which introduces the fact that there exists a FailureAutomatonState binding for fas. Next we execute the prenex command to instantiate that binding. We then invoke the FailureAutomatonState binding to “explode” the definition into its consituent parts, thereby

Chapter 4. Formalization of the Dynamic Fault Tree Language

46

prove apply FailureAutomatonState$member to predicate fas ∈ FailureAutomatonState prenex invoke FailureAutomatonState prove by reduce Figure 4.6: Z/Eves proof commands for SpareBeingUsed domain check

local SpareBeingUsed ∈ Event × FailureAutomatonState → Event ∧ sg ∈ Event ∧ history ∈ iseq (Event → N) ∧ stateOfEvents = last history ∧ spareInUse ∈ Event → 7 Event ∧ spareInUse ∈ F(Event × Event) ∧ systemFailedUncovered ∈ Boolean ∧ fas = θFailureAutomatonState[stateOfEvents := last history] ∧ ¬ history = hi ∧ (∀ i : 1 . . . #history; j : 1 . . . #history • dom(history i ) = dom(history j )) =⇒ sg ∈ dom spareInUse Figure 4.7: Proof goal for SpareBeingUsed domain check

exposing spareInUse to the current scope. Finally, we run prove by reduce to simplify the result by reducing complex types to their simpler forms. The proof commands, shown in Figure 4.6, yield the proof goal shown in Figure 4.7. The FailureAutomatonState components are now visible, but the consequence sg ∈ dom spareInUse remains unproven. Upon further analysis, we realized that the premises in fact can not be used to prove this statement. For example, we have placed no constraints on sg to ensure that it is indeed a spare gate. This problem led us to realize that SpareBeingUsed was in fact a partial function and not a total function as we had defined it. As a result, we modified the definition of the function so that it was a partial function, as shown below. In this specification, we use the hashed arrow to indicate a partial function. We also define the domain, and include a condition on the definition to ensure that the spare gate is in the domain

Chapter 4. Formalization of the Dynamic Fault Tree Language

47

theorem axiom SpareBeingUsed $domainCheck local SpareBeingUsed ∈ Event × FailureAutomatonState → 7 Event ∧ dom local SpareBeingUsed = { sg : Event; fas : FailureAutomatonState | sg ∈ dom fas.spareInUse • (sg, fas) } ∧ (sg 0 ∈ Event ∧ fas 0 ∈ FailureAutomatonState) ∧ (sg 0, fas 0) ∈ dom local SpareBeingUsed =⇒ (sg 0, fas 0) ∈ dom local SpareBeingUsed ∧ (fas 0.spareInUse, sg 0) ∈ applies$to Figure 4.8: Final domain check theorem for SpareBeingUsed

prove apply FailureAutomatonState$member to predicate fas 0 ∈ FailureAutomatonState prenex invoke FailureAutomatonState invoke SpareInUse apply extensionality to predicate dom local SpareBeingUsed = { sg : Event; fas : FailureAutomatonState | sg ∈ dom fas.spareInUse • (sg, fas) } instantiate x == (sg 0, fas 0) prove Figure 4.9: Z/Eves proof commands for SpareBeingUsed domain check

of fas.spareInUse.

SpareBeingUsed : Event × FailureAutomatonState → 7 Event dom SpareBeingUsed = { sg : Event; fas : FailureAutomatonState | IsUsingSpare(sg, fas) • (sg, fas) } ∀ sg : Event; fas : FailureAutomatonState | (sg, fas) ∈ dom SpareBeingUsed • SpareBeingUsed (sg, fas) = fas.spareInUse sg

The resulting domain check theorem is shown in Figure 4.8. This theorem can be proven using the sequence of proof commands shown in Figure 4.9.

Chapter 4. Formalization of the Dynamic Fault Tree Language

48

theorem SetOfCausalBasicEventsIsSetOfBasicEvents ∀ ft : FaultTree; fa : FailureAutomaton | fa = FaultTreeSemantics ft • { fat : FailureAutomatonTransition | fat ∈ fa.transitions • fat.causalBasicEvent } = ft.basicEvents Figure 4.10: SetOfCausalBasicEventsIsSetOfBasicEvents theorem

4.3.2.3

Consistency Theorems

Syntax checks and domain check theorems provide a minimal requirement for the basic soundness of the specification. However, these generic methods of formal validation do not address the risk that the specification may not be correct with respect to the domain being formalized. To help address this issue, we proved several theorems which we developed while writing the specification. For example, we wanted to ensure that our specification allowed only basic events to cause transitions in the failure automaton, and that each basic event was the cause of a transition somewhere in the state machine. The theorem in Figure 4.10 expresses these statements formally. It states that for any fault tree and its associated semantically valid failure automaton, the set of causal basic events from the automaton transitions must be equal to the set of basic events in the fault tree. In attempting to prove this theorem, we found a significant problem with the specification. Our original definition of the Event type was a schema with two state components: an identifier and a replication. Unfortunately, this formulation meant that the causal basic events in the failure automaton would include replication, a notion more appropriately related to the fault tree. It turned out that the specification ensured the event identifiers matched, but made no such guarantees about the replication values. As a result of this discovery, we were forced to reformulate our notion of an “event” to exclude ancillary information such as replication. We modified the specification accordingly to store this information separately in the fault tree, as we presented earlier. Once we solved this problem, we found that we still had difficulties proving the theorem. We could prove that the event associated with every transition was a basic event, but we could not prove that every basic event was associated with a transition. In this case, we realized that this latter

Chapter 4. Formalization of the Dynamic Fault Tree Language

49

theorem CausalBasicEventsInBasicEvents ∀ ft : FaultTree; fa : FailureAutomaton | fa = FaultTreeSemantics ft • { fat : FailureAutomatonTransition | fat ∈ fa.transitions • fat.causalBasicEvent } ⊆ ft.basicEvents Figure 4.11: CausalBasicEventsInBasicEvents theorem

condition was not in fact true—in some cases the initial state can have a failed basic event. For example, consider a fault tree in which two spare gates share a single basic event, and one spare gate output is the trigger for a functional dependency which has another basic event as a dependency. In the initial state, the shared basic event must be allocated to one of the two spare gates, in which case the other spare gate occurs. (There are two nondeterministic initial states in this case, depending on which spare gate is allocated the shared spare.) If the spare gate which is a trigger to the FDEP occurs, then the second basic event will be forced to occur. In this example, the second basic event has already occurred in the initial state, so it can not occur as part of a state transition later in the failure automaton. That is, the basic event will not be associated with a transition in the state machine. To resolve the problem in the theorem, we changed the equals in the theorem to subset-equals. The new theorem is shown in Figure 4.11. To prove theorems of this kind, we often had to develop lemmas. These lemmas allowed us to prove key aspects of the overall theorem without the distraction (and theorem prover performance penalty) of irrelevant contextual information. For this theorem, we first proved for any failure automaton transition in the failure automaton which represents the semantics of a fault tree, the causal basic event of the transition is in the set of basic events for the fault tree. Figure 4.12 shows the theorem. Proving this lemma requires 17 fairly complex proof steps, but allows us to prove the CausalBasicEventsInBasicEvents theorem fairly easily, using the proof commands shown in Figure 4.13. Note the use of the FailureAutomatonTransitionInBasicEvents lemma in the fourth step of the proof.

Chapter 4. Formalization of the Dynamic Fault Tree Language

theorem FailureAutomatonTransitionInBasicEvents ∀ ft : FaultTree; fa : FailureAutomaton; fat : FailureAutomatonTransition | fa = FaultTreeSemantics ft ∧ fat ∈ fa.transitions • fat.causalBasicEvent ∈ ft.basicEvents Figure 4.12: FailureAutomatonTransitionInBasicEvents theorem

prove apply inPower to predicate{ fat : FailureAutomatonTransition | fat ∈ (FaultTreeSemantics ft).transitions • fat.causalBasicEvent } ∈ P ft.basicEvents prove use FailureAutomatonTransitionInBasicEvents prove apply FaultTree$member to predicate ft ∈ FaultTree prenex invoke FaultTree invoke BasicEvents prove Figure 4.13: CausalBasicEventsInBasicEvents proof

50

Chapter 5 Formal Methods for Modeling and Analysis: Results and Evaluation

In this chapter we present a number of conceptual, design, and implementation errors which were discovered as a result of our formalization effort [21]. We also discuss the costs involved, and evaluate the approach.

5.1

Results of Formalization

We found the formalization effort to be indispensable. First, it revealed many semantic issues which were not previously documented in a usable form, and also revealed some semantic issues which were not previously known. Second, it revealed number of design problems in the language which hindered its orthogonality and regularity. Third, it led us to discover important implementation errors in the existing Galileo and DIFTree solver implementations. Fourth, it produced the first definitive specification of the DFT concept. Lastly, it revealed previously unrecognized but important abstractions, especially the failure automaton.

5.1.1

Errors in the Method

While formalizing the DFT language, we discovered a number of subtle issues in the DFT modeling and analysis methodology which were previously undocumented and unknown to us. Even

51

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

52

when such important issues are part of the intuition of the domain experts, without adequate documentation it is difficult to ensure that they are addressed during the development of the method, its implementation in software, or its use by practitioners. For example, we discovered that replicated inputs to a PAND gate complicated the semantics because the order of basic event replicates is undefined. The developers of the DIFTree solver had addressed this issue by restricting the language to prohibit replicated PAND inputs. Unfortunately, such semantic decisions were at best hidden in the implementation instead of being represented in any kind of abstract specification. The following sections illustrate these issues in detail. Our work documents the key subtleties of the DFT language, providing benefits to researchers, tool developers, and users. Researchers (other than the domain experts) benefit from a deeper understanding of the language, which can aid its further development. Tool developers now have a catalog of subtleties which must be carefully addressed by an implementation, and which can aid the verification of such implementations. The specification can also serve as the basis for systematic testing. Finally, users benefit from improved user documentation which explains the semantics of the language in the context of these subtleties.

5.1.1.1

Replicated Inputs

Replication, although seemingly innocuous, causes problems because basic event replicates, as aggregates, have no individual identity. In Figure 5.1, two replicated basic events are connected to a PAND gate. Consider the replicated Event A. If one occurrence of Event A is followed by another occurrence of Event A, it is not clear whether to treat them as having occurred in order or not. Similarly, inputs can be “partially occurred” if some subset of the replicates have occurred. For example, if a replicate of Event A occurs followed by a replicate of Event B, both inputs to the PAND gate have occurrences, but are not fully occurred. The existing software implementation of sequence-dependent constructs avoided these issues by introducing special cases in the language. In particular, replicated inputs to order-dependent gates were disallowed. In developing the specification, we realized that we could assign a precise semantics to such gates, and remove such special cases. For example, we consider the PAND gate

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

53

PAND

Event A

Event B

2

2

Figure 5.1: PAND with two replicated inputs occurrences in Figure 5.1 to be in order if all of the replicates of Event A occur at or before any of the Event B replicates.

5.1.1.2

Simultaneous Event Occurrences

As described in Chapter 2, a key subtlety in the DFT language is simultaneous occurrences resulting from the triggering of an FDEP. Because FDEPs introduce simultaneity in models, each orderdependent construct must be well-defined to handle this issue. Prior to this work, the significance of simultaneous occurrence was not realized. We found that we could specify an overall semantics for in-order occurrence as described in 4.2.3 that accommodates both simultaneity and replicated events. We use this definition for all order-dependent gates, thereby providing a consistent and uniform semantics.

5.1.1.3

Nondeterminism

DFT models, as originally formulated, were believed to be strictly deterministic—the occurrence of a basic event would result in exactly one next state. However, consider the fault tree depicted in Figure 5.2. The spare gates, Spare Gate 1 and Spare Gate 2, are using Event B and Event C, respectively, as indicated by heavy lines. They also share a spare event, Event D, that is not currently in use. The FDEP indicates that if Event A occurs, Event B and Event C occur simultaneously with it. In this case, Spare Gate 1 and Spare Gate 2 have to contend for the single shared spare, Event D. There are two possible outcomes, depending on which spare gate is allocated the spare Event D.

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation FDEP

Spare Gate 1

Spare Gate 2

Event B

Event C

Event A

54

Event D

Figure 5.2: A subtlety concerning non-determinism Spare Gate 1

Spare Gate 2

Event A 3

Figure 5.3: Spare gates taking from a common pool of spares When we asked the domain experts to clarify the proper semantics, they suggested that the language be modified in order to prioritize spare gates, thereby removing the nondeterminism. However, during our specification of the semantics, we found that we could accommodate the inherent nondeterminism in the language without having to complicate the semantics with the notion of prioritized inputs. A key lesson of this and the previous issue is that simultaneous occurrences are a key subtlety in the language, and that their interaction with other constraints should be carefully analyzed in order to have confidence in the language.

5.1.1.4

Lack of Identity for Replicates

There is another conflict between replication and spare gates. Replicates lack identity, and yet the semantics of spare gates involve the notion that the spare gate is using a particular spare. The example in Figure 5.3 illustrates the issue. In this case, two spare gates are using two of three replicated spares. When a replicate of Event

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

55

A occurs, a non-deterministic choice must be made between the spare allocated to Spare Gate 1, the spare allocated to Spare Gate 2, and the spare that is not in use. An alternative formulation used in an early software implementation [36] is to “de-anonymize” a replicate when it is allocated to a spare gate, separating it from its group of anonymous replicates and giving it a unique identity. We addressed this problem by exploiting the natural non-determinism of the specification. We allow all possible outcomes, and evenly divide the overall probability across all the next states.

5.1.1.5

Spare Gate Primary Inputs Can Not Be Shared

The original formulation of spare gates included a so-called primary input as the first input to the spare gate. The interpretation was that the primary input would be considered to be in use at the start of the mission. Because a basic event can be used by at most one spare gate, primary events could not be shared unless the basic event was replicated. During the development of this specification we found that the notion of a special primary input, while perhaps intuitively satisfying, introduces additional complexity while reducing little of the complexity resulting from other issues. As a result, we decided to remove the notion of a primary input, and treat all inputs uniformly. This change helps make the DFT language more regular and simplifies its semantics.

5.1.1.6

Subtle FDEP Semantics

Functional dependency gates have a number of subtle interactions. The first type of interaction, which we call cascaded FDEPs, occurs when the dependent input of an FDEP is the trigger to another FDEP. The situation is shown in Figure 5.4. The occurrence of a single event can cause the trigger of an FDEP, which in turn triggers a series of other FDEPs in a domino-like fashion. Furthermore, the triggering of the subsequent FDEPs can be done indirectly—the initial FDEP can cause an OR gate to occur, which then triggers a second FDEP. The second type of interaction, which we call cyclic dependence, occurs when the dependent event of an FDEP is the trigger of another, whose dependent event in turn is the trigger of the first. The interaction is illustrated in Figure 5.5. As a result of a cycle, the occurrence of any basic event

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

Trigger Event

FDEP 1

FDEP 2

Event A

Event B

56

Figure 5.4: Cascaded FDEPs FDEP 1

Event A

FDEP 2

Event B

Figure 5.5: Cyclic FDEPs in the cycle can cause all the other events to occur. (Infinite cycles of occurrences do not happen because an event, once occurred, can not occur again.) These subtle interactions impact both the semantics of the language and the implementation of those semantics. Cyclic FDEPs mean that more than one event occurrence can result in the same resulting state. When implementing the semantics, special care must be taken to properly propagate the effects of the occurrence of a causal basic event.

5.1.2

DFT Design Irregularities

We discovered several significant omissions, ambiguities, or errors in the previous tacit specification. In Section 10.1 we describe how we resolved these design irregularities in a revision of the language. The following sections describe the issues in more detail. Many of these design irregularities were imposed by the designers of the DFT language in order to avoid having to address the complex semantics of particularly problematic interactions.

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

57

For example, the original definition of spare gates required a primary input which could not be shared, and which represented the component being used at the start of the operation of the system. Among other things, the introduction of primary inputs allowed the designers of the language to avoid the semantic complexities which may occur when one component is shared between one or more spare gates at the start of the system’s operation. In this case, our formalization effort revealed that spare contention—and therefore nondeterminism—can occur during the operation of the system. In providing a precise semantics for this case, we realized that the primary input construct could be removed from the language. This change also increases the expressibility of the language—contention can now occur in the initial state, resulting in multiple nondeterministic initial states, and possibly failed initial states. By identifying and resolving such irregularities, we simplified the language while increasing its expressiveness. We also improved removed redundant elements, improved the orthogonality of the constructs, and clarified terminology which could result in confusion. These changes ease the implementation effort by reducing the number of special cases in the language and allowing common abstractions to be reused. For example, by removing the special primary input, all gates then have a sequence of events as inputs. Finally, we believe the resulting language is simpler to understand and easier to use, allowing users to more easily express the systems that they model.

5.1.2.1

Terminological Clarifications

We found that a number of terms used in the discussion of DFTs were inaccurate, and did not match the underlying semantics of the language. For example, FDEPs and SEQs were generally referred to as gates, even though they do not compute a occurrence relation over their inputs. We introduced the term “constraints” to make the distinction between these constructs and those that compute a occurrence relation. Similarly the “transfer gate” is more accurately an “indirect connector” because it allows one to connect an event in another part of the drawing, but does not compute a occurrence relation. We also formalized the use of the term “event” to describe both basic events and gates, as a gate is simply an event which depends on basic events. Because multiple events can occur in the transi-

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

58

tion from one state to another (see Section 5.1.1.6), we introduced the term “causal basic event” to describe the basic event whose occurrence causes all other basic events to occur. We also clarified the terminology regarding basic event failures—while it is appropriate to speak of the failure of a basic event that models a component of the system, it is better to refer to the “occurrence” of a basic event which models an environmental phenomenon such as “pressure exceeds tolerance”.

5.1.2.2

Inputs Made Regular

The original definition of DFTs imposed a number of limitations on the inputs to certain gates in order to simplify the semantics. These limitations reduced the expressibility of the language and made it more difficult to use. For example, the PAND gate inputs were not allowed to be replicated, spare gate spares could not be shared across spare gates of different types, and spare gates has a special “primary” input. We discovered that with careful specification of the semantics of such gates, we could remove such limitations and make the language more regular.

5.1.2.3

KOFM Redundancy Removed

The KOFM gate, as originally formulated, occurs if K out of M inputs occur, where M represents the total number of events input to the gate, taking replication into account. This value was redundant, and required the modeler to ensure the consistency of the M value and the inputs. Our solution was to replace the KOFM gate with a threshold gate having a threshold value similar to the K value. The semantics of this new gate includes the semantics of the KOFM—the gate occurs if more than threshold replicates occur in the inputs. The threshold gate semantics are also more general than that of the KOFM, because the threshold value can be more than the number of inputs, in which case the threshold gate can not occur.

5.1.2.4

Spare Gate Orthogonality Improved

Recall that the spares of the cold, warm, and hot spare gates occur at different rates when not in use. The rate of occurrence of an unused spare is 0 if the spare is attached to a cold spare gate, its normal rate if the spare is attached to a hot spare gate, and its normal rate multiplied by a dormancy

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

59

factor if the spare is attached to a warm spare gate. This means that not only do spare gates embody the semantics of sparing, but they also influence the occurrence behavior of basic events attached as spares. The original designers of the DFT language intended for the different types of spare gates to indicate the behavior of spares when not in use. By explicitly representing the “temperature” of spares, the user would not have to consult the basic event model, which is separate from the fault tree model. In addition, this redundancy provides additional error checking by comparing the type of spare gate to the dormancy value of the spare. For example, cold spare gates should have basic events with a dormancy of 0. We believe that the lack of orthogonality in the spare gates reduces the engineer’s ability to reason about the semantics of basic events in isolation, and complicates the language more than necessary. In practice, we found that existing implementations did not utilize the error checking potential, instead letting the dormancy semantics of the spare gate type override the dormancy specified in the basic event. (e.g. A spare attached to a cold spare gate would behave as if it had a dormancy of 0, even if the dormancy in the basic event model was not 0.) We chose to replace the three types of spare gates with a single spare gate that has no influence on the occurrence rate of unused spares. This change improves the orthogonality of the language by separating the sparing behavior from the dormancy concern. The result is that the attenuation of the occurrence rate of unused spares is expressed solely in terms of the dormancy value of the basic event. This allows basic events having different dormancies to be input to the same spare gate. Our changes to the spare gate also improved the expressiveness of the language. In the previous design, basic events could not be connected to spare gates of different types, because the dormancy would not be well defined. For example, a spare attached to a cold spare would have an effective dormancy of 0, indicating that it could not fail unless it was in use. The same spare could not be attached to a warm spare gate, because the effective dormancy would be nonzero, indicating that the spare could fail.

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

5.1.3

60

Errors in the DIFTree Implementation

Our specification effort revealed a number of fault trees with uncommon interactions which exercised certain subtleties in the language. Most of these cases were addressed correctly in the DIFTree implementation, but a few were not. In this section we describe some of the implementation errors we discovered.

5.1.3.1

PAND Gate Inconsistent Evaluation

PAND gates were not evaluated consistently in the face of simultaneous occurrences of inputs. The reason is that the implementor had no precise specification to meet, and thus provided an implementation that behaved inconsistently for this case. Although we did not perform an in-depth analysis, one can reasonably assume that the developer inadvertently created an implementation that was dependent on nondeterministic characteristics of the abstract data type such as the order of data in memory.

5.1.3.2

Arbitrary Semantics for Spare Contention

Under contention for a spare, the allocation of the spare was implementation dependent. The user was not notified of cases where multiple spare gates could contend for available spares owing to simultaneous occurrences of primaries due to functional dependencies.

5.1.3.3

Sharing of Primaries Allowed

Sharing of primaries to spare gates was permitted in some cases. Such sharing should have been disallowed as per the designer’s intent that a spare gate should have a unique primary input. Ultimately, this problem was resolved through our removal of the notion of a primary input from the DFT language.

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

5.2

61

Cost

The specification which we have developed belies the effort involved in its development. To arrive at this concise and relatively simple description, we significantly revised the specification several times. These revisions were driven by our increasing expertise in using the Z language, and by conceptual clarification we gained through the process. Development of the first draft of the specification took approximately four months, working part time while learning the Z notation. The process was an iterative one in which we developed portions of the specification, discovered gaps in our understanding, and queried domain experts for clarification. The specification also underwent several major revisions as we became more adept in using the Z notation to express the DFT syntax and semantics. The informal validation of the specification involved approximately ten sessions of one to two hours with domain experts, namely, Dugan and a number of her graduate students. This process revealed a number of errors in earlier drafts, as well as opportunities for improving the expression of the semantics using the Z notation. The use of syntax checking tools during the development of the specification was of minimal cost, yet ensured that the specification contained no obvious syntax and type errors. The formal validation of the specification was significantly more costly and less exhaustive than the initial formulation and informal validation. We spent approximately three months formally validating the specification, working part-time while learning to use Z/Eves. The predominant cost was the development of proofs for the domain check theorems described in Section 4.3.2.2. We found that not only was the development of the proof scripts difficult, but that Z/Eves was difficult to use, and that the execution of proofs was computationally expensive. Executing all of the proofs in the specification, for example, requires the user to manually invoke the proof scripts one at a time, and requires approximately two hours of compute time on a 1.2 GHz machine. Unfortunately, of the approximately 110 proofs in the specification, less than 10 are devoted to validating its domain-specific meaning. The remainder are domain check proofs which establish the basic soundness of the specification. These theorems were useful in that they generally forced

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

62

us to more carefully define partial functions and other constructs in the specification. However, we found these theorems to be a costly precursor to the proof of more significant theorems about the semantics.

5.3 5.3.1

Evaluation Formal Methods for Modeling Languages and Tools

Our research in applied formal methods has yielded several contributions. We have shown that formal methods can be applied at a modest cost to significantly benefit the languages which engineers use to model critical systems. By focusing on the modeling language, we discovered and resolved problems independent of any particular tool implementation. Through the application of formal methods, we discovered many conceptual, design, and implementation errors in the DFT language. The work clarified the previous informal understanding of the semantics of DFTs, revealing key subtleties in the language and identifying opportunities for improvement. Contrary to the widely held view that formal methods are too costly to apply in practice, we found that formal methods provided significant benefit during the formulation and informal validation of the notation at a cost that was quite modest by any reasonable industrial standards. They were, however, less clearly beneficial with respect to formal validation. Although the DFT language is modest in size, we discovered previously undocumented complexity and subtlety, and were able to significantly revise the notation using a formal approach. We believe that developers of more complex languages with similarly incompletely defined semantics would also do well to employ formal methods, as complexity increases the likelihood of unforeseen interactions between language constructs. We have applied a widely used formalism and associated toolset to the precise definition of an important language. DFTs are representative of a number of languages used in reliability modeling and analysis, as well as in the broader class of software modeling and analysis tools. The semantics we have defined for DFTs also encompasses static fault trees, a language that is in wide use today. Our work has raised the level of trustworthiness in the semantics of the DFT language, a necessary

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

63

condition for the dependability of a tool that supports it. The costs involved were not trivial. However, we believe there are compelling reasons to invest in such an effort. The increasing use of modeling and analysis tools in the engineering of safetycritical systems requires that their dependability be high. Our experience indicates that without such an approach, many more errors in the language and its implementation may remain undetected. Further, the effort spent in the precise specification of the language is a sunk cost—once such a specification is complete, it can be used in the development of many tools that support the language. Our experience indicates that formal methods are not only beneficial, but perhaps even necessary to ensure the semantic soundness of languages used to model critical systems.

5.3.2

Contributions to the Domain

Our work has resulted in two tangible contributions to the field of fault-tolerant computing. The specification we have developed represents the first reasonably complete and mathematically precise definition of the DFT language. Second, the domain of failure automata which we have defined has the potential for serving as the basis for the formal definition of other reliability modeling languages. It may also serve as a common semantic domain for demonstrating the equivalence of different reliability modeling languages which are in use today. Our collaboration with domain experts was a fruitful one. Frequent interaction during the creation of the specification allowed us to quickly clarify subtle issues as they were discovered. Our experience working with domain experts during informal validation of the resulting specification was similar to that of Knight et. al [44], who reported that domain experts overcame initial unfamiliarity with the Z notation to be able to study the specification effectively for errors. Our case study suggests that it is both possible and profitable to bring application domain experts (e.g., in reliability engineering) together with experts in rigorous software engineering methods to build engineering modeling and analysis tools with sound engineering foundations.

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

5.3.3

64

Formal Validation

Our formal validation effort established the basic soundness of the specification, which a necessary step which must be taken before interesting properties. Unfortunately, the difficulty of formally validating the specification meant that little time could be invested in the proof of domain-specific theorems. Despite our focus on the basic soundness of the specification, formal validation revealed numerous errors. Some errors were quite serious and caused us to restructure the specification. However, most of the errors we found required us to increase the level of precision, but did not challenge essential aspects of the specification. This is in contrast to our informal validation effort, which often revealed substantial errors in our documented understanding of the intended semantics of DFTs. Once the basic soundness of the specification is proven, more interesting theorems which directly relate to the semantics of the language can be checked. We were able to prove a few of these types of theorems, but in order to truly trust that the specification expresses the right semantics it is necessary to work with domain experts to formulate and prove a set of essential theorems. In this respect much work remains to be done.

5.3.4

Z Language and Tools

We found the Z notation and its associated tools to have limitations, but none that significantly hindered our specification effort. The obvious (an apparently common) issue we encountered was that Z has no abstraction for real numbers, so we were forced to define one ourselves. To keep it manageable we abstracted away the axioms of real arithmetic, leaving formal relations between types, but no definition of these relations. More importantly, we encountered an inherent limitation in the notation: in order to prove a key theorem we wanted to prove that a set constructed from a finite set is finite. In other words, we were trying to prove the theorem “∀ s : F X • { x : X | P • x } ∈ F X ”, where X is a parameterized type and P is a parameterized predicate involving s. Unfortunately, while the Z notation supports parameterized types, we could not express the parameterized predicate. As a result of such limitations, two of the theorems we developed could not be proven.

Chapter 5. Formal Methods for Modeling and Analysis: Results and Evaluation

65

While we found formal validation to be useful, it was also very difficult, requiring significant knowledge of both the Z notation and the Z/Eves tool. While learning how to use Z/Eves, we were in frequent communication with the tool’s author. Overall, we found the tool to be difficult to use, bound by the computationally intensive theorem proving analysis, and lacking in features. Is it clear that developing and informally validating the specification was very beneficial. However, formal validation was less cost-effective due to these difficulties. While we believe that formal validation is critical, it is not clearly affordable today. As a result, we are not yet fully confident of the validity of our specification. We also experienced difficulties due to the fact that the tools and notations did not support multiple formalisms. For example, we specified the concrete textual syntax using a standard grammar as part of a prototype effort to “complete the chain” from the concrete syntax to its low-level semantics. The mapping between this grammar and the abstract syntax could not be formalized because the notations were not compatible. In our case, this was not a significant problem because this mapping was simple and not as critical as the mapping from the abstract syntax to the failure automaton. As a result, we decided not to formalize the relationship between the concrete and abstract syntaxes.

Chapter 6 Package-Oriented Programming

In this chapter we present background on component-based software development. We then present the package-oriented programming approach, its potential as a model for the component-based development of interactive systems, and the research questions which must be answered in order to evaluate its viability in this regard. We describe our experimental evaluation of the model, as well as related work. Much of the material in this chapter is based on work published elsewhere [20,63,65].

6.1

Component-Based Software Development

For over thirty years software developers have been searching for effective component-based software development (CBSD) techniques [2, 6, 40, 50, 67, 68]. Components are independently developed software modules, and CBSD is the construction of software largely through the integration of such components. CBSD research has led to a deeper understanding of the need for components to have explicit context dependencies, clear specifications, conformance to standards, intellectual property protection, etc. Lampson [45] states that the dream of a component-based software development—the building of systems by integrating components from a library of reusable components—remains largely unrealized. To date components have have seen modest success as part of the software infrastructure— function, data structure, and graphics libraries, operating systems, and databases. However, the development of a general component industry for software remains a key challenge in software 66

Chapter 6. Package-Oriented Programming

67

engineering research. Unlike more mature engineering disciplines, most software is still developed in a line-by-line, craft-like manner. Components such as data structure and function libraries provide generic low-level reuse. Although their generality makes them widely applicable, it also limits the extent of their contribution to the domain-specific functionality of the software in which they are used. Large components with narrow functionality, such as databases, provide more leverage because they implement a larger subset of the functionality of the software. Such limited success has led some researchers to speculate that both technical and non-technical issues related to software components will hinder their widespread use. In a widely cited paper [31], Garlan and his colleagues documented a set of difficulties encountered in an attempt to integrate a set of large, ostensibly reusable software systems to produce a tool. On that basis they concluded large-scale component integration faced fundamental difficulties. In particular, their components made conflicting assumptions about the architectures of the systems in which they would be used, making integrating them very hard. Garlan et al. coined the phrase architectural mismatch to describe this kind of problem. There are other important challenges. Several observers have noted that designing software for reuse is relatively expensive. Integrators do not control component design. Understanding components can be costly [45]. Incomplete knowledge of the properties of large components is typical and perhaps even inevitable [17]. Even if existing components can be made to work, there is no guarantee that future versions will also work. Global analysis is hard when multiple components are used in a system. This problem is exacerbated when components are integrated dynamically [67]. We lack demonstrably effective payment schemes for many CBSD models, and software intellectual property is difficult to protect. In a keynote address at the 1999 International Conference on Software Engineering, Butler Lampson went even further, arguing that the component dream has not succeeded, and was unlikely to be realized for three reasons: components make conflicting assumptions; they are costly to develop; and they costly to understand [45].

Chapter 6. Package-Oriented Programming

6.2

68

Package-Oriented Programming

Our view is more optimistic. We believe that there is still potential for the development of successful component reuse models. In particular, we need models which (1) strike a balance between low-level, generic components such as GUI libraries and large, highly-specialized components such as databases, and (2) support the development of richly interactive software systems such as engineering modeling tools. In this research we investigate a promising model called package-oriented programming (POP) [20, 62, 63]. Package-oriented programming is a software development approach in which multiple commercial off the shelf software packages are used as components. The Microsoft Office suite is an example of a set of architecturally compatible mass-market packages. By using such commercial packages as massive components, POP exploits the vast investments that have already been spent in their design, construction, and refinement and the tremendous economies obtained by the volume pricing of mass-market software. In particular, users benefit from careful usability engineering, rich functionality, software familiarity, rich interoperability, and reasonably stable execution for the level of complexity, all at very low cost. There are two distinct CBSD models that use packages as components. In one which we call the “platform” approach, a single component is employed as a platform upon which to build a system. Using a database management system as a platform is an example that has been employed successfully for years. In work related to ours, Goldman and Balzer have used PowerPoint as a platform for software architectural drawings and animations [32]. Modern package vendors continue to favor this model, perhaps for economic reasons. For example, the Visio drawing tool provides an API and a set of specialization capabilities with which vendors can build domain-specific applications. The second model is one in which multiple component applications are integrated tightly in a single application. This is the model that we are evaluating. We believe that it is an important model because it addresses integration issues central to any mature component industry. Today’s components also appeart to be compatible with this model—many systems need functions in multiple orthogonal sub-domains, each of which tends to be addressed by a different package.

69

Chapter 6. Package-Oriented Programming

Figure 6.1: Reliasoft Blocksim

Figure 6.2: Microsoft Visio

The potential of this approach is illustrated in Figures 6.1 and 6.2. Figure 6.1 shows a screenshot of Reliasoft Blocksim, a commercial tool for modeling and analyzing reliability block diagrams. Figure 6.2 shows a screenshot of Microsoft Visio, a general-purpose graphical drawing tool. At a glance, it is obvious that both software packages share much functionality in common. They both support text formatting, scrolling and zooming of drawings, a stencil of drawing shapes, print preview, etc. When a Reliasoft salesman was asked about the striking similarity to Microsoft Visio, the salesman replied that Reliasoft copied the Visio interface because users were already familiar and satisfied with it. Reliasoft mimicked the Visio user interface, leveraging the user interface engineering effort invested in Visio’s design. However, they chose to build the interface and functionality from scratch. This example demonstrates the potential of the POP approach—if packages such as Visio can be used effectively as components, a large amount of the effort required to develop a tool such as Blocksim can be avoided.

6.2.1

POP Component Characteristics

POP components have several unique features that differentiate them from components such as function libraries, databases, ActiveX controls [14], or JavaBeans [52]. • The components provide tremendous function to system integrators within general sub-

Chapter 6. Package-Oriented Programming

70

domains that are important across broad families of systems, e.g., textual and graphical editing, database, etc. This means that fewer components need to be integrated, thereby reducing the cost of design. • Despite the large amount of reusable functionality that is provided by such components, the price to obtain the components is low. This is because packages are sold not only as components but also in the very large market of stand-alone applications. This large market amortizes the costs associated with developing the reusability. • In addition to narrow specialization mechanisms such as document templates, packages often provide sophisticated mechanisms for end-user programming. POP components often have scripting capabilities that were originally designed for creating macros to automate repetitive tasks, but which can be used by programmers to implement functionality specific to their applications. • Unlike low-level components, POP components have user interfaces with a large amount of functionality. • The cost to learn and use systems built from such components is reduced because people already know the particular components and the style of components. As a result, developers can build software having reduced learning costs due to the familiarity and ease-of-use of the components. • The problems of component licensing and payment are avoided because end-users buy the components. The integrator just sells application-specific code, including “glue code”, which utilizes the installed packages. • Modern application suites impose integration standards that help mitigate problems of architectural mismatch. • Applications are upgraded continually with powerful new features, providing system integrators with large increments of function at very low cost. Evolution is a two-edged sword,

Chapter 6. Package-Oriented Programming

71

in that new versions might not work. However, large installed bases often drive component vendors to maintain backward compatibility. This means that economic pressures exist which discourage component vendors from adversely changing the design of the components. • Documentation effort is reduced because the functions provided by the components are already documented. • POP components operate in separate address spaces and with separate threads. In most operating systems this means that communication with POP components is costly because it involves remote procedure calls. Furthermore, POP components can fail independently of the software that uses them. • POP components have their own data representations that are different than, and independent of, those of the application which uses them. For example, Microsoft Word’s internal document representation can be manipulated by the user through the user interface, or by the programmer through the programmatic interface. A related issue is persistence—the overall state of the application is distributed across a set of POP components. In order to save and restore this state the application must save and restore the state of the individual POP components as well.

6.2.2

Research Questions

Despite the obvious attractions of the model, it is not clear that it will work with today’s components. To the best of our knowledge, outside of our work, there have been no systematic evaluations of the model published in the literature. With such an evaluation, it is unclear how the POP approach fares in the face of its many potential challenges. • Large components bind design decisions that can make it hard or impossible to meet overall application requirements [7, 54]. • Even if a set of components is matched to current requirements, it might be hard or impossible to meet new or changed requirements using given components.

Chapter 6. Package-Oriented Programming

72

• The size of modern components makes complete and accurate specification of their interfaces unlikely. In the resulting uncertainty, component inadequacies can go unrecognized until late in development, and given components can be used incorrectly or sub-optimally. • Even if a component is specified fully, the complexity of the specification can make it hard to use effectively. • Large components consume resources that might not be used in a given application. In practice, this might or might not be a problem for a given user, given the continuing decrease in the cost of hardware. • A component application programming interface might not be consistent with its user interface. For example, a component might not make all user-level functions available to the integrator. Such a mismatch can mislead the integrator into believing that a component can perform functions that actually are not available, resulting in nasty surprises late in development. • Components provide their own user interfaces. This eases the problem of making functionality available to users, but creates the problem of specializing and integrating multiple components interfaces and keeping users from invoking functions that should not be used in the system context. Testing is also complicated because both the user interface and the programmatic interface must be exercised. • Because POP components run in their own address spaces, integrating them incurs the crossapplication communication penalty. In addition, we have found that components can display internal inefficiencies when manipulated via their APIs. • Components are distributed in binary form, which helps protect vendor’s intellectual property. However, without source code it is impossible to investigate or repair limitations, or to adapt components in ways not explicitly provided by the vendor. Of course, even if source code for old, large applications were available, it is unclear that it would be especially helpful due to its size, complexity, required build environment, etc.

Chapter 6. Package-Oriented Programming

73

• A program that uses multiple executable components is inherently concurrent, but modern components do not provide many functions for concurrency control. • Unlike traditional components, new versions of mass-market applications may be acquired independently of the developer. As a result, the developer must support multiple versions of components, and it is not clear the degree to which new components will be backwards compatible.

6.3

An Evaluation: Package-Oriented Programming for Tools

In this research we evaluate the feasibility of using POP approach for the development of modeling and analysis tools. Our evaluation of the POP approach is based on an end-to-end experiment to produce a tool which meets the demanding requirements of industrial users. The application domain is reliability analysis using dynamic fault trees. Repeated experience has shown that just speculating on the likelihood of success of a CBSD model is untrustworthy. On the other hand, the issues are too complex to be treated analytically. We believe that a rigorous evaluation of a relatively untested CBSD model requires that it be tested against the demands of real applications in practice. To that end, we attempt to use our approach to develop a tool which supports the innovative DFT notation, then deploy that tool to practicing reliability engineers for their evaluation. Ideally, a model would be applied to many systems, but that is not a feasible experimental method. Our approach to addressing these problems is thus to apply such a model in a case study that aims to produce an industrially viable system representative of a class of important applications. We emphasize industrial viability in order to ensure that the demands of real users apply in the experimental context, thereby helping to avoid the possibility of overlooking critical details related to the application of the model in practice. The application domain we have chosen, modeling and analysis tools, appears to be well-suited for the evaluation of the POP model. Such tools are often structured in terms of a “analysis core” which implements the semantics of the model in terms of an analysis algorithm, and a “superstructure” that provides model editing capabilities and other supporting functionality. The development

Chapter 6. Package-Oriented Programming

74

of the analysis core, while important, is often dwarfed by the effort required to develop a featurerich and usable superstructure. Our approach, in a nutshell, was to apply the POP model to the development of a tool for the modeling and analysis of fault-tolerant systems using the dynamic fault tree language. During the course of this effort, we analyzed the performance of the model with respect to the challenges we have described. This gave us valuable insight into the technical and non-technical issues involved in the approach. In addition, we distributed the resulting software to engineers and asked them to evaluate it. This provided important independent evaluation of the ability of the approach to meet the requirements of real users. The next chapter describes our efforts to build a tool for dynamic fault tree modeling and analysis using the POP approach. We discuss our experiences using the approach, and present data on end-user evaluation of the resulting tool. Chapter 8 presents an evaluation of the POP approach, addressing the particular challenges described in this chapter.

6.4

Related Work

The feasibility of the package-oriented approach was first discussed by Sullivan and Knight [63]. Their evaluation was based on a very early prototype that did not meet real user requirements. As a result, it did not address many of the challenges we address in this work, including the ability of the approach to deliver software that satisfies real users. Goldman and Balzer report on the use of PowerPoint as a platform for building a software architecture modeling and analysis tool [32]. They cite many of the same benefits that we discussed here and in earlier work. Importantly, we have explored the integration of multiple components using the POP approach. Our tool was also driven by demanding customer requirements. We have also addressed the component evolution issue. In the context of U.S. government COTS initiatives, Brownsword et al. describe the impact of components on lifecycle activities such as requirements definition, testing, and maintenance [12]. They also identify the relationship between component selection, requirement specification, and

Chapter 6. Package-Oriented Programming

75

architecture design, and the potentially high cost of specializing components. They provide no real examples. Fox et al. provide a COTS-based software development process for information system infrastructure [30, 31]. There is a heavy emphasis on early component evaluation and selection, and risk reduction methods such as prototyping. They also cite the problems of understanding large and complex components, and the strategy working with components given incomplete knowledge. Similarly, our experience has also shown component upgrade to be a significant risk. L´edeczi et al. [46] describe a generator for design environments. Given a meta-model of a modeling language, their tool generates a design environment. The underlying integration mechanism is COM [56], the same technology that is the basis for our integration of application components. While their work is similar to our in terms of modeling and analysis environments and the use of COM, our work focuses on the integration of applications as components, whereas theirs deals primarily with the automatic generation of design environments. Furthermore the components that are used in the generation of the environment are not independently developed mass-market applications. As in our work, Succi et al. [61] are investigating the integration of POP components. However, they target cross-platform integration using a Java-based architecture as the integration mechanism. In contrast to our work, they do not attempt to achieve tight functional integration, and do not address user interface integration.

Chapter 7 Galileo: A Tool Built Using the POP Approach

This chapter describes our attempt to build an industrially-viable modeling and analysis tool using the package-oriented programming model for component reuse. We first present an overview of the tool, then describe our experiences using POP in its construction. The last section presents survey data collected from end users on their impressions of the tool which we were able to build using the approach. Portions of this chapter have appeared in other publications [19, 26, 64].

7.1

Description and Features

In order to explore and evaluate the POP approach, we have employed it in the development of Galileo [65]. Galileo is a richly functional and easy-to-use tool which supports the dynamic fault tree language and analysis methodology. The bulk of the user-level functionality is provided by a tightly integrated set of mass-market application-components. The current version uses Microsoft Word to implement the textual editing capability, and Microsoft Visio to implement the graphical editing capability. The POP approach has allowed the construction of a tool that appears to be far richer, easier to use and more easily changed than would otherwise possible at a comparable level of development cost and code complexity. The Galileo research group, under the direction of Kevin Sullivan, has investigated the use of the POP approach across a number of component versions. For example, the tool has supported Word 6.0, 95, 97, 2000, and 2002, and Visio versions 4.0, 4.1, 5.0, 2000, and 2002. In addition to 76

Chapter 7. Galileo: A Tool Built Using the POP Approach

77

Word and Visio, early versions of the tool also used Internet Explorer as an integrated help system and Microsoft Access as a database for editing basic event data and storing fault trees. Some of the important features of Galileo are as follows: • supports the DIFTree [28, 36] modular dynamic fault tree analysis approach • allows the user to edit a fault tree in either a textual or graphical representation • provides automatic rendering from the textual view to the graphical view, or vice-versa • exploits the user interfaces of off-the-shelf packages—e.g., zoom, find-and-replace, print preview, etc. • exploits the user’s familiarity with common applications, significantly reducing training costs • integrates the separate interfaces of the component applications into a single Galileo interface • supports a range of component versions • can operate when some components are not installed on the system Figure 7.1 shows a screenshot of the tool. The upper-right subwindow displays the graphical view of a fault tree, provided by Microsoft Visio. The subwindow below displays the equivalent textual view, provided by Microsoft Word. The upper-left subwindow shows the graphical “stencil” that holds shapes for use in the graphical construction of fault trees. The multiple application interfaces are composed into a single overall Galileo window using Microsoft’s Active Document Architecture. This mechanism manages the integration of package and Galileo menu items, and automatically modifies the interface to reflect that of the currently selected component. Fault trees can be edited in either the textual or graphical view. The user creates a fault tree in the graphical view by moving fault tree shapes from the stencil to the drawing page, and then connecting them to express the relationships between the shapes. Alternatively, the engineer can build a fault tree using the automated editing functionality accessed via the large buttons at the top of the main window. Users can also edit fault trees in the textual view, which contains the same information as the graphical and can be edited like any text document.

Chapter 7. Galileo: A Tool Built Using the POP Approach

78

Figure 7.1: A screenshot of Galileo/ASSAP 3.0.0 Galileo presents views only for those package components that are actually installed on the user’s system. A user who lacks Visio sees only the text view based on Word, for example. A user can edit the currently active view, browse both views, and render the active view to the other view for editing. The textual and graphical views are saved as a single file. Galileo manages component initialization, lifetime, and shutdown, as well as the invocation of underlying analyzer. We use Microsoft’s OLE [9] to drive the components through their application programming interfaces. We use the Microsoft Active Document standard [51] to integrate the interfaces of the packages into the Galileo interface. The components conform to Microsoft’s underlying Com-

Chapter 7. Galileo: A Tool Built Using the POP Approach

79

ponent Object Model (COM) standard [56], which provides for low-level remote procedure call interoperability, among other capabilities. The Active Document standard [51] was particularly important. It made it possible to integrate multiple application interfaces into a coherent presentation. It supports merging of menus of separate applications, and their presentation as sub-windows of Galileo. As views are selected, buttons, menus, and other user interface components change to reflect the package interface for that view. Galileo provides its own menu items, which are always present. One allows the user to invoke the view rendering functionality. The other invokes fault tree specific actions such as the analyzer. Figure 7.2 shows the basic architecture of Galileo. The main Galileo mediator coordinates the views and the analysis engines, and implements the main application window. Most of the user interface and modeling function of the tool is provided by the package components. Each component runs as a separate process (indicated by dashed boundaries), but is integrated in the same user interface. The application-components are encapsulated by wrappers, each of which implements two key functions, get and put. The get operation extracts a fault tree object (C++ class instance) from a view. The put operation works in reverse, rendering a fault tree as either a text stream or a graphical drawing. Wrappers abstract the details pertaining to particular versions and types of components. Galileo detects installed components at startup and utilizes the wrapper that corresponds to the installed version. We have developed Galileo based on the needs of industrial users. We developed an early version largely based on a technical evaluation performed by an engineer at Lockheed-Martin, who helped clarify what was present and missing from the tool in terms of usability, modeling capabilities, and analysis capabilities. Engineers and researchers at NASA Langley Research Center (LaRC) also provided feedback on the value of aspects of our early system. From the beginning we have released intermediate versions of the tool for free download off of the WWW. To date many hundreds of industrial users have acquired the tool. This strategy has also been a source of valuable user feedback. One piece of feedback from a major defense contractor described the inadequacy of the user interface in its lack of support for multiple page and hierarchical drawings. We took that

80

Chapter 7. Galileo: A Tool Built Using the POP Approach

Microsoft Word

Graphical Wrapper

Textual Wrapper

Put

Microsoft Internet Explorer

Get

Get

Put

Microsoft Visio

Consistency Maintenance Engine

Galileo Mediator and User Interface

Fault Tree Solvers

Figure 7.2: The architecture of Galileo feedback as the basis for the user interface requirements of the next version we produced.

7.2

Development Experiences

In this section we describe observations and experiences gained through the application of the POP approach to the development of Galileo.

7.2.1

Component Capabilities

The original suite of components that we identified as providing leverage for Galileo were Microsoft’s Word and Access, and Visio’s drawing tool.1 We observed that Word and Visio covered a significant portion of the required model editing functions, and that they had mechanisms for 1 At

the time, Visio was owned by Visio Corporation. It has since been acquired by Microsoft.

Chapter 7. Galileo: A Tool Built Using the POP Approach

81

domain-specific specialization. Access provided basic persistence and concurrent access to multiple fault trees, and the ability to generate reports. Applications such as those in Microsoft’s Office suite provide the programmer certain specialization capabilities. In addition to user-visible mechanisms such as custom menus, applications also expose a programmer-visible object model. This object model allows the programmer to set properties of objects in the application, call methods on objects, or handle events raised by objects. For example, Visio provides user-level customization through the construction of custom stencils of drawing shapes. Stencils contain a shapesheet which is a spreadsheet with instructions for drawing the shape, as well as properties for its formatting, text label, grouping behavior, etc. For example, a shape may be locked against being flipped horizontally, or may contain context menu items which invoke software functions when selected. Functions are implemented using a general-purpose programming language such as Visual Basic or C++. In addition to the arguments supplied to a callback, the programmer has access to any objects that are visible in the scope of the function, such as the global Application, ActiveDocument, and ActiveWindow objects. From these objects, the programmer can access collections of objects related to various drawing elements. Objects expose events which allow behaviors to be intercepted and modified. For example, Visio has drawing-level events associated with the opening of documents, the deletion shapes, and the connection of shapes. However, not all events are exposed. For example, Visio does not expose low-level events associated with mouse movements or individual key presses (with the exception of the double-click of a shape).

7.2.1.1

Integrated Help System

Early versions of Galileo lacked integrated documentation, as the Windows standard help system required specialized development tools and experience. The documentation consisted of a set of web pages on our website which we had developed for users. When Internet Explorer became available as an architecturally compatible component, we saw the opportunity to use it to integrate our existing hypertext-based documentation into the tool. We customized the application to display

Chapter 7. Galileo: A Tool Built Using the POP Approach

82

only Galileo hypertext documents, and to disable general web browsing functionality. We then easily integrated its interface into that of Galileo.

7.2.1.2

Automatic Graphical Layout

Early versions of Galileo included a graph layout algorithm which would compute the position for fault tree shapes for a given fault tree. Galileo would then communicate with Visio in order to create and position each shape, then connect them. We found this approach to be extremely slow, as each shape required multiple calls across application boundaries. As a result, the usability of the tool was significantly degraded for large fault trees. This was a major concern, as one user told us that fast automatic layout distinguished high quality tools from others. Visio 2000 included an automatic layout feature, which we prototyped to verify that it would work for fault trees. Success with the prototype gave us confidence that we could utilize Visio’s efficient built-in automatic layout function, avoiding cross-application communication costs. The result was impressive–automatic layout that is nearly instantaneous even for large fault trees. This helped overall tool speed by removing cross-application communication for drawing layout.

7.2.1.3

“Smart” Connectors

In watching users interact with Galileo, we discovered that a common problem was the use of connectors—users had trouble connecting shapes properly, and could not easily locate a mislaid connector in the drawing. To address this problem, we utilized Visio’s built-in programmability to make the connectors dashed instead of solid until both ends are connected to shapes. In about an hour, we were able to identify the event associated with the connection of a connector and the property which controls the style of the connector, and then implement Visual Basic code to check for connections on both ends of the connector and set the property accordingly. Without the programmability of Visio, the rapid implementation of this feature would not have been feasible.

Chapter 7. Galileo: A Tool Built Using the POP Approach

7.2.2

83

Component Limitations

Many of the difficulties we encountered were due to hidden, undocumented limitations in the components. In this section we describe some of the issues we encountered.

7.2.2.1

Detecting Shape Deletion

The original design, developed by Sullivan and Knight [63] the goal was to implement an eager update scheme so that modifications in one view would be detected and propagated to the other. In order to implement this functionality, we needed to be able to detect the deletion of a shape in Visio. During our experiments with the application, we discovered that Visio did not expose events for every method of deleting an shape—selecting delete from the context menu, selecting the shape and pressing the delete key, and selecting the shape and pressing CTRL-x. Consequently, we could not detect fine-grained editing operations, which caused us to move to a batch-oriented rendering scheme [63]. The resulting batch-update protocol that we designed for rendering between views requires the user to put the active view in a consistent state before the model can be rendered in the alternate view.

7.2.2.2

Read-Only Views

One side-effect of the batch update scheme is that the user can make changes in one view, then accidentally lose those changes by rendering from the other view. In order to address this issue, we investigated the possibility of making the documents read-only. Unfortunately, the APIs for the components did not support such functionality. We explored an alternative approach in which Windows GUI events were intercepted before reaching the application-component. Our initial prototypes showed that we could easily prevent all events from reaching the component. Unfortunately, this approach would not only prevent the document from being edited, it would also prevent the user from viewing the document by scrolling the view, zooming, etc. While it may have been possible to filter the events, we decided to forgo this approach for two reasons. The first was that customization of component behavior in this manner was expensive, as we were exploring the use of internal aspects of the components without the benefit of a standard interface.

Chapter 7. Galileo: A Tool Built Using the POP Approach

84

Secondly, depending on such internal features would increase the cost of future maintenance, as these features are not part of a standardized interface. This experience showed us that if one is willing to incur additional cost, it is possible to utilize internal features of POP components in order to implement required functionality. This result is similar to that of Goldman and Balzer [32], who had a similar experience building a domain-specific design environment on top of Microsoft PowerPoint. They had to poll the Windows event queue to infer editing operations that were not exposed by PowerPoint. The alternative we chose to take was to modify the requirements slightly in order to work around the difficulty. We modified our design such that the inactive view is highlighted in red, but editing is not disabled. The solution is somewhat unintuitive and it increases the mental effort on the part of the user. The question is whether the benefits are enough to outweigh such inconveniences. Section 7.3 describes survey results from users which indicate that this solution is adequate for most users.

7.2.2.3

Cross-Page Linking

A new graphical interface feature we added was the ability to hyperlink a connector on one page to a sub-tree on another page. We tried representing bi-directional links by storing Visio unique IDs of linked shapes within the shapes which reference them. A sub-tree linked from many places would thus store such an ID for each such location. Unfortunately, all of the available methods for storing such references placed severe limits on the number of references that could be stored, which would have prevented us from providing the scalability that users desired. Furthermore, we discovered through experimentation that the references would be invalidated if the referred shape was “cut” from and then “pasted” to the drawing page, because the unique IDs changed. When we contacted Visio Corporation, they told us that the upcoming version would have hyperlinks that would give us the functionality we needed. We continued development assuming that the functions would be provided, and they were. In this case, component evolution coupled with knowledge from the component vendor helped us to determine a workable design. The lesson is that design in this style is anticipatory in an interesting way: we were taking risks on future features and fixes that were not assured.

Chapter 7. Galileo: A Tool Built Using the POP Approach 7.2.2.4

85

Capacity limitations of Visio

Examples of capacity limitations include a limit of 31 characters on the length of shape identifiers, a limit of 254 characters on a “list” custom property of a shape, a limit of 127 characters for a “string” custom property, and a limit on the number of handles to Visio shapes that the programmer can hold open at one time. Not all of these limits were documented, and could only be discovered during development. To help mitigate the risks associated with such limitations, we prototyped designs whenever possible.

7.2.2.5

Slow Visio Startup

A key problem in the use of Visio as a component is that it takes an unusually long time to load. When started as an application, Visio as ready for use within a few seconds. However, when embedded as a component, Visio takes 30 to 40 seconds to start. This behavior is not limited to Galileo— Microsoft’s Binder application, which also uses the Active Document technology, demonstrates this same problem. We have also documented this problem across multiple versions of Visio. We believe that this problem is not inherent with the use as POP components, as other application components such as Word and Excel can be embedded in just a few seconds. It is perhaps that because Visio’s COM interfaces were developed by Visio Corporation and not Microsoft they did not benefit from optimizations applied to Word, Excel, and other Microsoft Office applications. We believe that this problem will be solved by the Visio developers as Visio’s use as an embedded components becomes more widespread.

7.2.3

Component Standards

The architectural mismatch that Garlan et al. [31] experienced can largely be avoided if the components are developed to a set of strict standards. In mature component industries such as electronics, component standards play a critical role. In this section we investigate the effectiveness of standards in the use of POP components.

Chapter 7. Galileo: A Tool Built Using the POP Approach 7.2.3.1

86

Microsoft Access

The Active Document Architecture (ADA) provides a means of integrating the user interfaces of multiple applications. In this architecture, documents can be embedded within other documents, and when activated, certain user interface elements of the containing application change to reflect the interface of the application which created the embedded document. The development of this technology was critical to the success of Galileo. Without it, users encountered a confusing set of application windows which appeared to be unrelated but which were in fact coordinated by Galileo. When we attempted to use this technology in Galileo, we found that Access did not support the standard, even though it is a member of the Office suite. The lesson is that conformance of components to standards cannot be taken for granted. We stopped using Access, and instead turned to the file system for fault tree storage. A second lesson is that a willingness to search for solutions that provide value is of the essence in this development approach, in which one continually learns about unexpected aspects of the development environment.

7.2.3.2

Visio

Our earlier versions of Galileo were limited to drawing trees on one page. An engineer from industry who downloaded the tool emphasized the need for multi-page drawings. This new requirement placed demands on the Visio component. To mitigate the risk that we could not manipulate Visio pages programmatically, we built a stand-alone, non-Active Document prototype that created a Visio process and created multiple page drawings. During later development we discovered to our surprise that the page manipulation functionality provided by the published API did not work when Visio was used integrated using the Active Document Architecture! Its page manipulation interface did not function when a document appeared as a child window in another program. Unfortunately, we could find no workaround. The firm requirement of our “customers” to have this support caused us to contact the component vendor to lobby for correct support of the published API in the next version of the package. During our discussions with Visio Corporation, it became clear that to get the functionality we needed we had to provide a rationale in business terms. After we described our funding from NASA

Chapter 7. Galileo: A Tool Built Using the POP Approach

87

Headquarters under our agreement with NASA LaRC, they agreed to fix the problem in the next release. We continued to develop our tool to use the page manipulation functions, in the expectation that a suitable version would become available, which it did. We found this problem only because we were both constructing a system from multiple components, and because we were responding to real requirements from real users. Interestingly, we found the characterization that “component design can not be controlled” to be somewhat inaccurate. In fact, our experience shows that component design can be influenced, but that a compelling case must be made that the given change may benefit the component vendor. Furthermore, we would not have encountered this integration issue had we only evaluated the component in terms of our ability to build functionality on top of it as in the “platform” style of POP component reuse.

7.2.4

The Design Process

We have found the process of developing a system in this style to be one of continually trying to reconcile our understanding of what the user will value with the capabilities and constraints imposed by the component base. Traditional top-down system design is ill-suited for developing applications using this approach [17]. Rather, the early stages of design resulted in an architecture within which we continually search for workable designs, with a heavy emphasis on prototyping. We found that new versions of components often changed in subtle ways which compromised our design, despite the use of “standardized” interfaces. This forced us to carefully validate new versions of components. We now discuss several dimensions of this approach.

7.2.4.1

High-Level Editing Operations

Earlier versions of Galileo provided only built-in Visio methods for graphical editing of fault trees. For example, to connect two gates, the user would have to drag two gate shapes from the stencil to the canvas, then drag a connector shape to the canvas, then link each end of the connector to each of the shapes. Our potential users complained that this editing interface was fairly tedious. We decided to provide a set of fault tree drawing construction functions to automate most of these

Chapter 7. Galileo: A Tool Built Using the POP Approach

88

tasks. For example, we provided a function that allows a user to select one gate and then with the push of a button to add and connect a gate under it. Our team explored three different sets of high-level operations, attempting to identify useful operation sets that could be implemented efficiently given the constraints of Visio. For example, no operation could be specified that required iteration over every shape in the drawing, due to the cross-application communication penalty. Having settled on what appeared to be both valuable to users and feasible given the component constraints, we needed to actually implement the functionality on top of the only partially known Visio virtual machine. We mitigated the implementability risk using a “bridging” approach. We decomposed the high-level, domain-specific operations into more basic low-level operations. Simultaneously, we performed a bottom-up abstraction of the raw Visio API calls into more domainspecific operations. The condition that these two efforts “meet in the middle” had to be satisfied before we could proceed to implementation. It took three attempts before we found an operation set that we could implement effectively.

7.2.4.2

Augmenting the Graphical Interface

One of Visio’s specialization capabilities is to allow the developer to create custom buttons on the user interface that invoke custom code. Unfortunately, we discovered that although we could use this capability to invoke code using Visio’s Visual Basic for Applications, we could not cause our Visual C++ code which implemented the high-level fault tree editing operations to be invoked. As a workaround, we chose to modify our design so that the buttons were owned by the main Galileo window, and that the activation of the buttons would cause Galileo to invoke the necessary functions on Visio. The downside of this design is that the buttons are not managed by the ADA—they do not disappear when Visio is not the active view. In this case, our inability to make the original design work might be in part attributed to our lack of knowledge concerning the proper method of invoking external code. Given our uncertainty in whether the problem was our lack of knowledge or an inability of the component, we opted to use an alternate design which was feasible and low cost, if not entirely satisfactory.

Chapter 7. Galileo: A Tool Built Using the POP Approach 7.2.4.3

89

Component Validation

In addition to prototyping, our team used automated techniques to verify performance and capacity assumptions of package components. He used an automated tool to systematically explore performance, and to verify the performance of new component versions [17]. During these experiments we found yet another unexpected component characteristic. In order to speed up rendering in Visio, we disabled screen updating while adding shapes to the page. We found to our surprise that screen updating was re-enabled whenever we created a new drawing on which to place shapes. Thus, in order to ensure that screen updating is disabled, we had to be sure to disable it after creating any new drawing.

7.2.5

Evolution of components

Our experiences dealing with component evolution have been generally positive if risky, despite the negative concerns that many researchers have expressed about the evolution of components. We have already described the benefits of evolution in terms of new features and capabilities. For example, we were able to easily utilize the automatic layout feature and the support for hyperlinks in Visio 2000.

7.2.5.1

Programmatic Interfaces

The components themselves have generally improved over time in terms of the repair of known problems, and in the degree of user-level functionality that was exposed programmatically to the developer. The interface evolution in Microsoft Word is an interesting case study. The first version we used, Word 6.0, had a narrow interface consisting of a single method through which Visual Basic commands were invoked. Our efforts to integrate this component revealed a race condition. Our design used wdEditSelectAll to select the whole document text, and then it retrieved the selected text. However, if we performed the second operation immediately after the first, we would not get all of the text. We found that we could avoid that malfunction by extracting the text after waiting 500 milliseconds for the selection operation to complete.

Chapter 7. Galileo: A Tool Built Using the POP Approach

90

The next version, Word 95, with a full OLE interface, did not have such a race condition. Unfortunately, we found that calling the Selection() operation followed by GetText() failed because the selection size had a limit much lower than the documented upper limit of 65,000 characters. The alternative design we identified and implemented was to copy the text to the Windows clipboard, and then read the clipboard to acquire the entire text of the document. The next version, Word 97, had none of these limitations, and we were able to implement our functionality without these awkward workarounds. For the Word upgrades from 6.0 to 95, and from 95 to 97, and the Visio upgrades from 4.0 to 4.1, and 4.1 to 5.0, we simply installed new versions of the packages on our machine, ran Galileo, and found that we had an updated tool. The only upgrade incompatibility we observed was the upgrade to Visio 2000, which failed because the component did not support the older interface. In this case, we developed a new implementation of our programmatic interface to Visio which targeted the equivalent functionality in the new component interface. In general, we found that developing new implementations of the programmatic interface was not difficult. In fact, we often did this, even though the older interface was still viable, in order to take advantage of new features in the new interface. There are several lessons we learned in this experience. The first is that components such as Microsoft Word seem to be improving over time. Second is that while components are improving, they will always have extremely large interfaces that are likely to be undocumented, or misdocumented. The components are also complex software entities that may have unknown (even to the vendor) timing or capacity limitations that can cause a particular design to be difficult or infeasible.

7.2.5.2

User Interface Architecture

The need to integrate the programmatic interfaces of components is obvious. What is perhaps not as obvious is the need to integrate their user interfaces. An early version of our tool did not integrate the interfaces of the component applications. but presented multiple package windows to the user. In addition to causing usability problems, such loose integration allowed the user to violate systemwide invariants. For example, the user interfaces of the component applications provided several

Chapter 7. Galileo: A Tool Built Using the POP Approach

91

mechanisms for closing the application component, which would cause Galileo to fail when it later tried to access the component. Our initial efforts to design around these limitations involved working outside the programmatic integration architecture provided by COM. Our research group developed a technique for modeling the internal behavior of a component (because it is not always exposed), monitoring the user’s actions at the operating system level, and then preventing those actions that violated system invariants [49]. We found this approach to be difficult and risky, as it depended on the interception and manipulation of low-level system events. The advent of the Active Document Architecture, and the conformance to that standard by the packages, was a key development in the success of Galileo. This standard largely removed the problems of user interface integration and masking of undesired behavior.

7.2.5.3

User Interfaces

Because we utilized components with user interfaces, the bulk of our system testing is based on simulated user interaction. Based on a test script, the testing infrastructure manipulates Galileo through its user interface, checking for correct behavior. When Visio 2000 was released, we found that the application could not be manipulated in this manner. This version of the component was simply incompatible with our test environment. We were unable to find a solution, even after contacting both the component and test software vendors. As a result, we were forced to run all of our regression tests manually. Fortunately, Visio 2002 did not have this bug. In another case, we had specialized Visio to display parameters for basic events. In order to improve the usability of the tool, we utilized a feature of Visio which allows the display of some parameters to be dependent on another. For example, we only display the “rate” parameter if the basic event’s distribution value is “exponential”. When the user selects another type of distribution, the parameters of the old distribution are hidden and the new ones are shown. Unfortunately, we found that Visio 2002 does not automatically hide and show parameters as Visio 2000 had. It is unclear whether or not this is by design or a bug in the package. Our solution

Chapter 7. Galileo: A Tool Built Using the POP Approach

92

was to quickly close and reopen the properties window in order to force a refresh of the displayed properties. These two examples demonstrate an interesting characteristic of the POP approach: unlike APIs, user interfaces undergo frequent and radical evolution. Furthermore, we can expect such evolution, as this is where POP component vendors attempt to increase the value of new versions. As a result, component users must pay careful attention to the risks associated with dependence on user interface features.

7.3

End-User Evaluation

End-user evaluation is an important in that it provides independent assessment of the feasibility of POP for the development of industrially-viable software. An early evaluation of the POP approach was performed in the context of an industrial case study on the use of Galileo at the LockheedMartin Corporation. An engineer there concluded that Galileo had tremendous potential to aid reliability engineers in that corporation (as reported in an internal study of the tool). It should be noted that this evaluation was based on an early version which did not have an integrated user interface or high-level editing operations. In this section we describe end-user evaluation provided by engineers who used a follow-on version of Galileo during multi-day workshops on reliability engineering. We also describe end-user acceptance of this tool by industry.

7.3.1

Surveys of End-User Satisfaction

On the basis of the potential demonstrated by early versions, NASA LaRC funded the development of a production version called Galileo/ASSAP. The tool was built to a defined set of documented requirements and testing procedures. It has been featured in four workshops on reliability modeling and analysis. The first two-day workshop was conducted at the request of NASA headquarters, and involved a number of engineers from several NASA divisions. The second two-day workshop was conducted at the University of Virginia at the request of NASA Johnson Space Center.

Chapter 7. Galileo: A Tool Built Using the POP Approach

93

Following their experiences using Galileo during the second workshop, NASA engineers involved with the International Space Station project lobbied to adopt the tool. The remaining two workshops, both nearly a week in length, were held at Johnson Space Center to provide in-depth training of a wider set of engineers working on the space station and space shuttle projects. During each of these workshops engineers were taught the DFT language, and practiced modeling systems using Galileo. They also used the tool to build simple models of systems for which they are responsible. The workshops provided us with a unique opportunity to determine if the POP approach can be used to build tools that meet real user requirements. To that end, we developed two surveys which we distributed to workshop participants.

7.3.1.1

Survey Objectives and Design

Survey participation following the workshops was optional. In order to increase the odds of participation, we created two surveys: a short survey with 34 essential questions, and a longer survey with 77 in-depth questions. The bulk of the questions were multiple choice, with a few short answer and ranking questions. Participants were given the opportunity to comment on every question in order to clarify their answers or given additional information. Questions were designed according to the guidelines suggested by Dillman [23]. Every effort was made to limit the range of interpretation of the questions and otherwise reduce bias. The overall goal of the surveys was to understand user impressions of the tool in terms of its usability and features. To that end, the surveys contained a number of questions about the difficulty of performing common tasks with the tool. There were also several questions related to the user’s impressions of a tool built using mass-market applications as components. Because we expected a small number of respondents and moderate variation in experience and skills, the surveys also included a number of questions designed to “calibrate” the answers in a subjective manner. For example, several questions in the extended survey explore the user’s skills, tool requirements, and experience with reliability engineering. Other questions ask about the user’s familiarity with other tools, Word and Visio, and Galileo. This information was vital in order to better interpret the responses. For example, users who model systems on a daily basis are more

Chapter 7. Galileo: A Tool Built Using the POP Approach

94

likely to have stricter requirements for tools. We also believed it important to compare Galileo to tools developed using traditional approaches. Questions were included to assess the user’s impressions of tools that they frequently use, and to compare Galileo’s features and usability.

7.3.1.2

Results

Sixteen engineers from twelve NASA groups and contractors answered both surveys. Ten of these considered themselves to be “familiar” or “very familiar” with reliability analysis techniques in general. The distribution in terms of tool use was fairly bimodal: five users model systems once a month and five model systems every day; six use reliability modeling and analysis tools once a month, and six every day. Almost half of the respondents analyze systems whose failure could lead to over US $1B lost or loss of life. In terms of requirements for tools, users cited “an easy-to-use user interface” and “accurate and precise analysis results” as being the most important characteristics of a tool, above other options such as support for a range of modeling capabilities and speed. Validation of the tool was important to the users—three said it was somewhat important, eight said it was fairly important, and five said it was very important. The two most important factors in the validation of the tool were the use of a comprehensive test suite and a formal specification of the modeling language. Next important were certification by a governmental agency and algorithms published in peer-reviewed journals. Least important was “a decade of sales and development”. Users were asked several questions about the usability of the tool. When asked to rate their satisfaction with the textual editor two of the eight users who answered said they were “somewhat satisfied” and five said they were “fairly satisfied”. When asked how hard the textual view was to use, two of ten respondents said “fairly hard”, six said “not very hard”, and two said “fairly easy”. Only one user indicated that the graphical editor was “fairly difficult” to use. However, seven users said the graphical editor was “fairly easy” to use, and four said it was “very easy” to use. Two of fourteen respondents found common editing operations of the tool to be “not very difficult”, seven found them “fairly easy”, and four found them to be “very easy”.

Chapter 7. Galileo: A Tool Built Using the POP Approach

95

With regard to features, we asked how well the domain-specific editing operations of Galileo met their modeling needs, two of fifteen respondents said “well”, nine said “fairly well”, and three said “very well”. When asked about their satisfaction regarding the general editing operations supported natively by the packages, three of thirteen respondents were “somewhat satisfied”, eight were “fairly satisfied”, and one was “very satisfied”. Five of thirteen respondents considered Galileo’s multi-page editing functionality to be adequate, while seven considered it to be fairly adequate or more than adequate. Only one person was dissatisfied with the feature. Due in part to component limitations, our mechanism for updating one view from another is batch-oriented instead of incremental. Interestingly, all users were satisfied with this approach, despite our intuition that users would prefer a more incremental approach. Another design compromise related to the editing of the inactive view. In this case, Galileo allows editing in the “read-only” view due to component limitations. This design compromise also was acceptable to the users, with two of ten respondents indicating that that approach was “not very difficult” to use, and seven indicating that it was “fairly easy” or “very easy” to use. We asked several questions about the use of mass-market applications as components. All respondents considered themselves to at least somewhat familiar with Microsoft Word, with seven considering themselves to be very familiar with the package. Only eight considered themselves somewhat familiar with Visio, and two considered themselves to be very familiar with the package. Galileo’s capability to dynamically adjust to available packages was well-received—twelve found the feature to be at least somewhat useful, and eight considered it to be very useful. When asked if their familiarity with the packages helped them use Galileo, two respondents indicated that it did not help, whereas twelve said it helped at least a little, with six saying that it helped a lot. One user cited the resource consumption issue which we had identified as a potential problem. The user had an older machine with limited resources, so the tool did not perform very well. Five of six respondents, when asked how the usability of Galileo compares to other tools, said that it was the same or better. One person said that it was much worse. All six respondents indicated that the features for constructing a model were the same or better than other tools. Three of four users indicated that the overall tool performance was the same or somewhat better than other tools,

Chapter 7. Galileo: A Tool Built Using the POP Approach

96

and one user said the performance was much worse.

7.3.2

Adoption of Galileo Into Engineering Practice

Following their experiences using Galileo during a workshop, NASA engineers involved with the International Space Station project lobbied to adopt the tool. Today, NASA has mandated the use of the Galileo/ASSAP version of Galileo to model the causes of observed failures. According to anecdotal feedback from engineers in the fault diagnosis and repair group, the tool’s fault tree editing interface provides editing capabilities which far exceed that of commercial tools which they had been using. The NASA engineers have also reported that the ease-of-use of Galileo has led to a significant change in the engineering process. Previous tools required domain experts to work with reliability engineers in order to develop models. This resulted in an ad hoc relationship between domain experts and reliability engineers for the construction of system models. With Galileo, NASA was able to institute a more rigorous modeling process in which domain experts are able to model the system’s reliability themselves, without necessarily involving reliability engineers. NASA’s satisfaction with the Galileo tool has also led to follow-on funding for a new version of the tool. This new version is planned to include new features at the request of both NASA Langley Research Center (our primary sponsor) and the ISS group at NASA Johnson Space Center. We believe that the adoption of Galileo by a NASA group and the desire to extend this research effort are a good indicators that our approach has succeeded in delivering the features and usability that real users require. This experience provides a data point which supports the use of packages as components to provide the bulk of the functionality and usability of a modeling tool.

Chapter 8 Evaluation of the POP Approach

In this chapter we evaluate the package-oriented programming approach. We first discuss the impact of our ambitious requirement for industrial viability. We then describe the features of the approach which provided for the successful development of the Galileo tool. The final section revisits Section 6.2.2, taking a critical look at the performance of the approach relative to the challenges we described. Portions of this chapter are based on previously published results [20]. Galileo is based on the tight integration of multiple architecturally compatible, commercialoff-the-shelf (COTS) packages. It is not built on a single package; instead it integrates multiple packages as co-equal components. Brooks claims that the ability to do this could lead to “radical improvement” in software development productivity [11]. His thoughts on the matter are echoed in the comments of Feigenbaum, the Chief Scientist of the United States Air Force: We are now living in a software-first world. I think the revolution will be in software building that is now done painstakingly in a craft-like way by the major companies producing packaged software. They create a suite—a cooperating set of applications— that takes the coordinated effort of a large team. What we need to do now in computer science and engineering is to invent a way for everyone to do this at his or her desktop; we need to enable people to “glue” packaged software together so that the packages work as an integrated system. This will be a very significant revolution [15]. By using the POP approach to build Galileo we avoided designing a tremendous amount of 97

Chapter 8. Evaluation of the POP Approach

98

software from scratch. Instead, we only designed and implemented a fault tree data type and underlying analysis techniques; we specialized the packages for our purposes; and we wrote code to drive the packages and to “glue” them together. While we encountered numerous difficulties during this experiment, we were able to develop a fully featured, usable tool in about 41,000 lines of code. This cost is orders of magnitude below that which we would have incurred had we attempted to build the same functionality from scratch.

8.1

Targeting Industrial Viability

We believe that by setting a target of industrial viability, we have been able to build Galileo to a set of requirements which ensures that our evaluation addresses the real problems which arise in practice. In choosing the innovative dynamic fault tree modeling language, we have also been able to draw the interest of industrial users who have participated in the evaluation of the tool. We believe that our tool is representative of a broader class of modeling and analysis tools for engineering. Informal observation of tools for system performance modeling, software architecture, and other such applications shows a common pattern of support for graphical and textual notations and standard user interface functions combined with an underlying set of analysis algorithms. To the first order, Galileo appears to be a reasonable representative of this class of applications. In addition to our technical evaluation of the performance of the POP approach, we have also been fortunate to have independent evaluation of the resulting tool. These evaluations have helped to validate our experimental tool, demonstrating that it indeed can provide the features and usability users expect. The Lockheed-Martin evaluation of an early version of Galileo was instrumental in helping us to identify necessary requirements. Similarly, informal feedback from users who downloaded early versions of the tool helped to clarify our goal of industrial viability. Our survey results of NASA engineers show that Galileo meets, and in many cases exceeds, user expectations in terms of features and usability. In some sense the user’s expectations were set low—they were reminded several times that the software they were using was a beta version. Nevertheless, user comparisons to other tools indicate that Galileo compares well in terms of features,

Chapter 8. Evaluation of the POP Approach

99

usability, and performance. When asked what surprised them the most about the tool, several users cited the usability, saying “the ease of use was better than expected”, “[the] program is very user friendly”, and “very friendly user interface”. Several users liked the use of standard packages as components, saying they were surprised that “it embeds other software such as Word and Visio”, it has “transparent linkage between Word and Visio”, and it “[uses] outside code (Visio) for [the] interface”. One user went so far as to say “the reuse of Word/Visio [is] a rather brilliant idea”. Our survey of the requirements of NASA engineers supports our emphasis on usability and trustworthiness. It also highlights the need for careful engineering of tools due to their use in critical engineering contexts. Comments from users indicate that they rely upon tools for rapid design and analysis of systems. In several cases component limitations forced us to modify our designs. The survey results indicate that these less desirable designs were still adequate for their needs. We also found that the reuse of package interfaces was beneficial—not only did users benefit from the careful user interface engineering (as in the case of Visio), but they also benefitted from familiarity with the packages (as in the case of Word).

8.2

Component Capabilities

The POP components we used had a number of capabilities which contributed significantly to their successful use. Most obviously, the functionality provided by the components greatly eased our development effort. For example, Visio’s automatic layout feature removed the need to use our own layout algorithm, and outperformed our own algorithm which required many expensive RPC calls. The rich component APIs allowed us to tightly integrate their functions. For example, Visio exposed functionality for shape creation and manipulation which allowed us to automatically render the graphical depiction of a fault tree. The cost of using the component APIs was greatly reduced because all of the APIs used COM as the underlying communication mechanism. The conformance of the components to the active document standard allowed us to achieve tight integration of

Chapter 8. Evaluation of the POP Approach

100

component interfaces within the overall Galileo tool window. We also benefitted from the programmability of the components. Because the components include an embedded Visual Basic interpreter, we were able to implement several functions directly in the component, thereby avoiding the costs associated with cross-application communication. This efficiency also allowed us to implement features which would have been otherwise impossible. For example, our implementation of connectors whose line style dynamically changes as connections are made requires low update latency. User familiarity with the components, as well as their well-engineered user interfaces eased the learning curve for users of the tool. According to user surveys, most users benefitted from their familiarity with Word or Visio. Users also reported usability that was as good or better than commercial software tools. Many users were enthusiastic about the use of mass-market applications as components. We found component evolution to involve some risk, but also many benefits. New component versions often removed problems which we had encountered in previous versions, and provided powerful new functionality which we were often able to exploit with little or no effort. For example, Visio 2002 performs anti-aliasing of the drawing, greatly improving its appearance across a range of zoom factors. We were able to benefit from this improvement without modifying any of our own code. In many cases, we found new versions of components supported previous APIs, so that we were able to simply install the new version in order to be able to take advantage of it.

8.3

Challenges

While we were able to implement the tool which we had set out to build, several times we encountered difficulties which threatened our project. In this section we evaluate the POP approach with respect to the research issues which we described in Section 6.2.2.

Chapter 8. Evaluation of the POP Approach

8.3.1

101

Vendor-Bound Design Decisions

Having no control over the design of the POP components which we used, we were often faced with issues which made initial designs infeasible. As a result, we were often forced to explore the capabilities of the component in order to identify an alternative design. In cases where an alternative design could not be found, we found that a slight modification to the requirements made their implementation feasible. Thus, flexible requirements allowed us to more easily navigate the design space. In a few cases, we found that critical requirements could not be modified or implemented with the given components. For example, Visio’s incomplete adherence to the active document standard jeopardized our ability to implement multiple-page modeling functionality. In our experience, component vendors are sensitive to the needs of component developers, as meeting those needs helps to develop the component market for their packages.

8.3.2

Architectural Mismatch

We found that the difficulties encountered by Garlan et al. [31] were largely nonexistent for POP components. A central theme of our work is that the integration of independently designed components can, and indeed can only, be enabled by conformance to shared design rules, or integration architectures [20, 63, 66]. We undertook the work reported here knowing that the components that we would be using all conformed to a set of common integration standards. Still, we found that conformance to integration standards is not guaranteed. For example, Visio’s multiple page functionality failed when the application was integrated using the Active Document Architecture, and that an early version of Microsoft Access didn’t support the standard at all. We suspect that these problems were the result of the fact that the ADA was a new standard to which components were only beginning to support. As such standards become more widely used, we expect components to have fewer instances of mis-conformance.

Chapter 8. Evaluation of the POP Approach

8.3.3

102

Large, Poorly Documented Interfaces

As noted by Lampson [45], we found that POP components have enormous programmatic interfaces which reflect their rich capabilities—the interface for Microsoft Visio 2002, for example, comprises over 7700 methods in nearly 200 classes. Many of the component issues we encountered were due to undocumented limitations in the components. For example, we found undocumented race conditions in Word, and capacity limitations in Visio. These problems contributed to the cost of understanding the interfaces, beyond the design costs we expected. And yet the costs involved in understanding the components interfaces in order to utilize the components was orders of magnitude lower than the cost associated with developing equivalent functionality ourselves. In contrast to small components, the leverage gained by using massive components largely negates the cost of understanding and using them. Never did we consider it feasible to “just implement it ourselves”.

8.3.4

Resource Consumption

Experience over the last two decades has shown that Moore’s law has outstripped the growing resource requirements of applications. As a result, most modern PCs come equipped with at least 128 MB of memory, while applications such as Microsoft Word usually require less than 20 MB of memory. Package vendors also implement methods to amortize the cost of using an application, such as loading a small application core at startup, and dynamically loading additional components as necessary. Furthermore, we expect that most software built using the POP approach will use only a few components due to the large amount of functionality they provide. As described earlier, we did encounter one performance difficulty when using Visio as a component. We found that every version of Visio that we tested exhibited COM delays during initial embedding of the application. It is quite possible that if there were a large market demand to use Visio as an embedded component, this performance problem would have been addressed.

Chapter 8. Evaluation of the POP Approach

8.3.5

103

User Interface Specialization and Integration

We encountered some difficulty specializing the user interfaces to the application domain. While the components provided mechanisms for user interface manipulation, we sometimes encountered idiosyncratic behaviors or limitations which forced us to seek alternative designs. More important was the issue of identifying functionality which could potentially compromise the overall software. For example, we developed a custom connector shape in Visio, but did not realize that Visio also provided a “connector tool” which users could use to connect shapes. Unfortunately, our implementation expected connectors of the type we had developed, and did not recognize the connectors created by Visio’s connector tool. Early versions of Galileo also had the problem that users could close the application component independently of the tool, causing the tool to crash when it tried to use the then invalid application handle. Most issues such as this were resolved when we modified our tool to use the Active Document Architecture, as this standard was designed for the integration of user interfaces. The lesson here is that integration architectures can largely mitigate integration problems, but the software engineer must also address specific integration issues not addressed by the architectures. We found that the rich feature set of POP components increases the difficulty of ensuring proper behavior of the component in the context of the overall application. Our approach to dealing with this issue was to perform ad hoc testing of the application to discover incompatible behaviors. A more conservative approach is to restrict the user interface of the component, thereby decreasing the likelihood of erroneous behavior at the expense of less functionality.

8.3.6

Application-Boundary Inefficiency

The cost of traversing the application boundary between components often impacted our design. For example, our choice of high-level graphical editing operations was driven by the design rule that no operation should require the traversal of all shapes in the drawing. We set this design rule because we knew that iterating over the shapes would not scale well due to the cross-application communication costs.

Chapter 8. Evaluation of the POP Approach

104

Later in this research we realized the extent to which the POP components could be programmed. Because the components supported Visual Basic, we found we could implement all of the high-level editing operations in Visio, thereby avoiding the need for cross-application function calls. For the details of this aspect of the research, see Chapter 10.

8.3.7

Component Availability, Evolution

The development of Galileo gave us direct insight into the issues surrounding component evolution. During the course of the project, we gained experience working with five versions of both Word and Visio. We found that new versions of components often supported the same interface as older versions, but included new functionality in additional interfaces. As a result, three Word upgrades and two Visio upgrades required no changes to our tool. In many cases, we found the new interfaces of the components to provide additional functionality which we could use to improve our designs. For example, when Visio’s auto-layout feature became available, we replaced our own inefficient layout algorithm. We hypothesize that our positive experiences with component upgrade were due to market pressure on component vendors to support legacy interfaces upon which component developers depend. In contrast, we found that the user interfaces often changed, sometimes quite radically. As a result, we often had to fix problems related to the user interface following an upgrade. Identifying the problems was helped quite a bit by automated regression testing of the tool via the user interface. (Except when the user interface changed such that we could not use our automated testing tool!)

8.3.8

Adapting to Installed Components

In our model of POP-based software design, we distribute the “glue”, and the user must provide the components. As a result, we have no control over the components which are installed. We found our architecture to be adequate for meeting this challenge—Galileo automatically detects the available components and versions and instantiates the correct wrapper. In addition, the Galileo architecture ensures that one of Word or Visio is available so that the user can edit fault trees, and disables functionality that requires both views if only one is available.

Chapter 9 A Combined Approach to Building Tools

In previous chapters we presented research on the use of formal methods for domain-specific modeling languages and on the use of the package-oriented model for the development of tool superstructure. In this chapter, we characterize the domain of modeling and analysis tools, and argue that these two elements can significantly ease the difficulty of building such software. The last section describes related work.

9.1

Software Tools

Modeling and analysis methods supported by tools are now essential to engineering. Engineers use such tools to build models of systems, and then analyze these models with respect to particular properties. The results of the analyses are used to infer the behavior of the corresponding system, which may or may not exist in reality. Such tools are often used by engineers as a low-cost method of evaluating proposed designs, or to diagnose problems in existing systems.

9.1.1

Tool Requirements

Surveys of users (Section 7.3) indicates that two key requirements for tools are high usability and high accuracy. The high-level requirement regarding usability can be refined into a set of more precise requirements:

105

Chapter 9. A Combined Approach to Building Tools

106

• Rich editing functionality: Tools must support a wide range of editing capabilities, including formatting, print preview, model zooming, syntax checking, etc. • Multiple model views: There may be multiple ways to view a model, and multiple instances of the same type of view. • Incompatible model editing operations: Individual editing operations in one view may not have analogs in other views. • Incomplete models in views: It is possible that the user may want to view an incomplete model in one or more views. • Scalability for large models: Engineering models often get extremely large as engineers add detail to improve the fidelity of the model. • Integration of databases: Many modeling and analysis tools integrate field data into the model. For example, failure modeling tools integrate data on failure of components. • View-specific model data: For example, a graphical view of a model may contain information about the position of model components, but this information has no corresponding analog in the textual view. • Time-intensive analysis: Because the mathematical analysis of a model can take hours or even days depending on the domain, running the analysis function should not hinder the user’s ability to continue to use the tool for model viewing and development. • Models augmented with additional analysis information: Many types of analyses refer to components in the model. As a result, it is desirable to be able to represent the results of analysis through the modeling views. • Interoperability: Engineers want tools that integrate well into their existing engineering processes. For example, they want to be able to easily models or analysis results in reports. The “high accuracy” requirement voiced by users can also be refined into a set of more precise requirements:

Chapter 9. A Combined Approach to Building Tools

107

• Well-designed languages: In order to reduce the risk of misuse or misunderstanding, the modeling language supported by a tool should have few redundancies, expressive constructs, regular syntax, no contradictions, etc. • Precise language semantics: The semantics of the language is central to the analyses performed by the tool. As a result, the semantics must be mathematically precise. • Precise documentation: In order to reduce the risk of an engineer misusing the language or misinterpreting the meaning of a model, the documentation should be clear and precise. • Verified implementation: The implementation of the modeling language must be faithful to the semantics of the language.

9.1.2

Traditional Development Approaches Inadequate

Meeting these demanding requirements under tight resource constraints is very difficult. Consider the task of building a highly usable, feature-rich user interface. Traditional approaches are based on the use of object-oriented class libraries and design patterns, and on small-scale components such as typical “ActiveX controls”. These techniques do not attack the essence of the problem of developing software tools: building tools from thousands of small classes or components doesn’t radically simplify the design problem. As a result, traditional software development techniques are expensive—building software having sophistication on par with mass-market software often requires over a million lines of code. Unfortunately, the market for domain-specific tools is comparatively small, which means that tool developers often have few resources to devote to such an effort. These problems are exacerbated by the critical nature of software tools, which necessitates a level of dependability beyond that which is typically required of mass-market software. The accuracy of analysis results, for example, is critical to the correct evaluation of the model and, indirectly, the system being modeled. However, high-level domain-specific languages, while providing powerful modeling capabilities, also have complex and subtle semantics which must be precisely defined

Chapter 9. A Combined Approach to Building Tools

108

in order to have confidence in them. Without confidence in the semantics of the language, one can not trust the analyses which depend upon them. The programming languages and software engineering communities have developed mathematically precise methods for specifying languages. However, these techniques are not applied to modeling languages in practice. One possible reason is cost—the high cost of developing features and usability leaves few resources to devote to formal specification of the languages. As a result, tools which are developed have questionable dependability. This problem is exacerbated by the complexity of the semantics and the often inaccessible source code, which makes end-user verification of analysis results problematic. Users are then, knowingly or unknowingly, trusting the software developer to implement the language semantics correctly.

9.2

A New Approach

In this section we describe a new approach to building modeling and analysis tools which addresses the difficulties of building feature-rich, usable, and trustworthy tools. We begin with a set of observations about the domain of software tools.

9.2.1

Observations

The first observation is that many tools have an architecture as depicted in Figure 9.1. The user interface provides one or more views for editing the model, as well as a means of invoking the analysis engine at the core of the tool, and a means of displaying the results [63]. The core implements an abstract representation of the model as well as the analysis method. The abstract representation provides the basic communication mechanism between the core and the superstructure. Secondly, like most software, the domain-specific functionality of a software tool must be supported by a large amount of additional functionality. As Shaw says: Most applications devote less than 10% of their code to the overt function of the system; the other 90% goes into system or administrative code: input and output; user interfaces, text editing, basic graphics, and standard dialogs; communications; data

109

Chapter 9. A Combined Approach to Building Tools User Interface Editing View 1

Editing View 2

Results View

Concrete Repr. 1

Concrete Repr. 2

Results

Analysis Engine Abstract Repr.

Figure 9.1: Common structure for tools validation and audit trails; basic definitions for the domain such as mathematical or statistical libraries; and so on [58]. For example, a sophisticated tool which implements a particular modeling method must also provide a wealth of additional functionality such as print preview, autosave, autorecovery, autolayout, shape formatting, cut-and-paste, etc. Furthermore, simply implementing this functionality is not enough—it must also be engineered to be usable. Our third observation is that the major differences between such tools are based on the domainspecific modeling and analysis method that they support. The concrete representations in the views and the operations for them are based on the concrete syntax of the language. The analysis core implements analysis algorithms which depend upon the semantics of the language. The communication between views and the core is based largely on the abstract representation of models in the language. Finally, we observe that while the domain-specific elements of a tool may account for less than 10% of the code, the engineer places a substantially larger degree of trust in these elements than the supporting functionality of the tool. The code devoted to implementing the syntax and semantics of the modeling language is crucial to the overall modeling and analysis method. Furthermore, the subtle and complex nature of the method can increase the chance of error in its design or imple-

Chapter 9. A Combined Approach to Building Tools

110

mentation.

9.2.2

Package-Oriented Programming: Superstructure at Low Cost

We believe that the POP approach is particularly suited for the domain of modeling and analysis tools. Mass-market packages have evolved to cover functional domains which correspond to many of the key functions of tools. For example, tool views involve textual editing and graphical editing, and much of the data is best managed in a database. POP components can provide a wealth of general functionality which corresponds to the superstructure of a tool. POP components also have specialization capabilities which can be used to implement the domain-specific aspects of the tool’s user interface. For example, Microsoft Visio allows developers to create a custom stencil of drawing shapes. This capability can be used to implement a graphical editor for the language supported by the tool. The general functionality provided by POP components, coupled with the domain-specific functionality which can be implemented with their specialization capabilities, is an effective attack on the cost of building the user interface of tools. Because the superstructure functionality is achieved at low cost through reuse, the majority of the effort of developing a tool is mitigated.

9.2.3

Formal Methods for Tool Languages

In our combined approach, package-oriented programming is used to provide the bulk of the general-purpose functionality of the tool. What remains are the domain-specific elements: the concrete model representations and the editing operations for them, the abstract internal representation of the model, and the analysis engine. The savings gained through the use of the POP approach helps to mitigate the cost of formal definition of the domain-specific modeling language, which in turn eases the trustworthy implementation of the domain-specific elements of the tool. By applying formal methods toward the precise definition of the modeling language supported by the tool, one gains a rigorous basis for the understanding, development, and implementation of the language. The developer can discover and possibly prevent conceptual, design, and implementation errors. As a result, the overall trustworthiness of the tool is increased.

Chapter 9. A Combined Approach to Building Tools

9.2.4

111

Evaluation

To evaluate the technical feasibility and economics of this approach to building tools, our strategy is to first evaluate the POP and formal methods aspects independently, then demonstrate that they can be combined effectively. Chapters 3, 4, and 5 addressed the evaluation of the formal methods aspect with respect to the formal specification of the dynamic fault tree language. Chapters 6, 7, and 8 addressed the evaluation of the POP aspect through the development of a tool which was not based on a formally-defined modeling language. To assess the feasibility of the combined approach, we have developed a new tool which uses the POP approach for the superstructure, and whose modeling language is defined using formal methods. The new tool, called Nova, combines the work we have already done in the the application domain of dynamic fault tree modeling and analysis. In the development of this tool the formal specification served three purposes: (1) to discover and resolve problems and irregularities in the DFT language, (2) to drive the revision of the concrete representations hosted by the POP views in response to this language revision, and (3) to serve as the semantic basis for implementation of the analysis engine. Finally, the user interface provides more sophisticated domain-specific editing capabilities through much more aggressive specialization of the components. In the next chapter we present Nova, an advanced prototype tool for the modeling and analysis of dynamic fault trees. This tool was developed using the combined approach, in which packages provide the editing views, and the analysis engine is based on the formal specification of the language. We discuss our experiences using the combined approach, and in Chapter 11, we evaluate the overall approach.

9.3

Related Work

The DIFTree tool by Dugan et al. [28] included a graphical DFT editor built using Tcl/Tk. This editor implemented the original DFT language, and was not based on a formal semantics. Although it lacked much of the functionality of the tool presented in this paper, both the graphical depictions of the shapes and structure-based editing operations inspired similar functionality in Nova.

Chapter 9. A Combined Approach to Building Tools

112

Widespread adoption of DIFTree was hindered by its non-standard, idiosyncratic interface, and a lack of precise semantics for the DFT language. The Galileo tool [65] is a follow-on tool developed under the guidance of Sullivan and Dugan. As described in Chapter 7, Galileo is an experiment in the use of the POP model for the development of interactive software. Early versions of Galileo utilized the same analysis algorithms used by DIFTree, but this analysis core was reimplemented [48] when it was discovered that it was poorly designed and contained errors such the one discussed in Section 2.2. It was this effort which made the subtlety and complexity of the DFT language clear, and which motivated the use of formal methods. There has been quite a bit of work on the development of tools for software engineering. Grundy and Hosking [34] provide an overview of the past research and the current state of the art for software tools. As in our work, they cite features and usability as components of overall quality, but also include synergy between tools and development process, and tool integration and extensibility as important factors. In contrast to our work, they focus on software tools for software engineering, as opposed to the important domain of software tools for engineering modeling and analysis. Grundy et al. [35] describe their experiences building the Serendipity-II software process tool using components. In contrast to our work, the authors do not utilize applications as components. Instead, they use the JViews framework which they have developed to integrate components built using their JComposer component development environment. This toolset allows the authors to rapidly build software engineering environments which are easier to enhance and extend, integrate with other tools, and deploy to users. Some work has also been done on the development of generic frameworks for modeling environments. Examples include the Generic Modeling Environment [47], MetaEdit+ [16], and DOME [38]. The approach in this research is to invest in the development of a reusable modeling framework which can be used to instantiate new tools by specifying the aspects specific to the application domain. Our work is distinct in several dimensions. First, our strategy is to address the components used to construct tools, as opposed to an overall reusable framework. In this respect our work is not incompatible with the generic framework approach—it is possible to use POP com-

Chapter 9. A Combined Approach to Building Tools

113

ponents within a generic framework. However, our approach is somewhat more flexible, as it allows the developer to tailor the resulting tool via selection and specialization of components. Second, developers of reusable frameworks are faced with the challenge of not only providing a wealth of functionality and high usability, but also making this functionality easily reusable. There is a tension inherent in this approach: in order to recoup the cost of developing reusable functionality, the framework must be reused across a number of applications, but a wide range of applications places higher demands on the generality of the framework. Lastly, our work assumes that the modeling language has a semantics which can not be easily captured using the hierarchical and constraintbased semantics used by generic frameworks such as GME.

Chapter 10 Nova: A Tool Built Using the Combined Approach

In previous chapters we have described and evaluated the two elements of the approach in isolation. In this portion of the thesis we address the feasibility of combining these two elements for the effective construction of modeling and analysis tools. Galileo serves as an experimental testbed for the use of package-oriented programming, but its core modeling language does not have a complete and precise semantics. On the other hand, our work revising and formalizing the DFT notation was done independent of any particular tool implementation. To assess the feasibility of the combined approach, we decided to build a new tool from scratch—Nova—which uses POP to provide the supporting functionality, and supports the new, formally defined DFT notation. Nova is a proof of concept which also provides additional data on the use of POP for tools, and provides a better understanding of the costs involved. In this chapter we describe the development of Nova using our combined approach. Like Galileo, Nova is a tool for the construction and analysis of dynamic fault tree models. Compared to Galileo, Nova is an advanced prototype tool with several unique properties. First, the DFT language that it supports is a revised version based on our formal specification of the language. The implementation of the editing interface is based on a more aggressive specialization of Visio which implements all of the fault tree editing operations directly in the POP component. The analysis engine is a new implementation based on the formal semantics we have defined. Finally, the overall architecture of the tool is based on our experiences with Galileo.

114

Chapter 10. Nova: A Tool Built Using the Combined Approach

115

In Section 10.1, we describe the development of a revised graphical syntax for the dynamic fault tree language which addresses the issues discovered during the formalization of the language. Section 10.2 describes the overall software architecture. Sections 10.3, 10.4, and 10.5 describe the implementation of textual, graphical, and basic event editors for the new DFT language using POP components. Section 10.6 describes the implementation of a new DFT analysis engine based on the formal specification. Lastly, Section 10.7 describes the integration of these components and the resulting tool.

10.1

Revising the DFT Syntax

Many of the domain-specific capabilities of a tool depend upon the modeling language that it supports. As a result, the precise specification of the language can not only improve the trustworthiness of the analysis engine, it can also lead to improvement in the language and the editing capabilities of a tool which supports it. We have already described many of the conceptual and terminological improvements in Section 5.1.2. For example, we made the inputs regular, and removed redundant spare gate types. In this section we describe revisions to the DFT language which we made in order to resolve syntactic and semantic problems we discovered during the specification of the previous version of the language. We improved the language by resolving or clarifying the issues documented in the specification. These changes removed special cases and redundancy in the modeling language at the level of the abstract representation, and had a direct impact on the resulting concrete representation. We also describe how our specification effort also led us to improve the graphical concrete representation of DFTs.

10.1.1

Revision of the Lexical Elements

The lexical elements of a language are the concrete representations of the language elements—the spelling of the words. In a graphical language, the lexical elements are the shapes shown in a drawing.

Chapter 10. Nova: A Tool Built Using the Combined Approach

116

Four concerns guided our design of the new DFT shapes. The first was adherence to the formal specification. For example, if the formal specification stated that an event did not have an output, then the corresponding shape should not have an output connection point. The second concern was compatibility with legacy shapes—we did not want to force the user to re-learn shapes. The third concern was the intuitive nature of the language—any new shapes should be suggestive of their meaning. Finally, we did not want to design a language that would be difficult to implement. These four concerns were not always in agreement—sometimes we were forced to compromise one for another. The state components of our FaultTree specification (Section 4.2.1) indicated which portions of the language needed lexical representation. In addition to the basic event, our specification partitioned the set of gates into five types of gates (AND, OR, Threshold, PAND, and Spare) and two types of invariants (FDEP and SEQ). Our resulting DFT language has concrete lexical constructs for each of these eight abstract constructs. In addition, our specification abstracts the notion of connectors, representing inputs implicitly. In the graphical language, we included connectors to make input relations explicit: a direct connector which directly connects shapes, and an indirect connector which connects shape by naming the input shape. We restricted the connections to shapes based on the specification. For example, the input constraints starting on line 4.2.4 of the FaultTree specification allowed us to determine that all gates must have input connection points, and all events (basic events and gates) must have output connection points. In our specification a functional dependency has both a trigger input and a dependent event input, so the corresponding drawing shape has a connection point for both. Similarly, the sequence enforcer is a relation over a sequence of inputs, so it has an input connection point. The resulting DFT lexical elements are shown in Figure 10.2. These shapes reflect the modifications made to the language in the abstract specification. For comparison, the original constructs are shown in Figure 10.1. The new shapes reflect our revisions to the DFT language. There is a single spare gate instead of three, and the spare gate does not have a separate primary input. We retained several of the shapes, but also standardized the “label above smaller shape” structure in the SEQ and FDEP shapes, and added or modified the shapes of several gates in order to provide more

117

Chapter 10. Nova: A Tool Built Using the Combined Approach

T

K M

AND Gate

OR Gate

CSP

WSP

Cold Spare Gate

KOFM Gate

AND Gate

OR Gate

HSP

Warm Spare Gate

Hot Spare Gate

Threshold Gate

R Priority AND Gate

Spare Gate

Basic Event

FDEP

SEQ Sequence Enforcing Gate

Functional Dependency Gate

Priority AND Gate

Sequence Enforcing Constraint

Functional Dependency Constraint

1

Basic Event

Connector

Transfer Gate

Figure 10.1: Original depictions of DFT shapes

Direct Connector

Indirect Connector

Figure 10.2: Revised depictions of DFT shapes

intuitive representations. For example, the threshold gate has a “T” like structure, the spare gate has several sub-parts, and the sequence enforcer suggests an ordering. We also renamed the shape types based on the clarification in the specification of the differences between gates, basic events, constraints, and connectors. For example, the transfer gate is now called an “indirect connector” and the FDEP gate is now called an “FDEP constraint”. These modifications helped to make the language more regular and intuitive. In studying the specification, we realized that certain shapes had ancillary information essential to understanding the DFT. In particular, threshold gates have threshold values, and basic events have replication. We decided to represent these explicitly in the shapes. In addition, we needed a representation for the ordering of inputs to order-dependent gates, so we decided to indicate the order of an input as a number on the connector. These modifications ensured that all of the information necessary for interpreting the semantics of the DFT are explicitly represented in the drawing.

Chapter 10. Nova: A Tool Built Using the Combined Approach

118

Our work specifying the DFT language clarified the relationship between the basic event models and the fault tree. We found that unlike the threshold and replication, basic event models are used only in the analysis algorithms and are not necessary to understand the semantics of event occurrences in the model. For this reason we decided not to represent this information explicitly in the syntaxes of the language. Instead, we decided to represent this information in a separate view. See Section 10.4.1.1 for more information.

10.1.2

Revision of the Grammatical Structure

The grammatical structure of a language describes the allowable combinations of lexical elements— the structurally correct sentences. In a graphical representation, the grammatical structure indicates the legal connections of shapes and values for the attributes of the shapes. Many of the structural rules imposed by the formal specification are implicit in the drawing. For example, the rule that states that the number of basic events is finite is implicitly fulfilled by the definition of the concrete graphical representation. Similarly, the constraint that a basic event can not have an input is implicit in the fact that we did not give the basic event an input connection point. The other structural rules of a graphical fault tree correspond to the abstract structural rules described in Section 4.2.1. For example, predicate 4.2.5 in the FaultTree schema disallows cycles among gates in the fault tree, and is implemented as a function that checks for cycles in the fault tree drawing1 . We chose to allow the user to violate most of the global structural constraints while constructing the drawing, invoking a “validity check” which ensures that all the constraints are satisfied. See Section 10.4.2 for more information. In a few cases, the usability of the language outweighed our desire to adhere to the formal specification. For example, the replication for gates is not represented because it is always 1 in the abstract specification. Similarly, we only require the ordering of inputs to be specified for gates whose semantics are order-sensitive. Lastly, although the specification defines the semantics 1 Note,

however, that cycles in functional dependencies are allowed.

119

Chapter 10. Nova: A Tool Built Using the Combined Approach

Microsoft Excel

Analysis Engine

Graphical View

Textual View

Basic Event Model View

Dynamic Solver View

Put

Get

Get

Get

Put

Microsoft Word

Put

Microsoft Visio

View Manager

User Interface

Figure 10.3: The architecture of Nova of gates for no inputs, we require at least one input in the graphical language in order to avoid confusion on the part of the user.

10.2

The Nova Architecture

10.2.1

Overview

Figure 10.3 shows the basic architecture of Nova. There are four views—a graphical and textual model editing view, a basic event model view, and a dynamic solver view. Three of the views are implemented using POP components which run in separate address spaces, as indicated by the dashed lines. The Visio and Word POP components implement concrete representations of the fault tree model, along with representation-specific editing operations. For example, the Visio implements graphical depictions of the fault tree shapes, and provides functionality for changing gate types, checking the graphical syntax, etc. Each view encapsulates the details of interacting with the POP component or analysis engine,

Chapter 10. Nova: A Tool Built Using the Combined Approach

120

exposing a standard interface to the rest of the architecture. The interface depends on the capabilities of the view, as defined by the subtype of view to which it belongs. For example, the graphical and textual views are readable, indicating that the architecture can call the Get method to extract an abstract fault tree from them. Similarly, the dynamic solver view is writable, indicating that the architecture can call the Put method to apply a fault tree to the view. The rendering of models from one view to another is coordinated by the view manager, and initiated by the user via the user interface. For example, the user can tell the tool to render from the graphical view to the textual view, or from the textual view to the dynamic solver. In this case, the user interface instructs the view manager to render between the two views if the source view is readable and the destination view writable, and if the source view contains a syntactically valid fault tree. Nova has two model data types. The primary model transferred between the views is the fault tree abstract data type. In addition, the basic event model abstract data type is acquired from the basic event model view by the dynamic solver when it is asked to solve a fault tree. The Nova architecture is a generalization and abstraction of the Galileo architecture, and encapsulates many of the difficult design decisions we encountered related to the integration of multiple POP components. The architecture of Nova is similar to that of Galileo (Figure 7.2). Like Galileo, Nova utilizes abstracting wrappers to hide the details of package interaction, and manages models in views using a view manager module. Unlike Galileo, Nova has no consistency management scheme, and treats solvers simply as views, albeit write-only ones. To build Nova, we first created the fault tree and basic event model classes, both of which derive from the generic Model class. We then created three POP-based views (textual, graphical, and basic event model) by inheriting from the abstract ActiveView class and providing implementations for the interfaces that the views support. For example, because the graphical view supports the Readable interface, we implemented a Get method for the graphical view which parses the Visio drawing and creates a fault tree object. We also created a solver view from the abstract View class which supports the Writable interface but not the Readable interface. Both editing views were implemented using POP components. All editing functionality was

Chapter 10. Nova: A Tool Built Using the Combined Approach

121

implemented within the component itself using the extension and customization capabilities that it provides. This approach greatly improves the responsiveness of the overall Nova tool because editing operations need not be implemented in terms of inefficient cross-application RPC calls. Similarly, we designed the interface between the POP component and the main Nova tool to reduce communication overhead. For example, the Get and Put methods of the graphical view are implemented by parsing and unparsing the fault tree into a textual form which is passed to the Visio component to be parsed and then unparsed into graphical form. At run-time, the four views are registered with the document/view builder by Nova’s main application class. The architecture then handles the remaining details: the overall user interface, the creation of views and the embedding of POP components, the integration of POP user interfaces, the rendering of one view to another, and the saving and loading of multi-view documents. In the next three sections we describe the implementation of the graphical, textual, and dynamic solver views. The final section describes the resulting tool.

10.3

The Textual Dynamic Fault Tree Editor

Nova’s textual editor is based on Microsoft Word. As in the graphical view, we specialized the Word interface to provide editing operations specific to the DFT language. These specializations were implemented using Word’s built-in Visual Basic interpreter and its application API. By implementing all of the editing functionality of the view in the application itself, we avoided many of the performance limitations which constrain Galileo.

10.3.1

Overview Of The Textual Editor

Figure 10.4 is a screenshot of the textual editor. We rely on Word to provide the general editing capabilities, and augment this functionality by adding custom editing menus and toolbars for editing dynamic fault trees. We also mask behaviors in Word which are inappropriate for our purposes.

Chapter 10. Nova: A Tool Built Using the Combined Approach

122

Figure 10.4: A screenshot of the textual editor 10.3.1.1

Editing Operations

The bulk of the editing functionality of the textual view is provided by Microsoft Word. This functionality has been augmented with custom fault tree toolbar and menu which allow the user to automatically insert textual templates for each of the shape types. For example, when the user clicks on the sequence enforcer button on the toolbar, the text “ seq ;” is automatically inserted in the document. These templates ease the textual editing task by reminding the user of the syntax of the language. Each template insertion operation was implemented using a library of Visual Basic functions. These functions operate on the document using Word’s COM interface to its internal data structures. For example, to insert a template, the code first gets the current cursor location from the Word application, sends the cursor to the beginning of the current line, inserts the line of template text, then moves the cursor to the start of the newly inserted line of text.

10.3.1.2

Inappropriate Functionality Modified Or Hidden

We discovered that Word automatically performs a number of editing modifications which may be useful for editing English text, but which were inappropriate for editing fault trees. For example, Word automatically capitalizes the first word of a sentence, replaces common typing errors, high-

Chapter 10. Nova: A Tool Built Using the Combined Approach

123

lights grammar mistakes, and highlights spelling errors. This functionality is clearly inappropriate for a DFT textual editor, so we customized the application to disable this functionality upon editing a fault tree, and enable it when editing is complete. Unfortunately, Word’s automatic formatting options are not document-specific, so that changes affect all open documents. If the user opens another document while Nova has a textual view open, the opened document will not have any of these features enabled. While this problem may be minor, it is an inherent limitation of the Word application which, as far as we are aware, can not we worked around.

10.3.1.3

Enhanced Behavior

Compared to the graphical editor described in the next section, our enhancements to the textual editor were relatively modest. Because keywords in the language such as “and” and “or” can not be used as identifiers, we modified the tool to automatically highlight those keywords as they are typed. This provides a limited form of syntax highlighting similar to that which is used by modern programming environments.

10.3.1.4

Validity Check

The textual editing view provides the user with a method of checking the validity of a fault tree. When the user invokes the “validity check” operation, the textual representation of the fault tree is parsed into an internal representation which is then analyzed. This analysis ensures that the syntax is correct, and is based on the formal specification of both the concrete textual syntax and abstract fault tree syntax. The abstract syntax check performed by the validity check operation are: • No gate is an input to itself, either directly or indirectly. • The inputs to a spare gate must be basic events. • A basic event input to a spare gate can not be input to another type of gate. • The trigger event of an FDEP can not be a basic event with a replication other than 1.

Chapter 10. Nova: A Tool Built Using the Combined Approach

124

Figure 10.5: Checking the syntax of a malformed fault tree • The dependent events of an FDEP must be basic events. Figure 10.5 shows the result of the validity check operation for a syntactically invalid fault tree. In this fault tree, the input of the priority-AND is the OR gate, which causes a cycle in the fault tree. The tool has discovered this cycle, and has added an error to the analysis results window. If the user double-clicks on an error, the tool will take the user to the page and location of the problem, and will highlight the text involved. Interestingly, using the validity check is similar to using the compiler error reporting in an integrated development environment for software development. Users find themselves in a tight edit-check-debug cycle as they repair problems in the fault tree.

10.4

The Graphical Dynamic Fault Tree Editor

As in Galileo, our implementation of the new graphical language was built using Microsoft Visio. In this section we present our work specializing the Visio application for the construction of models expressed in the formally-based revision of the DFT language. This work provides additional insight into the specialization capabilities of POP components. In contrast to the Galileo tool, all of the graphical editing functions are implemented directly in the Visio component, demonstrating that the rich specialization capabilities of POP components can enable designs which are less lim-

Chapter 10. Nova: A Tool Built Using the Combined Approach

125

Figure 10.6: A screenshot of the graphical editor ited by the cost of cross-application communication, and are able to more readily exploit the rich functionality of the component.

10.4.1

Overview Of The Graphical Editor

Figure 10.6 is a screenshot of the graphical editor. Visio provides the general functionality such as zooming, scrolling, formatting, saving, printing, etc. In addition, we utilized Visio’s UpdateUI interface to specialize the interface in several dimensions in order to support the domain-specific editing of DFTs in their concrete graphical form. First, a stencil of shapes has been added which contains the graphical depictions as described in the previous section. Second, a menu and toolbar of functions has been added to perform DFT-specific operations such as changing a gate’s type or

Chapter 10. Nova: A Tool Built Using the Combined Approach

126

selecting a subtree. Third, certain functions which are inappropriate for DFTs have been removed or replaced in the interface. Fourth, the behavior of Visio has been enhanced to ease the task of building fault trees. Each of these is described in more detail in the following sections.

10.4.1.1

The Stencil

The stencil of shapes that we created implements the graphical depictions shown in Figure 10.2. In addition, the shapes in the stencil implement dynamic constraints. For example, the text box associated with a shape can be manually resized, and also automatically expands to accommodate long text. There is a “control point” that the user can also use to change the size of the DFT shape under the text box. In addition, each shape has a context menu for operations that can be performed on that shape. In addition to the shape name, the threshold gate and basic event also have an extra text field on the shape in which the user can enter the threshold value or replication. In implementing this functionality we found that Visio has a key limitation: it only allows a shape to have one text box. As a result, we were forced to design the basic event and threshold shapes as a group of two shapes—one having the descriptive label, and one having the threshold or replication. This design complicated the shapes because a change to the size or position of one member of the group required that the other be updated to match. Unfortunately, we encountered another Visio limitation when implementing the size and position constraints: the update scheme supported by Visio’s shapesheets does not easily handle the cyclic updates that result from the modification of a group member. For example, resizing a group member should resize the group, and vice-versa. As a result, we were forced to abandon Visio’s shapesheet update method, and to rely on the more complex and powerful built-in scripting language. Like many mass-market packages, Visio contains an embedded Visual Basic interpreter which allows the developer to write code in response to events from the application. We used used these capabilities to register callbacks which would be invoked whenever a shape’s size or position was changed. These routines would then update the group’s size or position in response to changes in any group member’s size or position.

Chapter 10. Nova: A Tool Built Using the Combined Approach 10.4.1.2

127

Editing Operations

Domain-specific editors are sometimes structure editors which maintain the validity of the model at all times by forcing the user to use validity-preserving operations for constructing the model. In contrast, Visio is a free-form editor that allows drawings to be constructed using generic low-level operations. Instead of attempting to change this intrinsic design decision in the Visio application, we decided to implement a free-form fault tree editor, and augment this interface with validitypreserving, structure-based editing operations. Because the user may construct an invalid fault tree using the low-level editing operations, we also provide a validity check operation which can be used to verify that a fault tree is valid. Each of the structure-based operations is implemented in terms of basic Visio objects, which provide for low-level, generic manipulation of graphical drawings. During the development of Galileo’s graphical editing capabilities, we followed a bottom-up implementation which addressed the large degree of uncertainty in the functionality that can be efficiently implemented in terms of Visio objects. We followed the same approach in the development of Nova’s graphical editor, building a library of drawing manipulation functions appropriate for DFT manipulation. The highlevel functionality was then implemented in terms of this library. For example, the “send subtree to page” operation is implemented in terms of a lower-level “select subtree” DFT operation, which is in turn implemented in terms of Visio’s native shape and connection object functionality.

10.4.2

Implementing On-The-Fly Syntactic Constraints

In contrast to the textual view, the shapes in the graphical view correspond directly to fault tree elements. This allowed us to implement on-the-fly syntactic checks to help the user build valid fault trees. For example, the graphical editor does not allow the user to attach an output to another output, or an input to an input. The syntactic constraints of the formal specification can be divided into local and global constraints. Local constraints are those that are expressed in terms of one or a few related constructs of the fault tree. Global constraints are those expressed in terms of all of the constructs of the fault tree. For example, the constraint which states that the trigger of an FDEP must have a replication

Chapter 10. Nova: A Tool Built Using the Combined Approach

128

of 1 is a local constraint, while the constraint which says that event identifiers must be unique is a global constraint. We perform validation of local constraints as the fault tree is constructed. However, we defer validation of most global constraints to the validity check, as checking these constraints does not scale well with respect to the size of the fault tree. This decision also helps to address performance issues in Visio, because global checks require traversal of the entire drawing one or more times. Performing such analyses as the fault tree is constructed would make the editor sluggish to use.

10.4.2.1

Visio Functionality Augmented and Modified

In addition to binding functions to shape behavior, we also used the Visio Document and Application objects to extend the interface by creating new menus and toolbars. As shown in Figure 10.6, the editor provides a menu and toolbar containing a number of complex editing operations. Viewing the toolbar buttons from left the right, the user can: • Add a shape to the current page. • Add a shape, connect it to all of the selected shapes, and position it relative to the other shapes on the page. • Change the type of a shape. • “Smart connect” two shapes. The editor determines the possible input and output connections, automatically selecting the only possible connection or prompting the user to select one of multiple possible connections. • Change a direct connection to an indirect connection or vice-versa. • Replace a subtree with an indirect connector, and optionally move the subtree to a new page. • Replace an indirect connector with the subtree that it references. • Select the gates and basic event input to the selected shapes. • Automatically resize the drawing page to the fault tree model.

Chapter 10. Nova: A Tool Built Using the Combined Approach

129

• Automatically format the selected shapes, or the entire page if no shape is selected. • Invoke the syntax check. • Show or hide the syntax check error window.

10.4.2.2

Enhanced Behavior

An important and necessary specialization capability is the specialization of package behavior. In several cases we found the default behavior of Visio to be inappropriate or insufficient for our application. Using Visio’s event interception mechanism we were able to redefine or augment the application’s behaviors in a number of important ways: • Duplicate names and invalid connections: We wanted to implement features to help prevent the user from constructing an invalid fault tree. For example, we disallow fault trees in which shape names are not unique. Similarly, we do not allow gates to be connected input-to-input or output-to-output. To help prevent such errors, we augmented the behavior of Visio so that as shape names are changed, we check for duplicates and issue a warning if any are found. Similarly, when a connection is made, we break the connection if the connection between shapes is invalid. • Directed vs. undirected connectors: Visio by default uses directed connectors. (They are directed even if they have no arrowheads.) In contrast, the connectors in the DFT language are not directed. Generally this is not a problem, but we discovered that Visio’s automatic layout algorithm depends on the directivity of the connectors between shapes. To address this issue, we modified Visio’s behavior so that when a connector is attached to a connection point on a shape, we intercept the connection event and flip the connector if necessary. • Proper hyperlink management: During our testing, we discovered a bug in Visio: when a page is deleted, hyperlinks between shapes become invalid because the page numbers within the hyperlinks are not updated correctly. To fix this bug, we had to intercept the page deletion

Chapter 10. Nova: A Tool Built Using the Combined Approach

130

event and update the hyperlinks ourselves. This example is interesting because we were able to use the programmability of the component to work around its own defect. • Automatic shape coloring: To help the user identify potential system-level events (i.e. those which are not input to other gates), we decided to automatically color the shape red. To implement this feature, we had to intercept every connection event and shape name change in order to correctly color the shape. (Shape names are important because indirect connectors refer to other shapes by name.) • Visual feedback for unconnected lines: During the Galileo workshops, we discovered that users often had problems properly connecting shapes. In order to provide more visual feedback, we modified the behavior of the connector so that the line would be dashed unless both ends are connected. • Dynamically glued connectors: Visio supports a feature called “dynamic glue” in which shapes are connected by a connector which automatically selects connection points on the edge of shapes in order to minimize connector length. Unfortunately, this behavior is inappropriate for DFTs, where connection points are not interchangeable. Because Visio provided no user-level ability to disable this feature, we had to automatically detect dynamically glued connectors when a connection was established and convert the dynamic glue to static glue. • Automatic input numbering: Because dynamic gates require ordered inputs, connectors have a number which indicates their relative order. In order to reduce the chance that a user creates an ordering with repeated values, we automatically compute the next input value when the user tells the tool to add and connect a shape to a dynamic gate. • Automatic hyperlink creation: We implemented an event handler to respond to the addition or deletion of shapes, automatically adding or removing hyperlinks between indirect connectors and the shapes to which they refer. Later, when we implemented the “send subtree to page” functionality, we found that hyperlinks were automatically updated for shapes in the

Chapter 10. Nova: A Tool Built Using the Combined Approach

131

Figure 10.7: Checking the syntax of a malformed fault tree subtree because the event handlers where implicitly invoked as each shape when the subtree was deleted from one page and added to another.

10.4.2.3

Validity Check

Syntax constraints that require global analysis are deferred to the semi-automatic validity check. As in the textual view, this operation, when invoked by the user, causes the drawing to be parsed into an abstract fault tree data structure, which is then analyzed to check conformance with the global constraints in the specification. Because both Visio and Word have the same Visual Basic scripting capability, we simply reused the same analysis algorithms from the textual editor. Figure 10.7 shows the result of the validity check operation for a syntactically invalid fault tree. In this fault tree is the graphical equivalent of the textual fault tree shown in Figure 10.5. The editor identifies the cycle, as shown in the error window. If the user double-clicks on the error, the editor will take the user to the page and location of the problem, and will highlight the shapes involved.

Chapter 10. Nova: A Tool Built Using the Combined Approach

132

Figure 10.8: A screenshot of the basic event model editor

10.5

The Basic Event Model Editor

Figure 10.8 shows the basic event model editing view, implemented using Microsoft Excel. Each basic event name is associated with a set of data, including the distribution type, distribution parameters, coverage parameters, and dormancy. Unlike the graphical and textual fault tree editing views, the interface provided by Excel was adequate for our purposes, and did not have to be augmented or restricted. The DIFTree and Galileo tools both integrate the basic event model data in the fault tree representation, even though this information is ancillary to the task of modeling relationships between event occurrences. The Nova tool has a separate basic event model view, which more accurately reflects the separation between fault tree and basic event model which exists in the specification. We believe that a separate basic event model view also improves the usability, as such data rarely changes, can shared among users, and can be used in the development of multiple fault trees.

10.6

A New DFT Solver

Prior to this research, the analysis engines of DFT tools were based largely on an informal understanding of the semantics of the DFT language. While precise definitions of isolated DFTs in

Chapter 10. Nova: A Tool Built Using the Combined Approach

133

terms of Markov chains existed (as in [8]), software developers were forced to rely upon the natural language descriptions of the DFT constructs in order to implement analysis engines which could handle the general case. In contrast, Nova’s DFTs analysis algorithms are based a sufficiently complete, mathematically precise definition of the DFT language. The ideal method of implementing the specification would be to formally refine the specification to the resulting code in a manner that mathematically ensures the correctness of the resulting system. Unfortunately, this approach is extremely costly for such a complex specification. Instead, we chose to implement the analysis code as a direct, albeit informal, translation of the specification to the target language of C++. By maintaining the abstractions which exist in the specification, we hoped to create an implementation which would be easy to verify for correctness. Importantly, we made a conscious decision to avoid performance optimization in order to reduce the risk of implementation errors, and to avoid complicating the code to the detriment of verification.

10.6.1

The Solver Architecture

Figure 10.9 illustrates the overall architecture of the solver, which corresponds closely to the structure of the specification. The rectangles in the diagram represent modules which are primarily computational in nature, and which correspond to functions in the specification. Arrows represent method calls, and are labeled with the data passed to and from the method being called. Each data element corresponds to a type in the formal specification. The Fault Tree Unreliability module manages the various steps of the analysis. As in the SystemUnreliability definition in the formal specification, the Fault Tree Unreliability module requires four arguments: the fault tree to be solved, a function which maps basic events to their associated failure models, an event which represents system failure, and a mission time. In return, the module computes a probability which corresponds to the probability of system failure. In order to solve the fault tree, the Fault Tree Unreliability module computes the corresponding failure automaton and Markov chain, as defined by the specification. First the Fault Tree Semantics module is used to translate the fault tree into a failure automaton. The resulting

134

Chapter 10. Nova: A Tool Built Using the Combined Approach

od

vM

ko ar

on

e re

M

at m to

ar

Au

T ult Fa

M

e ur

Markov Model

il Fa

el

Markov Model Semantics

ko v Ti M m od e el ,

Failure Automaton Semantics

Failure Automaton, Basic Event Models, Fault Tree

Fault Tree Semantics

Probability

Fault Tree, Basic Event Models, System Event, Time

Fault Tree Unreliability

Figure 10.9: The analysis engine architecture failure automaton is passed, along with the basic event model mapping and the fault tree (which contains the replications for each event), to the Failure Automaton Semantics module to generate the corresponding Markov model. Finally, the Markov model and mission time are solved by the Markov Model Semantics module using a Fortran-based differential equation solver. The resulting Markov chain contains the final state probabilities. To compute the overall unreliability, the Fault Tree Unreliability module iterates over all the states, determining if the state corresponds to a failure automaton state in which the system level event has occurred. If so, the final state probability for that state is added to the overall unreliability probability. This total probability is then returned as the solution to the fault tree.

10.6.2

A Verification-Amenable Implementation

We employed two key strategies for implementing a solver which would be easy to verify. First, we ensured that the structure of the code matched the structure of the specification. Second, we

Chapter 10. Nova: A Tool Built Using the Combined Approach

135

eschewed any performance optimizations as they were not necessary for the correctness of the code. Each schema in the specification is implemented as a C++ class whose private members correspond to the state variables in the schema. For example, consider the state variables of the fault tree schema:

FaultTree basicEvents : F Event andGates : F Event orGates : F Event thresholdGates : F Event pandGates : F Event spareGates : F Event gates : F Event thresholds : Event → 7 N1 seqs : F InputSequence fdeps : Event ↔ InputSequence events : F Event inputs : InputsMap replications : ReplicationMap

The FaultTree abstraction in the specification corresponds to the Fault Tree class in the implementation:

class Fault Tree { public: // Construction, destruction, access and mutation methods elided.

Chapter 10. Nova: A Tool Built Using the Combined Approach

136

protected: std::set basic events; std::set std::set std::set std::set std::set

and gates; or gates; threshold gates; pand gates; spare gates;

std::set gates; Threshold Map thresholds; std::set seq constraints; std::set fdep constraints; std::set events; Inputs Map inputs; Replication Map replications; // Debugging data structures elided };

This close correspondence to the specification pervades the implementation, and eases the task of verifying its correctness. Many of the data structures in the C++ Standard Template Library [53] correspond to the data structures of Z. One notable exception is the lack of the specialized relational data structures of Z such as bijections and total functions. We used STL’s general map class to implement the missing data structures, inheriting from the class, then overriding key methods and adding missing ones. The current implementation could best be described as “inefficient almost to a fault”. The solver, as a faithful implementation of the specification, suffers from the exponential state space explosion described in Chapter 2. There are many obvious performance optimizations which can be performed, such as stopping the state space enumeration after the system has failed (because further failures will not bring the system back up). We believe that this initial implementation, once verified, can serve as a baseline implementation on which future, more efficient, implementations can be based.

Chapter 10. Nova: A Tool Built Using the Combined Approach

10.6.3

137

Implementation Experiences

The entire solver implementation took about 1 month to complete. The abstract data types were very easy to implement, as they were based directly on the Z schemas which we had developed earlier. The remaining effort was spent to converting the “what” described in the specification to the “how” in the implementation. Most of the specification functions could be implemented as straightforward imperative statements such as that of the Fault Tree Unreliability computation described above. However, we found that there were two key issues for which a direct implementation of the specification would be exceedingly inefficient, and which required difficult design decisions. The first problem involved the correspondences between the states and transitions of the failure automaton and the states and transitions of the Markov model. These correspondences are created initially when the Markov chain is built from the failure automaton, and are needed again later during the computation of overall system unreliability, when the algorithm must determine which Markov chain state corresponds to a failure automaton state in which the system level event has occurred. A direct implementation of the specification would involve creating the correspondences when translating the failure automaton to a Markov chain, then “forgetting” the correspondences, then recreating them during the computation of overall system unreliability. To compute these correspondences would be excessively inefficient. Furthermore, the implementation would be complex and difficult to verify, as it would have forced us to solve a problem similar to graph isomorphism. Instead, we decided to save the correspondences built during the translation, and update them as appropriate during the remainder of the algorithm until they are needed during the final system unreliability computation. By modifying the architecture in this way, we diverged slightly from the specification, but avoided implementing a complex and difficult to verify method of recreating the state and transition correspondences. In other words, we compromised the direct correlation between specification and implementation in order to ease verification and avoid significant performance penalty. The second implementation issue we faced was the evaluation of the fault tree in response to the

Chapter 10. Nova: A Tool Built Using the Combined Approach

138

failure of a causal basic event. The specification simply states that all gates and constraints behave correctly relative to their semantics and the newly occurred basic event. A naive implementation would enumerate the entire state state space, removing those states which violate the semantics of the gates and constraints, or whose basic event states are incorrect. This method, while simple, is highly inefficient. We decided to implement a method which computes the next state in response to the failure of a causal basic event. For static fault trees, this procedure is simple: simply identify the gates into which the event is an input and evaluate them according to their gate semantics, then recurse for those gates whose output events have changed. Unfortunately, dynamic fault trees can have replication, cyclic dependencies, and spare allocation which must be handled while evaluating the fault tree. For example, if there is contention for spares, this contention must be resolved before the spare gates can be evaluated.

10.7

The Resulting Nova Modeling and Analysis Tool

10.7.1

Nova Overview

Figure 10.10 shows a screenshot of Nova. The sub-windows of the application correspond to the views described earlier in this chapter. The window at the bottom is the textual view, the window behind the textual view is the graphical view, and the upper window is the basic event model view. The textual and graphical views display two representations of the same fault tree. The user can create additional textual or graphical views, and can render from any textual or graphical view to any other. In addition, the user can render a textual or graphical view to the dynamic solver in order to solve the fault tree in that view. The Nova tool provides the main window, as well as the frames of the sub-windows in which the views reside. The “File” and “Model View” menus are provided by Nova, while the “Fault Tree” menu is provided by the embedded Word or Visio application, as specialized for DFT editing. The other menus are standard menus provided by Word or Visio, depending on the currently active view. The integration of the graphical and textual editors was not difficult, and was completed in less

Chapter 10. Nova: A Tool Built Using the Combined Approach

139

Figure 10.10: A screenshot of Nova than a week. In order to reduce communication overhead, we decided to use the textual representation of the fault tree for communication between Nova and Visio. As a result, we had to implement a textual parser and unparser in C++ for Nova’s TextualView and GraphicalView classes, and a Visual Basic textual parser and unparser for Visio. While we have not performed a quantitative comparison of this design to the one employed by Galileo, we observe that the new design appears to allow for much faster renderings between views.

10.7.2

Further POP Component Limitations

Development of Nova provided further experience with the POP approach, as we performed more aggressive specialization of the POP components than was employed in the development of Galileo. During the development of the graphical editor, we encountered performance issues that forced us

Chapter 10. Nova: A Tool Built Using the Combined Approach

140

to modify our design slightly. For example, we discovered that determining the type of a shape was a very common operation that would access the stencil repeatedly. For this situation we amortized the cost by caching the objects retrieved from the stencil. We also found that we had to disable the on-the-fly checks performed in the event handlers while we were making changes to the document programmatically. This was acceptable because, presumably, our automatic editing operations would not violate the syntax of the fault tree. We also found Visio’s UpdateUI application method to be particularly slow, which resulted in noticeable delays in the selection of shapes on the drawing page. As a result, we had to modify our design so that user interface updates are performed during idle times. In general, performance limitations can seriously compromise a design because the black-box nature of application packages does not provide the programmer any recourse for addressing poor package performance. In this case, our redesign resulted in acceptable performance. Interestingly, our discussions at the time with Visio developers revealed that the new version, Visio 2002, would have a new user interface object model (in addition to the old) with better performance properties. When Visio 2002 was released, we found that the new user interface API was indeed faster, but that it did not not support drop-down button lists on the toolbar, which we had used effectively in the previous version of the application. In the end, we decided to retain our previous design, which was possible because Visio 2002 still supported the older user interface object model. The integration of the textual and graphical editors into the Nova application was greatly aided by the design of the tool architecture. By implementing the TextualView and GraphicalView adapter classes, we ensured that the views could exchange data easily. By using the built-in Active Document support, we ensured that the user interfaces could be easily integrated. And by implementing all of the editing functionality in the application components themselves, the resulting tool did not suffer from the same type of performance problems as Galileo. Despite these positive experiences, we did encounter a few unanticipated integration problems. The first was that the Visio API provides a means of calling user-defined Visual Basic functions, but does not allow the user to return a value from the function. This meant that unlike the Word imple-

Chapter 10. Nova: A Tool Built Using the Combined Approach

141

mentation, the Get operation could not easily extract the textual representation of the graphical fault tree from Visio. In order to work around this problem, we had to export the textual representation from Visio to a file to be read by Nova. In our initial implementation of the graphical view, toolbar buttons and menu items were dynamically enabled and disabled according to the selected items. When disabled, a button becomes “grayed” and can not be clicked. For example, if no shape is selected, then the user should not be able to invoke the “add and connect” functionality. This behavior worked as expected during implementation when Visio was used as a stand-alone application. However, we found that when Visio was used as a POP component, the application would remove and add the toolbars and menus during the enable/disable process, resulting in an annoying flashing of the user interface as shapes were selected by the user. We hypothesize that this is a result of an interaction with the Active Document merging of the component with the container interface. In order to address this problem, we chose to not disable and enable the buttons but to instead simply make them unresponsive to a mouse click. We also encountered an unusual and inexplicable behavior related to Visio’s use as a component. When Visio’s automatic layout functionality is invoked, the container application loses window focus, and is sent to the background! This behavior seems totally unrelated to the task of formatting shapes on a drawing page. Because we did not have the ability to debug and fix Visio’s internal automatic layout routines, we were forced to send a “raise window” event to the Nova application following each layout operation. The result is workable, although suboptimal because the Nova application flashes as it briefly loses window focus. Finally, we found that Visio, when used as a component, is exceptionally slow to load. This problem seems to be inherent in Visio’s COM implementation, as it has been observed in versions of Galileo (see Section 7.2.2) as well as Microsoft Binder, which also uses the Active Document architecture. Unfortunately, there appears to be no known workaround for this problem.

Chapter 10. Nova: A Tool Built Using the Combined Approach

10.7.3

142

Revision of the Specification

Ideally, the development of a sufficiently detailed and complete specification allows the developer to identify and resolve semantic problems before they become implementation bugs. As described in Chapter 4, the development of the DFT specification did reveal a number of conceptual, design, and implementation errors. However, we found that the specification effort was not perfect—we found three errors in the specification during implementation, and discovered that the specification was not as flexible as we had thought. We also found that the specification defined semantics which were valid, but somewhat counter-intuitive. In the specification which we had developed, a basic event could occur in one of three ways: masked internally such that the system doesn’t detect it, in the normal manner, or as a catastrophic system-wide failure. The specification made the mistaken assumption that following a catastrophic system-wide failure, another basic event could occur and result in a system that is no longer failed. After consulting with the domain expert, we updated the specification to ensure that the catastrophically failed system stays failed. The discussion with the domain expert also revealed that our definition of the failure automaton, while suitable for our work, lacked flexibility needed for future uses. In particular, the failure automaton state records whether or not a system-wide failure has occurred, but not which basic events have occurred in this manner. This is important for the future specification of component repair, where one might assume that components that fail normally can be repaired, while components causing system-wide failures can not. The second error concerned the dormancy scale factor—the fraction by which the rate of failure of unused spare components is attenuated. The error was that the specification incorrectly specified a dormancy scale factor for basic events which were not spares. Instead, the specification should state that non-spare basic events have a dormancy scale factor of 1, indicating no attenuation of their failure rate. The third specification error we also discovered was that the sparing scale factor was not properly defined. This scale factor is intended to account for nondeterministic assignment of spares to spare gates in case of contention. For example, if there are two nondeterministic assignments of

143

Chapter 10. Nova: A Tool Built Using the Combined Approach FDEP2

FDEP

Event A

Spare Gate 1

Spare Gate 2

Event B

Event C

Spare Gate 3

Spare Gate 4

Event E

Event F

Event G

Event D

Figure 10.11: Cascading contention a spare, then the rate associated with transitioning to either of the resulting states should be half of the normal value. Among other problems, the definition of the sparing scale factor incorrectly accounted for basic events which caused system-wide failure. During testing of the implementation, we discovered that the specification had semantics which were valid, but nonintuitive. For example, consider the fault tree in Figure 10.11. In this example, the failure of Event A results in simultaneous failure of Event B and Event C. Spare Gate 1 and Spare Gate 2 then contend for Event D. If Spare Gate 2 loses, then this triggers FDEP2, which forces Event E and Event F to occur, causing further contention for Event G. There are three possible outcomes: • Spare Gate 1 acquires Event D and Spare Gate 3 acquires Event G • Spare Gate 1 acquires Event D and Spare Gate 4 acquires Event G • Spare Gate 2 acquires Event D, so that FDEP2 is not triggered The specification assigns a 1/3 probability of occurrence for these outcomes. This is somewhat counter-intuitive because one would assume that the first two cases are actually sub-cases of Spare Gate 1 acquiring Event D instead of Spare Gate 2. Under this interpretation, the values should be 1/4, 1/4, and 1/2 for the probabilistic outcomes.

144

Chapter 10. Nova: A Tool Built Using the Combined Approach FDEP2

FDEP

Spare Gate 1

Spare Gate 2

Event B

Event C

Event A

Spare Gate 3

Event E

Event D

Figure 10.12: Contention interacting with allocation Figure 10.12 shows another counter-intuitive fault tree. In this fault tree, contention may result in the failure of Spare Gate 2, which triggers the failure of Basic Event E. At this point Spare Gate 3 needs its spare, Event D. Normally, one would assume that Event D was already allocated to Spare Gate 1, as Spare Gate 2 had failed to cause Spare Gate 3 to need a spare in the first place. However, there is no causation in the allocation of spares according to the specification. As a result, there are three outcomes: • Spare Gate 1 acquires Event D • Spare Gate 2 acquires Event D • Spare Gate 3 acquires Event D Another way to interpret this situation is that all allocation and contention occurs simultaneously, so that the failure of Spare Gate 2 causes Spare Gate 3 to contend with Spare Gate 1 before the latter spare gate is allocated the spare. One can argue that both of these examples represent degenerate fault trees which would never occur in practice. On the other hand, the formal specification provides a semantics for these fault trees, and constrains the implementor to provide an implementation that obeys these semantics.

Chapter 10. Nova: A Tool Built Using the Combined Approach

145

Lacking such a specification, the implementor is free to build a system which will have unknown behavior should such a “degenerate” case actually arise in practice.

Chapter 11 Evaluation of the Combined Approach

Our overall evaluation is composed of three parts. First, we evaluated the POP approach through the development of multiple versions of the Galileo tool, as well as through the development of the model editing capabilities of Nova. We developed the Galileo tool to assess the technical feasibility of the POP approach as a model for the component-based construction of interactive systems. We then deployed Galileo to engineers and surveyed their evaluations of the software in order to assess the ability of the approach to deliver the features and usability that users require. As a result of these investigations, we were able to make a qualified positive evaluation of the approach. Second, we evaluated the targeted use of formal methods for the formal specification and validation of modeling methods and the languages which they employ. We found that significant improvements in concept, design, and implementation can be gained through the use of formal methods. We also demonstrated that formal methods can be cost-effective, although the use of formal validation is less clearly so. Finally, in this chapter we evaluate the combination of these two elements in the context of the development of the Nova tool. We demonstrate the effective combination of the POP approach for the low-cost construction of the superstructure elements of a tool, with the use of formal methods for the definition of the modeling method supported by the tool. In the next two sections we discuss the impact of the POP approach and the use of formal methods. We then discuss the overall cost of the developing Nova. Lastly, we discuss the applicability and limitations of the approach.

146

Chapter 11. Evaluation of the Combined Approach

11.1

147

Package-Oriented Programming

Nova is a prototype, and yet in many ways it surpasses existing DFT modeling and analysis tools. Its flexible architecture allows the user to create numerous textual and graphical editing views, and to render freely between these views. It supports many sophisticated features, such as rich undo capabilities, cut-and-paste, printouts spanning multiple pages, on-the-fly syntax highlighting and checking, powerful domain-specific editing operations, complete syntax checking, and integrated error reporting. Furthermore, users benefit from usability and familiarity associated with the Word and Visio application-components. The use of packages as components allowed us to build a tool having a multitude of features and high usability at a fraction of the cost associated with a from-scratch approach. Key to harnessing the capabilities of the components were the internal data structures and events which both Word and Visio exposed to the programmer. This, combined with the ability to program the components using the embedded Visual Basic scripting language allowed us to write complex functions to modify, augment, or restrict the capabilities of the applications without incurring the cost of cross-process communication. It also allowed us to prototype the functionality in the package as a stand-alone application before integrating it in the overall tool. The graphical editor, for example, automatically performs a number of on-the-fly analyses while the user builds a fault tree. For example, it will automatically disconnect any invalid connection between shapes. The editing toolbar allows the user to perform complex editing operations, such as connecting a new basic event to multiple selected gates at the same time. The syntax check operation will analyze the graphical fault tree drawing to ensure that it conforms to the abstract syntax described in the formal specification. If it does not, the integrated error window allows the user to jump to the location of any error when its report is double-clicked. Integration of the textual editor, graphical editor, and basic event model database was not difficult. As was our experience developing Galileo, we encountered several unexpected integration errors whose resolution required new designs. However, not all problems we encountered could be resolved. For example, the start-up performance of Visio remained one issue for which there was

Chapter 11. Evaluation of the Combined Approach

148

no easy solution. While this problem is troublesome, it does not inhibit use of the tool once it has been started. Extrapolating from our usability study with the Galileo tool, we fully expect that users will be satisfied with the usability of Nova. Much of the functionality of the tool is similar to that of Galileo, and it is also built using Microsoft Word and Visio. However, the performance of Nova is much better as a result of the careful partition between in-process model editing functionality and cross-process integration. Further, with “CPU cycles to spare” we were able to implement more sophisticated editing capabilities than those of Galileo. Much of the ease of developing Nova was due to our previous experience developing Galileo, whose architecture took about two years to stabilize. The overall Nova architecture provides a flexible method of integrating POP-based model editing views based on adapter classes controlled by a central view manager. The architecture manages the details of creating views, rendering between views, and document persistence. It provides abstract base classes for easily instantiating POP-based views. It also implements view-specific operations within the POP components. The use of the POP approach allowed us to rapidly build a usable, sophisticated, feature rich interface as significantly reduced cost. This experience provides a second data point in addition to Galileo which suggests that the POP approach can be used effectively for the construction of high quality modeling and analysis tools. Furthermore, it demonstrates that the approach can deliver more advanced tool features and behaviors than previous known.

11.2

Formal Methods

Our use of formal methods, although narrowly applied to the DFT modeling language, reveal not only that more rigorous approaches to developing tool modeling languages are needed, but also that such approaches can contribute significantly to the design of the language and tools which support it. The process of developing a formal specification, then submitting it to informal validation with domain experts and formal validation with theorem proving tools revealed numerous conceptual ambiguities, syntactic irregularities, and semantic inconsistencies.

Chapter 11. Evaluation of the Combined Approach Unit Analysis Engine Textual View Graphical View Basic Event Model View Nova (Main) Total

149

Lines Of Code 9000 C++ (also 2600 C++ test cases) 3100 VB (2000 VB shared with graphical view) 8800 VB (2000 VB shared with textual view) 0 5700 C++ 27200

Table 11.1: Lines of code written

Through this effort we gained and documented a deeper understanding of the subtleties and problems of the existing DFT language. We then used this knowledge to develop a revised language which resolved the syntactic and semantic problems which we had discovered. The formal specification served as the basis for the redesign of the textual and graphical representations used by the Nova tool. It also served as the basis for the validity check operation implemented in both the textual and graphical views. The specification also was the basis for the Nova DFT analysis engine. During the course of this effort, we discovered that our intense informal and formal validation effort was not perfect—we discovered three errors in the specification. Two of these errors involved the computation of scale factors using real numbers, a known weakness of the Z formal specification language. One of these errors caused us to study further the coverage modeling feature of the DFT language, and revealed an error in the analysis engine hosted by Galileo. Unfortunately, formal refinement of specification to code remains an impractical task. While verification of implementations of specifications is not within the scope of this research, we believe our direct-implementation approach has the potential of easing informal verification. By consciously avoiding performance concerns and closely mirroring the structure of the specification, we hope to reduce the difficulty of correlating specification to implementation. Still, we have no proof for the correctness of our algorithms.

Chapter 11. Evaluation of the Combined Approach

11.3

150

Cost

Overall, we estimate that the Nova tool took approximately a person-year’s worth of work to develop, including the one-time cost associated with the initial formal specification of the DFT language. Table 11.1 shows approximate lines of code for each major module of the code. The numbers indicate commented lines of code written, and do not include code generated automatically by the Microsoft Visual Studio AppWizard. Interestingly, the amount of Visual Basic code written to specialize the textual and graphical views exceeds the amount of code in the main Nova module which integrate them. (The basic event model view required no specialization code.) But perhaps most important is the amount of code necessary to implement the superstructure capabilities of the tool. Recalling Shaw’s estimate (Section 9.2) that 90% of a typical tool’s code is devoted to the superstructure, our approach has reduced the amount of required superstructure code to a level which is comparable with that of the the analysis engine. As a result, the entire Nova application is implemented in less than 30,000 lines of code. The POP approach allowed us to implement the bulk of the tool’s functionality at a fraction of the cost of traditional methods. The resulting Nova application is a mere seven megabytes, not including the Word, Visio, and Excel components. Our use of formal methods discovered not only errors in the previous implementation of the DFT language, but also errors and ambiguities in the DFT language itself. The revised language which we have developed has a more regular and orthogonal design, as well as a semantics which is defined in a mathematically precise manner. The resulting Nova tool, we believe, has a more substantial basis for trustworthiness than previous tools. Without a doubt, the use of formal methods incurred a nontrivial cost. However, the substantial savings gained through the use of packages as components helped to mitigate this cost. Furthermore, the use of modeling and analysis tools in the design of critical systems in large part motivates the need to bear this expense. Our experience indicates that the formalization of languages supported by modeling tools is sorely needed, and that doing so is feasible and profitable.

Chapter 11. Evaluation of the Combined Approach

11.4

151

Applicability and Limitations of the Approach

We have demonstrated the effective combination of POP and formal methods for the development of modeling methods and tools to support them. POP provides, at low cost, the bulk of the tool superstructure. By employing of formal methods, we help address the important risk that the engineering method is unsound. We avoid incurring excessive costs by forgoing the formal specification of the entire tool, while acknowledging that risks still remain in other aspects of the tool such as the POP components. We also avoid the high costs associated with the formal derivation of the specification to a trustworthy implementation. And yet, despite these cost-saving measures, we have shown that it is possible to build a tool whose modeling method has a formal basis. This research is based on a set of observed requirements for tools, as described in Section 9.1.1, and a set of assumptions about tools, as described in Section 9.2. This characterization of tools is a generalization, which can be challenged by particular tools which may contradict the stated requirements and assumptions. In particular, we assume that the superstructure of the tool represents low risk compared to the analysis core, and that the implementation effort required to provide the superstructure far exceeds that of the core. We feel that our characterization captures a large class of software modeling and analysis tools. Obviously, tools whose characteristics diverge from the class we have defined may not be suitable for this approach. For example, our approach assumes that model editing views are semiindependent. Some tools such as Adobe Photoshop have multiple views on the same representation such that modifications in one view are immediately reflected in the other. Such a tool would require a level of view integration which, although not impossible, has not been demonstrated by this research. Similarly, this work assumes that the software tool will be used in a critical context such that the validity of the modeling language and its analysis algorithms is of high concern. In cases where the tool is not a high risk factor, the cost incurred through the use of formal methods may not be worth the associated benefits. To explore the POP approach, we have been limited to those applications which provide the capabilities suitable for their use as components. For example, the approach requires sufficiently

Chapter 11. Evaluation of the Combined Approach

152

rich external APIs, internal programmability, and compatibility with programmatic and user interface integration standards. To date, such applications have been largely the product of Microsoft Corporation (and Visio Corporation before it was bought by Microsoft). The availability of suitable POP components can be a limitation to the use of this approach. Numerous times we have been faced with unanticipated and sometimes inexplicable POP component limitations. These limitations required us to explore the design space in order to find suitable alternative workable designs. As a result, this approach favors some degree of requirements flexibility. We have found that in cases where critical requirements can not be met, it is sometimes possible to influence vendors to provide a solution. Still, the approach does carry the risk that the desired tool can not be built. This research does not address the important and difficult problem of formally verifying the correctness of the implementation of the specification. Without a sound basis for arguing the correctness of the analysis engine, users can have only a limited level of trust resulting from testing, inspection, and other traditional methods of informal verification. Similarly, the POP components used by the tool can be sources of error. For example, the model extracted from a POP component by the view adapter class may not match the one the user created. This approach does not address this risk directly, except to argue that complexity and subtlety of the modeling language incurs a perhaps greater risk that can be addressed using formal methods.

Chapter 12 Conclusion

This chapter concludes this dissertation with a summary of contributions and a discussion of future work.

12.1

Summary

In this research we have identified two key impediments to the development of software modeling and analysis tools: the high cost of developing sophisticated user interfaces, and the informally defined modeling and analysis methods. Our approach is a synthesis of two key elements which attack these essential difficulties. The first element is the package-oriented programming approach to component reuse, in which the general purpose functionality of the tool is provided by volumepriced mass market applications. The second element is the use of formal techniques for the precise definition of the domain-specific languages employed by the method.

12.1.1

Package-Oriented Programming

Our experience indicates that it is possible to leverage the capabilities of mass-market packages to build tool interfaces at a cost that is significantly lower than the cost to develop comparable functionality using traditional techniques. In general, there is evidence, albeit limited, that POP defines a viable model for component-based software development.

153

Chapter 12. Conclusion

154

Important factors working in favor of POP include volume pricing through dual-use (application and component) packages, user familiarity, an existing licensing and payment model, de facto decomposition of the application space into general subdomains such as technical drawing and text editing, and support for both application and user interface integration standards. We found today’s components to have limitations, some of which jeopardized the success of our effort. The components are complex and rife with undocumented behaviors and limitations. We had to use a development style based on prototyping and the ongoing exploration of both component properties and user requirements. The risks are very high. We were caught several times late in the game with serious problems that threatened to undermine our project. For example, if Visio had not come through with both adequate support for hyperlinks and fixes for the multi-page aspects of the API, we would not have been able to implement the multi-page, linked drawing functions that our users required. In the end, however, we were able to develop tools having sophisticated functionality and high usability. End-user evaluation of Galileo indicated that we had met our goal of delivering a tool that met or exceeded user expectations. Users were surprised and enthusiastic about our use of mass-market applications with which they were already familiar. The success of Galileo has caused NASA to adopt the Galileo/ASSAP tool for documenting problems that occur on the International Space Station.

12.1.2

Formal Methods for Tool Languages

A key aspect of a modeling and analysis method is the modeling language that it employs. Without a precise and complete definition of that language, the method itself as well as any tool which implements it lacks a solid foundation for trustworthiness. By expressing the syntax and semantics of the language in a mathematically precise manner, the overall method can be more easily validated, and the implementor has a standard for correctness. None of the kinds of specification methods that are most commonly used today—natural language, semantics for selected special cases, and source code—can adequately meet these requirements. Formal methods for the precise specification of domain concepts and languages exist. How-

Chapter 12. Conclusion

155

ever, these approaches are rarely applied in practice, largely due to the perceived cost involved. We have demonstrated that it is both technically and economically feasible to apply formal methods to the definition of languages for modeling and analysis tools. We used the specification language Z to define the syntax and semantics in the denotational style for a language for an important reliability method, dynamic fault tree analysis. We then subjected the specification to both informal and formal validation. During this effort, we discovered a number of previously unknown conceptual, design, and implementation errors in the DFT language. We resolved these issues, resulting in the development of a new, improved version of the language.

12.1.3

Integrated Approach

We have presented an integrated approach that combines both POP and formal methods. The new tool which we developed, Nova, demonstrates that these two approaches can be combined effectively. Nova is the first tool to combine support for a formally specified modeling language with the use of POP components for the user interface. Our work demonstrates that a small team consisting primarily of a graduate student, advisor, and domain expert working part-time over two years can refine and formalize a complex modeling and analysis method, and deliver that method in a high quality tool. We showed that formal specification of the language not only led to the revision of the language, but also greatly eased the implementation—the syntax which we precisely defined was implemented in the model editor, and the semantics was implemented in the analysis engine. The Nova prototype also provides a second data point for the POP approach, demonstrating that POP components support a level of specialization which is required for the construction of high quality tools. We were able to aggressively modify Visio, implementing complex domain-specific functionality using only the built-in specialization capabilities.

Chapter 12. Conclusion

12.2

156

Future Work

There are many avenues of future work resulting from this research related to the POP model for component reuse, applied formal methods, and the domain of fault-tolerant computing.

12.2.1

Package-Oriented Programming

To date the usability of POP-based applications has been based on anecdotes and surveys. While this approach is appropriate and valuable in the initial evaluation of POP approach, more rigorous analysis of the usability, of the kind performed by HCI research, is needed. Key questions include (1) To what extent does the user’s familiarity with the packages lower the learning cost? (2) What is the nature of the tradeoff between aggressive specialization of the packages for the domain-specific application, versus keeping the packages “standard” in order to aid familiarity? (3) Packages provide a wealth of functionality, but (a) how much of this functionality is needed by the tool designer, and (b) how much of the functionality is used by the user? Secondly, while this research investigates the key issue of integrating POP components, the success of the approach with respect to recursive composition remains unknown. In mature component industries, complex components are themselves built out of components. Two issues are the development of suitable compositional architectures for POP components, and techniques for composing user interfaces. One plausible approach for exploring this issue is to develop APIs for either Galileo or Nova that allow it to be used as a component. This work would seek to leverage existing industrial interest in the development of interoperable suites of reliability analysis tools. Lastly, this work as addressed the dependability of the modeling language implemented by a tool. However, the use of mass-market applications as components introduces risk, as such applications are not known for their dependability. As a first step, we need methods for assessing risk exposure through the use of packages, and perhaps limiting the functionality of the packages in order to limit that risk.

Chapter 12. Conclusion

12.2.2

157

Applied Formal Methods

Recent work (e.g. [39]) has suggested that light-weight notations and validation approaches can achieve many of the benefits of more heavy-weight approaches at a greatly reduced cost. Model checking, for example, involves the validation of the specification by enumerating the state space and checking that the theorems hold. This approach is made easier by restricting the specification language to keep the state space explosion problem under control. Two key questions are (1) the extent to which the restriction of the language affects it’s expressibility, and (2) the limits of the partial state space validation approach compared to the purely symbolic approach used in traditional validation? The DFT specification can serve as a testbed for evaluating these issues. For example, one could attempt to reformulate the specification and associated theorems using a restricted specification language, and then attempt to prove the same theorems as with the heavyweight, symbolic approach. This research has involved the development and validation of a specification. While this specification is a precondition for the development of a trustworthy implementation, we have not addressed the very difficult question of verifying its correctness. One approach is to refine the specification to code, but this is difficult and costly. Another approach is to test the system, perhaps using the specification to generate test cases. This approach is necessarily incomplete, and for highly reliable systems impossible [13]. We need methods for ensuring the correctness of implementations which are less costly than refinement and provide stronger claims than testing.

12.2.3

Fault-Tolerant Computing

The formal specification of the DFT language which we have developed is a significant advance to the field of fault-tolerant computing. However, the language continues to evolve in order to meet the needs of reliability engineers. As a result, there are a number of recently developed features which have not yet been formalized. In particular, we have not yet formalized the semantics of DFT modularization, so called “phased mission” fault trees which model systems which have multiple phases of operation, or the notion of component sensitivity.

Chapter 12. Conclusion

158

The failure automaton appears to be a useful, general semantic model for reliability languages. The potential for this domain has not yet been fully explored. In particular, it may be possible to express the semantics of many modeling languages in terms of this domain, and even develop a unified semantics for all reliability modeling languages. There are also more practical benefits: if successful, a general analysis engine based on the failure automaton can be used in a wide range of fault modeling and analysis tools hosting different high-level languages. We have identified two key elements of an approach for developing modeling and analysis tools. It is clear that we are far from a standardized development approach. The role of architecture must be understood and codified—opportunities to abstract and generalize the architecture of Galileo and Nova could provide significant leverage for developers who build tools using this approach. Our development approach was characterized by heavy prototyping, package validation, and extensive testing. Much work remains in the elaboration and formalization of this process, as we develop a standard methodology for building such tools.

Appendix A The Z Formal Specification Language

In this appendix we present a brief overview of the Z (pronounced zed) formal specification notation [59], which we use for our specification. The notation supports the structuring and composition of complex expressions in first-order, typed set theory. We describe only the key concepts and notations used in this paper. In Z, every value has exactly one type. A type is a set of elements that is disjoint from sets that define other types. Z has a number of primitive types, such as natural numbers (N) and integers (Z). The specifier can define a new type of objects without specifying any details of the objects using a given set in Z, which is denoted using square brackets (“[GivenSet]”). Z also provides several mechanisms for defining new types from existing ones. For example, if S is a type, then seq S is a type that comprises the set of all sequences of items of type S ; iseq S is the set of injective sequences (without repeated elements); F S comprises all finite sets of elements of type S . Instances of a type can be declared. For example, a statement such as mySeq : iseq Z defines a state element named “mySeq” whose value is in the set of sequences of integers. Sequences are simply partial functions (indicated by →) 7 from positive naturals (N1 ) to values of a particular type, where the domain of the functions range from 1 to some value n representing the number of items in the sequence. A sequence is a function, and an expression such as mySeq(2) denotes the application of the function to the value 2, representing the value of the second item in mySeq. The value is undefined if there is no element 2. #mySeq denotes the length of the sequence.

159

Appendix A. The Z Formal Specification Language

160

Z has a rich collection of notations for defining relations over sets. Algebraically different kinds of relations are indicated by different types of arrows. For example, A → B is a type comprising the set of functions from the set A to the set B . Given f : A → B , i.e. f is some function whose actual domain is a subset of A and whose co-domain is a subset of B , ran f denotes the range of f , and dom f its co-domain. Cross-product types are denoted by the cross operator, ×, applied to the constituent types. Given a value, tuple : A × B , the elements of tuple are denoted as first(tuple) and second (tuple). Z provides a mechanism, called the schema, which supports the modular structuring of specifications of complex types. In a nutshell, a schema defines a type by specifying the state components of an element of the given type in terms of types that have already been defined, e.g., by given sets or other schemas; and by specifying invariant relations over these state components that are satisfied by all elements of the given type. Consider an example.

Example1 is : F1 Z j :N 1 ∈ is

This schema defines a new type, Example1. Items above the middle line declare state components. The schema says that every value of the type has the specified state components: a non-empty finite set of integers is and a natural number j . Invariant relations are stipulated in the predicate parts of schema, below the dividing line. Elements of the Example1 type are such that 1 is in is. An expression such as ex : Example1 states that ex is a value of type Example1, whose name is ex . The state components of ex are denoted by ex .is and ex .j . Z provides a schema calculus that allows schemas to be composed. In this paper we will make use of schema inclusion. Consider the schema below:

Appendix A. The Z Formal Specification Language

161

Example2 Example1 k :Z 2∈ / is

Schema inclusion means that the declarations of the included schema are aggregated textually with the declarations of the including schema. State components with the same name must have the same type and they are unified. Predicates of the included schema are conjoined with the predicates of the including schema. Example2 is thus exactly equivalent to the following schema:

Example2Together is : F1 Z j :N k :Z 1 ∈ is 2∈ / is

Sets can be constructed using set comprehension. The following set comprehension defines the set of all squares of even numbers:

{e : Z | e

mod 2 = 0 • e 2 }

e is declared to be an integer such that the remainder after dividing by two is zero. The statement after “•” defines the element for the constructed set. The set of squares of even numbers, as a type, can be named in the following way:

SquaresOfEvens = b {e : Z | e

mod 2 = 0 • e 2 }

Appendix A. The Z Formal Specification Language

162

Z also supports the definition of axioms, which pertain globally to a specification. They are declared in Z in the following way:

factorial : N → N1 factorial (0) = 1 ∀ i : N1 • factorial (i ) = i ∗ factorial (i − 1)

Here factorial is defined as a recursive function from natural numbers to positive natural numbers. The base case is defined as a predicate on the factorial function, and the factorial function is defined for all non-zero naturals in the normal way.

Appendix B A Formal Specification of Dynamic Fault Trees

B.1

Scope

This appendix formally specifies the abstract syntax and semantics of dynamic fault trees. We formalize the abstract syntax and semantics using Z, with the semantics expressed in terms of a lower-level domain called failure automata. A subset of this lower-level domain is then further formally specified in terms of the well-understood domain of Markov models. Complete specification of the semantics of failure automata will require the use of additional domains besides Markov models, and is not covered in this document. The correspondence between the concrete syntaxes and the abstract syntax is argued informally. While this document formalizes the key aspects of the dynamic fault tree language, there are several related aspects that are not covered. We do not formally specify the concrete textual and graphical syntax of the language, as this definition can be done in a straightforward manner using standard programming language grammars. We also do not formalize the semantics of the subsets of DFTs which can not be expressed as Markov models. We also do not specify a divide-andconquer technique for modularizing a DFT, solving the modules independently, and integrating the results. Similarly, we do not address the formal semantics of DFTs with regard to properties of interest besides unreliability, such as the sensitivity of components or the modeling of systems that operate in multiple phases.

163

Appendix B. A Formal Specification of Dynamic Fault Trees

B.2

164

Conventions and Notation

This section defines the conventions and text styles used throughout this document. The notation and convention descriptions specific to the Z notation have been omitted. defined term: A defined term is underlined. variableName: Variable names begin with a lower case letter. Additional words in the variable name begin with capital letters and are concatenated. TypeName:

Type names begin with an upper case letter. Additional words in the type name begin with capital letters and are concatenated.

Appendix B. A Formal Specification of Dynamic Fault Trees

B.3

165

Definitions

basic event:

A basic event models either the failure of an unelaborated subsystem, or the occurrence of some phenomenon that affects the system.

gate:

A gate models some combination or sequence of event occurrences.

constraint:

An constraint imposes a constraint on the occurrence of events in the model.

event:

An event models the occurrence of some phenomenon that affects the system, or the failure of a system, subsystem, or component of a system. Events can be either basic events or gates.

causal basic event: A basic event, the occurrence of which can cause the occurrence of all other newly occurred events in a fault tree. AND gate:

A gate whose output event is occurred if all of the input events are occurred.

OR gate:

A gate whose output event is occurred if any of the input events are occurred.

threshold gate:

A gate whose output event is occurred if the number of input events that are occurred exceeds a specified threshold.

priority-and (PAND) gate: A gate whose output event is occurred if all of the input events have occurred and if they occurred “in order”. spare gate:

A gate in which spare inputs are used in order until no operational input is available, in which case the event associated with the output of the gate occurs.

functional dependency (FDEP) constraint: A constraint which specifies that the dependent events must occur if the trigger event occurs. sequence (SEQ) constraint: A constraint which specifies that the input events can only occur “in order”. coverage model:

Three values used to model the probability that either a basic event occurs but is not visible to the system, a basic event occurs and can be handled by the system, or a basic event occurs and results in a failure.

dormancy:

A factor between 0 and 1 inclusive that is used to attenuate a spare when it is not in use.

Appendix B. A Formal Specification of Dynamic Fault Trees

166

in order occurrence: Two events A and B are said to occur in order if A occurs before or at the same time as B. failure automaton state: the state of a fault tree, consisting of the number of occurrences for each event, the allocation of spares, the history of event occurrences, and the uncovered failure status. history:

a sequence of event states resulting from a sequence of event occurrences.

time step:

one position in a fault tree history

Appendix B. A Formal Specification of Dynamic Fault Trees

B.4

167

Basic Types

In this section we begin the formal specification with the definition of the abstract syntax of DFTs in Z. We first define an abstract system of real numbers and operations.

[R]

0R : R 1R : R

We introduce R as a given type, and declare 0R and 1R to be elements of that type. In our use of real numbers, the subscript is used to distinguish between the values and operators used for non-reals and those used for reals.

Appendix B. A Formal Specification of Dynamic Fault Trees

168

+R : R × R → R ∗R : R × R → R /R : R × R → 7 R =R : R ↔ R 6=R : R ↔ R R : R ↔ R ≤R : R ↔ R ≥R : R ↔ R +/R : F R → R ∗f (R) : (R → R) × R → (R → R) +f (R) : (R → R) × (R → R) → (R → R) ∗pf (R) : (R → 7 R) × R → (R → 7 R) +pf (R) : (R → 7 R) × (R → 7 R) → (R → 7 R) intToReal : Z → R intToReal 0 = 0R intToReal 1 = 1R dom( /R ) = R × R \ {0R } ∀ x , y : Z • intToReal x = intToReal y ⇐⇒ x = y

We introduce type definitions for functions that operate on real numbers and functions of real numbers, abstracting the definitions. We first declare addition and division as an infix functions from pairs of reals to reals. Next we define the “distributed summation” operator, which computes the sum of a finite set of reals. We then define four operators for performing the distributed summation and product of both total and partial functions on reals. The function intToReal is used to map integers to reals, similar to a cast in programming languages.

Appendix B. A Formal Specification of Dynamic Fault Trees Boolean ::= True | False

We define a Boolean type.

Probability == { p : R | 0R ≤R p ≤R 1R • p }

We define a probability as a real number between the values of 0 and 1 inclusive.

Rate == { r : R | r ≥R 0R • r }

We define a rate as a real number greater than or equal to 0.

Time == { t : R | t >R 0R • t }

We define time to be a real number greater than 0.

169

170

Appendix B. A Formal Specification of Dynamic Fault Trees

B.5

Abstract Syntax of Fault Trees

B.5.1

Event Identifiers and Events [Event]

Event is a given type that represents failures or event occurrences in the fault tree.

B.5.2

Basic Events BasicEvents basicEvents : F Event

The basic events of a system are represented as a finite set of events.

B.5.3 Gates In this section we specify the abstract syntax of the gates of a dynamic fault tree.

Gates andGates : F Event orGates : F Event thresholdGates : F Event pandGates : F Event spareGates : F Event gates : F Event thresholds : Event → 7 N1 handGates, orGates, thresholdGates, pandGates, spareGatesi

(B .5.1)

partition gates dom thresholds = thresholdGates

(B .5.2)

Appendix B. A Formal Specification of Dynamic Fault Trees

171

This schema defines the gates of a fault tree as finite sets of events, one set for each type of gate. Line B.5.1 states that these sets partition the set of all gates in the fault tree. As stated on line B.5.2, each threshold gate has an associated non-zero natural number that represents the threshold value. A threshold gate occurs if the number of occurred input replicates is greater than or equal to the threshold value.

B.5.4

Constraints InputSequence == iseq Event

We define an input sequence as simply a sequence of events which does not contain repeated elements. This definition will be used in the definition of the inputs of gates and in the definition of the constraints.

Constraints seqs : F InputSequence fdeps : Event ↔ InputSequence

The schema above defines a sequence enforcer as an constraint over non-empty sequences of events. A functional dependency is a partial function from events to non-empty sequences of events. In the predicate we overspecify by defining the dependent events as a sequence instead of a set. We do this to improve readability later in the specification.

B.5.5

Fault Trees

Having specified the basic events, gates, and constraints, in this section we present the full specification of the fault tree abstract syntax.

InputsMap == Event → 7 InputSequence

Appendix B. A Formal Specification of Dynamic Fault Trees

172

First we define a type for mapping a gate to its inputs. This function is partial because basic events are events, but do not have inputs.

IsDirectlyInputTo : P(Event × Event × InputsMap) ∀ from, to : Event; inputs : InputsMap | to ∈ dom inputs • IsDirectlyInputTo(from, to, inputs) ⇐⇒ from ∈ ran(inputs to)

IsInputDirectlyTo is true if the “from” event is in the inputs list of the “to” event. The three arguments for this function are the “from” event, the “to” event, and the partial function mapping gates to their inputs.

IsInputTo : P(Event × Event × InputsMap) ∀ from, to : Event; inputs : InputsMap | to ∈ dom inputs • IsInputTo(from, to, inputs) ⇐⇒ IsDirectlyInputTo(from, to, inputs) ∨ (∃ g : dom inputs • from ∈ ran(inputs g) ∧ IsInputTo(g, to, inputs))

IsInputTo is true if either the from event is directly input to the to event, or if there is some gate to which from is an input and which is an input (recursively) to to. The three arguments for this function are the “from” event, the “to” event, and the partial function mapping gates to their inputs.

ReplicationMap == Event → N1

We also define a replication function which maps events to their replications.

Appendix B. A Formal Specification of Dynamic Fault Trees

173

NumberOfReplicatesInInputs : InputSequence × ReplicationMap → N ∀ is : InputSequence; rs : ReplicationMap • NumberOfReplicatesInInputs(is, rs) = if is = hi then 0 else rs(head is) + NumberOfReplicatesInInputs((tail is), rs)

This helper function, given a sequence of events and a replication mapping, determines the total number of event replicates in the sequence.

174

Appendix B. A Formal Specification of Dynamic Fault Trees FaultTree BasicEvents Gates Constraints events : F Event inputs : InputsMap replications : ReplicationMap hbasicEvents, gatesi partition events

(B .5.3)

dom inputs = gates

(B .5.4)

∀ g : gates • ran(inputs g) ⊆ events ∀ g : gates • ¬ IsInputTo(g, g, inputs) ∀ sg : spareGates • ran(inputs sg) ⊆ basicEvents

(B .5.5)

∀ sg : spareGates; be : basicEvents | IsDirectlyInputTo(be, sg, inputs) • ¬ (∃ g : gates \ spareGates • IsDirectlyInputTo(be, g, inputs)) ∀ s : seqs • ran s ⊆ events

(B .5.6)

dom fdeps ⊆ events

(B .5.7)

dom replications = events ∀ t : dom fdeps • replications t = 1 ∀ is : ran fdeps • ran is ⊆ basicEvents ∀ g : gates • replications g = 1

(B .5.8)

The FaultTree schema defines the syntactic structure of a fault tree. The events set is the set of all events in the fault tree. inputs is a mapping for the inputs of each gate in the fault tree, and replications is a similar mapping for the replications of the basic events. The constraints state the following:

Appendix B. A Formal Specification of Dynamic Fault Trees

175

• (B.5.3) An event is either a basic event, an AND gate, an OR gate, a threshold gate, a PAND gate, or a spare gate. • (B.5.4) Only gates can have inputs, the inputs must be one of the events in the fault tree, and no gate can be input to itself (directly or indirectly). • (B.5.5) The inputs to spare gates are only basic events. Basic events that are inputs to spare gates can not be inputs for other types of gates (but they can be inputs to constraints). • (B.5.6) Sequence enforcers must operate over the events of the fault tree. • (B.5.7) Functional dependencies must be triggered by some event in the fault tree, every gate and basic event has a replication, the trigger must have a replication of 1, and only basic events can be dependent inputs. • (B.5.8) All gates must have a replication of 1. Note that full connectivity is not required by our specification—although all gates must have 1 or more inputs and must be input to the system level event, some of the basic events in the fault tree may not be inputs to any gate. This generality does not affect the semantics of dynamic fault trees, and will be useful in later specifications that build upon this one.

Appendix B. A Formal Specification of Dynamic Fault Trees

B.6

176

Failure Automata

In this section we specify the domain of failure automata.

B.6.1

States of Events and Histories StateOfEvents == Event → N

StateOfEvents represents the state of all the events for a fault tree. Note that this does not capture the entire state of the fault tree; in particular, the allocation of spares to spare gates is not modeled.

History == { h : iseq StateOfEvents | h 6= hi ∧ (∀ i , j : dom h • dom(h i ) = dom(h j )) • h }

A History is specified as a non-repeating sequence of StateOfEvents that represents the changing state of the fault tree over a set of discrete time steps. Every step in the history has the same set of events, although the event states can change.

B.6.2

Failure Automaton State SpareInUse == { siu : Event → 7 Event | siu ∈ F(Event × Event) • siu }

We declare SparesInUse as a finite partial function from Event to Event. The domain represents a subset of the spare gates in the fault tree, and the range is the spare being used by the spare gate (if any).

Appendix B. A Formal Specification of Dynamic Fault Trees

177

FailureAutomatonState stateOfEvents : StateOfEvents history : History spareInUse : SpareInUse systemFailedUncovered : Boolean stateOfEvents = last history

The state of a fault tree consists of the state of the events, the history, the spare allocation, and the uncovered failure status. The stateOfEvents must be equal to the last state of events in the history.

B.6.3

Failure Automaton Transitions FailureAutomatonTransition from : FailureAutomatonState to : FailureAutomatonState causalBasicEvent : Event to.history = from.history a hto.stateOfEventsi causalBasicEvent ∈ dom from.stateOfEvents from.stateOfEvents causalBasicEvent < to.stateOfEvents causalBasicEvent

A transition between states consists of a “from” state, a “to” state, and an associated causal basic event. The destination state extends the history of the source state by one set of event states. The causal basic event must be one of the events in the event state, and it must be the case that additional replicate failures of the basic event occur in the transition between states.

Appendix B. A Formal Specification of Dynamic Fault Trees

B.6.4

178

Failure Automata

In this section we provide the complete specification of a failure automaton.

FailureAutomaton states : F FailureAutomatonState transitions : F FailureAutomatonTransition states = { t : transitions • {t.from, t.to} } S

A failure automaton consists of a finite set of states and transitions. The predicate constrains the transitions to be between the states of the failure automaton.

Appendix B. A Formal Specification of Dynamic Fault Trees

B.7

179

Semantics of Fault Trees in Terms of Failure Automata

In this section we specify the semantics of dynamic fault trees in terms of failure automata by establishing a correspondence between the two domains.

B.7.1

Number of Replicates, Event Occurrences

In this section we define functions for determining the number of replicates of an event that have occurred, the number of inputs to a gate that have occurred, and the notion of “in order” occurrence of inputs.

NumberOfOccurredReplicatesInInputs : InputSequence × StateOfEvents → 7 N dom NumberOfOccurredReplicatesInInputs = { is : InputSequence; soe : StateOfEvents | ran is ⊆ dom soe • (is, soe) } ∀ is : InputSequence; soe : StateOfEvents | (is, soe) ∈ dom NumberOfOccurredReplicatesInInputs • NumberOfOccurredReplicatesInInputs(is, soe) = if is = hi then 0 else soe(head is) + NumberOfOccurredReplicatesInInputs((tail is), soe)

Given a sequence of inputs and a set of event states, this function determines the number of replicates that have occurred for all the input events.

OccursInTimeStep : P(History × Event × N1 ) ∀ h : History; e : Event; t : N1 | t ≤ #h ∧ e ∈ dom(h 1) • OccursInTimeStep(h, e, t) ⇐⇒ t = 1 ∧ h t e > 0 ∨ t > 1 ∧ h t e > h (t − 1) e

This function determines whether an event had a replicate that occurred in a given time step.

Appendix B. A Formal Specification of Dynamic Fault Trees

180

We allow events to occur in the initial state. (In fact, spare contention in the initial state can cause multiple nondeterministic initial states.)

FirstOccurrenceTime : History × Event → 7 N dom FirstOccurrenceTime = { h : History; e : Event | e ∈ dom(h 1) • (h, e) } ∀ h : History; e : Event; t : N1 | (h, e) ∈ dom FirstOccurrenceTime • FirstOccurrenceTime(h, e) = if OccursInTimeStep(h, e, t) ∧ (∀ t2 : N1 | t2 < t • ¬ OccursInTimeStep(h, e, t2 )) then t else 0

This function determines the time step in which the first replicate of an event occurs. A value of 0 indicates that no replicate has failed.

FirstFullOccurrenceTime : History × Event × N → 7 N dom FirstFullOccurrenceTime = { h : History; e : Event; r : N | e ∈ dom(h 1) • (h, e, r ) } ∀ h : History; e : Event; r : N; t : N1 | t ∈ dom h ∧ (h, e, r ) ∈ dom FirstFullOccurrenceTime • FirstFullOccurrenceTime(h, e, r ) = if h t e = r ∧ OccursInTimeStep(h, e, t) then t else 0

This function computes the history position in which the final replicate of an event occurs. It returns 0 if the first replicate of an event has not occurred (i.e. the number operational is greater than 0 throughout the history).

181

Appendix B. A Formal Specification of Dynamic Fault Trees InputsOccurredAndInOrder : P(InputSequence × History × ReplicationMap) ∀ is : InputSequence; h : History; rs : ReplicationMap | ran is ⊆ dom(h 1) • InputsOccurredAndInOrder (is, h, rs) ⇐⇒ NumberOfOccurredReplicatesInInputs(is, last h) =

(B .7.1)

NumberOfReplicatesInInputs(is, rs) ∧ (∀ i , j : dom is | i < j •

(B .7.2)

FirstFullOccurrenceTime(h, is i , rs(is i )) 6= 0 ∧ FirstOccurrenceTime(h, is j ) 6= 0 ∧ FirstFullOccurrenceTime(h, is i , rs(is i )) ≤ FirstOccurrenceTime(h, is j ))

Given a history and a sequence of events, the value of the InputsOccurredAndInOrder function is true if the replicates in each position fail before or at the same time as the replicates in later positions. Predicate B.7.1 states that all the inputs must be fully occurred, and predicate B.7.2 states that the event at position i must be fully occurred at or before the time at which the first replicate at position i + 1 occurs.

NewlyOccurredBasicEvents : FaultTree × History → 7 F Event dom NewlyOccurredBasicEvents = { ft : FaultTree; h : History | ft.events = dom(h 1) • (ft, h) } ∀ ft : FaultTree; h : History | (ft, h) ∈ dom NewlyOccurredBasicEvents • NewlyOccurredBasicEvents(ft, h) = { e : Event | e ∈ ft.basicEvents ∧ e ∈ dom(last h) ∧ OccursInTimeStep(h, e, #h) • e }

This helper function computes the set of basic events that occurred in the last history step.

B.7.2

Semantics of AND Gates

Appendix B. A Formal Specification of Dynamic Fault Trees

182

FaultTreeAndFailureAutomatonEventsMatch : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton • FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ⇐⇒ (∀ fas : FailureAutomatonState | fas ∈ fa.states • ft.events = dom fas.stateOfEvents)

In order to ensure the correct behavior of events in a fault tree with respect to the failure automaton, it must be the case that the events in the fault tree have corresponding states in the failure automaton. This function ensures that this condition is satisfied.

ANDSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) • ANDSemantics(ft, fa) ⇐⇒ (∀ ag : ft.andGates; fas : FailureAutomatonState | fas ∈ fa.states • NumberOfOccurredReplicatesInInputs(ft.inputs ag, fas.stateOfEvents) = NumberOfReplicatesInInputs(ft.inputs ag, ft.replications) =⇒ fas.stateOfEvents ag = 1 ∧ NumberOfOccurredReplicatesInInputs(ft.inputs ag, fas.stateOfEvents) 6= NumberOfReplicatesInInputs(ft.inputs ag, ft.replications) =⇒ fas.stateOfEvents ag = 0)

Every AND gate in the fault tree is an event whose individual state in all of the failure automaton states is defined by this schema. The first predicate specifies that the event associated with the gate occurs if all the inputs have occurred (i.e. the number of occurred input replicates is equal to the total number of input replicates). The second predicate specifies that the event associated with the gate does not occur if it is not the case that all the input replicates have occurred.

Appendix B. A Formal Specification of Dynamic Fault Trees

B.7.3

183

Semantics of OR Gates ORSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) • ORSemantics(ft, fa) ⇐⇒ (∀ og : ft.orGates; fas : FailureAutomatonState | fas ∈ fa.states • NumberOfOccurredReplicatesInInputs(ft.inputs og, fas.stateOfEvents) 6= 0 =⇒ fas.stateOfEvents og = 1 ∧ NumberOfOccurredReplicatesInInputs(ft.inputs og, fas.stateOfEvents) = 0 =⇒ fas.stateOfEvents og = 0)

This schema defines the semantics of all of the OR gates in a fault tree. The OR gate’s associated event occurs if any of the input events have occurred. Otherwise, the associated event does not occur.

B.7.4

Semantics of Threshold Gates ThresholdSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) • ThresholdSemantics(ft, fa) ⇐⇒ (∀ tg : ft.thresholdGates; fas : FailureAutomatonState | fas ∈ fa.states • NumberOfOccurredReplicatesInInputs(ft.inputs tg, fas.stateOfEvents) ≥ ft.thresholds tg =⇒ fas.stateOfEvents tg = 1 ∧ NumberOfOccurredReplicatesInInputs(ft.inputs tg, fas.stateOfEvents) < ft.thresholds tg =⇒ fas.stateOfEvents tg = 0)

Like the specifications for the AND and OR gates, the threshold gate specification is based

Appendix B. A Formal Specification of Dynamic Fault Trees

184

on the number of input replicates that have occurred. In this case, the event associated with the threshold gate occurs only if the number of occurred inputs is greater than or equal to the threshold gate’s k value.

B.7.5

Semantics of PAND Gates PANDSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) • PANDSemantics(ft, fa) ⇐⇒ (∀ pg : ft.pandGates; fas : FailureAutomatonState | fas ∈ fa.states • InputsOccurredAndInOrder (ft.inputs pg, fas.history, ft.replications) =⇒ fas.stateOfEvents pg = 1 ∧ ¬ InputsOccurredAndInOrder (ft.inputs pg, fas.history, ft.replications) =⇒ fas.stateOfEvents pg = 0)

A PAND gate’s event occurs if all the inputs have occurred in order, and does not occur otherwise.

B.7.6

Semantics of Spare Gates

In this section we present the specification of spare gates. Spare gates are easily the most semantically rich construct in the DFT modeling language. We will present the specification in two parts in order to simplify the complexity: the state semantics which describe allocation of spares for a spare gate, and the transition semantics which describe the reallocation of spares resulting from the occurrence of a basic event.

Appendix B. A Formal Specification of Dynamic Fault Trees

185

NumberOfSpareGatesUsingSpare : SpareInUse × Event → N ∀ siu : SpareInUse; e : Event • NumberOfSpareGatesUsingSpare(siu, e) = #(siu B {e})

NumberOfSpareGatesUsingSpare determines the number of spare gates that have a particular event allocated to them. The predicate states that the value of the function is defined as the size of the result of restricting the spares in use relation to only those spare gates that use event e.

FaultTreeAndFailureAutomatonSGsMatch : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton • FaultTreeAndFailureAutomatonSGsMatch(ft, fa) ⇐⇒ (∀ fas : FailureAutomatonState | fas ∈ fa.states • dom fas.spareInUse ⊆ ft.spareGates ∧ ran fas.spareInUse ⊆ ft.basicEvents)

In order to ensure the correct behavior of spare gates in a fault tree with respect to the failure automaton, it must be the case that the events referenced in the spare in use relation must correspond to events in the fault tree. This function ensures that this condition is satisfied.

SpareInUseIsAnInput : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) • SpareInUseIsAnInput(ft, fa) ⇐⇒ (∀ fas : FailureAutomatonState; sgus : Event | fas ∈ fa.states ∧ sgus ∈ dom fas.spareInUse • fas.spareInUse sgus ∈ ran(ft.inputs sgus))

This function ensures that any spare being used is an input to the spare gate using it.

Appendix B. A Formal Specification of Dynamic Fault Trees

186

ReplicatesNotOverused : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) • ReplicatesNotOverused (ft, fa) ⇐⇒ (∀ fas : FailureAutomatonState; sp : Event | fas ∈ fa.states ∧ sp ∈ ran fas.spareInUse • NumberOfSpareGatesUsingSpare(fas.spareInUse, sp) ≤ ft.replications sp − fas.stateOfEvents sp)

This function ensures that the number of spare gates using a replicate of a spare never exceeds the number of available replicates of the spare.

PreviousSparesUnavailable : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) • PreviousSparesUnavailable(ft, fa) ⇐⇒ (∀ fas : FailureAutomatonState; sgus : Event; i : N1 | fas ∈ fa.states ∧ sgus ∈ dom fas.spareInUse ∧ i ∈ dom(ft.inputs sgus) ∧ fas.spareInUse sgus = ft.inputs sgus i • (∀ j : 1 . . i − 1; be : Event | be = ft.inputs sgus j • NumberOfSpareGatesUsingSpare(fas.spareInUse, be) = ft.replications be − fas.stateOfEvents be))

This function ensures that if a spare is being used, all the previous spares in the spare gate’s sequence of inputs have no available spare replicates.

Appendix B. A Formal Specification of Dynamic Fault Trees

187

NoSparesInUseOnlyIfNoneAvailable : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) • NoSparesInUseOnlyIfNoneAvailable(ft, fa) ⇐⇒ (∀ fas : FailureAutomatonState; sgnus : Event | fas ∈ fa.states ∧ sgnus ∈ / dom fas.spareInUse ∧ sgnus ∈ dom ft.inputs • (∀ be : Event | be ∈ ran(ft.inputs sgnus) • NumberOfSpareGatesUsingSpare(fas.spareInUse, be) = ft.replications be − fas.stateOfEvents be))

This function ensures that a spare gate uses a spare if one is available.

SpareGateStateSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) • SpareGateStateSemantics(ft, fa) ⇐⇒ SpareInUseIsAnInput(ft, fa) ∧ ReplicatesNotOverused (ft, fa) ∧ PreviousSparesUnavailable(ft, fa) ∧ NoSparesInUseOnlyIfNoneAvailable(ft, fa) ∧ (∀ sg : Event; fas : FailureAutomatonState | sg ∈ ft.spareGates ∧ fas ∈ fa.states • (sg ∈ dom fas.spareInUse =⇒ fas.stateOfEvents sg = 0) ∧ (sg ∈ / dom fas.spareInUse =⇒ fas.stateOfEvents sg = 1))

Appendix B. A Formal Specification of Dynamic Fault Trees

188

IsUsingSpare : P(Event × FailureAutomatonState) ∀ sg : Event; fas : FailureAutomatonState • IsUsingSpare(sg, fas) ⇐⇒ sg ∈ dom fas.spareInUse

This helper function determines whether a spare gate is using a spare.

SpareBeingUsed : Event × FailureAutomatonState → 7 Event dom SpareBeingUsed = { sg : Event; fas : FailureAutomatonState | IsUsingSpare(sg, fas) • (sg, fas) } ∀ sg : Event; fas : FailureAutomatonState | (sg, fas) ∈ dom SpareBeingUsed • SpareBeingUsed (sg, fas) = fas.spareInUse sg

This is a helper function that computes the spare that a spare gate is using in a given fault tree.

SpareGateStaysFailed : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) • SpareGateStaysFailed (ft, fa) ⇐⇒ (∀ sg : Event; fat : FailureAutomatonTransition | sg ∈ ft.spareGates ∧ fat ∈ fa.transitions • ¬ IsUsingSpare(sg, fat.from) =⇒ ¬ IsUsingSpare(sg, fat.to))

This function ensures that a spare gate that is not using a spare (is failed) remains so.

Appendix B. A Formal Specification of Dynamic Fault Trees

189

SpareGateContinuesUsingOperationalSpare : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) • SpareGateContinuesUsingOperationalSpare(ft, fa) ⇐⇒ (∀ sg : Event; fat : FailureAutomatonTransition | sg ∈ ft.spareGates ∧ fat ∈ fa.transitions • IsUsingSpare(sg, fat.from) ∧ SpareBeingUsed (sg, fat.from) ∈ / NewlyOccurredBasicEvents(ft, fat.to.history) =⇒ IsUsingSpare(sg, fat.to) ∧ SpareBeingUsed (sg, fat.to) = SpareBeingUsed (sg, fat.from))

This function ensures that a spare gate that is using a spare which has not failed continues to use the same spare.

Appendix B. A Formal Specification of Dynamic Fault Trees

190

SpareGateUsesAvailableSpare : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) • SpareGateUsesAvailableSpare(ft, fa) ⇐⇒ (∀ sg : Event; fat : FailureAutomatonTransition | sg ∈ ft.spareGates ∧ fat ∈ fa.transitions • IsUsingSpare(sg, fat.from) ∧ SpareBeingUsed (sg, fat.from) ∈ NewlyOccurredBasicEvents(ft, fat.to.history) =⇒ ¬ IsUsingSpare(sg, fat.to) ∨ IsUsingSpare(sg, fat.to) ∧ (∃ i , j : N1 | i ∈ dom(ft.inputs sg) ∧ j ∈ dom(ft.inputs sg) ∧ SpareBeingUsed (sg, fat.from) = ft.inputs sg i ∧ SpareBeingUsed (sg, fat.to) = ft.inputs sg j • i ≤ j ))

If a spare is being used in the first fault tree, then either no spare is used in the second, or a later spare is used.

Appendix B. A Formal Specification of Dynamic Fault Trees

191

SpareGateTransitionSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) • SpareGateTransitionSemantics(ft, fa) ⇐⇒ (∀ sg : ft.spareGates; fat : FailureAutomatonTransition | sg ∈ ft.spareGates ∧ fat ∈ fa.transitions • SpareGateStaysFailed (ft, fa) ∧ SpareGateContinuesUsingOperationalSpare(ft, fa) ∧ SpareGateUsesAvailableSpare(ft, fa))

This schema specifies the semantics of sparing as the state of the fault tree changes. It ensures that the spare allocation is consistent across a state transition from the fat.from state of a transition fat to the fat.to state.

SpareGateSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) • SpareGateSemantics(ft, fa) ⇐⇒ SpareGateStateSemantics(ft, fa) ∧ SpareGateTransitionSemantics(ft, fa)

The complete spare gate semantics is the conjunction of the state and transition semantics.

B.7.7

Semantics of Sequence Enforcing Constraints

Appendix B. A Formal Specification of Dynamic Fault Trees

192

SEQSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) • SEQSemantics(ft, fa) ⇐⇒ (∀ seq : InputSequence; fas : FailureAutomatonState | seq ∈ ft.seqs ∧ fas ∈ fa.states • ∀ i : 1 . . #seq | fas.stateOfEvents(seq i ) > 0 • (∀ j : 1 . . i − 1 • fas.stateOfEvents(seq j ) = ft.replications(seq j ) ∧ FirstFullOccurrenceTime(fas.history, seq j , ft.replications(seq j )) ≤ FirstOccurrenceTime(fas.history, seq i )))

A sequence enforcing gate disallows certain sequences of events from occurring. The SEQ gate has the same ordering semantics as the PAND gate with respect to simultaneous occurrence and replicated inputs.

B.7.8 Semantics of Functional Dependency Constraints FDEPSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) • FDEPSemantics(ft, fa) ⇐⇒ (∀ tr : Event; ds : InputSequence; fas : FailureAutomatonState; t : N1 | fas ∈ fa.states ∧ (tr , ds) ∈ ft.fdeps ∧ 1 ≤ t ≤ #(fas.history) • fas.stateOfEvents tr = 1 =⇒ NumberOfOccurredReplicatesInInputs(ds, (fas.history t)) = NumberOfReplicatesInInputs(ds, ft.replications))

Appendix B. A Formal Specification of Dynamic Fault Trees

193

For a given history, if the trigger event occurs in a time step of the history, then all the replicates of all the dependent events (ds) also occur.

B.7.9 Uncovered Failure Semantics UncoveredFailureSemantics : P(FailureAutomaton) ∀ fa : FailureAutomaton • UncoveredFailureSemantics(fa) ⇐⇒ (∀ fat : FailureAutomatonTransition • fat.from.systemFailedUncovered = True =⇒ fat.to.systemFailedUncovered = True)

This function ensures that failure automaton states which are failed uncovered remain failed uncovered.

B.7.10

Causal Basic Event Semantics

In this section we specify the relationship between the causal basic event and the change in fault tree state.

194

Appendix B. A Formal Specification of Dynamic Fault Trees CausalBasicEventSemantics : P(FaultTree × FailureAutomaton) ∀ ft : FaultTree; fa : FailureAutomaton | FaultTreeAndFailureAutomatonEventsMatch(ft, fa) • CausalBasicEventSemantics(ft, fa) ⇐⇒ (∀ fat : fa.transitions •

(B .7.3)

fat.causalBasicEvent ∈ ft.basicEvents) ∧ (∀ fas : FailureAutomatonState; be : Event | fas ∈ fa.states ∧ be ∈ ft.basicEvents • (fas.stateOfEvents be < ft.replications be =⇒ (∃ fat : fa.transitions • fat.from = fas ∧ fat.causalBasicEvent = be)))

In this function, we ensure that the causal basic event of a failure automaton transition is a basic event of the fault tree, and that every basic event that can fail in a given state does.

B.7.11

Complete Fault Tree Semantics in terms of Failure Automata FaultTreeSemantics : FaultTree → FailureAutomaton ∀ ft : FaultTree; fa : FailureAutomaton • FaultTreeSemantics(ft) = fa ⇐⇒ FaultTreeAndFailureAutomatonEventsMatch(ft, fa) ∧ FaultTreeAndFailureAutomatonSGsMatch(ft, fa) ∧ CausalBasicEventSemantics(ft, fa) ∧ UncoveredFailureSemantics(fa) ∧ ANDSemantics(ft, fa) ∧ ORSemantics(ft, fa) ∧ ThresholdSemantics(ft, fa) ∧ PANDSemantics(ft, fa) ∧ SpareGateSemantics(ft, fa) ∧ SEQSemantics(ft, fa) ∧ FDEPSemantics(ft, fa)

The complete fault tree semantics, expressed in terms of failure automata, is the conjunction of

Appendix B. A Formal Specification of Dynamic Fault Trees the gate, constraint, causal event, and system failure semantics.

195

Appendix B. A Formal Specification of Dynamic Fault Trees

B.8

Markov Models

In this section we provide an abstract definition of Markov models.

[MarkovModelStateID]

A Markov state identifier is used to distinguish Markov states.

MarkovModelState id : MarkovModelStateID initialProbability : Probability finalProbability : Probability

A Markov state has an associated initial and final state probability.

TransitionRateFunction == Time → Rate

A transition rate function is a function of time that computes a rate.

MarkovModelTransition from : MarkovModelState to : MarkovModelState transitionRateFunction : TransitionRateFunction

A Markov transition consists of a “from” and “to” state, and a transition rate function.

196

Appendix B. A Formal Specification of Dynamic Fault Trees

197

MarkovModel states : F MarkovModelState transitions : F MarkovModelTransition states = { t : transitions • {t.from, t.to} } S

∃ is : F MarkovModelState | is = { st : states | ¬ (∃ tr : MarkovModelTransition | tr ∈ transitions • st = tr .to) • st } ∧ #is > 0 • ∀ st : states • st.initialProbability = if st ∈ is then 1R /R intToReal (#is) else 0R

A Markov model comprises a set of states and a set of transitions between states. The first predicate states that the transitions must be over the particular set of states states. The second predicate assigns the state probability to 0 if the state is not an initial state, and otherwise distributes the overall probability of 1 across the initial states. The semantics of the Markov model are incomplete—we have not specified how the final state probabilities are computed from the transition rates and the initial state probabilities. However, Markov models are well enough understood that we are willing to elide the details.

Appendix B. A Formal Specification of Dynamic Fault Trees

B.9

198

Basic Event Models

In this section we define basic event models which model a basic event’s failure or occurrence characteristics.

CoverageModel singlePointFailureProbability : Probability coveredFailureProbability : Probability restorationProbability : Probability singlePointFailureProbability +R coveredFailureProbability+R restorationProbability = 1R

The restorationProbability is the probability that the component masks an internal failure. The coveredFailureProbability is the probability that a component fails in a way that can be detected by the system. The singlePointFailureProbability is the probability that the component fails and brings down the system. The sum of these three parameters must be one.

[Distribution]

We introduce the general Distribution type, from which we will define subtypes with associated failure parameters.

constantDistributions : F Distribution probabilities : Distribution → 7 Probability dom probabilities = constantDistributions

The constantDistributions are a subset of Distribution. Each constant distribution has an associated probability. The definition of this and other distributions is axiomatic.

Appendix B. A Formal Specification of Dynamic Fault Trees

199

exponentialDistributions : F Distribution exponentialRates : Distribution → 7 Rate dom exponentialRates = exponentialDistributions

The exponentialDistributions are a subset of Distribution. Each exponential distribution has an associated rate.

WeibullShape == { r : R | r ≥R 0R • r }

The Weibull shape is greater than or equal to 0.

weibullDistributions : F Distribution weibullRates : Distribution → 7 Rate weibullShapes : Distribution → 7 WeibullShape dom weibullRates = weibullDistributions dom weibullShapes = weibullDistributions

The weibullDistributions are a subset of Distribution. Each Weibull distribution has an associated rate and shape.

LogNormalMean == { r : R | r ≥R 0R • r } LogNormalStdDev == { r : R | r ≥R 0R • r }

The lognormal mean and standard distribution must be greater than or equal to 0.

Appendix B. A Formal Specification of Dynamic Fault Trees

200

logNormalDistributions : F Distribution logNormalMeans : Distribution → 7 LogNormalMean logNormalStdDevs : Distribution → 7 LogNormalStdDev dom logNormalMeans = logNormalDistributions dom logNormalStdDevs = logNormalDistributions

The logNormalDistributions are a subset of Distribution. Each lognormal distribution has an associated mean and standard deviation.

Dormancy == { r : R | 0R ≤R r ≤R 1R • r }

The dormancy must be between 0 and 1 inclusive.

disjoint hconstantDistributions, exponentialDistributions, weibullDistributions, logNormalDistributionsi

The various types of distributions are disjoint.

BasicEventModel distribution : Distribution coverageModel : CoverageModel dormancy : Dormancy

A BasicEventModel describes the stochastic behavior of a basic event. The dormancy must be between 0 and 1 inclusive.

BasicEventModelFunction == Event → 7 BasicEventModel

A basic event model function maps basic events to their associated models.

Appendix B. A Formal Specification of Dynamic Fault Trees

B.10

201

Semantics of Failure Automata in Terms of Markov Models

We now specify the semantics of failure automata in terms of Markov models.

B.10.1

Structural Correspondence Between Failure Automata and Markov Models

In this section we establish the structural correspondence between failure automata and Markov models.

StateCorrespondence == FailureAutomatonState → MarkovModelState

There is a bijection between states in the failure automaton and the Markov model.

TransitionCorrespondence == FailureAutomatonTransition → MarkovModelTransition

A transition correspondence is a mapping from failure automaton transitions to Markov model transitions. It is not a bijection because multiple FA transitions can map to the same MM transition.

StructuralCorrespondenceExists : P(FailureAutomaton × MarkovModel ) ∀ fa : FailureAutomaton; mm : MarkovModel • StructuralCorrespondenceExists(fa, mm) ⇐⇒ (∃ fas2mms : StateCorrespondence; fat2mmt : TransitionCorrespondence • dom fas2mms = fa.states ∧ ran fas2mms = mm.states ∧ dom fat2mmt = fa.transitions ∧ ran fat2mmt = mm.transitions ∧ (∀ fat : FailureAutomatonTransition | fat ∈ fa.transitions • (fas2mms fat.from = (fat2mmt fat).from ∧ fas2mms fat.to = (fat2mmt fat).to)))

Corresponding transitions in the failure automaton and Markov model must map from and to

Appendix B. A Formal Specification of Dynamic Fault Trees

202

corresponding states.

GetStructuralCorrespondence : FailureAutomaton × MarkovModel → 7 StateCorrespondence × TransitionCorrespondence dom GetStructuralCorrespondence = { fa : FailureAutomaton; mm : MarkovModel | (∃ fas2mms : StateCorrespondence; fat2mmt : TransitionCorrespondence • StructuralCorrespondenceExists(fa, mm)) • (fa, mm) } ∀ fa : FailureAutomaton; mm : MarkovModel ; fas2mms : StateCorrespondence; fat2mmt : TransitionCorrespondence | (fa, mm) ∈ dom GetStructuralCorrespondence ∧ dom fas2mms = fa.states ∧ ran fas2mms = mm.states ∧ dom fat2mmt = fa.transitions ∧ ran fat2mmt = mm.transitions ∧ (∀ fat : FailureAutomatonTransition | fat ∈ fa.transitions • (fas2mms fat.from = (fat2mmt fat).from ∧ fas2mms fat.to = (fat2mmt fat).to)) • GetStructuralCorrespondence(fa, mm) = (fas2mms, fat2mmt)

This helper function computes the correspondence between a failure automaton and its associated Markov model.

B.10.2

Markov Model Transition Rate Functions

In this section we specify the transition rate functions for the Markov model transitions in terms of the transitions in the failure automaton. The basis for this transition rate function is the “hazard function” of the distribution associated with the causal basic event.

Appendix B. A Formal Specification of Dynamic Fault Trees

203

GetNonDeterministicTransitions : FailureAutomatonTransition× FailureAutomaton → 7 F FailureAutomatonTransition dom GetNonDeterministicTransitions = { fat : FailureAutomatonTransition; fa : FailureAutomaton | fat ∈ fa.transitions • (fat, fa) } ∀ fat : FailureAutomatonTransition; fa : FailureAutomaton | (fat, fa) ∈ dom GetNonDeterministicTransitions • GetNonDeterministicTransitions(fat, fa) = { t : FailureAutomatonTransition | t ∈ fa.transitions ∧ t.causalBasicEvent = fat.causalBasicEvent • t }

GetCausalEventsBetweenStates computes the set of causal events for the transitions between two states in the fa.

SparingScaleFactor : FailureAutomatonTransition × FailureAutomaton → 7 R dom SparingScaleFactor = { fat : FailureAutomatonTransition; fa : FailureAutomaton | fat ∈ fa.transitions • (fat, fa) } ∀ fat : FailureAutomatonTransition; fa : FailureAutomaton | (fat, fa) ∈ dom SparingScaleFactor • SparingScaleFactor (fat, fa) = 1R /R intToReal (#(GetNonDeterministicTransitions(fat, fa)))

The sparing scale factor is equal to the inverse of the number of nondeterministic next states that have the same set of causal basic events as between the from and to states.

Appendix B. A Formal Specification of Dynamic Fault Trees

204

CoverageScaleFactor : FailureAutomatonTransition× BasicEventModelFunction → 7 R dom CoverageScaleFactor = { fat : FailureAutomatonTransition; bemf : BasicEventModelFunction | fat.causalBasicEvent ∈ dom bemf • (fat, bemf ) } ∀ fat : FailureAutomatonTransition; bemf : BasicEventModelFunction | (fat, bemf ) ∈ dom CoverageScaleFactor • CoverageScaleFactor (fat, bemf ) = if fat.from.systemFailedUncovered = False ∧ fat.to.systemFailedUncovered = True then (bemf fat.causalBasicEvent).coverageModel .singlePointFailureProbability else (bemf fat.causalBasicEvent).coverageModel .coveredFailureProbability

This function determines the coverage factor that should be applied to the transition rate function depending on whether the state is failed uncovered.

NumberOfReplsInUse : Event × FailureAutomatonState → N ∀ e : Event; fas : FailureAutomatonState • NumberOfReplsInUse(e, fas) = #(fas.spareInUse B {e})

The number of replicates of a basic event that are in use for a given FA state is the number of mappings from spare gates to spares such that the spares are restricted to be equal to the basic event.

Appendix B. A Formal Specification of Dynamic Fault Trees

205

NumberOfOperReplsNotInUse : Event × FailureAutomatonState× ReplicationMap → 7 N dom NumberOfOperReplsNotInUse = { e : Event; fas : FailureAutomatonState; rs : ReplicationMap | e ∈ dom fas.stateOfEvents • (e, fas, rs) } ∀ e : Event; fas : FailureAutomatonState; rs : ReplicationMap | (e, fas, rs) ∈ dom NumberOfOperReplsNotInUse • NumberOfOperReplsNotInUse(e, fas, rs) = rs e − fas.stateOfEvents e − NumberOfReplsInUse(e, fas)

The number of replicates of a basic event that are operational but not in use for a given FA state is the replication less the number of replicates that have already occurred less the number of replicates in use by spare gates.

Appendix B. A Formal Specification of Dynamic Fault Trees

206

DormancyReplicationScaleFactor : FailureAutomatonTransition × FailureAutomaton× BasicEventModelFunction × ReplicationMap → 7 R dom DormancyReplicationScaleFactor = { fat : FailureAutomatonTransition; fa : FailureAutomaton; bemf : BasicEventModelFunction; rs : ReplicationMap | fat ∈ fa.transitions ∧ fat.causalBasicEvent ∈ dom bemf • (fat, fa, bemf , rs) } ∀ fat : FailureAutomatonTransition; fa : FailureAutomaton; bemf : BasicEventModelFunction; rs : ReplicationMap | (fat, fa, bemf , rs) ∈ dom DormancyReplicationScaleFactor • DormancyReplicationScaleFactor (fat, fa, bemf , rs) = intToReal (NumberOfReplsInUse(fat.causalBasicEvent, fat.from))+R intToReal (NumberOfOperReplsNotInUse(fat.causalBasicEvent, fat.from, rs)) ∗R (bemf fat.causalBasicEvent).dormancy

The dormancy/replication scale factor is the number of replicates of the causal basic event that are in use by spare gates, plus the number of operational replicates that are not in use multiplied by the dormancy.

HazardFunction == Time → Rate

A hazard function is a function of time, and is based on the distribution. We do not specify the details of the hazard function.

Appendix B. A Formal Specification of Dynamic Fault Trees

207

ComputeHazardFunction : Distribution → 7 HazardFunction dom ComputeHazardFunction = { d : Distribution | d ∈ exponentialDistributions ∨ d ∈ weibullDistributions • d } ∀ d : Distribution; hf : HazardFunction | d ∈ dom ComputeHazardFunction • hf = ComputeHazardFunction d

ComputeHazardFunction computes the hazard function associated with a distribution. We abstract the details of this computation in this specification. However, an important constraint is that the hazard function can only be computed for exponential and Weibull distributions. This effectively constraints the fault trees whose corresponding failure automata semantics can be expressed in terms Markov models to the set of exponential or Weibull fault trees.

Appendix B. A Formal Specification of Dynamic Fault Trees ScaledTransitionRateFunction : FailureAutomatonTransition× FailureAutomaton × BasicEventModelFunction× ReplicationMap → 7 TransitionRateFunction dom ScaledTransitionRateFunction = { fat : FailureAutomatonTransition; fa : FailureAutomaton; bemf : BasicEventModelFunction; rs : ReplicationMap | fat ∈ fa.transitions ∧ fat.causalBasicEvent ∈ dom bemf • (fat, fa, bemf , rs) } ∀ fat : FailureAutomatonTransition; fa : FailureAutomaton; bemf : BasicEventModelFunction; d : Distribution; rs : ReplicationMap | (fat, fa, bemf , rs) ∈ dom ScaledTransitionRateFunction ∧ d = (bemf fat.causalBasicEvent).distribution ∧ d ∈ dom ComputeHazardFunction • ScaledTransitionRateFunction(fat, fa, bemf , rs) = ComputeHazardFunction d ∗pf (R) CoverageScaleFactor (fat, bemf ) ∗R DormancyReplicationScaleFactor (fat, fa, bemf , rs) ∗R SparingScaleFactor (fat, fa)

A scaled transition rate function is the hazard function scaled by the various scale factors.

208

Appendix B. A Formal Specification of Dynamic Fault Trees

209

SumScaledTransitionRateFunctions : F FailureAutomatonTransition × FailureAutomaton× BasicEventModelFunction × ReplicationMap → TransitionRateFunction dom SumScaledTransitionRateFunctions = { fats : F FailureAutomatonTransition; fa : FailureAutomaton; bemf : BasicEventModelFunction; rs : ReplicationMap | fats ⊆ fa.transitions ∧ (∀ fat : FailureAutomatonTransition | fat ∈ fats • (fat.causalBasicEvent ∈ dom bemf )) • (fats, fa, bemf , rs) } ∀ fat : FailureAutomatonTransition; fa : FailureAutomaton; fats : F FailureAutomatonTransition; bemf : Event → 7 BasicEventModel ; rs : ReplicationMap | (fats ∪ {fat}, fa, bemf , rs) ∈ dom SumScaledTransitionRateFunctions • SumScaledTransitionRateFunctions({fat}, fa, bemf , rs) = ScaledTransitionRateFunction(fat, fa, bemf , rs) ∧ SumScaledTransitionRateFunctions({fat} ∪ fats, fa, bemf , rs) = ScaledTransitionRateFunction(fat, fa, bemf , rs) +pf (R) SumScaledTransitionRateFunctions(fats, fa, bemf , rs)

The sum of the scaled transition rate functions is computed from a set of failure automaton transitions recursively.

Appendix B. A Formal Specification of Dynamic Fault Trees

210

ComputeTransitionRateFunction : FailureAutomatonTransition × FailureAutomaton× BasicEventModelFunction × ReplicationMap → 7 TransitionRateFunction dom ComputeTransitionRateFunction = { fat : FailureAutomatonTransition; fa : FailureAutomaton; bemf : BasicEventModelFunction; rs : ReplicationMap | fat ∈ fa.transitions ∧ fat.causalBasicEvent ∈ dom bemf ∧ (∀ fat2 : FailureAutomatonTransition | fat2 .from = fat.from ∧ fat2 .to = fat.to • (fat2 .causalBasicEvent ∈ dom bemf )) • (fat, fa, bemf , rs) } ∀ fat : FailureAutomatonTransition; fa : FailureAutomaton; fats : F FailureAutomatonTransition; bemf : BasicEventModelFunction; rs : ReplicationMap | (fat, fa, bemf , rs) ∈ dom ComputeTransitionRateFunction ∧ fats = { t : fa.transitions | t.from = fat.from ∧ t.to = fat.to • t } • ComputeTransitionRateFunction(fat, fa, bemf , rs) = SumScaledTransitionRateFunctions(fats, fa, bemf , rs)

The transition rate function for a Markov model transition is the sum of the scaled transition rate functions.

B.10.3 Complete Failure Automaton Semantics in terms of Markov Models

Appendix B. A Formal Specification of Dynamic Fault Trees

211

FailureAutomatonSemantics : FailureAutomaton × BasicEventModelFunction× ReplicationMap → 7 MarkovModel dom FailureAutomatonSemantics = { fa : FailureAutomaton; bemf : BasicEventModelFunction; rs : ReplicationMap | { fat : FailureAutomatonTransition | fat ∈ fa.transitions • fat.causalBasicEvent } ⊆ dom bemf • (fa, bemf , rs) } ∀ fa : FailureAutomaton; bemf : BasicEventModelFunction; mm : MarkovModel ; rs : ReplicationMap | (fa, bemf , rs) ∈ dom FailureAutomatonSemantics • FailureAutomatonSemantics(fa, bemf , rs) = mm ⇐⇒ StructuralCorrespondenceExists(fa, mm) ∧ (∃ fas2mms : StateCorrespondence; fat2mmt : TransitionCorrespondence | (fas2mms, fat2mmt) = GetStructuralCorrespondence(fa, mm) • (∀ fat : FailureAutomatonTransition | fat ∈ fa.transitions • (fat2mmt fat).transitionRateFunction = ComputeTransitionRateFunction(fat, fa, bemf , rs)))

The complete failure automaton semantics is specified as a correspondence between states and transitions, and the computation of the transition rate function from the basic event model functions, the replications, and the transitions in the models.

Appendix B. A Formal Specification of Dynamic Fault Trees

B.11

212

Analyses

A common analysis for fault trees is the determination of overall system unreliability. In this section we present a formal specification of that analysis.

IsSystemFailedState : P(MarkovModelState × Event× FailureAutomaton × StateCorrespondence) ∀ ms : MarkovModelState; se : Event; fa : FailureAutomaton; fas2mms : StateCorrespondence; fas : FailureAutomatonState | se ∈ dom fas.stateOfEvents ∧ ms = fas2mms fas • IsSystemFailedState(ms, se, fa, fas2mms) ⇐⇒ fas.stateOfEvents se > 0 ∨ fas.systemFailedUncovered = True

This helper function determines whether a particular state in the Markov model is a system failed state.

Appendix B. A Formal Specification of Dynamic Fault Trees

213

SystemUnreliability : FaultTree × Event× BasicEventModelFunction × Time → Probability ∀ ft : FaultTree; se : Event; bemf : BasicEventModelFunction; t : Time • ∃ fa : FailureAutomaton; mm : MarkovModel | fa = FaultTreeSemantics(ft) ∧ mm = FailureAutomatonSemantics(fa, bemf , ft.replications) ∧ { fat : FailureAutomatonTransition | fat ∈ fa.transitions • fat.causalBasicEvent } ⊆ dom bemf • ∃ fas2mms : StateCorrespondence; fat2mmt : TransitionCorrespondence | (fas2mms, fat2mmt) = GetStructuralCorrespondence(fa, mm) • SystemUnreliability(ft, se, bemf , t) = +/R { ms : MarkovModelState | ms ∈ mm.states ∧ IsSystemFailedState(ms, se, fa, fas2mms) • ms.finalProbability }

The system unreliability can be computed from a given fault tree ft, an identified “system level event”, the basic event models, and a mission time. The overall system unreliability is the sum of the final probabilities of all Markov states that correspond to fault tree states that are failed. The failure automaton semantics will only be valid if the basic event models have distributions that have the Markov property.

Appendix B. A Formal Specification of Dynamic Fault Trees

B.12

214

Fault Tree Subtypes

Each of these sub-classes of fault trees correspond to particular sub-classes of failure automata via the FaultTreeToFailureAutomaton relation. We now formalize these sub-classes of fault trees.

StaticFaultTree FaultTree pandGates = ∅ ∧ fdeps = ∅ ∧ spareGates = ∅ ∧ seqs = ∅

Static fault trees have no order-dependent constructs.

DynamicFaultTree = b ¬ StaticFaultTree Dynamic fault trees have one or more dynamic constructs.

Bibliography

[1] Gregory D. Abowd, Robert Allen, and David Garlan. Formalizing style to understand descriptions of software architecture. ACM Transactions on Software Engineering and Methodology, 4(4):319–64, October 1995. [2] Richard M. Adler. Emerging standards for component software. IEEE Computer, 28(3):68– 77, March 1995. [3] Robert Allen and David Garlan. A formal basis for architectural connection. ACM Transactions on Software Engineering and Methodology, 6(3):213–49, July 1997. [4] Suprasad Amari, Joanne Bechta Dugan, and Ravindra Misra. A separable method for incorporating imperfect coverage into combinatorial models. IEEE Transactions on Reliability, 48(3):267–74, September 1999. [5] Anju Anand and Arun K. Somani. Hierarchical analysis of fault trees with dependencies, using decomposition. In Proceedings of the Annual Reliability and Maintainability Symposium, pages 64–70, Anaheim, CA, 19–22 January 1998. [6] Klaus Bergner, Andreas Rausch, Marc Sihling, and Alexander Vilbig. Componentware – methodology and process.

In CBSE ’99 Proceedings of the International Workshop on

Component-Based Software Engineering. IEEE, May 1999. [7] Ted Biggerstaff and Charles Richter. Reusability framework, assessment, and directions. IEEE Software, 4(2):41–9, March 1987.

215

Bibliography

216

[8] Mark A. Boyd. Dynamic Fault Tree Models: Techniques for Analysis of Advanced Fault Tolerant Computer Systems. PhD thesis, Duke University, Department of Computer Science, April 1991. [9] K. Brockschmidt. Inside OLE. Microsoft Press, Redmond WA, second edition, 1995. [10] Fred Brooks. No silver bullet: Essence and accidents of software engineering. IEEE Computer, 20(4):10–19, April 1987. [11] Fredrick P. Brooks. The Mythical Man-Month: Essays on Software Engineering, 20th Anniversary Edition. Addison Wesley, Reading, Mass., second edition, 1995. [12] Lisa Brownsword, David Carney, and Tricia Oberndort. The opportunities and complexities of applying commercial-off-the-shelf components. Crosstalk, 11(4):4–6, April 1998. [13] Ricky W. Butler and George B. Finelli. The infeasibility of quantifying the reliability of lifecritical real-time software. IEEE Transactions on Software Engineering, 19(1):3–12, January 1993. [14] David Chappell. Understanding ActiveX and OLE. Microsoft Press, 1996. [15] David D. Clark, Edward A. Feigenbaum, Juris Hartmanis, Robert W. Lucky, Robert M. Metcalfe, Raj Reddy, and Mary Shaw. Innovation and obstacles: The future of computing. IEEE Computer, 31(1):29–38, January 1998. [16] MetaCase Consulting. Domain-specific modeling: 10 times faster than UML. URL: http: //www.metacase.com/papers/index.html. [17] Michael A. Copenhafer and Kevin J. Sullivan. Exploration harnesses: Tool-supported interactive discovery of commercial component properties. In Proceedings of ASE-97: The 12th IEEE Conference on Automated Software Engineering, pages 7–14, Cocoa Beach, Florida, 12–15 October 1999. IEEE.

Bibliography

217

[18] David Coppit and Kevin J. Sullivan. Formal specification in collaborative design of critical software tools. In Proceedings Third IEEE International High-Assurance Systems Engineering Symposium, pages 13–20, Washington, D.C., 13–14 November 1998. IEEE. [19] David Coppit and Kevin J. Sullivan. Galileo: A tool built from mass-market applications. In Proceedings of the 22nd International Conference on Software Engineering, pages 750–3, Limerick, Ireland, 4–11 June 2000. IEEE. [20] David Coppit and Kevin J. Sullivan. Multiple mass-market applications as components. In Proceedings of the 22nd International Conference on Software Engineering, pages 273–82, Limerick, Ireland, 4–11 June 2000. IEEE. [21] David Coppit, Kevin J. Sullivan, and Joanne Bechta Dugan. Formal semantics of models for computational engineering: A case study on dynamic fault trees. In Proceedings of the International Symposium on Software Reliability Engineering, pages 270–282, San Jose, California, 8–11 October 2000. IEEE. [22] O. Coudert and J. C. Madre. Fault tree analysis: 1020 prime implicants and beyond. In Proceedings of the Annual Reliability and Maintainability Symposium, pages 240–5, Atlanta, GA, 26–28 January 1993. [23] Don A. Dillman. Mail and Internet Surveys: The Tailored Design Method. John Wiley & Sons, 2nd edition, 1999. [24] Joanne Bechta Dugan, Salvatore Bavuso, and Mark Boyd. Dynamic fault-tree models for fault-tolerant computer systems. IEEE Transactions on Reliability, 41(3):363–77, September 1992. [25] Joanne Bechta Dugan, Salvatore Bavuso, and Mark Boyd. Fault trees and Markov models for reliability analysis of fault tolerant systems. Reliability Engineering and System Safety, 39(3):291–307, 1993.

Bibliography

218

[26] Joanne Bechta Dugan, Kevin J. Sullivan, and David Coppit. Developing a low-cost, highquality software tool for dynamic fault tree analysis. Transactions on Reliability, 49(1):49–59, March 2000. [27] Joanne Bechta Dugan, Kishor S. Trivedi, Mark K. Smotherman, and Robert M. Geist. The hybrid automated reliability predictor. Journal of Guidance, Control, and Dynamics, 9(3):319– 31, June 1986. [28] Joanne Bechta Dugan, Bharath Venkataraman, and Rohit Gulati. DIFTree: A software package for the analysis of dynamic fault tree models. In Proceedings of the Annual Reliability and Maintainability Symposium, pages 64–70, Philadelphia, PA, 13–16 January 1997. [29] Andy Evans, Robert France, Kevin Lano, and Bernhard Rumpe. The UML as a formal modeling notation. In The Unified Modeling Language—Workshop UML’98: Beyond the Notation, pages 336–48, Mulhouse, France, 3–4 June 1998. URL: http://www4.informatik. tu-muenchen.de/papers/EFLR98d.html. [30] Greg Fox and Steven Marcom. A software development process for COTS-based information system infrastructure: Part 1. Crosstalk, 11(3):20–25, March 1998. [31] David Garlan, Robert Allen, and John Ockerbloom. Architectural mismatch: Why reuse is so hard. IEEE Software, 12(6):17–26, November 1995. [32] Neil M. Goldman and Robert M. Balzer. The ISI visual design editor generator. In 1999 IEEE Symposium on Visual Languages (VL’99), pages 20–27, Tokyo, Japan, September 1999. IEEE. [33] The Precise UML Group. The precise UML group home page. URL: http://www.cs.york. ac.uk/puml/. [34] John Grundy and John Hosking. Wiley Encyclopaedia of Software Engineering, chapter Software Engineering. Wiley InterScience, 2nd edition, 2001.

219

Bibliography

[35] John Grundy, Warwick Mugridge, and John Hosking. Constructing component-based software engineering environments: Issues and experiences. Information and Software Technology, 42(2), 2000. [36] Rohit Gulati and Joanne Bechta Dugan. A modular approach for analyzing static and dynamic fault trees. In Proceedings of the Annual Reliability and Maintainability Symposium, pages 57–63, Philadelphia, Pennsylvania, 13–16 January 1997. [37] Les Hatton and Andy Roberts. How accurate is scientific software? IEEE Transactions on Software Engineering, 2(10):785–797, 1994. [38] Honeywell. DOME users’ guide. URL: http://www.htc.honeywell.com/dome/support. htm. [39] Daniel Jackson. Alloy: A lightweight object modelling notation. URL: http://sdg.lcs. mit.edu/˜dnj/publications.html. [40] Mehdi Jazayeri. Component programming — a fresh look at software components. Technical Report TUV-1841-95-01, Technical University of Vienna, February 1995. [41] Xiaoping Jia. ZTC: A type checker for Z. notation user’s guide. URL: http://se.cs. depaul.edu/fm/ztc.html. [42] Samual N. Kamin and David Hyatt. A special-purpose languae for picture-drawing. In Proceedings of the Conference on Domain-Specific Languages, pages 297–310, Santa Barbara, CA, 15–17 October 1997. USENIX Assoc. Berkeley, CA. [43] J. C. Knight. Safety critical systems: Challenges and directions. In Proceedings of the 24th International Conference on Software Engineering, pages 547–9, Orlando, Florida, 19–25 May 2002. IEEE. [44] John C. Knight, Colleen L. DeJong, Matthew S. Gibble, and Lu´is Nakano. Why are formal methods not used more widely? Virginia, September 1997.

In Fourth NASA Formal Methods Workshop, Hampton,

Bibliography

220

[45] Butler Lampson. How software components grew up and conquered the world. In Proceedings of the 21st International Conference on Software Engineering, Los Angeles, California, 16– 22 May 1999. IEEE. ´ ´ ad Bakay, Mikl´os Mar´oti, P´eter V¨olgyesi, Greg Nordstsrom, Jonathan [46] Akos L´edeczi, Arp´ Sprinkle, and G´abor Karsai. Composing domain-specific design environments. IEEE Computer, 34(11):44–51, November 2001. [47] Akos Ledeczi, Miklos Maroti, Arpad Bakay, Gabor Karsai, Jason Garrett, Charles Thomason, Greg Nordstrom, Jonathan Sprinkle, and Peter Volgyesi. The generic modeling environment. In Workshop on Intelligent Signal Processing, Budapest, Hungary, 17 May 2001. [48] Ragavan Manian, David W. Coppit, Kevin J. Sullivan, and Joanne Bechta Dugan. Bridging the gap between systems and dynamic fault tree models. In Annual Reliability and Maintainability Symposium 1999 Proceedings, pages 105–11, Washington, DC, 18–21 January 1999. [49] Mark Marchukov and Kevin J. Sullivan. Reconciling behavioral mismatch through component restriction. Technical Report CS-99-22, Department of Computer Science, University of Virginia, 30 July 1999. [50] M. D. McIlroy. Mass produced software components. In P. Naur and B. Randell, editors, Software Engineering, volume 1, pages 138–150. NATO Science Committee, January 1969. [51] Microsoft. Active document containers. URL: http://msdn.microsoft.com/library/ default.asp?url=/library/en-us/vccore98/HTML/_core_activex_document_ containers.asp. [52] Richard Monson-Haefel. Enterprise JavaBeans. O’Reilly & Associates, 3rd edition, 2001. [53] David R. Musser and Atul Saini. STL Tutorial and Reference Guide: C++ Programming with the Standard Template Library. Addison-Wesley, 1996.

Bibliography

221

[54] James M. Neighbors. Draco: A method for engineering reusable software systems. In Ted J. Biggerstaff and Alan J. Perlis, editors, Software Reusability — Concepts and Models, volume I, chapter 12, pages 295–319. IEEE, 1989. [55] Requirements in 10 CFR part 21 for reporting and evaluating software errors. Technical Report NRC Information Notice 96-29, United States Nuclear Regulatory Commission, 20 May 1996. [56] Dale Rogerson. Inside COM. Microsoft Press, 1996. [57] Mark Saaltink. The Z/EVES system. In ZUM ’97: Z Formal Specification Notation. 11th International Conference of Z Users. Proceedings, pages 72–85, Berlin, Germany, 3–4 April 1997. Springer-Verlag. [58] Mary Shaw. Architectural issues in software reuse: It’s not just the functionality, it’s the packaging. In Proceedings of the ACM SIGSOFT Symposium on Software Reusability, pages 3–6. ACM Press, August 1995. [59] J. M. Spivey. The Z Notation: A Reference Manual. Prentice Hall International Series in Computer Science, 2nd edition, 1992. [60] Mike Spivey. The fuzz manual. URL: http://spivey.oriel.ox.ac.uk/˜mike/fuzz/. [61] Giancarlo Succi, Witold Pedrcyz, Eric Liu, and Jason Yip. Package-oriented software engineering: a generic architecture. IT Professional, 3(2):29–36, March–April 2001. [62] K. J. Sullivan and J. C. Knight. Building programs from massive components. In Proceedings of the 21st Annual Software Engineering Workshop, Greenbelt, MD, 4-5 December 1996. IEEE. [63] K. J. Sullivan and J. C. Knight. Experience assessing an architectural approach to largescale systematic reuse. In Proceedings of the 18th International Conference on Software Engineering, pages 220–229, Berlin, Germany, 25–30 March 1996. IEEE.

Bibliography

222

[64] Kevin J. Sullivan, Jake Cockrell, Shengtong Zhang, and David Coppit. Package-oriented programming of engineering tools. In Proceedings of the 19th International Conference on Software Engineering, pages 616–617, Boston, Massachusetts, 17–23 May 1997. IEEE. [65] Kevin J. Sullivan, Joanne Bechta Dugan, and David Coppit. The Galileo fault tree analysis tool. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, pages 232–5, Madison, Wisconsin, 15–18 June 1999. IEEE. [66] Kevin J. Sullivan, Mark Marchukov, and John Socha. Analysis of a conflict between aggregation and interface negotiation in Microsoft’s Component Object Model. IEEE Transactions on Software Engineering, 25(5), September 1999. [67] Clemens Szyperski. Component Software: Beyond Object-Oriented Programming. AddisonWesley, 1998. [68] Jon Udell. Componentware. Byte, 19(5):46–56, May 1994. [69] Axel van Lamsweerde. Formal specification: a roadmap. In Proceedings of the 22nd International Conference on Software Engineering—The Future of Software Engineering, pages 149–59, Limerick, Ireland, 4–11 June 2000. IEEE. [70] W. E. Veseley, F. F. Goldberg, N. H. Roberts, and D. F. Haasl. Fault Tree Handbook. U. S. Nuclear Regulatory Commission, NUREG-0492, Washington DC, 1981. [71] H. A. Watson and Bell Telephone Laboratories. Launch control safety study. Technical report, Bell Telephone Laboratories, Murray Hill, NJ, 1961. [72] Hiahong Zhu, Shixao Zhou, Joanne Bechta Dugan, and Kevin J. Sullivan. A benchmark for quantitative fault tree reliability analysis. In Annual Reliability and Maintainability Symposium 2001 Proceedings, pages 86–93, Philadelphia, PA, 22–25 January 2001.