Supporting systems of systems hazard analysis using multi-agent simulation

17
Supporting systems of systems hazard analysis using multi-agent simulation Rob Alexander , Tim Kelly Department of Computer Science, University of York, York, United Kingdom article info Article history: Received 12 July 2011 Received in revised form 15 June 2012 Accepted 29 July 2012 Available online 3 September 2012 Keywords: Safety Simulation System of systems Hazard analysis Multi-agent abstract When engineers create a safety-critical system, they need to perform an adequate hazard analysis. For Systems of Systems (SoSs), however, hazard analysis is difficult because of the complexity of SoS and the environments they inhabit. Traditional hazard analysis techniques often rely upon static models of component interaction and have difficulties exploring the effects of multiple coincident failures. They cannot be relied on, therefore, to provide adequate hazard analysis of SoS. This paper presents a hazard analysis technique (SimHAZAN) that uses multi-agent modelling and simulation to explore the effects of deviant node behaviour within a SoS. It defines a systematic process for developing multi-agent models of SoS, starting from existing models in the MODAF architecture framework and proceeding to imple- mented simulation models. It then describes a process for running these simulations in an exploratory way, bounded by estimated probability. This process generates extensive logs of simulated events; in order to extract the causes of accidents from these logs, this paper presents a tool-supported analysis technique that uses machine learning and agent behaviour tracing. The approach is evaluated by compar- ison to some explicit requirements for SoS hazard analysis, and by applying it to a case study. Based on the case study, it appears that SimHAZAN has the potential to reveal hazards that are difficult to discover when using traditional techniques. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction A growing challenge for safety engineers is maintaining the safety of large-scale military and transport Systems of Systems (SoSs), such as Air Traffic Control (ATC) networks and military units with Network Enabled Capability (NEC). The term ‘‘SoS’’ can be de- fined in terms of key characteristics (Alexander et al., 2004): SoS consist of multiple components that are systems in their own right, each having their own goals and some degree of autonomy but needing to communicate and collaborate in order to achieve over- all SoS goals. SoS are typically distributed over large areas (such as regions, countries or entire continents), and their components fre- quently interact with each other in an ad-hoc fashion. It follows that military and transport SoS have the potential to cause large- scale destruction and injury. This is particularly true for SoS incor- porating new kinds of autonomous component systems, such as Unmanned Aerial Vehicles (UAVs). This paper is concerned with one aspect of the safety process for SoS, specifically hazard analysis: determining the distinct causal chains by which the behaviour of the SoS can lead to an accident. Hazard analysis is a crucial part of any risk-based safety approach, but the defining characteristics of SoS make it very difficult. Recent developments in SoS are likely to worsen the SoS safety problem. For example, there is a move towards dynamic reconfig- uration, which greatly expands the number of system states that needs to be considered; any analysis may need to be carried out for all possible configurations. Similarly, SoS increasingly use ad hoc communications, meaning that information errors can propa- gate through the system by many, unpredictable, routes. These factors overwhelm the ability of manual hazard analysis and therefore suggest a need for automated hazard analysis. There are a few automated approaches specifically designed for SoS safety, but what exists typically lacks any kind of systematic modelling process or has a very limited applicability in terms of the models it can analyse, and requires models that are built specifically for that analysis (for example, many approaches based on model-checking). Most of the extant SoS-specific methods are aimed at safety risk assessment (deriving quantitative values for the risk posed by the SoS); few of them are focussed specifically on hazard identification and hazard analysis (discovering the different hazards in the SoS and the distinct combinations of causes that can lead to them). This paper presents SimHAZAN: a partly-automated hazard anal- ysis method for SoS that avoids some of the problems associated with existing techniques. In particular, it has a systematic modelling process and a separate analysis approach that can be applied either to models developed through that process or to models developed 0925-7535/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ssci.2012.07.006 Corresponding author. Address: Department of Computer Science, University of York, Deramore Lane, York YO10 5GH, United Kingdom. Tel.: +44 1904 325 474, +44 7813 134 388. E-mail addresses: [email protected] (R. Alexander), [email protected]. ac.uk (T. Kelly). Safety Science 51 (2013) 302–318 Contents lists available at SciVerse ScienceDirect Safety Science journal homepage: www.elsevier.com/locate/ssci

Transcript of Supporting systems of systems hazard analysis using multi-agent simulation

Page 1: Supporting systems of systems hazard analysis using multi-agent simulation

Safety Science 51 (2013) 302–318

Contents lists available at SciVerse ScienceDirect

Safety Science

journal homepage: www.elsevier .com/locate /ssc i

Supporting systems of systems hazard analysis using multi-agent simulation

Rob Alexander ⇑, Tim KellyDepartment of Computer Science, University of York, York, United Kingdom

a r t i c l e i n f o a b s t r a c t

Article history:Received 12 July 2011Received in revised form 15 June 2012Accepted 29 July 2012Available online 3 September 2012

Keywords:SafetySimulationSystem of systemsHazard analysisMulti-agent

0925-7535/$ - see front matter � 2012 Elsevier Ltd. Ahttp://dx.doi.org/10.1016/j.ssci.2012.07.006

⇑ Corresponding author. Address: Department of CoYork, Deramore Lane, York YO10 5GH, United Kingdom7813 134 388.

E-mail addresses: [email protected] (R. Aac.uk (T. Kelly).

When engineers create a safety-critical system, they need to perform an adequate hazard analysis. ForSystems of Systems (SoSs), however, hazard analysis is difficult because of the complexity of SoS andthe environments they inhabit. Traditional hazard analysis techniques often rely upon static models ofcomponent interaction and have difficulties exploring the effects of multiple coincident failures. Theycannot be relied on, therefore, to provide adequate hazard analysis of SoS. This paper presents a hazardanalysis technique (SimHAZAN) that uses multi-agent modelling and simulation to explore the effects ofdeviant node behaviour within a SoS. It defines a systematic process for developing multi-agent modelsof SoS, starting from existing models in the MODAF architecture framework and proceeding to imple-mented simulation models. It then describes a process for running these simulations in an exploratoryway, bounded by estimated probability. This process generates extensive logs of simulated events; inorder to extract the causes of accidents from these logs, this paper presents a tool-supported analysistechnique that uses machine learning and agent behaviour tracing. The approach is evaluated by compar-ison to some explicit requirements for SoS hazard analysis, and by applying it to a case study. Based onthe case study, it appears that SimHAZAN has the potential to reveal hazards that are difficult to discoverwhen using traditional techniques.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

A growing challenge for safety engineers is maintaining thesafety of large-scale military and transport Systems of Systems(SoSs), such as Air Traffic Control (ATC) networks and military unitswith Network Enabled Capability (NEC). The term ‘‘SoS’’ can be de-fined in terms of key characteristics (Alexander et al., 2004): SoSconsist of multiple components that are systems in their own right,each having their own goals and some degree of autonomy butneeding to communicate and collaborate in order to achieve over-all SoS goals. SoS are typically distributed over large areas (such asregions, countries or entire continents), and their components fre-quently interact with each other in an ad-hoc fashion. It followsthat military and transport SoS have the potential to cause large-scale destruction and injury. This is particularly true for SoS incor-porating new kinds of autonomous component systems, such asUnmanned Aerial Vehicles (UAVs).

This paper is concerned with one aspect of the safety process forSoS, specifically hazard analysis: determining the distinct causalchains by which the behaviour of the SoS can lead to an accident.

ll rights reserved.

mputer Science, University of. Tel.: +44 1904 325 474, +44

lexander), [email protected].

Hazard analysis is a crucial part of any risk-based safety approach,but the defining characteristics of SoS make it very difficult.

Recent developments in SoS are likely to worsen the SoS safetyproblem. For example, there is a move towards dynamic reconfig-uration, which greatly expands the number of system states thatneeds to be considered; any analysis may need to be carried outfor all possible configurations. Similarly, SoS increasingly use adhoc communications, meaning that information errors can propa-gate through the system by many, unpredictable, routes.

These factors overwhelm the ability of manual hazard analysisand therefore suggest a need for automated hazard analysis. Thereare a few automated approaches specifically designed for SoS safety,but what exists typically lacks any kind of systematic modellingprocess or has a very limited applicability in terms of the modelsit can analyse, and requires models that are built specifically for thatanalysis (for example, many approaches based on model-checking).Most of the extant SoS-specific methods are aimed at safety riskassessment (deriving quantitative values for the risk posed by theSoS); few of them are focussed specifically on hazard identificationand hazard analysis (discovering the different hazards in the SoSand the distinct combinations of causes that can lead to them).

This paper presents SimHAZAN: a partly-automated hazard anal-ysis method for SoS that avoids some of the problems associatedwith existing techniques. In particular, it has a systematic modellingprocess and a separate analysis approach that can be applied eitherto models developed through that process or to models developed

Page 2: Supporting systems of systems hazard analysis using multi-agent simulation

R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318 303

by other means. The process provides specific support for hazardanalysis – it leads directly to a qualitative understanding of thechains of causes by which hazards occur. It can fit into an existingrisk-based safety process by providing a source of hypotheses abouthazards that can then be tested and mitigated by safety engineers.This presents a case study that demonstrates the method’s potentialto reveal hazards, and causes of hazards, that other methods do not.Because it provides hypotheses (rather than confirmatory evidenceof safety), it can be used despite concerns about the validity andfidelity of simulation models – it can be evaluated on a purelyreturn-on-investment basis, without necessarily making claimsabout achieving coverage of all possible hazards and causes.

This paper is summary of SimHAZAN, limited by the spaceavailable. For a fuller description (including a thorough literaturereview and many example artefacts from the case study) the read-er is referred to (Alexander, 2007). That text does not use the term‘‘SimHAZAN’’, but the approach presented here is a refinement ofthe approach described there.

The following section discusses the challenges of SoS hazardanalysis and describes what is required of any method if it is tomeet those challenges. Section 3 gives an overview of SimHAZAN,then Sections 4 and 5 give detailed accounts of the two major partsof SimHAZAN: modelling and analysis. Section 6 presents a casestudy, which also serves as an illustration of the method in prac-tice. Section 7 briefly discusses the use of SimHAZAN in practicalsafety engineering, and Section 8 concludes the paper with a dis-cussion of how well SimHAZAN meets the identified challenges;where there are shortfalls or opportunities, it outlines directionsfor future work.

2. The challenge of SoS hazard analysis

Aitken states that ‘‘An SoS Hazard is the combined behaviour oftwo or more distinct nodes within the SoS that could lead to an acci-dent. An accident that can be described by behaviour confined to a sin-gle node (i.e. a single system hazard) is not a SoS accident, even if thatnode is acting as part of a SoS’’ (Aitken et al., 2011). A ‘‘node’’ here isa component of the SoS – something that is part of the SoS but hassome degree of autonomy with respect to it. Examples could in-clude an aircraft or a group of rescuers on the ground. SoS hazardanalysis is thus the process of finding the conditions in which twoor more distinct nodes can behave so as to give rise to an accident,and then finding the causal paths by which those conditions couldbe reached from a safe state. The specific objective of SimHAZAN isthus to associate the behaviours, states and interactions of SoSnodes with accidents.

The reader may ask, at this juncture, why there is such a concernwith finding new hazards. After all, many hazard analysis tech-niques start with most hazards known, and concentrate on findingtheir causes. HAZOP is a typical example – although it works for-wards from deviations in order to find their consequences, the setof dangerous consequences (system hazards) is mostly known atthe start. This may not be typical for SoS. Although some hazardswill be known at the start, many will only become apparent throughexploratory analysis of the system. Hence, the current work isfocussed on identifying possible behaviour variations (‘‘deviations’’)of individual entities within the SoS (‘‘nodes’’) and using simulationto project accidents that could occur because of those. The output isa set of causal chains, and it is then a task for engineers to turn thoseinto a manageable set of hazards.

2.1. The problems of SoS hazard analysis

Perrow (1984) discusses what he calls ‘normal accidents’ in thecontext of complex systems. His ‘Normal Accident Theory’ holds

that any complex, tightly-coupled system has the potential forcatastrophic failure stemming from simultaneous minor failures.Similarly, Leveson (2002) notes that many accidents have multiplecauses, which are all necessary and (only) collectively sufficient forthe accident to occur. In such cases it follows that an investigationof any one cause prior to the accident (i.e. without the benefit ofhindsight) might not have made the accident plausible to ananalyst.

An SoS can certainly be a ‘complex, tightly-coupled system’, andas such is likely to experience such accidents. One strategy to im-prove SoS safety is to decouple the elements of the system, andMarais et al. note that this has worked well in the design of AirTraffic Control (ATC) SoS (Marais et al., 2009). This decouplingcan have a cost in performance, however – for example, there aremoves in ATC to move to free flight models where aircraft interactvia decentralised data exchange which may increase airspace per-formance at the cost of increased coupling.

A ‘normal accident’ could also result from actions by each oftwo nodes that were safe in themselves (in their assumed contextof use), but that are hazardous in combination with each other andthe wider SoS context. Such emergent hazards are a major concernfor SoS. These problems are also present in conventional systems –see, for example, Wilkinson and Kelly (1998) – but the characteris-tics of SoS exacerbate them.

Raheja and Moriarty (2006), when discussing SoS safety, com-ment that SoS can be tightly coupled at long distances and hencea change in one part of the system may have difficult-to-predictconsequences in other parts. They also stress the contribution ofsystem architecture to safety, noting however that in SoS the archi-tecture may be dynamic. In decentralised systems with dynamicstructure, predicting the long-range effects of local events is noto-riously difficult.

The difficulty of detecting hazardous combinations of events isgreater because many SoS will incorporate component systemsdrawn from multiple manufacturers, developed at different times,and operated by multiple organisations. The evolutionary and dy-namic nature of SoS structures means that a component systemdesigner may never understand the entire SoS context.

A further complication is that SoS elements, by definition, havesome degree of operational autonomy – they have some goals oftheir own (such as self preservation) in addition to goals at a higherlevel (such as destroying priority targets). There are likely, indeed,to have goals at several levels – individually, local to the team orunit, and globally to the whole SoS. The safety-critical behaviourof an SoS can thus only be understood by using models that cancapture these goals, and analyses that can derive their (combined)consequences.

Discussion of military SoS inevitably involves reference to cut-ting-edge technologies, such as advanced unmanned vehicles. Thiscreates an additional pressure in that of course, being novel, thesetechnologies may not be well understood. Their developers oftendo not know how to make them safe, or how to assure others thatthey are safe. Unmanned vehicles are a particular concern in thatthey are likely to be very dumb responders to information sharedover the SoS – they are particularly vulnerable to errors in networkdata or commands. This creates a need for modelling and analysisapproaches that can capture some of their behaviour and helpsafety engineers determine the consequences in the SoS context.

Existing work on SoS dependability concentrates mostly onsoftware and networks – there is little attention given to embodiedSoS. An example of this is the DSoS project at the University ofNewcastle (Gaudel et al., 2003), which almost exclusively studiedenterprise networks. Safety requires more than this – engineersneed to consider the physical nodes (e.g. aircraft and weapons sys-tems) that are part of the SoS, along with the organisational struc-ture of its human components (Rasmussen, 1997).

Page 3: Supporting systems of systems hazard analysis using multi-agent simulation

304 R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318

2.2. Requirements for SoS hazard analysis methods

Based on the general definition of hazard analysis as an activity,the key requirement for an SoS hazard analysis method can be ex-pressed as follows:

1. Must provide qualitative descriptions of causal chains thatrelate node-level behaviours, states and interactions to SoSaccidents.

In the requirement above, ‘‘SoS accidents’’ refers to accidentscaused by SoS hazards as defined in Section 2. Such accidents mustthus be caused by ‘‘the combined behaviour of two or more distinctnodes within the SoS’’. Accidents that are caused by the isolatedbehaviour of a single node can be found by standard hazard anal-ysis applied to that node – it is therefore not a good use of effort tolook for them at the SoS level.

Given the SoS-specific issues raised in Sections 1 and 2.1, anySoS hazard analysis technique should have the followingcapabilities:

2. Be able to model embodied SoS (SoS composed of physical,mobile entities).

3. Be able to express the behaviour of all SoS components, includ-ing the distinctive behaviour of novel technologies.

4. Be able to compose the behaviour of interacting autonomousnodes, including some emergent properties.

5. Be able to discover the effects of multiple simultaneousdeviations.

6. Be able to propagate the effects of deviations through complex,decentralised and dynamically-reconfigured SoS.

Some general requirements common to all engineering meth-ods research can be added to the above:

7. Provide a systematic, explicit method that can be applied byresearchers and practitioners.

8. Be scalable to systems (here, SoS) of practical concern.

As noted earlier, a full SoS safety process should address theorganisational issues that can prevent safe behaviour by opera-tional personnel. This could be expressed as an additional require-ment (that all hazards be traced to their possible organisationalcauses and mitigations), but the current work has not done this be-cause it is focussed on addressing SoS hazards that occur on shortertimescales. It is possible that simulation techniques can identifymanagement failings; indeed, this seems to be the ambition of(Mohaghegh et al., 2009), but SimHAZAN has not been developedwith this in mind.

The requirements above define a basic standard for an adequateSoS hazard analysis technique. Given a number of rival techniquesthat all meet the requirements, they can be compared by means ofquality attributes – attributes on which techniques can differ intheir hazard analysis power. The most important here are:

1. Find hazards and causal chains that are linked to the maximumamount of risk in a given case study SoS.

2. Produce the minimum number of ‘‘false alarms’’ (hazard chainsthat do not exist in the real SoS) on a given case study SoS.

3. Minimise the rework needed to reuse an existing SoS model.4. Maximise the ability of domain experts to find discrepancies

between the model and the real world.

The use of ‘‘maximum’’ and ‘‘minimum’’ in the above allowscomparison between techniques – for example, analysts can enu-merate the unique hazards and chains found by each technique,

and compare the two lists, e.g. as in Caseley et al. (2006).Obviously, not all hazards are equal in severity, so attribute 1 says‘‘the maximum amount of risk’’ rather than ‘‘the maximum num-ber of hazards and chains’’. In the early-stage work that SimHAZANis intended for, however, it may be difficult to assess probability, soevaluators may want to fall back to a simple count.

2.3. Problems with current techniques

There are many manual hazard analysis techniques that usefunctional models of systems, with FFA and FMEA being perhapsthe most common (Pumfrey, 1999). These work on the basis ofguide words applied to a functional (or component-failure-mode)model of a system. In theory, engineers can use these techniquesto explore combinations of deviations, but in practice most devia-tions are only considered individually.

HAZOP (CISHEC, 1977) has some advantages over FFA because itis based on a more explicit system model (nodes and flows). It hasobvious application to SoS in terms of information flow betweennodes (for example, as captured by the MODAF artefacts OV-2and OV-3), but it shares FFA’s limits on coincident deviations andon propagation. These techniques can therefore be questioned withrespect to requirements 4, 5 and 6. Standard HAZOP also relies onstatic descriptions of system structure. Static descriptions can beimposed on SoS (cf. the MODAF just mentioned), but they will notcapture any dynamic reconfiguration that can occur. This is afurther shortcoming with respect to requirement 6.

Some hazard analysis techniques have been developed specifi-cally for SoS. Examples are those of Redmond (2007), Redmondet al. (2008), and Stephenson et al. (2011). Both feature an ap-proach to managing the combinatorial diversity of SoS – Redmondlimits his exploration of failure chains by progressive estimation oftheir probability, and Stephenson adapts product line techniquesto manage variation. Both have been applied to plausible SoSexamples. Redmond’s technique is clearly documented, but thepublished description of Stephenson’s technique is rather confus-ing. These techniques are promising, but it is likely that they willstruggle with requirements 5, 6 and 8 – they will be limited in theirability to consider multiple simultaneous deviations, they will belimited in their ability to propagate effects through a system, andthus they may not scale well.

There are a number of existing simulation-based approaches tohazard and safety analysis within SoS. Most of the techniques aredesigned to give quantitative assessments of the overall risk pres-ent in the system. For example, Blom et al. (2006) in airspace sys-tem safety, Brooks et al. (2004) in military operations andMohaghegh et al. (2009) in socio-technical systems use Monte Car-lo techniques to acquire quantitative statistical measures of theoverall safety of a system under specified conditions. By contrast,the concern in this paper is hazard identification and analysis –deriving the complete set of qualitatively distinct hazards thatthe SoS can exhibit, and finding as many as possible of the causalchains that can lead to them. Monte Carlo can be used to explorequalitative possibilities, although it is not ideally suited to this –for example, if driven by parameter distributions based on opera-tional realities it may repeatedly run common situations and neverrun some rare ones. They may therefore perform poorly with re-gard to discovery of novel qualitative behaviours (requirement 1).

3. A method for SoS hazard analysis

This section presents the SimHAZAN approach, which meets therequirements identified while avoiding many of the weaknesses ofexisting techniques. This section gives an overview of the approachand its rationale; Sections 4 and 5 then describe it in more detail.

Page 4: Supporting systems of systems hazard analysis using multi-agent simulation

R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318 305

SimHAZAN is a semi-automated method for conducting hazardanalysis of systems of systems. It guides engineers to build asimulation model of the SoS in question, then guides them to ana-lyse it using specific software tools. The output of this process is aset of hazards that may be present in the SoS. This section,along with Sections 4 and 5, describes the SimHAZAN process.Section 5.5 describes the authors’ specific implementation of theSimHAZAN tools, and Section 6 presents a case study thatthe authors performed. Fig. 1 provides a graphical overview ofthe process.

The starting point for SimHAZAN is the knowledge that manualanalysis can highlight expected risks – engineers will always startthe analysis process with intuitions that certain deviations of nodebehaviour, certain environmental conditions and mission situa-tions, will cause safety risks. Some of these intuitions will comefrom past experience with similar SoS. The engineers will notalways be right, but some of their concerns are likely to be valid.Similarly, they can predict to a high degree that some accident sit-uations are possible (e.g. ‘‘two rescue helicopters collide while try-ing to respond to the same distress call’’). This is importantknowledge, it may be supported by a variety of sources, and shouldnot be discarded.

Unfortunately, as discussed in Section 2, it is difficult to workout exactly how and when those conditions will give rise to thoseaccidents. It is hard to discover this manually because SoS have somany diverse autonomous parts and because they work in suchcomplex environments.

Simulation can, however, explore the consequences of thosedeviations in models of the SoS. The simulator derives the conse-quences of diverse initial conditions (e.g. ‘‘X has failed’’ or ‘‘theradio is being jammed’’ or ‘‘a heavy fog has come down’’), generat-ing a huge range of possible outcomes. Large-scale computingpower can be applied to do a great many simulations simulta-neously, running them for days or weeks if desired. In practice, itwill still not be practical to explore all combinations of deviations,so combinations will need to be prioritised somehow. Section 4describes a suitable modelling approach, while Section 5.1 showshow runs are selected using a probabilistic approach to prioritisa-tion of combinations.

A consequence of this approach it produces output data that ishuge, far too vast to comprehend; users cannot simply readthrough all the simulation logs to find out what happened (therewill be thousands of them, at the least). It is possible, however,to apply machine learning techniques to compress the mass of datainto a concise set of rules that relates node deviations and environ-mental conditions to the accidents that they cause. This reducesthe mass of runs to a small number of distinct variations that safety

Fig. 1. The SimHA

engineers can investigate individually. This compression is notideal (some information will inevitably be lost), but it is a neces-sary concession to the complexity of SoS. Section 5.2 explainshow the machine learning is performed, while Section 5.4 ad-dresses the information loss in the context of a return-on-invest-ment argument for using SimHAZAN.

Of course, the ‘‘epistemic opacity’’ of simulations is well-known: it is often hard to understand why a simulation run gavethe results it did (Humphreys, 2004; Paolo et al., 2000). To helpwith this, agent tracing techniques (Lam and Barber, 2005) canbe employed to explain events in the simulation in terms of theircausation by other events. As with the simulation engine, the tra-cer can take local explanations of behaviour and turn them into SoSlevel consequences (a causal chain from an accident event back toits causes).

The overall SimHAZAN process is a hypothesis generator – theresult of simulation, learning and tracing is a set of hypothesesabout how the SoS can have accidents. As with all approaches tohazard analysis, the process has value if the effort involved in per-forming it is warranted by the set of hypothesised hazard chainsthat turn out to be valid. The overall role of the technique is to nar-row down a huge analysis space into one that is manually tracta-ble. Simulation validity remains a concern, but less than if it wasrelied on to give very trustworthy results. Section 5.4 explainshow the method can have value even when there are validityproblems.

The SimHAZAN approach has the following advantages over theexisting techniques discussed in Section 2.3:

� It has an explicit modelling approach that leads to models withtraceability to explicit, a priori safety concerns about thesystem.� It guides users to perform systematic exploration of deviations

of the model.� The use of simulation allows multiple simultaneous deviations

to be considered and long-range effects to be discovered.� The machine learning to helps make the mass of simulation out-

put comprehensible.� The tracing tool helps user to build explanations of accident

occurrences.

The following two sections describe the two complementaryparts of SimHAZAN. Section 4 describes an approach to modellingSoS – the aim of this approach is to produce implemented simula-tion models that can be run to reveal hazardous behaviour undercertain conditions. Once there is such a model, users can applythe analysis techniques described in Section 5 to produce a small

ZAN process.

Page 5: Supporting systems of systems hazard analysis using multi-agent simulation

306 R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318

set of hypotheses about the hazardous behaviour of the SoS. Sec-tion 6 presents a case study using both methods.

4. Modelling SoS as multi-agent simulations

4.1. Overall process

There are two key goals for the modelling process: it must pro-vide high confidence that the model is a useful representation ofthe real SoS, and it must support the mode of analysis describedin Section 5. Specifically, it must support a mode of analysis thatproduces explicit rules which relate deviant behaviour to the haz-ards it causes. It must then help engineers produce explicit causalchains that explain those rules.

The modelling method works by starting from a credible sourcemodel and transforming it to an implemented simulation by a ser-ies of stages that maintain traceability with respect to the originalmodel while allowing missing details to be added. By maintainingtraceability between sequential stages, and by running checksagainst a set of pre-specified safety concerns, a degree of fidelitycan be achieved.

In progressing from the source model to the implementation,the method follows an approach based on that of Drogoul et al.(2002). The modeller starts with a source model which contains adescription of the system in whatever form is familiar to thedomain experts they have available. In the military domain, thismight be a paper description of a military unit with its basic goals,structure and common sequences of operations. The next step is toexpand the source model into a domain model, which remains interms of domain concepts but adds additional information that isneeded for performing hazard analysis. For example, the sourcemodel may be vague on the precise performance and timing prop-erties of the computer network used – if the modeller knew thatthe network had a safety-relevant role, they would acquire thatinformation about the network.

Operational Model(computational agents)

Design Model(conceptual agents)

Domain Model(“real” agents)

Imple(simulation

Feedback

Feedback

Sour(“rea

Agent Modelling Activities

Fig. 2. Overall modelling process

Given a domain model, a modeller can transform that into a de-sign model, which is in terms of the simulation approach beingused. In this paper, this is multi-agent simulation, so the designmodel is expressed in terms of agents with specific beliefs, desiresand intentions, which interact so as to produce overall systembehaviour. In this work, the Prometheus method and notation(Padgham and Winnikoff, 2004) is used to generate the designmodel. The design model is still quite abstract, and at heart a papermodel – it does not take account of the precise representation ofagents in a computer. That representation in computational termsis the operational model, which covers the representation of agentconcepts as actual computer software, including low-level techni-cal issues such as event scheduling and inter-agent communica-tion. In most cases, modellers will want to use an existingsimulation engine, and it is the operational model that captureshow the stipulations of the design model are reconciled with thelimits of the engine. Finally, software engineers can transformthe operational model into an implementation, which is the modelrendered as an executable computer program.

The progression above is illustrated in Fig. 2 (the activities inthe ‘‘analysis’’ box are discussed in Section 5).

This staged progression allows the modelling process to be trea-ted as a series of transforms, and this in turn makes it possiblecheck at each stage that the model remains adequate: at each stagethe model can be verified against previous stages and validatedagainst what the modeller knows of the real world and what theywant to discover. It gives them some protection against unnoticedassumptions (or missing information) as they move from real-world terms into multi-agent systems terms and then into com-puter program terms. Below and in (Alexander, 2007) a range ofcross-checks are defined that are particularly appropriate to SoShazard analysis models.

The modeller also need to ensure that their model is able to an-swer questions about their initial concerns about how the SoS cancause harm – these may be concerns about accidents, root causedeviations or causal pathways between the two.

Target System(real environment)

mentation environment)

Analysis

ce Modell” agents)

(after Drogoul et al. (2002)).

Page 6: Supporting systems of systems hazard analysis using multi-agent simulation

R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318 307

Based on the above, a specific process can be defined:

1. Acquire a MODAF source model for the system2. Create a list of safety concerns for the system3. Transform the source model into a suitably complete

domain modela. Add informationb. Cross-check domain model against source model andconcerns

4. Transform into a multi-agent design model usingPrometheus agent modelling processa. Embed possible deviations in the model (as options)b. Cross-check design model against domain model andconcerns

5. Transform into an implementable operational model6. Implement in an existing modelling framework/engine

a. Cross-check implementation against design modelb. Cross-check implementation against domain model andconcerns

7. Carry out additional cross-checks end-to-end

4.2. Moving from source to domain model

The language assume here for the source model is MODAF (Min-istry of Defence, 2012), which is now widely used within the UKMinistry of Defence (and beyond, in its DODAF, TOGAF and CINDAFforms). MODAF can be expressed in UML, using the OMG’s UnifiedProfile for DODAF/MODAF (UPDM) specification (Object Manage-ment Group, 2009), and is supported by many UML and EnterpriseArchitecture tools. It may allow modellers to start with good initialdetail of the SoS, and it provides a shared vocabulary that the sim-ulation modeller can use to communicate with a wide range of do-main experts.

MODAF does have problems. The quality of existing MODAFmodels varies widely – some are very well developed, but manyhave some information missing. Experience with DODAF has beensimilar (Zinn, 2004). Unless it was specifically developed for safetypurposes (unlikely, since there is no established method of usingMODAF for safety) it is not likely to have all the informationneeded for safety. In any case, MODAF was not designed for creat-ing simulations and so lacks much of the information that isneeded for them (Mittal, 2006; Zinn, 2004).

The problems with MODAF can, however, be resolved. In pro-gressing from source to domain model, modellers can add extrainformation. There is an infinite amount they could add, and noshortage of sources – they need to add exactly that which will ben-efit them when doing hazard analysis. To this end, they shouldidentify a number of concerns that particularly matter to them.Concerns may be about specific accidents (e.g. a type of collision)or about causal mechanisms (e.g. the consequences of radio

Table 1Example concerns.

ID Type Description Model significance M

1 Accident Collision between ahelicopter and aUAV

Helicopter spatial position at differenttimes, UAV spatial position at differenttimes

HOar

2 Accident Artillery hits specialforces

Helicopter movement, landing, anddisembarking of troops

A

3 Deviation Unreliable sensing Errors in sensing, actions explicitly basedon sensor data

AO

4 Deviation Artillery inaccurate Position and time of artillery fire,variation in position actually hit

N

failure). Once they have concerns, they can identify the informa-tion that they need in order to address them adequately in theirsimulation. They cannot reasonably aspire to completeness of con-cerns, but identifying key concerns allows them to check that theyhave covered known problem areas. Table 1 gives examples ofsome concerns for a hypothetical military scenario. (The ‘‘OV-x’’codes in the rightmost column refer to specific diagram types with-in MODAF (‘‘products’’ in MODAF terminology) – see (Ministry ofDefence, 2012) for full descriptions.)

Concerns can be identified based on previous experience withsimilar systems, or by other hazard analysis techniques. Red-mond’s technique (Redmond et al., 2008) may be particularlyappropriate because of his emphasis on estimating the probabilityof hazards.

Where there is uncertainty about information, there is a risk inthat the wrong information may cause analysts to miss hazards orcauses. One option is to use the worst-case value; if there are mul-tiple bad cases then modellers can provide multiple bad values viadeviations (see Section 4.3). This is another strength of automatedmethods; they are less constrained by human effort costs, so ana-lysts can explore these plausible variations.

The open-ended scenarios that modellers are likely to start withwill be too broad to be effectively simulated. They must thereforesimplify them into simpler vignettes. A vignette is defined in (De-fence Modelling and Simulation Office, 2010) as ‘‘A self containedportion of a scenario’’. Modellers particularly need vignettes thatdescribe a simple sequence of events (under the normal course)that will then deviate when they introduce deviations to the modelelements.

When considering what vignettes to model, modellers can lookat different axes of variation. Table 2 gives some examples. Obvi-ously, completeness is a problem – modellers cannot provide acomplete set of vignettes (it would be infinite) so they have to pri-oritize based on the situations that projections suggest will beencountered most often.

4.3. Turning the domain model into a design

At this stage, the model is still in terms of the source domain –in terms of the real world. The modeller now needs to transformthat into agent terms. As noted earlier, MODAF models do not pro-vide enough information for simulation modelling (Mittal, 2006)and even within the limits of the notation are often not fully pop-ulated; this has been the authors’ experience, and that of Zinn(2004). Modellers will therefore need to add additional detail. Toachieve this, they can use the Prometheus agent modelling method(Padgham and Winnikoff, 2004). Prometheus has a strong track re-cord for general agent modelling, and has previously been used forsimulation modelling (Ronald et al., 2007).

Prometheus is particularly appropriate because it uses a Belief–Desire–Intention (BDI) approach to modelling as proposed by Brat-man (1987). BDI has two advantages. First, it has been claimed that

ODAF representation

elicopter movement indicated by ‘Transport Special Forces’ activity in OV-5 andV-6c. UAV movement indicated by ‘Patrol Area’ activity in OV-5 and ‘send patrolea’ message in OV-6c

rtillery fire, helicopter movement and disembarking are all nodes in OV-5

ct of sensing not represented, but use of sensor data shown in needlines 2-5 inV-2 and in messages ‘send enemy position’ and ‘area clear’ in OV-6cot represented

Page 7: Supporting systems of systems hazard analysis using multi-agent simulation

Table 2Example vignette aspects.

Category Aspects

Mission objectives Measures of effectiveness, intelligence availableTerrain Effect on movement, sensing, and communicationPeer/neutral entities Allied forces, civilians (including settlements)Threats Number, type, objectives, capabilities, and prior intelligence of SoS configuration and capabilities

308 R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318

BDI models, and the log output that they naturally produce, areeasier for humans to comprehend than most rival models (McIlroyand Heinze, 1996). This may make it easier for human analysts togain insight from the study of logs or log fragments. Second, andmore specifically to the current purpose, BDI models support theagent tracing techniques discussed in Section 5. Prometheus hasweaknesses; for example, it does not have a standard way to rep-resent the fallible sensors and actuators of embodied agents. Itdoes, however, provide sufficient expressive power to allow mod-ellers to represent such failure behaviour by hand.

The design model is expressed in Prometheus’s own modellingnotation. To assess whether this is adequate for the expression ofSoS models, the authors have assessed the Prometheus methodand notation against the characteristics of SoS identified in

Table 3Prometheus support for SoS characteristics.

SoS characteristic Can represent in Prometheus?

System goals System goalsMultiple components Agent descriptorsComponent situational

awareness/worldviewAgent overview and capability overviews show beliefindividual agents

Component goals Only implicitly, in the goal-seeking effects of plans, etComponent capability Action descriptors, percept descriptors, plan descripto

aspects of what the agent is capable ofGeographical distribution Implicit in various places: agent, action, percept descr. . .

Enemy on ground

Arrwa

SharedBattlespace

Picture

Artillery

Long -range fire Deplan

Fig. 3. System overview

(Alexander et al., 2004). Highlights from this are presented in Ta-ble 3 – for full details see (Alexander, 2007).

Fig. 3 shows part of the design model for the example SoS (spe-cifically, the ‘‘System Overview Diagram’’) in the Prometheus nota-tion. Fig. 4 is a key to the notation used. The element type‘‘percept’’ is perhaps the only one that is not self-explanatory – itrepresents an instance of perception, either that some event hasoccurred or that some state now holds.

Analysts can be supported in their later exploration of devia-tions by modellers using a systematic method of deviating agentsat the design model stage. Fig. 5 shows a conceptual model of agentparts and services, which identifies a wide range of agent parts thatmodellers can deviate. Using this, they can apply a method similarto Functional Failure Analysis (FFA), where they work over each

Explicit support? Guides towards?

Yes YesYes Yes

sets held by Yes Partly – concerned with ‘data’,not with ‘beliefs’

c. No Nors all capture No – capabilities emerge

when system is runPartly – have to describe actions,percepts and plans

iptors, etc. No No

ived at ypoint

Enemy vulnerable

Helicopter

UAV

e infantry Air move

for example SoS.

Page 8: Supporting systems of systems hazard analysis using multi-agent simulation

Goal Action

Message

Percept

Role Agent

ProtocolData Store Capability

Fig. 4. Prometheus notation key.

Agent

Sensors

Comms

Actuators

Environment

Computation / thought

Plan

Plan

Plan

Peer Agents

Situational Awareness

Fig. 5. Agent parts and services (after Russell and Norvig (2002)).

R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318 309

component of each agent and apply a set of guide words to stimu-late ideas of deviations. By doing this systematically for all agents,they can build up a large set of deviations for any scenario. Theauthors have also applied these generically to create a large check-list of generic deviations which will apply to many types of agent –examples of these can be seen in the ‘‘Gen. Deviation’’ column ofTable 4.

Table 4Example deviations.

Agent Part/service Gen. deviation Description

UAV Comms Loss oftransmission

Loses ability to send outradio messages (hence,cannot update sharedpicture of the battle space)

Air obstaclesensor

Total loss ofsensing

Loses ability to detect otherentities in the air (hencecannot adjust course toavoid collisions)

Situationalawareness(SA)/worldview

SA coordinatesystemmismatched withpeer entities

All entity updates to theshared picture appear to be‘moved’ a fixed distancefrom their actual location inone of the cardinal compassdirections

Artillery Artillery fireactuator

General loss ofprecision/control

Effects of artillery fire areapplied in a wider-than-usual area around the targetlocation

Situationalawareness(SA)/worldview

SA coordinatesystemmismatched withpeer entities

All entities in the sharedpicture appear to be ‘moved’a fixed distance from theiractual location in one of thecardinal compass directions

One concern is deviation probabilities – as will be explained inSection 5, analysts need probabilities in order to prioritise simula-tion runs. There are a range of sources for these probabilities. Theymight be able to get them from equipment safety cases or safetyassessments, although these may not be in suitable form. Othersources include component reliability models, human reliabilitymodelling methods, and software integrity level assignments(although integrity levels, of course, represent ambitions ratherthan achievements).

The independence of deviations is a concern, although not asgreat as in predictive risk assessments – probabilities are used herepurely to prioritise investigation, not to assess the final risk posedby the SoS. If there is a concern here, modellers can use beta factors(Smith, 2000) or similar (although it can be noted that beta factorswere developed for failures i.e. complete loss of function, and maynot read across well to all kinds of deviations). A second approachis to model known common causes (e.g. radio jamming impairingcommunications) as ‘‘multideviations’’ that apply to all agents inan area.

There is an assumption in the current work that all deviationsare subtle; the agents within the scenario are not aware that thedeviation is present. Future work (and more sophisticated models)could explore agent response to announced deviations.

As with the agents, modellers need to consider the possibledeviations to vignettes. Some example sources of deviation are gi-ven in Table 5.

Modellers are often reluctant to assign probabilities to futurescenarios, and there are good reasons to be wary of doing so (Ha-Duong, 2005). However, despite the increased exploratory powerof simulation-based techniques compared to manual ones, totalexploration of the space is unlikely to be possible, so some meansof prioritisation is still necessary. Assigning probabilities to vign-ette deviations can help with this.

At this point, there are a range of cross-checks that modellerscan carry out – these are shown in Table 6. A further significantcross-check at this point is to verify that the original concerns theyidentified back in Step 2 are still represented in the design model,either in the base form or in the deviations. See Table 7 for anexample.

4.4. The operational model

Moving to the operational model, the cross-checks continue.This may be where the most compromises are made – the availablesimulation environments may not be a perfect fit for the specificdesign model. The modeller must judge whether the demandedcompromises are too great. The authors have implemented SimHA-ZAN models using both the open-source simulation frameworkMASON (Luke et al., 2004) and the commercial simulation engineVR Forces (VT MAK, 2009). VR Forces notionally provides far great-er support to the modeller, but many of its hard-coded assump-tions (such as the homogeneity of behaviour of nodes of thesame type) proved inconsistent with the design model, so work-arounds were necessary.

Table 5Example sources of vignette deviations.

Aspect Example

Weather Heavy rain and cloudsTerrain More and higher peaksPeer/neutral

agentsIncreased civilian traffic on roads

Threat agents Man-portable anti-aircraft weapons supplemented withvehicle-mounted systems

Missionparameters

Maximum acceptable duration of mission shortened

Page 9: Supporting systems of systems hazard analysis using multi-agent simulation

Table 6Cross-checks between domain model and design model.

MODAFproduct

Check

Prometheusarchitectural designartefact

System overview OV-1 All MODAF operational nodesrepresents by agents

OV-5 Combination of protocols andindividual messages could give rise toall interactions described in OV-5

OV-6c Combination of protocols andindividual messages could give rise toall interactions described in OV-6c

Protocols OV-2 Protocol only exchanges data overconnectivity identified in OV-2

OV-3 Protocol only exchanges data overneedlines identified in OV-3

Message descriptors OV-7 Message descriptors consistent withmessage formats described in OV-7

Prometheus detaileddesign artefact

Agent overview,functionalityoverview

OV-6b Can impose a set of states on the agentconsistent with the operational nodestates described in OV-6b

Table 7Example cross-checks with concerns.

Concern Description Representation in Prometheus model

1 Collision between ahelicopter and a UAV

Helicopter has ‘air move’ action, and this isguided by a target location taken from theshared picture. UAV has ‘air move’ action

2 Artillery hits specialforces

Artillery has ‘long-range fire’ action;helicopter has ‘air move’ and ‘disembarkinfantry’ actions. Artillery and helicoptersboth take their targets from the sharedpicture

3 Unreliable sensing Helicopter and Artillery plans involveactions cued by the state of the sharedpicture, and the shared picture is built bythe UAV based on the percepts that itreceives

4 Artillery inaccurate Artillery has ‘long-range fire’ action

310 R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318

4.5. The implementation

Implementation should be the least demanding stage; althoughit may be labour-intensive (for larger models, it will require a teamof experienced software engineers), it is standard software engi-neering. The earlier transformations should have led to a clearspecification and design that is already expressed in the conceptsthat are used by the chosen simulation framework.

It is at this stage that the deviations previously described inplain text are turned into specific low-level behaviour, capturedin the implementation as discrete agent behaviour variants thatcan be turned on or off by the simulation engine. The details of ex-actly how this is achieved will inevitably vary according to thedecisions that were made at the operational model stage. If thepreceding steps have been followed, this should be fairly straight-forward, although review by domain experts may be particularlyimportant here as there is likely to be less information in thesource model to support these decisions. For the high-prioritydeviations that are identified in the list of concerns, the softwareengineers should check that the deviations as implemented captureall aspects of the deviation that are mentioned in the ‘‘Model Sig-nificance’’ column of the concerns table.

With the implementation in place, it is possible to do a finalcross-check: the simulation can be run to confirm that the events

of the normal vignette (with no deviations) are consistent withthose described in the domain model.

5. Finding hazards in SoS simulations

Although detailed and accurate SoS models are obviously valu-able, it can be hard to make sense of the results when they are run.There are three key problems:

First, even with a modest number of parameters, the resultingparameter space is very large. Analysts will want to find ‘all haz-ards and all causes’, but they cannot exhaustively explore thatspace. The traditional way to deal with this problem is throughexperimental design techniques (Dewar et al., 1996). However,the techniques that Dewar identifies all make three assumptions:that the effect of each parameter on the safety-critical behaviourof the system is independent of the effect of the other parameters,that there is a monotonic relationship between the number ofdeviations imposed and the level of safety risk, and that small vari-ations in parameters cause small changes in the output. None ofthese are necessarily true in the current case.

The second problem is that the output that results from the rea-sonable exploration of a realistically complex model will be huge –thousands or millions of run logs, each containing tens of thou-sands of entries. It is unrealistic to expect a human analyst to readsuch logs, let alone understand them.

The third problem is that, even when the model has been ade-quately explored and methods have been found to make the hugevolume of output comprehensible to the analyst, it is not immedi-ately obvious whether these results are valid, are artefacts of theanalysis process, or are artefacts of the model itself.

To resolve these issues, analysts can start by exploring the sim-ulation to find a functional mapping between deviations and acci-dents. They can then study those relationships and come tounderstand why they hold. The process is one of narrowing focus,following clues and leads, with the result that they focus their ef-fort on the mechanisms that matter most. The proposed method isas follows:

1. Find accident situationsa. Progressively search the powerset of deviations

i. Limit search by probability estimatesb. Log deviations and accidents for each run

2. Learn hazard rulesa. Use machine learning to learn rules relating deviations toaccidents

3. Uncover causal chainsa. Use an agent tracer to explain an example run for eachrule

4. Assess fidelity and validitya. Use conventional safety engineering techniques

The details of this method are described in the followingsections.

5.1. Finding accident situations

Analysts can explore the space in a simple fashion via abreadth-first search of all known deviations (agent and vignette)for each vignette – they try all the single deviations, then all thepairs, and so forth. They can limit the breadth-first search by sim-ple number of deviations (e.g. do not consider combinations ofmore than three deviations) or by estimates of probability(e.g. do not consider combinations less likely than 1 � 10�11 perrun). They can generate deviation combinations by deriving the

Page 10: Supporting systems of systems hazard analysis using multi-agent simulation

R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318 311

powerset (the set of all subsets of the set of deviations) then select-ing only those subsets that meet the criteria in the previous sen-tence. By using the lowest-order subsets first, this can provide aprogressive exploration of the parameter space in order of thosethe analyst estimates is most likely.

Even if they limit the search as discussed in the previous para-graph, they will be carrying out large numbers of runs and gener-ating large amounts of data. The resulting data needs to beprocessed in some way in order to make it intelligible.

5.2. Learning hazard rules

Detecting accidents and incidents is straightforward, becausethe simulation code can be instrumented so that such events gen-erate distinctive entries in the logs. Similarly, the software can re-cord what deviations were imposed on each run. Given this, theanalyst needs to find general rules that map deviations toaccidents.

In the current work, machine learning techniques are adoptedfor this purpose. The inspiration for this was the work of Plattset al., in which rules are learned which relate the behaviour of anunmanned aircraft to success in a particular mission (Platts et al.,2004). The approach described in this paper is similar in that it in-volves learning rules which relate node behaviour to unwantedhazardous consequences.

The task of machine learning can be viewed as one of functionapproximation from a set of training instances expressed as input-output pairs. Given a function specification – a set of named inputparameters (‘features’) and a particular form of output value – thealgorithm learns the relationship between the features and theoutput. The algorithm produces a model that is much simpler thanthe training data itself, but that (ideally) captures all the informa-tion that is contained in the data. The model can then be interro-gated – if the algorithm has learned well, the model will be ableto provide the appropriate output for the input supplied. A widevariety of machine learning algorithms have been proposed; a sur-vey can be found in (Mitchell, 1997).

In SimHAZAN, the input to the learner is a set of descriptions ofruns of the simulation model. The output from the learner is there-fore a simplified proxy for the simulation model, a proxy with amuch simpler internal mechanism.

For current purposes, the function parameters are the parame-ters of the simulation and the output values are the consequenceswithin the simulation. All the parameters used in the current workare deviations that are applied to the model, and the target func-tion is the set of accidents that occurs during the simulation run.The output of the learning algorithm is a set of rules that describesthe relationship between deviations and accidents. For example, arule might be ‘‘Aircraft 1 lost_radio_comms causes aircraft 1 to collidewith aircraft 2’’.

SimHAZAN requires an algorithm that can turn discrete, noisydata into human-comprehensible rules. Given this, an appropriatechoice is the C4.5 decision-tree learner. Mitchell (1997) notes that‘‘decision tree learning is one of the most widely used and practicalmethods for inductive inference’’, and C4.5 is a widely-used exampleof a decision tree learning algorithm. C4.5 is a well-establishedalgorithm, so this paper will treat it as a black box; full details(and source code for an implementation) can be found in (Quinlan,1993).

For current purposes, it is useful to learn rules for each accidenttype separately, providing the set of runs in which that accidenthappened, and the set of runs in which it did not – this providesthe contrast needed for the learner to discriminate between acci-dent and non-accident cases. Each run is described by the set ofdeviations that were present. The algorithm will then produce adecision tree describing how deviations appear to cause accidents.

That tree can be split into a set of rules, with one rule for each pathfrom the root to a leaf.

Once a set of rules has been produced, they must be prioritisedfor investigation. The most common way to do this in safety engi-neering is by risk, a measure produced by combining probabilityand severity. The probability of a rule can be calculated in the sameway as the probability of a run. Severity can be classified into cat-egories based on a description of associated accident consequences– one option would be the MIL-STD-882D ‘mishap severity catego-ries’ of ‘catastrophic’, ‘critical’, ‘marginal’ and ‘negligible’ (USDepartment of Defense, 2000). Given high uncertainty in theassignment and combination of probabilities, however, it couldbe suggested that all high-severity runs should be investigated.

One complication is that this may result in one deviation rulethat actually covers several causal paths – there is no guaranteethat the simulation has single unique responses to its inputs. Ifthe simulation is deterministic then this uniqueness will hold,but it can also manifest as an artefact of the learning process –the learner may collapse several distinct relationships into onerule. This is an important area for future work (see Section 8.1).

5.3. Uncovering causal chains

Once the analyst has a prioritised list of deviation rules, theyneed to explain them. This could be attempted by watching anima-tions of simulation runs, or by reading logs. These strategies arepopular in the literature, but they are inadequate when modelsare complex. The complexity of a simulation can make it particu-larly hard to even understand why it gave the result that it did –Humphreys calls this property of complex simulations ‘‘epistemicopacity’’ (Humphreys, 2004). A better method is therefore needed.

Lam and Barber have developed a tool-supported approach tothe comprehension of agent systems which they call ‘agent tracing’(Lam and Barber, 2005). The core of the approach is that, given alog of the events that occurred in a single simulation run and anidentified event of interest within that run, the tracer tool attemptsto explain why that event happened in terms of its immediatecauses. Those causes can each then be explained in the sameway, and the process repeated until the final explanation is interms of the initial state of the simulation run or some ‘external’events that occurred. This explanation, complete or partial, canbe expressed as a causal graph leading to the event to be explained.

A simple example of such an explanation is ‘‘UAV 1 received apercept indicating the location of an enemy unit. This caused it to forma goal of destroying that enemy unit, which it selected the ‘air strike’plan to resolve, and as a consequence of that plan the UAV conductedthe ‘attack’ action using a laser-guided bomb’’.

The tool produces explanations by using what Lam and Barbercall ‘background knowledge’. This is a set of possible causal rela-tionships between events based on their types and properties.When the tool tries to explain an event, it reviews these rules tofind which previous events could have caused it. For example, ifthe simulation model has the rule ‘‘if two entities move into thesame space at the same time, they collide’’, then the tracer canhave the rule ‘‘if two entities collided, then that can be explainedby them moving into the same space at the same time’’. If the tra-cer also has rules that can explain entity movement in terms oftheir goals, beliefs and plans, then it may be able to construct auseful causal graph for the collision. For example, one of the enti-ties might have had a plan to survey the location and a belief thatthe location was unoccupied, and in its behaviour rules this combi-nation was sufficient to cause the movement.

The tracer, in effect, automates common search rules thatwould be applied by humans. In the authors’ current implementa-tion, the tracing rules have to be written by a programmer(although there is a useful class library of common constructs).

Page 11: Supporting systems of systems hazard analysis using multi-agent simulation

312 R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318

An alternative implementation could express them in a domain-specific language, or even infer them automatically by comparingmany logs (Lam and Barber, 2005).

5.4. Assessing fidelity, validity and completeness

It is extremely difficult to argue the fidelity of a complex systemsimulation (see (Paolo et al., 2000) for a particularly aggressivestatement of this idea). Similarly, it is clear that simulators usedfor training can lead users to develop dubious intuitions aboutthe real system (Arthur et al., 1999), and it is possible that SimHA-ZAN simulations will lead engineers towards incorrect theoriesabout how the SoS operates. Simulations are not, however, the onlymodels with these kinds of problems. All forms of model-basedanalysis must face this challenge, and the more automated thetechnique the larger the problem looms (Lisagor et al., 2010).

The primary concern for the current work is whether the causalchains found are present in the real SoS. SimHAZAN performs whatDewar et al. describe as ‘weak prediction’ – for such prediction‘‘subjective judgement is unavoidable in assessing credibility’’, andwhen such a simulation produces an unexpected result ‘‘it has cre-ated an interesting hypothesis that can (and must) be tested by othermeans’’ (Dewar et al., 1996). In other words, when a simulation re-veals a plausible causal chain, other, more conventional analysesmust be carried out to determine whether it is credible in the realSoS. The role of the simulation analysis is to narrow down a hugeanalysis space into one that is manually tractable.

It is well known that manual hazard analysis techniques do notfind all hazards in conventional systems – see, for example (Suokasand Kakko, 1989). Given that, SimHAZAN’s role is to provides plau-sible hypotheses about causal chains that might otherwise havebeen missed. It is a complement to traditional manual analyses,and requires engineering effort to investigate the hazard hypothe-ses it raises. Given that, many errors in the simulation model canbe tolerated, provided that many of the causal hazard chains thatare predicted by the model turn out to exist in reality. Ultimately,this is a return-on-investment question: if SimHAZAN leads engi-neers to find some accident chains, and does not waste too muchof their time investigating chains that cannot occur, it has potentialvalue. If those accident chains were not found by other means, thenit has definite value.

It would be naive to suggest that any technique can explore theentire state space of the simulation model. Indeed, the approachproposed here does not try to do this. There are techniques forexhaustive analysis of computer models, such as model checking,but they place serious limitations on the models they can analyse.Similarly, it is unlikely that SimHAZAN will scale to arbitrarilycomplex models – it will always be possible to create simulationsthat cannot be adequately understood or analysed. It is reasonableto expect that SimHAZAN will find some hazards and causes thatwould be missed by manual techniques, and that it will be morepractical, flexible and applicable to SoS than rival automatedmethods.

It is very difficult to argue the completeness or adequacy of haz-ard analysis; there is no standard way to determine when all causalchains have been found. The As Confident as Reasonably Practica-ble (ACARP) principle introduced in (Hawkins and Kelly, 2009) maybe valuable here. As with software testing, simple coverage metricsmay not be very useful (Yang et al., 2011) – what matters is howgood the technique is at finding hazards and their causes. This isan empirical matter.

5.5. The authors’ implementation of the SimHAZAN tools

The authors have implemented a prototype version of the toolsneeded for SimHAZAN. This consists of a toolchain that combines a

simulation engine (‘‘Sim8’’), a learning tool (a thin wrapper aroundWEKA), and a tracing tool (‘‘Tracer++’’), controlled by a set ofscripts that minimise the need for manual intervention. The simu-lation and Tracer++ were implemented in Java, with the formerbuilt on the MASON simulation framework (Luke et al., 2004)and the latter using the Prefuse library to render the causal graphs.The control scripts were written in a mixture of Python and Jython,with the latter being used when access to the underlying Java ob-jects was needed. The software was developed using a mostly ad-hoc process, although test-driven development (Beck, 2002) wasused systematically for the simulation engine, the specific simula-tion models, and the Tracer++ tool.

6. Case study

The authors have conducted a case study using SimHAZAN,based on an example source model that was provided by industry.The SoS consists of a small military unit where a central commandnode manages a set of UAVs that provide surveillance and recon-naissance over a large area. Each UAV unit consists of a MobileAir Recon Vehicle (MARV) and the associated ground control sta-tion (MARG). The information produced is used by ground-basedinfantry and special forces to locate and destroy enemy troops.The activity of the various elements in the case study SoS is coor-dinated via a Combined Operational Picture (COP), which is pri-marily created by fusing data from UAV sensors. The COP couldalso include data from other sources, such as a priori plans andconventional intelligence. The SoS of concern will be referred toas ‘‘blue force’’; the enemy they are hunting is ‘‘red force’’.

A graphical representation of the case study SoS can be found inFig. 6 in the form of a MODAF OV-1a (High Level Operational Gra-phic), although it can be noted that this includes some extra forceelements, such as naval assets, that will not be considered here.The figure has been redrawn to meet this journal’s artwork stan-dards and to replace specific node names with generic ones, butis otherwise the same as the one received by the authors.

The case study is appropriate for proof-of-concept of this work.It is an SoS as defined in Section 1 – it consists of multiple compli-cated and geographically distributed components, each of whichhas their own goals (such as self-preservation) and a degree ofautonomy, but which are tied together by a common goal (to pro-vide surveillance and reconnaissance while maintaining the integ-rity of the unit). The overall system is quite complex – overallsystem behaviour will be the product of multiple agents each withseveral behaviour rules. With limited communication (which is as-sumed here), the behaviour of agents over time will depend ontheir local situational awareness and world view. The systemexhibits several possible accidents (including mid-air collisionsand friendly fire) as well as other threats (failure to defend fromenemy action). The source model has previously been used in sev-eral other studies including the BAE Systems NECTISE researchproject. Although not a currently deployed military SoS, it can beconsidered representative of such.

Section 6.1 describes how the SoS was modelled, and Section 6.2describes the analysis performed.

6.1. Modelling

Modelling started from the source model discussed above. Verysoon a deficiency was identified – although there was an obviouseffector in the SoS (the infantry) there was nothing in its behaviourdescription to indicate when it would actually attack something;the SoS was set up to provide information, but there were no ac-tions that the information could lead to. There was therefore noway for the information provided by the SoS to lead to an accident

Page 12: Supporting systems of systems hazard analysis using multi-agent simulation

Main Operating Base

Brigade HQ

Infantry

Strategic ISTARAircraft

Aircraft Carrier

Suspected enemy location

Permanent Joint HQ

UK

Air Recon Facility

others

Tactical ReconAircraft

MARG

Joint Forces HQ

DeployCOP

Communication Satellites

ForeignTheatre

MARV

Fig. 6. Case study SoS OV-1a, after Adcock (2006).

R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318 313

(although there was the potential for one based on vehiclecollisions that were not information-related).

While creating the domain model, therefore, the source modelwas extended the source model by placing it in a wider missioncontext. How necessary this will be in practice will depend onthe quality of the MODAF models available. In the authors’ experi-ence, the quality and completeness of MODAF models has im-proved over the few years of its existence – recent models havebeen more complete.

An armed UAV (an Unmanned Combat Aerial Vehicle (UCAV))was added, and protocols were defined for interactions betweenthis and the rest of the SoS. The scenario was extended to coverinfantry acting on the information provided. The ‘‘enemy’’ werediversified into enemy infantry (vulnerable to friendly infantry)and enemy tanks (a serious threat to friendly infantry, only vulner-able to the UCAV). The emphasis here was on keeping this part ofthe mission consistent with the information already in place – forreal applications, there are a range of sources suggested in (Alexan-der, 2007) that could be used to validate these decisions.

Eight safety concerns were identified for the SoS and situation;some of these are shown in Table 8. Four of them were chosen toprioritise, and a single vignette was created that exhibited all four.

Table 8Excerpt from the case study concerns table.

ID Description Description Model significance

1 Accident Enemy infantrydestroys infantry

Enemy position at different times, infantryrelative effectiveness of infantry weapons, aenemy

2 Accident Enemy tankdestroys infantry

Tank position at different times, infantry prelative effectiveness of infantry weapons,

...5 Deviation Limited radio

communicationsbandwidth

Availability of bandwidth, consumption ofmessage sending by time

...

Using the (expanded) domain model as an input into the Pro-metheus process, a Prometheus model of the SoS was produced.This was a fairly straightforward and self-contained process,although there were a number of decisions to make. For example,the source model contained references to several support and con-trol nodes but there was little information about the distinctionsbetween these systems and roles. The design model therefore com-bined the several command and control nodes (including the re-motely located ‘home’ command centre) depicted in the variousMODAF artefacts into a single command node, and the various sur-veillance-coordination nodes into a single ‘ISTAR’ node.

These decisions were forced by lack of information – the remotecommand, for example, did not feature in OV-5 or any of the otherbehavioural descriptions. The decisions did have the advantage offocussing the model on those nodes that were physically mobileand active in the field.

The top-level overview diagram produced here is shown inFig. 7 (see Fig. 4 for a key to the symbols). A full set of Prometheusartefacts for this model can be found in Appendix C of (Alexander,2007).

The deviation approach discussed in Section 4.3 was applied tothe design model, producing a set of 176 deviations in total. A simple

MODAF significance

position and movement,rmour and tactics used by the

Not represented

osition and movement,armour and tactics vs. tanks

Not represented

bandwidth per message, Needlines are in OV-2b, communications areimplicit in OV-5 and OV-6. System-levelcommunications are in SV-1 and SV-2

Page 13: Supporting systems of systems hazard analysis using multi-agent simulation

Threat in COP

Accurate target type and locaton

Infantry

Command

ISTARCoord

MARG

MARV

UCAV

ISTAR Recon Tasking

Infantry CAS request

Infantry recon request

Command CAS tasking

ISTAR BDA tasking

COP

MARG-MARV recon mission

MARG-MARV BDA mission

Battle Damage

Command BDA tasking

Command recon tasking

Possible target

Infantry assaultGround move Air strike

Sense ground entitiesAir move

Fig. 7. System overview diagram for the case study SoS.

314 R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318

approach to probability assignment was used, arbitrarily assigning aflat 1 � 10�3 probability of having each deviation in any given run.

Once the design model was complete, it was implemented man-ually using in the Sim8 SoS simulation engine. Because of timelimitations, only a limited set of deviations were implemented –these are listed in Table 9. This limited set also helped to constrainthe computation required, although of course it limited the rangeof hazards that could be uncovered. There was no particular

Table 9Implemented deviations.

ID Code Description

A1 RADIO_CANT_SEND The entity cannot send radio messagesA2 RADIO_CANT_RECEIVE The entity cannot receive radio messagesA6 RADIO_EXTRA_BANDWIDTH All radio messages sent by the entity consA7 RADIO_SEND_DELAY All radio messages sent by the entity are dA12 MAX_SA_ENT_2 The entity only remembers details the lastA14 COORD_SKEW_N When storing the location of another entitIN16 INF_DOUBLE_SPEED The infantry unit moves at twice its standIN28 INF_FIND_TARGET_DONT_WAIT After detecting a potential target, the infanIN38 INF_NO_CAS_LOCK After calling for combat air support (CAS),MG6 MARG_PLAN_RECON_TIME The time taken for the MARG to plan a rec

mechanism used in the choice, but the set covers all agents withseveral deviations for each.

Throughout this process, a range of crosschecks were applied asdiscussed in Section 4.5. One example here is the relationship ofthe concerns (from the early stage of the domain model) to theoperational model and the implementation – see Tables 10 and 11.

One problem, throughout the operational model and implemen-tation stage, was the assumption by the domain model that there

ume extra bandwidthelayedtwo entities it sensed or was told about

y, the entity shifts its location 1 square to the northard speedtry unit continues with its mission without waiting for reconnaissance supportthe infantry unit moves to engage the targetonnaissance mission for the MARV is increased from 2 ticks to 16 ticks

Page 14: Supporting systems of systems hazard analysis using multi-agent simulation

Table 10Concerns vs. operational model.

Concern Support in Sim8

1 Supports small arms fire2 Detects potential collisions...5 Engine supports limited bandwidth. Messages are delayed if

insufficient bandwidth. Bandwidth costs can be varied...

Table 11Concerns vs. implementation log output.

Concern Log output

1 All attacks, all agents destroyed2 All collisions...5 Delayed messages logged (at point of both sending and delivery)...

Table 12Rules highlights.

Accident Cases Rules Probability

cM1UC1 421 deviation_radio_send_delay_inf1 AND 1 � 10�6

failure_radio_cant_send_inf2!deviation_radio_send_delay_inf1 AND 1 � 10�6

!failure_radio_cant_send_inf2 ANDdeviation_radio_send_delay_marv1 ANDdeviation_coord_skew_n1_command!deviation_radio_send_delay_inf1 AND 1 � 10�9

deviation_radio_send_delay_marv1 ANDdeviation_inf_dont_wait_for_cas_inf3 ANDdeviation_coord_skew_n1_command

TI1 34648 deviation_inf_find_target_dont_wait_inf1 1 � 10�3

!deviation_inf_find_target_dont_wait_inf1AND

1 � 10�3

deviation_inf_dont_wait_for_cas_inf1!deviation_inf_find_target_dont_wait_inf1AND

1 � 10�9

!deviation_inf_dont_wait_for_cas_inf1ANDdeviation_max_sa_ent_inf1 ANDdeviation_max_sa_ent_command ANDdeviation_coord_skew_n1_command

UC1I2 2067 failure_radio_cant_send_inf3 1 � 10�3

failure_radio_cant_send_inf3 AND 1 � 10�6

deviation_radio_send_delay_inf1failure_radio_cant_send_inf3 AND 1 � 10�6

deviation_radio_send_delay_command

R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318 315

was a single agent of each type. Although it is possible to capturethis in MODAF (it provides UML-style cardinality indicators onclass relationships) this does not propagate well into the Prome-theus artefacts. The dividing line between static properties (nodecapabilities, procedures and training) and temporary role assign-ments is not clear. This may be a hangover from the software engi-neering origins of the diagram types used in MODAF andPrometheus – in conventional software engineering, the distinc-tion just discussed does not apply. This is a route for further exten-sion of SoS modelling languages.

6.2. Analysis

Given the implementation and the set of deviations that wereactually implemented, the simulation was run for many combina-tions of deviations. In order to limit the computation involved, amaximum a maximum of three simultaneous deviations wereallowed per run. This was based on the 1 � 10�3 probabilityassigned for each deviation, and the 1 � 10�11 ‘‘improbability offailure’’ level used in the nuclear industry. Scenarios deemed tobe less than the improbability level are not explored in detail;rather than dealing with their consequences, engineers rely onthem never occurring (Ammirato et al., 2004). Performing the runsproduced, as expected, a number of simulated accidents.

Applying the learner to the accident set produced a large num-ber of candidate accident rules, including some rules for every acci-dent that occurred. Table 12 gives the three most probable rules fora subset of the accidents.

The rules were explored and evaluated as described in Sections5.3 and 5.4. For each, an example run that matched the rule wasselected, and the tracer was applied to it. For the most interestingrules (from an illustrative perspective) detailed explanations weredeveloped, which are discussed below.

6.2.1. Accident cM1UC1Accident ‘cM1UC1’ describes a collision between the MARV and

the UCAV. This involves material loss, so is safety-relevant by mostdefinitions, but it is not as important as one involving human life.This accident was caused by the MARV not having a collision-avoidance system – if they are very close and the UCAV moves first,then the MARV may move into it and thereby collide. It is not tre-mendously interesting because it is possible that it could be foundmanually – it is not unreasonable to ask ‘‘shouldn’t the MARV havecollision avoidance?’’

There is also a risk that this accident is an artefact – it dependson position and time being discrete in the simulation, which was asimplification made in the operational model for efficiency andease of implementation.

6.2.2. Accident TI1Accident ‘TI1’ describes a situation where an enemy tank

destroys infantry unit 1. This is not a standard ‘‘safety’’ event – itis not entirely an accident – but it is unwanted from the perspec-tive of blue force so it is of interest. The accident occurs becausethe tank does not wait when it calls for reconnaissance. This is asimple linear interaction (the linear sequence of events is dis-rupted) and could reasonably been found by manual analysis.

This accident is interesting in that the SimHAZAN toolset foundthis, but the tracing tool could not help to explain it. The accidentwas caused by an ‘‘unevent’’ in the terms used by Why-BecauseAnalysis (Ladkin and Loer, 1998) – the infantry didn’t wait whenit needed to. The tracing tool cannot provide explanations in termsof unevents – it deals solely with events that did occur. Once a sig-nificant unevent is known, it is theoretically possible to go backand instrument the simulation model to monitor that unevent(e.g. detect a failure to wait, and log when this occurs) but theauthors have not explored this.

6.2.3. Accident UC1I2The final example, accident ‘UC1I2’ is where the UCAV destroys

infantry unit 2 with an air strike. This is a classic friendly-fire acci-dent, and the accident sequence here is quite involved.

Initially, the tracing tool displayed only the trace shown inFig. 8. This explains that the UCAV performed the air strike becauseit had finished its previous objective (fly to a waypoint) and had anactive plan to carry out an air strike at that location. It offers noexplanation, however, for why the UCAV had that plan active, orwhy the infantry unit was located there when the air strike landed.

Additional logging was added to the simulation model so thatindividual (single grid square) moves were tracked, and a tracingrule was defined that would relate those actions to the previous‘move to location’ action. This made it possible to write a rule that

Page 15: Supporting systems of systems hazard analysis using multi-agent simulation

BELIEF

[265 ] UCAV1: air strike on (36,19)

PLAN

EVENT

ACTION

MESSAGE

[264 ] UCAV1: No more waypoints

GOAL

[265] UCAV1: fly to target

[300] UCAV1: air strike at (36,19)

[300] inf2: destroyed by UCAV1 at (36,19)

Fig. 8. Original trace for accident UC1I2.

316 R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318

could explain air strike casualties in terms of the movement of therecipient as well as the actions of the attacker:

An air strike death is caused by air strike X iff:Air strike X was at the location of the death

AND air strike X was performed by the agent who

caused the death

AND air strike X is the most recent air strike at

the location

An air strike death is caused by move action X iff:

Move action X was the last one performed by the

dead agent

AND move action X took the agent into the location

that it died at

The resulting trace, shown in Fig. 9, shows the infantry unit ismoving to attack enemy3, who was located at (36,9) i.e. the loca-tion that was hit by the air strike. It would follow that the UCAVwas attempting to destroy enemy3.

A manual inspection of the log reveals the precise timing of theaccident sequence, with the order of air strikes and arrivals at thelocation. The timing is such that inf2 arrives near (36,9), calls downan air strike on it, and, after battle damage assessment (BDA) isperformed by the MARV, it moves on to enter the location and de-stroy enemy3. While this is happening, inf1 also arrives near (36,9)and also orders an air strike. It is this second strike that hits inf2.

It can also be observed from the log that inf2 waits at (36,9) foran extended period of time prior to the air strike arriving. Inspec-tion of the normal-case log reveals that inf2 normally moves onimmediately after destroying enemy3. If that had occurred in thiscase, the accident would have been avoided.

The difficulty here for the tracing tool is that, by its nature, atracing tool as described in Section 5.3 cannot have ‘unevents’ –a causal node cannot be of the form ‘event X did not occur’, whichis what is needed to explain this hazard (e.g. ‘inf2 does not move

[200] inf2: target is dead: enemy1

[264] UCAV1: No more waypoints

[201] inf2: be in fire range of target

enemy3

[265] UCAV1: air strike on (36,19)

[201] inf2: ground move to target enemy3

[2tow

Fig. 9. Improved trace f

away’). However, a manual inspection of the accident log revealsthat after arriving at the final location inf2 calls for MARV recon-naissance on inf3 – it is calling for airborne reconnaissance of afriendly entity.

An explanation for this can be found in that the infantry unitsalways call reconnaissance for entities which they have not previ-ously seen, either with their organic sensing or through the COP.Normally, all infantry units appear in the COP soon after the startof the run, because they call out the MARV to perform reconnais-sance, and it observes them as it does so. However, because inf3has here lost its ability to send radio messages, it has been unableto call for any reconnaissance and so has not been seen by theMARV. It is therefore not in the COP, and hence inf2 sees it as anunknown entity.

It can be noted that the location of inf2 is in the COP, so poten-tially the air strike could have been cancelled because it wouldlikely lead to blue casualties. However, the behaviour of the UCAVdefined in the model does not include any such check.

This is a plausible accident. It reveals a potentially dangerousassumption, and it might well have been missed by an analyst be-cause it is caused by an apparently unrelated deviation – it iscaused by a radio failure in a different infantry unit. This illustratesa successful use of SimHAZAN – the simulation model produced aninteresting run, the learner led analysts to it, and the tracer helpedthe analysts through part of the explanation.

7. Using SimHAZAN in practice

SimHAZAN can build on the results of manual techniques. Forexample, a HAZOP or Redmond analysis may identify high-levelconcerns about interactions between nodes, which can be fed intothe process as explicit concerns (see Section 4.2). Similarly, tech-niques such as FFA and FMEA may lead to concerns about the ef-fects of certain node failures. The cross-checks carried out duringmodelling will provide some confidence that such concerns are re-flected in the final simulation model. Despotou et al. (2009) illus-

[265] UCAV1: fly to target

[300] UCAV1: air strike at (36,19)

01] inf2: move ards (36,19,0)

[273] inf2: move (1,0) into (36,19)

[300] inf2: destroyed by UCAV1 at (36,19)

or accident UC1I2.

Page 16: Supporting systems of systems hazard analysis using multi-agent simulation

R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318 317

trates how SimHAZAN could be integrated with a specific manualtechnique (Dependability Deviation Analysis).

The results of SimHAZAN need to be fed back into a wider safetyengineering process. One way to do this is to express the derivedcausal complexes as fault trees – Alexander (2007) provides someexamples of this in. As noted in Section 5.4, however, it may be thatsome causal complexes are difficult to explain in this form, and thismay be more likely in larger, more complex SoS. In the light ofother criticisms of fault trees, such as (Manion, 2007), this is animportant area for future work.

8. Conclusions

This paper has presented a simulation-based method for hazardanalysis of SoS. In the case study (Section 6), a model was built andanalysed using SimHAZAN. In doing so, some plausible SoS hazardswere found, one of which would only have been reliably discov-ered by a hazard analysis that propagated deviations between mul-tiple component systems. This is promising. SimHAZAN can nowcompare it to the requirements established in Section 2.2.

The case study shows SimHAZAN supporting an exploratoryanalysis of embodied, node-unique behaviour (requirements 2and 3), composed in the face of multiple failures across a complexSoS to show an accident sequence that was not previously pre-dicted (requirements 4, 5 and 6). The model used was derived froma source model familiar to domain experts, and maintains stronglinks to that right up to the actual implementation. The machinelearning and agent tracing tools work together to help extract ex-plicit causal chains (requirement 1), and the whole analysis ap-proach remains applicable to a wide range of simulation models– it uses a vocabulary that is not artificially restricted for the pur-pose of allowing analysis (this is part of requirement 8). Betweenthis and (Alexander, 2007) there is a method that can be appliedby a third party without the author’s help (requirement 7).

It is therefore reasonable to suggest that SimHAZAN meets thebasic requirements identified in Section 2.2, and is thus a potentialsolution to the SoS hazard analysis problem. The evidence pre-sented here is weakest for requirement 8 (‘‘Must be scalable to sys-tems of practical concern’’). So far it has only been applied it to asmall example, but a reasonable ambition is to address much largerSoS. SimHAZAN may scale well to larger systems, in a way thatmanual techniques and more complete analyses (such as thosebased on model checking) will not, but further work will be re-quired to test this.

Simulation fidelity remains a problem, but as long as the problemis framed as one of hazard identification and explanation the analystdoes not need to be certain – they could content themselves withweak prediction and an experience that, in practice, many of thehypothetical hazards turn out to be real. Beyond that, it is difficultto argue that SimHAZAN analysis is adequately complete – thereare no obvious ‘‘stopping rules’’ for modelling or analysis.

The analysis process described here is far from fully automated– although the learning and tracing tools can help, it still requiresengineer time, intelligence and creativity. Similarly, creating therequired simulation models may be very expensive. It may be pos-sible to re-use models developed for other purposes, however thismay exacerbate the fidelity problems as the appropriateness of amodel depends heavily on the specific purpose to which it is tobe put; for example, models that are adequate for performanceanalysis may be inappropriate for use with SimHAZAN.

8.1. Future work

The work described in this paper is promising, but there isconsiderable room for improvement. Perhaps the most important

future work is that someone else should apply SimHAZAN to a casestudy without the authors’ involvement or direct support; one ofthe most important attributes for method research is how well athird party can implement the method from its description (Raeet al., 2010). Sufficient detail has been supplied, between this paperand (Alexander, 2007), to make that possible.

Application to some larger case studies would be very valuable.Because the computer is performing the bulk of the combinatorialwork (and then applying the machine learner to refine the results)SimHAZAN may scale with SoS size and complexity better thanmanual techniques do. Because of the emphasis on qualitativeexplanations, and the use of the tracer to create causal graphs, thisapproach may produce fewer false positives than existing auto-mated techniques. Both of these claims should be empiricallyevaluated.

In other work, the authors are investigating the use of heuristicsearch techniques to explore the risk landscape created by a simu-lation. This will require improved risk measures within the simula-tion models so that they create searchable landscapes of risk.Adding such subtle risk measures would also indicate situationswhere the model is coming close to an accident, but cannot quitereach it; SimHAZAN as it stands does not help find such situations.

Acknowledgements

The work described in this paper was funded under the Defenceand Aerospace Partnership in High Integrity Real Time Systems.The authors would like to thank many colleagues in the HISE Groupat York for their invaluable comments on drafts of this paper, par-ticularly Andrew Rae, Richard Hawkins, Zoe Stephenson, KatrinaAttwood, Philippa Conmy, Oleg Lisagor, Kester Clegg and IbrahimHabli. We’d also like to thank Richard Adcock of Cranfield Univer-sity for providing the source model used for the case study.

References

Adcock, R.D., 2006. NECTISE Project Issue 1—Operational Architecture. TechnicalReport. Cranfield University.

Aitken, J.M., Alexander, R.D., Kelly, T.P., 2011. A risk modelling approach for acommunicating system of systems. In: Proceedings of the IEEE InternationalSystems Conference.

Alexander, R.D., 2007. Using Simulation for Systems of Systems Hazard Analysis.PhD Thesis., Department of Computer Science, University of York.

Alexander, R.D., Hall-May, M., Kelly, T.P., 2004. Characterisation of systems ofsystems failures. In: Proceedings of the 22nd International Systems SafetyConference (ISSC 2004), Providence, USA, pp. 499–508.

Ammirato, F., Bieth, M., Chapman, O.J.V., Davies, L.M., Engl, G., Faidy, C., Seldis, T.,Szabo, D., Trampus, P., Kang, K.-S., Zdarek, J., 2004. Improvement of In-serviceInspection in Nuclear Power Plants. Technical Report. International AtomicEnergy Agency.

Arthur, J.G., McCarthy, A.D., Wynn, H.P., Harley, P.J., Baber, C., 1999. Weak at theknees? Arthroscopic surgery simulation user requirements, capturing thepsychological impact of VR innovaton through risk-based design. In: Sasse,M.A., Johnson, C. (Eds.), Human–Computer Interaction – INTERACT ‘99. IOSPress.

Beck, K., 2002. Test-Driven Development: By Example. Addison WesleyProfessional, Boston, MA.

Blom, H.A.P., Stroeve, S.H., de Jong, H.H., 2006. Safety risk assessment by MonteCarlo simulation of complex safety critical operations. In: Redmill, F., Anderson,T. (Eds.), Proceedings of the Fourteenth Safety-critical Systems Symposium.Springer, Bristol, UK, pp. 47–67.

Bratman, M., 1987. Intention, Plans, and Practical Reason. Harvard University Press,Cambridge, MA.

Brooks, H., DeKeyser, T., Jaskot, D., Sibert, D., Sledd, R., Stilwell, W., Scherer, W.,2004. Using event-based simulation to reduce collateral damage during militaryoperations. In: Jones, M.H., Patek, S.D., Tawney, B.E. (Eds.), Proceedings of the2004 Systems and Information Engineering Design, Symposium, pp. 71–78.

Caseley, P.R., Guerra, S., Froome, P., 2006. Measuring hazard identification. In:Proceedings of the 1st IET Conference on System Safety.

CISHEC, 1977. A Guide to Hazard and Operability Studies. The Chemical IndustrySafety and Health Council of the Chemical Industries Association Ltd.

Defence Modelling and Simulation Office, 2010. DoD Modeling and Simulation(M&S) Glossary. Technical Report. US Department of Defence.

Despotou, G., Alexander, R., Kelly, T., 2009. Addressing challenges of hazard analysisin systems of systems. In: Proceedings of the 3rd IEEE Systems Conference. IEEE.

Page 17: Supporting systems of systems hazard analysis using multi-agent simulation

318 R. Alexander, T. Kelly / Safety Science 51 (2013) 302–318

Dewar, J.A., Bankes, S.C., Hodges, J.S., Lucas, T., Saunders-Newton, D.K., Vye, P., 1996.Credible Uses of the Distributed Interactive Simulation (DIS) System. TechnicalReport. RAND.

Drogoul, A., Vanbergue, D., Meurisse, T., 2002. Multi-agent based simulation: whereare the agents? In: Proceedings of the Third International Workshop on Multi-Agent-Based Simulation, Bologna, Italy.

Gaudel, M.-C., Issarny, V., Jones, C., Kopetz, H., Marsden, E., Moffat, N., Paulitsch, M.,Powell, D., Randell, B., Romanovsky, A., Stroud, R., Taiani, F., 2003. Final Versionof DSOS Conceptual Model (CSDA1). Technical Report. University of Newcastleupon Tyne.

Ha-Duong, M., 2005. Scenarios, Probability and Possible Futures. Centreinternational de recherche sur l’environnement et le développement (CIRED).<http://halshs.archives-ouvertes.fr/halshs-00003925/en/> (07.07.11).

Hawkins, R.D., Kelly, T.P., 2009. Software safety assurance – what is sufficient? In:Proceedings of the 4th IET System Safety Conference. Institute of Engineeringand Technology, London, UK.

Humphreys, P., 2004. Extending Ourselves: Computational Science, Empiricism, andScientific Method. Oxford University Press, New York.

Ladkin, P.B., Loer, K., 1998. Why-Because Analysis: Formal Reasoning aboutIncidents. Technical Report. Bielefeld University.

Lam, D.N., Barber, K.S., 2005. Comprehending agent software. In: Proceedings of theFourth International Joint Conference on Autonomous Agents and MultiagentSystems (AAMAS-2005), Utrecht, Netherlands.

Leveson, N.G., 2002. A New Accident Model for Engineering Safer Systems,Proceedings of the 20th International System Safety Society Conference (ISSC2002). System Safety Society, Unionville, Virginia, pp. 476–486.

Lisagor, O., Sun, L., Kelly, T.P., 2010. The illusion of method: challenges of model-based safety assessment. In: The 28th International System Safety Conference(ISSC ‘10).

Luke, S., Cioffi-Revilla, C., Panait, L., Sullivan, K., 2004. MASON: a new multi-agentsimulation toolkit. In: Proceedings of the 2004 SwarmFest Workshop.

Manion, M., 2007. The epistemology of fault tree analysis: an ethical critique.International Journal of Risk Assessment 7, 382–430.

Marais, K., Dulac, N., Leveson, N.G., 2009. Moving beyond normal accidents and highreliability organizations: a systems approach to safety in complex systems.Organization Studies 30, 227–249.

McIlroy, D., Heinze, C., 1996. Air combat tactics implementation in the Smart WholeAiR Mission Model (SWARMM). In: Proceedings of the First InternationalSimTecT Conference, Melbourne, Australia.

Ministry of Defence, 2012. MOD Architecture Framework (MODAF). <http://www.mod.uk/DefenceInternet/AboutDefence/WhatWeDo/InformationManagement/MODAF/>.

Mitchell, T.M., 1997. Machine Learning. McGraw-Hill.Mittal, S., 2006. Extending DoDAF to allow integrated DEVS-based modeling and

simulation. Journal of Defense Modeling and Simulation 3, 95–123.Mohaghegh, Z., Kazemia, R., Mosleha, A., 2009. Incorporating organizational factors

into Probabilistic Risk Assessment (PRA) of complex socio-technical systems: ahybrid technique formalization. Reliability Engineering & System Safety 94,1000–1018.

Object Management Group, 2009. Unified Profile for DoDAF and MODAF (UPDM)Version 1.0.

Padgham, L., Winnikoff, M., 2004. Developing Intelligent Agent Systems: a PracticalGuide. John Wiley & Sons.

Paolo, E.A.D., Noble, J., Bullock, S., 2000. Simulation models as opaque thoughtexperiments. in: Proceedings of the Seventh International Conference onArtificial Life. MIT Press, pp. 497–506.

Perrow, C., 1984. Normal Accidents: Living with High-Risk Technologies. BasicBooks, New York.

Platts, J.T., Peeling, E., Thie, C., Lock, Z., Smith, P.R., Howell, S.E., 2004. Increasing UAVintelligence through learning. In: Proceedings of AIAA Unmanned Unlimited,Chicago IL.

Pumfrey, D.J., 1999. The Principled Design of Computer System Safety Analyses.DPhil Thesis. Department of Computer Science, University of York.

Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan Kauffman.Rae, A.J., Alexander, R.D., Nicholson, M., 2010. The state of practice in system safety

research evaluation. In: Proceedings of the 5th IET System Safety Conference,Manchester, UK.

Raheja, D., Moriarty, B., 2006. New Paradigms in System Safety. Journal of SystemSafety 42.

Rasmussen, J., 1997. Risk management in a dynamic society: a modelling problem.Safety Science 27, 183–213.

Redmond, P.J., 2007. A System of Systems Interface Hazard Analysis Technique. MSThesis. Naval Postgraduate School, Monterey, CA.

Redmond, P.J., Michael, J.B., Shebalin, P.V., 2008. Interface hazard analysis forsystem of systems. In: IEEE International Conference on System of SystemsEngineering.

Ronald, N., Sterling, L., Kirley, M., 2007. An agent-based approach to modellingpedestrian behaviour. International Journal of Simulation 8, 25–38.

Russell, S.J., Norvig, P., 2002. Artificial Intelligence: A Modern Approach. Pearson.Smith, D.J., 2000. Developments in the Use of Failure Rate Data and Reliability

Prediction Methods. Delft University of Technology.Stephenson, Z., Fairburn, C., Despotou, G., Kelly, T., Herbert, N., Daughtrey, B., 2011.

Distinguishing fact from fiction in a system of systems safety case. In: Pro-ceedings of the Safety-critical Systems Symposium. Springer, Southampton, UK.

Suokas, J., Kakko, R., 1989. On the problems and future of safety and risk analysis.Journal of Hazardous Materials 21, 105–124.

US Department of Defense, 2000. MIL-STD-882D – System Safety ProgramRequirements. US Department of Defence.

VT MAK, 2009. VR-Forces Brochure. <http://www.mak.com/pdfs/br_vrforces.pdf>(07.07.11).

Wilkinson, P.J., Kelly, T.P., 1998. Functional hazard analysis for highly integratedaerospace systems. In: Proceedings of the 1998 IEE Seminar on Certification ofGround/Air Systems, London, UK.

Yang, X., Chen, Y., Eide, E., Regehr, J., 2011. Finding and understanding bugs in Ccompilers. In: Proceedings of the 2011 ACM SIGPLAN Conference onProgramming Language Design and Implementation (PLDI), San Jose, CA.

Zinn, A.W., 2004. The Use of Integrated Architectures to Support Agent BasedSimulation: an Initial Investigation. MS Thesis. Air Force Institute of Technology.