SIGNED DIRECTED GRAPH BASED MODELING AND ITS ... - Home...

Int. J. Appl. Math. Comput. Sci., 2012, Vol. 22, No. 1, 41–53DOI: 10.2478/v10006-012-0003-z

SIGNED DIRECTED GRAPH BASED MODELING AND ITS VALIDATION FROMPROCESS KNOWLEDGE AND PROCESS DATA

FAN YANG ∗,∗∗, SIRISH L. SHAH ∗, DEYUN XIAO ∗∗

∗ Department of Chemical and Materials EngineeringUniversity of Alberta, 7th Floor, ECERF, 9107 116 Street, Edmonton, AB T6G 2G6, Canada

e-mail: fyang.cme,[email protected]

∗∗Tsinghua National Laboratory for Information Science and Technology (TNList), Department of AutomationTsinghua University, 1 Tsinghuayuan, Haidian District, Beijing 100084, China

e-mail: yangfan,[email protected]

This paper is concerned with the fusion of information from process data and process connectivity and its subsequent usein fault diagnosis and process hazard assessment. The Signed Directed Graph (SDG), as a graphical model for capturingprocess topology and connectivity to show the causal relationships between process variables by material and informationpaths, has been widely used in root cause and hazard propagation analysis. An SDG is usually built based on processknowledge as described by piping and instrumentation diagrams. This is a complex and experience-dependent task, andtherefore the resulting SDG should be validated by process data before being used for analysis. This paper introducestwo validation methods. One is based on cross-correlation analysis of process data with assumed time delays, while theother is based on transfer entropy, where the correlation coefficient between two variables or the information transfer fromone variable to another can be computed to validate the corresponding paths in SDGs. In addition to this, the relationshipcaptured by data-based methods should also be validated by process knowledge to confirm its causality. This knowledgecan be realized by checking the reachability or the influence of one variable on another based on the corresponding SDGwhich is the basis of causality. A case study of an industrial process is presented to illustrate the application of the proposedmethods.

Keywords: signed directed graph, transfer entropy, process topology, fault diagnosis, process hazard assessment.

1. Introduction

The scale and complexity of process operations have beenincreasing to meet the business and regulatory complian-ce demands of industry. These industrial processes, ho-wever, are often subject to low productivity due to systemfaults and/or even hazardous operating conditions becauseof abnormal conditions such as faulty operations, equip-ment degradation, external disturbances, and control sys-tem failures. Compared with the traditional fault detectionin local systems, fault detection in large-scale complexsystems is particularly difficult because of the high degreeof interconnections in such systems. A simple fault in suchsystems may easily propagate throughout the plant. Sucha type of analysis is similar to hazard operability analysis,which is initially carried out in a qualitative way followedby quantitative analysis. It is important to remember thatsimple local faults can propagate and spread to a wider do-

main because of the connectivity between units and the to-pology of the system. Therefore, in root cause analysis ofprocess faults, it is important to capture connectivity andtopology to use this information to relate to the dynamicsof the systems as captured in process data. This fusionof information can be further utilized to find the probableroot causes and resulting consequences according to thecurrent symptom(s). Therefore, the problems of fault de-tection and propagation require analysis on two fronts: (i)capturing process connectivity and causality, and (ii) vali-dation of such models using both process knowledge andprocess data.

There are several models to describe process topo-logy and cause/effect relations. Causal graphs describethe causal relations between variables, and they can beextended to AND/OR causal graphs to describe logicalcausal relations (Ligeza and Koscielny, 2008). A poten-tial conflict structure, as a subgraph of a causal graph,

fyang.cme,[email protected]

yangfan,[email protected]

42 F. Yang et al.

describes specifically the computational causal relationsused for consistency-based diagnosis (Ligeza, 1996). The-re are also more complex models that describe essen-tially causal relations. Bond graphs and their extension,temporal causal graphs (Mosterman and Biswas, 1999),use different symbols to further describe dynamic cha-racteristics. More precisely, qualitative transfer functions(Leyval et al., 1994) and differential equations (Montmainand Gentil, 2000) have been integrated into causal gra-phs, and complex algorithms are introduced to improvetheir correctness (Cheng et al., 2011). Similar or impro-ved approaches were investigated by many researchers,such as Pastor et al. (2000), Alonso et al. (2003), Fa-garasan et al. (2004), and Jan et al. (2007). All of the-se improved methods depend on mathematical models. IfMatlab/Simulink models of dynamic systems are availa-ble, then the diagnostic models can be generated directly(Górny, 2001). The dependency on models, however, li-mits the utility in large-scale complex systems.

As a qualitative graphical model, the Signed Direc-ted Graph (SDG) technique has been used to describethe topology and causality of industrial processes, inclu-ding both material flow and information flow paths (Iriet al., 1979). Compared to classic causal graphs, SDGsadd signs to nodes and arcs to include more informationand yet remain simple and qualitative. SDGs can helpus understand the internal relationships between variablesand thus be implemented in hazard assessment (to find thepossible consequence of a fault) and also in root causediagnosis (Yang et al., 2010a).

Although the modeling of SDGs can be based on ma-thematical equations (Maurya et al., 2003a; 2003b; 2004),it can also be undertaken practically from process know-ledge that is readily available from Piping and Instru-mentation Diagrams (P&IDs). Although simple, SDGs areable to capture fairly precise process connectivity infor-mation from the direct relationship between variables, asbased on causality. Some techniques to capture connecti-vity from P&IDs have been developed by Fedai and Drath(2005), Yim et al. (2006), and Thambirajah et al. (2009)based on the standard data format of Computer Aided En-gineering Exchange (CAEX), though there may also be alot of irrelevant information in these files. The SDGs con-structed in this way should be validated by the processdata to confirm their correctness.

When dealing with process data, various pairwisecausality capture methods have been proposed (Baueret al., 2007; Bauer and Thornhill, 2008), whereas con-structing a causal network or determining the topologypurely from data is difficult, if not impossible, unless onealso resorts to process knowledge. In multivariate cases,knowledge discovery in databases has been studied andimplemented in the DiaSter system (Korbicz and Kosciel-ny, 2010), including the discovery of qualitative and quan-titative dependencies. However, when the scale of a pro-

cess is large, most of the data-based methods are essen-tially pattern recognition, which is difficult to be combi-ned with knowledge-based or model-based methods. Mo-reover, by these data-based methods, one is likely to runinto a lot of redundancy and errors.

In general, the combination of knowledge and datareinforces the trust in SDG models and leads to their usewith a high degree of confidence. In this paper, SDG mo-dels are used because they can be built by knowledge aswell as data and therefore provide a common descriptionthat facilitates their validation.

It should be noted that verification and validation aretwo important steps to be followed before the model canbe used. Palmer and Chung (1999) discussed verificationto ensure that the internal structure of the model is com-plete, correct, and consistent; this is based on the modelitself. However, the model should also be tested in theenvironment in which it is going to be used; in practice,historical data are employed for this test, which is the pur-pose of this paper.

This paper is organized as follows. Section 2 provi-des an overview of the SDG modeling and inference me-thod based on process knowledge. Section 3 introducestwo data-based methods, time delayed cross-correlationand transfer entropy, to capture causality between two va-riables. Section 4 describes mutual validation of knowled-ge description and a data-based causal network. In Sec-tion 5 the above methods are evaluated by application toa real industrial process where we build an SDG and va-lidate it by process data, followed by concluding remarksin Section 6.

2. SDG modeling by process knowledge

A process variable (or another continuous variable suchas a manipulated one) is represented by a node in a SDG,and the cause/effect relationship between two variables isrepresented by a directed arc between the correspondingnodes (Iri et al., 1979). The source node means the cau-se or parent, and the destination node means the effect orchild. For example, an arc from node A to node B impliesthat deviation or perturbation in A may cause deviation inB. A sign ‘+’, ‘−’, or ‘0’ is assigned to a node in compa-rison with thresholds to denote higher than, lower than, orwithin the normal operating region, respectively. Positiveor negative influence between nodes is distinguished bythe sign ‘+’ (promotion) or ‘−’ (suppression) of the arc,shown as a solid or dotted arc respectively.

The construction procedure of SDGs can be accom-plished based on mathematical models, i.e., Differentialand Algebraic Equations (DAEs). For example, the follo-wing Differential Equation (DE):

dxi

dt= fi(x1, . . . , xn) (1)

Signed directed graph based modeling and its validation from process knowledge and process data 43

can be transformed into a set of arcs with the sign

sgn(xj → xi) = sgn∂fi

∂xj

∣∣∣∣x01,...,x0

n

, (2)

where (x01, . . . , x

0n) denotes the steady state. Algebraic

Equations (AEs) can be transformed into arcs in much thesame way.

In most cases, the SDG is established by qualitativeprocess knowledge and experience because precise equ-ations are usually unavailable and unnecessary. Neverthe-less, process knowledge is available in P&IDs, and it issuggested that one should capture process topology fromthis information, where upstream variables influence do-wnstream variables, external environmental variables in-fluence internal state variables, and manipulated variablesinfluence controlled variables in a control system.

2.1. SDG modeling within a process unit. For a sin-gle unit, several physical quantities reflect the characte-ristics of the process. The following three types of rela-tionships between these variables can be described as DA-Es: DEs, which reflect dynamic causal relationships (Ty-pe I), AEs with causal meaning, which include driving for-ce equations, functional relationships, and other algebraicequalities (Type II), and AEs with no causal relationships(Type III) (Yang et al., 2010b). For the first two types,Maurya et al. (2003a) summarized the modeling methods.The third type can be used as redundant constraints to re-move irrelevant solutions (Oyeleye and Kramer, 1988).

Based on these methods, an SDG of a unit can be ob-tained from qualitative knowledge instead of DAEs. Forexample, consider a tank process as shown in Fig. 1, whe-re L is the level in the tank, K is the valve position inthe outlet pipe, F1 and F2 are inlet and outlet flowrates,respectively. The DAEs of this process are given below:

AdL

dt= F1 − F2, (3)

F2 = K√

L, (4)

K = αL, (5)

where A is the cross sectional area of the tank and α is theproportional coefficient of the control law. By these DA-Es, an SDG can be built as shown in Fig. 2, where 2(a)

K

F1

F2L

Fig. 1. Scheme of a tank process.

F1 L F2 L K F2 F1 L K F2

(a) (b) (c)

Fig. 2. SDGs of the tank process: SDG obtained from Eqn.(3) (a), SDG obtained from Eqns. (4) and (5) (b), com-bination of (a) and (b) (c).

is obtained from Eqn. (3) (Type I), 2(b) from Eqns. (4)and (5) (Type II), and 2(c) is a combination of them.This SDG can also be obtained from process knowledgethat the level is determined by the inlet and outlet and theoutlet flow rate is affected by the level and the valve po-sition. Irrespective of the type of control (P, PI, or PID),it is shown as an arc from L to K , and it is unnecessaryto write the corresponding DAEs for this case. Accurateparameters and functions are not important for this usage.

In the process industry, standard units, such as tanksand exchangers, can be modeled and saved as templates ormodules for reuse. One can build SDGs for special unitsby using the qualitative information obtained from pro-cess knowledge as well as first-principles or mathematicalmodels. When all the units are modeled, they should beconnected according to the linkage information.

2.2. SDG modeling between process units. Units areconnected because of material and signal flow paths. Thedirections of arcs are consistent with the transportation pa-ths. The arcs should be connected to corresponding varia-bles in different units.

An efficient way of showing these transportations isvia P&IDs in which both material flow and informationflow (control signal) are shown. SDGs can be built byunfolding the units in P&IDs as unit SDGs and connec-ting them according to flow directions. Thus SDGs bringout not only the connectivity and topology between theunits but also the causality between variables.

This plant topology information can be expressed bya text file that can be automatically generated by CADtools (such as the DXF format developed by AutoCADfor enabling data interoperability). The program shouldextract related information from it to identify process unitsand flowsheets (Palmer and Chung, 2000). For a stan-dard and uniform extraction of connectivity information,XML (eXtensible Markup Language) is one possible so-urce of information. Some CAD tools can export XMLtexts, and some techniques of computer science can beemployed to capture the connectivity information fromthem (Fedai and Drath, 2005; Yim et al., 2006; Thambira-jah et al., 2009). This format of connectivity informationforms the basis for causality and can be converted intomatrices or SDGs essentially.

44 F. Yang et al.

2.3. SDG-based inference and model improvement.In the safety area, fault diagnosis, especially root causediagnosis, and process hazard assessment, especially HA-Zard and OPerability analysis (HAZOP), are two differenttasks. The former is to correctly identify the root cause ofa fault. It is based on measurements and used in real-time.The latter is used off-line to find the possible consequen-ces due to every cause. By assuming various faults, onecan analyze possible consequences that are likely to betriggered. Both tasks need a model and mechanism to de-scribe how a fault propagates, and thus the SDG modelcan be employed.

Based on an SDG, fault propagation can be explainedby searching for consistent paths. An arc is called consi-stent if the product of the sign of its source, the sign of itsdestination, and the sign of the arc is equal to ‘+’. Take arcA → B with sign ‘−’ (suppression), for example. If A is‘+’, and B is ‘−’, then this arc is consistent, which me-ans the fault on A can be propagated to B along this arcbecause the signs show the suppression effect. A fault can-not be propagated along an inconsistent arc even thoughthere may be a connection. Consistent arcs are connectedsuccessively to compose a consistent path, i.e., a reasona-ble fault propagation path. In Fig. 3, a consistent path ishighlighted, while there is another node with a sign ‘−’enclosed in a circle that does not belong to any consistentpath, because the arc adjacent to it is inconsistent in spiteof the existing connection.

According to this rule of consistency, one can searchfor root causes and consequences by backtracking and for-ward tracking, respectively. A common search algorithmis the depth-first or width-first traversal on the graph (Iriet al., 1979), by which the root causes or affected variablesas well as the propagation paths can be obtained.

One disadvantage of the aforementioned unit-basedmethod is incorrect or ambiguous inference due to theexistence of divider/header combinations or recycle loops,because there may exist multiple paths with the same ori-gin and terminal. This is unavoidable because of the in-complete information of qualitative SDGs. This disadvan-tage, however, can be partially overcome through modelimprovement. Maurya et al. (2003b) analyzed model de-scription in detail by distinguishing an initial response anda steady-state response and modified the SDG by addingredundant equations. Palmer and Chung (2000) developed

Fig. 3. Example of a consistent path.

the modular approach to remove redundant paths. Neitherof these efforts can solve the problem perfectly or effi-ciently without investigating real data, especially for largeprocesses. This issue will be discussed in Section 4.1.

3. Causality capture based on process data

There are many methods to capture causality informationfrom process data. Two of them are given below. One isthe simplest and yet a very practical one, the other is ageneral but computationally complex one.

3.1. Causality capture based on cross-correlation. Ina process, variables are interacting. By analyzing the cor-relation between them, we can capture the causality. Al-though correlation does not imply causality, it can be usedas a validation. Moreover, when computing Pearson’s cor-relation, causes and effects can be recognized by introdu-cing lags in a time series to find the maximum correlation.

Assume that x and y are time series of n observationswith means μx and μy and standard deviations σx and σx,respectively. Then the Cross-Correlation Function (CCF)with an assumed lag k is

φxy =E [(xi − μx)(yi+k − μy)]

σxσy. (6)

The expectation can be estimated by the sample CCF as

φxy(k) =

⎧

⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

n−k∑

i=1

(xi − μx)(yi+k − μy)

(n − k)σxσyif k ≥ 0,

n∑

i=1−k

(xi − μx)(yi+k − μy)

(n + k)σxσyif k < 0.

(7)A value of the CCF is obtained by assuming a cer-

tain time delay for one of the time series. Thus the ab-solute maximum value can be regarded as the real cross-correlation and the corresponding lag as the estimated ti-me delay between these two variables. For mathemati-cal description, one can compute the maximum and mini-mum values φmax = maxkφxy(k), 0 ≥ 0 and φmin =minkφxy(k), 0 ≤ 0, and the corresponding argumentskmax and kmin. Then the time delay from x to y is

λ =

kmax if φmax ≥ −φmin,kmin if φmax < −φmin (8)

(corresponding to the maximum absolute value) and theactual time delayed cross-correlation is ρ = φxy(λ) (be-tween −1 and 1). If λ is less than zero, then it means thatthe actual delay is from y to x. Thus the sign of λ provi-des the directionality information between x and y. The


sign of ρ corresponds to the sign of the arc in the SDGindicating that the correlation is positive or negative.

By this definition, ρ is a statistical estimate inevita-bly prone to some uncertainty due to disturbances, noiseand the size of data windows. Therefore its value shouldbe judged with care. Even if the two time series are uncor-related random noises, ρ may still likely be different fromzero. Therefore the value of the CCF between two varia-bles should be checked against a threshold. If the correla-tion between the two series is very weak, then the effectof the noise will dominate the results. Therefore, in corre-lation analysis, only those values which are significantlylarger than a user-defined threshold (e.g., ±0.2) are consi-dered to be evidence of correlations.

To sum up, based on estimation, the maximum CCFis defined as the time delayed correlation coefficient (orcorrelation in brief), and the corresponding argument kis defined as an estimate of the time delay from x to y,from which we can find the cause and the effect (Bauerand Thornhill, 2008). Of course, correlations are based onstatistics under the assumption of linearity, and thus theyneed hypothesis tests to obtain the level of significance.Although the estimates of correlation coefficients are notaccurate, the directions are believable for most cases. Al-though this method is practical and easy for computation,it has many shortcomings, some of which are explainedbelow.

• The nonlinear causal relationship does not necessa-rily show up in correlation analysis. For example,if y equals the square of x with the time delay ofone sampling time, then, based on the time-delayedcross-correlation, this obvious causality cannot be fo-und because all the values are small relative to a thre-shold. This can be explained because the true corre-lation should be zero.

• Correlation simply gives us an estimate of the timedelay. The sign of the delay is an estimate of the di-rectionality of the signal flow path. The time delayobtained, however, is only an estimate. In addition,the trend in a time series is ignored, and values atdifferent time instances are regarded as samples ofthe same random event. Thus the causality obtainedby this measure is purely the time delay based on theestimate of the covariance.

Given all the correlations between any two varia-bles, a correlation color map can be constructed (Tangiralaet al., 2005), whose horizontal and vertical coordinates areboth variables in the same order and the color of each pi-xel shows the correlation between the two correspondingvariables according to a scaled color bar. This providesan intuitive way for human beings to observe correlation,especially for identification of similar groups of variables.

Based on these correlations with directions, one canconstruct a causal network. Because correlation has the

property of a transitive relation, some correlations can beexplained by sequential direct causal relations. For exam-ple, the causality from A to C can be a combined result ofcausal relations from A to B and from B to C. Pairwisedata analysis cannot recognize this difference without pro-cess knowledge. In addition, other related variables canaffect the correlation between two variables to lessen theintensity, which may interfere the analysis results. In anycase, a correlation high enough, whether direct or indirect,should be explained by an arc or a path in the correspon-ding SDG. This is a means of validating SDGs.

3.2. Causality capture based on transfer entropy.From another point of view, causal relation means in-formation transfer. Thus the measure of transfer entropy(Schreiber, 2000) in information theory can also be em-ployed. The transfer entropy from y to x is defined as

t(x|y) =∑

p(xi+h, xi, yi) · logp(xi+h|xi, yi)p(xi+h|xi)

, (9)

where p means the complete or conditional ProbabilityDensity Function (PDF), xi = [xi, xi−τ , . . . , xi−(k−1)τ ],yi = [yi, yi−, . . . , yi−(l−1)τ ], τ is the sampling period,and h is the prediction horizon. The transfer entropy is ameasure of information transfer from y to x done by me-asuring the reduction of uncertainty while assuming pre-dictability. It is defined as the difference between the in-formation about a future observation of x obtained fromthe simultaneous observation of past values of both x andy, and the information about the future of x obtainedfrom the past values of x alone. It gives us a good sen-se of the causality information without having to requirethe delay information. Experience suggests that one takesτ = h ≤ 4, k = 0, and l = 1 for the initial trial, while theusual way is to test and compare various results from dif-ferent values of τ and h. If the transfer entropies in two di-rections are considered, then t(x → y) = t(y|x) − t(x|y)is used as a causality measure to decide the quantity anddirection of information transfer (Bauer et al., 2007).

In Eqn. (9), the PDF can be estimated from a hi-stogram or kernel methods (Silverman, 1986), which arenonparametric methods, to fit any shape of the distribu-tions. Here the Gaussian kernel method is used becauseit is more robust than the naive histogram-based method.The Gaussian kernel function is defined as

K(v) =1√2π

e−12 v2

. (10)

Thus a univariate PDF can be estimated by

p(x) =1

Nh

N∑

i=1

K

(x − xi

h

)

, (11)

where N is the number of samples, and h is the ban-dwidth chosen to minimize the mean square error of the

46 F. Yang et al.

PDF estimation calculated by h = c · σ · N−1/5, wherec = (4/3)1/5 ≈ 1.06 according to a ‘normal referencerule-of-thumb’ (Li and Racine, 2007).

For the multivariate case (q dimensional), the estima-tion of PDF is

p(x1, x2, . . . , xq)

=1

Nh1 · · ·hq

·N∑

i=1

K

(x1 − xi1

h1

)

· · ·K(

xq − xiq

hq

)

,

(12)

wherehs = cσ(xis)N

i=1 · N−1/(4+q)

for s = 1, . . . , q, and the notation is the same as the uni-variate case.

Compared to the approach based on cross-correlation, this one can be applied to more generalconditions such as nonlinear relations. In the above(nonlinear) example, the causality cannot be validatedbased on the cross-correlation. Given a lag of 1, however,the transfer entropies from x to y and vice versa are 0.27and 0.01, respectively, and thus the causality from x toy can be confirmed, which is consistent with the actualsetting.

Transfer entropy shows the information transfer ineach direction. Thus this method provides more insightinto causality for complex systems especially for the casewith recycles. In Fig. 4(a), x and y are connected direc-tly via a forward path and a recycle path. In the forwardchannel from x to y, the relation is via an AR model, i.e.,y(i) + 0.5y(i− 1) = x(i− 5), whereas there also exists afeedback channel from y to x, i.e., x(i) = 1−0.5y(i−1).Thus the information transfer lies in the two channels. Ifone is only concerned with the causality measure, thent(x → y) (in Fig. 4(b)) indicates that the causality is fromx to y. However, if transfer entropies t(y|x) (in Fig. 4(c))and t(x|y) (in Fig. 4(d)) are studied, then both arcs can bevalidated.

To sum up, this method is more general than the pre-vious one and is able to reveal the essential relationshipeven when the lag is inexplicit. There are several parame-ters to be set by users, which provide some freedom. Itis, however, highly dependent on the estimation of PDFs(although it can have any non-Gaussian forms). Thus thecomputational burden is higher. However, when using thismethod, the time delay cannot be estimated, and the arcsigns in SDGs cannot be obtained. In real applications,one may mainly choose correlation analysis for validationbut sometimes the transfer entropy can bring in additionalinsights.

There are other alternative methods to capture thecausality between time series such as Granger causalityand predictability improvement (Lungarella et al., 2007).

x y

y(i)+0.5y(i-1)=x(i-5)

x(i)=1-0.5y(i-1) (a)

(b)

(c)

(d)

Fig. 4. Transfer entropy measures: two variables and their rela-tions (a), transfer entropy from x to y (b), transfer en-tropy from y to x (c), causality measure which is thedifference (d).

Each has its own advantages and limitations. They com-plement each other and no one method is powerful enoughto replace the others. Hence data analysis is a combinationof various methods.


4. Mutual validation of process description

Process knowledge and process data are two means to cap-ture causality information in a process. Neither, however,is sufficient for practical use because of redundancy anderrors in the resulting models. We should combine themby mutual validation.

4.1. Using process data to validate knowledge descrip-tion. A causal network constructed based on processknowledge is essentially closer to the real causality. Ide-ally, it covers all the possible paths. It also has the abilityto avoid indirect relations by avoiding parallel paths. Forexample, if A can reach B and B can reach C, then Acan reach C. However, if the reachable path from A to Cis just realized by the serial connection of the two pathsfrom A to B and from B to C, then the direct arc fromA to C can be deleted, otherwise there should exist so-und reasons to include different parallel paths to achievethe reachability from A to C, resulting in a necessary arc.The major problem of this causal network constructed byprocess knowledge is the lack of quantitative informationto confirm its reliability. In fact, there are usually redun-dant and irrelevant arcs that should be deleted.

The reasons for this redundancy include many caseswhereas the essential reason is that connectivity is merelya necessary but insufficient condition for causality. Whiledetailed information can also be obtained to exclude theinexistent arcs or choose the dominant path, the effort wo-uld be multiplied because quantitative process knowledgeis needed in addition to qualitative information. In the mo-dular approach (Palmer and Chung, 2000), this task is leftto the user, although semi-automated; it still highly relieson experience. Thus we limit the process knowledge ofconcern to P&IDs, that is to say, qualitative connectivi-ty information. Under this circumstance, complex structu-res, such as dividers, headers, or recycle loops, often ma-ke the qualitative algebra provide ambiguous results thatare difficult to improve according to qualitative informa-tion because the intensity of each arc is unknown. This isan intrinsic problem of the SDG model, although a fewresearchers have made some improvements (Palmer andChung, 2000; Maurya et al., 2003b).

In general, it is difficult to exclude physically broken(e.g., valve block), behaviourally nondescript loop (e.g.,control loop), or extremely weak (due to attenuation ofsignals) paths. To validate the causal network, one shouldresort to process data for quantitative evidence. If there isno data support for reachability, then the causality shouldbe excluded. For the tank system shown in Fig. 1, if thedata of F1, F2, L, and K (take controller output as analternative) are available and sufficiently excited, then thearcs in Fig. 2(c) can be validated except F2 → L becausethe control determines the value of L.

4.2. Using process knowledge to validate data-basedrelations. Although there are several data-based me-thods to capture causality, they are developed to find thecausality between two variables. Real systems, however,are multivariate, the causality within which is shown asa network with weights based on the causality measuresbetween every two variables. Thus a causality matrix isobtained to reflect the magnitude of the causality of eachpair of variables, and the direction is determined by thetime delay in correlation analysis or the sign of measurein information transfer computation. The topology of thecausal network, however, also relies on the propagationrelations by screening the indirect relations. According toBauer and Thornhill (2008), one of the two typical topo-logies (extreme cases) is generated according to the num-ber of nonzero entries in the first row and above the maindiagonal of the causality matrix, while the real topologyis the combination of these two topology forms. With theaid of a correlation test and a directionality test, one canselect the evident relations first and construct the network(Yang et al., 2010b). A method based on correlation isproposed, which introduces the resulting quantitative in-formation obtained by CCF computations.

• Step 1: In matrix P, select the maximum value in theelements that has not been used and tested.

• Step 2: Check the results of the correlation test anddirectionality test. If the correlation value fails topass the tests, then stop.

• Step 3: Check the result of the consistency test for allthe variables in the existing arcs. If it fails, then go toStep 5.

• Step 4: Add an arc corresponding to this element withan estimated time delay. The sign of the arc is deter-mined by the sign of the element.

• Step 5: Go to Step 1.

Nevertheless, the resulting network can be incorrect andmay be inconcise without validation by process knowled-ge.

One possible way to validate an acceptable causal ne-twork is to check the reachability between the two varia-bles for evident causality; this can be realized by searchingconsistent paths. This treatment, however, cannot excludeindirect relations; the arcs are selected according to theirmagnitudes of causality. For example, if causal relationsexist from A to B and from B to C, there may be strongcausality between A and C that can be confirmed by re-achability, yet it is actually redundant. The study to identi-fy whether the path is direct or indirect is underway in thetemporal as well as the spectral domain (Gigi and Tangi-rala, 2010). Although it is possible, it is not efficient wi-thout looking at process knowledge. Thus, in most cases,

48 F. Yang et al.

one should make good use of process knowledge in themodeling procedure.

For the tank system as shown in Fig. 1, based on thedata of F1, F2, L, and K , one can easily obtain the causalrelations from F1 to F2, from K to F2, and from F1 to Kwhen the level control is in effect. These arcs can be va-lidated by reachability check. If the data set also includesa transient response, then other causal relations, such asfrom F1 to L, can be detected. The SDG can be construc-ted accordingly and validated. The structure related to L isslightly different from Fig. 2(c) because of the level con-trol.

5. Industrial case study

Consider a real industrial system, the Final Tailings PumpHouse (FTPH) process at Suncor Energy Inc. in FortMcMurray, Alberta, Canada.

5.1. Process description. The flow sheet for this pro-cess is shown in Fig. 5, where some texts have been ma-de illegible and the control strategy is omitted for confi-dentiality reasons. The tailings from upstream are pum-ped into a distributor and then processed in parallel cyclo-packs and pump boxes, and finally discharged into theponds. There are five parallel lines from the cyclo-packdownwards, where Lines A/B/C/E are structurally identi-cal while Line D is distinct. Based on the pressure of thedistributor, a prioritization program is implemented on theparallel lines and Line A is therefore the most important.

The single-loop, cascade, and selective control stra-tegy are applied, including distributor pressure control,cyclo-pack pressure and underflow control by adjustingthe number of cyclones opened, gypsum addition flow ra-te control, pump box level control and discharge densi-ty control by adjusting Cold Process Water (CPW), pumpbox level control by adjusting pump speed, pump box di-scharge flow rate control by adjusting pump speed, andMature Final Tailing (MFT) addition flow rate control. Forthis process an SDG constructed from process knowledgeis shown in Fig. 6.

For simplicity, six key variables xi(i = 1, . . . , 6) inLine A are chosen, which are y10, y16, y18, y21, y28,and y30, respectively, as indicated in Table 1. The cor-responding subsection of the SDG (Fig. 7(a)) is extractedfrom the SDG in Fig. 6. This can be easily done by chec-king reachability in the SDG model.

5.2. Using process data to validate knowledge descrip-tion. Based on the process data of the variables in Li-ne A over a one-week long period (with one minute sam-pling frequency), the correlation color map obtained is di-splayed in Fig. 7(c). We use these data and correspondingtime delays for SDG validation. For example, the bottomleft corner is a cluster of correlated variables. By checking

the P&ID, they are found to be associated with the leveland density controls in the same pump box.

One can focus on the cross-correlations and time de-lays between the six variables for further analysis. Theyare shown by the following correlation matrix P (compri-sing all the correlation coefficients between two variables)and the causality matrix Λ (comprising all the estimatedtime delay λs from one variable to another).

Table 1. Key process variables in the FTPH process: Line A forcausality analysis.

Notation Tag name Description

x1 y10 distributor pressurex2 y16 gypsum addition flow ratex3 y18 gypsum densityx4 y21 sludge header pressurex5 y28 sump densityx6 y30 sump level

x4

x5x2

x6 x1

x3

(a) (b)

(c)

Fig. 7. SDG and its validation by cross-correlation: subsectionof the SDG concerning six important variables in whichthe thick lines are validated (a), correlation color map ofthe six important variables (b), correlation color map ofall the variables in Line A (c). Readers can contact theauthors ([email protected]) for a colo-red version of this figure for better visualization.

[email protected]


Fig. 5. Partial flowsheet of the FTPH process (texts made illegible for confidentiality reasons).

y12

y11

y11

y30

y31

y33

y27y26

y24

y25

y29

y28

y17 y16

y23 y22

1.1

1.2

4.1

4.2

3.1

3.2

2

5

y18/19/20

y10

4.1

4.2

3.1.

3.22

5

y1

y2

y3

y4

y5

y6

y7

y8

y9

0

Line D Line A/B/C/E

Line prioritization enabled

Line prioritization disabled

y21HS

PR

HS

HS

HS

HS

y32

y14x1

x4

x6

x2x5x3

Controls:0 distributor pressure control1.1 cyclo-pack pressure control (by adjusting no. of cyclones opened)1.2 cyclo-pack underflow control (by adjusting no. of cyclones opened)2 gypsum addition flowrate control3.1 pump box level control (by adjusting CPW)3.2 pump box discharge density control (by adjusting CPW)4.1 pump box level control (by adjusting pump speed)4.2 pump box discharge flowrate control (by adjusting pump speed)5 MFT addition flowrate control

Fig. 6. SDG of the FTPH process. Relationships between variables are shown as thick lines while control signal paths are shown asthin lines. Solid and dashed lines show positive (reinforcement) and negative (reduction) causal relations, respectively. A dottedline is a fault propagation path.

50 F. Yang et al.

P =

⎡

⎢⎢⎢⎢⎢⎢⎣

1 .28 −.28 −.18 .39 .411 .36 −.31 .74 .50

1 −.14 .10 −.111 −.25 −.24

1 .751

⎤

⎥⎥⎥⎥⎥⎥⎦

, (13)

Λ =

⎡

⎢⎢⎢⎢⎢⎢⎣

0 −271 220 32 49 50 −360 2 1 1

0 −357 359 −3600 20 60

0 −10

⎤

⎥⎥⎥⎥⎥⎥⎦

. (14)

For example, the (i, j)-th elements of P and Λ are thecorrelation and the estimated time delay between varia-bles i and j, respectively. Due to the symmetric and anti-symmetric properties of these two matrices, only the ele-ments above the diagonal are computed and shown here.Rather than looking at the correlation matrix in the nu-merical form, it is better to look at the color coded corre-lation matrix as shown in Fig. 7(b) which is a portion ofFig. 7(c).

In Eqn. (13), those values exceeding a pre-set thre-shold (such as 0.4) can be used to validate some of thearcs in the subsection of the SDG shown in Fig. 7(a) asthick lines. For example, correlation from x5 to x6 is 0.75and the time delay is −1, so the arc from x6 to x5 is vali-dated. Similarly, the arcs from x2 to x5, x2 to x6, x1 to x6,and x1 to x5 are also validated. Note that very large timedelays do not make sense because they show the ineffec-tiveness of this measure and the computed correlation isconsidered invalid.

There are still some arcs that have not been valida-ted. Thus we resort to a more general but more complexmeasure—transfer entropy. To obtain a rough insight, ifwe look at the time trends, then we observe that the trendsof x2 and x4 do not look like the other four variables.This shows that, during this period, the relations associa-ted with x2 and x4 are not strong enough to be validated.We have to examine more data with sufficient excitementto check if there is a discernible relation between thesevariables. However, the trends of the other four variableshave apparent similarities that should be definitely due tocausality. In order to reduce the computational load, weextract from the previous data set only 200 minute recordsof continuous data shown in Fig. 8(a).

The transfer entropy measure is used to compute theinformation transfer between these four variables where τis assumed to lie between 1 and 10. When τ is 9, the trans-fer entropies from x1 to x6, from x3 to x5, and from x5 tox6 all reach their individual maximum values: 2.01, 1.48,and 1.41, respectively. Thus the bidirectional relationship

between x5 and x6 is validated, and the reachability fromx3 to x5 is also validated. They are marked in Fig. 8(b) asthick lines.

By combing the above two methods, we found thatmost of the arcs have been validated except from x4 tox6 and from x1 to x5. The former can be explained byprocess knowledge because x6 is the sump level, affectedby quite a few variables due to various feeds. To validatethis, we have to extract the influence of each variable bysignal conditioning, which is an ongoing work in the fra-mework of multivariate systems. The latter is an indirectrelation, i.e., x1 and x5 are related with the relay of x6.Since quantitative information is missing in the SDG mo-del, the transitive property may be weakened due to the at-tenuation during the propagation. The time-delayed cross-correlations from x1 to x6 and from x6 to x5 are 0.41 and0.75, respectively, but from x1 to x5 it is 0.39, less than theabove two. Moreover, the time delays of the above two are5 and 1, respectively, while the indirect one is 49, makingthe validation result of this arc unacceptable. Therefore,data analysis is usually used to only validate direct arcs.

5.3. Using process knowledge to validate data-basedrelations. Starting from data-based methods, for exam-

1 200

1

2

3

4

Samples

Time Trends

(a)x4

x5x2

x6 x1

x3

(b)

Fig. 8. Time trends and validation results by transfer entropy:time trends of four important variables (1: x1; 2: x3; 3:x5; and 4: x6) in the process (a), SDG and validated arcs(thick lines) (b).


ple, correlation analysis, one can obtain the causal ne-twork. According to the procedure in Section 4.2 andEqn. (13), 0.74 is selected, so there is an arc between x5

and x6, and the sign is ‘+’. The corresponding element inΛ is −1 as shown in Eqn. (14). Thus the direction is fromx6 to x5. Similarly, the second arc is x2 → x5, and thethird one is x2 → x6. For the latter, because the two asso-ciated nodes have been used, the consistency test shouldbe undertaken. The time delays associated with the threearcs are 1s that are reliable. Other arcs are built one byone. Note that arc x1 → x5 is ignored because the timedelay is 49, which fails to pass the consistency test. Theobtained SDG is shown in Fig. 9.

However, from reachability analysis based on theSDG, arc x2 → x4 does not make sense and should bedeleted because there is no path connecting them. Fromprocess knowledge, they are parameters on different pi-pes, and in the P&ID of Fig. 5, they are independent. Oneexplanation for this link is that there is another cause orupstream unit resulting in the changes in both of them.This incorrect arc may bring out wrong results if the rootcause is x2 because x4 does not depend on it.

5.4. Application of SDGs in fault propagation ana-lysis. Given a validated SDG, fault propagation can beanalyzed qualitatively based on consistent paths. This isimportant in any conclusive significant HAZOP analysisand also in fault detection and isolation, especially rootcause analysis, where SDGs can help.

In this case, a fault propagation path is shown by thedotted lines in Fig. 6 meaning that the pressure changein the distributor (x1) can affect the parameters in cyclo-packs and pump boxes (x5 and x6) in turn. During the HA-ZOP study, all the consistent paths should be consideredand the corresponding consequences should be evaluated.If the domain of influence of one variable is large, the in-tensity of influence is strong, or the consequence is severe,then some appropriate measures should be taken. In thiscase study, x1 is important because it has wide influenceon almost all downstream variables. Thus the controller onit is well tuned and the line prioritization is implemented

x4 x5

x2

21

x6

x1

5

1

1

Fig. 9. Causal network obtained by correlation analysis. Solidand dashed lines are positive and negative correlations,respectively. Numbers are the estimated time delays.

to lessen the risk. On the other hand, root cause analysisis undertaken online. When the variables on this path areshowing disturbances, then one can trace immediately thestarting point and that may be identified as the root cau-se of fault propagation. The automation of this procedurewill help operators quickly identify the symptom of theabnormal situation.

6. Concluding remarks

In this paper, modeling and fault propagation analysis ba-sed on SDGs are proposed for application to large-scaleindustrial systems. The SDG has its advantage of simpli-city and good matching with process data. It is feasibleeven when no model is available. Thus it has great poten-tial in many process analysis applications. Although bothprocess knowledge and process data can be used to captu-re the causality between variables, neither of them can besolely used to find the true causality without validation.

For the validation of SDGs modeled by pro-cess knowledge, especially by P&IDs, two typicaltemporal-domain data-based methods are described.Cross-correlation is relatively easy to compute and thusvery practical for determining signs and estimating timedelays. As a more general but more costly computatio-nal measure, transfer entropy gives a complementary va-lidation from the information transfer perspective. In bothmethods, arcs can be validated one by one, as matrices Pand Λ are not necessary, and hence the total computatio-nal effort is proportional to the number of arcs. The abovetwo points of view are illustrated in this paper to showthe fusion of two data sources in process modeling. Basedon the validated model, fault propagation analysis can beperformed.

Sometimes, process data are easy to obtain while pro-cess knowledge is insufficient. For this case, data-basedmethods can be applied first to build a causal network,and then reachability check can be applied to validate thisnetwork by process knowledge. Moreover, if the causalnetwork is not important whereas only some causality re-lations are of concern, then this method is more useful.

Acknowledgment

The authors are grateful for the financial aid from NSFC(60736026 and 60904044), NSERC SPG and IRC at theUniversity of Alberta and the Tsinghua National Labo-ratory for Information Science and Technology (TNList)Cross-discipline Foundation. Data access from Dr. Ra-mesh Kadali at Suncor Energy Inc. is gratefully acknow-ledged.

ReferencesAlonso, C.J., Llamas, C., Maestro, J.A. and Pulido, B. (2003).

Diagnosis of dynamic systems: A knowledge model that

52 F. Yang et al.

allows tracking the system during the diagnosis process, inP.W.H. Chung, C. Hinde and M. Ali (Eds.), Developmentsin Applied Artificial Intelligence, Lecture Notes in Artifi-cial Intelligence, Vol. 2718, Springer, Berlin, pp. 208–218.

Bauer, M., Cox, J.W., Caveness, M.H., Downs, J.J. and Thorn-hill, N.F. (2007). Finding the direction of disturban-ce propagation in a chemical process using transfer en-tropy, IEEE Transactions on Control Systems Technology15(1): 12–21.

Bauer, M. and Thornhill, N.F. (2008). A practical method foridentifying the propagation path of plant-wide disturban-ces, Journal of Process Control 18(7–8): 707–719.

Cheng, H., Tikkala, V.-M., Zakharov, A., Myller, T. and Jamsa-Jounela, S.L. (2011). Application of the enhanced dy-namic causal digraph method on a three-layer board ma-chine, IEEE Transactions on Control Systems Technology19(3): 644–655.

Fagarasan, I., Ploix, S. and Gentil, S. (2004). Causal fault de-tection and isolation based on a set-membership approach,Automatica 40(12): 2099–2110.

Fedai, M. and Drath, R. (2005). CAEX—A neutral data exchan-ge format for engineering data, Automation Technology inPractice—ATP International 1(3): 43–51.

Gigi, S. and Tangirala, A.K. (2010). Quantitative analysis of di-rectional strengths in jointly stationary linear multivariateprocesses, Biological Cybernetics 103(2): 119–133.

Górny, B. (2001). Consistency-Based Reasoning in Model-Based Diagnosis, Ph.D. thesis, AGH University of Scienceand Technology, Cracow.

Iri, M., Aoki, K., O’shima, E. and Matsuyama, H. (1979). Analgorithm for diagnosis of system failures in the chemi-cal process, Computers and Chemical Engineering 3(1–4): 489–493.

Jan, A., Jonas, B., Erik, F., Krysander, M. and Lars, N. (2007).Safety analysis of autonomous systems by extended faulttree analysis, International Journal of Adaptive Controland Signal Processing 21(2–3): 287–298.

Ligeza, A. (1996). A note on systematic conflict generation inCA-EN-type causal structures, LAAS Report No. 96317,Laboratory for Analysis and Architecture of Systems, To-ulouse.

Ligeza, A. and Koscielny, J.M. (2008). A new approach tomultiple fault diagnosis: A combination of diagnostic ma-trices, graphs, algebraic and rule-based models. The ca-se of two-layer models, International Journal of AppliedMathematics and Computer Science 18(4): 465–476, DOI:10.2478/v10006-008-0041-8.

Korbicz, J. and Koscielny, J.M. (Eds.) (2010). Modeling, Dia-gnostics and Process Control: Implementation in the Dia-Ster System, Springer-Verlag, Berlin/Heidelberg.

Leyval, L., Gentil, S. and Feray-Beaumont, S. (1994). Model-based causal reasoning for process supervision, Automati-ca 30(8): 1295–1306.

Li, Q. and Racine, J.S. (2007). Nonparametric Econometrics:Theory and Practice, Princeton University Press, Prince-ton, NJ.

Lungarella, M., Ishiguro, K., Kuniyoshi, Y. and Otsu, N. (2007).Methods for quantifying the causal structure of bivariatetime series, International Journal of Bifurcation and Chaos17(3): 903–921.

Maurya, M.R., Rengaswamy, R. and Venkatasubramanian, V.(2003a). A systematic framework for the development andanalysis of signed digraphs for chemical processes: 1. Al-gorithms and analysis, Industrial and Engineering Chemi-stry Research 42(20): 4789–4810.

Maurya, M.R., Rengaswamy, R. and Venkatasubramanian, V.(2003b). A systematic framework for the development andanalysis of signed digraphs for chemical processes: 2. Con-trol loops and flowsheet analysis, Industrial and Engine-ering Chemistry Research 42(20): 4811–4827.

Maurya, M.R., Rengaswamy, R. and Venkatasubramanian, V.(2004). Application of signed digraphs-based analysisfor fault diagnosis of chemical process flowsheets, Engi-neering Applications of Artificial Intelligence 17(5): 501–518.

Montmain, J. and Gentil, S. (2000). Dynamic causal model dia-gnostic reasoning for online technical process supervision,Automatica 36(8): 1137–1152.

Mosterman, P.J. and Biswas, G. (1999). Diagnosis of continu-ous valued systems in transient operating regions, IEEETransactions on Systems, Man, and Cybernetics: Part A29(6): 554–565.

Oyeleye, O.O. and Kramer, M.A. (1988). Qualitative simulationof chemical process systems: Steady-state analysis, AIChEJournal 34(9): 1441–1454.

Palmer, C. and Chung, P.W.H. (1999). Verifying signed directedgraph models for process plants, Computers & ChemicalEngineering 23(Suppl. 1): S394–S391.

Palmer, C. and Chung, P.W.H. (2000). Creating signed direc-ted graph models for process plants, Industrial and Engi-neering Chemistry Research 39(20): 2548–2558.

Pastor, J., Lafon, M., Trave-Massuyes, L., Demonet, J.-F., Doy-on, B. and Celsis, P. (2000). Information processing inlarge-scale cerebral networks: The causal connectivity ap-proach, Biological Cybernetics 82(1): 49–59.

Schreiber, T. (2000). Measuring information transfer, PhysicalReview Letters 85(2): 461–464.

Silverman, B.W. (1986). Density Estimation for Statistics andData Analysis, Chapman and Hall, London/New York, NY.

Staroswiecki, M. (2000). Quantitative and qualitative modelsfor fault detection and isolation, Mechanical Systems andSignal Processing 14(3): 301–325.

Tangirala, A.K., Shah, S.L. and Thornhill, N.F. (2005). PSC-MAP: A new tool for plant-wide oscillation detection, Jo-urnal of Process Control 15(8): 931–941.

Thambirajah, J., Benabbas, L., Bauer, M. and Thornhill, N.F.(2009). Cause-and-effect analysis in chemical processesutilizing XML: Plant connectivity and quantitative pro-cess history, Computers and Chemical Engineering 33(2):503–512.


Yang, F., Shah, S.L. and Xiao, D. (2010a). SDG (Signed Direc-ted Graph) based process description and fault propagationanalysis for a tailings pumping process, Proceedings of the13th IFAC Symposium on Automation in Mining, Mineraland Metal Processing, Cape Town, South Africa, pp. 50–55.

Yang, F., Xiao, D. and Shah, S.L. (2010b). Qualitative fault de-tection and hazard analysis based on signed directed graphsfor large-scale complex systems, in W. Zhang (Ed.), FaultDetection, IN-TECH, Vukovar, pp. 15–50.

Yim, S.Y., Ananthakumar, H.G., Benabbas, L., Horch, A., Drath,R. and Thornhill, N.F. (2006). Using process topology inplant-wide control loop performance assessment, Compu-ters and Chemical Engineering 31(2): 86–99.

Fan Yang earned a B.Eng. degree in 2002 and aPh.D. degree in 2008, both from Tsinghua Uni-versity. From 2009 to 2011 he worked as a visi-ting professor and PDF at the University of Al-berta. Currently he is a lecturer at Tsinghua Uni-versity. He obtained the Young Research PaperAward from the IEEE CSS Beijing Chapter in2006. His research interests include topology mo-deling of large-scale processes, abnormal eventsmonitoring, process hazard analysis, and smart

alarm management.

Sirish L. Shah has been with the Universi-ty of Alberta since 1978, where he currentlyholds the NSERC-Matrikon-Suncor-iCORE Se-nior Industrial Research Chair in Computer Pro-cess Control. He was the recipient of the Albright& Wilson Americas Award of the Canadian So-ciety for Chemical Engineering (CSChE) in reco-gnition of distinguished contributions to chemi-cal engineering in 1989, the Killam Professor in2003, and the D.G. Fisher Award of the CSChE

for significant contributions in the field of systems and control. The mainarea of his current research is fault detection and diagnosis, alarm designand analysis, process and performance monitoring, system identificationas well as design and implementation of soft sensors.

Deyun Xiao is a professor in the Departmentof Automation at Tsinghua University. Since hisgraduation from Tsinghua University in 1970, hehas been with this department as a faculty mem-ber. His research interests cover a wide range oftopics related to the area of control science andengineering, including fault diagnosis and ha-zard assessment, system identification and para-meter estimation, multi-sensor data fusion, intel-ligent measurement and control, intelligent sen-

sing techniques, computer control systems, real-time database systems,industrial monitoring software, enterprise comprehensive automationsystems, and intelligent transportation systems.

Received: 18 January 2011Revised: 20 July 2011

SIGNED DIRECTED GRAPH BASED MODELING AND ITS ... - Home...

Documents

Transcript of SIGNED DIRECTED GRAPH BASED MODELING AND ITS ... - Home...