[Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences...

18
R. Meersman et al. (Eds.): OTM 2014, LNCS 8841, pp. 202–219, 2014. © Springer-Verlag Berlin Heidelberg 2014 CFS: A Behavioral Similarity Algorithm for Process Models Based on Complete Firing Sequences Zihe Dong, Lijie Wen, Haowei Huang, and Jianmin Wang School of Software, Tsinghua University, Beijing 100084, P.R. China [email protected], {wenlj,jimwang}@tsinghua.edu.cn [email protected] Abstract. Similarity measurement of process models is an indispensable task in business process management, which is widely used in many scenarios such as organizations merging, user requirements change and model repository man- agement. This paper focuses on a behavioral process similarity algorithm named CFS based on complete firing sequences which are used to express model behavior. We propose a matching theme for two sets of complete firing sequences by A* algorithm with pruning strategy and define new similarity measure method. Experiments show that this method improves rationality than existing behavior-based similarity algorithms. Keywords: Petri net; Similarity measure, Coverability tree, Complete firing sequences, A* search algorithm. 1 Introduction In recent years, with the development and application of workflow technology, many large enterprises and public management institutions are using process models to formalize and operate their inner business processes. As the way of business process management getting more and more developed, a large number of models are accu- mulated. For example, the total number of process models in OA System of CMCC 1 has been over 8,000. Business process management technology combines information technology with management science, and puts them into the use of operational busi- ness processes [1]. This is of great significance to decision change, reform and inno- vation, and performance improvement of these companies and organizations. Simi- larity measurement of process models is an indispensable task in business process management [2], which is widely used in many scenarios such as organization merg- ing, user requirements changing, and model repository management. Nowadays, many process similarity algorithms are proposed to measure the simi- larity of two process models, and they are based on three different aspects: i) task labels, events and other modeling elements; ii) the topologic structure of the models; 1 http://www.10086.cn/

Transcript of [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences...

Page 1: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

R. Meersman et al. (Eds.): OTM 2014, LNCS 8841, pp. 202–219, 2014. © Springer-Verlag Berlin Heidelberg 2014

CFS: A Behavioral Similarity Algorithm for Process Models Based on Complete Firing Sequences

Zihe Dong, Lijie Wen, Haowei Huang, and Jianmin Wang

School of Software, Tsinghua University, Beijing 100084, P.R. China [email protected],

{wenlj,jimwang}@tsinghua.edu.cn [email protected]

Abstract. Similarity measurement of process models is an indispensable task in business process management, which is widely used in many scenarios such as organizations merging, user requirements change and model repository man-agement. This paper focuses on a behavioral process similarity algorithm named CFS based on complete firing sequences which are used to express model behavior. We propose a matching theme for two sets of complete firing sequences by A* algorithm with pruning strategy and define new similarity measure method. Experiments show that this method improves rationality than existing behavior-based similarity algorithms.

Keywords: Petri net; Similarity measure, Coverability tree, Complete firing sequences, A* search algorithm.

1 Introduction

In recent years, with the development and application of workflow technology, many large enterprises and public management institutions are using process models to formalize and operate their inner business processes. As the way of business process management getting more and more developed, a large number of models are accu-mulated. For example, the total number of process models in OA System of CMCC1 has been over 8,000. Business process management technology combines information technology with management science, and puts them into the use of operational busi-ness processes [1]. This is of great significance to decision change, reform and inno-vation, and performance improvement of these companies and organizations. Simi-larity measurement of process models is an indispensable task in business process management [2], which is widely used in many scenarios such as organization merg-ing, user requirements changing, and model repository management.

Nowadays, many process similarity algorithms are proposed to measure the simi-larity of two process models, and they are based on three different aspects: i) task labels, events and other modeling elements; ii) the topologic structure of the models;

1 http://www.10086.cn/

Page 2: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

CFS: A Behavioral Similarity Algorithm for Process Models 203

iii) the execution semantics of the models (i.e. behavior) [3]. In this paper, we focus on the similarity motivated by principal transition sequences (PTS) [4] and propose an improvement scheme by defining the complete firing sequences to express model behavior and defining new similarity measure. We name this new algorithm CFS.

A behavioral algorithm named PTS measuring the similarity of two labeled Petri nets was proposed by Wang [4]. PTS divides a Petri net’s behavior into three kinds of principal transition sequences, which includes limited loops, infinite loops and those without loops. By calculating the similarity of each kind of transition sequences and adding them up by weight, we can get the precise similarity of two Petri nets. Howev-er, this algorithm considers the partial sub-structures with loops and those without loops in models separately, which breaks the completeness of the execution seman-tics. As a result, PTS cannot handle Petri nets with loops effectively. Furthermore, because of the absence of some necessary factors such as set-size difference and aver-age similarity between a sequence and a set of sequences in getting the similarity of principal transition sequences, the result of PTS may be unconvincing.

Another method named TAR defines transition adjacency relation (TAR for short) as the gene of a Petri net’s behavior and measures the similarity between two sets of TARs using Jaccard coefficient [5]. TAR solves the problem in measuring the similar-ity of two Petri nets with loop structures. However, it cannot deal with non-free-choice constructs.

The BP method defines strict order relation, exclusiveness relation and interleaving order relation among transitions as the behavioral profiles of a model, and measures the similarity by adding the similarities between these relations up with weights [6]. BP takes into account the multiple sequence relationship between activities and meas-ures the similarity of Petri nets within the time cost of O(n ). However, the BP me-thod cannot deal with invisible tasks effectively and cannot distinguish interleaving order relations induced from a parallel structure or a loop structure.

The SSDT method [7] computes the shortest succession distance between every two tasks in the model to generate a two-dimensional matrix. Two matrixes after di-mension formatting can be used to calculate similarity. It saves the intact behavior characteristics of the model and runs fast. However, the matrix cannot be used as a kind of model index to accelerate process retrieval.

CF algorithm [8] translates process models into causal footprints which contain ac-tivity sets, look-back links and look-ahead links. Thus the CF method can express models as the points in spatial vector. It gets similarity by calculating the angle cosine of vectors. However, it cannot distinguish XOR, AND, and OR connectors.

In this paper we propose a new process similarity algorithm based on the complete firing sequences derived from the coverability tree of a labeled Petri net, namely CFS algorithm. This algorithm provides us with more convincing results in comparison with other algorithms in the experiments which we will discuss later in this paper.

The major contributions of this paper are as follows:

1. We manage to express the behavior of a process model by defining complete firing sequences, and propose a method to generate complete firing sequences sets by traversing the coverability tree.

Page 3: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

204 Z. Dong et al.

2. Considering many properties, the algorithm we put forward can describe the simi-larity between two sets of complete firing sequences in a more comprehensive way. A* searching algorithm with effective pruning strategy is used to get the best match of complete firing sequences.

3. We compare this method with PTS and other popular similarity algorithms with detailed experiments and raise a new property that a good process similarity me-thod should meet.

The remainder of this paper is structured as follows. Section 2 presents related concepts. Section 3 introduces CFS method in detail. Section 4 evaluates CFS and compares it with other algorithms. Section 5 summarizes the work and sketches some future work.

2 Preliminaries

In this section we introduce the definitions about Petri net and coverability tree related to CFS which include Petri net system, labeled Petri net system, firing sequence, complete firing sequence and coverability tree.

2.1 Petri Net

Definition 1 (Petri net system [9]): A Petri net system is a quad-tuple ∑(P, T, F, M ):

─ P is a finite set of places; ─ T is a finite set of transitions such that P ∪ T ≠ and P T ; ─ F (P T) ∪ (T P) is a set of flow relations; ─ M is the initial marking, M : P , where is the set of non-negative numbers.

The algorithm proposed in this paper is based on labeled Petri net, and a labeling function is used to assign names to transitions, which maps transitions to real business activities associated with them.

Definition 2 (Labeled Petri net system [4]): A labeled Petri net system is a sex-tuple ∑ (P, T, F, M0, A, L), and the quad-tuple (P, T, F, M ) represents the Petri net system. The others are defined as follows:

─ A is a finite set of activity names; ─ L is the mapping from transition set to A∪{ε}, ε is the empty label of an invisible

activity or a silent activity.

In this paper we measure the similarity of Petri nets for real business processes which are always satisfying the definition of Workflow net.

Definition 3 (Workflow net [10]): A Petri net system Σ (P, T, F, M ) is a workflow net (WF-net for short) with two special places i and o if it satisfies:

Page 4: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

CFS: A Behavioral Similarity Algorithm for Process Models 205

─ i∈ P is the unique source place such that · i ; ─ o ∈ P is the unique sink place such that o · ; ─ If we add a transition t satisfying · t o and t · i , the Petri net we get is

strongly connected.

Definition 4 (Firing sequence [11]): For a given workflow net ∑ (P, T, F, M ) with the source place i and the sink place o. Let t ∈ T and M be a marking of the workflow net. If t is enabled at M , it means that for any p ∈• t, M (p) 1:

─ M [t > M indicates that t is triggered at M , and the marking is changed to M . Namely, for any p ∈• t\t • , M (p) M (p) 1 , for any p ∈ t •\• t , M (p) M (p) 1, and for any other place, M (p) M (p); ─ σ t ,t ,…t >∈ T is a firing sequence from state M to M , indi-cated by M [σ > M if there exist markings M , … , M such that M [t > M [t > M [t > . Definition 5 (Complete firing sequence [11]): For a given workflow net ∑(P, T, F, M ) with the source place i and the sink place o, let σ ∈ T . Let M be the initial marking and M (i) 1, and M (p) 0 if p ≠ i. Let M be the ending (or sink) marking and M (o) 1, and M (p) 0 if p ≠ o. If M [σ > M , σ is de-fined as a complete firing sequence.

2.2 Coverability Tree

This paper uses coverability tree to analyze the behavior of a Petri net. The procedure of coverability tree generation is essentially a procedure which repeatedly iterates all the markings that can be reached, and the last result is a tree with the markings as nodes and the firings of transitions as edges. The markings can be reached will be countless if the Petri net has no bounds, which makes the coverability tree endless. In order to resolve this problem, a special symbol ω is introduced to represent the situa-tion that a place may have unlimited tokens. For any positive integer n, it satisfies that ω > , ω n ω and ω ω. The algorithm of coverability tree generation is illustrated in [12].

3 The CFS Algorithm

The procedure of CFS algorithm is shown in Figure 1. First, we construct the set of complete firing sequences on coverability tree, which includes loop identification, counting method and construction method for complete firing sequences. And then, we calculate model similarity with complete firing sequences including the applica-tion of A* algorithm to construct an optimal match between two sets of complete firing sequences and the similarity formula.

Page 5: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

206 Z. Dong et al.

Loop identification and counting Construct complete firing sequences

Construct mapping between sequencesComputing model similarity

Fig. 1. The procedure of CFS algorithm

3.1 Constructing the Set of Complete Firing Sequences

The set of complete firing sequences is infinite in the model with loop structures and cannot be used to calculate similarity directly. This section presents a method to con-struct the limited set of complete firing sequences by specifying the loop number. First we define loop in a Petri net and its corresponding entrance place.

Definition 6 (Loop): Let Σ (P, T, F, M ) be a Petri net. Loop NC (x , … , x ) ∈(P ∪ T) satisfies:

─ n >2 and for any 1 i, j with i≠j, xi≠xj; ─ (x , x ) ∈ F for any integer i where 1 i ; ─ x x . Definition 7 (Entrance places of a loop): Let Petri net Σ (P, T, F, M ) be a WF-net and one of its loops be NC (x , … , x ), an entrance place p ∈ P of NC satisfies:

─ There is an integer i (1 i n), such that p x ; ─ There is a transition t ∈ T such that p ∈ t · and for any integer i where 1 i n, we have t ≠ x ;

Entrance places are used to identify loop structures. We consider that several loop structures belong to one equivalent loop class if they have the same entrance place. We define equivalent loop class as follows:

Definition 8 (Equivalent loop class): Let Petri net Σ (P, T, F, M ) be a workflow net, if loop N , N , … , N have the same entrance place, they belong to one equivalent loop class. If there is not any other loop which has this entrance place, the equivalent loop class C can be expressed as C ={N , N , … , N }.

We will find equivalent loop class in the coverability tree to express loop unrolling in a model’s behavior.

There are three kinds of special nodes in a coverability tree, i.e., anchor, old, dead-end [12]. In the coverability tree, the loop structure is the path from the anchor node to the old node and we define this kind of path as follows:

Page 6: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

CFS: A Behavioral Similarity Algorithm for Process Models 207

Definition 9 (Loop subpath, Loop marking): Let Σ (P, T, F, M ) be a Petri net and its corresponding coverability tree be CT , in CT each old node has only one corres-ponding anchor node, and the path from the anchor node to the old node expresses the loop structure in the Petri net. This path is called loop subpath, which is expressed by anchorMarking … oldMarking (for any 1 i n , t is on the path from the anchor node to the old node), and oldMarking is called loop marking, expressed by m .

According to the definition of coverability tree, the equivalent loop class in the tree is the set of loop subpaths with the same loop marking. So in the coverability tree, we use C to express an equivalent loop class. We propose a method to com-pute the execution times of an equivalent loop class in a coverability tree.

Let p: m … m (i 1) be a loop subpath in a coverability tree and current fir-ing sequence is σ by traversing the tree. If t t … t is the discontinuous subsequence of σ, we consider the traversal going through the loop subpath in the generation of σ. To calculate the times of going through the loop subpath p, we use t _ to express the last transition of p. If the traversal goes through t _ once, the traversal goes through p for once. Use path_count(p) to express the times of going through p and edge_count(t _ ) to express the times of going through t _ , then we have

path_count(p) edge_count(t _ ) (1)

Assume P is the set of loop subpaths in the coverability tree and path_count(p) is the times of going through loop subpath p in the current firing sequence σ, then the times of C in σ expressed by cycle_count(C ) is defined as follows:

cycle_count(C ) ∑ path_count(p)∈P, (2

For the Petri net shown in Figure 2(a), the equivalent loop class is C (P , t , P , t , P , t , P , t , P ), (P , t , P , t , P , t , P , t , P ) . Correspondingly in the coverability tree of Figure 2(b), the equivalent loop class C( , , , , , , ) includes

the following two loop subpaths p : (0,1,0,0,0,0,0) (0,1,0,0,0,0,0) and p : (0,1,0,0,0,0,0) (0,1,0,0,0,0,0) . If the current firing sequence is t t t t t t t t t t t t t t t , the execution times of the equivalent loop class cycle_count C( , , , , , , ) path_count(p ) path_count(p )edge_count(t ) edge_count(t ) 2 1 3. When constructing complete firing sequences by traversing a coverability tree, we

use user-defined parameter k to limit the max execution times of an equivalent loop class. For each firing sequence, if the execution times of some equivalent loop class exceeds k, this firing sequence will be discarded, otherwise continue to expand. We get a complete firing sequence when the execution times of each equivalent loop class is not bigger than k.

Page 7: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

208 Z. Dong et al.

(a) A sample Petri net

(b) The coverability tree of (a)

Fig. 2. A sample Petri net and its coverability tree

3.2 Similarity Algorithm Based on Two Sets of Complete Firing Sequences

According to the definition of labeled Petri net, each transition has been assigned an activity name, namely a label. Let L(σ) be the labeled sequence determined by σ, then the set of labeled sequences can be derived from the set of complete firing sequences directly. For instance, one complete firing sequence of the Petri net in Figure 3 is , and the corresponding labeled sequence L( ) is XYZYYZ.

Fig. 3. A sample labeled Petri net LΣ

Page 8: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

CFS: A Behavioral Similarity Algorithm for Process Models 209

Firstly, we compare two complete firing sequences by longest common subse-quence method. Let σ and σ be two firing sequences, and lcs L(σ), L(σ ) be the longest common subsequence, and then the similarity of the two firing sequences is:

sims(σ, σ ) L( ),LL( ) , L( ) (3)

The similarity between two sets of complete firing sequences is defined as follows. It takes the similarity between the elements and radix distance into consideration and defines the mapping between the two sets, and applies average similarity to compare the complete firing sequences which cannot be mapped with the other set.

Let A and B be two sets of complete firing sequences, assume |A| |B| , avg(A,σ ) represents the similarity between the set of complete firing sequences A and a complete firing sequence σ , then we have

avg(A, σ ) ∑ ( , )∈A |A| (4)

Let dis(A, B) be radix distance between two sets of complete firing sequences A and B where |A| |B|, then we have

dis(A, B) 1 |A||B| (5)

Let M: A B be an injective mapping, that maps the complete firing sequences in A to the complete firing sequences in B, cod(M) be codomain of M, and B Bcod(M), n 1, then the similarity between A and B is defined as follows. In the formula, different user-defined parameter n which is an integer decides the influence degree of dis(A, B) to simc(A, B). simc(A, B) ∑ ( , )( , )∈M ∑ (A, )∈B|B| 1 (A,B)

(6)

CFS uses the set of complete firing sequences to describe the behavior of a labeled Petri net, then the similarity of two labeled Petri nets can be transferred to the similar-ity between two sets of complete firing sequences. Let L∑A and L∑B be the two la-beled Petri nets and the corresponding set of complete firing sequences be A and B. Assume |A| |B|, then the similarity of the two Petri nets is:

Sim(LΣA, LΣB) simc(A, B) (7)

Fig. 4. Another labeled Petri net LΣ

Page 9: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

210 Z. Dong et al.

Fig. 5. Injective mapping between LΣ and LΣ

Take the labeled Petri nets in Figure 3 and Figure 4 as an example. Let upper bound of the execution times of any equivalent loop class be 2, then we can get the sets of complete firing sequences of the two models respectively. We assume the injective mapping between the two sets as Figure 5 indicates, namely M (σ , σ ), (σ , σ ) , then we have B σ , let n in Equation 6 be 10 , then

Sim(LΣ , LΣ ) ( ) 1 0.426

3.3 Optimal Mapping between Two Sets of Complete Firing Sequences

We can get different similarity when applying different mapping plans between the two sets of complete firing sequences in the similarity calculation of two labeled Petri nets. We define the similarity of two labeled Petri nets to be the maximum value of all the possible similarities under different mapping plans, and the corresponding map-ping is called the optimal mapping.

From Equation 6, we can infer that the maximum value of sum ∑ sims(σ, σ ) ∑ avg(A, σ )∈B ( , )∈M can achieve the maximum value of the similarity between two Petri nets, so we need to find the optimal mapping plan which makes the maximum value. The time complexity of brute-force method is O(n!), and the result calculated by greedy algorithm is always not the global opti-mum. Therefore, we use A algorithm to construct the optimal mapping.

The construction of the optimal mapping can be treated as the construction of the mapping tree. As Figure 6 shows, each node represents an intermediate sult (M, A , B ), M is a partial mapping currently, A is the unmapped sequences in the set of complete firing sequences A and B is the unmapped sequences in the set of complete firing sequences B. In the handing procedure of the algorithm, for each leaf node, evaluation function f(M) contains two parts:

f(M) g(M) h(M) (8)

Page 10: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

CFS: A Behavioral Similarity Algorithm for Process Models 211

In Equation 8, g(M) implies the solution of the current partial mapping M, h(M) indicates the upper bound of the extended solution of unmapped sequences, so f(M) is the upper bound of all the mapping solution extended from M. The algorithm extends the leaf node with the maximum f(M) each time, and it ends when the leaf node contains a complete mapping, namely the optimal solution. In this procedure only partial nodes can be extended to implement pruning and improve efficiency. The pseudo code of the algorithm is as follows.

input: two sets of complete firing sequences A, B, and |A| |B| output: f(M) corresponding to the optimal mapping M AStarAlgorithmforMap(A, B) open:= (σ, σ ) σ ∈ A, σ ∈ B while open ≠ do begin M:= arg maxM ∈ g(M ) h(M ) open:=open\{M} if dom(M) = A then return g(M)+h(M) else begin select σ ∈ A, such that σ dom(M) foreach σ ∈ B, such that σ cod(M) begin M M ∪ (σ, σ ) open:=open∪ M end end end

This algorithm firstly initializes the mapping collection open which represents the collection comprised by current leaf nodes. In the procedure of loop, the algorithm extends the mapping with the maximum g(M) h(M), and excludes M from open. M is the optimal mapping, if its definition domain equals to A, and the algorithm returns with g(M) h(M) . Otherwise, the algorithm chooses an unmapped sequence σ from A, and for each unmapped sequence σ in B, it constructs a mapping pair (σ, σ ) and adds the new mapping pair to the new mapping M derived from M, and M is added to open.

Page 11: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

212 Z. Dong et al.

Fig. 6. The execution procedure of A* algorithm in tree structure

Here gives the definition of g(M) and h(M) which is used when pruning: A* algorithm calculates the optimal mapping, and finally return the maximum of ∑ sims(σ, σ )( , )∈M ∑ avg(A, σ )∈B in Equation 6, then the evaluation

function f(M ) g(M ) h(M ) is the upper bound of ∑ sims(σ, σ )( , )∈M ∑ avg(A, σ )∈B corresponding with all the possible mapping M extended from M . g(M) is the current partial solution, and h(M) is the upper bound of all solutions of partial mappings that will be extended.

Assume that the mapping of current node is M, then the set of unmapped complete firing sequences in A is defined as A A dom(M), and the set of unmapped com-plete firing sequences in B is defined as B B cod(M). Assume |A| |B|, then the definition of g(M) is as follows.

g(M) ∑ sims(σ, σ )( , )∈M , A ≠∑ sims(σ, σ )( , )∈M ∑ avg(A, σ )∈B , A (9)

Page 12: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

CFS: A Behavioral Similarity Algorithm for Process Models 213

The value of g(M) is the similarity of current partial mapping, which is the de-termined part. When A ≠ and there exists unmapped sequences in A, g(M) is the sum of similarity of each sequence pair. If A and all the sequences have been mapped, M will be the complete solution with all the values having been determined, then g(M) is equal to ∑ sims(σ, σ )( , )∈M ∑ avg(A, σ )∈B corresponding with M.

Assume that the mapping of the current node is M, then the set of unmapped com-plete firing sequences in A is A A dom(M) and the set of unmapped complete firing sequences in B is B B cod(M) . If |A| |B| , let S B |B B , and |B | |B| |A| , then the definition of h(M) is as follows.

h(M) ∑ max ∈B sims(σ, σ )∈A maxB ∈S ∑ avg(A, σ )∈B , A ≠0 , A (10)

The h(M) is comprised of two parts, if A ≠ among which the first part is the sum of the maximum similarity between each element in A and some element in B , the second part is sum of the similarity between each element in B and the set A, in which B is a subset of B and its size is |B| |A|. If there exists many possible B , the B having the maximum sum will be chosen. If A and all the sequences in A have been mapped and no other sequence needs to be extended, M is the com-plete solution with no evaluation part, and h(M) 0 obviously.

The paper by Peter E. Hart [13] has proven the admissibility of A* searching algo-rithm, and insures that A* search algorithm can give the optimal solution, the admis-sibility in this problem is defined as follows.

Theorem 1. For any partial mapping M generated by A* search algorithm, let S be the collection comprised of all the complete solutions extended by M. Let Marg maxM∈S g(M), then A* is admissible if h(M) g(M ) g(M), namely h(M) is the upper bound of the partial solution getting from mapping remaining sequences.

Proof: Let the complete solution of A* algorithm be an injective mapping A B. For any partial mapping M, we have A A dom(M) and B B cod(M).

If A ≠ , let M M M and B B cod(M ) which represents the set of unmapped firing sequences eventually. Then we have g(M ) g(M) ∑ sims(σ, σ )( , )∈M ∑ avg(A, σ )∈B . Since dom(M ) = A and cod(M ) B , ∑ max ∈B sims(σ, σ )∈A ∑ sims(σ, σ )( , )∈M . Let S B |B B , |B | |B| |A| , then B ∈ S , so we have maxB ∈S ∑ avg(A, σ )∈B ∑ avg(A, σ )∈B . Therefore, h(M) ∑ max ∈B sims(σ, σ )∈A maxB ∈S ∑ avg(A, σ )∈B∑ sims(σ, σ )( , )∈M ∑ avg(A, σ )∈B g(M ) g(M).

The partial mapping M is the complete solution if A , namely M M . In this condition, h(M) 0 and g(M ) g(M) , so h(M) g(M ) g(M) 0 which satisfies h(M) g(M ) g(M).

Page 13: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

214 Z. Dong et al.

4 Experimental Evaluation

By measuring the satisfaction rate of the triangle inequality in real process models, we noticed that when we set the upper pound k to 2 and n in the similarity formula to 1000, we can get the highest satisfaction rate of the triangle inequality. We utilize these recommended values to investigate the pruning strategy in A* and to compare CFS with popular similarity algorithms. Our model set contains 124 models from TC, 115 models from DG, 592 reference models from SAP and 200 models generated by BeehiveZ [14]. Our experiment environment is Eclipse Platform whose version is 3.6.1 and JDK is 1.7.0_25 with 4G memory assigned. And the computer has OS of Red Hat Enterprise Linux Sever (Santiago) and the CPU is Intel(R) Xeon(R) CPU E5-2640 whose main frequency is 2.50GHz.

4.1 Experiments on Pruning Strategy of A* Algorithm

From the space point, we discuss the rate of model pairs calculated before and after applying pruning strategy, which is the ratio of calculated model pairs vs. total pairs. Experiment result is shown in Figure 7. If pruning strategy is not applied, the rate reaches 69.26% at most which results from the space complexity of factorial level. Otherwise, the rate reaches 100% because pruning strategy only extends the partial mapping which may generate the optimal mapping and saves the storage space.

Fig. 7. Model pair ratio before and after pruning in A*

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

DG TC SAP BeehiveZ

mod

el p

air

rati

o

model sets

Model pair ratio before and after pruning in A*

before pruning

after pruning

Page 14: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

CFS: A Behavioral Similarity Algorithm for Process Models 215

Fig. 8. Average time before and after pruning in A*(ms)

From the time point, we measure average calculation cost based on the calculated model pairs before and after applying pruning strategy. The result is shown in Figure 8, from which we can infer that applying pruning strategy significantly reduces the execution time.

4.2 Compared with Other Similarity Algorithms Based on Model Behavior

Wang et al. give the five properties that process similarity algorithms should satisfy in [7]. These properties coincide with real business and human intuition, and can be criterions of measuring similarity algorithms in some degree. We have compared CFS algorithm with other behavioral process similarity algorithms by the five properties, and the result is shown in Table 1.

Table 1. Evaluation result of similarity algorithms on the five properties

properties SSDT PTS TAR CF BP CFS

Propery1 √ √ √ √

Propery2 √ √ √

Propery3 √ √ √ √ √

Propery4 √ √ √

Propery5 √ √ √ √ √

DG TC SAP BeehiveZ

0200400600800

10001200140016001800

model sets

aver

age

tim

eAverage time before and after pruning in A*(ms)

before pruning

after pruning

Page 15: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

216 Z. Dong et al.

From Table 1, we can conclude that PTS, TAR, CF, BP all have unsatisfied proper-ties and only SSDT and CFS can satisfy all the five properties. On the following new property we compare these algorithms further.

We propose a new property based on the analysis of real business process models: Let Σ and Σ be two business process models, and both of them are referred to

two kinds (A and B) of complete firing sequence sets. Assume A and B are two dif-ferent kinds of businesses between which the similarity of their complete firing se-quence sets is very low. However, the similarities between the same kinds of se-quences are very high. In sequences of Σ , A is the majority and B is in the minority, and they are opposite in model Σ . After mapping the complete firing sequences be-tween these two models, few mapping pairs have high similarities and most of the mapped pairs have the low values, then we think the similarity between the two mod-els is low. The different businesses have different proportions in the two models and we call this property as unbalance of business behavior distribution.

Take the models Σ and Σ in Figure 9 as an example, the two sets of complete firing sequences and the mapping between them can be figured out in Figure 10. The similarity of the two models is 0.286 by our algorithm, and the similarities by other algorithms are listed in Table 2. Based on the results, we conclude that the result of TAR and CFS conform to our expectation, while the results of the other algorithms are too much higher than our expectation.

(a)Model Σ (b)Model Σ

Fig. 9. Two example models for the new property

Fig. 10. The optimal mapping between the two models in Figure 9

Page 16: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

CFS: A Behavioral Similarity Algorithm for Process Models 217

Table 2. The similarity of the two models in Figure 9 by different algorithms

Algorithm Similarity between and

SSDT 0.840

PTS 0.829

TAR 0.333

CF 0.858

BP 0.680

CFS 0.286

Here we discuss the satisfaction rate of triangle inequality of PTS, TAR, CF and

CFS and set the threshold of similarity calculation time to five minutes. Matthias [15] gives the definition of triangle inequality. Let S be the model collection, and |S| n. Let any three of the models make a group, then we can get C groups. If m of the C groups satisfy triangle inequality, the satisfaction rate of triangle inequality of set S is defined as follows.

Rate(S) C (11)

Let LΣA and LΣB be two labeled Petri nets, then we can transfer the similarity to distance using following equation.

Dis(LΣA, LΣB) 1 Sim(LΣA, LΣB) (12)

The satisfaction rates of triangle inequality by different algorithms are shown in Figure 11. TAR and CFS can satisfy the triangle inequality completely, and PTS does slightly less than 100% while CF cannot satisfy the triangle inequality well enough.

Fig. 11. The rate of triangle inequality

80.0%

85.0%

90.0%

95.0%

100.0%

PTS TAR CF CFS

rate

of t

rian

gle

ineq

ualit

y

process similarity algorithm

rate of triangle inequality of different algorithms

DG

TC

SAP

Page 17: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

218 Z. Dong et al.

In conclusion, CFS has advantages over other algorithms on the six properties, while TAR, PTS, CF, BP and SSDT all have some unsatisfied properties. Further-more, CFS is more reasonable and intuitional. As to the rate of triangle inequality, CFS’s satisfaction rate reaches 100% and is better than PTS and CF.

5 Summary and Future Work

In this paper we propose a new process model similarity algorithm named CFS based on labeled Petri net and coverability tree. It expresses the behavior of process models by the set of complete firing sequences, and deals with all kinds of structures in Petri net efficiently, especially loops. Also, we improved the execution efficiency of the A* search algorithm using pruning strategy. As we can see from the experimental results, CFS is more reasonable than any other process similarity algorithms based on the model behavior.

In future work, we will be devoted to improving the calculation efficiency of CFS algorithm, and we are interested in finding better strategy for pruning. Furthermore, we intend to put forward a more common criteria for evaluating process model simi-larity algorithms.

Acknowledgement. The research is supported by the MOE-CMCC research founda-tion project No.MCM20123011 and the special fund for innovation of Shandong, China No.2013CXC30001.

References

1. van der Aalst, W.M.P.: Business process management demystified: A tutorial on models, systems and standards for workflow management. In: Desel, J., Reisig, W., Rozenberg, G. (eds.) Lectures on Concurrency and Petri Nets. LNCS, vol. 3098, pp. 1–65. Springer, Heidelberg (2004)

2. Becker, M., Laue, R.: A comparative survey of business process similarity measures. Computers in Industry 63(2), 148–167 (2012)

3. Dumas, M., García-Bañuelos, L., Dijkman, R.M.: Similarity Search of Business Process Models. IEEE Computer Society Technical Committee on Data Engineering 32(3), 23–28 (2009)

4. Wang, J., He, T., Wen, L., Wu, N., ter Hofstede, A.H.M., Su, J.: A Behavioral Similarity Measure between Labeled Petri Nets Based on Principal Transition Sequences. In: Meersman, R., Dillon, T.S., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6426, pp. 394–401. Springer, Heidelberg (2010)

5. Zha, H., Wang, J., Wen, L., et al.: A workflow net similarity measure based on transition adjacency relations. Computers in Industry 61(5), 463–471 (2010)

6. Kunze, M., Weidlich, M., Weske, M.: Behavioral similarity – A proper metric. In: Rinderle-Ma, S., Toumani, F., Wolf, K. (eds.) BPM 2011. LNCS, vol. 6896, pp. 166–181. Springer, Heidelberg (2011)

7. Wang, S., Wen, L., Wang, J., et al.: SSDT-matrix based behavioral similarity algorithm for process models. In: The Third CBPM (2013)

Page 18: [Lecture Notes in Computer Science] On the Move to Meaningful Internet Systems: OTM 2014 Conferences Volume 8841 || CFS: A Behavioral Similarity Algorithm for Process Models Based

CFS: A Behavioral Similarity Algorithm for Process Models 219

8. Dijkman, R., Dumas, M., van Dongen, B., et al.: Similarity of business process models: metrics and evaluation. Information Systems 36(2), 498–516 (2011)

9. Petri, C.A.: Kommunikation mit automaten [PhD]. Fachbereich Informatik of Universitat Hamburg (1962)

10. Van Der Aalst, W.M.P.: Verification of Workflow Nets. In: Azéma, P., Balbo, G. (eds.) ICATPN 1997. LNCS, vol. 1248, pp. 407–426. Springer, Heidelberg (1997)

11. Gerke, K., Cardoso, J., Claus, A.: Measuring the compliance of processes with reference models. In: Meersman, R., Dillon, T., Herrero, P. (eds.) OTM 2009, Part I. LNCS, vol. 5870, pp. 76–93. Springer, Heidelberg (2009)

12. Murata, T.: Petri nets: Properties, analysis and applications. Proceedings of the IEEE 77(4), 541–580 (1989)

13. Hart, P.E., Nilsson, N.J., Raphael, B.: A Formal Basis for the Heuristic Determination of Minimum Cost Paths. Systems Science and Cybernetics 4(2), 100–107 (1968)

14. Wu, N., Jin, T., Zha, H., et al.: BeehiveZ: an open framework for business process model management. Journal of Computer Research and Development 47(z1), 450–454 (2010)

15. Kunze, M., Weske, M.: Metric trees for efficient similarity search in large process model repositories. In: Muehlen, M.z., Su, J. (eds.) BPM 2010 Workshops. Lecture Notes in Business Information Processing, vol. 66, pp. 535–546. Springer, Heidelberg (2011)