Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization...

15
Journal of the Chinese Institute of Engineers, Vol. 31, No. 4, pp. 659-673 (2008) 659 MULTIPLE SEQUENCE ALIGNMENT USING MODIFIED DYNAMIC PROGRAMMING AND PARTICLE SWARM OPTIMIZATION Wang-Sheng Juang and Shun-Feng Su* ABSTRACT Dynamic Programming (DP) is widely used in Multiple Sequence Alignment (MSA) problems. However, when the number of the considered sequences is more than two, multiple dimensional DP may suffer from large storage and computational complexities. Often, progressive pairwise DP is employed for MSA. However, such an approach also suffers from local optimum problems. In this paper, we present a hybrid algorithm for MSA. The algorithm combines the pairwise DP and the particle swarm optimization (PSO) techniques to overcome the above drawbacks. In the algorithm, pairwise DP is used to align sequences progressively and PSO is employed to avoid the result of alignment being trapped into local optima. Several existing MSA tools are employed for comparison. The experimental results show excellent performance of the proposed algorithm. Key Words: dynamic programming, multiple sequence alignment, particle swarm optimization. *Corresponding author. (Tel: 886-2-27376704; Fax: 886-2- 27376699; ; Email: [email protected]) W. S. Juang and S. F. Su are with the Department of Electrical Engineering, National Taiwan University of Science and Technology, No. 43, Sec. 4, Keelung Rd., Taipei 106, Taiwan, R.O.C. I. INTRODUCTION Multiple sequence alignment (MSA) is an impor- tant problem in molecular biology. Biological sequences are aligned with each other vertically to show possible similarities or differences among these sequences. The similarities (or commonalties) may reveal evolution- ary history and are clues about common biological func- tions of the sequences. The information of similarity and difference can then be used to predict the second- ary or tertiary structure of new sequences and to find the relationships between sequences. MSA is also of- ten used for constructing evolutionary trees from DNA sequences and for analyzing the structures to help in designing new proteins. Usually, to solve an MSA problem is to find an alignment of multiple sequences with the highest score based on a given scoring crite- rion among sequences. During recent decades, there has been increasing interest in the biosciences for methods that can efficiently solve this problem for se- quences such as biological macromolecules, DNA and proteins. However, multiple sequence alignment is a combinatorial problem with exponential time complexity and there is no good approach that can solve it effi- ciently (Karadimitriou and Kraft, 1996). It is well known that the multiple sequence align- ment problem can be exactly solved by the dynamic programming (DP) algorithm (Needleman and Wunsch, 1970), which converts the original problem to a prob- lem of searching for the shortest path in a weighted directed acyclic k-dimensional graph. Unfortunately, such an approach is notorious for its large consump- tion of processing time because DP methods with the sum-of-pairs score have been proven to be a NP-com- plete problem (Smith and Waterman, 1981), in which algorithms that guarantee to find the true optimal so- lution have running time complexity growing expo- nentially with the size of the considered problem. As a consequence, most of practical multiple sequence alignment algorithms are based on heuristics and usu- ally produce quasi-optimal alignment. There are several MSA algorithms reported in the literature (Chen, et al ., 1992, McClure, et al ., 1994, Thompson, et al., 1999). A great majority of multiple

Transcript of Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization...

Page 1: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

Journal of the Chinese Institute of Engineers, Vol. 31, No. 4, pp. 659-673 (2008) 659

MULTIPLE SEQUENCE ALIGNMENT USING MODIFIED

DYNAMIC PROGRAMMING AND PARTICLE SWARM

OPTIMIZATION

Wang-Sheng Juang and Shun-Feng Su*

ABSTRACT

Dynamic Programming (DP) is widely used in Multiple Sequence Alignment(MSA) problems. However, when the number of the considered sequences is morethan two, multiple dimensional DP may suffer from large storage and computationalcomplexities. Often, progressive pairwise DP is employed for MSA. However, suchan approach also suffers from local optimum problems. In this paper, we present ahybrid algorithm for MSA. The algorithm combines the pairwise DP and the particleswarm optimization (PSO) techniques to overcome the above drawbacks. In thealgorithm, pairwise DP is used to align sequences progressively and PSO is employedto avoid the result of alignment being trapped into local optima. Several existingMSA tools are employed for comparison. The experimental results show excellentperformance of the proposed algorithm.

Key Words: dynamic programming, multiple sequence alignment, particle swarmoptimization.

*Corresponding author. (Tel: 886-2-27376704; Fax: 886-2-27376699; ; Email: [email protected])

W. S. Juang and S. F. Su are with the Department of ElectricalEngineering, National Taiwan University of Science and Technology,No. 43, Sec. 4, Keelung Rd., Taipei 106, Taiwan, R.O.C.

I. INTRODUCTION

Multiple sequence alignment (MSA) is an impor-tant problem in molecular biology. Biological sequencesare aligned with each other vertically to show possiblesimilarities or differences among these sequences. Thesimilarities (or commonalties) may reveal evolution-ary history and are clues about common biological func-tions of the sequences. The information of similarityand difference can then be used to predict the second-ary or tertiary structure of new sequences and to findthe relationships between sequences. MSA is also of-ten used for constructing evolutionary trees from DNAsequences and for analyzing the structures to help indesigning new proteins. Usually, to solve an MSAproblem is to find an alignment of multiple sequenceswith the highest score based on a given scoring crite-rion among sequences. During recent decades, therehas been increasing interest in the biosciences for

methods that can efficiently solve this problem for se-quences such as biological macromolecules, DNA andproteins. However, multiple sequence alignment is acombinatorial problem with exponential time complexityand there is no good approach that can solve it effi-ciently (Karadimitriou and Kraft, 1996).

It is well known that the multiple sequence align-ment problem can be exactly solved by the dynamicprogramming (DP) algorithm (Needleman and Wunsch,1970), which converts the original problem to a prob-lem of searching for the shortest path in a weighteddirected acyclic k-dimensional graph. Unfortunately,such an approach is notorious for its large consump-tion of processing time because DP methods with thesum-of-pairs score have been proven to be a NP-com-plete problem (Smith and Waterman, 1981), in whichalgorithms that guarantee to find the true optimal so-lution have running time complexity growing expo-nentially with the size of the considered problem. Asa consequence, most of practical multiple sequencealignment algorithms are based on heuristics and usu-ally produce quasi-optimal alignment.

There are several MSA algorithms reported in theliterature (Chen, et al., 1992, McClure, et al., 1994,Thompson, et al., 1999). A great majority of multiple

Page 2: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

660 Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)

sequence alignment heuristics is now carried out usingthe “progressive approach” proposed in (Feng andDoolittle, 1987) or its variation (Thompson, et al.,, 1994).This approach has the great advantages of speed andsimplicity. On the other hand, the main disadvantageis the “local alignment” problem, which stems fromthe greedy nature of the algorithm. This means that ifmistakes are made in intermediate alignments, thosemistakes cannot be corrected later as more sequencesare added into the alignment process. Another kind ofapproach is to use an extension of DP for simultaneouslyaligning multiple sequences, such as the Carrillo-Lipmanalgorithm (Carrillo and Lipman, 1998), MSA (Lipman,et al., 1989), DCA (Stoye, et al., 1997, Stoye, 1998).In general, these algorithms often have higher qualitysolutions than those of progressive approaches. However,they have drawbacks of complexity in running time andin memory requirements. Thus, they can only be ap-plied to problems with a limited number of sequences(probably fewer than 10).

There are also iterative, stochastic approaches formultiple sequence alignment. These approaches includesimulated annealing (SA) (Myers and Miller, 1988),genetic algorithms (GA) (Isokawa, et al., 1996,Notredame and Higgins, 1996, Omar, et al., 2004,Yang, 2001, Zhang and Wong, 1997) and evolutionaryprogramming (EP) (Chellapilla and Fogel, 1999, Liu,2003, Thomsen, et al., 2002). Such approaches are alsoreferred to as non-derivative optimization or popula-tion-based search algorithms. Iterative and stochasticmethods such as SA, GA, and EP can be used to searcha large solution space efficiently. Not surprisingly, theapplication of SA to multiple sequence alignmentproblems has a tendency to stagnate at local optimal,whereas GA and EP tend to be more successful.However, the GA and EP methods introduced so farstill suffer from long running time and may not havegood search performance. It is because they all startfrom a random initialization of candidate alignmentsand therefore spend a lot of time to gradually improvethe solutions before reaching a solution near optimal.In this study, we intend to combine the above two kindsof approaches to form a better approach for MSA.

In this paper, we propose a hybrid search algo-rithm combining DP and particle swarm optimization(PSO), which is also classified as an iterative and sto-chastic approach, for finding solutions for MSA. Inthe algorithm, DP is first employed to align multiplesequences according to the pair alignments order.Then, PSO is employed to improve the aligned re-sults to avoid being trapped into local optima. Wemust point out that PSO is an improver for a progres-sive pairwise DP in our approach. DP is not an ini-tialization mechanism for PSO. In our opinion, a searchusing an approach of employing DP as an initializa-tion mechanism for PSO may easily be trapped into

local optima. In fact, in (Liu, 2003), the author hasalso used ClustralW as an initialization mechanismfor evolutionary programming and the results are notgood owing to the local optimum problem. Our ap-proach basically is a pairwise DP based approach. Itis known that DP is a good alignment approach ex-cept that the local optimum problem is usually seri-ous in progressive pairwise DP cases. In our approach,we propose to employ PSO to resolve this problem.There are some approaches employed to modify theresults obtained by DP, but those methods always usedomain knowledge to make modifications. Such anapproach can be viewed as adding a local search mecha-nism for DP and cannot be used to resolve the localoptimum problem. There are also approaches usingsome traditional search algorithms as initializationmechanisms for a global search technique. But, thesekinds of approaches will have the local optimum problem.In our study, several data sets of Clusters of OrthologousGroups (COGs) of proteins are used as examples todemonstrate that our approach is superior to the mostwidely used multiple sequence alignment approachClustalW (Thompson, et al., 1994) and other exist-ing MSA tools, such as MUSCLE (Edgar, 2004), MAFFT(Katoh, et al., 2002; Katoh, et al., 2005), T-COFFEE(Notredame et al., 2000; Poirot et al., 2003), DIALIGN-T (Subramanian, et al., 2005) and MAP (Huang, 1994).Our approach is also compared to a recent approach(Liu, 2003), which adopts EP in the search. The simu-lation results show that our approach is indeed promising.Overall, our algorithm can find better MSA in mostcases without using too much search time.

II. DYNAMIC PROGRAMMING

In this section, we shall introduce the idea ofpairwise dynamic programming and its application topairwise sequence alignment. We also modify thepairwise dynamic programming to suit multiple sequencealignment. In this study, we also propose to embedthe PSO techniques into DP to avoid local optima. Therelated PSO issues will be introduced in the next section.

The first use of the Dynamic Programming ap-proach for the alignment of biological sequences wasreported in (Needleman and Wunsch, 1970). For anumber of useful alignment-scoring schemes, thismethod is guaranteed to produce an alignment of twogiven sequences with the highest possible score. Theprocedure of DP for finding the maximum score isshown in algorithm I. In addition to finding the maxi-mum score, the algorithm also needs to find the cor-respondent alignment. Thus, there are four steps in acomplete DP algorithm. They are the initializationstep, the matrix filling step, the backtracking matrixconstructing step, and the alignment obtaining step.

Initialization step: In this step, the first row and

Page 3: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

W. S. Juang and S. F. Su: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle 661

the first column are filled with decrements of the gapscore.

Matrix filling step: The score matrix is filled start-ing from the upper left corner and then going right anddown. For each cell (i, j) we need to compute the scoresof three different paths, which are F(i – 1, j – 1) + σ(s i

a,s

jb), F(i – 1, j) + σ(–, s

jb) and F(i, j – 1) + σ(s i

a, –), whereF(i, j) is the maximum score at position (i, j), σ (s i

a, –)denotes a score for a gap in sequence sa at position i,and σ(–, s

jb) denotes a score for a gap in sequence sb at

position j. The maximum of the above three scores isfilled into the cell. As mentioned earlier, the proce-dure for filling the matrix and finding the maximum isshown in algorithm I. The final value in the lower rightcorner is the maximum global alignment score.

Backtracking matrix constructing step: Thebacktracking step is to determine the actual alignmentthat results in the maximum score. This step startsfrom F(1, 1) and ends at the lower right corner cell.For each cell, there is one of three possible symbols,“*”, “–”, or “#”. A “–” symbol in the cell indicatesthat there is a gap in the column sequence. A “#”symbol in the cell indicates that there is a gap in therow sequence. Otherwise, it is a “*” symbol. Theprocedure for constructing the backtracking matrixis described in Algorithm II.

Alignment obtaining step: This step is to ob-tain the best sequence alignment from the backtrack-ing matrix. The process starts from the lower rightcorner cell and records sequences from right to left.Considering cell (i, j) and its next cells, there are threepaths to be chosen from. They are (i – 1, j – 1), (i –1, j) and (i, j – 1). Which path will be chosen de-pends upon the symbol in the cell (i, j). If the cell (i,j) is “*”, the next cell will be selected at (i – 1, j – 1).If the cell (i, j) is “–”, the next cell will be (i, j – 1).If the cell (i, j) is “#”, the next cell will be (i – 1, j).Fig. 1 illustrates the procedure of choosing cell paths.Fig. 2 demonstrates how to obtain the best sequencealignment from a backtracking matrix.

For illustration, the following example isconsidered. Let the sequences to be aligned be

Sa : A G C A C A C A

Sb : A C A C T A

The scoring scheme used is σ (s ia, s

jb) = 2 if the gene at

position i of sequence sa is the same as the gene at po-sition j of sequence sb (match), σ(s i

a, sjb) = –1 if the gene

at position i of sequence sa is different from the gene atposition j of sequence sb (mismatch), and if one of thesegenes is a gap, then g = –2 (gap penalty). The scoringmatrix after initialization is shown in Fig. 3(a). By ap-plying the matrix filling step, the rest of the score ma-trix can be filled accordingly. The completed scorematrix is shown in Fig. 3(b). The maximum global align-ment score is 5 in this example. Then the backtrackingmatrix can be constructed and is shown in Fig. 4. Ac-cording to the backtracking matrix, the best sequencealignment for the example is given in Fig. 5.

The dynamic programming method has been ac-cepted as an effective method for finding the optimalsolution for pairwise alignment (Gotoh, 1982, Smith

(i – 1, j)(i – 1, j – 1)

(i, j – 1) (i, j)

path 3path 2

path 1

Fig. 1 Shows how to choose the next cell path to obtain an align-ment step. If cell (i, j) is “*”, choose path 2. If cell (i, j)is “–”, choose path 1. If cell (i, j) is “#”, choose path 3

* – – –

# * * *

# # # *

# * – #

A

C

G

T

A T C

(a) (b)

C

Sa : row sequence

Sb : column sequence

Sa A T C C –

Sb A – C G T

Fig. 2 (a) Shows backtrack matrix. (b) Shows the result of ob-taining global alignment for the backtrack matrix of Fig.3(a), where s′a and s′b are the sequences of the global align-ment result

0 -2 -4 -6 -8 -10 -12 -14 -16

Sa

Sb

A G C A

(a)

C A C A

-2

-4

-6

-8

-10

-12

A

C

A

C

T

A

0 -2 -4 -6 -8 -10 -12 -14 -16

Sa

Sb

A G C A

(b)

C A C A

-2

-4

-6

-8

-10

-12

2

0

-2

-4

-6

-8

0

1

-1

-3

-5

-7

-2

2

0

1

-1

-3

-4

0

4

2

0

1

-6

-2

2

6

4

2

-8

-4

0

4

5

6

-10

-6

-2

2

3

4

-12

-8

-4

0

1

5

A

C

A

C

T

A

Fig. 3 (a) Figure shows how to fill the first row and the first col-umn with gap scores. (b) Shows the completed score matrix

Page 4: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

662 Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)

and Waterman, 1981). In aligning two sequences oflengths m1 and of length m2, a straightforward imple-mentation of the method needs the complexity ofO(m1m2) in time and of O(m1m2) in space. (Myersand Miller, 1988) developed a space-efficient versionof dynamic programming based on the work in(Hirschberg, 1975), which needs O(m1m2) in time andO(m1) in space where m1 is the length of the shortersequence. There are also a number of authors devot-ing themselves to the study of how to construct a goodscoring function for pairwise sequence alignment, suchas (Karlin and Altschul,1990) and (Altschul, 1993).

Since the above DP algorithm is for pairwisealignment, when multiple sequence alignment isconsidered, the algorithm must be modified to includemore sequences and to form a multi-dimensional DPalgorithm, which is notorious for its large time con-sumption. The time usage for a k sequence align-

ment problem is O((2k –1) . s jΠj = 0

k) ≈ O(2knk) with

|s1| = |s2| ... = |sk| = n and the space required is O( s jΠj = 0

k)

≈ O(nk). Obviously, such computational costs are too

high to be implemented. It is an NP-complete prob-lem (Smith and Waterman, 1981). As a result, mostof practical alignment algorithms are based on heu-ristics and only produce a quasi-optimal alignment.

In practice, the pairwise DP algorithm is oftenused. The idea is to align sequences one by one. It isthe so-called progressive alignment. For instance, afterfinding an alignment for a given pair of sequences(like the one in the above example), suppose that wehave another sequence sc : AGTACA needing to bealigned. We can then construct the initialization scorematrix as shown in Fig. 6. It is worth noticing thatthe first column is filled with –4 and its multipliednumber because a gap generated for a column meanstwo gaps in the original aligned sequences. For example,(A, –, –) is the character set for the first move down-ward and has a score –4. Similar cases may occur inother cells because there are two sequences in the rowsequence. As an example, in the first row, the score

of the sixth column is –12. Then in the seventh column,the characters pair at the sixth column in row sequenceis (A, T) and the row sequence is paired with a gap(moving horizontally), then the score in this cell is –12 + σ(A, T, –); that is, –17.

After the initialization step, based on the simi-lar calculation, the completed score matrix can beconstructed accordingly and is shown in Fig. 7.Afterward, the completed backtracking matrix can beformed and is shown in Fig. 8. Using the backtrack-ing matrix, a possible good alignment for these se-quences is obtained and is shown in Fig. 9. It shouldbe noted that the obtained alignment may not be thebest alignment because this alignment is based on theoriginal alignment between sa and sb, which is thebest alignment between sa and sb, but may not be thebest alignment for all sequences. It is called thelocal optimum problem. In our approach, we thenemploy the PSO technique to cope with this problem.

III. PSO AS AN IMPROVER

The particle swarm theory was first proposed in(Eberhart and Kennedy, 1995, Kennedy and Eberhart,1995). Based on such an idea, many researchersmodify the theory to particle swarm optimization(PSO) and then apply this technique to widespreadareas (Abido,2002, Kumar, et al., 2004, Robinson andRahmat-Samii, 2004, Shi and Eberhart, 1998, Van der

*

A

#

*

#

#

*

G

*

#

#

#

#

C

*

#

*

#

#

*

A

*

#

#

*

C

*

*

#

#

*

A

*

*

*

C

*

*

*

A

*

*

A

C

A

C

T

A

Fig. 4 Shows how to construct the backtrack matrix. The arrowsindicate the global sequence alignment path

sa A G C A C A C A′sb A – C A C T – A′

Fig. 5 The obtained global sequence alignment to Fig. 4, wheres′a and s′b indicate aligned sequence sa and sb respectively

0 -2 -6 -8 -10 -12 -17 -21 -23

sbsc

A – C A C T – A

-4

-8

-12

-16

-20

-24

A

G

T

A

C

A

′sa A G C A C A C A′

Fig. 6 Initial score matrix for adding the third sequence. For col-umn sequence (sc) each character will have two gaps (“–”symbol). Therefore, the initial scores show as above. Forrow sequences, each column has three elements, too. Forinstance, we want to compute the initial score in the firstcolumn of row sequences, three elements are (A, A, –), andits initial score is –2. If we want to compute the initialscore of the 6th column of row sequences, the initial scorethen becomes –12 + σ(A, T, –). Thus, initial score is –17

Page 5: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

W. S. Juang and S. F. Su: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle 663

Merwe and Engelbrecht, 2003). PSO is a populationbased heuristic search technique in which each par-ticle presents a potential solution within the searchspace and will be characterized by its position, itsvelocity and a record of its past performance. In thisstudy, the PSO technique will be employed to improvethe alignment of sequences aligned by DP.

In our study, the modified particles swarmoptimizer proposed in (Kennedy and Eberhart, 1995)is used and is written as

vid(t + 1)

= w * vid(t) + c1 * rand( ) * (Pid – xid(t))

+ c2 * Rand( ) * (Pgd – xid(t)) (1)

xid(t + 1) = xid(t) + vid(t + 1), (2)

where w is an inertia weight, c1 and c2 are two posi-tive constants, rand( ) and Rand( ) are two indepen-dent random functions in the range [0, 1], xid and vid

are the position and the velocity of an individual par-ticle id, respectively, Pid is the best position at presentfor the particle, and Pgd is the globally best positionat present for all particles. The particle position willthen be updated according to Eq. (2). In applyingparticle swarm algorithms, the inertia weight w is set

as a random value in the range [0, 1].In our implementation of PSO for MSA, each

particle in the problem space represents a string of gappositions X = x1

1x12 ... x1

n1x2

1x22 ... x2

n2 ... xm

1xm2 ... xm

nm, where

xij for 1 ≤ j ≤ ni and 1 ≤ i ≤ m is the location of a gap

existing in sequence i. Here, m is the number of se-quences and ni is the number of gaps for sequence i.ni is obtained as ni = L – li, where li is the length of thei-th original sequence and L is the length of sequencesused in the algorithm and is determined in the pairwiseDP process. Different scoring matrices may have dif-ferent alignment lengths. To illustrate the way ofemploying PSO for MSA, an example is given here.Assume that there are three sequences after progres-sive pairwise DP alignment and they are given as s1:AT–TC–G, s2: A––TCAG and s3: ATGTCA–. For se-quence s1, there are two gaps in positions 3 and 6respectively. For sequence s2, there are two gaps inpositions 2 and 3. For sequence s3, there is only onegap in position 7. According to this gap positioninformation, a particle can be coded into an integernumber vector [3 6 2 3 7]. For an N-particle optimizer,the initial values of the other N – 1 particles are ran-domly generated within the problem space. For thisexample, the length of sequences is 7. Each one ofthe other N – 1 particles will be formed like [a1 a2 b1

b2 c], where a1, a2, b1, b2, and c are randomly gener-ated within the integer range [1, 7]. It is noted thatthe result of using Eq. (1) to update velocity vid is notan integer value. To cope with this problem, Eq. (1)is modified as:

vid(t + 1)

= round{w * vid(t) + c1 * rand( ) * (Pid – xid(t))

+ c2 * Rand( ) * (Pgd – xid(t))} (3)

where round{} is the round-off operation. Particlepositions thereby can be updated by Eq. (2).

Pairwise DP has a drawback of “once a gap al-ways a gap.” If such a gap is improper for the globalalignment, it is impossible to modify it in the laterDP process. In that case, when more sequences areadded into the process, the result obtained will befarther away from the optimal alignment. As men-tioned previously, DP for simultaneously aligningmultiple sequences has an advantage of resulting inhigh quality solutions. But, it suffers from large stor-age and computational complexities, when the num-ber of sequences is large. Thus, we propose to use

0 -2 -6 -8 -10 -12 -17 -21 -23

sbsc

A – C A C T – A

-4

-8

-12

-16

-20

-24

6

2

-2

-6

-10

-14

2

4

0

-4

-8

-12

0

2

4

0

2

-2

-2

0

2

10

6

8

-4

-2

0

8

16

12

-9

-7

-2

3

11

16

-13

-11

-6

-1

7

12

-15

-13

-8

0

5

13

A

G

T

A

C

A

′sa A G C A C A C A′

Fig. 7 Completed score matrix. The global sequence alignmentscore is 13

*

A

#

#

*

#

*

*

#

#

#

#

C

*

#

*

#

*

A

*

#

*

C

*

#

T

*

*

*

AA G C A C A C A

*

*

A

G

T

A

C

A

Fig. 8 Backtrack matrix for three sequences alignment

sa : A G C A C A C A′sb : A – C A C T – A′sc : A G T A C – – A′

Fig. 9 Global alignment for three sequences

Page 6: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

664 Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)

pairwise DP and to incorporate PSO into thealgorithm. The proposed algorithm is referred to asModified Dynamic Programming and Particle SwarmOptimization (MDPPSO) in this paper.

The proposed MDPPSO algorithm is shown inalgorithm III. The algorithm consists of two phases.In the initialization phase, the algorithm calculatespairwise alignment scores for all possible pairs of se-quences and sorts all pair alignment scores to deter-mine the alignment order. Different aligning ordersmay have different aligned results. In the second phase,the highest score sequences pair is selected as a rowsequence. Other sequences are added one by one intothe DP algorithm according to the alignment order thatis formed in the initialization phase. The particle swarmalgorithm stated in Eqs. (3) and (2) is then used toimprove the result of each DP process. It is noted thatPSO is applied as an improver.

It may be noticed that the algorithm needs tocalculate the scores of all possible pairs of sequencesand to form an order of those pairs based on suchscores. Then, if the number of sequences for align-ment is large, the computational burden is also heavy.In fact, if many sequences are considered in MSA,we may skip the process of calculating the scores ofall possible pairs. In that case, the later stage, se-quences are randomly selected for applying pairwiseDP and for being added into the pairwise DP. Ofcourse, such a modification surely will lead to localoptima for progressive pairwise DP. Nevertheless,in our algorithm, PSO is employed to improve thealignment all the time. As a consequence, the ob-tained results are still satisfactory except thatsometimes, more iterations may be needed in PSO.

An example is given in the following for illustra-tion. Assume that we want to align four sequences(s1, s2, s3, s4) using MDPPSO. The considered se-quences are

s1: M A I W Q G R S S G

s2: M V W Q G R S G K G

s3: M A I W Q S M G R A

s4: M I W G S R K P S G G

A simple scoring strategy is used. They are match =

+2, mismatch = –1, space = –2, space to space = 0.The parameters required in the particle swarm opti-mization procedure are set as number = 5, iterationnumber = 500, w = [1, 0], velocity range = [–2, 2]. Inthe first stage, pairwise dynamic programming is usedto align all sequence pairs. The results are shown inFig. 10. The pairwise score matrix is shown in Fig.11. Based on the scores of all sequence pairs, theorder for pairwise DP alignment is s2, s3, s4 and thens1. First, the aligned sequence pair (s2, s3) is selectedas the row sequence and s4 is put as the column se-quence in pairwise DP. The result of the alignmentis shown in Fig. 12. If PSO is not employed, the alignedsequence set (s2, s3, s4) is selected as the row sequenceand s1 is the column sequence. After the DP procedure,the final result of four sequences’ alignment withoutPSO refinement is shown in Fig. 13. In this example,the sum of pairwise (SOP) scores of final global se-quence alignments without PSO refinement is 6. Now,consider the incorporation of PSO. After the alignedsequence pair (s2, s3, s4) is obtained, PSO is employedto modify the aligned sequence. According to Fig.12 (c), xid(0) indicating all gap positions is [10 1213 3 12 6 7 10] and L, l1, l2, l3, n1, n2, and n3 are13, 10, 10, 10, 3, 3, and 3, respectively. After foursequences are all included, PSO is again employedto improve the alignment. According to Fig. 13(c),we can have xid(0), L, l1, l2, l3, l4, n1, n2, n3, and n4 as[8 9 12 14 15 3 8 9 14 15 6 7 8 9 12 2 12 1415], 15, 10, 10, 10, 11, 5, 5, 5, and 4 respectively.The results with PSO refinement are shown in Fig.14. The SOP score of the final global sequence alignmentwith PSO refinement is 9. Obviously, the multiplealignments have been improved as expected.

M – I W – G S R K P S G GM A I W Q G – R S – S – G

(s1, s2)

M A I W Q G R S S – GM V – W Q G R S G K G

(s2, s3)

M I W – G S R K P S G – GM V W Q G – R – – S G K G

(s1, s3)

M A I W Q – – G R S S GM A I W Q S M G R A – –

(s2, s4)

M – I W G S R K P S G GM A I W Q S M G R A – –

(s1, s4)

M V – W Q G R S G K GM A I W Q S M – G R A

(s3, s4)

Fig. 10 The result of pairwise alignment of all sequence pairs

– 3 3 -3

– 8 5

– -1

s1

s2

s3

s4

s1 s2 s3 s4

Fig. 11 Shows pairwise scores of four sequences. Each pair’s scoredetermines the alignment priority from the highest scorepair to the lowest score pair. The sequence pairs scoreorder is (s2, s3) ≥ (s2, s4) ≥ (s1, s3) ≥ (s1, s2) ≥ (s3, s4) ≥ (s1,s4). The alignment order then is (((s2, s3), s4), s1).

Page 7: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

W. S. Juang and S. F. Su: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle 665

IV. RESULTS AND DISCUSSIONS

In this section, the COG data sets are consid-ered and the obtained results are reported. Table 1shows all related information for those data sets. Thesimulation platform is implemented in MATLAB R11language, the operating system is Windows XP, withNorton System Works 2003. The processor is IntelPentium®4 2.5G and the main memory size is 256M.The scoring scheme used is the BLOSUM62 scoringmatrix for protein sequences as shown in Fig. 15. Apair of gaps with any alphabet gives score –4. A gapto gap pair gives a score of 0. In the process of PSO,the number of particles is 5. The iteration number is1000. The inertial weight w in the PSO algorithm is

set as a random value in the range [0, 1]. ParametersC1 and C2 are set 2 and 2, respectively. For the pa-rameters used in PSO, most of them are heuristicallyselected. In fact, in our study, we simply use a com-monly used value and the results are acceptable.

There are many tools being used for MSA. Thefirst category like ClustralW, is to find alignments ina fast way, but the resultant alignments may not bethe best solutions. The other category is to employsome optimization search algorithms, such as geneticalgorithms, to search for the possibly best alignment.However, this kind of approach may suffer from

GG

-32

-24

-19

-17

-9

-1

7

9

17

13

9

–K

-30

-22

-17

-15

-7

1

9

11

10

9

11

SG

-26

-18

-13

-11

-3

5

13

15

14

13

15

SS

-21

-13

-8

-6

2

10

18

14

16

18

20

RR

-19

-11

-6

-4

4

12

14

16

12

20

16

GG

-17

-9

-4

-2

6

14

16

12

14

10

6

QQ

-15

-7

-2

0

8

16

12

8

4

0

-4

(a)

WW

-13

-5

0

2

10

6

2

-2

-6

-10

-14

I–

-11

-3

2

4

0

-4

-8

-12

-16

-20

-24

AV

-7

1

6

2

-2

-6

-10

-14

-18

-22

-26

MM

-2

6

2

-2

-6

-10

-14

-18

-22

-26

-30

0

-4

-8

-12

-16

-20

-24

-28

-32

-36

-40

M

A

I

W

Q

S

M

G

R

A

GG

–K

*

#

#

SG

SS

*

*

RR

*

#

*

*

GG

*

#

*

#

QQ

*

#

*

#

#

(b)

(C)

WW

*

#

#

#

#

#

I–

*

#

#

#

#

#

#

AV

*

#

#

#

#

#

#

#

MM

*

#

#

#

#

#

#

#

#

*

#

#

#

#

#

*

#

#

#

M

A

I

W

Q

S

M

G

R

A

M A I W Q G R S S – G – –M V – W Q G R S G K G – –M A I W Q – – S M – G R A

Fig. 12 Alignment of three sequences. (a) Shows the fulfilled scorematrix. (b) Shows the obtained backtracking matrix andthe back tracking path. (c) Shows the obtained global se-quence alignment. The results of without PSO refinementand with PSO refinement are the same as each other

-39

-27

-21

-9

3

9

12

15

15

9

15

18

-39

-27

-21

-9

-3

9

12

12

6

3

6

3

-51

-39

-33

-21

-9

-3

0

3

3

0

3

6

-45

-33

-27

-5

-3

3

6

9

9

6

9

12

-33

-21

-15

-3

3

15

18

12

12

9

12

6

-24

-12

-6

6

12

24

18

18

12

15

9

3

-24

-12

-6

6

12

12

15

9

3

-3

-9

-15

-18

-6

0

12

18

15

9

3

-3

-9

-9

-15

-12

0

6

18

21

15

9

3

-3

-9

-15

-21

(a)

-12

0

6

18

21

15

9

3

-3

-9

-15

-21

-12

0

6

3

-3

-9

-15

-21

-27

-33

-39

-45

-6

6

9

3

-3

-9

-15

-21

-27

-33

-39

-45

GGG

–K–

––A

––R

SGM

SSS

RR–

GG–

QQQ

WWW

I–I

AVA

MMM

0

12

6

0

-6

-2

-18

-24

-30

-36

-42

-48

0

-6

-12

-18

-24

-30

-36

-42

-48

-54

-60

-66

M

I

W

G

S

R

K

P

S

G

G

*

*

*

*

#

*

*

*

*

*

*

*

*

*

#

*

#

*

#

*

#

#

*

*

#

#

#

#

#

*

*

#

#

#

#

#

#

(b)

*

#

#

#

#

#

#

#

*

#

#

#

#

#

#

#

#

*

*

#

#

#

#

#

#

#

#

GGG

–K–

––R

––A

SGM

SSS

RR–

GG–

QQQ

WWW

I–I

AVA

MMM

*

#

#

#

#

#

#

#

#

#

*

#

#

#

#

#

#

#

#

#

#

M

I

W

G

S

R

K

P

S

G

G

(c)

M A I W Q G R – – S S – G – –M V – W Q G R – – S G K G – –M A I W Q – – – – S M – G R AM – I W G S R K P S G – G – –

Fig. 13 Align four sequences. (a) Shows the fulfilled score matrix.(b) Shows the obtained backtracking matrix and the backtracking path. (c) Shows the obtained global sequencealignment without PSO refinement

Page 8: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

666 Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)

inefficiency. The proposed approach is compared tothose two kinds of approaches. The first one isClustalW, which is one of the most widely used mul-tiple sequence alignment systems. As mentioned earlier,ClustalW is a progressive approach. The other oneis the method proposed in (Liu, 2003). The methodis a stochastic and iterative approach and has beenshown to have good search performance. The per-formance comparisons are shown in Table 2. Forsmoothing out the randomness of the algorithm, 10runs of alignment are independently conducted foreach data set. The maximum score, the average scoreand the standard deviation of scores for MDPPSO arelisted in Table 3. The running times of the cases ob-taining the best result and the average running timeare also listed in Table 4. Notice that “M.C.” de-notes the numbers of match columns. From those results,it can be clearly seen that the proposed approach ingeneral is better than the other two approaches. The

running time is much shorter than those shown in (Liu,2003). For illustration, Fig. 16 shows the obtainedmultiple sequence alignment result of data set COG1603using the MDPPSO method. Fig. 17 shows the mul-tiple sequence alignment result of data set COG1603using ClustalW version 1.83. The ClustalW simula-tion platform is obtained from the web site http://www.ebi.ac.uk/Tools/clustalw/index.html#.

The parameters used in alignment indicate whatare considered in the alignment problems. We agree

M A I W Q G R – – S S – G – –M – V W Q G R – – S G K G – –M A I W Q – – – – S M – G R AM – I W G S R K P S G – G – –

Table 1 List of datasets

Average length of sequencesI D Number of sequences

(min, max)

COG2178 3 211 (196, 222)COG1983 4 118 (65, 158)COG1603 4 222 (199, 245)COG2157 4 72 (57, 78)COG1476 5 71 (66, 79)COG2097 6 96 (81, 113)COG1510 6 170 (152, 185)COG1761 6 105 (85, 142)COG0219 9 158 (151, 166)COG2003 9 206 (148, 243)

Fig. 14 Four sequences global alignment with PSO refinement

4-1-2-20

-1-10

-2-1-1-1-1-2-110

-3-20

-150

-2-310

-203

-22

-1-3-2-1-1-3-2-3

ARNDCQEGHILKMFPSTWYV

-2-216

-302

-1-1-3-4-1-3-3-10

-1-4-3-3

0-3-3-39

-3-4-3-3-1-1-3-1-2-3-1-1-2-2-1

-2061

-30001

-3-30

-2-3-210

-4-2-3

-1002

-425

-20

-3-31

-2-3-10

-1-3-2-2

0-20

-1-3-2-26

-2-4-4-2-3-3-20

-2-2-3-3

-1100

-352

-20

-3-210

-3-10

-1-2-1-2

-1-3-3-3-1-3-3-4-342

-310

-3-2-1-3-13

-1-2-3-4-1-2-3-4-324

-220

-3-2-1-2-11

-201

-1-300

-28

-3-3-1-2-1-2-1-2-223

-1-1-2-3-10

-2-3-212

-150

-2-1-1-1-11

-2-3-3-3-2-3-3-3-100

-306

-4-2-213

-1

-120

-1-311

-2-1-3-25

-1-3-10

-1-3-2-2

1-110

-1000

-1-2-20

-1-2-141

-3-2-2

0-10

-1-1-1-1-2-2-1-1-1-1-2-115

-2-20

-1-2-2-1-3-1-1-2-2-3-3-1-2-47

-1-1-4-3-2

-2-2-2-3-2-1-2-32

-1-1-2-13

-3-2-227

-1

0-3-3-3-1-2-2-3-331

-21

-1-2-20

-3-14

-3-3-4-4-2-2-3-2-2-3-2-3-11

-4-3-2112

-3

A R D CN E GQ I LH M FK S TP Y VW

Fig. 15 The BLOSUM62 scoring matrix

Table 2 The comparison with ClustalW and the best result in (Liu, 2003) for COG datasets

MDPPSO best result ClustalW result The best result in (Liu, 2003)I D

Score M.C. Score M.C. Score M.C.

COG2178 654 44 384 41 653 44COG1983 -351 14 -659 11 -323 15COG1603 639 16 149 10 624 16COG2157 610 18 499 12 608 18COG1476 1674 22 1657 22 1668 21COG2097 1998 12 1781 12 1993 12COG1510 2205 6 1650 4 2157 6COG1761 423 9 -144 7 43 6COG0219 10204 25 9734 24 10358 25COG2003 6656 22 5152 18 6442 19

Page 9: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

W. S. Juang and S. F. Su: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle 667

that those parameters will affect the alignment results.However, those different results mean different solu-tions to different problems. It has nothing to do withthe effectiveness of the search algorithms used. Togive more illustrating examples, we have used dif-ferent sets of parameters for experiments. TheClustalW parameters can be changed by users accord-ing to different alignment issues. Fig. 18 shows themain control panel of ClustalW. In Table 2 and Fig.

Table 3 The maximum, the average and the stan-dard deviation of scores for MDPPSO

Max. Average StandardI D

score score deviation

COG2178 654 647 4.8305COG1983 -351 -361.3 3.9455COG1603 639 630.1 6.5566COG2157 610 609.5 0.527COG1476 1674 1674 0COG2097 1998 1976.0 15.5496COG1510 2205 2168.1 16.1895COG1761 423 418.5 2.8382COG0219 10204 10073.7 75.4734COG2003 6656 6586 53.5931

Table 4 Running time of the best result and theaverage running time for COG datasets

Running time Average runningI D of the best result time

(sec.) (sec.)

COG2178 2097 2206.2COG1983 3386 3528.3COG1603 5944 6092.0COG2157 1470 1549.8COG1476 3019 2969.4COG2097 7792 7995.4COG1510 15122 14451.1COG1761 10171 10013.7COG0219 38504 40392.7COG2003 65651 64838.4

17, the ClustalW alignment result is performed byusing default parameters which are listed in Table 5.There are five main parameters to be changed as avariant scoring matrix or a set satisfactory value.They are the protein matrix, the gap open penalty,the end gap penalty, the gap distance, and the gapextension penalty, respectively. For protein scoring,there are three kinds different matrices in ClustalW,which are the Blosum, the Pam, and the Gonnet,

Fig. 16 Multiple sequence alignment result of COG1603 using MDPPSO. An asterisk is used to indicate a match column

Page 10: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

668 Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)

respectively. Table 6 shows the satisfactory param-eter set. Table 7 shows the alignment results of us-ing those parameters and variant protein matrix forCOG datasets. It is obvious that the MDPPSO bestalignment results are superior to those of usingClustalW even though variant scoring matrices andgap penalty are used. To have a clear illustration, wealso employed several available MSA tools forcomparison. Table 8 shows the results of MDPPSOcompared with MUSCLE (Edgar, 2004), MAFFT(Katoh, et al., 2002; Katoh, et al., 2005), T-COFFEE(Notredame et al . , 2000; Poirot et al . , 2003),DIALIGN-T (Subramanian, et al., 2005) and MAP(Huang, 1994). Table 9 shows these WEB servers.The experimental results show that MDPPSO still hasbetter results than these MSA tools.

V. CONCLUSIONS

A multiple sequence alignment approach thatcombines techniques of modified dynamic program-ming and particle swarm optimization was proposed.In the approach, PSO is employed to improve the re-sults of modified dynamic programming when align-ment has been finished. This procedure is veryimportant, since it will affect the next alignment stepin dynamic programming. The proposed algorithmcombines progressive and iterative alignment methods.The experimental results reveal that the proposed

approach is indeed promising. Several existing MSAtools are used for comparison. Among them, ClustralWis a commonly used MSA tool, which is not a globalsearch algorithm. Thus, usually, if a global searchalgorithm, such as Genetic Algorithm, Ant System,etc, is used and enough time is given, the result willbe better than that of using ClustralW. Of course, if

Fig. 18 The main control panel of the ClustalW V1.83 simulationplatform

Fig. 17 Multiple sequence alignment result of COG1603 using clustalW(V1.83). An asterisk is used to indicate a match column

Page 11: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

W. S. Juang and S. F. Su: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle 669

Table 7 The ClustalW alignment results with variant protein matrices for COG datasets

MDPPSO best Protein matrix (MATRIX)

I D Result Blosum Pam Gonnet

Score M.C. Score M.C. Score M.C. Score M.C.

COG2178 654 44 581 45 582 42 616 44COG1983 -351 14 -513 15 -541 13 -468 12COG1603 639 16 444 16 470 14 592 15COG2157 610 18 525 14 532 14 564 15COG1476 1674 22 1657 22 1671 22 1671 22COG2097 1998 12 1876 11 1782 12 1974 12COG1510 2205 6 1976 7 1910 6 2097 6COG1761 423 9 290 9 44 8 234 9COG0219 10204 25 9956 25 10089 24 10156 25COG2003 6656 22 5903 20 5865 20 6066 24

Table 8 The comparison for MDPPSO MUSCLE, MAFFT, T-COFFEE, DIALIGN-T and MAP for COGdatasets

MDPPSO best MAFFT T-COFFEE DIALIGN-T MAPGap open penalty = 10

ID result MUSCLE (Version 6) (Regular) (L = 4, T = 40) Extension penalty = 2

Score M.C. Score M.C. Score M.C. Score M.C. Score M.C. Score M.C.

COG2178 654 44 482 40 498 40 487 38 327 39 425 41COG1983 -351 14 -675 9 -655 12 -786 9 -963 10 -766 10COG1603 639 16 226 10 26 11 107 10 -80 11 153 12COG2157 610 18 487 12 529 12 476 12 -494 12 485 12COG1476 1674 22 1660 22 1660 22 1661 22 1654 22 1660 22COG2097 1998 12 1683 11 1765 11 1376 11 1750 11 1651 11COG1510 2205 6 1818 4 1840 5 1077 5 276 4 1712 4COG1761 423 9 0 7 29 8 -436 7 -449 7 -434 7COG0219 10204 25 10067 24 10047 24 9999 24 9755 24 10004 24COG2003 6656 22 5826 18 5759 19 5604 18 5172 20 5294 17

Table 9 The WEB servers of MSA tools

MSA tools WEB servers

MUSCLE http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.pyMAFFT http://align.bmr.kyushu-u.ac.jp/mafft/online/server/T-COFFEE http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi?stage1=1&daction=TCOFFEE::RegularDIALIGN-T http://dialign-t.gobics.de/submission?type=proteinMAP http://genome.cs.mtu.edu/map.html

Table 5 The default values of alignment param-eters within ClustalW

Parameter Default value (def)

Protein gap open penalty 1 10.0Protein gap extension penalty 0.2

Protein matrix GonnetProtein ENDGAP -1Protein GAPDIST 4

Table 6 Another set of alignment parameterswithin ClustalW

Parameter Satisfactory value

Protein gap open penalty 1Protein gap extension penalty 0.05

Protein end gaps -1 (default)Protein gap distances 1

Page 12: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

670 Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)

those considered sequences are highly similar, suchMSA problems will be considered to be simple andClustralW can effectively find the solutions for them.In general, the proposed approach, which has a glo-bal search nature, can outperform ClustralW for non-simple problems. This can be seen in our experiments.The proposed algorithm for multiple sequence align-ments can produce higher match columns and higherscores.

ACKNOWLEDGEMENTS

This work was supported in part by the NationalScience Council of Taiwan under Grant NSC 93-2218-E-011-008.

NOMENCLATURE

c1, c2 positive constantsF(i, j) maximum score at position (i, j)g gap penaltyL length of sequencesli length of the i-th original sequenceM number of sequencesN particle numberni number of gaps for sequence iO(• ) complexityPgd globally best position at present for

all particlesPid best position at present for a par-

ticle idround{} round-off operationrand( ), Rand( ) independent random functions in

the range [0, 1]sa, si sequencesw inertia weight;x i

j for 1 ≤ j ≤ ni and 1 ≤ i ≤ m; locationof a gap existing in sequence i

xid, vid position and velocity of a particleid, respectively

σ(sja, –) score for a gap in sequence sa at po-

sition iσ(–, s

jb) score for a gap in sequence sb at

position j

REFERENCES

Abido, M. A., 2002, “Optimal Design of Power-DystemStabilizers Using Particle Swarm Optimization,”IEEE Transactions on Energy Conversion, Vol.17, No. 3, pp. 406-413.

Altschul, S. F., 1993, “A Protein Alignment ScoringSystem Sensitive at All Evolutionary Distances,”Journal of Molecular Evolution, Vol. 36, No. 3,pp. 290-300.

Carrillo, H., and Lipman, D. J., 1988, “The Multiple

Sequence Alignment Problem in Biology,” SIAMJournal on Applied Mathematics, Vol. 48, No. 5,pp. 1073-1082.

Chellapilla, K., and Fogel, G. B., 1999, “Multiple Se-quence Alignment Using Evolutionary Program-ming,” Proceedings of the 1999 IEEE Congress onEvolutionary Computation, Vol. 1, WashingtonD. C. pp. 445-452.

Chen, S. C., Wong, A. K. C., and Chiu, D. K. Y.,1992 “A Survey of Multiple Sequence Compari-son methods,” Bulletin of Mathematical Biology,Vol. 54, No. 4, pp. 563-598.

Eberhart, R., and Kennedy, J. , 1995, “A NewOptimizer Using Particle Swarm Theory,” Pro-ceedings of the Sixth IEEE International Sympo-sium on Micro Machine and Human Science,Nagoya, Japan, pp. 39- 43.

Edgar, R. C., 2004, “MUSCLE: Multiple SequenceAlignment with High Accuracy and HighThroughput,” Nucleic Acids Research, Vol. 31,No. 32, No. 5 pp. 1792-1797.

Feng, D. F., and Doolittle, R. F., 1987, “ProgressiveSequence Alignment as a Prerequisite to CorrectPhylogenetic Trees,” Journal of Molecular Evo-lution, Vol. 25, No. 4, pp. 351-360.

Gotoh, O., 1982, “An Improved Algorithm for Match-ing Biological Sequences,” Journal of Molecu-lar Biology, Vol. 162, No. 3, pp. 705-708.

Hirschberg, D. S., 1975, “A Linear Space Algorithmfor Computing Longest Common Subsequences,”Communications Association Computing Machinery,Vol. 18, No. 6, pp. 341-343.

Huang, X., 1994, “On Global Sequence Alignment,”CABIOS, Vol. 10, No. 3, pp. 227-235.

Isokawa, M., Wayama, M., and Shimizu, T., 1996,“Multiple Sequence Alignment Using a GeneticAlgorithm,” Genome Informatics, Vol. 7, pp. 176-177.

Jiang, T., and Wang, L., 1994, “On the Complexityof Multiple Sequence Alignment,” Journal of Com-putational Biology, Vol. 1, No. 4, pp. 337-378.

Karlin, S., and Altschul, S. F., 1990, “Methods forAssessing the Statistical Significance of Molecu-lar Sequence Features by Using General ScoringSchemes,” Proceedings of the National Academyof Science USA, Vol. 87, No. 6, pp. 2264-2268.

Katoh, K., Misawa, K., Kuma, K., and Miyata, T.,2002, “MAFFT: A Novel Method for Rapid Mul-tiple Sequence Alignment Based on Fast Fouriertransform,” Nucleic Acids Research, Vol. 30, No.14, pp. 3059-3066.

Katoh, K., Kuma, K., Toh, H., and Miyata, T., 2005,“MAFFT Version 5: Improvement in Accuracyof Multiple Sequence Alignment,” Nucleic AcidsResearch, Vol. 33, No. 2, pp. 511-518.

Kennedy, J., and Eberhart, R., 1995, “Particle Swarm

Page 13: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

W. S. Juang and S. F. Su: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle 671

Optimization,” IEEE International Conference onNeural Networks, Vol. 4, pp. 1942-1948.

Karadimitriou, K., and Kraft, D. H., 1996, “GeneticAlgorithms and the Multiple Sequence AlignmentProblem in Biology,” Proceedings of the SecondAnnual Molecular Biology and BiotechnologyConference, Baton Rouge, LA, USA.

Kumar, A. I. S., Dhanushkodi, K., Kumar, J. J., andPaul, C. K. C., 2003, “Particle Swarm Optimiza-tion Solution to Emission and Economic DispatchProblem,” TENCON 2003 Conference on Conver-gent Technologies for Asia-Pacific Region, Vol.1, Taj Residency Bangalore, pp. 435-439.

Lipman, D. J., Altschul, S. F., and Kececioglu, J. D.,1989, “A Tool for Multiple Sequence Alignment,”Proceedings of National Academy of Science USA,Vol. 86, No. 12, pp. 4412-4415.

Liu, K.-H., 2003, “Multiple Sequence Alignment: AnEvolutionary Programming Based Algorithm withLocal Search,” Master Thesis, National TaiwanUniversity of Science and Technology, Depart-ment of Electrical Engineering.

McClure, M. A., Vasi, T. K., and Fitch, W. M., 1994,“Comparative Analysis of Multiple Protein Se-quence Alignment Methods,” Molecular Biologyand Evolution, Vol. 11, No. 4, pp. 571-592.

Myers, E. W., and Miller, W., 1988, “Multiple Se-quence Alignment Using Simulated Annealing,”Computer Applications in the Biosciences, Vol.4, No. 1, pp. 11-17.

Needleman, S. B., and Wunsch, C. D., 1970, “A Gen-eral Method Applicable to the Search for Simi-larities in the Amino Acid Sequences of TwoProteins,” Journal of Molecular Biology, Vol. 48,No. 4, pp. 443-453.

Notredame, C., and Higgins, D. G., 1996, “SAGA:Sequence Alignment by Genetic Algorithm,” NucleicAcids Research, Vol. 24, No. 8, pp. 1515-1524.

Notredame, C., Higgins, D. G., and Heringa, J., 2000,“T-Coffee: A Novel Method for Fast and Accu-rate Multiple Sequence Alignment,” Journal ofMolecular Biology, Vol. 302, No. 1, pp. 205-217.

Omar, M. F., Salam, R. A., Rashid, N. A., and Abdullah,R., 2004, “Multiple Sequence Alignment UsingGenetic Algorithm and Simulated Annealing,”Proceedings of the 2004 IEEE 12th Conferenceon Signal Processing and CommunicationsApplication, I. KuSadasi Turkey, pp. 28-30.

Poirot, O., O’Toole, E., and Notredame, C., 2003,“Tcoffee@igs: A Web Server for Computing,Evaluating and Combining Multiple SequenceAlignments,” Nucleic Acids Research, Vol. 31,No. 13, pp. 3503-3506.

Robinson, J., and Rahmat-Samii, Y., 2004, “ParticleSwarm Optimization in Electromagnetics,” IEEETransactions on Antennas and Propagation, Vol.

52, No. 2, pp. 397-407.Smith, T. F., and Waterman, M. S. 1981, “Identifica-

tion of Common Molecular Subsequences,” Jour-nal of Molecular Biology, Vol. 147, No. 1, pp.195-197.

Stoye, J., Moulton, V., and Dress, A. W., 1997, “DCA:An Efficient Implementation of the Divide-And-Conquer Approach to Simultaneous Multiple Se-quence Alignment,” Computer Applications in theBiosciences, Vol. 13, No. 6, pp. 625-626.

Stoye, J., 1998, “Multiple Sequence Alignment withthe Divide-And-Conquer Method,” Gene, Vol.211, No. 2, pp. GC45-GC56.

Shi, Y., and Eberhart, R., 1998, “A modified particleswarm optimizer,” IEEE International Confer-ence on Evolutionary Computation, Anchorage,Ak, USA, pp. 69-73.

Subramanian, A. R., Weyer-Menkhoff, J., Kaufmann,M., and Morgenstern, B., 2005, “DIALIGN-T: AnImproved Algorithm for Segment-Based MultipleSequence Alignment,” BMC Bioinformatics, 6: 66.

Thompson, J. D., Plewniak, F., and Poch, O. 1999,“A Comprehensive Comparison of Multiple Se-quence Alignment Programs,” Nucleic AcidsResearch, Vol. 27, No. 13, pp. 2682-2690.

Thompson, J. D., Higgins, D. G., and Gibson, T. J.,1994, “CLUSTAL W: Improving the Sensitivityof Progressive Multiple Eequence AlignmentThrough Sequence Weighting, Position-SpecificGap Penalties and Weight Matrix Choice,” NucleicAcids Research, Vol. 22, No. 22, pp. 4673-4680.

Thomsen, R., Fogel, G. B., and Krink, T., 2002, “AClustal Alignment Improver Using EvolutionaryAlgorithms,” Proceedings of the 2002 IEEE Con-gress on Evolutionary Computation, Honolulu,Hawaii, Vol. 1, pp. 121-126.

Van der Merwe, D. W., and Engelbrecht, A. P., 2003,“Data Clustering Using Particle Swarm Optimiza-tion,” The 2003 IEEE Congress on EvolutionaryComputation, Canberra, Australia, Vol. 1, pp.215-220.

Yang, B.-H., 2001, “An Approach to Multiple Pro-tein Sequence Alignment Using A GeneticAlgorithm,” Master Thesis, National CentralUniversity, Department of Computer Science andInformation Engineering, Taiwan ROC.

Zhang, C., and Wong, A. K. C., 1997, “TowardEf f i c i en t Mul t i p l e Molecu l a r SequenceAlignment: A System of Genetic Algorithm andDynamic Programming,” IEEE Transactions onSystems, Man, and Cybernetics-part B, Vol. 27,No. 6, pp. 918-932.

Manuscript Received: Feb. 9, 2006Revision Received: Jan. 21, 2008

and Accepted: Feb. 25, 2008

Page 14: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

672 Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)

ALGORITHM I: DYNAMIC PROGRAMING FOR GLOBAL ALIGNMENT

Aligning sequences sa and sb of length m and n, respectively, with linear gap penalty.

begin

initialization

for i: = 0 to m doF(0, i) = – ig

endfor j: = 1 to n do

F( j, 0) = – jgend

matrix fill

for i : = 1 to n dofor j : = 1 to m do

F(i, j) = max{F(i – 1, j – 1) + σ(sai , sb

j), F(i – 1, j) + g, F(i, j – 1) + g}end

end

end

ALGORITHM II: BACKTRACK MATRIX CONSTRUCT

Aligning sequences sa and sb of length m and n, respectively, with linear gap penalty.

beginfor i := 1 to n do for j := 1 to m do UP_Value = F(i – 1, j) Left_Value = F(i, j – 1) UP_Left_Value = F(i –1, j – 1) if (s

ja := s

jb) do

BM(i, j) = '*' else if (Left_Value >= U_Value) do if (Left_Value + gap_penalty >= UP_Left_Value + Mismatch) do fill BM(i, j) with '–' else fill BM(i, j) with '*' end else if (UP_Value + gap_penalty >= UP_Left_Value + Mismatch) do fill BM(i, j) with '#' else fill BM(i, j) with '*' end end end endendend

Page 15: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle Swarm Optimization (Journal of the Chinese Institute of Engineers)

W. S. Juang and S. F. Su: Multiple Sequence Alignment Using Modified Dynamic Programming and Particle 673

ALGORITHM III: MDPPSO

Aligning sequences s1, s2, ..., sn, with BLOSUM62 scoring matrix.// Initialization: Calculate pairwise alignment scores, and determine alignment order.begin for i := 1 to n-1 do for j := i + 1 to n do compute score(si, sj) using DP end end Determine alignments order according to each pair score score(si, sj).end

// Combine modified dynamic programming and PSO to align multiple sequences.// Suppose sa is the highest score sequences pair, and sb is the next to be// added to the sequence.

begin Pick the highest score sequences pair sa as row sequences. for i:=3 to n do According to alignments order, pick next sequence sb as column sequence Align (sa, sb) using modified DP Improve result of alignment for sequences pair (sa, sb) using PSO Remove all full spaces column and set result as row sequences end Output result of multiple alignmentsend