A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin...
-
date post
20-Dec-2015 -
Category
Documents
-
view
225 -
download
1
Transcript of A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin...
![Page 1: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/1.jpg)
A fast parallel algorithm for finding the longest common sequence of multiple
biosequences.
Yixin Chen, Andrew Wan, Wei Liu
Washington University / Yangzhou University
December 2006
![Page 2: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/2.jpg)
Biological Review
DNA can be represented as sequence of four letters
(ACGT).
New biosequences every day.
Sequence comparison.
Find similarities with other sequences already
known.
Longest Common Subsequence (LCS).
![Page 3: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/3.jpg)
Longest Common Subsequence LCS
Search a substring that is common and longest to
two or more given strings.
Searching for the LCS of multiple biosequences is a
fundamental task in Bioinformatics.
LCS is a special case of global sequence alignment.
All algorithms of global sequence alignment can
be used to solve LCS.
![Page 4: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/4.jpg)
Longest Common Subsequence LCSAlgorithms and complexity
Dynamic programming.
Smith-Waterman algorithm
Complexity O(mn).
Mayers and Miller
Complexity O(m+n).
Parallel algorithms.
CREW-PRAM
![Page 5: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/5.jpg)
Longest Common Subsequence LCSAlgorithms and complexity in Multiple sequences
What is the problem with those Algorithms?
Smith-Waterman: complexity now is exponential.
Carrillo and Lipman algorithm.
New divide and conquer algorithm DCA.(Stoye)
Clustal-W (Feng and Doolittle’s algorithm).
![Page 6: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/6.jpg)
FAST_LCS
Seeks the successors of initial identical character
pairs. Tables
Pruning operations
FAST_LCS can be extended to find LCS of multiple
biosequences.
They it implemented using Parallel computing model
as well.
![Page 7: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/7.jpg)
Initial identical character pair and its successor table
Let X and Y be two biosequences,
where Xi and Yi to { A, C , G, T }
Define an Array CH of four CHaracters:
CH(1) = A
CH(2) = C
CH(3) = G
CH(4) = T
![Page 8: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/8.jpg)
Initial identical character pair and its successor table
Build the successor table of identical characters for two strings,
Lets call the tables TX for sequence X an TY for sequence Y.
The table(s) are defined as follows:
TX(i,j) = min { k | k SX (i,j) } SX(i,j)
Otherwise
Where SX (i,j) = {k | Xk = CH(i), k > j } and i = 1,2,3,4 and j = 0,1,…n
For example if i = 1 means A
![Page 9: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/9.jpg)
Example
Let X = “TGCATA” and Y = “ATCTGAT”
If Xi = Yi = CH(k) then call them an identical pair CH(k)
and denote as (i,j).
Let (i,j) and (k,l) be two identical characters pairs of X
Then: If i < k and j < l they call (i, j) a “predecessor” of (k,l).
or (k,l) the a “successor” of (i,j)
![Page 10: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/10.jpg)
Successor Table for example
Remember the sequence X = “TGCATA”
SX(1,0) = k | Xk = CH(A) , k > 0 then k = {4 ,6}
TX(1,0) = min { 4,6} => TX(1,0) = 4...
SX(1,4) = k | Xk = CH(A) , k > 4 then k = {6}
TX(1,4) = 6
![Page 11: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/11.jpg)
ExampleThe final Tables are these:
SX(1,0) = 4 SX(1,5) = 6
SX(1,1) = 4 SX(1,6) = -
SX(1,2) = 4
SX(1,3) = 4
SX(1,4) = 6
TGCATA”
![Page 12: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/12.jpg)
Some definitions
If Xi=Yi=CH(k) they call them “Identical pair of CH(k)”
and it denoted as (i,j).
The set of all “Identical pairs” of X is denoted as
S(X,Y).
If an identical pair (i,j) S(X,Y) and there is not
(k,l) S(X,Y) such that (k,l) < (i,j) then they call (i,j)
“initial identical pair”
![Page 13: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/13.jpg)
Some definitions
The level of each pairs is defined as follows:
![Page 14: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/14.jpg)
Theorems
T1: If the length of LCS of X and Y is denote as |
LCS(X,Y)| then |LCS(X,Y)| = max {level(i,j) | (i,j) S(X,Y)}
T2: For an identical character pair (i,j) S(X,Y) the
operations of producing all its direct successors is as
follows:
![Page 15: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/15.jpg)
Back to the Example
The successors of the pair (2,5) are:
(4,6), (3,-), (-,-), (5,7)
![Page 16: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/16.jpg)
Example
The successors of the pair (2,5) are:
(4,6), (5,7)
![Page 17: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/17.jpg)
Pruning operations
Operation 1
If on the same level, there are two identical character
pairs (i,j) and (k,l) and (k,l) > (i,j) then (k,l) can be
pruned without to affecting the correctness of the
algorithm
This operation will remove all the redundant identical
pairs
![Page 18: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/18.jpg)
Back to the Example
The successors of the pair (2,5) were
(4,6), (5,7)
since they are on the same level and (4,6) > (5,7)
The successors of the pair (2,5) is:
(4,6)
![Page 19: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/19.jpg)
Pruning operations
Operation 2
If on the same level, there are two identical character
pairs (i1,j) and (i2,j) and i1< i2 then (i2,j) can be pruned
without to affecting the correctness of the algorithm
![Page 20: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/20.jpg)
Pruning operations
Operation 3
If on the same level, there are two identical character
pairs (i1,j), (i2,j), …,(ir,j) and i1< i2 … < ir then :
(i2,j) … (ir,j) can be pruned
![Page 21: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/21.jpg)
FAST_LCS complexity
They claim that the complexity of their algorithm is
O(L) where L is the number of the identical character
pairs of X,Y.
When they algorithm is implement using Parallel
computing the complexity is O(|LCS(X,Y)|)
![Page 22: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/22.jpg)
FAST_LCS and multiple sequences
FAST_LCS can be easily extended to the LCS
problem of multiple sequences.
From the biological point of views is more important
to find LCS for multiple sequence.
![Page 23: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/23.jpg)
FAST_LCS and multiple sequences
Suppose there are n sequence X1, X2, …, Xn where X=
(Xi1, Xi2, …, Xi,ni), ni is the length of Xi, Xij to { A, C , G,
T } and where j = 1,2,3…,ni.
The successors tables will be denoted as:
TX1, TX2,….,TXn Where TXs is a two dimesional array
for the sequence Xs=(Xs1, Xs2,… Xns),
s = 1,2,…,n
![Page 24: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/24.jpg)
FAST_LCS and multiple sequences
Following the same procedure that it is used to find two
sequences, in this case start building the successors
tables of all the sequences, following:
![Page 25: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/25.jpg)
FAST_LCS and multiple sequences
Identical character tuple for LCS of multiple
sequences.
The level of each tuple comes from the following:
![Page 26: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/26.jpg)
FAST_LCS and multiple sequences
For an Identical character tuple this operation follows:
They claim two more Theorems which are basically
the extension of T1 and T2 for multiple sequence
![Page 27: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/27.jpg)
Example
Let n = 3, X1 = “TGCATA”, X2=“ATCTGAT” and
X3=“CTGATTC”
The successors tables are:
![Page 28: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/28.jpg)
Example
The direct successors of the identical character triple
(1,2,2) can be obtained by :
![Page 29: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/29.jpg)
Example
The successors are : (4,6,4), (3,3,7), (2,5,3), (1,2,2).
Then following the algorithm, apply the pruning
operations we get (3,3,7), (2,5,3), (1,2,2), etc.
![Page 30: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/30.jpg)
FAST_LCS for multiple sequence. complexity
The time complexity of most algorithms for multiple
sequence LCS depends on the number of sequence.
They are not practicable when the number of
sequence is large.
![Page 31: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/31.jpg)
FAST_LCS for multiple sequence. complexity
When FAST_LCS algorithm is implement using
Parallel computing the complexity is
O(|LCS(X1, X2,… Xn)|).
It complexity is “Independent of the number of
sequences n”
![Page 32: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/32.jpg)
Sequential computation on two sequences
![Page 33: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/33.jpg)
Sequential computation on multiple sequences
![Page 34: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/34.jpg)
Sequential computation on using parallel computing
![Page 35: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. Yixin Chen, Andrew Wan, Wei Liu Washington University / Yangzhou.](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d4d5503460f94a2c35d/html5/thumbnails/35.jpg)
Conclusion
The precision of FAST_LCS is higher than FASTA and faster
than S-W algo for computation on two sequences
FAST_LCS is faster than other algorithms that compute LCS for
multiple sequences, like
CLUSTAL-W.
FAST_LCS could be implemented using Parallel computation.