1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik...
-
Upload
annice-carroll -
Category
Documents
-
view
217 -
download
0
description
Transcript of 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik...
1
Multiple Genome Alignment:
Chaining Algorithms Revisited
Graduiertenkolleg Bioinformatik
Universität Bielefeld
Mohamed I. AbouelhodaUniversity of Bielefeld
Joint work with
Enno OhlebuschUniversity of Ulm
Germany2003
2
Comparative GenomicsGraduiertenkolleg Bioinformatik
Universität Bielefeld
The practice of analysing the genomic material of a species by comparing it with
the genomic material of other species.
Why is this important?
The next logical step after the high throughput sequencing projects.
Deducing the mechanism and history of genome evolution.
Discovering genes and regulatory elements.
Identifying exons in eukaryotic genes.
Revealing the role of non-coding conserved sequences.
3
Conserved regions are more significant in a multiple genome alignment.
The StrategyGraduiertenkolleg Bioinformatik
Universität Bielefeld
Closely-related organisms No (or few) genome rearrangements.
Finding conserved regions (genes and regulatory elements).
Detecting mutations.
Detecting unique genes (e.g., pathogenic genes in bacterial genomes).
Strategy: Global Sequence Alignment
Mutation
Pathogenic Gene
4
Multiple Genome Alignment is DifficultGraduiertenkolleg Bioinformatik
Universität Bielefeld
Standard dynamic programming takes O(N k) inpractical
even for k=2
N is very large
Mega bases
Given k genomes, each of average length N
Heuristic algorithms are therefore employed
Program Year Authors
Two GenomesPipMaker 2000 Schwartz et al.
DIALIGN 2000 Morgenstern
MUMmer 2002 Delcher et al.
CHAOS 2002 Brudno and Morgenstern
OWEN 2002 Roytberg et al.
AVID 2003 Bray et al.
Multiple GenomesMGA 2002 Höhl et al.
These tools use anchor-based alignment method
5
MGAGraduiertenkolleg Bioinformatik
Universität Bielefeld
MGA uses a strategy composed of three steps:
First Genome G1
Second Genome G2
Third Genome G3
1. Computation of fragments (maximal multiple exact matches).
2. Computation of an optimal chain of colinear non-overlapping fragments.
3. Detailed alignment of the regions between the fragments of the optimal chain.
6
MGAGraduiertenkolleg Bioinformatik
Universität Bielefeld
MGA uses a strategy composed of three steps:
First Genome G1
Second Genome G2
Third Genome G3
1. Computation of fragments (maximal multiple exact matches).
2. Computation of an optimal chain of colinear non-overlapping fragments.
3. Detailed alignment of the regions between the fragments of the optimal chain.
7
MGAGraduiertenkolleg Bioinformatik
Universität Bielefeld
MGA uses a strategy composed of three steps:
1. Computation of fragments (maximal multiple exact matches).
2. Computation of an optimal chain of colinear non-overlapping fragments (anchors).
3. Detailed alignment of the regions between the fragments of the optimal chain.
First Genome G1
Second Genome G2
Third Genome G3
anchors
8
The Chaining ProblemGraduiertenkolleg Bioinformatik
Universität Bielefeld
Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments
such that its total score is maximum over all other chains.
score(C)= ∑i [ fi+1 .weight - g(fi+1, fi)]
where g(fi+1, fi) is the gap cost of connecting fi+1 to fi
First Genome G1
Second Genome G2
Third Genome G3
9
The Chaining ProblemGraduiertenkolleg Bioinformatik
Universität Bielefeld
Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments
such that its total score is maximum over all other chains.
First Genome G1
Second Genome G2
Third Genome G3
score(C)= ∑i [ fi+1 .weight - g(fi+1, fi)]
fi+1fi
where g(fi+1, fi) is the gap cost of connecting fi+1 to fi
10
Previous WorkGraduiertenkolleg Bioinformatik
Universität Bielefeld
Graph based solution takes O(n2) time.
Geometric based algorithm is subquadratic (sparse dynamic programming):
1. Zhang et al. (1994) used space division with a kd-tree (no complexity analysis was given).
2. Myers and Miller (1995) used orthogonal range search with a range tree yielding a complexity of
O(n log k n) time and O(n log k-1 n) space.
But the result is a time bound higher by a logarithmic factor than
what one would expect.David Eppstein
Soble-Martinez, 1986Wilbur-Lipman, 1983
Eppstein-Giancarlo, 1992
11
Previous WorkGraduiertenkolleg Bioinformatik
Universität Bielefeld
For two genomes the complexities are also higher than those of known 2-dim. chaining algorithms
O(n log 2 n) time and O(n log n) space.
We thought hard to reduce this discrepancy but have been unable to do so and the reasons appear
to be fundamental !To improve upon our result appears to be a difficult
open problem !
Here, it is improved by almost two log factors in time and one log factor in
space
O(n log n) time and O(n) space. For k=2
For k>2 O(n log k-2 n log log n) time and O(n log k-2 n) space.
Myers & Miller
12
The Problem RevisitedGraduiertenkolleg Bioinformatik
Universität Bielefeld
fi<< fi+1: end( fi).xr < start(fi+1).xr for all r, 0 < r <= k
• Any kind of fragment can be used (fragments can contain also mismatches,
insertions/deletions).
• A fragment fi is represented as a hyper-rectangle in a k-dimensional space.
• A fragment fi is identified with its start and end points: start(fi) and end( fi).
• We add two imaginary fragments O and t with weight zero.
• Any two fragments fi and fi+1 in the chain must be colinear and non-overlapping
13
The SolutionGraduiertenkolleg Bioinformatik
Universität Bielefeld
fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}
where
fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= kg( fi , fj ) is the gap cost of connecting fi to fj
score(C)= ∑i [ fi .weight - g(fi, fi-1)]
An optimal chain is a chain of maximum score
The score of a chain C is
The maximum score can be computed by the recurrence
1 3
1 4
1 2 3
1 2 2
fj
Sparse Dynamic Programming
A graph based solution takes O(n2) time.
14
Geometric-based SolutionGraduiertenkolleg Bioinformatik
Universität Bielefeld
The recurrence
fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}
can be written as
RMQ (Range Maximum Query)
Retrieves the fragment fi whose end point Iies in the hyper-rectangle bounded by start(fj) and O such that fi.score-g( fi , fj ) is maximum.
fj.score=fj.weight+RMQ{O, start(fj)}
RMQ is applied using the start and end points
fj
15
Overview of the AlgorithmGraduiertenkolleg Bioinformatik
Universität Bielefeld
The algorithm uses techniques from computational geometry
1. Line-sweep algorithm.
2. The algorithm works on the start and end points of the fragments.
3. RMQ using a semi-dynamic data structure: the range tree.
4. Proper inclusion of the gap costs into the fragment weight.
fj.score=fj.weight+RMQ{O, start(fj)}
The recurrence is
If the gap cost is zero, a RMQ returns the end point of the
fragment fi such that is
maximum.
ir
rri weightfscoref
0
..
16
The Algorithm without Gap CostsGraduiertenkolleg Bioinformatik
Universität Bielefeld
fj.score=fj.weight+RMQ{O, start(fj)}
The recurrence is
Line-sweep algorithm
1. Sort the start and end points of the fragments w.r.t. x1
2. If a start point of a fragment, say fj, is scanned
apply the RMQ(O, (start(fj).x2, …, start(fj).xk)) to the set of active end points
and update the score of the end point of fragment fj.
3. Otherwise, add the end point to the set of active end points (already scanned end points).
The first step reduces the dimension of the RMQ to k-1.
If the gap cost is zero, a RMQ returns the end point of the
fragment fi such that is
maximum.
ir
rri weightfscoref
0
..
17
The Complexity of the AlgorithmGraduiertenkolleg Bioinformatik
Universität Bielefeld
The complexity of the algorithm depends on the complexity of the RMQ in d= k-1 dimensions The required data structure D to manipulate the set of points
• is a semi-dynamic data structure over all end points in d= k-1 dimensions • supports the operations:
1. Activate an end point.2. Perform a RMQ.
D is implemented as a range tree
O(n log k-2 n log log n) time and O(n log k-2 n) space
For n fragments and dimension d, the RMQ and activation takes:
Since d= k-1>1, the complexity of the algorithm is
O(n log d-1 n log log n) time and O(n log d-1 n) space
1. supported by fractional cascading.
2. enhanced with priority queues.
Willard, 1985
van Emde Boas, 1977Johnson, 1982
O(n log n) time and O(n) space For k=2, the total complexity is
18
Including Gap Costs Graduiertenkolleg Bioinformatik
Universität Bielefeld
The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic.
fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}
fj.score=fj.weight+RMQ{O, start(fj)}A
C C
XX
X A
C C
ACCYYACC
f
f
Recall the recurrence
How to define the gap costs ?
How to include the gap costs without affecting the complexity?
19
Types of Gap CostsGraduiertenkolleg Bioinformatik
Universität Bielefeld
ACC YYY _ _ ACC ACC YYY ACCACC _ _ _ XX ACC ACC _ XX ACC
L1 L∞
A C
C Y
YY
A C
C
ACCXXACC
f The gap costs g can be described geometrically:
k
iii xfendxfstartfendfstartdffg
111 ).().())(),((),(
iiixfendxfstartfendfstartdffg ).().(max))(),((),(
5),(1 ffg 3),( ffgf
ACC XX _ _ _ _ _ ACCACC _ _ YYY _ _ ACC
ACC _XX ACCACC YYY ACC
ACC _ _ _ _ _ ZZ ACC
7),(1 ffg
ACC _ ZZ ACC
3),( ffg
The L∞ and the sum-of-pairs gap cost follow the same idea as the L1
x
y
20
Including Gap Costs Graduiertenkolleg Bioinformatik
Universität Bielefeld
The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic.
fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}
fj.score=fj.weight+RMQ{O, start(fj)}
A C
C X
XX
A C
C
ACCYYACC
f
f
Recall the recurrence
21
Including Gap Costs in L1Graduiertenkolleg
Bioinformatik
Universität Bielefeld
gc( f) = d1(t, end(f))
We define the geometric cost of a fragment f as follows:
where d1(t, end(f) is the distance in the L1 metric
between t and end(f).
f 1.score - g( f 1 , f ) > f 2.score - g( f 2
, f )
iff
f 1.score - gc( f 1) > f 2.score - gc( f 2)
f 1
f 2
gc( f) is a constant that can be precomputed and attached to the fragment’s weight
O(n log k-2 n log log n) time and O(n log k-2 n) space
For k>2, the complexity is
O(n log n) time and O(n) space For k=2, the complexity is
f
22
Gap Costs in L∞Graduiertenkolleg
Bioinformatik
Universität Bielefeld
gc( f) = d∞(t, end(f))
The geometric cost of a fragment f is then:
f 1.score - g( f 1 , f ) > f 3.score - g( f 3
, f )
iff
f 1.score - gc( f 1) > f 3.score - gc( f 3)
f 1
f 2
gc( f) is a constant that can be precomputed and attached to the fragment’s weight
iiixfendxfstartfendfstartdffg ).().(max))(),((),(
),(),,(max),( ffffffg yx
In the octant O1, ),(),( ffffg x
In the octant O2, ),(),( ffffg y
f 3
O1
O2
RMQ must be performed in every octant and the point of maximum score is chosen.
23
RMQ on OctantsGraduiertenkolleg Bioinformatik
Universität Bielefeld
O1
O2
s
pq
pq
s
Because the RMQ requires an orthogonal range, we use the octant-to-quadrant transformation:
),(),(: 221211 xxxxxT
),(),(: 121212 xxxxxT
For the octant O1
For the octant O2
The total complexity of the algorithm depends on the space divison.
O(k! n log k-2 n log log n) time and O(k! n log k-2 n) space
For k>2, the complexity is
O(2 n log n) time and O(2 n) space For k=2, the complexity is
24
Example: 4 Strains Staphylococcus Graduiertenkolleg Bioinformatik
Universität Bielefeld
1. Staphlyococcus aureus N315 NC_002745 (2853924)2. Staphlyococcus aureus Mu50 NC_002758 (2919236)3. Staphlyococcus aureus MW2 NC_003923 (2860842)4. Staphlyococcus epidermidis NC_004461 (2535068)
Fragments min. len. 15 of 1-2200,000 Fragments
25
Example: 4 Strains Staphylococcus Graduiertenkolleg Bioinformatik
Universität Bielefeld
1-2 1-3 1-4
1. Staphlyococcus aureus N315 NC_002745 (2853924)2. Staphlyococcus aureus Mu50 NC_002758 (2919236)3. Staphlyococcus aureus MW2 NC_003923 (2860842)4. Staphlyococcus epidermidis NC_004461 (2535068)
3-4 2-4 2-3
26
Example: 3 Strains E. coli Graduiertenkolleg Bioinformatik
Universität Bielefeld
1-2
1-3 2-3 1: E.coli O157:H7 EDL 993(5608027bp)2: Ecoli O157:H7 (5577057bp)3: E.coli k12 (4705567)
Fragments min. len. 30 of 1-260,000 Fragments
27
ConclusionsGraduiertenkolleg Bioinformatik
Universität Bielefeld
Our algorithm solves an open problem w.r.t. the chaining of n fragments of k genomes.
Other data structures than the range tree can be used, e.g., the kd-tree. It takes for k>2
The sum-of-pairs gap cost is addressed in the paper.
We reduced the time complexity by almost two log factors and the space complexity by one log factor.
time and O(n) space ))1(( 112
knkO
Other research topic: Comparison of distantly-related organisms.
Many genome rearrangements.
Local sequence alignment Detecting rearranged segments.
Revealing the mechanisms of rearrangements.
Better identification of the exons of eukaryotic genes.Abouelhoda-Ohlebusch, WABI 2003
28
Example Local chainsGraduiertenkolleg Bioinformatik
Universität Bielefeld
Fragments min. len. 12
M. Genetalium
M. P
neimoniaM
. Pne
imon
ia
M. GenetaliumM. Genetalium
Abouelhoda-Ohlebusch, WABI 2003
29
Example Local chainsGraduiertenkolleg Bioinformatik
Universität Bielefeld
Fragments min. len. 15
E. coli
V. cholera
30
Graduiertenkolleg Bioinformatik
Universität Bielefeld
Acknowledgement
Enno Ohlebusch, Ulm University
Stefan Kurtz, , Hamburg University
Robert Giegerich, Bielefeld University
Janina Scholz, Bielefeld University
Thanks for your attention
This work was funded by the Graduiertenkolleg Bioinformatik,
Universität Bielefeld, Germany.