Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science...
-
Upload
dorothy-flowers -
Category
Documents
-
view
214 -
download
0
Transcript of Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science...
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis
Jijun Tang
Computer Science and EngineeringUniversity of South Carolina
[email protected](803) 777-8923
Outline
Backgrounds
Branch-and-Bound Algorithms for the Median Problem
Maximum Likelihood Methods for Phylogenetic Reconstruction
Post-Analysis
Conclusions
Multichromosomal Reversal Median problem
To find a median genome that minimizes the summation of the multichromosomal HP distances on the three edges
Events considered: reversal, translocation, fusion, fission
Exact and heuristic solvers exist for the Unichromosomal Reversal Median Problem (reversals are the only events)
Capless Breakpoint Graph
Genome A → Non-perfect Matching M(A)Let a,b be adjacency genes in A. Then (at,bh) is an edge in M(A)
A genome is composed of a set of edges and ends.
Matchings naturally correspond to Undirected Genomes (Flipping of chromosomes does not alter matchings)a bc at
ah
bt
bhc
tc
h
-5 1 6 3 2 4
1 6 -5 -4 -3 -2
• Matchings:
M(A):
M(B):
: A-end: B-end
ExampleExample Genomes
A={‹ -5, 1, 6, 3 ›, ‹ 2, 4 ›} B={‹ 1, 6 ›, ‹ -5, -4, -3, -2 ›}
Adjacency Graph
: A-end: B-end
Capless Breakpoint Graph
-5 1 6 3 2 4G(A,B)
AB-paths of length 0
• Denote C(A,B) #Cycles, AB #AB-Paths, AA #AA-paths, BB #BB-paths in G(A,B), n #genes
• n = 6,C(A,B) = 1,AB = 4,
• dHP≥ 6-1-4/2 = 3
A Lower Bound of the HP DistanceA simpler lower bound only contains #genes, #cycles, #paths. Derived from Hannenhalli, Pevzner 1995
dHP (A,B)≥n – C(A,B) - AB/2 + AA - BBPseudo-cycle of A and B:
BBAAABBACBAc 2/),(),(~
Pseudo-cycle distance Median Problem
Pseudo-cycle distance :
Pseudo-cycle distance Median Problem (PMP): to find a median genome that minimizes the summation of the Pseudo-cycle distance on the three edges
We use the Pseudo-cycle distance as a lower bound for the HP distance to derive a RMP solver
),(~ BAcn
Branch-and-Bound Algorithm
Enumerate the solution genomes gene by gene (Genome Enumeration)After enumerated a gene, compute an upper bound based on the partial solution genomeBound: check whether the upper bound of the partial solution is less than a criteriaBranch
If it is true, the partial genome is discarded, enumerate another geneOtherwise update the criteria and continue enumeration
Genome Enumeration for Multichromosome Genomes
$
1
-1
3
-3
2
-3
2 3 $
-3
.
$3
-3
...
...
...
...
.
.
.
.
.
.
.
.
.
.
.
...$-3
3 $
‹ 1, 2, 3 ›
‹ 1, 2, -3 ›
‹ 1, 2 › ‹ 3 ›
‹ 1, 2 › ‹ -3 ›
Genome Enumeration
For genomes on gene {1,2,3}
2
-2
2
-2
2
-2
Features
Main Components:
Contraction Operation
Upper Bound on the number of pseudo-cycles
Genome enumeration
Extension of Caprara’s method for unichromosomal genomes (1999)
Contraction OperationContraction e={at,bh} on M(A): M(A)/e
-y -x ...ah bt ... ......
• Case(2):
a x ...... b ...
x ...ah bt ...... ah bt ......
a ...... b ...
• Case(3):
a x ...... y b ......
• Case(1)
Upper Bound on the Number of Pseudo-cycles
Let S be a genome and Z={G1, G2, G3} a set of three input genomes
2
),(~
2
),(~
2
),(~2/3)( 323121* GGcGGcGGc
nZUB
The maximal γ(S,Z) is denoted by γ*
Based on triangle inequality, an upper bound on the number of pseudo-cycles can be derived:
3
1
),(~),(i
kGScZS
Notes
qn- γ* is the lower bound of the sum of pseudo-cycle distances between any S and each genome in Z ={G1, G2, G3}
Given an edge e, assume genome S contains e and maximizes γ(S,Z); let Z’={G1/e, G2/e, G3/e}, and assume S’ maximizes Z’=γ(S’,Z’), then S = S’ {∪ e}
Upper Bound TestIn a step of the algorithm, the current partial solution is Si={e1,e2,…,ei}
The upper bound of γ(S,Z) of genoms containing Si is the following:
iii
kkiiS
eeeeHSZwhere
GSCSZUBUBi
/)/))/)/((((/
),()/(
121
3
1
• Let UB be the current upper bound
• If UBSi<UB, then the best upper bound of the genomes containing Si is worse than UB
Branch-and-Bound Algorithm for Multichromosomal Genomes
Compute an initial Upper Bound (UB) from the input genomes.
In each step, either an end or an edge is fixed in the solution.
End Fixing: Mark a node as an end of a chromosome.
Edge Fixing: Fix an edge e to the current partial solution genome Si.
Genome Enumeration for Multichromosome Genomes
$
1
-1
3
-3
2
-3
2 3 $
-3
.
$3
-3
...
...
...
...
.
.
.
.
.
.
.
.
.
.
.
...$-3
3 $
‹ 1, 2, 3 ›
‹ 1, 2, -3 ›
‹ 1, 2 › ‹ 3 ›
‹ 1, 2 › ‹ -3 ›
Genome Enumeration
For genomes on gene {1,2,3}
• Red line: end fixing• Black line: edge fixing
2
-2
2
-2
2
-2
Properties
Can be extended to compute a given tree using iterative or progressive approaches
However, median computation is still difficultLarge nuclear genomes
Complex events
We also need to search the best tree from the large tree space
N species:
20 species:
3)72()52( NN662
Statistical Approaches
Combinatorial approaches are the focus of genome rearrangement research
Only one MCMC method exists
Maximum Likelihood methods have been very popular in sequence phylogenetic analysis
Bootstrapping (data resampling) is a popular method to assess quality of obtained trees
Hard to directly apply ML and bootstrapping to gene order
Sequence ML Phylogeny
For each position, generate all possible tree structures
Based on the evolutionary model, calculate likelihood of these trees and sum them to get the column likelihood
Calculate tree likelihood by multiplying the likelihood for each position
Choose tree with the greatest likelihood
MLBE
Convert the gene-orders into binary sequences based on adjacencies
Convert the binary sequences into protein or DNA sequence
Use RAxML to compute a ML tree on the sequences
Binary encoding was used before for parsimony analysis, with reasonable results
Experimental Setup
Generate random trees of N taxa
Each tree is equally likely
Birth-death model is preferred
Starting from the root, apply r events along each edge
r is the expected number of events
Actual number is a sample between 1…2r
Comparing the inferred tree with the true tree using RF rate
Post Analysis
Bootstrapping has been widely used to assess the quality of sequence phylogeny
The same procedure is impossible for gene order data since there is only one character
We tested the procedure of jackknifing through simulated data to obtain
Is jackknifing useful
The best jackknifing rate
What is the threshold of the support values
Jackknifing Procedure
Generate a new dataset by removing half of the genes from the original genomes (orders are preserved)
Compute a tree on the new dataset
Repeat K times and obtain K replicates
Obtain a consensus tree with support values
Jackknife Properties
Jackknifing is necessary and useful for gene order phylogeny, and a large number of errors can be identified40% jackknifing rate is reasonable85% is a conservative threshold, 75% can also be usedLow support branches should be examined in detail
Conclusions
Great progress has been made in genome rearrangement research
We are able to handle real size dataNow the question is what data
Data quality and biological modeling
Ancestral genome reconstruction is still difficult
Putting everything together has just started