Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science...

57
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina [email protected] (803) 777-8923

Transcript of Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science...

Combinatorial and Statistical Approaches in Gene Rearrangement Analysis

Jijun Tang

Computer Science and EngineeringUniversity of South Carolina

[email protected](803) 777-8923

Outline

Backgrounds

Branch-and-Bound Algorithms for the Median Problem

Maximum Likelihood Methods for Phylogenetic Reconstruction

Post-Analysis

Conclusions

Simple Rearrangements

Phylogenetic Reconstruction

Rearrangement Phylogeny

Median Problem

Goal: find M so that DAM+DBM+DCM is minimized

NP hard for most metric distances

Multichromosomal Reversal Median problem

To find a median genome that minimizes the summation of the multichromosomal HP distances on the three edges

Events considered: reversal, translocation, fusion, fission

Exact and heuristic solvers exist for the Unichromosomal Reversal Median Problem (reversals are the only events)

Capless Breakpoint Graph

Genome A → Non-perfect Matching M(A)Let a,b be adjacency genes in A. Then (at,bh) is an edge in M(A)

A genome is composed of a set of edges and ends.

Matchings naturally correspond to Undirected Genomes (Flipping of chromosomes does not alter matchings)a bc at

ah

bt

bhc

tc

h

-5 1 6 3 2 4

1 6 -5 -4 -3 -2

• Matchings:

M(A):

M(B):

: A-end: B-end

ExampleExample Genomes

A={‹ -5, 1, 6, 3 ›, ‹ 2, 4 ›} B={‹ 1, 6 ›, ‹ -5, -4, -3, -2 ›}

Adjacency Graph

: A-end: B-end

Capless Breakpoint Graph

-5 1 6 3 2 4G(A,B)

AB-paths of length 0

• Denote C(A,B) #Cycles, AB #AB-Paths, AA #AA-paths, BB #BB-paths in G(A,B), n #genes

• n = 6,C(A,B) = 1,AB = 4,

• dHP≥ 6-1-4/2 = 3

A Lower Bound of the HP DistanceA simpler lower bound only contains #genes, #cycles, #paths. Derived from Hannenhalli, Pevzner 1995

dHP (A,B)≥n – C(A,B) - AB/2 + AA - BBPseudo-cycle of A and B:

BBAAABBACBAc 2/),(),(~

Pseudo-cycle distance Median Problem

Pseudo-cycle distance :

Pseudo-cycle distance Median Problem (PMP): to find a median genome that minimizes the summation of the Pseudo-cycle distance on the three edges

We use the Pseudo-cycle distance as a lower bound for the HP distance to derive a RMP solver

),(~ BAcn

Branch-and-Bound Algorithm

Enumerate the solution genomes gene by gene (Genome Enumeration)After enumerated a gene, compute an upper bound based on the partial solution genomeBound: check whether the upper bound of the partial solution is less than a criteriaBranch

If it is true, the partial genome is discarded, enumerate another geneOtherwise update the criteria and continue enumeration

Genome Enumeration for Multichromosome Genomes

$

1

-1

3

-3

2

-3

2 3 $

-3

.

$3

-3

...

...

...

...

.

.

.

.

.

.

.

.

.

.

.

...$-3

3 $

‹ 1, 2, 3 ›

‹ 1, 2, -3 ›

‹ 1, 2 › ‹ 3 ›

‹ 1, 2 › ‹ -3 ›

Genome Enumeration

For genomes on gene {1,2,3}

2

-2

2

-2

2

-2

Features

Main Components:

Contraction Operation

Upper Bound on the number of pseudo-cycles

Genome enumeration

Extension of Caprara’s method for unichromosomal genomes (1999)

Contraction OperationContraction e={at,bh} on M(A): M(A)/e

-y -x ...ah bt ... ......

• Case(2):

a x ...... b ...

x ...ah bt ...... ah bt ......

a ...... b ...

• Case(3):

a x ...... y b ......

• Case(1)

Upper Bound on the Number of Pseudo-cycles

Let S be a genome and Z={G1, G2, G3} a set of three input genomes

2

),(~

2

),(~

2

),(~2/3)( 323121* GGcGGcGGc

nZUB

The maximal γ(S,Z) is denoted by γ*

Based on triangle inequality, an upper bound on the number of pseudo-cycles can be derived:

3

1

),(~),(i

kGScZS

Notes

qn- γ* is the lower bound of the sum of pseudo-cycle distances between any S and each genome in Z ={G1, G2, G3}

Given an edge e, assume genome S contains e and maximizes γ(S,Z); let Z’={G1/e, G2/e, G3/e}, and assume S’ maximizes Z’=γ(S’,Z’), then S = S’ {∪ e}

Upper Bound TestIn a step of the algorithm, the current partial solution is Si={e1,e2,…,ei}

The upper bound of γ(S,Z) of genoms containing Si is the following:

iii

kkiiS

eeeeHSZwhere

GSCSZUBUBi

/)/))/)/((((/

),()/(

121

3

1

• Let UB be the current upper bound

• If UBSi<UB, then the best upper bound of the genomes containing Si is worse than UB

Branch-and-Bound Algorithm for Multichromosomal Genomes

Compute an initial Upper Bound (UB) from the input genomes.

In each step, either an end or an edge is fixed in the solution.

End Fixing: Mark a node as an end of a chromosome.

Edge Fixing: Fix an edge e to the current partial solution genome Si.

Genome Enumeration for Multichromosome Genomes

$

1

-1

3

-3

2

-3

2 3 $

-3

.

$3

-3

...

...

...

...

.

.

.

.

.

.

.

.

.

.

.

...$-3

3 $

‹ 1, 2, 3 ›

‹ 1, 2, -3 ›

‹ 1, 2 › ‹ 3 ›

‹ 1, 2 › ‹ -3 ›

Genome Enumeration

For genomes on gene {1,2,3}

• Red line: end fixing• Black line: edge fixing

2

-2

2

-2

2

-2

Properties

Can be extended to compute a given tree using iterative or progressive approaches

However, median computation is still difficultLarge nuclear genomes

Complex events

We also need to search the best tree from the large tree space

N species:

20 species:

3)72()52( NN662

Statistical Approaches

Combinatorial approaches are the focus of genome rearrangement research

Only one MCMC method exists

Maximum Likelihood methods have been very popular in sequence phylogenetic analysis

Bootstrapping (data resampling) is a popular method to assess quality of obtained trees

Hard to directly apply ML and bootstrapping to gene order

Sequence ML Phylogeny

For each position, generate all possible tree structures

Based on the evolutionary model, calculate likelihood of these trees and sum them to get the column likelihood

Calculate tree likelihood by multiplying the likelihood for each position

Choose tree with the greatest likelihood

Example

A acgcaa

B acataa

C atgtca

D gcgtta

A B C D A C B D A D C B

All Possible Evolutionary Paths (Column 1)

a a a g

a c g t a c g t

a c g t

Likelihood for One Path

a a a g

a g

t

)()()()()()(1 ggPagPgtPaaPaaPatPL

Sum of All Paths (Column 1)

a a a g

a c g t a c g t

a c g t

64

11column

iiLP

Whole Sequence

A B C D

5

1icolumn 1 Tree

i

PP

MLBE

Convert the gene-orders into binary sequences based on adjacencies

Convert the binary sequences into protein or DNA sequence

Use RAxML to compute a ML tree on the sequences

Binary encoding was used before for parsimony analysis, with reasonable results

Binary Encoding

MLBE Sequences

Experimental Setup

Generate random trees of N taxa

Each tree is equally likely

Birth-death model is preferred

Starting from the root, apply r events along each edge

r is the expected number of events

Actual number is a sample between 1…2r

Comparing the inferred tree with the true tree using RF rate

Experimental Results (Equal Content 1)

80% inversion, 20% transposition

Experimental Results (Equal Content 2)

80% inversion, 20% transposition

Experimental Results (Unequal 1)

90% inversion, 10% of del/ins/dup, 5-30 genes per segment

Experimental Results (Unequal 2)

90% inversion, 10% of del/ins/dup, 5-30 genes per segment

Multistate Endocing

MLME Results (200 genes 20 genomes)

MLME Results (1000 genes 20 genomes)

Post Analysis

Bootstrapping has been widely used to assess the quality of sequence phylogeny

The same procedure is impossible for gene order data since there is only one character

We tested the procedure of jackknifing through simulated data to obtain

Is jackknifing useful

The best jackknifing rate

What is the threshold of the support values

46

DNA bootstrapping

Bootstrapping Results

Jackknifing Procedure

Generate a new dataset by removing half of the genes from the original genomes (orders are preserved)

Compute a tree on the new dataset

Repeat K times and obtain K replicates

Obtain a consensus tree with support values

An Example—New Genomes

1 2 3 4 5 6 7 8 9 10

1 -4 5 2 8 10 9 -7 -6 3

1 3 5 7 9

1 5 9 -7 3

Jackknifing Rate

Support Value Threshold - FP

Up to 90% FP can be identified with 85% as the threshold

Trees with FP

Support Value Threshold - FN

Low Support Branches

Jackknife Properties

Jackknifing is necessary and useful for gene order phylogeny, and a large number of errors can be identified40% jackknifing rate is reasonable85% is a conservative threshold, 75% can also be usedLow support branches should be examined in detail

Conclusions

Great progress has been made in genome rearrangement research

We are able to handle real size dataNow the question is what data

Data quality and biological modeling

Ancestral genome reconstruction is still difficult

Putting everything together has just started

Thank You!