Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

30
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

description

Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event. Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis. Haplotyping Problem. Diploid organisms have two copies of (not identical) chromosomes. - PowerPoint PPT Presentation

Transcript of Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Page 1: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

WABI 2005

Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single

Homoplasy or Recombnation Event

Yun S. Song, Yufeng Wu and Dan Gusfield

University of California, Davis

Page 2: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Haplotyping Problem

• Diploid organisms have two copies of (not identical) chromosomes.

• A single copy is haplotype, a vector of Single Nucleotides Polymorphisms (SNPs)

• SNP: a site with two types of nucleotides occur frequently, 0 or 1

• The mixed description is genotype, vector of 0,1,2– If both haplotypes are 0, genotype is 0– If both haplotypes are 1, genotype is 1– If one is 0 and the other is 1, genotype is 2

Page 3: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Haplotypes and Genotypes

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

Merge the haplotypes

Sites: 1 2 3 4 5 6 7 8 9

• Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes

Page 4: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Perfect Phylogeny Haplotyping (PPH)

• Finding original haplotypes in nature hopeless without genetic model to guide solution picking

• Gusfield (2002) introduced PPH problem• PPH is to find HI solutions that fit into a

perfect phylogeny.• Nice results for PPH, including a linear time

algorithm

Page 5: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

The Perfect Phylogeny Model for Haplotypes

00000

1

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edges

The tree derives the set M:1010010000010110101000010

Assume at most 1 mutationat each site

Page 6: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

PPH Example

GenotypesInferred

Haplotypes Perfect Phylogeny

Page 7: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Imperfect Phylogeny Haplotyping (IPPH): Extending PPH

• Often, the real biological data does not have PPH solutions.

• Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic)

• Our approach: IPPH with explicit genetic model, with small amount of– Homoplasy, i.e. back or recurrent mutation – Recombination

• Goal: Extend usage of PPH– Real data: may be of small perturbation from PPH– Haplotype block: low recombination or homoplasy

Page 8: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Back/Recurrent Mutation for Haplotypes

Data000010101110

000

000110

2 1

3

010 101

1

010100

More than one mutation at a site

Page 9: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Recombinations: Single Crossover

• Recombination is one of the principle genetic force shaping genetic variations

• Two equal length sequences generate the third equal length sequence

110001111111001 000110000001111

Prefix Suffix

11000 0000001111

breakpoint

Page 10: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

IPPH (Imperfect Phylogeny Haplotyping) Problems

• Small deviation from PPH• H-1 IPPH problem

– Find a tree that allows exactly one site to mutate twice – The rest of sites can only mutate at most once– Derive haplotypes for the given genotypes

• R-1 IPPH problem– Find a network that has exactly one recombination

event– Each site mutates at most once– Derive haplotypes for the given genotypes

Page 11: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Number of Minimum Recombinations for Haplotypes

Rmin Rho=1 Rho=3 Rho=5

0 60.8% 23.6% 8.4%

1 31.8% 35.2% 27.6%

2 6.8% 24.8% 27.8%

3 11.6% 21.6%

4 3.8% 9.0%

5 0.8% 3.6%

6 0.2% 1.4%

Frequency of Minimumrecombinations for small rho(scaled recombination rate)

20 sequences30 sites500 simulations

Page 12: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Haplotyping with One Homoplasy

More than one mutation at a site 1

s1 s2 s3

a1 0 0 0

a2 0 1 0

b1 1 0 1

b2 1 1 0

s1 s2 s3

a 0 2 0

b 1 2 2

Genotype Haplotype000

a1b2

2 1

3

a2 b1

1

010100

1 Homoplasy Tree

Page 13: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Algorithm for H1-IPPH

• For each site s in the input genotype data M– Test whether M-{s} has PPH solutions– If not, move to next site.– Otherwise, check whether 1 homoplasy at site s

can lead to HI solutions– If yes, stop and report result

• Assume only one PPH solution for M-{s}• But how to find solutions with 1 homoplasy at

s efficiently?

Page 14: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Example

M

Site i3

M-{i3} {i3}

Page 15: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

PPH

M-{i3} {i3} Mh-{i3} h{i3}

r2

r2’ s2’

s2

Assume Mh-{i3} is fixed.Haplotypes for the same genotype must pair up.Two ways to pair

Combine Mh-{i3} with h{i3}

Page 16: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

• 4 ways to try pairing i3.• Exponential number in general, even for one PPH solution• Need polynomial-time method to avoid trying all the pairings

?

Mh-{i3} h{i3} Mh1 Mh2

Page 17: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Mh-{i3} h{i3}

Move to Trees

Convert perfect phylogeny tree from PPH solution to un-rooted

Page 18: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

1 Homoplasy: from T to Tr, Ts

s s

Recurrent mutation @ site s

Tree T

L1 L2O1 O2

L1, L2 O1, O2 s

Ts

Tree Tr

s induces a split Ts

Deleting s induces tree Tr

Page 19: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

From Tr, Ts to T

Find two subtrees Ts1, Ts2, in Tr, s.t.

Tree Tr

L O s

Ts

Ts1, Ts2 corresponds to one side

s s

Tree T

L1 L - L1O1 O2

of Ts

L1 L - L1

Page 20: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis
Page 21: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

2. Pick leaves from Tr corresponding the chosen partition side1. Pick one side of partition from Ts

3. Check whether the selected leaves fit into two sub-trees

Page 22: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

1. May need to refine a non-binary vertex before picking subtree

s2 can pair with r2’

Page 23: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Solution

Page 24: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Algorithms and Results

• Efficient graph-coloring based method to select two subtrees (skipped)

• Implemented in C++• Simulation with data with program ms.• Compare to PHASE (a haplotyping program)

– Accuracy: comparable– Speed: at least 10x faster– 100x100 data: about 3 seconds

• Can identify the homoplasy site with high accuracy: >95% in simulation

Page 25: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Algorithm for R1-IPPHM ML MR

Split M by cutting between two sites

Page 26: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

PPH Solutions

Build perfect phylogeny for two partitions

Page 27: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

1-SPR operation

SPR: subtree-prune-regraft operation

1 recombination condition equivalent to distance-SPR(TL,TR) = 1

Page 28: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Algorithm for R1-IPPH

• Brute-force 1-SPR idea leads to exponential time when TL or TR are not binary.

• Trickier than H1-IPPH, but with care, R1-IPPH can be solved in polynomial time. (not in paper)

Page 29: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Conclusions

• Contributions– Assuming bounded number of PPH solutions1. Polynomial time algorithm for H1-IPPH problem2. Polynomial time algorithm for R1-IPPH problem3. Possible extension to more than 1 homoplasy

event.

• Open problems– Haplotyping with more than 1 recombination

efficiently.– Remove assumption that number of PPH solutions

for M-{s} is bounded.

Page 30: Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Thank you

• Questions?