recomb05 poster ent phasing
description
Transcript of recomb05 poster ent phasing
Poster Session A, Bay 30
Haplotype Inference by Entropy MinimizationIon Mandoiu and Bogdan Pasaniuc, CSE Department, University of Connecticut
• A Single Nucleotide Polymorphism (SNP) is a position in the genome at which exactly two of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability between individuals, and mapping SNPs in human population has become the next high-priority in genomics after the completion of the Human Genome project.• In diploid organisms such as humans, there are two non-identical copies of each chromosome. A description of the SNPs in each chromosome is called a haplotype, which can be viewed as a 0/1 vector, e.g., by representing the most frequent (dominant) SNP allele as a 0 and the alternate (minor) allele as a 1.
Introduction
gcc{AT}ac{TG}
gccTacG
gccAacT
gccTacT
gccAacG
• At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to obtain rather easily the conflated SNP information in the so called genotype. A genotype can be conveniently represented as a 0/1/2 vector, where 0 (1) means that both chromosomes contain the dominant (respectively minor) allele, and 2 means that the two chromosomes contain different alleles.
?
M i n i m u m E n t r o p y P o p u l a t i o n P h a s i n g : G i v e n a s e t o f g e n o t y p e s , f i n d a p h a s i n g w i t h m i n i m u m e n t r o p y
P r o b l e m D e f i n i t i o n A p a i r o f h a p l o t y p e s ( h , h ’ ) e x p l a i n s g i f h ( i ) = h ’ ( i ) = g ( i ) w h e n e v e r g ( i ) i s 0 o r 1 ,
a n d h ( i ) ? h ’ ( i ) w h e n e v e r g ( i ) = 2
A p h a s i n g o f a s e t o f g e n o t y p e s G { 0 , 1 , 2 } k i s a f u n c t i o n f : G { 0 , 1 } k x { 0 , 1 } k
s u c h t h a t , f o r e v e r y g , f ( g ) i s a p a i r o f h a p l o t y p e s t h a t e x p l a i n g
E n t r o p y o f a p h a s i n g
w h e r e c o v ( h , f ) , i s t h e n u m b e r o f g e n o t y p e s g f r o m G s u c h t h a t f ( g ) = ( h , h ' ) o r f ( g ) = ( h ' , h ) p l u s t w i c e t h e n u m b e r o f o f g e n o t y p e s g s u c h t h a t f ( g ) = ( h , h )
)||2
),cov(log(||2
),cov()(0),co v (: G
fhG
fhfEntropyfhh
Approaches to Phasing
• Maximum Likelihood• PHASE [Stevens et al. 01] - repeatedly chooses a genotype at
random, and estimates that individual’s haplotypes under the assumption that all other haplotypes are correctly reconstructed
• GERBIL [Kimmel&Shamir 05] - expectation maximization for genotype resolution and block partitioning
• Perfect Phylogeny• Set of haplotypes used in the phasing must be consistent with a
perfect phylogeny [Gusfield 02]• Pure Parsimony
• Minimizing the number of distinct haplotypes• Integer Linear Program formulations: exponential size [Gusfield 04],
polynomial size [Brown&Harrower 05]• Entropy
• Minimize the entropy of the phasing• [Halperin&Karp 04] - simple greedy approximation algorithm
Previous Approaches
Local optimization algorithm for entropy minimization
1. Create a random phasing f 2. repeat forever
Find the pair (g ,(h ,h’)) that minimizes entropy(f’), where f’ is obtained from f by re-explaining g with (h,h’)If entropy(f’) < entropy(f) update f (change the current explanation for g to (h,h’))Else
exit loop3. Output cover
Phasing Short Genotypes
Switching Error (%)
--4.11.72.33.05.516.148.8800
04.122.33.25.615.949600
04.12.22.63.35.615.848.9400
0.24.53.13.146.115.848.2200
0.64.44.74.34.36.515.847.8100
1.74.36.565.96.816.347.750
5q31-euro(99 snp)
--11.33.04.07.510.824.848.5800
011.42.74.37.211.224.648.5600
011.53.65.28.111.724.748.5400
0.111.55.36.69.212.124.748.1200
0.69.99.1111212.924.948.2100
2.81217.215.616.115.725.847.550
5q31-wafr(89 snp)
9.811.118.417.617.917.325.135.429Gabriel2.62.74.64.23.75.115.943.5129Daly
k=9k=7k=5k=3k=1PHASEGERBIL
ENTROPY_PHASERAND#Gen
D atasets• [D aly 2001] 129 fam ily trios over a reg ion of 103 SN Ps• [G abrie l 2002] 60 b locks w ith an average of 50 S N P s genotyped for 29 ind iv iduals• [Forton et a l 2004] S im ulated popula tions generated as fo llows
-32 E uropean and 32 W est A frican fam ily trios were genotyped a t t he IL8 and 5q31 reg ions [H u ll e t a l. 2000]-P opulation hap lo types and the ir frequencies were in ferred using P ham ily and P H A S E-B ased on these haplo types frequencies, 100,000 random genotypes are generated, from which we se lec ted populations of s ize between 50 and 800
Experim ental Setup
5q 31w afr - s w itch e rro r
0
5
10
15
20
25
30
50 100 200 400 600 800
#ge n
erro
r ra
te
win= 1
win= 3
win= 5
win= 7
win= 9
G E RB IL
P HA S E
Sw itch error rateG iven the true hap lo types(t,t’) and the in ferred ones(h ,h ’), sw itch error ra te is the num ber of tim es we have to sw itch from reading h to h ’ to obta in t, d iv ided by the num ber of am biguous positions.
IL8-datasets
--8.45.26.68.310.716.347.1800
0.18.55.66.98.810.916.247600
0.28.76.47.5910.916.546.9400
0.78.87.68.29.511.416.646.5200
2.59.710.19.71112.216.545.9100
4.99.610.911.611.612.517.245.150
IL8-wafr(52 snp)
--2.62.12.12.734.847.8800
0.42.622.12.73.14.747.6600
0.42.72.22.32.73.24.847.6400
0.52.62.32.22.83.14.747200
1.22.832.73.23.8546.2100
1.72.932.83.13.74.944.950
IL8-euro(55 snp)
k=9k=7k=5k=3k=1PHASEGERBIL
ENTROPY_PHASERAND#Gen
• Entropy minimization gives a unified framework for various phasing problem variants, including phasing genotypes with missing data and pedigree constrained phasing
• Preliminary results show that entropy minimization is competitive with existing methods in haplotype reconstruction accuracy, particularly for large populations
• Currently, we are implementing trio-based entropy phasing and are exploring other strategies for phasing long genotypes
References• V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph. Haplotyping as perfect phylogeny: a direct approach
Technical Report UCDavis 2002• D Brown and I Harrower, A new integer programming formulation for the pure parsimony problem in
haplotype analysis, Proceedings of WABI 2004, 254-265• Daly et al., High resolution haplotype structure in the human genome, Nature Genetics, 29:229–232,
2001• Gabriel et al., The structure of haplotype blocks in the human genome, Science, 296:2225—2229, 2002.• E. Halperin and R. Karp. The Minimun-Entropy Set Cover Problem. International Colloquium on
Automata Languages and Programming 2004• J. Hull et al., Association of respiratory syncytial virus bronchiolitis with the interleukin 8 gene region in
UK families. Thorax 55:1023-1027, 2000• M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from
population data. American Journal of Human Genetics 68:978-989, 2001
Conclusions
Long Genotypes• Divide the genotypes into windows of size k• Run the previous algorithm for windows of size 2*k by fixing the first k snips.
k
Handling Missing Data• Any value is correct for a snip with missing data; the result is more pairs of haplotypes that can explain a genotype.• The local improvement algorithm remains the same
Trios• A family trio: two parents and a child• One haplotype from mother, one from father• At each step we re-explain a whole family
Extensions