Post on 26-Jan-2016
description
A New Nonparametric Bayesian Model for Genetic Recombination in
Open Ancestral Space
Presented by Chunping Wang
Machine Learning Group, Duke University
February 26, 2007
Paper by E. P. Xing and K-A. Sohn
Outline
• Terminology and Introduction
• DP Mixtures for Non-recombination Inheritance
• HMDP for Recombination
• Results
• Conclusions
• Allele: a viable DNA coding on a chromosome – observation
• Locus : the location of an allele – index of an observation
• Haplotype: a sequence of alleles – data sequence
• Recombination: exchange pieces of paired chromosome – state-transition
• Mutation: any change to a haplotype during inheritance – emission
Terminology and Introduction (1)
Terminology and Introduction (3)
Problems:
1. Ancestral inference: recovering ancestral haplotypes;
2. Recombination analysis: inferring the recombination hotspots;
3. Ancestral mapping: inferring the ancestral origin of each allele in each modern haplotype.
DP Mixtures for Non-recombination Inheritance (1)
Non-recombination:
• Only mutation may occur during inheritance;
• Each modern haplotype is originated from a single ancestor.
Only true for haplotypes spanning a short region in a chromosome.
DP Mixtures for Non-recombination Inheritance (2)
Q
0Q
i
ihn
)(~|
~|
),(~,| 00
ihii
i
Ph
QDPQQ
Kka kkk ,,1),,(* where , the distinct values of , denote the joint of the kth ancestor and the mutation parameter corresponding to the kth ancestor.
nii 1}{
HMDP for Recombination (1)
For long haplotypes possibly bearing multiple ancestors, we consider recombinations (state-transitions across discrete space-interval).
jQ
ji
jihjm
2Q
2i
2ih
2m
1Q
0Q
1i
2ih
1m
F
Each row of the transition matrix in HMM is a DP. Also these DPs are linked by the top level master DP, and have the same set of target states.
The mixing proportions for each lower level DP are denoted as , then the jth row of the transition matrix is .
HMDP for Recombination (2)
],,[ 2,1, jjj
j
HMDP for Recombination (3)
Modern haplotypeAncestor haplotype
The indicators of ith modern haplotype for all the loci, which specify the corresponding ancestral haplotype
• when no recombination takes place during the inheritance process producing haplotype Hi,
• when a recombination occurs between loci t and t+1,
tkC ti ,,
1,, titi CC
HMDP for Recombination (4)
Introduce a Poisson point process to control the duration of non-recombinant inheritance (space-inhomogeneous)
ex
xp x
!
1)|(
Denote
d: the physical distance between loci t and t+1 ;
r: recombination rate per unit distance.
Then
x-the number of recombinations
1)|0( dredrxp
dredrxp 1)|0(
HMDP for Recombination (5)
Combine with the standard stationary HMDP, the non-stationary state transition probability:
)',()1()|'( ',,1, kkkCkCp kktiti
While d or r goes to infinity, , , the inhomogeneous HMDP model goes back to a standard HMDP.
0 dre 1
HMDP for Recombination (6)
Inference:
The emission function:
),(~ hhBeta
),|( achp
where
The prior base: )()(),( pApAF
)(Ap uniform
Integrate over , the marginal likelihood: )(p
HMDP for Recombination (7)
Inference:
Two sampling stages:
1. Sample given all haplotypes h and the most recently sampled ancestor pool a;
2. Sample every ancestor Ak given all haplotypes h and the current
}{ ,tiC
}{ ,tiC
Combine the HDP prior and the marginal likelihood,
we can infer the posterior for and , which are the variables of interest.
}{ ,tiC }{ ,tkA
Results (1)Simulated data:
30 populations, each includes 200 haplotypes from K=5 ancestral haplotypes. T=100
Compare: HMDP, HMMs with K=3,5 and 10
The average ancestor reconstruction errors for the five ancestors
Even the HMM with K=5 cannot beat the HMDP
Results (2)
Box plot of the empirical recombination rates
The vertical gray lines - the pre-specified recombination hotspots
Threshold 1
Threshold 2
Results (3)
Population maps: 1. true map; 2. HMDP; 3-5. HMMs with K=3,5,10
Each vertical thin line – one modern haplotype;
Each color – one ancestral haplotype.
Measure for accuracy: the mean squared distance to the true map
Results (4)Real haplotype data sets 1: Daly data – single population
512 haplotypes. T=103
Bottom: empirical recombination rates
Upper vertical lines: recombination hotspots.
Red dotted lines: HMM; blue dashed lines: MDL; black solid lines: HMDP
Results (6)
Estimated population map
Each vertical thin line – one modern haplotype;
Each color – one ancestral haplotype.
Conclusions
• This HMDP model is an application and extension of the HDP into the population genetics field;
• The HDP allows the space of states in HMM to be infinite so that it is suitable for inferring unknown number of ancestral haplotypes;
• The HMDP model also allows the recombination rates to be non-stationary;
• The HMDP model can jointly infer a number of important genetic variables.