Background The demographic events experienced by populations influence their genealogical history...

1
Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable within and between extant populations (see Figure 1). For example, the number of alleles shared between very closely related species depends on the time at which the species split and whether gene flow occurred since the split (see Figure 2). Thus, polymorphism data can be used to estimate the demographic parameters describing the history of two incipient species (see Figure 3). Here, we consider a simple model in which two populations split T generations ago and the number of migrants exchanged between them is M per generation. Na, N1 and N2 are the effective population sizes for the ancestral, first and second descendant populations, respectively. We denote the set of parameters by . Our goal is to estimate the posterior distribution of given the data. Rather than using all the data to estimate these parameters, we summarize the data for each locus by four statistics known to be sensitive to the parameters of interest (see Figure 1 for details). Given a genealogy, the probability of obtaining these statistics can be calculated explicitly. We therefore take the following approach to obtain an estimate of the posterior distribution of the parameters: Specifically, we pick a set of parameters independently from prior distributions, then simulate a genealogical history for each locus and calculate u= p(D|G,). We then weight the values of the parameters by u to obtain an estimate of their posterior probability. Keystone Symposia. Genome Sequence Variation. Jan 08 – Jan 13, 2006 Estimating divergence times and testing for migration using multi-locus polymorphism data Céline Becquet a , Andrea S. Putnam b , Peter Andolfatto b , and Molly Przeworski a Dept. of Human Genetics, Chicago, IL, USA, 60637 a ; University of California at San Diego, La Jolla, CA, 92093 b Table 1 – Comparison between methods Paper Speed Advantages Drawbacks Wakeley & Hey 1997 Fast Method of moments estimator Summary of data Allows for recombination Low accuracy Multiple loci S s & S f >0 required Can use genotype data Model with no migration Nielsen & Wakeley 2001 Slow Uses all the data Uses one locus Allows for uncertainty in nuisance parameters Recombination not allowed Haplotype data required More general model (allows for migration) Hey & Nielsen 2004 Slow Same as Nielsen & Wakeley 2001 Recombination not allowed Haplotype data required Multiple loci Leman et al. 2005 Fast Approximate Bayesian Computation approach Summary of data Uses one locus No need for S s & S f >0 Recombination not allowed Can use genotype data Model with no migration Our method Slow Same as Leman et al. 2005 Summary of data More general model Allows for recombination Multiple loci An example of polymorphism data at a locus in three sequences sampled from each of two populations. The horizontal lines represent aligned sequences; the colored squares, disc and ovals stands for segregating sites. We use the following summaries of the polymorphism data at each locus: the number of segregating sites specific to sample one (S 1 ), specific to sample two (S 2 ), shared between samples from both populations (S ) S 1 =1 S 2 =2 S shared =1 S fixed =1 1 2 3 a b c Fig-2. Effects of divergence time and migration on polymorphism data Examples of genealogical histories for three sequences sampled from each of two closely related populations, under different models. The patterns of polymorphism and divergence expected under each model are indicated below. For simplicity, we present a single genealogy, but for recombining loci, there may be many histories within a single region (i.e. there is an ancestral recombination graph, rather than a tree). The vertical branches represent ancestral lineages for the six sequences; they are colored according to whether a mutation would lead to a fixed, shared or unique polymorphism in the sample (see Figure 1). In c, gene flow occurred (yellow line), thus sequence 3 was sampled in population one but its ancestor came from population two. G dG G p G D p D f ) ( ) | ( ) , | ( ) | ( Posterior distributio n Calculated explicitly Estimated from coalescent simulations Prior distributions on parameters Future directions Our current method is relatively slow when using data from multiple loci because it is searching a huge space of possible histories and parameters. We would like to speed up the method and extend it to more complex models. To do so, we will need to account for two sources of variance: in the genealogies and the parameters. We therefore plan to generate many genealogies for the same set of parameters in order to improve the accuracy of our estimate of p(D|) and use Markov Chain Monte Carlo in order to better explore Abstract Population divergence times are of interest in many contexts, from human genetics to conservation biology. These times can be estimated from polymorphism data. However, existing approaches make a number of assumptions (e.g., no recombination within loci or no migration since the split) that limit their applicability. To overcome these limitations, we developed an Approximate Bayesian Computation approach to estimate population parameters for a simple split model, allowing for migration as well as intralocus recombination. Application to simulated data suggests that the approach provides fairly accurate estimates of population sizes and divergence times and has high power to detect migration since the split. We illustrate the potential of the method by applying it to polymorphism data from five highly recombining loci surveyed in two closely related species of Lepidoptera (Papilio glaucus and P. canadensis). Fig-1. Summary statistics used for estimation a. A gene genealogy for a recent divergence time without migration 1 2 3 a b c T N 1 N 2 N a Excess of shared polymorphisms (occurring along the red branch) and few fixed sites (purple branch). b. A gene genealogy for an old divergence time without migration T 1 2 3 a b c Few shared polymorphisms (none here) and an excess of fixed sites. c. A gene genealogy for an old divergence time with migration T 1 2 3 a b c M Excess of shared polymorphism and few fixed sites. Application to two Papilio species Fig-3. Performance on a small simulated data set Mean of the divergence time (a) and the ratio of ancestral to current population size (b). The estimates are based on polymorphism data from ten simulated loci of 1 kb, generated with: a sample size of 20 individuals from each population, the population mutation rates θ 1 2 a =.001, T=5x10 4 generations and M=5. Each vertical line refers to a data set (Y-axis) , the red line indicates the true value and the X-axis range corresponds to the range of the prior distribution. As can be seen, the divergence times tend to be over-estimated, while the ancestral population size estimates are more accurate. Fig-4. Ranges of P. glaucus and P. canadensis. A narrow hybrid zone forms where the ranges meet. Female mimetic morph of P. glaucus is shown with yellow morphs. We applied our method to data from five highly recombining loci sampled in two species of Lepidoptera (Papilio glaucus and P. canadensis). These two species are known to exchange migrants and experience high levels of recombination. In order to examine the sensitivity to assumptions about migration, we compared the parameter estimates obtained in models with and without gene flow: the time of divergence appears to be under-estimated and ancestral population size over-estimated when migration is ignored (see Table 2). Table 2 - Effect of model on estimation Estimator Model N1 N2 Na T (in generatio n) Migration Rate* Mean Migration 2.65E+05 1.80E+05 1.72E+05 4.38E+05 0.154 No Migration 2.59E+05 1.75E+05 6.25E+05 1.46E+05 Median Migration 2.55E+05 1.66E+05 1.51E+05 4.45E+05 0.148 No Migration 2.52E+05 1.62E+05 5.84E+05 1.43E+05 * The posterior probability of migration is > .999, while the prior probability is only .5. Thus, there is strong support for gene flow, as expected. 1 2 3 4 5 6 7 8 9 10 0.E+00 1.E+05 2.E+05 3.E+05 4.E+05 5.E+05 6.E+05 T a 1 2 3 4 5 6 7 8 9 10 0 2 4 6 N a /N 1 b

Transcript of Background The demographic events experienced by populations influence their genealogical history...

Page 1: Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable.

BackgroundThe demographic events experienced by populations influence their genealogical history and

therefore the pattern of neutral polymorphism observable within and between extant populations

(see Figure 1). For example, the number of alleles shared between very closely related species

depends on the time at which the species split and whether gene flow occurred since the split (see

Figure 2). Thus, polymorphism data can be used to estimate the demographic parameters

describing the history of two incipient species (see Figure 3).

Here, we consider a simple model in which two populations split T generations ago and the

number of migrants exchanged between them is M per generation. Na, N1 and N2 are the effective

population sizes for the ancestral, first and second descendant populations, respectively. We

denote the set of parameters by . Our goal is to estimate the posterior distribution of given the

data. Rather than using all the data to estimate these parameters, we summarize the data for each

locus by four statistics known to be sensitive to the parameters of interest (see Figure 1 for details).

Given a genealogy, the probability of obtaining these statistics can be calculated explicitly. We

therefore take the following approach to obtain an estimate of the posterior distribution of the

parameters:

Specifically, we pick a set of parameters independently from prior distributions, then simulate a

genealogical history for each locus and calculate u= p(D|G,). We then weight the values of the

parameters by u to obtain an estimate of their posterior probability.

Keystone Symposia. Genome Sequence Variation. Jan 08 – Jan 13, 2006

Estimating divergence times and testing for migration using multi-locus polymorphism data

Céline Becqueta, Andrea S. Putnamb, Peter Andolfattob, and Molly Przeworskia Dept. of Human Genetics, Chicago, IL, USA, 60637a; University of California at San Diego, La Jolla, CA, 92093b

Table 1 – Comparison between methods

PaperSpee

d Advantages DrawbacksWakeley & Hey 1997 Fast Method of moments estimator Summary of data

Allows for recombination Low accuracy

Multiple loci Ss & Sf >0 required

Can use genotype data Model with no migration

Nielsen & Wakeley 2001 Slow Uses all the data Uses one locus

Allows for uncertainty in nuisance parameters

Recombination not allowedHaplotype data required

More general model (allows for migration)

Hey & Nielsen 2004 Slow Same as Nielsen & Wakeley 2001

Recombination not allowedHaplotype data required

Multiple loci

Leman et al. 2005 Fast Approximate Bayesian Computation approach

Summary of data

Uses one locus

No need for Ss & Sf >0 Recombination not allowed

Can use genotype data Model with no migration

Our method Slow Same as Leman et al. 2005 Summary of data

More general model

Allows for recombination

Multiple loci

An example of polymorphism data at a locus in

three sequences sampled from each of two

populations. The horizontal lines represent aligned

sequences; the colored squares, disc and ovals

stands for segregating sites. We use the following

summaries of the polymorphism data at each locus:

the number of segregating sites specific to sample

one (S1), specific to sample two (S2), shared

between samples from both populations (Sshared) and

fixed in either population sample (Sfixed).

S1=1

S2=2

Sshared=1

Sfixed=1

123

abc

Fig-2. Effects of divergence time and migration on polymorphism data

Examples of genealogical histories for three sequences sampled from each of two closely

related populations, under different models. The patterns of polymorphism and divergence

expected under each model are indicated below. For simplicity, we present a single genealogy, but

for recombining loci, there may be many histories within a single region (i.e. there is an ancestral

recombination graph, rather than a tree). The vertical branches represent ancestral lineages for the

six sequences; they are colored according to whether a mutation would lead to a fixed, shared or

unique polymorphism in the sample (see Figure 1). In c, gene flow occurred (yellow line), thus

sequence 3 was sampled in population one but its ancestor came from population two.

G

dGGpGDpDf )()|(),|()|( Posterior distribution

Calculated explicitly

Estimated from coalescent simulations

Prior distributions on parameters

Future directionsOur current method is relatively slow when using data from multiple loci because it is searching a huge space of possible histories and

parameters. We would like to speed up the method and extend it to more complex models. To do so, we will need to account for two sources of variance: in the genealogies and the parameters. We therefore plan to generate many genealogies for the same set of parameters in order to improve the accuracy of our estimate of p(D|) and use Markov Chain Monte Carlo in order to better explore the parameter space.

AbstractPopulation divergence times are of interest in many contexts, from human genetics to

conservation biology. These times can be estimated from polymorphism data. However, existing

approaches make a number of assumptions (e.g., no recombination within loci or no migration

since the split) that limit their applicability. To overcome these limitations, we developed an

Approximate Bayesian Computation approach to estimate population parameters for a simple split

model, allowing for migration as well as intralocus recombination. Application to simulated data

suggests that the approach provides fairly accurate estimates of population sizes and divergence

times and has high power to detect migration since the split. We illustrate the potential of the

method by applying it to polymorphism data from five highly recombining loci surveyed in two

closely related species of Lepidoptera (Papilio glaucus and P. canadensis).

Fig-1. Summary statistics used for estimation

a. A gene genealogy for a recent divergence time without migration

1 2 3 a b c

T

N1 N2

Na

Excess of shared polymorphisms

(occurring along the red branch)

and few fixed sites (purple branch).

b. A gene genealogy for an old divergence time without migration

T

1 2 3 a b cFew shared polymorphisms (none

here) and an excess of fixed sites.

c. A gene genealogy for an old divergence time with migration

T

1 2 3 a b c

M

Excess of shared polymorphism and

few fixed sites.

Application to two Papilio species

Fig-3. Performance on a small simulated data set

Mean of the divergence time (a) and the ratio of ancestral to current population size (b). The

estimates are based on polymorphism data from ten simulated loci of 1 kb, generated with: a

sample size of 20 individuals from each population, the population mutation rates θ1=θ2=θa=.001,

T=5x104 generations and M=5. Each vertical line refers to a data set (Y-axis) , the red line

indicates the true value and the X-axis range corresponds to the range of the prior distribution. As

can be seen, the divergence times tend to be over-estimated, while the ancestral population size

estimates are more accurate.

Fig-4. Ranges of P. glaucus and P. canadensis. A narrow hybrid zone forms where the ranges meet. Female

mimetic morph of P. glaucus is shown with yellow morphs.

We applied our method to data from five highly

recombining loci sampled in two species of Lepidoptera

(Papilio glaucus and P. canadensis). These two species

are known to exchange migrants and experience high

levels of recombination. In order to examine the sensitivity

to assumptions about migration, we compared the

parameter estimates obtained in models with and without

gene flow: the time of divergence appears to be under-

estimated and ancestral population size over-estimated

when migration is ignored (see Table 2).

Table 2 - Effect of model on estimation

Estimator Model N1 N2 NaT (in

generation)Migration

Rate*

MeanMigration 2.65E+05 1.80E+05 1.72E+05 4.38E+05 0.154

No Migration 2.59E+05 1.75E+05 6.25E+05 1.46E+05

MedianMigration 2.55E+05 1.66E+05 1.51E+05 4.45E+05 0.148

No Migration 2.52E+05 1.62E+05 5.84E+05 1.43E+05* The posterior probability of migration is > .999, while the prior probability is only .5. Thus, there is strong support for gene flow, as expected.

0123456789

1011

0.E+00 1.E+05 2.E+05 3.E+05 4.E+05 5.E+05 6.E+05T

a

0123456789

1011

0 2 4 6Na/N1

b