Lecture 3: Allele Frequencies and Hardy-Weinberg sdifazio/popgen_12/lectures/aug27_HW......

Click here to load reader

  • date post

    01-Feb-2020
  • Category

    Documents

  • view

    8
  • download

    0

Embed Size (px)

Transcript of Lecture 3: Allele Frequencies and Hardy-Weinberg sdifazio/popgen_12/lectures/aug27_HW......

  • Lecture 3: Allele Frequencies and Hardy-Weinberg

    Equilibrium August 27, 2012

  • Last Time  Review of genetic variation and

    Mendelian Genetics

     Methods for detecting variation  Morphology  Allozymes  DNA Markers

     Anonymous  Sequence-tagged

  • Today  Sequence probability calculation

     Molecular markers: DNA sequencing

     Introduction to statistical distributions

     Estimating allele frequencies

     Introduction to Hardy-Weinberg Equilibrium

     Using Hardy-Weinberg: Estimating allele frequencies for dominant loci

  • If nucleotides occur randomly in a genome, which sequence should occur more

    frequently? AGTTCAGAGT

    AGTTCAGAGTAACTGATGCT

    What is the expected probability of each sequence to occur once?

    How many times would each sequence be expected to occur by chance in a 100 Mb

    genome?

  • AGTTCAGAGT

    What is the expected probability of each sequence to occur once?

    What is the sample space for the first position? A

    T

    G C

    Probability of “A” at that position? 4 1

    Probability of “A” at position 1, “G” at position 2, “T” at position 3, etc.?

    710 1054.925.0 4 1

    4 1

    4 1

    4 1

    4 1

    4 1

    4 1

    4 1

    4 1

    4 1 −== xxxxxxxxxx

    AGTTCAGAGTAACTGATGCT 1320 1009.925.0 −= x

  • AGTTCAGAGT

    How many times would each sequence be expected to occur in a 100 Mb genome?

    ( )( ) 4.95101054.9 87 =−x

    AGTTCAGAGTAACTGATGCT

    ( )( ) 5813 101.9101009.9 −− = xx

    Why is this calculation wrong?

  • ),()|()( BPBAPBAP =∩),()()()( BAPBPAPBAP ∩−+=∪

    A B

    AGTTCAGAGTAACTGATGCT AGT TCA GAG TAA CTG ATG CT

    UCA AGU CUC AUU GAC UAC GA

    Ser Cys Phe Ile Asp Tyr

    UGA AGU CUC AUU GAC UAG GA Stop Cys Phe Ile Asp Stop

  • DNA Sequencing   Direct determination of

    sequence of bases at a location in the genome

      Shotgun versus PCR sequencing

      Dye terminators (Sanger) and capillaries revolutionized DNA sequencing

      Modern sequencing methods (sequencing by synthesis, pyrosequencing) have catapulted sequencing into realm of population genetics

      Human genome took 10 years to sequence originally, and hundreds of millions of dollars

      Now we can do it in a week for

  • SNPs   A Single Nucleotide Polymorphism

    (SNP) is a single base mutation in DNA.

      The most common source of genetic polymorphism (e.g., 90% of all human DNA polymorphisms).

      Identify SNP by screening a sample of individuals from study population: usually 16 to 48

      Once identified, SNP are assayed in populations using high-throughput methods

  • Genotyping by Sequencing   New sequencing methods generate 10’s of millions of short sequences

    per run

      Combine restriction digests with sequencing and pooling to genotype thousands of markers covering genome at very high density

    http://www.maizegenetics.net/images/stories/GBS_CSSA_101102sem.pdf

    Generate 10’s of thousands of markers for

  • Genotyping by Sequencing Cost Example

    http://www.maizegenetics.net/gbs-overview

  • Statistical Distributions: Normal Distribution

      Many types of estimates follow normal distribution

      Can be visualized as a frequency distribution (histogram)   Can interpret as a probability density function

    Variance (Vx): A measure of the dispersion around the mean:

    ∑ =

    − −

    = n

    i ix xxn

    V 1

    2)( 1 1

    Expected Value (Mean): ∑ =

    = n

    i ixn

    x 1

    1

    where n is the number of samples

    Standard Deviation (sd): A measure of dispersion around the mean that is on same scale as mean

    xVsd =

    1 sd

    2 sd

  • Standard Error of Mean

      Standard Deviation is a measure of how individual points differ from the mean estimates in a single sample

      Standard Error is a measure of how much the estimate differs from the true parameter value (in the case of means, µ)

      If you repeated the experiment, how close would you expect the mean estimate to be to your previous estimate?

    Standard Error of the Mean (se): n Vse x=

    95% Confidence Interval: )(96.1 sex ±

  • Estimating Allele Frequencies, Codominant Loci   Measured allele frequency is maximum likelihood estimator

    of the true frequency of the allele in the population (See Hedrick, pp 82-83 for derivation)

    N

    NN p

    1211 2 1

    + =

      Expected number of observations of allele A1: E(Y)=np

      Where n is number of samples   For diploid organisms, n = 2N , where N is number of

    individuals sampled

      Expected number of observations of allele A1 is analogous to the mean of a sample from a normal distribution

      Allele frequency can also be interpreted as an estimate of the mean

  •   Assume a population of Mountain Laurel (Kalmia latifolia) at Cooper’s Rock, WV

    Allele Frequency Example

    Red buds: 5000 Pink buds: 3000 White buds: 2000

      Phenotype is determined by a single, codominant locus: Anthocyanin

      What is frequency of “red” alleles (A1), and “white” alleles (A2)?

    A1A1 A1A2 A2A2

    , 2

    22 1

    1211 1211

    N NN

    N

    NN p +=

    + =

    Frequency of A1 = p

    , 2

    22 1

    1222 1222

    N NN

    N

    NN q +=

    + =

    Frequency of A2 = q

  • Allele Frequencies are Distributed as Binomials

      Binomials are variables that can be interpreted as the number of successes and failures in a series of trials

      Based on samples from a population

      For two-allele system, each sample is like a “trial”   Does the individual contain Allele A1?   Remember, q=1-p, so only one parameter is estimated

    Number of ways of observing y positive results in n trials

    Probability of observing y positive results in n trials once

    ,)( yny fs y n

    yYP −⎟⎟ ⎠

    ⎞ ⎜⎜ ⎝

    ⎛ ==

    )!(! ! yny

    nC y n n

    y − ==⎟⎟

    ⎠

    ⎞ ⎜⎜ ⎝

    ⎛

    where s is the probability of a success, and f is the probability of a failure

  • Given the allele frequencies that you calculated earlier for Cooper’s Rock

    Kalmia latifolia, what is the probability of observing two “white” alleles in a

    sample of two plants?

  • Variation in Allele Frequencies, Codominant Loci   Binomial variance is pq or p(1-p)

      Variance in number of observations of A1: V(Y) = np(1-p)

      Variance in allele frequency estimates (codominant, diploid):

    N ppVp 2 )1( −

    =

      Standard Error of allele frequency estimates:

    N ppSEp 2 )1( −

    =

      Notice that estimates get better as sample size increases

      Notice also that variance is maximum at intermediate allele frequencies

  • Maximum variance as a function of allele frequency for a codominant locus

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    p

    p (1 -p )

  • Why is variance highest at intermediate allele frequencies?

    p = 0.5

    If this were a target, how variable would your outcome be in each case (red versus white hits)?

    Variance is constrained when value approaches limits (0 or 1)

    p = 0.125

  • What if there are more than 2 alleles?   General formula for calculating allele frequencies in

    multiallelic system with codominant alleles:

    N ppV iipi 2 )1( −

    =

      Variance and Standard Error of allele frequency estimates remain:

    N ppSE iipi 2 )1( −

    =

    ij N

    NN p

    n

    j ijii

    i ≠

    +

    = ∑ = ,

    2 1

    1

  • How do we estimate allele frequencies for dominant loci?

    A2A2

    Codominant locus Dominant locus A1A1 A1A2 A2A2 -

    +

    A1A1 A1A2

    Codominant locus Dominant locus -

    +

  • Hardy-Weinberg Law  After one generation of random mating,

    single-locus genotype frequencies can be represented by a binomial (with 2 alleles) or a multinomial function of allele frequencies

    222 2)( qpqpqp ++=+ Frequency of A2A2 (Q) Frequency of A1A1 (P) Frequency of A1A2 (H)

  • How does Hardy-Weinberg Work?   Reproduction is a sampling process

      Example: Mountain Laurel at Cooper’s Rock Red Flowers: 5000 Pink Flowers: 3000 White Flowers: 2000

    A1A1 A1A2 A2A2

    Frequency of A1 = p = 0.65 Frequency of A2 = q = 0.35

    : A2=14 : A1