Modeling Molecular Substitution

26
Modeling Molecular Substitution Von Bing Yap Statistics Department, UC Berkeley [email protected]

description

Modeling Molecular Substitution. Von Bing Yap. Statistics Department, UC Berkeley [email protected]. Variations on a theme. Human genome project furnishes sequences from a few individuals, but is relevant to all of us: we are very similar. - PowerPoint PPT Presentation

Transcript of Modeling Molecular Substitution

Page 1: Modeling Molecular  Substitution

Modeling Molecular Substitution

Von Bing Yap

Statistics Department, UC Berkeley

[email protected]

Page 2: Modeling Molecular  Substitution

Variations on a theme

Human genome project furnishes sequences from a few individuals, but is relevant to all of us: we are very similar.

We are also slightly different: polymorphisms.

Species variations.

Page 3: Modeling Molecular  Substitution

How did variations arise?

Mutation: (a) Inherent: DNA replication errors are not always corrected. (b) External: exposure to chemicals and radiation.

Selection: Deleterious mutations are removed quickly. Neutral and rarely, advantageous mutations, are tolerated and stick around …

Fixation: It takes time for a new variant to be established (having a stable frequency) in a population.

Page 4: Modeling Molecular  Substitution

Why study molecular substitution?

(Relatively) early days … to test hypotheses from population genetics. Do rodents evolve faster than primates (the generation time hypothesis) ? Which sites in a protein are under much selective pressure?

More recently (from 70’s) …to derive better substitution matrices like the PAM and BLOSUM series for sequence alignment and building phylogenetic trees.

Page 5: Modeling Molecular  Substitution

Modeling DNA base substitution

Strictly speaking, only applicable to regions undergoing little selection.

Assumptions 1. Site independence. 2. Site homogeneity. 3. Markovian: given current base, future

substitutions independent of past. 4. Temporal homogeneity: stationary Markov

chain.

Page 6: Modeling Molecular  Substitution

Markov chain on {A,C,G,T}

A stationary Markov chain is a family of transition probability matrices P(t), t 0, satisfying

P(t)P(s) = P(t+s), t,s 0. P(t,a,b) = Pr(b at time s+t | a at time s). Let Q be a rate matrix, i.e., it has positive off-

diagonal entries and each row sum is 0. Eg., Q(1,2) is the instantaneous rate of A going to C. Q defines an MC by

P(t) = exp(Qt) = I + Q t + Q2 t2/2! + …

Page 7: Modeling Molecular  Substitution

Let Q =

Then P(t) =

Jukes-Cantor model(1969)

-3 -3 -3 -3

r s s s

s r s s

s s r s

s s s r

r = (1+3exp(-4t))/4, s = (1- exp(-4t))/4.

Page 8: Modeling Molecular  Substitution

The stationary distribution

A probability distribution on {A,C,G,T}, , is a stationary distribution if

a (a) P(t,a,b) = (b), b, t 0,

or P(t) = , t 0, or Q = 0 (global balance). Facts: MC built from our Q has a unique stationary

distribution. Let X(t) be the base at time t. If X(0) ~ , then X(t) ~ . Given any initial distribution, the distribution of X(t) as t .

For the JC model, is the uniform distribution.

Page 9: Modeling Molecular  Substitution

A pair of homologous bases

Typically, ancestor is unknown.

ancestor

A C

QhQm

T years

Page 10: Modeling Molecular  Substitution

More assumptions

5. Qh = sh Q and Qm = sm Q, for some positive

sh and sm, and some rate matrix Q.

6. The ancestor is sampled from the stationary distribution of Q.

7. Q is reversible:

(a) P(t,a,b) = P(t,b,a) (b), b, t 0 (detailed balance).

Page 11: Modeling Molecular  Substitution

New picture

ancestor ~

A

C

QQ

shT PAMs smT

PAMs

Page 12: Modeling Molecular  Substitution

Joint probability of A and C

Under the model, the joint probability is

a (a) P(shT,a,A) P(smT,a,C)

= a (A) P(shT,A,a) P(smT,a,C)

= (A) P(shT+ smT,A,C) = F(t,A,C).The matrix F(t) is symmetric. It is equally valid to view A

as the ancestor of C or vice versa.

t = shT+ smT is the “distance” between A and C. Note: Q

and t are identifiable from F(t), but sh , sm and T are not.

Page 13: Modeling Molecular  Substitution

Joint probability of a pair of homologous sequences

Pr(a1…an,b1...bn)

= k F(t,ak,bk)

= a,b F(t,a,b)c(a,b),

where c(a,b) = # {k : ak = a, bk = b}.

Page 14: Modeling Molecular  Substitution

The choice of Q

By convention (Dayhoff 1978), Q is chosen to satisfy

–a (a) Q(a,a) = 0.01,

or the expected number of substitutions in 1 time unit is 0.01 per site. This new time scale is called evolutionary time, measured in PAM ( Point Accepted Mutation).

Page 15: Modeling Molecular  Substitution

M pairs of homologous sequences

The kth pair of sequences, separated by tk PAMs, gives the count matrix c(k). Assuming that the pairs are independent and underwent the same MC defined by Q over pair-specific distances tk’s.

Pr(all pairs) = k a,b F(tk,a,b)c(k,a,b).The pair-specific distances allow for different rates of substitution across pairs.

Parameters : Q, t1, t2,…,tM.Maximum likelihood estimation by numerical methods.

Page 16: Modeling Molecular  Substitution

Previous work

This model is general enough to include almost all models in use as special cases.

For DNA base, there are JC, Kimura and Hasegawa-Kishino-Yano models.

For amino acid, the PAM (Dayhoff 1978) and BLOSUM (Henikoff 1993) substitution matrices are derived based on the model. Müller and Vingron (2000) presents an interesting method of estimation.

For codon, Z Yang and collaborators use constrained versions to estimate ratio of syn/nonsyn substitutions, among other things.

.

Page 17: Modeling Molecular  Substitution

A formula for a reversible Q

A C G T

A . (2) (3) (4)

C (1) . (3) (4)

G (1) (2) . (4)

T (1) (2) (3) .

: stationary distribution of Q; ,…,: positive constants.

Page 18: Modeling Molecular  Substitution

Strand symmetry

It seems plausible that the substitution process in noncoding regions is strand-symmetric, i.e., Q(A,C) = Q(T,G), Q(C,C) = Q(G,G), etc.

From the point of view of sequence alignment, SS implies that the same substitution matrix can be used for aligning either strand.

Strand-symmetry (of the lack of) is well-studied in bacteria (Francino, Ochman).

Page 19: Modeling Molecular  Substitution

A formula for a reversible SS Q

A C G T

A . (2) (2) (1)

C (1) . (2) (1)

G (1) (2) . (1)

T (1) (2) (2) .

: stationary distribution; 2((1) + (2)) = 1. ,…,: positive constants.

Page 20: Modeling Molecular  Substitution

Chromosome bands and human-mouse alignment

Staining chromosomes with the Giemsa dye produces characteristic banding patterns. G-bands: G1, G2, G3 or G4 (Francke,1994). The unstained regions are H3- or H3+.

Data: 22,490 local alignments (4.4 Mbp) of human chromosome 22 with homologous mouse reads, produced by blastz, and approximate boundary coordinates of the bands on chr 22, obtained from the UCSC Human genome browser.

Page 21: Modeling Molecular  Substitution

Data analysis

Only alignments in noncoding regions are used. Alignments that are closer than 10 bases are glued. Of the resultant alignments, those longer than 100 bases are used to fit band-specific substitution models.

Page 22: Modeling Molecular  Substitution

ResultsBand G3 G4, H3-, H3+

A C G T A C G T

A . 18 62 15 . 21 67 15

C 21 . 16 71 19 . 16 63

G 71 16 . 21 63 16 . 19

T 15 62 18 . 15 67 21 .

A . 18 61 15 . 21 67 15

C 21 . 16 70 19 . 16 63

G 71 16 . 20 63 16 . 20

T 16 62 18 . 15 68 21 .

GC 46.7 51.7

R,SS

R

Page 23: Modeling Molecular  Substitution

Amino acid substitution

Exactly the same formulation applies to modeling amino acid substitution.

Fast-evolving and slow-evolving proteins are known to have different substitution rates. How can we summarize such observations?

Buried amino acids evolve differently from exposed residues. HMM with 2 hidden state?

Page 24: Modeling Molecular  Substitution

D. melanogastor vs virilis

Bergman and Kreitman (2001) showed that the substitution patterns between intergenic and intronic regions are similar. The combined estimated rate matrix is

A C G T

A . 21 42 20

C 31 . 32 62

G 62 32 . 31

T 20 42 21 .

Page 25: Modeling Molecular  Substitution

Future work

Study other human chromosomes, and isochore-specific substitution patterns.

Multiple species. Holmes (2002) has a neat EM algorithm for estimating substitution rates if the process is reversible. What to do if irreversible?

Even for 2 species, irreversible substitution models are not fully explored.

Page 26: Modeling Molecular  Substitution

Acknowledgements

Terry Speed

Webb Miller

David Haussler

Terry Furey

Casey Bergman

Anne Yap