Modeling Molecular Substitution

Modeling Molecular Substitution

Von Bing Yap

Statistics Department, UC Berkeley

[email protected]

Variations on a theme

Human genome project furnishes sequences from a few individuals, but is relevant to all of us: we are very similar.

We are also slightly different: polymorphisms.

Species variations.

How did variations arise?

Mutation: (a) Inherent: DNA replication errors are not always corrected. (b) External: exposure to chemicals and radiation.

Selection: Deleterious mutations are removed quickly. Neutral and rarely, advantageous mutations, are tolerated and stick around …

Fixation: It takes time for a new variant to be established (having a stable frequency) in a population.

Why study molecular substitution?

(Relatively) early days … to test hypotheses from population genetics. Do rodents evolve faster than primates (the generation time hypothesis) ? Which sites in a protein are under much selective pressure?

More recently (from 70’s) …to derive better substitution matrices like the PAM and BLOSUM series for sequence alignment and building phylogenetic trees.

Modeling DNA base substitution

Strictly speaking, only applicable to regions undergoing little selection.

Assumptions 1. Site independence. 2. Site homogeneity. 3. Markovian: given current base, future

substitutions independent of past. 4. Temporal homogeneity: stationary Markov

chain.

Markov chain on {A,C,G,T}

A stationary Markov chain is a family of transition probability matrices P(t), t 0, satisfying

P(t)P(s) = P(t+s), t,s 0. P(t,a,b) = Pr(b at time s+t | a at time s). Let Q be a rate matrix, i.e., it has positive off-

diagonal entries and each row sum is 0. Eg., Q(1,2) is the instantaneous rate of A going to C. Q defines an MC by

P(t) = exp(Qt) = I + Q t + Q2 t2/2! + …

Let Q =

Then P(t) =

Jukes-Cantor model(1969)

-3 -3 -3 -3

r s s s

s r s s

s s r s

s s s r

r = (1+3exp(-4t))/4, s = (1- exp(-4t))/4.

The stationary distribution

A probability distribution on {A,C,G,T}, , is a stationary distribution if

a (a) P(t,a,b) = (b), b, t 0,

or P(t) = , t 0, or Q = 0 (global balance). Facts: MC built from our Q has a unique stationary

distribution. Let X(t) be the base at time t. If X(0) ~ , then X(t) ~ . Given any initial distribution, the distribution of X(t) as t .

For the JC model, is the uniform distribution.

A pair of homologous bases

Typically, ancestor is unknown.

ancestor

A C

QhQm

T years

More assumptions

5. Qh = sh Q and Qm = sm Q, for some positive

sh and sm, and some rate matrix Q.

6. The ancestor is sampled from the stationary distribution of Q.

7. Q is reversible:

(a) P(t,a,b) = P(t,b,a) (b), b, t 0 (detailed balance).

New picture

ancestor ~

A

C

QQ

shT PAMs smT

PAMs

Joint probability of A and C

Under the model, the joint probability is

a (a) P(shT,a,A) P(smT,a,C)

= a (A) P(shT,A,a) P(smT,a,C)

= (A) P(shT+ smT,A,C) = F(t,A,C).The matrix F(t) is symmetric. It is equally valid to view A

as the ancestor of C or vice versa.

t = shT+ smT is the “distance” between A and C. Note: Q

and t are identifiable from F(t), but sh , sm and T are not.

Joint probability of a pair of homologous sequences

Pr(a1…an,b1...bn)

= k F(t,ak,bk)

= a,b F(t,a,b)c(a,b),

where c(a,b) = # {k : ak = a, bk = b}.

The choice of Q

By convention (Dayhoff 1978), Q is chosen to satisfy

–a (a) Q(a,a) = 0.01,

or the expected number of substitutions in 1 time unit is 0.01 per site. This new time scale is called evolutionary time, measured in PAM ( Point Accepted Mutation).

M pairs of homologous sequences

The kth pair of sequences, separated by tk PAMs, gives the count matrix c(k). Assuming that the pairs are independent and underwent the same MC defined by Q over pair-specific distances tk’s.

Pr(all pairs) = k a,b F(tk,a,b)c(k,a,b).The pair-specific distances allow for different rates of substitution across pairs.

Parameters : Q, t1, t2,…,tM.Maximum likelihood estimation by numerical methods.

Previous work

This model is general enough to include almost all models in use as special cases.

For DNA base, there are JC, Kimura and Hasegawa-Kishino-Yano models.

For amino acid, the PAM (Dayhoff 1978) and BLOSUM (Henikoff 1993) substitution matrices are derived based on the model. Müller and Vingron (2000) presents an interesting method of estimation.

For codon, Z Yang and collaborators use constrained versions to estimate ratio of syn/nonsyn substitutions, among other things.

.

A formula for a reversible Q

A C G T

A . (2) (3) (4)

C (1) . (3) (4)

G (1) (2) . (4)

T (1) (2) (3) .

: stationary distribution of Q; ,…,: positive constants.

Strand symmetry

It seems plausible that the substitution process in noncoding regions is strand-symmetric, i.e., Q(A,C) = Q(T,G), Q(C,C) = Q(G,G), etc.

From the point of view of sequence alignment, SS implies that the same substitution matrix can be used for aligning either strand.

Strand-symmetry (of the lack of) is well-studied in bacteria (Francino, Ochman).

A formula for a reversible SS Q

A C G T

A . (2) (2) (1)

C (1) . (2) (1)

G (1) (2) . (1)

T (1) (2) (2) .

: stationary distribution; 2((1) + (2)) = 1. ,…,: positive constants.

Chromosome bands and human-mouse alignment

Staining chromosomes with the Giemsa dye produces characteristic banding patterns. G-bands: G1, G2, G3 or G4 (Francke,1994). The unstained regions are H3- or H3+.

Data: 22,490 local alignments (4.4 Mbp) of human chromosome 22 with homologous mouse reads, produced by blastz, and approximate boundary coordinates of the bands on chr 22, obtained from the UCSC Human genome browser.

Data analysis

Only alignments in noncoding regions are used. Alignments that are closer than 10 bases are glued. Of the resultant alignments, those longer than 100 bases are used to fit band-specific substitution models.

ResultsBand G3 G4, H3-, H3+

A C G T A C G T

A . 18 62 15 . 21 67 15

C 21 . 16 71 19 . 16 63

G 71 16 . 21 63 16 . 19

T 15 62 18 . 15 67 21 .

A . 18 61 15 . 21 67 15

C 21 . 16 70 19 . 16 63

G 71 16 . 20 63 16 . 20

T 16 62 18 . 15 68 21 .

GC 46.7 51.7

R,SS

R

Amino acid substitution

Exactly the same formulation applies to modeling amino acid substitution.

Fast-evolving and slow-evolving proteins are known to have different substitution rates. How can we summarize such observations?

Buried amino acids evolve differently from exposed residues. HMM with 2 hidden state?

D. melanogastor vs virilis

Bergman and Kreitman (2001) showed that the substitution patterns between intergenic and intronic regions are similar. The combined estimated rate matrix is

A C G T

A . 21 42 20

C 31 . 32 62

G 62 32 . 31

T 20 42 21 .

Future work

Study other human chromosomes, and isochore-specific substitution patterns.

Multiple species. Holmes (2002) has a neat EM algorithm for estimating substitution rates if the process is reversible. What to do if irreversible?

Even for 2 species, irreversible substitution models are not fully explored.

Acknowledgements

Terry Speed

Webb Miller

David Haussler

Terry Furey

Casey Bergman

Anne Yap

Modeling Molecular Substitution

Documents

Transcript of Modeling Molecular Substitution