Transition Bias and Substitution models Xuhua Xia [email protected] .

25
Transition Bias and Substitution models Xuhua Xia [email protected] http://dambe.bio.uottawa.ca

Transcript of Transition Bias and Substitution models Xuhua Xia [email protected] .

Page 1: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Transition Bias and Substitution models

Xuhua Xia

[email protected]

http://dambe.bio.uottawa.ca

Page 2: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Transition bias refers to the degree by which the s/v ratio deviates from the expected 1/2. The observed s/v ratio is almost always much larger than 1/2.

A G

C T

A G

C T

A G

C T

Transitions and Transversions

Transition: the substitution of a purine for a purine or a pyrimidine for a pyrimidine. Symbolized by s.

Transversion: the substitution of a purine for a pyrimidine or vice versa. Symbolized by v.

What is transition bias?

Purine

Pyrimidine

Page 3: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Transition Bias is Ubiquitous. Why?

• For both invertebrate and vertebrate genes:

• What causes transition bias?– Mutation bias– Selection bias

1

2obs

obs

s

v

obs s s

obs v v

s P

v P

Selection bias in fixation probability

Protein-coding genesRNA genes

Mutation bias

Page 4: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Mitochondrial Genetic CodeAmino Amino Amino Amino

Codon acid Codon acid Codon acid Codon acid

UUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA Stop UGA TrpUUG Leu UCG Ser UAG Stop UGG Trp

CUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG Arg

AUU lle ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Met ACA Thr AAA Lys AGA StopAUG Met ACG Thr AAG Lys AGG Stop

GUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly

• Synonymous and nonsynonymous

• Degeneracy:

– Non-degenerate

– Two-fold degenerate

– Four-fold degenerate

• Transitions are synonymous and transversions are nonsynonymous at two-fold degenerate sites.

Page 5: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

RNA secondary structure

Seq1: CACGA ||||| GUGCU

Seq2: CAUGA ||||| GUGCU

Seq1: CACGA ||||| GUGCU

Seq2: CGCGA ||||| GUGCU

G/U pair, although not as strong as A/U or C/G pair, generally does not disrupt RNA secondary structure (and occurs frequently in RNA secondary structure).

Page 6: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Causes of transition bias

I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be." Lord Kelvin: Phys. Letter A, vol. 1, "Electrical Units of Measurement", 1883-05-03

obs s s

obs v v

s P

v P

Page 7: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

At Four-fold Degenerate Sites

At four-fold degenerate sites, all nucleotide substitutions are synonymous and subject to roughly the same selection pressure (similar fixation probabilities)

2obs s s s

obs v v v

s P

v P

Glycine codon:

GGA

GGC

GGG

GGT

Four-folddegenerate site

Gly Asn Lys Gly Asp Lys Ala Ala Pro Ala Cys ...Fold 4 2 2 2 2 4 4 4 2 S1 GGA AAU AAA GGA GAC AAA GCC GCC CCU GCG UGU ...S2 GGG AAC AAA GAA GAU AAG GCC GCU CCA GGG UGG ... s s v Glu Gly Trp

Page 8: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

At Nondegenerate Sites

Glycine codon:

GGA

GGC

GGG

GGT

nondegenerate site

At nondegenerate sites, all nucleotide substitutions are nonsynonymous and subject to roughly the same selection pressure (similar fixation probabilities)

2obs s s s

obs v v v

s P

v P

Gly Asn Lys Gly Asp Lys Ala Ala Pro Ala Cys ...S1 GGA AAU AAA GGA GAC AAA GCC GCC CCU GCG UGU ...S2 GGG AAC AAA GAA GAU AAG GCC GCU CCA GGG UGG ... s v Glu Gly Trp

Page 9: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

At Two-fold Degenerate Sites

At two-fold degenerate sites, all transitional substitutions are synonymous, and all transversional substitutions are nonsynonymous

802 v

s

v

s

v

s

obs

obsP

P

P

P

v

s

GAA His

GAG His

GAC Gln

GAT Gln

2-fold degenerate site

A transition is about 40 time as like to become fixed as a transversion.

Gly Asn Lys Gly Asp Lys Ala Ala Pro Ala Cys ...Fold 4 2 2 2 2 4 4 4 2 S1 GGA AAU AAA GGA GAC AAA GCC GCC CCU GCG UGU ...S2 GGG AAC AAA GAA GAU AAG GCC GCU CCA GGG UGG ... s s s v Glu Gly Trp

Page 10: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Methylation and deamination

H3C-MethyltransferaseH3C- +

Donor Acceptor

Page 11: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Methylation and DNA Repair in E. coli

• DNA alphabets: ACGT• RNA alphabets: ACGU• DNA duplication and Watson-Crick paring rule:

A-T, C-G

3’--CTAG----CTAGGTAT----C-----C--CTAG-----------5’ |||| |||||||| ? ? ||||5’--GATC----GATCCATA----U-----T--GATC-----... 3’

H3C H3C H3C

H3CmutSmutH mutL

Spacing of GATC: consequences of being too far.

Page 12: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Methylation-Modification System

TGGC*CA AC*CGGT

Brevibacterium albidum

dsDNAphage

Bacterial Genome

Restrictionenzyme

Transcription and Translation

Bacterial Membrane----TGG|CCA-------ACC|GGT---

Methylase

Page 13: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

CpG-Specific DNA Methylation• Mammalian DNA methyltransferase 1

(DNMT1)– NLS-containing domain– replication foci-directing domain– ZnD, Zn-binding domain– polybromo domain– CatD, the catalytic domain

Fatemi, M., A. Hermann, S. Pradhan and A. Jeltsch, 2001 J Mol Biol 309: 1189-99.

1343

350 613 746 1124

609 748 1110NlsD ZnD CatD

CpG mCpG mCpG

RFDD PBD

1620

Page 14: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

CpG-Specific DNA Methylation

5’ATGCGA-------CCGA--------ACGGC--TAA 3’ |||||| |||| |||||3’TACGCT-------GGCT--------TGCCG--ATT 5’

H3C

H3C

H3C

Fully methylated Hemi-methylated Unmethylated

Note: 5’CG3’ = CpG

Page 15: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Methylation and Gene Regulation• Proteins with a methyl-CpG binding domain (MBD)

– MBD1, MBD2, and MBD3 – MeCP2

• Deacetylases: An enzyme that removes an acetyl group• Histone deacetylases: deacetylate lysyl residues in histones (the half life of an

acetyl group is ~10min). Acetylation removes a positive charge on the lysine -amino group and promote nucleosome melting (and gene expression). Deacetylation tend to decrease or turn off gene expression.

---mCpG-----------------MBD

Histone deacetylase Condensed

DNA with repressed transcription

Wade, P. A., and A. P. Wolffe, 2001 Nat Struct Biol 8: 575-7.

Lysine demethylation

Page 16: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

H3C

Methylation and Mutation

N

N

O

NH2O

Cytocine is converted to Thymine

methylation

Spontaneous deamination

N

N

O

H3C

O

Page 17: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Vertebrate mitochondrion

Parental H

Parental L

Daughter H

OH

OL

Daughter L

Page 18: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Spontaneous deamination

N

NNH

N

NH2

NH

NNH

N

O

NH2

N

NH

NH2

O

NH

NNH

N

O

H

NH

NNH

N

O

O

N

NH

O

O

Adenine Guanine Cytosine Methylcytosine

Hypoxanthine Xanthine Uracil Thymine(Pair with C) (Pair with C) (Pair with A) (Pair with A)

N

NH

NH2

O

CH3

N

NH

O

O

CH3

H2 O

NH

3

H2 O

NH

3

H2 O

NH

3

H2 O

NH

3

Page 19: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Transversion can erase transitions

Transitions can erase transitions, and transversions can erase transversions.

However, a transversion can erase many transitions occurring before it, and subsequent transitions cannot erase the transversion:

AACGCTTGACG

AACGCTTAACG

AACGCTTGACG

AACGCTTCACG

AACGCTTTACG

Although a transition could also erase 2n transversions occurring before it, this is rare because transversions are in generally much rarer than transitions.

Transitions tend to be missed in counting much more frequently than transversions.

AACGCTTGACGAACGCTTTACGAACGCTTAACGAACGCTTGACG

Page 20: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Summary• Selection: Transitions are tolerated more than transversion by

natural selection because– they are more likely synonymous in protein-coding sequences than

transversions– they are less likely to disrupt RNA secondary structure than

transversions.• Mutation: Transitional mutation occurs more frequently than

transversions because– Misincorporation during DNA replication occur more frequently

between two purines or between two pyrimidines than between a purine and a pyrimidine

– A purine is more likely to mutate chemically to another purine than to a pyrimidine (e.g., through spontaneous deamination) . The same for pyrimidine.

• Bias in counting: Transitions tend to be missed in counting much more frequently than transversions (which necessitates the substitution models)

Page 21: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Nucleotide Substitutions

ACACTCGGATTAGGCT

ACACTCGGATTAGGCT

ATACTCAGGTTAAGCT

ACAATCCGGTTAAGCT

T C C

AGACTCGGATTAGGCT

Observed sequences

sing

le

mul

tipl

e

coin

cide

ntal

para

llel

conv

erge

nt

back

Actual number of changes during the evolution of the two daughter sequences: 12

Observed number of differences between the two daughter sequences: 3.

Correcting for multiple substitutions to to estimate the true number of changes, i.e., 12.

From WHL

Page 22: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Substitution models and phylogenetics

• A substitution model is to model the evolutonary process so as to correct for multiple hits.

• A phylogenetic reconstruction method implicitly or explicitly assumes a substitution model.

• A phylogenetic method assuming a wrong substitution model will typically lead to wrong trees produced.

• An alignment with an inappropriate substitution score matrix will typically lead to inaccurate alignment (e.g., strong transition bias among sequences but a substitution score matrix without strong penalty against transversion)

A G

C T

Page 23: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

A G C TA a1 a2 a3 G a7 a4 a5 C a8 a9 a6

T a10 a11 a12

A G C TA a1G a2C a3T G a1A a4C a5T C a2A a4G a6T T a3A a5G a6C

The diagonal of a transition probability matrix is subject to the constraint that each row sums up to 1.

JC69

i = 0.25ai = c

F81/TN84A, C, G, Tai = c

K80i =0.25a1 = a6 = a7 = a12 = a2 = a3 = a4 = a5 = a8 = a9 = a10 = a11=

HKY85A, C, G, Ta1 = a6 = a7 = a12 = a2 = a3 = a4 = a5 = a8 = a9 = a10 = a11=

TN93A, C, G, Ta1 = a7 = 1

a6 = a12 = 2

a2 = a3 = a4 = a5 = a8 = a9 = a10 =a11= GTR

Unrestricted: no equilibrium i

Page 24: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

The TN93 model as an example

.

.

.

.

2

2

1

1

ACT

GCT

GAT

GAC

Q

p - frequency parameters

k - rate ratio parameters

In addition to illustrated assumptions, it also assumes that the frequency and rate ratio parameters do not change over time, i.e., the substitution process is stationary.

A G

C T

T C A G

Page 25: Transition Bias and Substitution models Xuhua Xia xxia@uottawa.ca .

Xuhua Xia

Substitution Models• There are three types of substitution models in molecular

evolution– Nucleotide-based– Amino acid-based– Codon-based

• Substitution models are characterized by two categories of parameters: the frequency parameters and the rate ratio parameters, and different models differ by their assumptions concerning these two categories of parameters.

• Substitution models, substitution score matrix and sequence alignment.