Post on 08-Jan-2018
description
Burkhard MorgensternInstitut für Mikrobiologie und Genetik
Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen
WS 2006/2007
Goal:
Phylogeny reconstruction based on molecular sequence data (DNA, RNA, protein sequences)
Multiple sequence alignment
Molecular phylogeny reconstruction relies on comparative nucleic acid and protein sequence analysis
Alignment most important tool for sequence comparison
Multiple alignment contains more information than pair-wise alignment
Tools for multiple sequence alignment
Y I M Q E V Q Q E R
Sequence duplicates in history (e.g. speciation event)
Tools for multiple sequence alignment
Y I M Q E V Q Q E R
Tools for multiple sequence alignment
Y I M Q E V Q Q E R
Y I M Q E V Q Q E R
Tools for multiple sequence alignment
Y I M Q E A Q Q E R
Y L M Q E V Q Q E R
Substitutions occur
Tools for multiple sequence alignment
Y I M Q E A Q Q E R
Y L M Q E V Q Q E R
Tools for multiple sequence alignment
YAI M Q E A Q Q E R
Y L M - - V Q Q E R V
Insertions/deletions (indels) occur
Tools for multiple sequence alignment
YAI M Q E A Q Q E R
Y L M - - V Q Q E R V
Tools for multiple sequence alignment
Y A I M Q E A Q Q E R
Y L M V Q Q E R V
because of insertions/deletions: sequence similarity no longer immediately visible!
Tools for multiple sequence alignment
Y A I M Q E A Q Q E R -
Y - L M V - - Q Q E R V
Alignment brings together related parts of the sequences by inserting gaps into sequences
Tools for multiple sequence alignment
Y A I M Q E A Q Q E R -
Y - L M V - - Q Q E R V
Tools for multiple sequence alignment
Y A I M Q E A Q Q E R -
Y - L M V - - Q Q E R V
Mismatches correspond to substitutions Gaps correspond to indels
Tools for multiple sequence alignment
Pairwise alignment: alignment of two sequences
Multiple alignment: alignment of N > 2 sequences
Tools for multiple sequence alignment
s1 R Y I M R E A Q Y E S A Qs2 R C I V M R E A Y Es3 Y I M Q E V Q Q E Rs4 W R Y I A M R E Q Y E
Assumtion: sequence family related by common ancestry; similarity due to common history
Sequence similarity not obvious (insertions and deletions may have happened)
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - -
Multiple alignment = arrangement of sequences by introducing gaps
Alignment reveals sequence similarities
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - -
General information in multiple alignment: Functionally important regions more conserved than
non-functional regions Local sequence conservation indicates functionality!
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
Phylogeny reconstruction based on multiple alignment: Estimate pairwise distances between sequences
(distance-based methods for tree reconstruction) Estimate evloutionary events in evolution (parsimony
and maximum likelihood methods)
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
Task in bioinformatics: Find best multiple alignment for given sequence set
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
Astronomical number of possible alignments!
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - - - Y E -s3 Y I - - - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
Astronomical number of possible alignments!
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - - - Y E -s3 Y I - - - M Q E V Q Q E R - -
s4 W R Y I A M R E - Q Y E - - -
Computer has to decide: which one is best??
Tools for multiple sequence alignment
Questions in development of alignment programs:
(1) What is a good alignment? → objective function (`score’)
(2) How to find a good alignment? → optimization algorithm
First question far more important !
Tools for multiple sequence alignment
Before defining an objective function (scoring scheme)
What is a biologically good alignment ??
Tools for multiple sequence alignment
Criteria for alignment quality:
1. 3D-Structure: align residues at corresponding positions in 3D structure of protein!
Tools for multiple sequence alignment
Criteria for alignment quality:
Tools for multiple sequence alignment
Criteria for alignment quality:
1. 3D-Structure: align residues at corresponding positions in 3D structure of protein!
Tools for multiple sequence alignment
Species related by common history
Tools for multiple sequence alignment
Genes / proteins related by common history
Tools for multiple sequence alignment
Criteria for alignment quality:
1. 3D-Structure: align residues at corresponding positions in 3D structure of protein!
2. Evolution: align residues with common ancestors!
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - -
Alignment hypothesis about sequence evolution Mismatches correspond to substitutions Gaps correspond to insertions/deletions
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - Y E - - -s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - -
Alignment hypothesis about sequence evolution Search for most plausible scenario! Estimate probabilities for individual evolutionary
events: insertions/deletions, substitutions
Tools for multiple sequence alignment
s1 - R Y I - M R E A Q Y E S A Qs2 - R C I V M R E A - Y E - - -s3 - Y - I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - -
Alignment hypothesis about sequence evolution Search for most plausible scenario! Estimate probabilities for individual evolutionary
events: insertions/deletions, substitutions
Tools for multiple sequence alignment
Compute score s(a,b) for degree of similarity between amino acids a and b based on probability
pa,b
of substitution a → b (or b → a)
(Extremely simplified!)
Tools for multiple sequence alignment
Tools for multiple sequence alignment
Reason for different substitutin probabilities pa,b :
Different physical and chemical properties of amino acids
Amino acids with similar properties more likely to be substituted against each other
Tools for multiple sequence alignment
Use penalty for gaps introduced into alignment
Simplest approach: linear gap costs: penalty proportional to gap length
Non-linear gap penalties more realistic: long gap caused by single insertion/deletion
Most frequently used: affine linear gap penalties: more realistic, but efficient to calculate!
Traditional Objective functions:
Define Score of alignments as
Sum of individual similarity scores s(a,b) Minus gap penalties
Needleman-Wunsch scoring system for pairwise alignment (1970)
Pair-wise sequence alignment
T Y W I V T - - L V
Example:
Score = s(T,T) + s(I,L) + s (V,V) – 2 g
Assumption: linear gap penalty!
Pair-wise sequence alignment
T Y W I V T - - L V
Dynamic-programming algorithm findsalignment with best score.
(Needleman and Wunsch, 1970)
Pair-wise sequence alignment
T Y W I V T - - L V
Running time proportional to product of sequence length
Time-complexity O(l1 * l2)
Pair-wise sequence alignment Algorithm for pairwise alignment can be
generalized to multiple alignment of N sequences
Time-complexity O(l1 * l2 * … * lN)
Not feasable in reality (too long running time!)
Heuristic necessary, i.e. fast algorithm that does not necessarily produce mathematically best alignment
`Progressive´ Alignment
Most popular approach to (global) multiple sequence alignment:
Progressive Alignment
Since mid-Eighties: Feng/Doolittle, Higgins/Sharp, Taylor, …
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
Guide tree
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD
YAVESEASFQPVAALERIN
WLNYNEERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD
YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN- WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD
YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN-------- WW--RLNDKEGYVPRNLLGLYP-------- AVVIQDNSDIKVVP--KAKIIRD------- YAVESEA---SVQ--PVAALERIN------ WLN-YNE---ERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN-------- WW--RLNDKEGYVPRNLLGLYP-------- AVVIQDNSDIKVVP--KAKIIRD------- YAVESEA---SVQ--PVAALERIN------ WLN-YNE---ERGDFPGTYVEYIGRKKISP
Most important implementation: CLUSTAL W
`Progressive´ Alignment
CLUSTAL W; Thompson et al., 1994 (~17.000 citations)
Pairwise distances as 1 - percentage of identity Calculate un-rooted tree with Neighbor Joining Define root as central position in tree Define sequence weights based on tree Gap penalties calculated based on various
parameters
Tools for multiple sequence alignment
Problems with traditional approach:
Results depend on gap penalty
Heuristic guide tree determines alignment; alignment used for phylogeny reconstruction
Algorithm produces global alignments.
Tools for multiple sequence alignment
Problems with traditional approach:
But:
Many sequence families share only local similarity
E.g. sequences share one conserved motif
The DIALIGN approach
Morgenstern, Dress, Werner (1996),PNAS 93, 12098-12103
Combination of global and local methods
Assemble multiple alignment from gap-free local pair-wise alignments (,,fragments“)
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
The DIALIGN approach
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
The DIALIGN approach
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
Consistency!
The DIALIGN approach
atc------TAATAGTTAaactccccCGTGC-TTag
cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg
caaa--GAGTATCAcc----------CCTGaaTTGAATaa
More methods for multiple alignment:
T-Coffee PIMA Muscle Prrp Mafft ProbCons
Substitution matrices
Similarity score s(a,b) for amino acids a and b based on probability pa,b of substitution a -> b
Idea: it is more reasonable to align amino acids that are often replaced by each other!
Substitution matrices
Assumptions:
pa,b does not depend on sequence position Sequence positions independent of each other pa,b = pb,a (symmetry!)
Substitution matrices
Compute score s(a,b) for degree of similarity between amino acids a and b:
Probability pa,b of substitution
a → b (or b → a), Frequency qa of a
Define s(a,b) = log (pa,b / qa qb)
Substitution matrices
Substitution matrices
To calculate pa,b:
Consider alignments of related proteins and count substitutions
a → b (or b → a)
Substitution matrices
To calculate pa,b:
Consider alignments of related proteins and count substitutions
a → b (or b → a)
ESWTS-RQWERYTIALMSDQRREVLYWIALYERWTSERQWERYTLALMS-QRREALYWIALY
Substitution matrices
To calculate pa,b:
Consider alignments of related proteins and count substitutions
a → b (or b → a)
ESWTS-RQWERYTIALMSDQRREVLYWIALYERWTSERQWERYTLALMS-QRREALYWIALY
Substitution matrices
Problems involved:
1. Probability pa,b depends on time t since sequences separated in evolution: pa,b = pa,b (t)
2. Protein families contain multiple sequences: phylogenetic tree must be known!
3. Alignment of protein families must be known!
4. Multiple mutations at one sequence position
Substitution matrices
M. Dayhoff et al., Atlas of Protein sequence and Structure, 1978
PAM matrices
Substitution matrices
Calculation of pa,b(t) : Consider multiple alignments of closely related
protein families Count occurrence of a and b at corresponding
positions in alignments using phylogenetic tree Estimate pa,b(t) for small times t Calculate conditional probabilities p(a|b,t) for small t Normalize to distance 1 PAM (= percentage of
accepted mutations) Calculate p(a|b,t) for larger evolutionary distances by
matrix multiplication Calculate pa,b(t) for larger evolutionary distances
Substitution matrices
Substitution matrices
Alternative: BLOSUM matrices
S. Henikoff and J.G. Henikoff, PNAS, 1992
Basis: BLOCKS database, gap-free regions of multiple alignments.
Cluster of sequences if percentage of similarity > L Estimate pa,b(t) directly.
Default values: L = 62, L = 50