Multiple Sequence Alignment (MSA)
Ecole Phylogénomique, Carry le Rouet 2006
Plan
Introduction to sequence alignments
Multiple alignment construction
Traditional approaches
Alignment parameters
Alternative approaches
Multiple alignment main applications
MACSIMS : Multiple Alignment of Complete Sequences Information Management System
Ecole Phylogénomique, Carry le Rouet 2006
Local alignment / Global alignment
Sequence A
Sequence B
Global alignment
Sequence alignment on their whole length
G G C T G A C C A C C - T T
| | | | | | |
G A - T C A C T T C C A T G
Local alignment
Alignment of the high similarity regions
G A C C A C C T T
| | | | | | |
G A T C A C - T T
Optimal local pairwise alignment :Smith and Waterman, 1981
Optimal global pairwise alignment :Needleman and Wunsch, 1970
Ecole Phylogénomique, Carry le Rouet 2006
Pairwise alignment / Multiple alignmentQuery: 177 EMGDTGPCGPCSEIHYDRIGGRDAAHLVNQDDPNVLEIWNLVFIQYNR---EADG----I 229 G G GP E+ Y LE+ LVF+QY + AD ISbjct: 193 AGG--GNAGPAFEVLYKG-----------------LEVATLVFMQYKKAPANADPSQVVI 233
Query: 230 LK-----PLPKKSIDTGMGLERLVSVLQNKMSNYDTDLFVPYFEAIQKGTGARPYTGKVG 284 +K P+ K +DTG GLERLV + Q + YD L E +++ G ++ Sbjct: 234 IKGEKYVPMETKVVDTGYGLERLVWMSQGTPTAYDAVLGY-VIEPLKRMAGVEKIDERIL 292
Query: 285 AEDA---------DGIDMAYR--------------------------VLADHARTITVAL 309 E++ D D+ Y +ADH + +T LSbjct: 293 MENSRLAGMFDIEDMGDLRYLREQVAKRVGISVEELERLIRPYELIYAIADHTKALTFML 352
Ecole Phylogénomique, Carry le Rouet 2006Conservation profile Secondary structure
What is a multiple alignment?
A representation of a set of sequences, in which equivalent residues (e.g. functional or structural) are aligned in columns
Conserved residues
Ecole Phylogénomique, Carry le Rouet 2006
MACSMACS
• Schematic overview of complete alignmentSchematic overview of complete alignment e.g. de.g. domain organisationomain organisation (Interpro) (Interpro)
SH3
SH2
PI-PLC-X
PI-PLC-Y
PH C2
CH
rhoGEF
DAG_PE-bind
Key:
Ecole Phylogénomique, Carry le Rouet 2006
Why multiple alignments?
Applications : phylogeny domain organisation functional residue identification 2D/3D structure prediction transmembrane prediction …
Integration of a sequence in the context of the protein family
Ecole Phylogénomique, Carry le Rouet 2006
MSA Construction
Ecole Phylogénomique, Carry le Rouet 2006
Multiple alignment construction
Traditional approaches
Optimal multiple alignment
Progressive multiple alignment
Alignment parameters
Residue similarity matrices
Gap penalties
Alternative approaches
Iterative alignment methods
Combinatorial algorithms
PipeAlign : a protein family analysis tool
Ecole Phylogénomique, Carry le Rouet 2006
Traditional Approaches
Ecole Phylogénomique, Carry le Rouet 2006
Is the direct extension of pairwise dynamic programming to N-dimension (Sankoff, 1975).
Examine all possible alignments to find the optimal alignment
Optimal multiple alignment
Problem The optimised mathematical alignment is not necessarily the biologically optimal alignmentCPU time and memory required are prohibitive for practical purposes (the required time is proportional to Nk for k sequences with length N) : limited to <10 sequences
Exemple : alignment of 3 sequences
Ecole Phylogénomique, Carry le Rouet 2006
Principle : Progressively align the sequences (or sequence groups) by pair
Problem :
Which sequences begin with ? In which order ?
first align closest sequences
How to estimate the distance between the sequences ?
align all pairs of sequences
calculate distance matrix from the pairwise alignments : distance matrix
construct a guide tree from this distance matrix
progressive multiple alignment following branching order in tree
Progressive multiple alignment
Heuristic algorithm which avoids calculating all possible alignments, but does not garuantee ‘optimal’ alignment
Ecole Phylogénomique, Carry le Rouet 2006
Progressive multiple alignment
Step 1 : Pairwise alignment of all sequences
Hbb_human 1 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLST ... |.| :|. | | |||| . | | ||| |: . :| |. :| | ||| Hba_human 3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS. ...
Hbb_human 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST ... | |. |||.|| ||| ||| :|||||||||||||||||||||:|||||| Hbb_horse 2 VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ...
Hba_human 3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH ... || :| | | | || | | ||| |: . :| |. :| | |||.Hbb_horse 2 LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN ...
Ex : pairwise alignment of 2 globin sequences
The alignment can be obtained with :- global or local method- dynamic programming or heuristic methods
Example : in Clustalx=> global alignments=> choice between
- heuristic method (used in Fasta program) => faster- dynamic programming (Smith & Waterman) => better
Example : Alignment of 7 globins (Hbb_human, Hbb_horse, Hba_human, Hba_horse, Myg_phyca, Glb5_petma and Lgb2_lupla)
Ecole Phylogénomique, Carry le Rouet 2006
-.17 -.59 .60 -.59 .59 .13 -.77 .77 .75 .75 -.81 .82 .73 .74 .80 -.87 .86 .86 .88 .93 .90 -
Hbb_humanHbb_horseHba_humanHba_horseMyg_phycaGlb5_petmaLgb2_lupla
1234567
1 2 3 4 5 6 7
distance between 2 sequences = 1- nb of identical residuesnb of compared residues
Step 2 : Distance matrix construction
In Clustalx :
Ex : Hbb_human vs Hbb_horse = 83% identity = 17% distance
Progressive multiple alignment
Ecole Phylogénomique, Carry le Rouet 2006
- Join the 2 closest sequences
- Recalculate distances and join the 2 closest sequences or nodes
- Step 3 is repeated until all sequences are joined
Sequential branching
Hba_human
Hba_horse
Hbb_horse
Hbb_human
Myg_phyca
Glb5_petma
Lgb2_lupla
Hbb_human
Hbb_horse
Hba_human
Hba_horseMyg_phyca
Glb5_petma
Lgb2_lupla
Guide tree
Step 3 : Sequential branching / Guide tree construction
Progressive multiple alignment
Ecole Phylogénomique, Carry le Rouet 2006
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Step 4 : Progressive alignment
The progressive multiple alignment follows the branching order in tree
Hbb_human
Hbb_horse
Hba_human
Hba_horseMyg_phyca
Glb5_petma
Lgb2_lupla
Progressive multiple alignment
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
Ecole Phylogénomique, Carry le Rouet 2006
H1 H2 H3 H4
H6 H7H5
Progressive multiple alignment
Ecole Phylogénomique, Carry le Rouet 2006
Progressive
Local Global
SB
MLUPGMA
NJ
SBpima multal
multalignpileup
clustalx
MLpima
SB - Sequential BranchingUPGMA - Unweighted Pair Grouping MethodML - Maximum LikelihoodNJ - Neighbor-Joining
Progressive multiple alignment methods
Ecole Phylogénomique, Carry le Rouet 2006
Alignment Parameters
Ecole Phylogénomique, Carry le Rouet 2006
Residue similarity matrices
Dynamic programming methods score an alignment using residue similarity matrices, containing a score for matching all pairs of residues
For proteins, a wide variety of matrices exist: Identity, PAM, Blosum, Gonnet etc.
PAM 250
Ecole Phylogénomique, Carry le Rouet 2006
Residue similarity matrices
Dynamic programming methods score an alignment using residue similarity matrices, containing a score for matching all pairs of residues
For proteins, a wide variety of matrices exist: Identity, PAM, Blosum, Gonnet etc.
Matrices range from strict ones for comparing closely related sequences to soft ones for very divergent sequences.
Matrices are generally constructed by observing the mutations in large sets of alignments, either sequence-based or structure-based
ClustalW automatically selects a suitable matrix depending on the observed pairwise % identity.
A single best matrix does not exist!!
A gap penalty is a cost for introducing gaps into the alignment, corresponding to insertions or deletions in the sequences
• Fixed penalty : P = a L with L the length of gap
• Linear (or affine) penalty : P = x + y L
• Position specific and residue specific penalties :
ex : in ClustalW, gap penalties are :
- lowered at existing gaps
- increased close to (less than 8 residues) existing gaps
- lowered in hydrophilic stretches (loops)
otherwise : gap opening penalties are modified according to their observed relative frequencies adjacent to gaps (Pascarella & Argos, 1992)
x : gap opening penalty (gop)
y : gap extension penalty (gep)
Gap penalties
SFGDLSNPGAVMG
HF-DLS-----HG
Goal is to introduce gaps in sequence segments corresponding to flexible regions of the protein
structure
Ecole Phylogénomique, Carry le Rouet 2006
Alternative Approaches
Ecole Phylogénomique, Carry le Rouet 2006
Iterative alignment methods
Iterative Alignment e.g. PRRPPRRP (Gotoh, 1993) - refine an initial progressive multiple alignment by iteratively dividing the alignment into 2 profiles and realigning them.
Genetic Algorithms e.g. SAGASAGA (Notredame et al, 1996)- iteratively refine an alignment using genetic algorithms (evolves a population of alignments in a quasi evolutionary manner)
Segment-to-segment alignment: DIALIGNDIALIGN (Morgenstern et al. 1999)
- search for locally conserved motifs in all sequences and compares segments of sequences instead of single residues
Hidden Markov Models: - iteratively refine an alignment using HMMs
e.g.e.g. HMMERHMMER (Eddy, 1998)SAMSAM (Karplus et al, 2001)
Ecole Phylogénomique, Carry le Rouet 2006
Progressive
Iterative
Local Global
SB
ML UPGMA
NJ
Genetic Algo. HMM
SBpima multal
multalignpileup
clustalx
dialign
MLpima
saga hmmt
prrp
Multiple alignment methods
Ecole Phylogénomique, Carry le Rouet 2006
BAliBASE: objective evaluation of MACS programsBAliBASE: objective evaluation of MACS programs
• High-quality alignments based on 3D structural superpositions and manually verified
• Alignments compared only in reliable ‘core blocks’, excluding non-superposable regions
• Separate reference sets specifically designed to address distinct alignment problems
reference set description
1 small number of sequences: divergence, length
2 a family with one to 3 orphans
3 several sub-families
4 long N/C terminal extensions
5 long insertions
6 repeats
7 transmembrane regions
8 circular permutations
BAliBASE1 :Thompson et al. 1999 BioinformaticsBAliBASE2 : Bahr et al, 2001 Nucl Acids Res.
Ecole Phylogénomique, Carry le Rouet 2006
=> Need of reference alignments to evaluate the alignment programs
BaliBASE (Thompson et al. Bioinformatics. 1999) – benchmark database
• Alignments based on 3D structure superposition
• Alignments must be compared for the superposable regions
• Alignments take into account :
- the effect of the number of sequences
- the effect of the sequence length
- the effect of the sequence similarity
- alignment of an orphan sequence with a sequence family
- sub-family alignments
- alignments of sequences with different length (insertions,extensions)
Comparison of multiple alignment methods
Local / global methods
Progressive / iterative methods
Iterative algorithms usually improve alignment quality Problems :
- Can give bad alignment in case of orphan sequences
- Iteratif process can be very long !
ClustalW 2 mins 41 secsPRRP 3 hours 40 minsDialign 3 hours 48 mins
Example : alignment of 89 histone sequences (66-92 residues):
Colinear sequences => global methods N/C-ter extensions or insertions => local methods
To increase the alignment quality, as many sequences as possible have to be integrated !
Comparison of multiple alignment methods
> 35% Id : any method
DbClustal: local and global algorithm coupling
Domain A
Domain B
Domain C
Blast Database Search
Query Sequence
Database Hits
Ballast Anchors
Query Sequence
Anchors
DbClustal Alignment
Ecole Phylogénomique, Carry le Rouet 2006
ClustalW
DbClustal
ClustalW / DbClustal comparison
Ecole Phylogénomique, Carry le Rouet 2006
• T-Coffee (Notredame et al. 2000) http://igs-server.cnrs-mrs.fr/Tcoffee/
performs local and global alignments for all pairs of sequences, then combines them in a progressive multiple alignment, similar to ClustalW.
• DbClustal (Thompson et al. 2000) http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbClustal+noid
designed to align the sequences detected by a database search. Locally conserved motifs are detected using the Ballast program (Plewniak et al. 1999) and are used in the global multiple alignment as anchor points.
• MAFFT (Katoh et al. 2002) http://timpani.genome.ad.jp/%7Emafft/server
detects locally conserved segments using a Fast Fourier Transform, then uses a restricted global DP and a progressive algorithm
• MUSCLE (Edgar, 2004) http://www.drive5.com/muscle kmer distances and log-expectation scores, progressive and iterative refinement
• PROBCONS (Do et al, 2005) http://probcons.stanford.edu pairwise consistency based on an objective function
Combinatorial algorithms
Multiple Alignment QualityMultiple Alignment Quality
Ref1 Ref2 Ref3 Ref4 Ref5 Time
V1 (<20%) V2 (20-40%)
orphans subgroups extensions insertions
(sec)
ClustalW1.83 0.42 0.78 0.42 0.52 0.41 0.38 902
Dialign2.2.1 0.31 0.71 0.37 0.39 0.45 0.43 5993
Mafft5.32 0.44 0.78 0.49 0.53 0.47 0.48 96
Maffti5.32 0.54 0.83 0.56 0.60 0.49 0.57 327
Muscle3.51 0.52 0.82 0.50 0.58 0.46 0.54 523
Muscle_fast 0.40 0.77 0.43 0.44 0.35 0.49 34
Muscle_med 0.45 0.80 0.50 0.59 0.44 0.51 219
Tcoffee2.66 0.47 0.84 0.50 0.64 0.54 0.58 216133
Probcons1.1 0.63 0.87 0.60 0.65 0.54 0.63 19035
muscle_fast : muscle –maxiters=1 –diags1 –sv –distance1 kbit20_3 muscle_medium : muscle –maxiters=2
Truncated Alignments
2. Twilight zone still exists3. Probcons scores best in all tests, but is MUCH slower than MAFFT or MUSCLE4. MAFFTI scores slightly better than MUSCLE in all test, and is more efficient
1. Significant improvement in accuracy/efficiency since 2000
Multiple Alignment QualityMultiple Alignment Quality
Ref1 Ref2: orphans Ref3: subgroups Time (sec)for all refs
V1 (<20%) V2 (20-40%)
T FL T FL T FL T FL T FL
ClustalW1.83
0.42 0.24 0.78 0.72 0.42 0.20 0.52 0.27 902 2227
Dialign2.2.1 0.31 0.26 0.71 0.70 0.37 0.29 0.39 0.31 5993 12595
Mafft5.32 0.44 0.25 0.78 0.75 0.49 0.35 0.53 0.38 96 312
Maffti5.32 0.54 0.35 0.83 0.80 0.56 0.40 0.60 0.50 327 1409
Muscle3.51 0.52 0.34 0.82 0.79 0.50 0.36 0.58 0.39 523 3608
Muscle_fast 0.40 0.28 0.77 0.72 0.43 0.29 0.44 0.33 34 132
Muscle_med 0.45 0.29 0.80 0.74 0.50 0.34 0.59 0.38 219 1601
Tcoffee2.66 0.47 0.35 0.84 0.82 0.50 0.40 0.64 0.49 216133 341578
Probcons1.1 0.63 0.43 0.87 0.86 0.60 0.41 0.65 0.54 19035 58488
Comparison: truncated versus full-length sequences
1. Loss of accuracy is more important in twilight zone (Ref1 V1, orphans, and subgroups)2. Probcons still scores best in all tests3. MAFFT still scores better than MUSCLE in all tests
Ecole Phylogénomique, Carry le Rouet 2006
•Sum-of-pairs (Carrillo, Lipman, 1988) Sum the scores of all the pair of sequences (based on a similarity matrix and gap penalty)
Multiple alignment quality
• norMD (Thompson et al, 2001) - scores by column using a substitution matrix and gap penalties- normalisation according to the sequences to align (their number, length and the similarity between them)
Development of objective functions to estimate multiple alignment quality
• Relative Entropy: uses a normalized log-likelihood ratio to measure the degree of conservation for each column (identical residues only).
• MD (column scores used in ClustalX) uses a comparison matrix (Gonnet) to take into account similar residues
Ecole Phylogénomique, Carry le Rouet 2006
Evaluation of Objective Functions using BAliBase
Ecole Phylogénomique, Carry le Rouet 2006
SeqLab GCG Wisconsin Package
SeaView (Gaultier et al, 1996) http://pbil.univ-lyon1.fr/software/seaview.html
WEB servers :
GeneAlign (Kurukawa) http://www.gen-info.osaka-u.ac.jp/geneweb2/genealign/
Jalview (Clamp, 1998) http://www.ebi.ac.uk/~michele/jalview/
CINEMA (Lord et al, 2002) http://www.bioinf.man.ac.uk/dbbrowser/cinema-mx
Multiple sequence alignment editors
No automatic method is 100% reliable.
Manual verification and refinement is essential!
Ecole Phylogénomique, Carry le Rouet 2006
FASTA format
>O88763 Phosphatidylinositol 3-kinase.------MGEAEKFHYIYSCDLDINVQLKIGSLEGKREQKSYKAVLEDPMLKFSGLYQETCSDLYVTCQVFAEGKPLALPVRTSYKPFSTRWN-WNEWLKLPVKYPDLPRNAQVALTIWD------VYGPG-RAVPVGGTTVSLFGKYGMFRQGMHDLKVWPNVEADGSEPTRTPGRTSSTLSEDQMSRLAKLTKAHRQGHMVKVLDRLTFREIEMINESEKRSS--NFMYLMVEFRCVKCDDKE-YGIVYYE---->Q9W1M7 CG5373-PA (GH13170p).-----MDQPDDHFRYIHSSSLHERVQIKVGTLEGKKRQPDYEKLLEDPILRFSGLYSEEHPSFQVRLQVFNQGRPYCLPVTSSYKAFGKRWS-WNEWVTLPLQFSDLPRSAMLVLTILD------CSGAG-QTTVIGGTSISMFGKDGMFRQGMYDLRVWLGVEGDGNFPSRTPGK-GKESSKSQMQRLGKLAKKHRNGQVQKVLDRLTFREIEVINEREKRMS--DYMFLMIEFPAIVVDDMYNYAVVYFE---->Q7PMF0 ENSANGP00000002906 (Fragment).------------LRYIGSSSLLQKISIKIGTLEGENVGYSYEKLIEQPLLKFSGMYTEKTPPLKVKLQIFDNGEPVGLPVCTSHKHFTTRWS-WNEWVTLPLRFTDISRTAVLGLTIYD------CAGGREQLTVVGGTSISFFSTNGLFRQGLYDLKVWPQMEPDGACNSITPGK-AITTGVHQMQRLSKLAKKHRNGQMEKILDRLTFRELEVINEMEKRNS--QFLYLMVEFPQVYIHEKL-YSVIHLE---->Q9TXI7 Related to yeast vacuolar protein sorting factor protein 34MIPGMRATPTESFSFVYSCDLQTNVQVKVAEFEG-----IFRDVLN-PVRRLNQLFAEITVYCNNQQIGYPVCTSFHTPPDSSQLARQKLIQKWNEWLTLPIRYSDLSRDAFLHITIWEHEDDEIVNNSTFSRRLVAQSKLSMFSKRGILKSGVIDVQMNVSTTPDPFVKQPETWKYSDAWG-DEIDLLFKQVTRQSRGLVEDVLDPFASRRIEMIRAKYKYSSPDRHVFLVLEMAAIRLGPTF-YKVVYYEDETK
MSF format toto.msf MSF: 256 Type: P May 24, 2005 19:34 Check: 3415 ..
Name: O88763 Len: 256 Check: 9443 Weight: 1.00 Name: Q9W1M7 Len: 256 Check: 1161 Weight: 1.00 Name: Q7PMF0 Len: 256 Check: 8095 Weight: 1.00 Name: Q9TXI7 Len: 256 Check: 4716 Weight: 1.00
//
1 50O88763 ......MGEA EKFHYIYSCD LDINVQLKIG SLEGKREQKS YKAVLEDPML Q9W1M7 .....MDQPD DHFRYIHSSS LHERVQIKVG TLEGKKRQPD YEKLLEDPIL Q7PMF0 .......... ..LRYIGSSS LLQKISIKIG TLEGENVGYS YEKLIEQPLL Q9TXI7 MIPGMRATPT ESFSFVYSCD LQTNVQVKVA EFEG.....I FRDVLN.PVR
51 100O88763 KFSGLYQETC SDLYVTCQVF AEGKPLALPV RTSYKPFSTR WN.WNEWLKL Q9W1M7 RFSGLYSEEH PSFQVRLQVF NQGRPYCLPV TSSYKAFGKR WS.WNEWVTL Q7PMF0 KFSGMYTEKT PPLKVKLQIF DNGEPVGLPV CTSHKHFTTR WS.WNEWVTL Q9TXI7 RLNQLFAEIT VYCNNQQIGY PVCTSFHTPP DSSQLARQKL IQKWNEWLTL
101 150O88763 PVKYPDLPRN AQVALTIWD. .....VYGPG .RAVPVGGTT VSLFGKYGMF Q9W1M7 PLQFSDLPRS AMLVLTILD. .....CSGAG .QTTVIGGTS ISMFGKDGMF Q7PMF0 PLRFTDISRT AVLGLTIYD. .....CAGGR EQLTVVGGTS ISFFSTNGLF Q9TXI7 PIRYSDLSRD AFLHITIWEH EDDEIVNNST FSRRLVAQSK LSMFSKRGIL
151 200O88763 RQGMHDLKVW PNVEADGSEP TRTPGRTSST LSEDQMSRLA KLTKAHRQGH Q9W1M7 RQGMYDLRVW LGVEGDGNFP SRTPGK.GKE SSKSQMQRLG KLAKKHRNGQ Q7PMF0 RQGLYDLKVW PQMEPDGACN SITPGK.AIT TGVHQMQRLS KLAKKHRNGQ Q9TXI7 KSGVIDVQMN VSTTPDPFVK QPETWKYSDA WG.DEIDLLF KQVTRQSRGL
201 250O88763 MVKVLDRLTF REIEMINESE KRSS..NFMY LMVEFRCVKC DDKE.YGIVY Q9W1M7 VQKVLDRLTF REIEVINERE KRMS..DYMF LMIEFPAIVV DDMYNYAVVY Q7PMF0 MEKILDRLTF RELEVINEME KRNS..QFLY LMVEFPQVYI HEKL.YSVIH Q9TXI7 VEDVLDPFAS RRIEMIRAKY KYSSPDRHVF LVLEMAAIRL GPTF.YKVVY
251 O88763 YE....Q9W1M7 FE....Q7PMF0 LE....Q9TXI7 YEDETK
Multiple Sequence File
Ecole Phylogénomique, Carry le Rouet 2006
With an editor …
PipeAlign : protein family analysis tool
Plewniak et al, 2003
http://bips.u-strasbg.fr/PipeAlign/
•BlastP search• Identify motifs
•Build multiple alignment
•Refine alignment•Correct alignment errors
•Remove unrelated seq.
•Cluster sequences
INPUT: single sequence OR set of unaligned sequences
conservation profilelist of homologs
MACS of user-specified homologs
refined MACS
MACS of validated homologs
single sequence
singlesequence
multiplealignment
multiplealignment
multiplealignment
•Validate alignmentmultiplealignment
validated MACS
PipeAlign
Integrated family and sub-family analysisIdentification of key residues, domain organisation, mean predictions of cellular
location, transmembrane regions, 2D/3D structures, phylogeny studies, etc.
Ecole Phylogénomique, Carry le Rouet 2006
MSA Main Applications
Structure comparison, modelling
Interaction networks
Hierarchical function annotation: homologs, domains, motifs
Phylogenetic studies
Human genetics, SNPs
Therapeutics, drug discovery
Therapeutics, drug designDBD
LBD
insertion domain
binding sites / mutations
Gene identification, validation
RNA sequence, structure, function
Comparative genomics
MACS
MSA : central role in biology
MACS : new landscapeMACS : new landscape
• LengthLength:: from tens of amino acids or nucleotides to thousands or millions (genomes) from tens of amino acids or nucleotides to thousands or millions (genomes)
• NumberNumber:: from tens up to thousands of sequences from tens up to thousands of sequences
• VariabilityVariability: : from small percent identity to almost identicalfrom small percent identity to almost identical
• ComplexityComplexity:: of the sequences to be aligned of the sequences to be aligned
- Family with linear or highly irregular repartition of sequence variability- Family with linear or highly irregular repartition of sequence variability
- Heterogeneity of length, structure or composition (large insertions or - Heterogeneity of length, structure or composition (large insertions or
extensions, repeats, circular permutations, transmembrane regions…)extensions, repeats, circular permutations, transmembrane regions…)
• FidelityFidelity:: from 15-30% errors (sequence, eucaryotic gene prediction, annotation…) from 15-30% errors (sequence, eucaryotic gene prediction, annotation…)
High volume & heterogeneity of sequence dataHigh volume & heterogeneity of sequence data
MACS : new conceptsMACS : new concepts
Distinct objectives imply distinct needs & strategiesDistinct objectives imply distinct needs & strategies
• Overview of one sequence family to quickly infer and integrate information from a limited Overview of one sequence family to quickly infer and integrate information from a limited
number of closely related, well annotated sequences number of closely related, well annotated sequences (reliable and efficient)(reliable and efficient)
• Exhaustive analysis of one sequence family for Exhaustive analysis of one sequence family for (very high quality)(very high quality)
- homology modeling- homology modeling
- phylogenetic studies- phylogenetic studies
- subfamily-specific features (differentially conserved domains, regions or residues)- subfamily-specific features (differentially conserved domains, regions or residues)
• Massive analysis of sets of sequences Massive analysis of sets of sequences (reliable/high quality and efficient)(reliable/high quality and efficient)
- phylogenetic distribution, co-presence and co-absence and structural complex- phylogenetic distribution, co-presence and co-absence and structural complex
- genome annotation- genome annotation
- target characterisation for functional genomics studies (transcriptomics…)- target characterisation for functional genomics studies (transcriptomics…)
Ecole Phylogénomique, Carry le Rouet 2006
Residue conservation identification
residues conserved in all sequences in family
structural or functional importance: characteristic motifs residues conserved within a sub-group of sequences
discriminant residues
Euc
Bac
Motif II
Euc
ArcEuc
Bac
EMAP domainN-terminal extension C-terminal extensionS4 domain
Motif I
Euc
ArcEuc
Bac
10 aa
Ordered Alignment analysis of TyrRS
Euc
Bac
Motif II
Euc
ArcEuc
Bac
EMAP domainN-terminal extension C-terminal extensionS4 domain
Motif I
Euc
ArcEuc
Bac
10 aa
Ordered Alignment analysis of TyrRS
Ecole Phylogénomique, Carry le Rouet 2006
Phylogenetic studies
Multiple alignments = basis for calculation of the levels of similarity between sequences
Multiple alignments = basis for calculation of sequences evolutionary distances
Multiple alignments = basis for the computation of phylogenetic trees
Creation of high quality phylogenetic tree implies to work with high quality multiple
sequence alignments
Phylogenetic studies
Whole alignment
AQUIF AEOL
THERM MARI
PORPH GING
CLOST ACET
BORDE PERT
NEISS GONONEISS MENI
PSEUD AERUSHEWA PUTRVIBRI CHOLYERSI PESTESCHE COLISALMO TYPH
ACTIN ACTIHAEMO INFL
BACIL SUBT
ENTER FAECSTREP PYOG
THERM THERDEINO RADI
SYNECHO SPAR THA CHL
CHLAM TRAC
CAMPY JEJU
HELIC PYLO
MYCOB LEPR
MYCOB TUBE
CHLOR TEPI
RHODO CAPS
RICKE PROWBUCHN AFIDMYCOP CAPR
BORRE BURG
TREPO PALI
CAEN EL MTDROS ME MT
SCHI PO MTSACC CE MT
MYCOP GENI
MYCOP PNEU
ARABI THAL
PLASM FALC
CAENO ELEG
DROSO MEGA
HOMO SAPIE
RATTU NORV
SCHIZ POMB
SACCH CERECANDI ALBI
HALOB SALI
ARCHE FULG
METBA THERMETHA JANN
PYROC KODA
PYROC HORI
Archa
ea
Eucarya
Bacteria +Mitochondrie
Phylogenetic studies
0.1
BACIL SUBT
SYNECHO SP
BORDE PERT
NEISS GONONEISS MENI
PSEUD AERUSHEWA PUTR
VIBRI CHOLYERSI PESTESCHE COLISALMO TYPH
ACTIN ACTIHAEMO INFLTHERM THER
ENTER FAEC
STREP PYOGAQUIF AEOL
THERM MARI
AR THA CHL
TREPO PALI
MYCOB LEPRMYCOB TUBE
CAMPY JEJUHELIC PYLO
CHLAM TRAC
CHLOR TEPI
RHODO CAPS
RICKE PROW
PORPH GING
BUCHN AFID
BORRE BURG
MYCOP CAPR
CAEN EL MT
DROS ME MT
HALOB SALI
ARCHE FULG
METBA THER
METHA JANN
PYROC KODA
PYROC HORI
ARABI THALSCHIZ POMB
SACCH CERE
CANDI ALBI
PLASM FALC
CAENO ELEG
DROSO MEGA
HOMO SAPIE
RATTU NORV
SCHI PO MT
SACC CE MT
CLOST ACET
DEINO RADI
MYCOP GENI
MYCOP PNEU
N terminusglobal gap removal
BacteriaArchaeaMito.
Eukarya
Ecole Phylogénomique, Carry le Rouet 2006
Euc
Arc
Eub
Euc
Arc
Eub
Euc
Arc
Eub
320
340
690 890 930710 730 750 770 790 810 830 850 870
360 380 400 420 440 460 480 500 520 540 560
180 280 300
P
Motif I Flipping loop
L Q PQ KQ
Motif II
Insertion domain
R
Motif III
G
Anticodon binding domain
Catalytic core I
Catalytic core II
200
H
260240220
Schematic alignment of Aspartyl-tRNA synthetases
Ecole Phylogénomique, Carry le Rouet 2006
Ecole Phylogénomique, Carry le Rouet 2006
Protein sequence validation
Sequencing / frameshift error detection
Example: transcription TFIIH complex protein
Estimation: 44% of predicted proteins from genome sequencing projects and 31% of high-throughput cDNA (HTC) contain errors in their intron/exon structure.
Bianchetti et al, 2005
Multiple alignment of complete sequences
Determination of sequence groups
Hierarchical clustering of positions based on insertion/deletion
Definition of blocs
N-terminal region analysis : • Reference position• Proposed N-terminus : potential start codon closest to the reference position
--------MXXXXXX-XXXXXX-------XXX-------MXXXX-XXXXXXXXXX------XXXMXXXXXXMXXXMXXXXX-XXXXX-XXXXXXXX------MXXXXXXXXXXXXX-XX--XXXXXXX---------MXXXXX-XXXXXXXXXXXXXXXX
extension
Reference position
Clustered MACS : StarterClustered MACS : Starter
° 3000 proteins from ° 3000 proteins from B. subtilisB. subtilis with wrong randomly generated N-ter. : 82% with wrong randomly generated N-ter. : 82% predictedpredicted
° For the 3828 proteins from the ° For the 3828 proteins from the Vibrio choleraVibrio cholera proteome : proteome :817 specific / 1722 valid start codons / 236 “wrong” (from 1 up to 56 817 specific / 1722 valid start codons / 236 “wrong” (from 1 up to 56
aas)aas)
Bianchetti et al. (2005) JBCB
http://igbmc.u-strasbg.fr/vALId/
Clustered MACS : vAlidClustered MACS : vAlid
Clustering
Characterization of the
specificity of the
homologous sequences
-> Filter
Filter
User sequenc
e
DBWatcher [Plewniak, IGBMC]
Daily Blastp
Automatic Daily Update
Integration of the sub-family members
Clustered MACS : DbWClustered MACS : DbW
• • Automatic up-date of more than 300 different protein families Automatic up-date of more than 300 different protein families => 24 AaRS (amino-acid tRNA synhetases), nuclear receptors, => 24 AaRS (amino-acid tRNA synhetases), nuclear receptors, ribosomal proteins, transcription factors…ribosomal proteins, transcription factors…
Databases :- Proteins- Structures
Prigent et al. (2005) BioInformatics
F
minV(Horiz) = 21 * F
p
p p
minV(Verti) = MaxBranch * P
GoAnno : find a pertinent level automatically and propagate Gene Ontology to an unannotated target protein according to clustered MACS
989 target proteins from
retinal transcriptome analysis
795 proteins with a GO terms (increase of 47 %)
3085 GO terms (increase of 92 %)
Subfamily of the Query
Level 0
Level 2
Level 4
Level 3
Level 1
Level 6
Level 5
physiological processes
metabolism
cellular process
cell communication
biological_process
Gene_Ontology
12
0 + 12
0 + 12
0 + 12
12
2 + 16
0 + 18
0 + 18
0 + 18
18
16
2
0 + 2
0 + 2
0+ 2
2
3
0 + 21
0 + 21
0 + 21
21
2 + 19
16 + 3
nucleobase, nucleoside, nucleotide and nucleic acid metabolism
transcription
regulation of transcription
signal transduction
Clustered MACS : GOAnnoClustered MACS : GOAnno
Chalmel et al. (2005) Bioinfomatics
Ecole Phylogénomique, Carry le Rouet 2006
Basic steps for comparative (homology) modelling :Basic steps for comparative (homology) modelling : 1. Identify a template structure
2. Align the target sequence to the template sequence
3. Copy the backbone coordinates from template to the matching residues in the target sequence
4. Build the side-chains (copied for identical residues, predicted for non-identical)
5. Model the loop regions
6. Optimise (energy refinement)
Protein 3D structure prediction
Applicable to ~60% of proteins from fully sequenced genomes
Proteins with similar sequences tend to fold into similar structure
Above 50% identity, pairwise alignment is enough for accurate model Below 50% identity, multiple alignment is better
Ecole Phylogénomique, Carry le Rouet 2006
Propagation of information from a known sequence to an unknown one Propagation of information from a known sequence to an unknown one e.g. domains, active sites, cellular localisation, post-transcriptional modifications, …e.g. domains, active sites, cellular localisation, post-transcriptional modifications, …
1. Database search for homologues e.g. BlastP, PSI-Blast1. Database search for homologues e.g. BlastP, PSI-Blast
2. Domain databases : e.g. Interpro (EBI), CDD (NCBI)2. Domain databases : e.g. Interpro (EBI), CDD (NCBI)
3. Multiple alignment construction and analysis e.g. PipeAlign3. Multiple alignment construction and analysis e.g. PipeAlign
Protein functional characterisation
By homology : Similar sequences generally share similar structures and often have similar functions
FunctionalFunctionalgenomicsgenomics
EvolutionaryEvolutionarystudiesstudies
StructureStructuremodeling modeling
Drug designDrug designMutagenesis Mutagenesis experimentsexperiments
domain organization, structural motifskey functional residues, ORF definition
localization signals, conservation pattern...
Additional domain
Intra-group conservation
Universal conservation
Differential conservation between
the two families
Transmembraneregion
NLS
Bacteria
Archaea
Eucarya
Bacteria
Error in ORFdefinition
1st
FAMILY
2nd
FAMILY
Phosphorylation site
Lecompte et al Gene. 2001
MSA applications : Summary
Ecole Phylogénomique, Carry le Rouet 2006
MACSIMS
Ecole Phylogénomique, Carry le Rouet 2006
MAO : Multiple Alignment OntologyMAO : Multiple Alignment Ontologyhttp://www-igbmc.u-strasbg.fr/BioInfo/MAO/mao.htmlhttp://www-igbmc.u-strasbg.fr/BioInfo/MAO/mao.html
Also available from OBO web site: http://obo.sourceforge.net
MAO consortium:MAO consortium:
- RNA analysis - RNA analysis (Steve HOLBROOK, Berkeley)(Steve HOLBROOK, Berkeley)
- MACS algorithm- MACS algorithm(Kazutake KATOH, Kyoto)(Kazutake KATOH, Kyoto)
- Protein 3D analysis - Protein 3D analysis (Patrice KOEHL, Davis)(Patrice KOEHL, Davis)
- Protein 3D structure - Protein 3D structure (Dino MORAS, Strasbourg)(Dino MORAS, Strasbourg)
- 3D RNA structure - 3D RNA structure (Eric WESTHOF, Strasbourg)(Eric WESTHOF, Strasbourg)
Thompson et al. (2005) Nucleic Acids Res.
MACSIMS
Multiple Alignment of Complete Sequences Information Management System
Structural and functional information is mined automatically from the public databases
Homologous regions are identified in the MACS
Mined data is evaluated and cross-validated
Mined data is propagated from known to unknown sequences with the homologous regions
MACSIMS provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist
Thompson et al BMC Bioinformatics 2006
MACSIMShttp://bips.u-strasbg.fr/MACSIMS/
MACSIMSMACSIMS
• Schematic overview of complete alignmentSchematic overview of complete alignment e.g. de.g. domain organisationomain organisation (Interpro) (Interpro)
SH3
SH2
PI-PLC-X
PI-PLC-Y
PH C2
CH
rhoGEF
DAG_PE-bind
Key:
MACSIMS visualisation
JalView II, Coll. G. Barton
MACSIMS
* * * ** * * *E
E
E
E C
C
C
CGSVPTG
GSTKVG
GETRTG
GSTEVG
GSVSAG
GSRDVGGSRDVG
GSRDVGGSTNVFGSTNVF
GSTAVF
BAliBASE reference 3: aldehyde dehydrogenase-like
NAD binding Active site Active siteUniprot annotation
Ecole Phylogénomique, Carry le Rouet 2006
Summary
Choice of multiple alignment methodtraditional progressive method (e.g. clustalw / clustalx)
combined local and global method (e.g. mafft, muscle, dbclustal)
knowledge-based method (e.g. PipeAlign)
Web Server versus Local Installation ?
WARNING: Automatic alignment methods can make mistakes.Verify alignment quality by automatic methods (e.g. norMD) and visual inspection !
Multiple alignment applications
Traditional applications:
phylogeny
conserved residue / motif identification
Information in multiple alignments also improves accuracy in:
sequence error detection
structure prediction
functional annotation
Ecole Phylogénomique, Carry le Rouet 2006
Laboratory of Integrative Genomics and BioinformaticsIGBMC, Strasbourg
Iterative Refinement
PRRP (Gotoh, 1993) refines an initial progressive multiple alignment by iteratively dividing the alignment into 2 profiles and realigning them.
initial alignment
divide sequencesinto 2 groups
profile 1
profile 2
pairwiseprofile
alignmentrefined
alignment
converged?
no
alternative algorithms
Genetic AlgorithmsSAGA (Notredame, Higgins, 1996) evolves a population of alignments in a quasi evolutionary
manner, iteratively improving the fitness of the population
select a number of individuals to be parents
modify the parents by shuffling gaps, merging 2 alignments etc.
evaluation of the fitness using OF (sum-of-pairs or COFFEE)
END
population n
population n+1
alternative algorithms
HMM• Probabilistic model for sequence profiles, visualized as a finite state
machine• For each column of the alignment a match state models the distribution
of residues allowed• Insert and delete states at each column allow for insertion or deletion of
one or more residues
YW
VLL
DD
Original profile HMM (Krogh et al, 1994)match state
delete,begin, end state
insert stateAK E
AKY-L-D--WVLED
alternative algorithms
Multiple Alignment using HMM
HMMER (Eddy, unpublished)
SAM-T98 (Hughey, 1996)produce a model
generate new alignment(Viterbi algorithm or posterior decoding)
END
evaluate alignment(expectation maximization)
generate initial alignment(Baum-Welch expectation maximization)
Segment-to-segment Alignment
Dialign (Morgenstern et al. 1996) compares segments of sequences instead of single residues
1. construct dot-plots of all possible pairs of sequences
2. find a maximal set of consistent diagonals in all the sequences
Sequence i
Sequence j
.......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq...............WWNAedsegkr.GMIPVPYVek..........
........nlFVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCEAqtkngq..GWVPSNYItpvns.......ieqvpqqptyVQALFDFdpqedgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..GMFPRNYVtpvnrnv.....gsmstselkkVVALYDYmpmnandlqlrKGDEYFIleesnlp...............WWRArdkngqe.GYIPSNYVteaeds...........tagkiFRAMYDYmaadadevsfkDGDAIINvqaideg...............WMYGtvqrtgrtGMLPANYVeai...........gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg...............WWRGdyggkkq.LWFPSNYVeemvnpegihrd.......gyqYRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLNGynettgerGDFPGTYVeyigrkkisp..
Local alignment - residues between the diagonals are not aligned
alternative algorithms
Top Related