Download - Multiple Sequence Alignment (MSA). Ecole Phylogénomique, Carry le Rouet 2006 Plan Introduction to sequence alignments Multiple alignment construction.

Multiple Sequence Alignment (MSA)

Ecole Phylogénomique, Carry le Rouet 2006

Plan

Introduction to sequence alignments

Multiple alignment construction

Traditional approaches

Alignment parameters

Alternative approaches

Multiple alignment main applications

MACSIMS : Multiple Alignment of Complete Sequences Information Management System


Local alignment / Global alignment

Sequence A

Sequence B

Global alignment

Sequence alignment on their whole length

G G C T G A C C A C C - T T

| | | | | | |

G A - T C A C T T C C A T G

Local alignment

Alignment of the high similarity regions

G A C C A C C T T

| | | | | | |

G A T C A C - T T

Optimal local pairwise alignment :Smith and Waterman, 1981

Optimal global pairwise alignment :Needleman and Wunsch, 1970


Pairwise alignment / Multiple alignmentQuery: 177 EMGDTGPCGPCSEIHYDRIGGRDAAHLVNQDDPNVLEIWNLVFIQYNR---EADG----I 229 G G GP E+ Y LE+ LVF+QY + AD ISbjct: 193 AGG--GNAGPAFEVLYKG-----------------LEVATLVFMQYKKAPANADPSQVVI 233

Query: 230 LK-----PLPKKSIDTGMGLERLVSVLQNKMSNYDTDLFVPYFEAIQKGTGARPYTGKVG 284 +K P+ K +DTG GLERLV + Q + YD L E +++ G ++ Sbjct: 234 IKGEKYVPMETKVVDTGYGLERLVWMSQGTPTAYDAVLGY-VIEPLKRMAGVEKIDERIL 292

Query: 285 AEDA---------DGIDMAYR--------------------------VLADHARTITVAL 309 E++ D D+ Y +ADH + +T LSbjct: 293 MENSRLAGMFDIEDMGDLRYLREQVAKRVGISVEELERLIRPYELIYAIADHTKALTFML 352

Ecole Phylogénomique, Carry le Rouet 2006Conservation profile Secondary structure

What is a multiple alignment?

A representation of a set of sequences, in which equivalent residues (e.g. functional or structural) are aligned in columns

Conserved residues


MACSMACS

• Schematic overview of complete alignmentSchematic overview of complete alignment e.g. de.g. domain organisationomain organisation (Interpro) (Interpro)

SH3

SH2

PI-PLC-X

PI-PLC-Y

PH C2

CH

rhoGEF

DAG_PE-bind

Key:


Why multiple alignments?

Applications : phylogeny domain organisation functional residue identification 2D/3D structure prediction transmembrane prediction …

Integration of a sequence in the context of the protein family


MSA Construction


Multiple alignment construction

Traditional approaches

Optimal multiple alignment

Progressive multiple alignment

Alignment parameters

Residue similarity matrices

Gap penalties

Alternative approaches

Iterative alignment methods

Combinatorial algorithms

PipeAlign : a protein family analysis tool


Traditional Approaches


Is the direct extension of pairwise dynamic programming to N-dimension (Sankoff, 1975).

Examine all possible alignments to find the optimal alignment

Optimal multiple alignment

Problem The optimised mathematical alignment is not necessarily the biologically optimal alignmentCPU time and memory required are prohibitive for practical purposes (the required time is proportional to Nk for k sequences with length N) : limited to <10 sequences

Exemple : alignment of 3 sequences


Principle : Progressively align the sequences (or sequence groups) by pair

Problem :

Which sequences begin with ? In which order ?

first align closest sequences

How to estimate the distance between the sequences ?

align all pairs of sequences

calculate distance matrix from the pairwise alignments : distance matrix

construct a guide tree from this distance matrix

progressive multiple alignment following branching order in tree


Heuristic algorithm which avoids calculating all possible alignments, but does not garuantee ‘optimal’ alignment



Step 1 : Pairwise alignment of all sequences

Hbb_human 1 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLST ... |.| :|. | | |||| . | | ||| |: . :| |. :| | ||| Hba_human 3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS. ...

Hbb_human 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST ... | |. |||.|| ||| ||| :|||||||||||||||||||||:|||||| Hbb_horse 2 VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ...

Hba_human 3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH ... || :| | | | || | | ||| |: . :| |. :| | |||.Hbb_horse 2 LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN ...

Ex : pairwise alignment of 2 globin sequences

The alignment can be obtained with :- global or local method- dynamic programming or heuristic methods

Example : in Clustalx=> global alignments=> choice between

- heuristic method (used in Fasta program) => faster- dynamic programming (Smith & Waterman) => better

Example : Alignment of 7 globins (Hbb_human, Hbb_horse, Hba_human, Hba_horse, Myg_phyca, Glb5_petma and Lgb2_lupla)


-.17 -.59 .60 -.59 .59 .13 -.77 .77 .75 .75 -.81 .82 .73 .74 .80 -.87 .86 .86 .88 .93 .90 -

Hbb_humanHbb_horseHba_humanHba_horseMyg_phycaGlb5_petmaLgb2_lupla

1234567

1 2 3 4 5 6 7

distance between 2 sequences = 1- nb of identical residuesnb of compared residues

Step 2 : Distance matrix construction

In Clustalx :

Ex : Hbb_human vs Hbb_horse = 83% identity = 17% distance



- Join the 2 closest sequences

- Recalculate distances and join the 2 closest sequences or nodes

- Step 3 is repeated until all sequences are joined

Sequential branching

Hba_human

Hba_horse

Hbb_horse

Hbb_human

Myg_phyca

Glb5_petma

Lgb2_lupla

Hbb_human

Hbb_horse

Hba_human

Hba_horseMyg_phyca

Glb5_petma

Lgb2_lupla

Guide tree

Step 3 : Sequential branching / Guide tree construction



xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Step 4 : Progressive alignment

The progressive multiple alignment follows the branching order in tree

Hbb_human

Hbb_horse

Hba_human

Hba_horseMyg_phyca

Glb5_petma

Lgb2_lupla


xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx


H1 H2 H3 H4

H6 H7H5



Progressive

Local Global

SB

MLUPGMA

NJ

SBpima multal

multalignpileup

clustalx

MLpima

SB - Sequential BranchingUPGMA - Unweighted Pair Grouping MethodML - Maximum LikelihoodNJ - Neighbor-Joining

Progressive multiple alignment methods


Alignment Parameters



Dynamic programming methods score an alignment using residue similarity matrices, containing a score for matching all pairs of residues

For proteins, a wide variety of matrices exist: Identity, PAM, Blosum, Gonnet etc.

PAM 250



Dynamic programming methods score an alignment using residue similarity matrices, containing a score for matching all pairs of residues

For proteins, a wide variety of matrices exist: Identity, PAM, Blosum, Gonnet etc.

Matrices range from strict ones for comparing closely related sequences to soft ones for very divergent sequences.

Matrices are generally constructed by observing the mutations in large sets of alignments, either sequence-based or structure-based

ClustalW automatically selects a suitable matrix depending on the observed pairwise % identity.

A single best matrix does not exist!!

A gap penalty is a cost for introducing gaps into the alignment, corresponding to insertions or deletions in the sequences

• Fixed penalty : P = a L with L the length of gap

• Linear (or affine) penalty : P = x + y L

• Position specific and residue specific penalties :

ex : in ClustalW, gap penalties are :

- lowered at existing gaps

- increased close to (less than 8 residues) existing gaps

- lowered in hydrophilic stretches (loops)

otherwise : gap opening penalties are modified according to their observed relative frequencies adjacent to gaps (Pascarella & Argos, 1992)

x : gap opening penalty (gop)

y : gap extension penalty (gep)

Gap penalties

SFGDLSNPGAVMG

HF-DLS-----HG

Goal is to introduce gaps in sequence segments corresponding to flexible regions of the protein

structure


Alternative Approaches


Iterative alignment methods

Iterative Alignment e.g. PRRPPRRP (Gotoh, 1993) - refine an initial progressive multiple alignment by iteratively dividing the alignment into 2 profiles and realigning them.

Genetic Algorithms e.g. SAGASAGA (Notredame et al, 1996)- iteratively refine an alignment using genetic algorithms (evolves a population of alignments in a quasi evolutionary manner)

Segment-to-segment alignment: DIALIGNDIALIGN (Morgenstern et al. 1999)

- search for locally conserved motifs in all sequences and compares segments of sequences instead of single residues

Hidden Markov Models: - iteratively refine an alignment using HMMs

e.g.e.g. HMMERHMMER (Eddy, 1998)SAMSAM (Karplus et al, 2001)


Progressive

Iterative

Local Global

SB

ML UPGMA

NJ

Genetic Algo. HMM

SBpima multal

multalignpileup

clustalx

dialign

MLpima

saga hmmt

prrp

Multiple alignment methods


BAliBASE: objective evaluation of MACS programsBAliBASE: objective evaluation of MACS programs

• High-quality alignments based on 3D structural superpositions and manually verified

• Alignments compared only in reliable ‘core blocks’, excluding non-superposable regions

• Separate reference sets specifically designed to address distinct alignment problems

reference set description

1 small number of sequences: divergence, length

2 a family with one to 3 orphans

3 several sub-families

4 long N/C terminal extensions

5 long insertions

6 repeats

7 transmembrane regions

8 circular permutations

BAliBASE1 :Thompson et al. 1999 BioinformaticsBAliBASE2 : Bahr et al, 2001 Nucl Acids Res.


=> Need of reference alignments to evaluate the alignment programs

BaliBASE (Thompson et al. Bioinformatics. 1999) – benchmark database

• Alignments based on 3D structure superposition

• Alignments must be compared for the superposable regions

• Alignments take into account :

- the effect of the number of sequences

- the effect of the sequence length

- the effect of the sequence similarity

- alignment of an orphan sequence with a sequence family

- sub-family alignments

- alignments of sequences with different length (insertions,extensions)

Comparison of multiple alignment methods

Local / global methods

Progressive / iterative methods

Iterative algorithms usually improve alignment quality Problems :

- Can give bad alignment in case of orphan sequences

- Iteratif process can be very long !

ClustalW 2 mins 41 secsPRRP 3 hours 40 minsDialign 3 hours 48 mins

Example : alignment of 89 histone sequences (66-92 residues):

Colinear sequences => global methods N/C-ter extensions or insertions => local methods

To increase the alignment quality, as many sequences as possible have to be integrated !

Comparison of multiple alignment methods

> 35% Id : any method

DbClustal: local and global algorithm coupling

Domain A

Domain B

Domain C

Blast Database Search

Query Sequence

Database Hits

Ballast Anchors

Query Sequence

Anchors

DbClustal Alignment


ClustalW

DbClustal

ClustalW / DbClustal comparison


• T-Coffee (Notredame et al. 2000) http://igs-server.cnrs-mrs.fr/Tcoffee/

performs local and global alignments for all pairs of sequences, then combines them in a progressive multiple alignment, similar to ClustalW.

• DbClustal (Thompson et al. 2000) http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbClustal+noid

designed to align the sequences detected by a database search. Locally conserved motifs are detected using the Ballast program (Plewniak et al. 1999) and are used in the global multiple alignment as anchor points.

• MAFFT (Katoh et al. 2002) http://timpani.genome.ad.jp/%7Emafft/server

detects locally conserved segments using a Fast Fourier Transform, then uses a restricted global DP and a progressive algorithm

• MUSCLE (Edgar, 2004) http://www.drive5.com/muscle kmer distances and log-expectation scores, progressive and iterative refinement

• PROBCONS (Do et al, 2005) http://probcons.stanford.edu pairwise consistency based on an objective function

Combinatorial algorithms

Multiple Alignment QualityMultiple Alignment Quality

Ref1 Ref2 Ref3 Ref4 Ref5 Time

V1 (<20%) V2 (20-40%)

orphans subgroups extensions insertions

(sec)

ClustalW1.83 0.42 0.78 0.42 0.52 0.41 0.38 902

Dialign2.2.1 0.31 0.71 0.37 0.39 0.45 0.43 5993

Mafft5.32 0.44 0.78 0.49 0.53 0.47 0.48 96

Maffti5.32 0.54 0.83 0.56 0.60 0.49 0.57 327

Muscle3.51 0.52 0.82 0.50 0.58 0.46 0.54 523

Muscle_fast 0.40 0.77 0.43 0.44 0.35 0.49 34

Muscle_med 0.45 0.80 0.50 0.59 0.44 0.51 219

Tcoffee2.66 0.47 0.84 0.50 0.64 0.54 0.58 216133

Probcons1.1 0.63 0.87 0.60 0.65 0.54 0.63 19035

muscle_fast : muscle –maxiters=1 –diags1 –sv –distance1 kbit20_3 muscle_medium : muscle –maxiters=2

Truncated Alignments

2. Twilight zone still exists3. Probcons scores best in all tests, but is MUCH slower than MAFFT or MUSCLE4. MAFFTI scores slightly better than MUSCLE in all test, and is more efficient

1. Significant improvement in accuracy/efficiency since 2000

Multiple Alignment QualityMultiple Alignment Quality

Ref1 Ref2: orphans Ref3: subgroups Time (sec)for all refs

V1 (<20%) V2 (20-40%)

T FL T FL T FL T FL T FL

ClustalW1.83

0.42 0.24 0.78 0.72 0.42 0.20 0.52 0.27 902 2227

Dialign2.2.1 0.31 0.26 0.71 0.70 0.37 0.29 0.39 0.31 5993 12595

Mafft5.32 0.44 0.25 0.78 0.75 0.49 0.35 0.53 0.38 96 312

Maffti5.32 0.54 0.35 0.83 0.80 0.56 0.40 0.60 0.50 327 1409

Muscle3.51 0.52 0.34 0.82 0.79 0.50 0.36 0.58 0.39 523 3608

Muscle_fast 0.40 0.28 0.77 0.72 0.43 0.29 0.44 0.33 34 132

Muscle_med 0.45 0.29 0.80 0.74 0.50 0.34 0.59 0.38 219 1601

Tcoffee2.66 0.47 0.35 0.84 0.82 0.50 0.40 0.64 0.49 216133 341578

Probcons1.1 0.63 0.43 0.87 0.86 0.60 0.41 0.65 0.54 19035 58488

Comparison: truncated versus full-length sequences

1. Loss of accuracy is more important in twilight zone (Ref1 V1, orphans, and subgroups)2. Probcons still scores best in all tests3. MAFFT still scores better than MUSCLE in all tests


•Sum-of-pairs (Carrillo, Lipman, 1988) Sum the scores of all the pair of sequences (based on a similarity matrix and gap penalty)

Multiple alignment quality

• norMD (Thompson et al, 2001) - scores by column using a substitution matrix and gap penalties- normalisation according to the sequences to align (their number, length and the similarity between them)

Development of objective functions to estimate multiple alignment quality

• Relative Entropy: uses a normalized log-likelihood ratio to measure the degree of conservation for each column (identical residues only).

• MD (column scores used in ClustalX) uses a comparison matrix (Gonnet) to take into account similar residues


Evaluation of Objective Functions using BAliBase


SeqLab GCG Wisconsin Package

SeaView (Gaultier et al, 1996) http://pbil.univ-lyon1.fr/software/seaview.html

WEB servers :

GeneAlign (Kurukawa) http://www.gen-info.osaka-u.ac.jp/geneweb2/genealign/

Jalview (Clamp, 1998) http://www.ebi.ac.uk/~michele/jalview/

CINEMA (Lord et al, 2002) http://www.bioinf.man.ac.uk/dbbrowser/cinema-mx

Multiple sequence alignment editors

No automatic method is 100% reliable.

Manual verification and refinement is essential!


FASTA format

>O88763 Phosphatidylinositol 3-kinase.------MGEAEKFHYIYSCDLDINVQLKIGSLEGKREQKSYKAVLEDPMLKFSGLYQETCSDLYVTCQVFAEGKPLALPVRTSYKPFSTRWN-WNEWLKLPVKYPDLPRNAQVALTIWD------VYGPG-RAVPVGGTTVSLFGKYGMFRQGMHDLKVWPNVEADGSEPTRTPGRTSSTLSEDQMSRLAKLTKAHRQGHMVKVLDRLTFREIEMINESEKRSS--NFMYLMVEFRCVKCDDKE-YGIVYYE---->Q9W1M7 CG5373-PA (GH13170p).-----MDQPDDHFRYIHSSSLHERVQIKVGTLEGKKRQPDYEKLLEDPILRFSGLYSEEHPSFQVRLQVFNQGRPYCLPVTSSYKAFGKRWS-WNEWVTLPLQFSDLPRSAMLVLTILD------CSGAG-QTTVIGGTSISMFGKDGMFRQGMYDLRVWLGVEGDGNFPSRTPGK-GKESSKSQMQRLGKLAKKHRNGQVQKVLDRLTFREIEVINEREKRMS--DYMFLMIEFPAIVVDDMYNYAVVYFE---->Q7PMF0 ENSANGP00000002906 (Fragment).------------LRYIGSSSLLQKISIKIGTLEGENVGYSYEKLIEQPLLKFSGMYTEKTPPLKVKLQIFDNGEPVGLPVCTSHKHFTTRWS-WNEWVTLPLRFTDISRTAVLGLTIYD------CAGGREQLTVVGGTSISFFSTNGLFRQGLYDLKVWPQMEPDGACNSITPGK-AITTGVHQMQRLSKLAKKHRNGQMEKILDRLTFRELEVINEMEKRNS--QFLYLMVEFPQVYIHEKL-YSVIHLE---->Q9TXI7 Related to yeast vacuolar protein sorting factor protein 34MIPGMRATPTESFSFVYSCDLQTNVQVKVAEFEG-----IFRDVLN-PVRRLNQLFAEITVYCNNQQIGYPVCTSFHTPPDSSQLARQKLIQKWNEWLTLPIRYSDLSRDAFLHITIWEHEDDEIVNNSTFSRRLVAQSKLSMFSKRGILKSGVIDVQMNVSTTPDPFVKQPETWKYSDAWG-DEIDLLFKQVTRQSRGLVEDVLDPFASRRIEMIRAKYKYSSPDRHVFLVLEMAAIRLGPTF-YKVVYYEDETK

MSF format toto.msf MSF: 256 Type: P May 24, 2005 19:34 Check: 3415 ..

Name: O88763 Len: 256 Check: 9443 Weight: 1.00 Name: Q9W1M7 Len: 256 Check: 1161 Weight: 1.00 Name: Q7PMF0 Len: 256 Check: 8095 Weight: 1.00 Name: Q9TXI7 Len: 256 Check: 4716 Weight: 1.00

//

1 50O88763 ......MGEA EKFHYIYSCD LDINVQLKIG SLEGKREQKS YKAVLEDPML Q9W1M7 .....MDQPD DHFRYIHSSS LHERVQIKVG TLEGKKRQPD YEKLLEDPIL Q7PMF0 .......... ..LRYIGSSS LLQKISIKIG TLEGENVGYS YEKLIEQPLL Q9TXI7 MIPGMRATPT ESFSFVYSCD LQTNVQVKVA EFEG.....I FRDVLN.PVR

51 100O88763 KFSGLYQETC SDLYVTCQVF AEGKPLALPV RTSYKPFSTR WN.WNEWLKL Q9W1M7 RFSGLYSEEH PSFQVRLQVF NQGRPYCLPV TSSYKAFGKR WS.WNEWVTL Q7PMF0 KFSGMYTEKT PPLKVKLQIF DNGEPVGLPV CTSHKHFTTR WS.WNEWVTL Q9TXI7 RLNQLFAEIT VYCNNQQIGY PVCTSFHTPP DSSQLARQKL IQKWNEWLTL

101 150O88763 PVKYPDLPRN AQVALTIWD. .....VYGPG .RAVPVGGTT VSLFGKYGMF Q9W1M7 PLQFSDLPRS AMLVLTILD. .....CSGAG .QTTVIGGTS ISMFGKDGMF Q7PMF0 PLRFTDISRT AVLGLTIYD. .....CAGGR EQLTVVGGTS ISFFSTNGLF Q9TXI7 PIRYSDLSRD AFLHITIWEH EDDEIVNNST FSRRLVAQSK LSMFSKRGIL

151 200O88763 RQGMHDLKVW PNVEADGSEP TRTPGRTSST LSEDQMSRLA KLTKAHRQGH Q9W1M7 RQGMYDLRVW LGVEGDGNFP SRTPGK.GKE SSKSQMQRLG KLAKKHRNGQ Q7PMF0 RQGLYDLKVW PQMEPDGACN SITPGK.AIT TGVHQMQRLS KLAKKHRNGQ Q9TXI7 KSGVIDVQMN VSTTPDPFVK QPETWKYSDA WG.DEIDLLF KQVTRQSRGL

201 250O88763 MVKVLDRLTF REIEMINESE KRSS..NFMY LMVEFRCVKC DDKE.YGIVY Q9W1M7 VQKVLDRLTF REIEVINERE KRMS..DYMF LMIEFPAIVV DDMYNYAVVY Q7PMF0 MEKILDRLTF RELEVINEME KRNS..QFLY LMVEFPQVYI HEKL.YSVIH Q9TXI7 VEDVLDPFAS RRIEMIRAKY KYSSPDRHVF LVLEMAAIRL GPTF.YKVVY

251 O88763 YE....Q9W1M7 FE....Q7PMF0 LE....Q9TXI7 YEDETK

Multiple Sequence File


With an editor …

PipeAlign : protein family analysis tool

Plewniak et al, 2003

http://bips.u-strasbg.fr/PipeAlign/

•BlastP search• Identify motifs

•Build multiple alignment

•Refine alignment•Correct alignment errors

•Remove unrelated seq.

•Cluster sequences

INPUT: single sequence OR set of unaligned sequences

conservation profilelist of homologs

MACS of user-specified homologs

refined MACS

MACS of validated homologs

single sequence

singlesequence

multiplealignment

multiplealignment

multiplealignment

•Validate alignmentmultiplealignment

validated MACS

PipeAlign

Integrated family and sub-family analysisIdentification of key residues, domain organisation, mean predictions of cellular

location, transmembrane regions, 2D/3D structures, phylogeny studies, etc.


MSA Main Applications

Structure comparison, modelling

Interaction networks

Hierarchical function annotation: homologs, domains, motifs

Phylogenetic studies

Human genetics, SNPs

Therapeutics, drug discovery

Therapeutics, drug designDBD

LBD

insertion domain

binding sites / mutations

Gene identification, validation

RNA sequence, structure, function

Comparative genomics

MACS

MSA : central role in biology

MACS : new landscapeMACS : new landscape

• LengthLength:: from tens of amino acids or nucleotides to thousands or millions (genomes) from tens of amino acids or nucleotides to thousands or millions (genomes)

• NumberNumber:: from tens up to thousands of sequences from tens up to thousands of sequences

• VariabilityVariability: : from small percent identity to almost identicalfrom small percent identity to almost identical

• ComplexityComplexity:: of the sequences to be aligned of the sequences to be aligned

- Family with linear or highly irregular repartition of sequence variability- Family with linear or highly irregular repartition of sequence variability

- Heterogeneity of length, structure or composition (large insertions or - Heterogeneity of length, structure or composition (large insertions or

extensions, repeats, circular permutations, transmembrane regions…)extensions, repeats, circular permutations, transmembrane regions…)

• FidelityFidelity:: from 15-30% errors (sequence, eucaryotic gene prediction, annotation…) from 15-30% errors (sequence, eucaryotic gene prediction, annotation…)

High volume & heterogeneity of sequence dataHigh volume & heterogeneity of sequence data

MACS : new conceptsMACS : new concepts

Distinct objectives imply distinct needs & strategiesDistinct objectives imply distinct needs & strategies

• Overview of one sequence family to quickly infer and integrate information from a limited Overview of one sequence family to quickly infer and integrate information from a limited

number of closely related, well annotated sequences number of closely related, well annotated sequences (reliable and efficient)(reliable and efficient)

• Exhaustive analysis of one sequence family for Exhaustive analysis of one sequence family for (very high quality)(very high quality)

- homology modeling- homology modeling

- phylogenetic studies- phylogenetic studies

- subfamily-specific features (differentially conserved domains, regions or residues)- subfamily-specific features (differentially conserved domains, regions or residues)

• Massive analysis of sets of sequences Massive analysis of sets of sequences (reliable/high quality and efficient)(reliable/high quality and efficient)

- phylogenetic distribution, co-presence and co-absence and structural complex- phylogenetic distribution, co-presence and co-absence and structural complex

- genome annotation- genome annotation

- target characterisation for functional genomics studies (transcriptomics…)- target characterisation for functional genomics studies (transcriptomics…)


Residue conservation identification

residues conserved in all sequences in family

structural or functional importance: characteristic motifs residues conserved within a sub-group of sequences

discriminant residues

Euc

Bac

Motif II

Euc

ArcEuc

Bac

EMAP domainN-terminal extension C-terminal extensionS4 domain

Motif I

Euc

ArcEuc

Bac

10 aa

Ordered Alignment analysis of TyrRS



Multiple alignments = basis for calculation of the levels of similarity between sequences

Multiple alignments = basis for calculation of sequences evolutionary distances

Multiple alignments = basis for the computation of phylogenetic trees

Creation of high quality phylogenetic tree implies to work with high quality multiple

sequence alignments


Whole alignment

AQUIF AEOL

THERM MARI

PORPH GING

CLOST ACET

BORDE PERT

NEISS GONONEISS MENI

PSEUD AERUSHEWA PUTRVIBRI CHOLYERSI PESTESCHE COLISALMO TYPH

ACTIN ACTIHAEMO INFL

BACIL SUBT

ENTER FAECSTREP PYOG

THERM THERDEINO RADI

SYNECHO SPAR THA CHL

CHLAM TRAC

CAMPY JEJU

HELIC PYLO

MYCOB LEPR

MYCOB TUBE

CHLOR TEPI

RHODO CAPS

RICKE PROWBUCHN AFIDMYCOP CAPR

BORRE BURG

TREPO PALI

CAEN EL MTDROS ME MT

SCHI PO MTSACC CE MT

MYCOP GENI

MYCOP PNEU

ARABI THAL

PLASM FALC

CAENO ELEG

DROSO MEGA

HOMO SAPIE

RATTU NORV

SCHIZ POMB

SACCH CERECANDI ALBI

HALOB SALI

ARCHE FULG

METBA THERMETHA JANN

PYROC KODA

PYROC HORI

Archa

ea

Eucarya

Bacteria +Mitochondrie


0.1

BACIL SUBT

SYNECHO SP

BORDE PERT

NEISS GONONEISS MENI

PSEUD AERUSHEWA PUTR

VIBRI CHOLYERSI PESTESCHE COLISALMO TYPH

ACTIN ACTIHAEMO INFLTHERM THER

ENTER FAEC

STREP PYOGAQUIF AEOL

THERM MARI

AR THA CHL

TREPO PALI

MYCOB LEPRMYCOB TUBE

CAMPY JEJUHELIC PYLO

CHLAM TRAC

CHLOR TEPI

RHODO CAPS

RICKE PROW

PORPH GING

BUCHN AFID

BORRE BURG

MYCOP CAPR

CAEN EL MT

DROS ME MT

HALOB SALI

ARCHE FULG

METBA THER

METHA JANN

PYROC KODA

PYROC HORI

ARABI THALSCHIZ POMB

SACCH CERE

CANDI ALBI

PLASM FALC

CAENO ELEG

DROSO MEGA

HOMO SAPIE

RATTU NORV

SCHI PO MT

SACC CE MT

CLOST ACET

DEINO RADI

MYCOP GENI

MYCOP PNEU

N terminusglobal gap removal

BacteriaArchaeaMito.

Eukarya


Euc

Arc

Eub

Euc

Arc

Eub

Euc

Arc

Eub

320

340

690 890 930710 730 750 770 790 810 830 850 870

360 380 400 420 440 460 480 500 520 540 560

180 280 300

P

Motif I Flipping loop

L Q PQ KQ

Motif II

Insertion domain

R

Motif III

G

Anticodon binding domain

Catalytic core I

Catalytic core II

200

H

260240220

Schematic alignment of Aspartyl-tRNA synthetases


Protein sequence validation

Sequencing / frameshift error detection

Example: transcription TFIIH complex protein

Estimation: 44% of predicted proteins from genome sequencing projects and 31% of high-throughput cDNA (HTC) contain errors in their intron/exon structure.

Bianchetti et al, 2005

Multiple alignment of complete sequences

Determination of sequence groups

Hierarchical clustering of positions based on insertion/deletion

Definition of blocs

N-terminal region analysis : • Reference position• Proposed N-terminus : potential start codon closest to the reference position

--------MXXXXXX-XXXXXX-------XXX-------MXXXX-XXXXXXXXXX------XXXMXXXXXXMXXXMXXXXX-XXXXX-XXXXXXXX------MXXXXXXXXXXXXX-XX--XXXXXXX---------MXXXXX-XXXXXXXXXXXXXXXX

extension

Reference position

Clustered MACS : StarterClustered MACS : Starter

° 3000 proteins from ° 3000 proteins from B. subtilisB. subtilis with wrong randomly generated N-ter. : 82% with wrong randomly generated N-ter. : 82% predictedpredicted

° For the 3828 proteins from the ° For the 3828 proteins from the Vibrio choleraVibrio cholera proteome : proteome :817 specific / 1722 valid start codons / 236 “wrong” (from 1 up to 56 817 specific / 1722 valid start codons / 236 “wrong” (from 1 up to 56

aas)aas)

Bianchetti et al. (2005) JBCB

http://igbmc.u-strasbg.fr/vALId/

Clustered MACS : vAlidClustered MACS : vAlid

Clustering

Characterization of the

specificity of the

homologous sequences

-> Filter

Filter

User sequenc

e

DBWatcher [Plewniak, IGBMC]

Daily Blastp

Automatic Daily Update

Integration of the sub-family members

Clustered MACS : DbWClustered MACS : DbW

• • Automatic up-date of more than 300 different protein families Automatic up-date of more than 300 different protein families => 24 AaRS (amino-acid tRNA synhetases), nuclear receptors, => 24 AaRS (amino-acid tRNA synhetases), nuclear receptors, ribosomal proteins, transcription factors…ribosomal proteins, transcription factors…

Databases :- Proteins- Structures

Prigent et al. (2005) BioInformatics

F

minV(Horiz) = 21 * F

p

p p

minV(Verti) = MaxBranch * P

GoAnno : find a pertinent level automatically and propagate Gene Ontology to an unannotated target protein according to clustered MACS

989 target proteins from

retinal transcriptome analysis

795 proteins with a GO terms (increase of 47 %)

3085 GO terms (increase of 92 %)

Subfamily of the Query

Level 0

Level 2

Level 4

Level 3

Level 1

Level 6

Level 5

physiological processes

metabolism

cellular process

cell communication

biological_process

Gene_Ontology

12

0 + 12

0 + 12

0 + 12

12

2 + 16

0 + 18

0 + 18

0 + 18

18

16

2

0 + 2

0 + 2

0+ 2

2

3

0 + 21

0 + 21

0 + 21

21

2 + 19

16 + 3

nucleobase, nucleoside, nucleotide and nucleic acid metabolism

transcription

regulation of transcription

signal transduction

Clustered MACS : GOAnnoClustered MACS : GOAnno

Chalmel et al. (2005) Bioinfomatics


Basic steps for comparative (homology) modelling :Basic steps for comparative (homology) modelling : 1. Identify a template structure

2. Align the target sequence to the template sequence

3. Copy the backbone coordinates from template to the matching residues in the target sequence

4. Build the side-chains (copied for identical residues, predicted for non-identical)

5. Model the loop regions

6. Optimise (energy refinement)

Protein 3D structure prediction

Applicable to ~60% of proteins from fully sequenced genomes

Proteins with similar sequences tend to fold into similar structure

Above 50% identity, pairwise alignment is enough for accurate model Below 50% identity, multiple alignment is better


Propagation of information from a known sequence to an unknown one Propagation of information from a known sequence to an unknown one e.g. domains, active sites, cellular localisation, post-transcriptional modifications, …e.g. domains, active sites, cellular localisation, post-transcriptional modifications, …

1. Database search for homologues e.g. BlastP, PSI-Blast1. Database search for homologues e.g. BlastP, PSI-Blast

2. Domain databases : e.g. Interpro (EBI), CDD (NCBI)2. Domain databases : e.g. Interpro (EBI), CDD (NCBI)

3. Multiple alignment construction and analysis e.g. PipeAlign3. Multiple alignment construction and analysis e.g. PipeAlign

Protein functional characterisation

By homology : Similar sequences generally share similar structures and often have similar functions

FunctionalFunctionalgenomicsgenomics

EvolutionaryEvolutionarystudiesstudies

StructureStructuremodeling modeling

Drug designDrug designMutagenesis Mutagenesis experimentsexperiments

domain organization, structural motifskey functional residues, ORF definition

localization signals, conservation pattern...

Additional domain

Intra-group conservation

Universal conservation

Differential conservation between

the two families

Transmembraneregion

NLS

Bacteria

Archaea

Eucarya

Bacteria

Error in ORFdefinition

1st

FAMILY

2nd

FAMILY

Phosphorylation site

Lecompte et al Gene. 2001

MSA applications : Summary


MACSIMS


MAO : Multiple Alignment OntologyMAO : Multiple Alignment Ontologyhttp://www-igbmc.u-strasbg.fr/BioInfo/MAO/mao.htmlhttp://www-igbmc.u-strasbg.fr/BioInfo/MAO/mao.html

Also available from OBO web site: http://obo.sourceforge.net

MAO consortium:MAO consortium:

- RNA analysis - RNA analysis (Steve HOLBROOK, Berkeley)(Steve HOLBROOK, Berkeley)

- MACS algorithm- MACS algorithm(Kazutake KATOH, Kyoto)(Kazutake KATOH, Kyoto)

- Protein 3D analysis - Protein 3D analysis (Patrice KOEHL, Davis)(Patrice KOEHL, Davis)

- Protein 3D structure - Protein 3D structure (Dino MORAS, Strasbourg)(Dino MORAS, Strasbourg)

- 3D RNA structure - 3D RNA structure (Eric WESTHOF, Strasbourg)(Eric WESTHOF, Strasbourg)

Thompson et al. (2005) Nucleic Acids Res.

MACSIMS

Multiple Alignment of Complete Sequences Information Management System

Structural and functional information is mined automatically from the public databases

Homologous regions are identified in the MACS

Mined data is evaluated and cross-validated

Mined data is propagated from known to unknown sequences with the homologous regions

MACSIMS provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist

Thompson et al BMC Bioinformatics 2006

MACSIMShttp://bips.u-strasbg.fr/MACSIMS/

MACSIMSMACSIMS

• Schematic overview of complete alignmentSchematic overview of complete alignment e.g. de.g. domain organisationomain organisation (Interpro) (Interpro)

SH3

SH2

PI-PLC-X

PI-PLC-Y

PH C2

CH

rhoGEF

DAG_PE-bind

Key:

MACSIMS visualisation

JalView II, Coll. G. Barton

MACSIMS

* * * ** * * *E

E

E

E C

C

C

CGSVPTG

GSTKVG

GETRTG

GSTEVG

GSVSAG

GSRDVGGSRDVG

GSRDVGGSTNVFGSTNVF

GSTAVF

BAliBASE reference 3: aldehyde dehydrogenase-like

NAD binding Active site Active siteUniprot annotation


Summary

Choice of multiple alignment methodtraditional progressive method (e.g. clustalw / clustalx)

combined local and global method (e.g. mafft, muscle, dbclustal)

knowledge-based method (e.g. PipeAlign)

Web Server versus Local Installation ?

WARNING: Automatic alignment methods can make mistakes.Verify alignment quality by automatic methods (e.g. norMD) and visual inspection !

Multiple alignment applications

Traditional applications:

phylogeny

conserved residue / motif identification

Information in multiple alignments also improves accuracy in:

sequence error detection

structure prediction

functional annotation


Laboratory of Integrative Genomics and BioinformaticsIGBMC, Strasbourg

Iterative Refinement

PRRP (Gotoh, 1993) refines an initial progressive multiple alignment by iteratively dividing the alignment into 2 profiles and realigning them.

initial alignment

divide sequencesinto 2 groups

profile 1

profile 2

pairwiseprofile

alignmentrefined

alignment

converged?

no

alternative algorithms

Genetic AlgorithmsSAGA (Notredame, Higgins, 1996) evolves a population of alignments in a quasi evolutionary

manner, iteratively improving the fitness of the population

select a number of individuals to be parents

modify the parents by shuffling gaps, merging 2 alignments etc.

evaluation of the fitness using OF (sum-of-pairs or COFFEE)

END

population n

population n+1


HMM• Probabilistic model for sequence profiles, visualized as a finite state

machine• For each column of the alignment a match state models the distribution

of residues allowed• Insert and delete states at each column allow for insertion or deletion of

one or more residues

YW

VLL

DD

Original profile HMM (Krogh et al, 1994)match state

delete,begin, end state

insert stateAK E

AKY-L-D--WVLED


Multiple Alignment using HMM

HMMER (Eddy, unpublished)

SAM-T98 (Hughey, 1996)produce a model

generate new alignment(Viterbi algorithm or posterior decoding)

END

evaluate alignment(expectation maximization)

generate initial alignment(Baum-Welch expectation maximization)

Segment-to-segment Alignment

Dialign (Morgenstern et al. 1996) compares segments of sequences instead of single residues

1. construct dot-plots of all possible pairs of sequences

2. find a maximal set of consistent diagonals in all the sequences

Sequence i

Sequence j

.......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq...............WWNAedsegkr.GMIPVPYVek..........

........nlFVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCEAqtkngq..GWVPSNYItpvns.......ieqvpqqptyVQALFDFdpqedgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..GMFPRNYVtpvnrnv.....gsmstselkkVVALYDYmpmnandlqlrKGDEYFIleesnlp...............WWRArdkngqe.GYIPSNYVteaeds...........tagkiFRAMYDYmaadadevsfkDGDAIINvqaideg...............WMYGtvqrtgrtGMLPANYVeai...........gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg...............WWRGdyggkkq.LWFPSNYVeemvnpegihrd.......gyqYRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLNGynettgerGDFPGTYVeyigrkkisp..

Local alignment - residues between the diagonals are not aligned