Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame...

166
Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

Transcript of Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame...

Page 1: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Getting the best out of multiple sequence alignment methods in

the genomic era

Cédric NotredameComparative Bioinformatics GroupBioinformatics and Genomics Program

Page 2: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Which Tool for Which Sequence ?

Page 3: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

In- SilVo Biology

In Silico Biology – Making Sense of digital data

In Vivo Biology– Recording data in a living Cell

In SilVo Biology– Connect In-Vitro and In-Vivo

In-Vivo: High-throughput recording In-Silico: High-Throughput analysis

Page 4: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.
Page 5: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Is it Possible to Compare all Types of Sequences ?

Non Transcribed World– Genes/Full Genomes

Lagan, TBA

– Promoter Regions Meta-Aligner Motifs Finders

– Nucleosome ???

Multiple Genome Aligners– Not Very Accurate– Very Fast– Deal with rearrangements

Page 6: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Multiple Genome Alignments and

re-sequencing

Before– Re-sequence Human

Genomes– Map the Reads onto the

reference genome

Now– Re-sequence– Assemble– Align– Non trivial with very large

datasets

Page 7: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Is it Possible to Compare all Types of Sequences ?

RNA Comparison– Less Accurate than Proteins– Secondary Structures

ncRNA World– Sankoff

Time O(L2n) Space O(L3n)

– Consan– R-Coffee

Page 8: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Is it Possible to Compare all Types of Sequences ?

Protein Comparisons– Very Accurate– 3D-Structure Improves it

Protein Aligners– ClustalW– T-Coffee– 3D-Coffee

Page 9: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

What Changes with 1000 Genomes?

Page 10: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Phylogeny Vs Function

Function– Low level => Biochemistry => Protein Domains– High Level => Metabolic Pathway => Orthology

Orthology– Phylogenetic Analysis– Phylogenetic Analysis =>Accurate Alignments

Page 11: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Using The tree

Correct Tree

Correct Orthologous Assignment

Correct Functional Prediction

Page 12: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

The Alignment that Hides The Forest…

Page 13: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Why So Much interest for MSA methods???

Page 14: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Phylogenetic Trees and

Multiple Sequence Alignments

Page 15: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Phylogenetic Trees and

Multiple Sequence Alignments

Page 16: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Positive Selection and MSAs

Page 17: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Positive Selection and MSAs

Page 18: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Positive Selection and MSAs

Page 19: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Multiple Genome Alignments

Page 20: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Multiple Genome Alignments

Page 21: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Genomic Era: The Goal

10.000 Sequences: interspecies 1 Billion: Re-sequencing

Incorporation of ALL experimental Data– Structure, Genomic, ChIp-Chip, ChIp-Seq…

Alignments suitable for all applications of comparative genomics– Homology Modeling (function)– Functional Analysis– Phylogenetic Reconstruction– 3D-Modelling

Accurate Alignments for ALL kind of data

Non Transcribed DNA Transcribed DNA Translated DNA

Page 22: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Genomic Era Challenges

Accuracy– Proteins: 30% is the limit– DNA/RNA 70% is the limit

Scale– With too many sequences

algorithms lose in accuracy

Data Integration– Structure– Homology– Genomic Structure– Function– Proteomics

Methods– Wealth of alternative methods– Poorly Characterized

Page 23: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Consistency and Data Integration

Most methods rely on the progressive algorithm

Consistency based methods have been designed as an extension

Consistency based alignment methods have been designed to:

– Better extract the signal contained in the data– Integrate/Confront existing methods– Integrate/Confront heterogeneous types of Information

Page 24: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

The Progressive Alignment Algorithm

Page 25: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

SeqA GARFIELD THE LAST FAT CAT

SeqB GARFIELD THE FAST CAT

SeqC GARFIELD THE VERY FAST CAT

SeqD THE FAT CAT

SeqA GARFIELD THE LAST FA-T CATSeqB GARFIELD THE FAST CA-T ---SeqC GARFIELD THE VERY FAST CATSeqD -------- THE ---- FA-T CAT

Page 26: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88SeqB GARFIELD THE FAST CAT ---

SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT

SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100SeqD -------- THE ---- FAT CAT

SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100SeqC GARFIELD THE VERY FAST CAT

SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100SeqD -------- THE ---- FA-T CAT

Page 27: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88SeqB GARFIELD THE FAST CAT ---

SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT

SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100SeqD -------- THE ---- FAT CAT

SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100SeqC GARFIELD THE VERY FAST CAT

SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100SeqD -------- THE ---- FA-T CAT

SeqA GARFIELD THE LAST FAT CAT Weight =88SeqB GARFIELD THE FAST CAT ---

SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CATSeqB GARFIELD THE ---- FAST CAT

SeqA GARFIELD THE LAST FA-T CAT Weight =100SeqD -------- THE ---- FA-T CATSeqB GARFIELD THE ---- FAST CAT

Page 28: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

SeqA GARFIELD THE LAST FAT CAT Weight =88SeqB GARFIELD THE FAST CAT ---

SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CATSeqB GARFIELD THE ---- FAST CAT

SeqA GARFIELD THE LAST FA-T CAT Weight =100SeqD -------- THE ---- FA-T CATSeqB GARFIELD THE ---- FAST CAT

Page 29: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

Page 30: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

Page 31: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Methods

Data

Scalability

Page 32: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

A Brief History of Consistency

A Long Chain of Small Contributions…

Page 33: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Consistency Based Algorithms

Gotoh (1990)– Iterative strategy using consistency

Martin Vingron (1991)– Dot Matrices Multiplications– Accurate but too stringeant

Dialign (1996, Morgenstern)– Concistency– Agglomerative Assembly

T-Coffee (2000, Notredame)– Concistency– Progressive algorithm

ProbCons (2004, Do)– T-Coffee with a Bayesian Treatment– Biphasic Gap Penalty

AMAP (Schwarz, 2007)– ProbCons Consistency– Replace Progressive alignment with

simulated Annealing– Hard to distinguish from ProbCons

FSA ( Patcher, 2009)– AMAP with automated parameter

estimation– Hard to distinguish from ProbCons

Page 34: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Choosing the right modeling methodM-Coffee

Page 35: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Combining Many MSAs into ONE

MUSCLE

MAFFT

ClustalW

???????

T-Coffee

Page 36: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Consistency and Accuracy

Page 37: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Integrating New Types of DataTemplate Based Sequence

Alignments

Page 38: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

ExperimentalData

TARGET

ExperimentalData

TARGETTemplate

Aligner

Template-Sequence Alignment

Primary Library

Template Alignment

Template based Alignmentof the Sequences

Templates Templates

TARGET

Page 39: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Exploring The Template World

Template Generator Alignment Method

RNA Structure Prediction RNA Aligner

Protein Structure BLAST vs PDB 3D Aligner

Profile BLAST vs NR Profile/Profile Alignment

Gene Structure ENSEMBL Genome Aligner

Promoter Transfac Meta-Aligner

Page 40: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Exploring The Template World

Template Generator Alignment Method

Mode

RNA Structure Prediction RNA Aligner R-Coffee

Protein Structure BLAST /PDB 3D Aligner 3D-Coffee

Profile BLAST/NR Profile/Profile PSI-Coffee

Gene Structure ENSEMBL Genome Aligner Exoset

Promoter Transfac Meta-Aligner Meta-Coffee

Page 41: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

3D-Coffee/ExpressoIncorporating

Structural Information

Page 42: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Expresso: Finding the Right Structure

Sources

Templates

Library

BLAST BLAST

SAP

Template Alignment

Source Template Alignment

Remove Templates

Templates

Page 43: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

PSI-CoffeeHomology Extension

Page 44: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Exploring The Template World

Page 45: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

What is Homology Extension ?

L L

L

?

-Simple scoring schemes result in alignment ambiguities

Page 46: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

What is Homology Extension ?

L L

L

LLLLLL

LLIVIL

LLLLLL

Profile 1

Profile 2

Page 47: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

What is Homology Extension ?

L L

L

LLLLLL

LLIVIL

LLLLLL

Profile 1

Profile 2

Page 48: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

PSI-Coffee: Homology Extension

Sources

Templates

Library

BLAST BLAST

Template Alignment

Source Template Alignment

Remove Templates

TemplatesProfile Aligner

Page 49: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Benchmarks

Page 50: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Method Method Template Score Comment

ClustalW-2 Progressive NO 22.74

PRANK Gap NO 26.18 Science2008

MAFFT Iterative NO 26.18

Muscle Iterative NO 31.37

ProbCons Consistency NO 40.80

ProbCons MonoPhasic NO 37.53

T-Coffee Consistency NO 42.30

M-Coffe4 Consistency NO 43.60

PSI-Coffee Consistency Profile 53.71

PROMAL Consistency Profile 55.08

PROMAL-3D Consistency PDB 57.60

3D-Coffee Consistency PDB 61.00 Expresso

Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

Page 51: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Method Method Template Score Comment

ClustalW-2 Progressive NO 22.74

PRANK Gap NO 26.18 Science2008

MAFFT Iterative NO 26.18

Muscle Iterative NO 31.37

ProbCons Consistency NO 40.80

ProbCons MonoPhasic NO 37.53

T-Coffee Consistency NO 42.30

M-Coffe4 Consistency NO 43.60

PSI-Coffee Consistency Profile 53.71

PROMAL Consistency Profile 55.08

PROMAL-3D Consistency PDB 57.60

3D-Coffee Consistency PDB 61.00 Expresso

Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

Consistency

Page 52: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Method Method Template Score Comment

ClustalW-2 Progressive NO 22.74

PRANK Gap NO 26.18 Science2008

MAFFT Iterative NO 26.18

Muscle Iterative NO 31.37

ProbCons Consistency NO 40.80

ProbCons MonoPhasic NO 37.53

T-Coffee Consistency NO 42.30

M-Coffe4 Consistency NO 43.60

PSI-Coffee Consistency Profile 53.71

PROMAL Consistency Profile 55.08

PROMAL-3D Consistency PDB 57.60

3D-Coffee Consistency PDB 61.00 Expresso

Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

Homology Extension

Page 53: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Method Method Template Score Comment

ClustalW-2 Progressive NO 22.74

PRANK Gap NO 26.18 Science2008

MAFFT Iterative NO 26.18

Muscle Iterative NO 31.37

ProbCons Consistency NO 40.80

ProbCons MonoPhasic NO 37.53

T-Coffee Consistency NO 42.30

M-Coffe4 Consistency NO 43.60

PSI-Coffee Consistency Profile 53.71

PROMAL Consistency Profile 55.08

PROMAL-3D Consistency PDB 57.60

3D-Coffee Consistency PDB 61.00 Expresso

Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

Structural Extension

Page 54: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and The World

BLAST/SOAP

-Some Templates are obtained with a BLAST-Queries can be sent to the EBI or the NCBI-No Need for a Local BLAST installation

Users sequences

Page 55: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Incorporating RNA Information Within the T-Coffee Algorithm

Page 56: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

ncRNAs Can Evolve Rapidly

CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**

GAACGGACC

CTTGCCTGG

GG

AAC CA

CGG

AG

AC G

CTTGCCTCC

GAACGGAGG

GG

AAC CA

CGG

AG

AC G

Page 57: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

ncRNAs Can Evolve Rapidly

CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**

CC--AGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAAC--GGAGG** * *** * * *** **

Sequence Alignment (Maximizing Identity)

-Incorrect-Not Predictive

Page 58: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

The Holy Grail of RNA Comparison:Sankoff’ Algorithm

Page 59: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

CC

R-Coffee Extension

GG

TC Library

G G Score XC C Score Y

CC

GG

Goal: Embedding RNA Structures Within The T-Coffee Libraries The R-extension can be added on the top of any existing method.

Page 60: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

R-Coffee + Structural Aligners

Method Avg Braliscore Net Improv.direct +T +R +T +R

-----------------------------------------------------------Stemloc 0.62 0.75 0.76 104 113Mlocarna 0.66 0.69 0.71 101 133Murlet 0.73 0.70 0.72 -132 -73Pmcomp 0.73 0.73 0.73 142 145T-Lara 0.74 0.74 0.69 -36 -8Foldalign 0.75 0.77 0.77 72 73-----------------------------------------------------------Dyalign --- 0.63 0.62 --- ---Consan --- 0.79 0.79 --- --------------------------------------------------------------

Improvement= # R-Coffee wins - # R-Coffee looses over 170 test sets

Page 61: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

R-Coffee + Regular Aligners

Method Avg Braliscore Net Improv.direct +T +R +T +R

-----------------------------------------------------------Poa 0.62 0.65 0.70 48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64 0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7 83Mafft_fftnts 0.68 0.68 0.72 17 68ProbConsRNA 0.69 0.67 0.71 -49 39Muscle 0.69 0.69 0.73 -17 42Mafft_ginsi 0.70 0.68 0.72 -49 39-----------------------------------------------------------

Improvement= # R-Coffee wins - # R-Coffee looses over 388 test sets

Page 62: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Genomic Era ChallengesConclusion

TemplateBased

Alignments

Meta-MethodsM-Coffee

Homology Extension(Proteins)

R-Coffee

Scaled Consistency

Page 63: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Open Questions

Accurately Aligning non transcribed DNA Coping with One Billion Human Genomes

Page 64: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

[email protected]

Comparative Bioinformatics

University College Dublin– Des Higgins– Orla O’Sullivan– Iain Wallace (UCD, IE)

Berlin Free University– Knut Reinert– Tobias Rausch

Swiss Intitute of Bioinformatics– Ioannis Xenarios– Sebastien Morreti

Comparative Bioinformatics– Merixell Oliva– Giovanni Bussoti– Carsten Kemena– Emanuele Rainieri– Ionas Erb– Jia Ming Chang– Matthias Zytneki

Page 66: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.
Page 67: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Why So Much Interest For Multiple Alignments ?

Extrapolation

Motifs/Patterns

Phylogeny

Profiles

Structure Prediction

SNP Analysis

Reactivity Analysis

Regulatory Elements

Page 68: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Phylogeny Vs Function:Applications

Comparative Genomics => New Medium New Medium => New Clinical Test

Page 69: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Detecting ncRNAs in silico: a long way to go…

RNAse P

Page 70: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Obtaining the Structure of a ncRNA is difficult

Hard to Align The Sequences Without the Structure

Hard to Predict the Structures Without an Alignment

Page 71: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

R-Coffee: Modifying T-Coffee at the Right Place

Incorporation of Secondary Structure information within the Library

Two Extra Components for the T-Coffee Scoring Scheme

– A new Library– A new Scoring Scheme

Page 72: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.
Page 73: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.
Page 74: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

G-INS-i, H-INS-i and F-INS-i use pairwise alignment information when constructing a multiple alignment. The two options ([HF]-INS-i) incorporate local alignment information and do NOT USE FFT.

Page 75: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Molecular BiologyWithin the System Biology Era

Protein A Interacts with

RegulatesInhibits

Protein B

Page 76: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Molecular BiologyIn the 1000 Genomes Era

Protein A Interacts with

RegulatesInhibits

Protein B

Variation Within Species: CNVs of A and SNPs

Conservation Across Species

Page 77: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

System Biology vs

Comparative Genomics

Systems Biology

Systems can be Understood

Comparative Genomics

Systems can Evolve through Selection

Page 78: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Phylogeny Vs Function:Applications

– Important Application

– Possible Many New Genomes

– Challenging Too Many New Genomes

Page 79: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

Page 80: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Comparing Methods

MAFFT

Page 81: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Some Benchmark: BB11 BaliBase

BB11: 38 highly divergent (less than 25% id) datasets from BaliBaseBB11: predicts 78% of the results measured on other datasets

Blackshield, Higgins

Page 82: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

PhD Fellowships www.crg.es

Page 83: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.
Page 84: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.
Page 85: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

What ‘s in a Multiple Sequence Alignment

Evolution Inertia

Common Ancestry Shows up

In the sequences

Selection

Important FeaturesAre Preserved

Functional Constraint

Same FunctionSame Sequence

ConvergencePhylogenetic Footprint, Evolutionary Trace …

Page 86: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Which Tool for Which Sequence ?

Page 87: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Is it Possible to Compare all Types of Sequences

Non Transcribed World– Genes/Full Genomes

Lagan, TBA

– Promoter Regions Meta-Aligner Motifs Finders

– Nucleosome ???

Multiple Genome Aligners– Not Very Accurate– Very Fast– Deal with rearrangements

Page 88: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Is it Possible to Compare all Types of Sequences

RNA Comparison– Less Accurate than Proteins– Secondary Structures

ncRNA World– Consan– R-Coffee

Page 89: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Is it Possible to Compare all Types of Sequences

Protein Comparisons– Very Accurate– 3D-Structure Improves it

Protein Aligners– ClustalW– T-Coffee– 3D-Coffee

Page 90: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Why So Much Interest For Multiple Alignments ?

Extrapolation

Motifs/Patterns

Phylogeny

Profiles

Structure Prediction

SNP Analysis

Reactivity Analysis

Regulatory Elements

Page 91: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

What’s in a Multiple Alignment ?

The MSA contains what you put inside:

– Structural Similarity– Evolutive Similarity– Sequence Similarity

You can view your MSA as:

– A record of evolution– A summary of a protein family– A collection of experiments made for you by Nature…

Page 92: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Producing The Right Alignment

Multiple Sequence Alignments Influence Phylogenetic Trees

Choice of Method is not Neutral

– Different Methods– Different Alignments– Different Trees

Using The Right Models insures Producing the right Tree

Page 93: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Model Based Alignments vs Naïve Alignments

Naïve Alignment– Lexicographic Alignment– Maximizing the number of identities– At best using a substitution matrix

Model Based Alignments– Using a model– Protein structure information– RNA Structure information– Combining/Confronting Modeling

methods

Template based Alignments– Model based Alignments through the

use of Templates

Page 94: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Model Based Alignments

T-Coffee Algorithm

Expresso: Aligning Protein Structures

R-Coffee: Aligning RNA structures

M-Coffee: Combining methods

Page 95: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

Page 96: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

When Sequences Are not Enough

3D-Coffee and Expresso

Page 97: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

Page 98: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

Page 99: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Where to Trust Your Alignments

Most Methods Agree

Most Methods Disagree

Page 100: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Conclusion

Model Based Alignments Give the best Accuracy

Template based alignment is a very efficient way to turn Naïve aligners into model based aligners

Sequence Alignments are not necessarily reliable over their entire lengths

Page 101: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Manguel M, Samaniego F.J., Abraham Wald’s Work on Aircraft Suvivability, J. American Statistical Association. 79, 259-270, (1984)

Page 102: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Building and Using Models

35.67 Angstrom

Page 103: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Computing the Correct Alignment is a Complicated Problem

Page 104: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Stochastic Optimization

Page 105: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Stochastic Optimization

Exploration of Complex Optimization Problems With Multiple Constraints

– Genomic Alignments– RNA Alignments

Generation of Population of Suboptimal Solutions

– Quality=f( optimality )

Specification of Concistency Objective Function of T-Coffee

Page 106: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Three Types of Algorithms

Progressive: ClustalW

Iterative: Muscle

Concistency Based: T-Coffee and Probcons

Page 107: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

Each Library Line is a Soft Constraint (a wish)

You can’t satisfy them all

You must satisfy as many as possible (The easy ones)

Page 108: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Concistency Based Algorithms: T-Coffee

Gotoh (1990)– Iterative strategy using consistency

Martin Vingron (1991)– Dot Matrices Multiplications– Accurate but too stringeant

Dialign (1996, Morgenstern)– Concistency– Agglomerative Assembly

T-Coffee (2000, Notredame)– Concistency– Progressive algorithm

ProbCons (2004, Do)– T-Coffee with a Bayesian Treatment

Page 109: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

How Good Is My Method ?

Page 110: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Structures Vs Sequences

Page 111: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee Results

Validation Using BaliBase

Page 112: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Too Many Methods for ONE AlignmentM-Coffee

Page 113: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.
Page 114: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Estimating the Accuracy of your MSA

Page 115: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

What To Do Without Structures

Page 116: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

Page 117: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Expresso: Finding the Right Structure

Why Not Using Structure Based Alignments

Page 118: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Template Based Multiple Sequence Alignments

Page 119: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Template Based Multiple Sequence Alignments

-Structure-Profile-…

Sources

Templates

Library

TemplateAligner

Template Alignment

Source Template Alignment

Remove Templates

Templates-Structure-Profile-…

Page 120: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Method Score Templates Prefab Homstrad --------------------------------------------------------------ClustalW Matrix ---- 61.80 ----Kalign Matrix ---- 63.00 ----MUSCLE Matrix ---- 68.00 45.0--------------------------------------------------------------

T-Coffee Consistency ---- 69.97 44.0ProbCons Consistency ---- 70.54 ----Mafft Consistency ---- 72.20 ----M-Coffee Consistency ---- 72.91 ----MUMMALS Consistency ---- 73.10 ------------------------------------------------------------------Clustal-db Matrix Profiles ---- ----PRALINE Matrix Profiles ---- 50.2PROMALS Consistency Profiles 79.00 ----SPEM Matrix Profiles 77.00 ------------------------------------------------------------------EXPRESSO Consistency Structures ---- 71.9 *T-Lara Consistency Structures ---- ------------------------------------------------------------------

Table 1. Summary of all the methods described in the review. Validation figures were compiled from several sources, and selected for the compatibility. Prefab refers to some validation made on Prefab Version 3. The HOMSTRAD validation was made on datasets having less than 30% identity. The source of each figure is indicated by a reference.*The EXPRESSO figure comes from a slightly more demanding subset of HOMSTRAD (HOM39) made of sequences less than 25% identical.

Page 121: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Improving The Evaluation

Page 122: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

How Do We Perform In The Twilight Zone?

Concistency Based Methods Have an Edge Hard to tell Methods Apart Sequence Alignment is NOT solved

Page 123: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

More Than Structure based Alignments

Structural Correctness Is Only the Easy Side of the Coin.

In practice MSA are intermediate models used to generate other models:

Data Model Type BenchmarkHomology Profile Yes

Evolution Trees No

Structure 3D-Structure CASP

Function Annotation No

Page 124: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Conclusion

Template based Multiple Sequence Alignments Projecting any relevant information onto the sequences Using this Information

Need for new evaluation procedures Functional Analysis Phylogenetic Analysis Homology Search (Profiles) Homology Modelling

Integrating data Making sure your bits of data can fight with one another

Page 125: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Turning Data into Models

DataColumbus, considered that the landmass occupied 225°, leaving only 135° of water (Marinus of Tyre, 70 AD).

Columbus believed that 1° represented only 56 miles (Alfraganus, XIth century)

He knew there was an island named Japan off the cost of China…

ModelCircumference of the Earth as 25,255 km at most,Canary Island to Japan : 3,700 km (Reality: 12,000 km.)

Page 126: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

The More Structures The Merrier

Average Improvement over

T-Coffee

Struc/Seq Ratio

Page 127: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

The Right Mixt of Methods

Page 128: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

Page 129: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Applications

Page 130: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Looking-Up The DNA Behind The Sequences: PROTOGENE

Page 131: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

SAR Analysis

Correlate Alignment Variations with Reactivity

Application to the Human Kinome Collaboration with Sanofi-Aventis

Main Issue: – Training problem Proper Benchmarking

Page 132: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

ncRNA Multiple Alignments with R-Coffee

Laundering the Genome Dark Matter

Cédric NotredameComparative Bioinformatics GroupBioinformatics and Genomics Program

Page 133: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

No Plane Today…

Page 134: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

ncRNAs Comparison

And ENCODE said…“nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions”

Who Are They?– tRNA, rRNA, snoRNAs, – microRNAs, siRNAs– piRNAs– long ncRNAs (Xist, Evf, Air, CTN, PINK…)

How Many of them– Open question– 30.000 is a common guess– Harder to detect than proteins

.

Page 135: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

ncRNAs can have different sequences and Similar Structures

Page 136: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

ncRNAs are Difficult to Align

Same Structure Low Sequence Identity

Small Alphabet, Short Sequences Alignments often Non-Significant

Page 137: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Obtaining the Structure of a ncRNA is difficult

Hard to Align The Sequences Without the Structure

Hard to Predict the Structures Without an Alignment

Page 138: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

The Holy Grail of RNA Comparison:Sankoff’ Algorithm

Page 139: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

The Holy Grail of RNA ComparisonSankoff’ Algorithm

Simultaneous Folding and Alignment

– Time Complexity: O(L2n)– Space Complexity: O(L3n)

In Practice, for Two Sequences:

– 50 nucleotides: 1 min. 6 M.– 100 nucleotides 16 min. 256 M.– 200 nucleotides 4 hours 4 G.– 400 nucleotides 3 days 3 T.

Forget about– Multiple sequence alignments– Database searches

Page 140: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

The next best Thing: Consan

Consan = Sankoff + a few constraints

Use of Stochastic Context Free Grammars

– Tree-shaped HMMs– Made sparse with constraints

The constraints are derived from the most confident positions of the alignment

Equivalent of Banded DP

Page 141: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Going Multiple….

Structural Aligners

Page 142: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Game Rules

Using Structural Predictions– Produces better alignments– Is Computationally expensive

Use as much structural information as possible while doing as little computation as possible…

Page 143: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Adapting T-Coffee To

RNA Alignments

Page 144: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

Page 145: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

Page 146: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

Page 147: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

T-Coffee and Concistency…

Page 148: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Consistency: Conflicts and Information

X

Y

X

Z

Y

W Z

X

Z

Y

ZW

Y

W

X

Z

Y

Z

X

WY

Z

X

W

Partly Consistent

Less Reliable

Fully Consistent

More Reliable

Y is unhappy X is unhappy

X

Y

Page 149: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

RNA Sequences

Secondary Structures

Primary Library

R-Coffee ExtendedPrimary Library

Progressive AlignmentUsing The R-Score

RNAplfoldConsan

orMafft / Muscle / ProbCons

R-CoffeeExtension

R-Score

Page 150: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

CC

R-Coffee Scoring Scheme

GG

R-Score (CC)=MAX(TC-Score(CC), TC-Score (GG))

Page 151: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Validating R-Coffee

Page 152: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

RNA Alignments are harder to validate than Protein Alignments

Protein Alignments Use of Structure based Reference Alignments

RNA Alignments No Real structure based reference alignments– The structures are mostly predicted from

sequences– Circularity

Page 153: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

BraliBase and the BraliScore

Database of Reference Alignments

388 multiple sequence alignments.

Evenly distributed between 35 and 95 percent average sequence identity

Contain 5 sequences selected from the RNA family database Rfam

The reference alignment is based on a SCFG model based on the full Rfam seed dataset (~100 sequences).

Page 154: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

BraliBase SPS Score

RFam MSA

Number of Identically Aligned PairsSPS=Number of Aligned Pairs

Page 155: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

BraliBase: SCI Score

RNApfold

(((…)))…((..)) G Seq1(((…)))…((..)) G Seq2(((…)))…((..))G Seq3(((…)))…((..)) G Seq4(((…)))…((..)) G Seq5(((…)))…((..)) G Seq6

RNAlifold

(((…)))…((..)) ALN G

Average G Seq X Cov

G ALN

SCI=

Covariance

Page 156: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

BRaliScore

Braliscore= SCI*SPS

Page 157: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

RM-Coffee + Regular Aligners

Method Avg Braliscore Net Improv.direct +T +R +T +R

-----------------------------------------------------------Poa 0.62 0.65 0.70 48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64 0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7 83Mafft_fftnts 0.68 0.68 0.72 17 68ProbConsRNA 0.69 0.67 0.71 -49 39Muscle 0.69 0.69 0.73 -17 42Mafft_ginsi 0.70 0.68 0.72 -49 39-----------------------------------------------------------RM-Coffee4 0.71 / 0.74 / 84

Page 158: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

How Best is the Best….

M-Locarna 234 *** 183 **

Stral 169 *** 62

FoldalignM 146 61

Murlet 130 * -12

Rnasampler 129 * -27

T-Lara 125 * -30

Poa 241 *** 217 ***

T-Coffee 241 *** 199 ***

Prrn 232 *** 198 ***

Pcma 218 *** 151 ***

Proalign 216 *** 150 **

Mafft fftns 206 *** 148 *

ClustalW 203 *** 136 ***

Probcons 192 *** 128 *

Mafft ginsi 170 *** 115

Muscle 169 *** 111

Methodvs. R-Coffee-Consan

vs. RM-Coffee4

Page 159: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Range of Performances

Effect of Compensated Mutations

Page 160: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Conclusion/Future Directions

T-Coffee/Consan is currently the best MSA protocol for ncRNAs

Testing how important is the accuracy of the secondary structure prediction

Going deeper into Sankoff’s territory: predicting and aligning simultaneously

Page 161: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

[email protected]

Credits and Web Servers

Andreas Wilm Des Higgins Sebastien Moretti Ioannis Xenarios Cedric Notredame

CGR, SIB, UCD

Page 162: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Prank Vs Prank

Page 163: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Prank Vs Prank

Gop=0Gep=0

Page 164: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Prank Vs Prank

The reconstruction of evolutionary homology -- including the correct placement of insertion and deletion events -- is only feasible for rather closely-related sequences. PRANK is not meant for the alignment of very diverged protein sequences. If sequences are very different, the correct homology cannot be reconstructed with confidence and

http://www.ebi.ac.uk/goldman-srv/prank/

Page 165: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Do Benchmarks All Tell the same story?

Based on

Page 166: Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Probcons: Different Primary Library

Score= (MIN(xz,zk))/MAX(xz,zk)Score(xi ~ yj | x, y, z)

∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)