Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega,...

39
Clustal-Omega Tandy Warnow

Transcript of Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega,...

Page 1: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Clustal-Omega

TandyWarnow

Page 2: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Today

•  Discussionof“Fastscalablegenera<onofhigh-qualityproteinmul<plesequencealignmentsusingClustal-Omega”bySieversetal.,publishedinMolecularSystemsBiology7:539(2011)– DesignofClustal-Omega–  Performanceondata– Maininnova<on:guidetreeconstruc<on–  Reviewsin“transparentprocess”–  Comparisontootherstudies(<mepermiSng)

Page 3: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Clustal-Omegastudy

•  Clustal-OmegaisthelatestintheClustalfamilyofMSAmethods

•  Clustal-Omegaisdesignedprimarilyforaminoacidalignment,butcanbeusedonnucleo<dedatasets

Page 4: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Clustal-Omega

Givenunalignedsequences:•  Buildaguidetree•  Performprogressivealignmentontheguidetree,using“consistency”toinformthealignmentmergers

Ifdesired,youcan:•  Iterate(constructprofileHMMbasedonalignment,andthenre-align)

•  Useexternalprofilealignment(EPA)

Page 5: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Evalua<on

•  DefaultClustal-Omega(noitera<on,noEPA)comparedtostandardmethodsonproteinbenchmarks

•  VariantsofClustal-OmegacomparedtodefaultClustal-Omega

Criteria:•  TC(columnscore)on“core”columns•  Running<me

Page 6: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Datasets

•  BAliBASE(“small”)•  PreFab(50sequencesineachset,butevaluatedbasedonpairwisealignments)

•  HomFam(muchlarger!)

Page 7: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

about 10 000 sequences is MAFFT/PartTree (Katoh and Toh,2007). It is very fast but leads to a loss in accuracy, which hasto be compensated for by iteration and other heuristics. WithClustal Omega, we use a modified version of mBed (Black-shields et al, 2010), which has complexity of O(N log N), andwhich produces guide trees that are just as accurate as thosefrom conventional methods. mBed works by ‘emBedding’ eachsequence in a space of n dimensions where n is proportional tolog N. Each sequence is then replaced by an n element vector,where each element is simply the distance to one ofn ‘referencesequences.’ These vectors can then be clustered extremelyquickly by standard methods such as K-means or UPGMA.In Clustal Omega, the alignments are then computed using thevery accurate HHalign package (Soding, 2005), which alignstwo profile hidden Markov models (Eddy, 1998).Clustal Omega has a number of features for adding

sequences to existing alignments or for using existingalignments to help align new sequences. One innovation isto allow users to specify a profile HMM that is derived from analignment of sequences that are homologous to the input set.The sequences are then aligned to these ‘external profiles’ tohelp align them to the rest of the input set. There are alreadywidely available collections of HMMs frommany sources suchas Pfam (Finn et al, 2009) and these can now be used to helpusers to align their sequences.

Results

Alignment accuracy

The standard method for measuring the accuracy of multiplealignment algorithms is to use benchmark test sets of referencealignments, generated with reference to three-dimensionalstructures. Here, we present results from a range of packagestested on three benchmarks: BAliBASE (Thompson et al,2005), Prefab (Edgar, 2004) and an extended version ofHomFam (Blackshields et al, 2010). For these tests, we justreport results using the default settings for all programs butwith two exceptions, which were needed to allow MUSCLE(Edgar, 2004) and MAFFT to align the biggest test cases in

HomFam. For test cases with 43000 sequences, we runMUSCLEwith the –maxiter parameter set to 2, in order to finishthe alignments in reasonable times. Second, we have run severaldifferent programs from the MAFFT package. MAFFT (Katohet al, 2002) consists of a series of programs that can be runseparately or called automatically from a script with the –autoflag set. This flag chooses to run a slow, consistency-basedprogram (L-INS-i) when the number and lengths of sequences issmall. When the numbers exceed inbuilt thresholds, a conven-tional progressive aligner is used (FFT-NS-2). The latter is also theprogram that is run by default if MAFFT is called with no flagsset. For very large data sets, the –parttree flag must be set on thecommand line and a very fast guide tree calculation is then used.The results for the BAliBASE benchmark tests are shown in

Table I. BAliBASE is divided into six ‘references.’Average scoresare given for each reference, along with total run times andaverage total column (TC) scores, which give the proportion ofthe total alignment columns that is recovered. A score of 1.0indicates perfect agreementwith the benchmark. There are tworows for the MAFFT package: MAFFT (auto) and MAFFTdefault. In most (203 out of 218) BAliBASE test cases, thenumber of sequences is small and the script runs L-INS-i, whichis the slow accurate program that uses the consistency heuristic(Notredame et al, 2000) that is also used by MSAprobs(Liu et al, 2010), Probalign, Probcons (Do et al, 2005) andT-Coffee. These programs are all restricted to small numbers ofsequences but tend to give accurate alignments. This is clearlyreflected in the times and average scores in Table I. The timesrange from 25min up to 22h for these packages and theaccuracies range from 55 to 61% of columns correct. ClustalOmega only takes 9min for the same runs but has an accuracylevel that is similar to that of Probcons and T-Coffee.The rest of the table is mainly taken by the programs that use

progressive alignment. Some of these are very fast but thisspeed ismatched by a considerable drop in accuracy comparedwith the consistency-based programs and Clustal Omega. Theweakest program here, is Clustal W (Larkin et al, 2007)followed by PRANK (Loytynoja and Goldman, 2008). PRANKis not designed for aligning distantly related sequences but atgiving good alignments for phylogenetic work with special

Table I BAliBASE results

Aligner Av score(218 families)

BB11(38 families)

BB12(44 families)

BB2(41 families)

BB3(30 families)

BB4(49 families)

BB5(16 families)

Tot time (s) Consistency

MSAprobs 0.607 0.441 0.865 0.464 0.607 0.622 0.608 12 382.00 YesProbalign 0.589 0.453 0.862 0.439 0.566 0.603 0.549 10 095.20 YesMAFFT (auto) 0.588 0.439 0.831 0.450 0.581 0.605 0.591 1475.40 Mostly

(203/218)Probcons 0.558 0.417 0.855 0.406 0.544 0.532 0.573 13086.30 YesClustal O 0.554 0.358 0.789 0.450 0.575 0.579 0.533 539.91 NoT-Coffee 0.551 0.410 0.848 0.402 0.491 0.545 0.587 81041.50 YesKalign 0.501 0.365 0.790 0.360 0.476 0.504 0.435 21.88 NoMUSCLE 0.475 0.318 0.804 0.350 0.409 0.450 0.460 789.57 NoMAFFT (default) 0.458 0.258 0.749 0.316 0.425 0.480 0.496 68.24 NoFSA 0.419 0.270 0.818 0.187 0.259 0.474 0.398 53 648.10 NoDialign 0.415 0.265 0.696 0.292 0.312 0.441 0.425 3977.44 NoPRANK 0.376 0.223 0.680 0.257 0.321 0.360 0.356 128 355.00 NoClustalW 0.374 0.227 0.712 0.220 0.272 0.396 0.308 766.47 No

The figures are total column scores produced using bali score on core columns only. The average score over all families is given in the second column. The results forBAliBASE subgroupings are in columns 3–8. The total run time for all 218 families is given in the second last column. The last column indicates whether the method isconsistency based.

High-quality protein MSAs using Clustal OmegaF Sievers et al

2 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited

FromSeiversetal.,MolecularSystemsBiology2011

TCscoresonBAliBASEsequences

Page 8: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

0.35

0.4

0.45

0.5

0.55

0.6

0.65

10 100 1000 10000 100000 1e+06

Tota

l Col

umn

Scor

e

Time [s]

BAliBASE

MSAprobsProbalignMafft-auto

ProbconsClustalOmega TCoffeeSATeOpal

Kalign

MuscleMafft

FsaDialign

PrankClustalW2

Page 9: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

attention

togap

s.These

gappositio

nsare

not

inclu

dedin

these

testsas

they

tend

not

tobestru

cturally

conserved.

Dialign

(Morgen

sternet

al,1998)

does

not

use

consisten

cyor

progressivealign

men

tbutis

based

onfindingbestlocalm

ultip

lealign

men

ts.FSA(Brad

leyeta

l,2009)uses

samplin

gof

pairwise

alignmen

tsan

d‘seq

uen

cean

nealin

g’an

dhas

been

show

nto

delivergo

odnucleotide

sequen

cealign

men

tsin

thepast.

ThePrefab

ben

chmark

testresu

ltsare

shownin

Table

II.Here,

theresu

ltsare

divided

into

fivegrou

psacco

rdingto

the

percen

tiden

tityoftheseq

uen

ces.Theoverall

scores

range

from53

to73%

ofcolumnscorrect.

Theconsisten

cy-based

program

sMSA

probs,M

AFFTL-IN

S-i,Probalign

,Probconsan

dT-C

offee,

areagain

themost

accurate

butwith

longruntim

es.ClustalO

mega

isclo

seto

theconsisten

cyprogram

sin

accuracy

butis

much

faster.There

isthen

agap

tothefaster

progressive

based

program

sofMUSC

LE,MAFFT,

Kalign

(Lassm

annan

dSo

nnham

mer,

2005)an

dClustal

W.

Resu

ltsfrom

testing

largealign

men

tswith

up

to50

000sequ

ences

aregiven

inTable

IIIusin

gHom

Fam.Here,

eachalign

men

tismade

upof

acore

ofaHom

strad(M

izugu

chietal,

1998)stru

cture-based

alignmen

tofatleastfive

sequen

ces.These

sequen

cesare

then

inserted

into

atest

setof

sequen

cesfrom

the

correspondin

g,hom

ologous,P

famdom

ain.T

hisgives

verylarge

setsof

sequen

cesto

bealign

edbu

tthetestin

gison

lycarried

out

onthesequ

ences

with

know

nstru

ctures.

Only

someprogram

sare

ableto

deliveralign

men

tsatall,w

ithdata

setsofth

issize.W

erestricted

thecom

parisonsto

Clustal

Omega,

MAFFT,

MUSC

LEan

dKalign

.MAFFT

with

defaultsettin

gs,has

alim

itof

20000

sequen

cesan

dweon

lyuse

MAFFT

with

–parttreefor

thelast

sectionof

TableIII.

MUSC

LEbecom

esincreasin

glyslow

when

youget

over3000

sequen

ces.Therefore,

for43000

sequen

cesweused

MUSC

LEwith

thefaster

butless

accurate

settingof

–maxiters

2,which

restrictsthenumber

ofiteration

sto

two.

Overall,

Clustal

Omega

iseasily

themost

accurate

program

inTab

leIII.

Theruntim

esshow

MAFFTdefau

ltan

dKalign

tobeexcep

tionally

fastonthesm

allertest

casesan

dMAFFT

–parttree

tobe

very

faston

the

biggest

families.

Clustal

Omega

does

scalewell,

however,

with

increasin

gnumbers

of

sequen

ces.This

scalingis

describ

edin

more

detail

inthe

Supplem

entary

Inform

ation.W

edohave

twofurth

ertestcases

with

450

000seq

uen

ces,butitwas

notpossib

leto

getresu

ltsforthese

fromMUSC

LEorKalign

.These

aredescrib

edin

the

Supplem

entary

Inform

ationas

well.

Table

IIIgiv

esoverall

run

times

forthe

fourprogram

sevalu

atedwith

HomFam

.Figu

re1reso

lvesthese

runtim

escase

bycase.K

alignisvery

fastforsm

allfamilies

butdoes

not

scaleas

well.O

verall,M

AFFTisfaster

than

theother

program

sover

alltest

casesizes

butClustal

Omega

scalessim

ilarly.Points

inFigu

re1rep

resentdifferen

tfam

ilieswith

differen

tav

erageseq

uen

celen

gthsan

dpairw

iseiden

tities.Therefo

re,the

scalability

trend

isfuzzy,

with

largerdots

occu

rring

generally

abov

esm

allerdots.Su

pplem

entary

Figu

reS3

shows

scalability

data,

where

subsets

ofincreasin

gsize

aresam

pled

fromonelarge

family

only.T

hisred

uces

variab

ilityin

pairw

iseiden

tityan

dseq

uen

celen

gth.

Externalprofile

alig

nment

ClustalO

mega

canread

extrainform

ationfro

maprofileHMM

derived

from

preexistin

galign

men

ts.Fo

rexam

ple,

ifauser

Table II Prefab results

Aligner 0o%IDp100 (1682families)

0p%IDp20 (912families)

20p%IDp40 (563families)

40p%IDp70 (117families)

70p%IDp100 (90families)

Total time (s) (1682families)

Consistency

MSAprobs 0.737 0.591 0.889 0.965 0.971 51 286.00 YesMAFFT(auto)

0.721 0.569 0.876 0.961 0.979 4544.45 Yes

Probalign 0.719 0.563 0.881 0.961 0.977 35117.30 YesProbcons 0.717 0.562 0.876 0.955 0.972 46 908.30 YesT-Coffee 0.710 0.558 0.865 0.950 0.972 175 789.00 YesClustal O 0.700 0.535 0.866 0.967 0.980 1698.06 NoMUSCLE 0.677 0.507 0.850 0.946 0.976 2068.56 NoMAFFT 0.677 0.513 0.836 0.961 0.979 225.56 NoKalign 0.649 0.474 0.817 0.957 0.979 80.81 NoClustalW2 0.617 0.430 0.797 0.933 0.975 3433.53 NoDialign 0.595 0.398 0.783 0.940 0.974 18 909.70 NoPRANK 0.586 0.390 0.767 0.951 0.978 351 498.00 NoFSA 0.534 0.277 0.791 0.965 0.976 229 391.00 No

Total column scores (TC) are shown for different percent identity ranges; the second column is the average score over all test cases. The total run time in seconds is shown in the second last column. The last column indicatesif the method is consistency based.

High-quality

proteinMSAsusing

ClustalO

mega

FSievers

etal

&2011

EMBO

andMacm

illanPublishers

Limited

MolecularSystem

sBiology

20113

FromSeiversetal.,MolecularSystemsBiology2011

Notethatbestperformingmethoddependsonthe“%ID”(measureofsimilarity)

TCscoresonPreFabsequences

Page 10: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

wishes to align a set of globin sequences and has an existingglobin alignment, this alignment can be converted to a profileHMM and used as well as the sequence input file. This HMM ishere referred to as an ‘external profile’ and its use in this wayas ‘external profile alignment’ (EPA). During EPA, eachsequence in the input set is aligned to the external profile.Pseudocount information from the external profile is thentransferred, position by position, to the input sequence.Ideally, this would be used with large curated alignments ofparticular proteins or domains of interest such as are used inmetagenomics projects. Rather than taking the input se-quences and aligning them from scratch, every time newsequences are found, the alignment should be carefullymaintained and used as an external profile for EPA. ClustalOmega also can align sequences to existing alignments usingconventional alignment methods. Users can add sequences toan alignment, one by one or align a set of aligned sequences tothe alignment.In this paper, we demonstrate the EPA approach with two

examples. First, we take the 94 HomFam test cases from theprevious section and use the corresponding Pfam HMM forEPA. Before EPA, the average accuracy for the test cases was0.627 of correctly aligned Homstrad positions but after EPA itrises to 0.653. This is plotted, test case for test case inFigure 2A. Each dot is one test case with the TC score forClustal Omega plotted against the score using EPA. The secondexample is illustrated in Figure 2B. Here, we take all theBAliBASE reference sets and align them as normal usingClustal Omega and obtain the benchmark result of 0.554 ofcolumns correctly aligned, as already reported in Table I. ForEPA, we use the benchmark reference alignments themselvesas external profiles. The results now jump to 0.857 of columnscorrect. This is a jump of over 30% and while it is not a validmeasure of Clustal Omega accuracy for comparison with otherprograms, it does illustrate the potential power of EPA to useinformation in external alignments.

Iteration

EPA can also be used in a simple iteration scheme. Once a MSAhas been made from a set of input sequences, it can beconverted into a HMM and used for EPA to help realign theinput sequences. This can also be combined with a fullrecalculation of the guide tree. In Figure 3, we show the resultsof one and two iterations on every test case from HomFam.The graph is plotted as a running average TC score for all testcases with N or fewer test cases where N is plotted on the

horizontal axis using a log scale. With some smaller test cases,iteration actually has a detrimental effect. Once you get near1000 or more sequences, however, a clear trend emerges.Themore sequences you have, themore beneficial the effect ofiteration is. With bigger test cases, it becomes more and morebeneficial to apply two iterations. This result confirmsthe usefulness of EPA as a general strategy. It also confirmsthe difficulty in aligning extremely large numbers of sequencesbut gives one partial solution. It also gives a very simple buteffective iteration scheme, not just for guide tree iteration, asused inmany packages, but for iteration of the alignment itself.

Discussion

The main breakthroughs since the mid 1980s in MSA methodshave been progressive alignment and the use of consistency.Otherwise, most recent work has concerned refinements forspeed or accuracy on benchmark test sets. The speed increaseshave been dramatic but, with just two major exceptions, themethods are still basically O(N2) and incapable of beingextended to data sets of 410 000 sequences. The twoexceptions aremBed, used here, andMAFFT PartTree. PartTreeis faster but at the expense of accuracy, at least as judged by thebenchmarking here. The second group of recent developments

Table III HomFam benchmarking results

93pNp2957 (41 families) 3127pNp9105 (33 families) 10 099pNp50157 (18 families)

Aligner TC/t (s) TC/t (s) TC/t (s)Clustal O 0.708/2114.0 0.639/11 719.5 0.464/27 328.9Kalign 0.569/324.9 0.563/6752.0 0.420/286 711.0MAFFT default 0.550/238.9 0.462/3115.4 !/!MAFFT –parttree !/! !/! 0.253/6119.4MUSCLE default 0.533/104 587.0 !/! !/!MUSCLE –maxiters 2 !/! 0.416/8239.2 0.216/110 292.0

The columns show total column score (TC) and total run time in seconds for groupings of small (o3000 sequences), medium (3000–10 000 sequences) and large(410 000 sequences) HomFam test cases.

0.001

0.01

0.1

1

10

100

1000

10 000

100 000

100 3000 10 000 100 000

Tim

e (s

)

#Sequences

HomFamClustalΩ

MafftMuscleKalign

Avr length:1−50

50 −100100 −150150 −200200 −250250 −300300 −350350 −400400 −450

Figure 1 Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE(green) and Kalign (purple) against the number of sequences of HomFam testsets. Average sequence length is rendered by point size. Both axes havelogarithmic scales. Clustal Omega and Kalign were run with default flags over theentire range. MUSCLE was run with –maxiters 2 for N43000 sequences.MAFFT was run with –parttree for N410 000 sequences.

High-quality protein MSAs using Clustal OmegaF Sievers et al

4 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited

FromSeiversetal.,MolecularSystemsBiology2011

Notethereducedsetofmethodsduetodatasetsize.

TCscoresonHomFamsequences

Page 11: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Observa<ons(sofar)

•  Rela<veandabsoluteaccuracy(wrtTCscore)impactedbydegreeofheterogeneityanddatasetsize

•  Somemethodscannotrunonlargedatasets•  Onsmalldatasets,Clustal-OmeganotasaccurateasProbalign,MAFFTv6.857,andMSAprobs.

•  Onlargedatasets,Clustal-Omegamoreaccuratethanothermethodsthatitwascomparedto.

Page 12: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Clustal-Omegavariants

•  Itera<ngusingprofileHMMs•  UsingEPA(profilesbasedonexternalstructuralalignments)

Page 13: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

EPA for HomFam and BAliBASE.

Fabian Sievers et al. Mol Syst Biol 2011;7:539

© as stated in the article, figure or figure legend

Page 14: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Iteration of HomFam alignments.

Fabian Sievers et al. Mol Syst Biol 2011;7:539

© as stated in the article, figure or figure legend

Page 15: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE (green) and Kalign (purple) against the number of sequences of HomFam test sets.

Fabian Sievers et al. Mol Syst Biol 2011;7:539

© as stated in the article, figure or figure legend

Page 16: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

(a) (b)

(c) (d)

20 200 2,000 20,000 200,000

0.01

0.10

1.00

10.00

100.00

1,000.00

10,000.00

100,000.00

Run time (single threaded)

ClustalΩ

MAFFT-PartTree

sequences

se

co

nd

s

20 200 2,000 20,000 200,000

1

10

100

1,000

10,000

100,000

Peak memory usage

ClustalΩ

MAFFT-PartTree

sequences

me

ga

byte

s

20 200 2,000 20,000 200,000

1

10

100

1,000

10,000

100,000

Average memory usage

ClustalΩ

MAFFT-PartTree

sequences

me

ga

byte

s

20 200 2,000 20,000 200,000

0

0.1

0.2

0.3

0.4

0.5

Alignment accuracy

ClustalΩ

MAFFT-PartTree

sequences

TC

sco

re

Page 17: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Addi<onalObserva<ons

•  Itera<ngusingprofileHMMsimprovesTCscoresforClustal-Omega.

•  UsingEPA(profilesbasedonexternalstructuralalignments)improvesTCscoresforClustal-Omega.

•  MAFFT-pardreeisfasterthanClustal-Omega.•  ThekeytoClustal-Omega’sspeedisitsguidetree.

Page 18: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Someconcerns

•  Othercriteria(1-SPFN,1-SPFP,impactontreeerror,etc.)

•  Otherdatasets•  Newermethods(orupdatedmethods)•  Restric<ontomethodsthatcompletequickly

Page 19: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

wishes to align a set of globin sequences and has an existingglobin alignment, this alignment can be converted to a profileHMM and used as well as the sequence input file. This HMM ishere referred to as an ‘external profile’ and its use in this wayas ‘external profile alignment’ (EPA). During EPA, eachsequence in the input set is aligned to the external profile.Pseudocount information from the external profile is thentransferred, position by position, to the input sequence.Ideally, this would be used with large curated alignments ofparticular proteins or domains of interest such as are used inmetagenomics projects. Rather than taking the input se-quences and aligning them from scratch, every time newsequences are found, the alignment should be carefullymaintained and used as an external profile for EPA. ClustalOmega also can align sequences to existing alignments usingconventional alignment methods. Users can add sequences toan alignment, one by one or align a set of aligned sequences tothe alignment.In this paper, we demonstrate the EPA approach with two

examples. First, we take the 94 HomFam test cases from theprevious section and use the corresponding Pfam HMM forEPA. Before EPA, the average accuracy for the test cases was0.627 of correctly aligned Homstrad positions but after EPA itrises to 0.653. This is plotted, test case for test case inFigure 2A. Each dot is one test case with the TC score forClustal Omega plotted against the score using EPA. The secondexample is illustrated in Figure 2B. Here, we take all theBAliBASE reference sets and align them as normal usingClustal Omega and obtain the benchmark result of 0.554 ofcolumns correctly aligned, as already reported in Table I. ForEPA, we use the benchmark reference alignments themselvesas external profiles. The results now jump to 0.857 of columnscorrect. This is a jump of over 30% and while it is not a validmeasure of Clustal Omega accuracy for comparison with otherprograms, it does illustrate the potential power of EPA to useinformation in external alignments.

Iteration

EPA can also be used in a simple iteration scheme. Once a MSAhas been made from a set of input sequences, it can beconverted into a HMM and used for EPA to help realign theinput sequences. This can also be combined with a fullrecalculation of the guide tree. In Figure 3, we show the resultsof one and two iterations on every test case from HomFam.The graph is plotted as a running average TC score for all testcases with N or fewer test cases where N is plotted on the

horizontal axis using a log scale. With some smaller test cases,iteration actually has a detrimental effect. Once you get near1000 or more sequences, however, a clear trend emerges.Themore sequences you have, themore beneficial the effect ofiteration is. With bigger test cases, it becomes more and morebeneficial to apply two iterations. This result confirmsthe usefulness of EPA as a general strategy. It also confirmsthe difficulty in aligning extremely large numbers of sequencesbut gives one partial solution. It also gives a very simple buteffective iteration scheme, not just for guide tree iteration, asused inmany packages, but for iteration of the alignment itself.

Discussion

The main breakthroughs since the mid 1980s in MSA methodshave been progressive alignment and the use of consistency.Otherwise, most recent work has concerned refinements forspeed or accuracy on benchmark test sets. The speed increaseshave been dramatic but, with just two major exceptions, themethods are still basically O(N2) and incapable of beingextended to data sets of 410 000 sequences. The twoexceptions aremBed, used here, andMAFFT PartTree. PartTreeis faster but at the expense of accuracy, at least as judged by thebenchmarking here. The second group of recent developments

Table III HomFam benchmarking results

93pNp2957 (41 families) 3127pNp9105 (33 families) 10 099pNp50157 (18 families)

Aligner TC/t (s) TC/t (s) TC/t (s)Clustal O 0.708/2114.0 0.639/11 719.5 0.464/27 328.9Kalign 0.569/324.9 0.563/6752.0 0.420/286 711.0MAFFT default 0.550/238.9 0.462/3115.4 !/!MAFFT –parttree !/! !/! 0.253/6119.4MUSCLE default 0.533/104 587.0 !/! !/!MUSCLE –maxiters 2 !/! 0.416/8239.2 0.216/110 292.0

The columns show total column score (TC) and total run time in seconds for groupings of small (o3000 sequences), medium (3000–10 000 sequences) and large(410 000 sequences) HomFam test cases.

0.001

0.01

0.1

1

10

100

1000

10 000

100 000

100 3000 10 000 100 000

Tim

e (s

)

#Sequences

HomFamClustalΩ

MafftMuscleKalign

Avr length:1−50

50 −100100 −150150 −200200 −250250 −300300 −350350 −400400 −450

Figure 1 Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE(green) and Kalign (purple) against the number of sequences of HomFam testsets. Average sequence length is rendered by point size. Both axes havelogarithmic scales. Clustal Omega and Kalign were run with default flags over theentire range. MUSCLE was run with –maxiters 2 for N43000 sequences.MAFFT was run with –parttree for N410 000 sequences.

High-quality protein MSAs using Clustal OmegaF Sievers et al

4 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited

FromSeiversetal.,MolecularSystemsBiology2011

Notethereducedsetofmethodsduetodatasetsize.

TCscoresonHomFamsequences

Page 20: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Methodstheydidn’texplore

•  PASTAandUPP:designedexplicitlyforlargedatasets.(TheytestedSATe-1,theprecursortoSATe-IIandPASTA,butnotonthelargedatasetsbecauseitcouldnotberunefficiently.)

•  ImprovedversionsofMAFFTsincetheirstudy•  Andwhoknowswhatelse…

Page 21: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

SATeFamily•  SATe-I(2009):

–  Uptoabout10,000sequences–  Goodaccuracyandreasonablespeed–  “Center-tree”decomposi<on

•  SATe-II(2012)–  Uptoabout50,000sequences–  Improvedaccuracyandspeed–  Centroid-edgerecursivedecomposi<on

•  PASTA(2014)–  Upto1,000,000sequences–  Improvedaccuracyandspeed–  Combinescentroid-edgedecomposi<onwithtransi<vitymerge

Page 22: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

SATé/PASTAAlgorithmDesign

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

Page 23: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

A

B D

C

Merge subproblems

Estimate ML tree on merged alignment

Decompose based on input tree

A B

C D

Align subproblems

A B

C D

ABCD

SATéitera<on(actualdecomposi<onproduces32subproblems)

e

Page 24: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

run two iterations of SATe-II on a separate machine with no running time limits (12 Quad-Core AMDOpteron processors, 256GB of RAM memory). Given 12 CPUs, two iterations of SATe-II took 137 hr,compared to 10 hr for PASTA. However, the resulting SATe-II alignment recovered only 30 columnsentirely correctly while PASTA recovered 311 columns. The pairs score of SATe-II was extremely poor(38.2%), while PASTA was quite accurate (81.0%). The tree produced by SATe-II had higher error thanPASTA (12.6% versus 8.2% FN rate).Impact of varying algorithmic parameters. We compared results obtained using four different startingtrees: a random tree, the ML tree on the Mafft-PartTree alignment, PASTA’s default starting tree, and thetrue (model) tree (see Table 3). After one iteration, PASTA alignments and trees based on our starting treeor true tree had roughly the same accuracy, and the starting tree based on Mafft-PartTree resulted in only aslightly worse tree (1% higher FN rate). However, using a random tree resulted in much higher tree errorrates (52.3% error), and alignments that were also less accurate. Interestingly, after three iterations ofPASTA, no noticeable difference could be detected between results from various starting trees. Thus,PASTA is robust to the choice of the starting tree.

We also evaluated the impact of changing the alignment subset size (Table 4); these analyses showedthat using alignment subsets of only 50 sequences improved the TC score and running time substantially,and only slightly changed the pairs score or tree error score. Although these analyses were performed onlyfor two datasets, they suggest the possibility that improved results might be obtained through smalleralignment subsets.

Running Time. Figure 3 compares the running time (in hours) of different alignment methods. Note thatPASTA was faster than SATe-II in all cases and could analyze datasets that SATe-II could not (i.e., the

Table 2. Alignment Accuracy on AA Datasets

Column (TC) score Pairs score

Method AA-10 HomFam(17) HomFam(2) AA-10 HomFam(17) HomFam(2)

Clustal-O 78 88 29 0.76 0.72 0.71Muscle 48 51 X 0.70 0.52 XMafft 81 103 32 0.76 0.75 0.79Initial 54 95 16 0.75 0.71 0.81SATe-II 83 73 X 0.75 0.64 XPASTA 80 102 36 0.76 0.78 0.83

We show TC (the number of correctly aligned sites, left) and the pairs score (the average of the SP-score and modeler

score, right). X indicates that a method failed to run on a particular dataset given the computational constraints.‘‘Initial’’ corresponds to the alignment approach used to obtain the starting tree of PASTA (HMMER failed to align one

sequence in the 16S.T dataset). All values shown are averages over all datasets in each category.

Boldface indicates the best values for each model condition.

Table 3. Effect of the Starting Tree on Final PASTA Alignment and Tree

Initial tree Alignment accuracy

method Error (FN) Pairs score TCTree error

FN

One iterationRandom 100.0% 79.9% 2 52.3%Mafft-parttree 28.7% 87.0% 126 11.7%Starting tree 12.4% 86.8% 138 10.5%True tree 0% 86.1% 133 10.5%

Three iterationsRandom 100.0% 90.4% 138 11.0%Mafft-parttree 28.7% 83.9% 144 10.7%Starting tree 12.4% 88.8% 145 10.7%True tree 0% 90.8% 150 10.5%

Alignment accuracy and tree error is shown for PASTA with various starting trees, after one iteration (top) and

three iterations (bottom) on one replicate of the 10k RNASim dataset.

PASTA: ULTRA-LARGE MULTIPLE SEQUENCE ALIGNMENT 7

ComparisonofPASTAtoSATe-IIandotheralignmentsonAAdatasets.FromMirarabetal.,J.Computa<onalBiology2014

Malversion7.143brunindefaultmode(evenonHomFam(17)datasets).

Page 25: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

MAFFTvs.Clustal-Omega

•  IntheClustal-Omegapaper,Clustal-OmegahadbederTCscoresthanMAFFT.

•  InthePASTApaper,MAFFThadbederTCscoresthanClustal-Omega.

Page 26: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Whythedifference?

•  TwoversionsofMAFFT(Clustal-Omegapaperusedversion6.857,PASTApaperusedversion7.143b)

•  Twodifferentcommandspossibly(MAFFT-pardreevs.MAFFT-default)

Lesson:Versionnumberandcommandimpactsabsoluteandrela<veperformance!

Page 27: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

to each other. Even if a more computationally expensive (andusually more accurate) method, L-INS-i, is applied (CPU time-= 98 h), the alignment is still obviously incorrect (fig. 2B).

Two-step strategies can solve this type of problem. That is,a set of full-length sequences taken from databases are firstaligned to build a backbone MSA, and then the new ITS1 andITS2 sequences are added into this backbone MSA, using the––addfragments option.

Step 1: mafft--autofull_length_sequences>\

backbone_msa

Step 2: mafft --addfragments \ new_sequences

backbone_msa > output

The second command is equivalent to

mafft --multipair --addfragments \new_sequences backbone_msa > output

in which Dynamic Programming (DP) is used to compare thedistances between every new sequence and every sequence inthe backbone MSA (––multipair is selected by default).

mafft --6merpair --addfragments \new_sequences backbone_msa > output

where distances are rapidly estimated using the number ofshared 6mers, instead of DP.

The result of the latter option (––6merpair––addfragments) is shown in fig. 2D and E. The differencebetween D and E is just in the order of sequences; the se-quences were reordered according to similarity using the––reorder option in E. In this alignment, ITS1 and ITS2are clearly separated and aligned to appropriate positions inthe full-length alignment. Moreover, this strategy is compu-tationally much less expensive (CPU time = 15 min [firststep] + 1.5 min [second step]) than the full applicationof L-INS-i (CPU time = 98 h). The former option(––multipair ––addfragments) also returns a similarresult to the latter (––6merpair) but is slower (CPUtime = 48.6 min [second step]).

This case suggests that it is crucial to select a strategyappropriate to the problem of interest. The most time-consuming method, L-INS-i, is not always the most accurate

one. The difficulty of this problem for standard approachescomes from the fact that ITS1 sequences and ITS2 sequencesare not homologous to each other and most pairwise align-ments are impossible. Because of these nonhomologous pairs,the distance matrix used for the guide tree calculation isnot additive; the distances between ITS1 and full-lengthsequences and those between ITS2 and full-length sequencesare close to zero, whereas the distances between ITS1 andITS2 are quite large. In this situation, it is difficult for normaldistance-based tree-building methods to give a reasonabletree. Moreover, in the alignment step, the objective functionof the L-INS-i is affected by inappropriate pairwise alignmentscores between ITS1 and ITS2. Such problems can be avoidedby just ignoring the relationship between ITS1 and ITS2,as done in the ––addfragments option.

In addition, a result of the second type of misuse ofmafft-profile (discussed earlier) is shown in figure 2C.Some new sequences are correctly aligned but others areobviously incorrectly aligned (note that the order of se-quences in fig. 2C is identical that in fig. 2D). These misalign-ments are due to an incorrect assumption on phylogeneticplacement of new sequences shown in figure 1C.

Test Case 2: Bacterial SSU rRNAAnother case is the 16S.B.ALL data set by Mirarab et al. (2012).It consists of an MSA of 13,822 bacterial SSU rRNA sequences,taken from the Gutell Comparative RNA Website (CRW)(Cannone et al. 2002) and 138,210 fragmentary sequences,which are originally included in the CRW alignmentbut ungapped and artificially truncated. In Katoh andStandley (2013), we used a subset (13,821 fragmentarysequences) prepared by Mirarab et al. (2012). In addition tothis subset, here we use the full data set (138,210 fragmentarysequences), to examine the scalability. Suppose a situationwhere we already have a manually curated (or backbone)MSA and a newly determined set of many fragmentarysequences in a metagenomics project, and we need anentire MSA of them.

The first four lines in table 2 (case 1) show the perfor-mances of various options for such an analysis, with a rela-tively small data set (13,822 sequences in the existing

Table 2. Comparison of Different Options Using the 16S.B.ALL Data Set (Mirarab et al. 2012).

Data Method Accuracy CPU Time Actual Timea

Case 1 mafft ––multipair ––addfragments frags existingmsa 0.9969 6.67 days 18.3 hmafft ––6merpair ––addfragments frags existingmsa 0.9949 3.76 h 36.2 minmafft ––localpair ––add frags existingmsa 0.9707 23.4 daysb 2.43 daysb

mafft ––6merpair ––add frags existingmsa 0.9604 1.32 h 1.44 hprofile alignment 0.2779 15.5 h 1.60 h

Case 2 mafft ––6merpair ––addfragments frags existingmsa 0.9969 4.54 h 33.8 min

Case 3 mafft ––6merpair ––addfragments frags existingmsa 0.9949 1.79 days 5.91 h

NOTE.—The estimated alignments were compared with the CRW alignment to measure the accuracy (the number of correctly aligned letters/the number of aligned letters in theCRW alignment). Calculations were performed on a Linux PC with 2.67 GHz Intel Xeon E7-8837/256 GB RAM (for the case marked with superscript alphabet “b”), or on a LinuxPC with 3.47 GHz Intel Xeon X5690/48 GB RAM (for the other cases).Case 1: 13,822 sequences in the existing alignment! 13,821 fragments;Case 2: 1,000 sequences in the existing alignment! 138,210 fragments;Case 3: 13,822 sequences in the existing alignment! 138,210 fragments.aWall-clock time with 10 cores. Command-line argument for parallel processing is ––thread 10.bFull command-line options are as follows: mafft ––localpair ––weighti 0 ––add frags existingmsa.

776

Katoh and Standley . doi:10.1093/molbev/mst010 MBE

by guest on February 22, 2015http://m

be.oxfordjournals.org/D

ownloaded from

FromKatohandStandley,2013(dealingwithfragmentarysequences)Mol.Biol.Evol.30(4):772–780doi:10.1093/molbev/mst010

Page 28: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Reviewer1•  “TheoverallmessageappearstobethatClustalOmega

appearstobethefastestandmostaccuratealgorithmforverylargealignments,whileitiss<llsubstan<allyoutperformedbyothermethodsforsmalleralignments.”

•  “Lookingattheresultspresentedhere(again,fromthestandpointofa"consumer"),Iwouldnotpersonallyexpecttomakemuchuseofthisalgorithmandwillprobablycon<nuetouseMAFFTorMUSCLE,mainlysincethey'res<llperformingsimilarlywellorbederforsmalleralignments,whicharethemainuseformyself.Iexpectthistobetrueforthemajorityofusers.However,thismaychangeinthefutureandespeciallyforthosewhoworkonmajorsequencing/datarepositorycenters(suchastheEBI),assequencingofmorespeciesrampsup.”

Page 29: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Reviewer2•  “ThemainnoveltyofthisprogramcomparedtopreviousversionsofClustalisitsabilitytodealwithhugedatasets.Thisismainlyachievedbyusinganefficientclusteringalgorithmtoefficientlycalculateaguidetreeforprogressivealignment…construc<ngtheguidetreeforprogressivealignmentisthebodleneckofmostMSAprograms:calcula<ngatreefromtheinputsequenceswithtradi<onalmethodstakes(atleast)O(n2)<mewhileprogressivelyaligningthesequencesaccordingtothistreecanbedoneinlinear<me.”

Page 30: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

CommentsaboutEPA•  Reviewer1:“ItalsoofferstheProfileAlignmentuse-case,whichislikelytobemoreimportantinthefuture”

•  Reviewer2:“Asecondinteres<ngnoveltyofClustalOmegaistheop<ontousepreviouslycalculated"ExternalProfileAlignments"(EPA)toimprovethealignmentprocedure.ThisisdoneusinganHMMapproachdevelopedbyoneoftheco-authors.Sinceformanyproteinfamilies,therearenowprofileHMMsavailable,itisaverysensibleideatousethisinforma<onforimprovedMSA,ratherthanrelyingonsequencesimilarityalone,asdotradi<onalalignmentmethods.”

Page 31: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Overall•  MainfeaturesofClustal-Omegaarethefastguidetreecalcula<on,theuseofitera<on,andEPA.(BothEPAanditera<onuseprofileHMMs.)

•  Butimprovementsinaccuracycomparedtoothermethodsonlyseemstobeapparentonlargedatasetswhenrestric<ngthesetofmethodstothosethatcompletequickly.Newermethods(andevennewversionsofoldmethodslikeMAFFT)seemtooutperformClustal-Omegaforaccuracy,eveniftheyarenotasfast.

•  AccuracywasonlyassessedusingTCscore,whichisnotnecessarilythewholestory.

Page 32: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

ChoiceofbestMSAmethod

•  Doesitdependontypeofdata(DNAoraminoacids?)

•  Doesitdependonrateofevolu<on?•  Doesitdependongaplengthdistribu<on?•  Doesitdependonexistenceoffragments?

Page 33: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Howdoestheguidetreeimpactaccuracy?

•  Doesimprovingtheaccuracyoftheguidetreehelp?

•  Doallalignmentmethodsrespondiden<cally?(Isthesameguidetreegoodforallmethods?)

•  DothedefaultseSngsfortheguidetreeworkwell?

Page 34: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Anotherstudy…

•  Prank(LoytynojaandGoldman,Science2008)isa“phylogenyaware”progressivealignmentstrategy.

•  Theirstudyfocusedonevalua<ngMSAswithrespecttoTCscore,butalsoatypicalcriteria,suchas:–  Genetreebranchlengthes<ma<on–  Alignmentlengthes<ma<on(compressionissue)–  Inser<on/dele<onra<o–  Numberofinser<ons/dele<ons

•  Theyexploredverysmallsimulateddatasets,evolvingsequencesdowntrees.

Page 35: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

have a flaw that has gone unnoticed as long asdifferent methods have been consistent in theerror they create.

That such a significant error has passedundetected may be explained by the align-ment field's historical focus on proteins, wherethese biases tend to be manifested in less-constrained regions such as loops (compareFig. 1). Alignments with insertions and de-letions squeezed compactly between con-served blocks may suffice for, and even bepreferred by, some molecular biologists work-ing with proteins. We have shown, however,

that these patterns are, in fact, imposed bysystematic biases in alignment algorithms,even in cases where they are incorrect and,indeed, phylogenetically unreasonable. Wecontend that algorithms that impose gap pat-terns like those found in structural align-ments of proteins are inappropriate for theincreasingly widespread analysis of genomicDNA and are likely to cause error when theresulting alignments are used for evolutionaryinferences.

We believe that alignment methods spe-cifically designed for evolutionary analyses

will give a very different picture of the mech-anisms of sequence evolution and showsequence turnover through short insertionsand deletions as a more frequent and im-portant phenomenon. This raises interestingquestions of the true evolution of variablesequences such as promoter regions, non-coding DNA, and exposed coil regions inproteins: Do they predominantly evolvethrough point substitutions, or are those dis-similar regions just incorrectly aligned non-homologous sequences? To resolve that, weneed more sequence data and alignmentmethods that can really benefit from theadditional information. The resulting align-ments may be fragmented by many gapsand may not be as visually beautiful as thetraditional alignments, but if they representcorrect homology, we have to get used tothem.

References and Notes1. R. A. Gibbs et al., Nature 428, 493 (2004).2. Rhesus Macaque Genome Sequencing and Analysis

Consortium, Science 316, 222 (2007).3. The ENCODE Project Consortium, Nature 447, 799

(2007).4. A. Stark et al., Nature 450, 219 (2007).5. Materials and methods are available as supporting

material on Science Online.6. A. Rambaut, D. Posada, K. Crandall, E. Holmes, Nat. Rev.

Genet. 5, 52 (2004).7. N. Sullivan, M. Thali, C. Furman, D. Ho, J. Sodroski, J. Virol.

67, 3674 (1993).8. R. Wyatt et al., J. Virol. 69, 5723 (1995).9. M. Jansson et al., AIDS Res. Hum. Retroviruses 17, 1405

(2001).10. S. D. Frost et al., Proc. Natl. Acad. Sci. U.S.A. 102,

18514 (2005).11. M. Sagar, X. Wu, S. Lee, J. Overbaugh, J. Virol. 80, 9586

(2006).12. J. D. Thompson, D. G. Higgins, T. J. Gibson, Nucleic Acids

Res. 22, 4673 (1994).13. C. Notredame, D. G. Higgins, J. Heringa, J. Mol. Biol.

302, 205 (2000).14. R. C. Edgar, BMC Bioinformat. 5, 113 (2004).15. K. Katoh, K. Kuma, H. Toh, T. Miyata, Nucleic Acids Res.

33, 511 (2005).16. A. Löytynoja, N. Goldman, Proc. Natl. Acad. Sci. U.S.A.

102, 10557 (2005).17. D. D. Pollock, D. J. Zwickl, J. A. McGuire, D. M. Hillis,

Syst. Biol. 51, 664 (2002).18. M. S. Rosenberg, S. Kumar, Syst. Biol. 52, 119

(2003).19. M. S. Rosenberg, BMC Bioinformat. 6, 278

(2005).20. K. M. Wong, M. A. Suchard, J. P. Huelsenbeck, Science

319, 473 (2008).21. This work was funded in part by a Wellcome

Trust Programme Grant (GR078968). We thankN. Luscombe for many suggestions that improvedthe manuscript.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/320/5883/1632/DC1Materials and MethodsSOM TextFig. S1Table S1References

28 March 2008; accepted 15 May 200810.1126/science.1158395

Fig. 3. Alignment accuracy errors are reduced by a phylogeny-aware algorithm. (A to F) Errorsfrom traditional alignment methods grow with increasing evolutionary distances and moredifficult alignments (close-intermediate-distant, white to blue gradient) and also with a densersequence sampling and increasingly similar sequences (close-2X-4X, white to red gradient).The phylogeny-aware method PRANK+F is less biased by greater distances and, in contrast toother methods, improves in accuracy with additional sequence data. Alignment statistics forthe five multiple sequence alignment methods are number of (A) insertions and (B) deletions,(C) insertion/deletion ratio, (D) gap overlap, (E) total length of the alignment, and (F)proportion of columns correctly recovered. (G) Inferred branch lengths at different depths inthe tree (t1 – t4) for the intermediate sets indicate that alignment errors lead to overestimatedbranch lengths, with PRANK+F giving the most accurate estimates across the whole range ofdepths. All measures in (A) to (G) are shown relative to those inferred from the true align-ment, and values closer to 1 are more correct. Vertical bars show means and 95% confidenceintervals.

www.sciencemag.org SCIENCE VOL 320 20 JUNE 2008 1635

REPORTS

FromLoytyjojaandGoldman,Science2008:

Page 36: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

have a flaw that has gone unnoticed as long asdifferent methods have been consistent in theerror they create.

That such a significant error has passedundetected may be explained by the align-ment field's historical focus on proteins, wherethese biases tend to be manifested in less-constrained regions such as loops (compareFig. 1). Alignments with insertions and de-letions squeezed compactly between con-served blocks may suffice for, and even bepreferred by, some molecular biologists work-ing with proteins. We have shown, however,

that these patterns are, in fact, imposed bysystematic biases in alignment algorithms,even in cases where they are incorrect and,indeed, phylogenetically unreasonable. Wecontend that algorithms that impose gap pat-terns like those found in structural align-ments of proteins are inappropriate for theincreasingly widespread analysis of genomicDNA and are likely to cause error when theresulting alignments are used for evolutionaryinferences.

We believe that alignment methods spe-cifically designed for evolutionary analyses

will give a very different picture of the mech-anisms of sequence evolution and showsequence turnover through short insertionsand deletions as a more frequent and im-portant phenomenon. This raises interestingquestions of the true evolution of variablesequences such as promoter regions, non-coding DNA, and exposed coil regions inproteins: Do they predominantly evolvethrough point substitutions, or are those dis-similar regions just incorrectly aligned non-homologous sequences? To resolve that, weneed more sequence data and alignmentmethods that can really benefit from theadditional information. The resulting align-ments may be fragmented by many gapsand may not be as visually beautiful as thetraditional alignments, but if they representcorrect homology, we have to get used tothem.

References and Notes1. R. A. Gibbs et al., Nature 428, 493 (2004).2. Rhesus Macaque Genome Sequencing and Analysis

Consortium, Science 316, 222 (2007).3. The ENCODE Project Consortium, Nature 447, 799

(2007).4. A. Stark et al., Nature 450, 219 (2007).5. Materials and methods are available as supporting

material on Science Online.6. A. Rambaut, D. Posada, K. Crandall, E. Holmes, Nat. Rev.

Genet. 5, 52 (2004).7. N. Sullivan, M. Thali, C. Furman, D. Ho, J. Sodroski, J. Virol.

67, 3674 (1993).8. R. Wyatt et al., J. Virol. 69, 5723 (1995).9. M. Jansson et al., AIDS Res. Hum. Retroviruses 17, 1405

(2001).10. S. D. Frost et al., Proc. Natl. Acad. Sci. U.S.A. 102,

18514 (2005).11. M. Sagar, X. Wu, S. Lee, J. Overbaugh, J. Virol. 80, 9586

(2006).12. J. D. Thompson, D. G. Higgins, T. J. Gibson, Nucleic Acids

Res. 22, 4673 (1994).13. C. Notredame, D. G. Higgins, J. Heringa, J. Mol. Biol.

302, 205 (2000).14. R. C. Edgar, BMC Bioinformat. 5, 113 (2004).15. K. Katoh, K. Kuma, H. Toh, T. Miyata, Nucleic Acids Res.

33, 511 (2005).16. A. Löytynoja, N. Goldman, Proc. Natl. Acad. Sci. U.S.A.

102, 10557 (2005).17. D. D. Pollock, D. J. Zwickl, J. A. McGuire, D. M. Hillis,

Syst. Biol. 51, 664 (2002).18. M. S. Rosenberg, S. Kumar, Syst. Biol. 52, 119

(2003).19. M. S. Rosenberg, BMC Bioinformat. 6, 278

(2005).20. K. M. Wong, M. A. Suchard, J. P. Huelsenbeck, Science

319, 473 (2008).21. This work was funded in part by a Wellcome

Trust Programme Grant (GR078968). We thankN. Luscombe for many suggestions that improvedthe manuscript.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/320/5883/1632/DC1Materials and MethodsSOM TextFig. S1Table S1References

28 March 2008; accepted 15 May 200810.1126/science.1158395

Fig. 3. Alignment accuracy errors are reduced by a phylogeny-aware algorithm. (A to F) Errorsfrom traditional alignment methods grow with increasing evolutionary distances and moredifficult alignments (close-intermediate-distant, white to blue gradient) and also with a densersequence sampling and increasingly similar sequences (close-2X-4X, white to red gradient).The phylogeny-aware method PRANK+F is less biased by greater distances and, in contrast toother methods, improves in accuracy with additional sequence data. Alignment statistics forthe five multiple sequence alignment methods are number of (A) insertions and (B) deletions,(C) insertion/deletion ratio, (D) gap overlap, (E) total length of the alignment, and (F)proportion of columns correctly recovered. (G) Inferred branch lengths at different depths inthe tree (t1 – t4) for the intermediate sets indicate that alignment errors lead to overestimatedbranch lengths, with PRANK+F giving the most accurate estimates across the whole range ofdepths. All measures in (A) to (G) are shown relative to those inferred from the true align-ment, and values closer to 1 are more correct. Vertical bars show means and 95% confidenceintervals.

www.sciencemag.org SCIENCE VOL 320 20 JUNE 2008 1635

REPORTS

FromLoytynojaandGoldman,Science2008

Page 37: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Observa<ons

•  Mostalignmentmethods“over-align”(producecompressedalignments)

•  Prankavoidsthisthroughits“phylogeny-aware”strategy

•  Compressionresultsin– Over-es<ma<onsofbranchlengths– Under-es<ma<onofinser<ons

•  Clustalisleastaccurate,othermethodsinbetween

Page 38: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

Researchques<ons

•  Clustal-Omega’smainadvantageseemstobeitsfastguidetree.Buttheguidetreehasalargeimpactonaccuracy–canweimprovetheaccuracyofClustal-Omegabygivingitabederguidetree?

•  Itera<onusingprofileHMMshelpsaccuracy;canweusethistechniqueinothermethods?

•  EPAsareuseful.Whatarethebestwaystouseexternalstructurally-definedalignments?

Page 39: Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega, the alignments are then computed using the very accurate HHalign package (So¨ding,

ResearchProjects•  DesignyourownMSAmethod,orjustmodifyanexis<ngoneinsomesimpleway(e.g.,differentguidetree)

•  Testexis<ngMSAmethodswithrespecttodifferentcriteria(e.g.,extendPrankstudytomoremethodsanddatasets)

•  DevelopdifferentMSAcriteriathataremoreappropriatethanTC,SPFN,SPFP

•  ComparedifferentMSAmethodsonsomebiologicaldataset

•  ParallelizesomeMSAmethod•  ConsiderhowtocombineMSAsonthesameinput