Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega,...
Transcript of Clustal-Omega - Tandy Warnowtandy.cs.illinois.edu/Clustal-Omega-discussion.pdf · In Clustal Omega,...
Clustal-Omega
TandyWarnow
Today
• Discussionof“Fastscalablegenera<onofhigh-qualityproteinmul<plesequencealignmentsusingClustal-Omega”bySieversetal.,publishedinMolecularSystemsBiology7:539(2011)– DesignofClustal-Omega– Performanceondata– Maininnova<on:guidetreeconstruc<on– Reviewsin“transparentprocess”– Comparisontootherstudies(<mepermiSng)
Clustal-Omegastudy
• Clustal-OmegaisthelatestintheClustalfamilyofMSAmethods
• Clustal-Omegaisdesignedprimarilyforaminoacidalignment,butcanbeusedonnucleo<dedatasets
Clustal-Omega
Givenunalignedsequences:• Buildaguidetree• Performprogressivealignmentontheguidetree,using“consistency”toinformthealignmentmergers
Ifdesired,youcan:• Iterate(constructprofileHMMbasedonalignment,andthenre-align)
• Useexternalprofilealignment(EPA)
Evalua<on
• DefaultClustal-Omega(noitera<on,noEPA)comparedtostandardmethodsonproteinbenchmarks
• VariantsofClustal-OmegacomparedtodefaultClustal-Omega
Criteria:• TC(columnscore)on“core”columns• Running<me
Datasets
• BAliBASE(“small”)• PreFab(50sequencesineachset,butevaluatedbasedonpairwisealignments)
• HomFam(muchlarger!)
about 10 000 sequences is MAFFT/PartTree (Katoh and Toh,2007). It is very fast but leads to a loss in accuracy, which hasto be compensated for by iteration and other heuristics. WithClustal Omega, we use a modified version of mBed (Black-shields et al, 2010), which has complexity of O(N log N), andwhich produces guide trees that are just as accurate as thosefrom conventional methods. mBed works by ‘emBedding’ eachsequence in a space of n dimensions where n is proportional tolog N. Each sequence is then replaced by an n element vector,where each element is simply the distance to one ofn ‘referencesequences.’ These vectors can then be clustered extremelyquickly by standard methods such as K-means or UPGMA.In Clustal Omega, the alignments are then computed using thevery accurate HHalign package (Soding, 2005), which alignstwo profile hidden Markov models (Eddy, 1998).Clustal Omega has a number of features for adding
sequences to existing alignments or for using existingalignments to help align new sequences. One innovation isto allow users to specify a profile HMM that is derived from analignment of sequences that are homologous to the input set.The sequences are then aligned to these ‘external profiles’ tohelp align them to the rest of the input set. There are alreadywidely available collections of HMMs frommany sources suchas Pfam (Finn et al, 2009) and these can now be used to helpusers to align their sequences.
Results
Alignment accuracy
The standard method for measuring the accuracy of multiplealignment algorithms is to use benchmark test sets of referencealignments, generated with reference to three-dimensionalstructures. Here, we present results from a range of packagestested on three benchmarks: BAliBASE (Thompson et al,2005), Prefab (Edgar, 2004) and an extended version ofHomFam (Blackshields et al, 2010). For these tests, we justreport results using the default settings for all programs butwith two exceptions, which were needed to allow MUSCLE(Edgar, 2004) and MAFFT to align the biggest test cases in
HomFam. For test cases with 43000 sequences, we runMUSCLEwith the –maxiter parameter set to 2, in order to finishthe alignments in reasonable times. Second, we have run severaldifferent programs from the MAFFT package. MAFFT (Katohet al, 2002) consists of a series of programs that can be runseparately or called automatically from a script with the –autoflag set. This flag chooses to run a slow, consistency-basedprogram (L-INS-i) when the number and lengths of sequences issmall. When the numbers exceed inbuilt thresholds, a conven-tional progressive aligner is used (FFT-NS-2). The latter is also theprogram that is run by default if MAFFT is called with no flagsset. For very large data sets, the –parttree flag must be set on thecommand line and a very fast guide tree calculation is then used.The results for the BAliBASE benchmark tests are shown in
Table I. BAliBASE is divided into six ‘references.’Average scoresare given for each reference, along with total run times andaverage total column (TC) scores, which give the proportion ofthe total alignment columns that is recovered. A score of 1.0indicates perfect agreementwith the benchmark. There are tworows for the MAFFT package: MAFFT (auto) and MAFFTdefault. In most (203 out of 218) BAliBASE test cases, thenumber of sequences is small and the script runs L-INS-i, whichis the slow accurate program that uses the consistency heuristic(Notredame et al, 2000) that is also used by MSAprobs(Liu et al, 2010), Probalign, Probcons (Do et al, 2005) andT-Coffee. These programs are all restricted to small numbers ofsequences but tend to give accurate alignments. This is clearlyreflected in the times and average scores in Table I. The timesrange from 25min up to 22h for these packages and theaccuracies range from 55 to 61% of columns correct. ClustalOmega only takes 9min for the same runs but has an accuracylevel that is similar to that of Probcons and T-Coffee.The rest of the table is mainly taken by the programs that use
progressive alignment. Some of these are very fast but thisspeed ismatched by a considerable drop in accuracy comparedwith the consistency-based programs and Clustal Omega. Theweakest program here, is Clustal W (Larkin et al, 2007)followed by PRANK (Loytynoja and Goldman, 2008). PRANKis not designed for aligning distantly related sequences but atgiving good alignments for phylogenetic work with special
Table I BAliBASE results
Aligner Av score(218 families)
BB11(38 families)
BB12(44 families)
BB2(41 families)
BB3(30 families)
BB4(49 families)
BB5(16 families)
Tot time (s) Consistency
MSAprobs 0.607 0.441 0.865 0.464 0.607 0.622 0.608 12 382.00 YesProbalign 0.589 0.453 0.862 0.439 0.566 0.603 0.549 10 095.20 YesMAFFT (auto) 0.588 0.439 0.831 0.450 0.581 0.605 0.591 1475.40 Mostly
(203/218)Probcons 0.558 0.417 0.855 0.406 0.544 0.532 0.573 13086.30 YesClustal O 0.554 0.358 0.789 0.450 0.575 0.579 0.533 539.91 NoT-Coffee 0.551 0.410 0.848 0.402 0.491 0.545 0.587 81041.50 YesKalign 0.501 0.365 0.790 0.360 0.476 0.504 0.435 21.88 NoMUSCLE 0.475 0.318 0.804 0.350 0.409 0.450 0.460 789.57 NoMAFFT (default) 0.458 0.258 0.749 0.316 0.425 0.480 0.496 68.24 NoFSA 0.419 0.270 0.818 0.187 0.259 0.474 0.398 53 648.10 NoDialign 0.415 0.265 0.696 0.292 0.312 0.441 0.425 3977.44 NoPRANK 0.376 0.223 0.680 0.257 0.321 0.360 0.356 128 355.00 NoClustalW 0.374 0.227 0.712 0.220 0.272 0.396 0.308 766.47 No
The figures are total column scores produced using bali score on core columns only. The average score over all families is given in the second column. The results forBAliBASE subgroupings are in columns 3–8. The total run time for all 218 families is given in the second last column. The last column indicates whether the method isconsistency based.
High-quality protein MSAs using Clustal OmegaF Sievers et al
2 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited
FromSeiversetal.,MolecularSystemsBiology2011
TCscoresonBAliBASEsequences
0.35
0.4
0.45
0.5
0.55
0.6
0.65
10 100 1000 10000 100000 1e+06
Tota
l Col
umn
Scor
e
Time [s]
BAliBASE
MSAprobsProbalignMafft-auto
ProbconsClustalOmega TCoffeeSATeOpal
Kalign
MuscleMafft
FsaDialign
PrankClustalW2
attention
togap
s.These
gappositio
nsare
not
inclu
dedin
these
testsas
they
tend
not
tobestru
cturally
conserved.
Dialign
(Morgen
sternet
al,1998)
does
not
use
consisten
cyor
progressivealign
men
tbutis
based
onfindingbestlocalm
ultip
lealign
men
ts.FSA(Brad
leyeta
l,2009)uses
samplin
gof
pairwise
alignmen
tsan
d‘seq
uen
cean
nealin
g’an
dhas
been
show
nto
delivergo
odnucleotide
sequen
cealign
men
tsin
thepast.
ThePrefab
ben
chmark
testresu
ltsare
shownin
Table
II.Here,
theresu
ltsare
divided
into
fivegrou
psacco
rdingto
the
percen
tiden
tityoftheseq
uen
ces.Theoverall
scores
range
from53
to73%
ofcolumnscorrect.
Theconsisten
cy-based
program
sMSA
probs,M
AFFTL-IN
S-i,Probalign
,Probconsan
dT-C
offee,
areagain
themost
accurate
butwith
longruntim
es.ClustalO
mega
isclo
seto
theconsisten
cyprogram
sin
accuracy
butis
much
faster.There
isthen
agap
tothefaster
progressive
based
program
sofMUSC
LE,MAFFT,
Kalign
(Lassm
annan
dSo
nnham
mer,
2005)an
dClustal
W.
Resu
ltsfrom
testing
largealign
men
tswith
up
to50
000sequ
ences
aregiven
inTable
IIIusin
gHom
Fam.Here,
eachalign
men
tismade
upof
acore
ofaHom
strad(M
izugu
chietal,
1998)stru
cture-based
alignmen
tofatleastfive
sequen
ces.These
sequen
cesare
then
inserted
into
atest
setof
sequen
cesfrom
the
correspondin
g,hom
ologous,P
famdom
ain.T
hisgives
verylarge
setsof
sequen
cesto
bealign
edbu
tthetestin
gison
lycarried
out
onthesequ
ences
with
know
nstru
ctures.
Only
someprogram
sare
ableto
deliveralign
men
tsatall,w
ithdata
setsofth
issize.W
erestricted
thecom
parisonsto
Clustal
Omega,
MAFFT,
MUSC
LEan
dKalign
.MAFFT
with
defaultsettin
gs,has
alim
itof
20000
sequen
cesan
dweon
lyuse
MAFFT
with
–parttreefor
thelast
sectionof
TableIII.
MUSC
LEbecom
esincreasin
glyslow
when
youget
over3000
sequen
ces.Therefore,
for43000
sequen
cesweused
MUSC
LEwith
thefaster
butless
accurate
settingof
–maxiters
2,which
restrictsthenumber
ofiteration
sto
two.
Overall,
Clustal
Omega
iseasily
themost
accurate
program
inTab
leIII.
Theruntim
esshow
MAFFTdefau
ltan
dKalign
tobeexcep
tionally
fastonthesm
allertest
casesan
dMAFFT
–parttree
tobe
very
faston
the
biggest
families.
Clustal
Omega
does
scalewell,
however,
with
increasin
gnumbers
of
sequen
ces.This
scalingis
describ
edin
more
detail
inthe
Supplem
entary
Inform
ation.W
edohave
twofurth
ertestcases
with
450
000seq
uen
ces,butitwas
notpossib
leto
getresu
ltsforthese
fromMUSC
LEorKalign
.These
aredescrib
edin
the
Supplem
entary
Inform
ationas
well.
Table
IIIgiv
esoverall
run
times
forthe
fourprogram
sevalu
atedwith
HomFam
.Figu
re1reso
lvesthese
runtim
escase
bycase.K
alignisvery
fastforsm
allfamilies
butdoes
not
scaleas
well.O
verall,M
AFFTisfaster
than
theother
program
sover
alltest
casesizes
butClustal
Omega
scalessim
ilarly.Points
inFigu
re1rep
resentdifferen
tfam
ilieswith
differen
tav
erageseq
uen
celen
gthsan
dpairw
iseiden
tities.Therefo
re,the
scalability
trend
isfuzzy,
with
largerdots
occu
rring
generally
abov
esm
allerdots.Su
pplem
entary
Figu
reS3
shows
scalability
data,
where
subsets
ofincreasin
gsize
aresam
pled
fromonelarge
family
only.T
hisred
uces
variab
ilityin
pairw
iseiden
tityan
dseq
uen
celen
gth.
Externalprofile
alig
nment
ClustalO
mega
canread
extrainform
ationfro
maprofileHMM
derived
from
preexistin
galign
men
ts.Fo
rexam
ple,
ifauser
Table II Prefab results
Aligner 0o%IDp100 (1682families)
0p%IDp20 (912families)
20p%IDp40 (563families)
40p%IDp70 (117families)
70p%IDp100 (90families)
Total time (s) (1682families)
Consistency
MSAprobs 0.737 0.591 0.889 0.965 0.971 51 286.00 YesMAFFT(auto)
0.721 0.569 0.876 0.961 0.979 4544.45 Yes
Probalign 0.719 0.563 0.881 0.961 0.977 35117.30 YesProbcons 0.717 0.562 0.876 0.955 0.972 46 908.30 YesT-Coffee 0.710 0.558 0.865 0.950 0.972 175 789.00 YesClustal O 0.700 0.535 0.866 0.967 0.980 1698.06 NoMUSCLE 0.677 0.507 0.850 0.946 0.976 2068.56 NoMAFFT 0.677 0.513 0.836 0.961 0.979 225.56 NoKalign 0.649 0.474 0.817 0.957 0.979 80.81 NoClustalW2 0.617 0.430 0.797 0.933 0.975 3433.53 NoDialign 0.595 0.398 0.783 0.940 0.974 18 909.70 NoPRANK 0.586 0.390 0.767 0.951 0.978 351 498.00 NoFSA 0.534 0.277 0.791 0.965 0.976 229 391.00 No
Total column scores (TC) are shown for different percent identity ranges; the second column is the average score over all test cases. The total run time in seconds is shown in the second last column. The last column indicatesif the method is consistency based.
High-quality
proteinMSAsusing
ClustalO
mega
FSievers
etal
&2011
EMBO
andMacm
illanPublishers
Limited
MolecularSystem
sBiology
20113
FromSeiversetal.,MolecularSystemsBiology2011
Notethatbestperformingmethoddependsonthe“%ID”(measureofsimilarity)
TCscoresonPreFabsequences
wishes to align a set of globin sequences and has an existingglobin alignment, this alignment can be converted to a profileHMM and used as well as the sequence input file. This HMM ishere referred to as an ‘external profile’ and its use in this wayas ‘external profile alignment’ (EPA). During EPA, eachsequence in the input set is aligned to the external profile.Pseudocount information from the external profile is thentransferred, position by position, to the input sequence.Ideally, this would be used with large curated alignments ofparticular proteins or domains of interest such as are used inmetagenomics projects. Rather than taking the input se-quences and aligning them from scratch, every time newsequences are found, the alignment should be carefullymaintained and used as an external profile for EPA. ClustalOmega also can align sequences to existing alignments usingconventional alignment methods. Users can add sequences toan alignment, one by one or align a set of aligned sequences tothe alignment.In this paper, we demonstrate the EPA approach with two
examples. First, we take the 94 HomFam test cases from theprevious section and use the corresponding Pfam HMM forEPA. Before EPA, the average accuracy for the test cases was0.627 of correctly aligned Homstrad positions but after EPA itrises to 0.653. This is plotted, test case for test case inFigure 2A. Each dot is one test case with the TC score forClustal Omega plotted against the score using EPA. The secondexample is illustrated in Figure 2B. Here, we take all theBAliBASE reference sets and align them as normal usingClustal Omega and obtain the benchmark result of 0.554 ofcolumns correctly aligned, as already reported in Table I. ForEPA, we use the benchmark reference alignments themselvesas external profiles. The results now jump to 0.857 of columnscorrect. This is a jump of over 30% and while it is not a validmeasure of Clustal Omega accuracy for comparison with otherprograms, it does illustrate the potential power of EPA to useinformation in external alignments.
Iteration
EPA can also be used in a simple iteration scheme. Once a MSAhas been made from a set of input sequences, it can beconverted into a HMM and used for EPA to help realign theinput sequences. This can also be combined with a fullrecalculation of the guide tree. In Figure 3, we show the resultsof one and two iterations on every test case from HomFam.The graph is plotted as a running average TC score for all testcases with N or fewer test cases where N is plotted on the
horizontal axis using a log scale. With some smaller test cases,iteration actually has a detrimental effect. Once you get near1000 or more sequences, however, a clear trend emerges.Themore sequences you have, themore beneficial the effect ofiteration is. With bigger test cases, it becomes more and morebeneficial to apply two iterations. This result confirmsthe usefulness of EPA as a general strategy. It also confirmsthe difficulty in aligning extremely large numbers of sequencesbut gives one partial solution. It also gives a very simple buteffective iteration scheme, not just for guide tree iteration, asused inmany packages, but for iteration of the alignment itself.
Discussion
The main breakthroughs since the mid 1980s in MSA methodshave been progressive alignment and the use of consistency.Otherwise, most recent work has concerned refinements forspeed or accuracy on benchmark test sets. The speed increaseshave been dramatic but, with just two major exceptions, themethods are still basically O(N2) and incapable of beingextended to data sets of 410 000 sequences. The twoexceptions aremBed, used here, andMAFFT PartTree. PartTreeis faster but at the expense of accuracy, at least as judged by thebenchmarking here. The second group of recent developments
Table III HomFam benchmarking results
93pNp2957 (41 families) 3127pNp9105 (33 families) 10 099pNp50157 (18 families)
Aligner TC/t (s) TC/t (s) TC/t (s)Clustal O 0.708/2114.0 0.639/11 719.5 0.464/27 328.9Kalign 0.569/324.9 0.563/6752.0 0.420/286 711.0MAFFT default 0.550/238.9 0.462/3115.4 !/!MAFFT –parttree !/! !/! 0.253/6119.4MUSCLE default 0.533/104 587.0 !/! !/!MUSCLE –maxiters 2 !/! 0.416/8239.2 0.216/110 292.0
The columns show total column score (TC) and total run time in seconds for groupings of small (o3000 sequences), medium (3000–10 000 sequences) and large(410 000 sequences) HomFam test cases.
0.001
0.01
0.1
1
10
100
1000
10 000
100 000
100 3000 10 000 100 000
Tim
e (s
)
#Sequences
HomFamClustalΩ
MafftMuscleKalign
Avr length:1−50
50 −100100 −150150 −200200 −250250 −300300 −350350 −400400 −450
Figure 1 Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE(green) and Kalign (purple) against the number of sequences of HomFam testsets. Average sequence length is rendered by point size. Both axes havelogarithmic scales. Clustal Omega and Kalign were run with default flags over theentire range. MUSCLE was run with –maxiters 2 for N43000 sequences.MAFFT was run with –parttree for N410 000 sequences.
High-quality protein MSAs using Clustal OmegaF Sievers et al
4 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited
FromSeiversetal.,MolecularSystemsBiology2011
Notethereducedsetofmethodsduetodatasetsize.
TCscoresonHomFamsequences
Observa<ons(sofar)
• Rela<veandabsoluteaccuracy(wrtTCscore)impactedbydegreeofheterogeneityanddatasetsize
• Somemethodscannotrunonlargedatasets• Onsmalldatasets,Clustal-OmeganotasaccurateasProbalign,MAFFTv6.857,andMSAprobs.
• Onlargedatasets,Clustal-Omegamoreaccuratethanothermethodsthatitwascomparedto.
Clustal-Omegavariants
• Itera<ngusingprofileHMMs• UsingEPA(profilesbasedonexternalstructuralalignments)
EPA for HomFam and BAliBASE.
Fabian Sievers et al. Mol Syst Biol 2011;7:539
© as stated in the article, figure or figure legend
Iteration of HomFam alignments.
Fabian Sievers et al. Mol Syst Biol 2011;7:539
© as stated in the article, figure or figure legend
Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE (green) and Kalign (purple) against the number of sequences of HomFam test sets.
Fabian Sievers et al. Mol Syst Biol 2011;7:539
© as stated in the article, figure or figure legend
(a) (b)
(c) (d)
20 200 2,000 20,000 200,000
0.01
0.10
1.00
10.00
100.00
1,000.00
10,000.00
100,000.00
Run time (single threaded)
ClustalΩ
MAFFT-PartTree
sequences
se
co
nd
s
20 200 2,000 20,000 200,000
1
10
100
1,000
10,000
100,000
Peak memory usage
ClustalΩ
MAFFT-PartTree
sequences
me
ga
byte
s
20 200 2,000 20,000 200,000
1
10
100
1,000
10,000
100,000
Average memory usage
ClustalΩ
MAFFT-PartTree
sequences
me
ga
byte
s
20 200 2,000 20,000 200,000
0
0.1
0.2
0.3
0.4
0.5
Alignment accuracy
ClustalΩ
MAFFT-PartTree
sequences
TC
sco
re
Addi<onalObserva<ons
• Itera<ngusingprofileHMMsimprovesTCscoresforClustal-Omega.
• UsingEPA(profilesbasedonexternalstructuralalignments)improvesTCscoresforClustal-Omega.
• MAFFT-pardreeisfasterthanClustal-Omega.• ThekeytoClustal-Omega’sspeedisitsguidetree.
Someconcerns
• Othercriteria(1-SPFN,1-SPFP,impactontreeerror,etc.)
• Otherdatasets• Newermethods(orupdatedmethods)• Restric<ontomethodsthatcompletequickly
wishes to align a set of globin sequences and has an existingglobin alignment, this alignment can be converted to a profileHMM and used as well as the sequence input file. This HMM ishere referred to as an ‘external profile’ and its use in this wayas ‘external profile alignment’ (EPA). During EPA, eachsequence in the input set is aligned to the external profile.Pseudocount information from the external profile is thentransferred, position by position, to the input sequence.Ideally, this would be used with large curated alignments ofparticular proteins or domains of interest such as are used inmetagenomics projects. Rather than taking the input se-quences and aligning them from scratch, every time newsequences are found, the alignment should be carefullymaintained and used as an external profile for EPA. ClustalOmega also can align sequences to existing alignments usingconventional alignment methods. Users can add sequences toan alignment, one by one or align a set of aligned sequences tothe alignment.In this paper, we demonstrate the EPA approach with two
examples. First, we take the 94 HomFam test cases from theprevious section and use the corresponding Pfam HMM forEPA. Before EPA, the average accuracy for the test cases was0.627 of correctly aligned Homstrad positions but after EPA itrises to 0.653. This is plotted, test case for test case inFigure 2A. Each dot is one test case with the TC score forClustal Omega plotted against the score using EPA. The secondexample is illustrated in Figure 2B. Here, we take all theBAliBASE reference sets and align them as normal usingClustal Omega and obtain the benchmark result of 0.554 ofcolumns correctly aligned, as already reported in Table I. ForEPA, we use the benchmark reference alignments themselvesas external profiles. The results now jump to 0.857 of columnscorrect. This is a jump of over 30% and while it is not a validmeasure of Clustal Omega accuracy for comparison with otherprograms, it does illustrate the potential power of EPA to useinformation in external alignments.
Iteration
EPA can also be used in a simple iteration scheme. Once a MSAhas been made from a set of input sequences, it can beconverted into a HMM and used for EPA to help realign theinput sequences. This can also be combined with a fullrecalculation of the guide tree. In Figure 3, we show the resultsof one and two iterations on every test case from HomFam.The graph is plotted as a running average TC score for all testcases with N or fewer test cases where N is plotted on the
horizontal axis using a log scale. With some smaller test cases,iteration actually has a detrimental effect. Once you get near1000 or more sequences, however, a clear trend emerges.Themore sequences you have, themore beneficial the effect ofiteration is. With bigger test cases, it becomes more and morebeneficial to apply two iterations. This result confirmsthe usefulness of EPA as a general strategy. It also confirmsthe difficulty in aligning extremely large numbers of sequencesbut gives one partial solution. It also gives a very simple buteffective iteration scheme, not just for guide tree iteration, asused inmany packages, but for iteration of the alignment itself.
Discussion
The main breakthroughs since the mid 1980s in MSA methodshave been progressive alignment and the use of consistency.Otherwise, most recent work has concerned refinements forspeed or accuracy on benchmark test sets. The speed increaseshave been dramatic but, with just two major exceptions, themethods are still basically O(N2) and incapable of beingextended to data sets of 410 000 sequences. The twoexceptions aremBed, used here, andMAFFT PartTree. PartTreeis faster but at the expense of accuracy, at least as judged by thebenchmarking here. The second group of recent developments
Table III HomFam benchmarking results
93pNp2957 (41 families) 3127pNp9105 (33 families) 10 099pNp50157 (18 families)
Aligner TC/t (s) TC/t (s) TC/t (s)Clustal O 0.708/2114.0 0.639/11 719.5 0.464/27 328.9Kalign 0.569/324.9 0.563/6752.0 0.420/286 711.0MAFFT default 0.550/238.9 0.462/3115.4 !/!MAFFT –parttree !/! !/! 0.253/6119.4MUSCLE default 0.533/104 587.0 !/! !/!MUSCLE –maxiters 2 !/! 0.416/8239.2 0.216/110 292.0
The columns show total column score (TC) and total run time in seconds for groupings of small (o3000 sequences), medium (3000–10 000 sequences) and large(410 000 sequences) HomFam test cases.
0.001
0.01
0.1
1
10
100
1000
10 000
100 000
100 3000 10 000 100 000
Tim
e (s
)
#Sequences
HomFamClustalΩ
MafftMuscleKalign
Avr length:1−50
50 −100100 −150150 −200200 −250250 −300300 −350350 −400400 −450
Figure 1 Alignment time for Clustal Omega (red), MAFFT (blue), MUSCLE(green) and Kalign (purple) against the number of sequences of HomFam testsets. Average sequence length is rendered by point size. Both axes havelogarithmic scales. Clustal Omega and Kalign were run with default flags over theentire range. MUSCLE was run with –maxiters 2 for N43000 sequences.MAFFT was run with –parttree for N410 000 sequences.
High-quality protein MSAs using Clustal OmegaF Sievers et al
4 Molecular Systems Biology 2011 & 2011 EMBO and Macmillan Publishers Limited
FromSeiversetal.,MolecularSystemsBiology2011
Notethereducedsetofmethodsduetodatasetsize.
TCscoresonHomFamsequences
Methodstheydidn’texplore
• PASTAandUPP:designedexplicitlyforlargedatasets.(TheytestedSATe-1,theprecursortoSATe-IIandPASTA,butnotonthelargedatasetsbecauseitcouldnotberunefficiently.)
• ImprovedversionsofMAFFTsincetheirstudy• Andwhoknowswhatelse…
SATeFamily• SATe-I(2009):
– Uptoabout10,000sequences– Goodaccuracyandreasonablespeed– “Center-tree”decomposi<on
• SATe-II(2012)– Uptoabout50,000sequences– Improvedaccuracyandspeed– Centroid-edgerecursivedecomposi<on
• PASTA(2014)– Upto1,000,000sequences– Improvedaccuracyandspeed– Combinescentroid-edgedecomposi<onwithtransi<vitymerge
SATé/PASTAAlgorithmDesign
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
A
B D
C
Merge subproblems
Estimate ML tree on merged alignment
Decompose based on input tree
A B
C D
Align subproblems
A B
C D
ABCD
SATéitera<on(actualdecomposi<onproduces32subproblems)
e
run two iterations of SATe-II on a separate machine with no running time limits (12 Quad-Core AMDOpteron processors, 256GB of RAM memory). Given 12 CPUs, two iterations of SATe-II took 137 hr,compared to 10 hr for PASTA. However, the resulting SATe-II alignment recovered only 30 columnsentirely correctly while PASTA recovered 311 columns. The pairs score of SATe-II was extremely poor(38.2%), while PASTA was quite accurate (81.0%). The tree produced by SATe-II had higher error thanPASTA (12.6% versus 8.2% FN rate).Impact of varying algorithmic parameters. We compared results obtained using four different startingtrees: a random tree, the ML tree on the Mafft-PartTree alignment, PASTA’s default starting tree, and thetrue (model) tree (see Table 3). After one iteration, PASTA alignments and trees based on our starting treeor true tree had roughly the same accuracy, and the starting tree based on Mafft-PartTree resulted in only aslightly worse tree (1% higher FN rate). However, using a random tree resulted in much higher tree errorrates (52.3% error), and alignments that were also less accurate. Interestingly, after three iterations ofPASTA, no noticeable difference could be detected between results from various starting trees. Thus,PASTA is robust to the choice of the starting tree.
We also evaluated the impact of changing the alignment subset size (Table 4); these analyses showedthat using alignment subsets of only 50 sequences improved the TC score and running time substantially,and only slightly changed the pairs score or tree error score. Although these analyses were performed onlyfor two datasets, they suggest the possibility that improved results might be obtained through smalleralignment subsets.
Running Time. Figure 3 compares the running time (in hours) of different alignment methods. Note thatPASTA was faster than SATe-II in all cases and could analyze datasets that SATe-II could not (i.e., the
Table 2. Alignment Accuracy on AA Datasets
Column (TC) score Pairs score
Method AA-10 HomFam(17) HomFam(2) AA-10 HomFam(17) HomFam(2)
Clustal-O 78 88 29 0.76 0.72 0.71Muscle 48 51 X 0.70 0.52 XMafft 81 103 32 0.76 0.75 0.79Initial 54 95 16 0.75 0.71 0.81SATe-II 83 73 X 0.75 0.64 XPASTA 80 102 36 0.76 0.78 0.83
We show TC (the number of correctly aligned sites, left) and the pairs score (the average of the SP-score and modeler
score, right). X indicates that a method failed to run on a particular dataset given the computational constraints.‘‘Initial’’ corresponds to the alignment approach used to obtain the starting tree of PASTA (HMMER failed to align one
sequence in the 16S.T dataset). All values shown are averages over all datasets in each category.
Boldface indicates the best values for each model condition.
Table 3. Effect of the Starting Tree on Final PASTA Alignment and Tree
Initial tree Alignment accuracy
method Error (FN) Pairs score TCTree error
FN
One iterationRandom 100.0% 79.9% 2 52.3%Mafft-parttree 28.7% 87.0% 126 11.7%Starting tree 12.4% 86.8% 138 10.5%True tree 0% 86.1% 133 10.5%
Three iterationsRandom 100.0% 90.4% 138 11.0%Mafft-parttree 28.7% 83.9% 144 10.7%Starting tree 12.4% 88.8% 145 10.7%True tree 0% 90.8% 150 10.5%
Alignment accuracy and tree error is shown for PASTA with various starting trees, after one iteration (top) and
three iterations (bottom) on one replicate of the 10k RNASim dataset.
PASTA: ULTRA-LARGE MULTIPLE SEQUENCE ALIGNMENT 7
ComparisonofPASTAtoSATe-IIandotheralignmentsonAAdatasets.FromMirarabetal.,J.Computa<onalBiology2014
Malversion7.143brunindefaultmode(evenonHomFam(17)datasets).
MAFFTvs.Clustal-Omega
• IntheClustal-Omegapaper,Clustal-OmegahadbederTCscoresthanMAFFT.
• InthePASTApaper,MAFFThadbederTCscoresthanClustal-Omega.
Whythedifference?
• TwoversionsofMAFFT(Clustal-Omegapaperusedversion6.857,PASTApaperusedversion7.143b)
• Twodifferentcommandspossibly(MAFFT-pardreevs.MAFFT-default)
Lesson:Versionnumberandcommandimpactsabsoluteandrela<veperformance!
to each other. Even if a more computationally expensive (andusually more accurate) method, L-INS-i, is applied (CPU time-= 98 h), the alignment is still obviously incorrect (fig. 2B).
Two-step strategies can solve this type of problem. That is,a set of full-length sequences taken from databases are firstaligned to build a backbone MSA, and then the new ITS1 andITS2 sequences are added into this backbone MSA, using the––addfragments option.
Step 1: mafft--autofull_length_sequences>\
backbone_msa
Step 2: mafft --addfragments \ new_sequences
backbone_msa > output
The second command is equivalent to
mafft --multipair --addfragments \new_sequences backbone_msa > output
in which Dynamic Programming (DP) is used to compare thedistances between every new sequence and every sequence inthe backbone MSA (––multipair is selected by default).
mafft --6merpair --addfragments \new_sequences backbone_msa > output
where distances are rapidly estimated using the number ofshared 6mers, instead of DP.
The result of the latter option (––6merpair––addfragments) is shown in fig. 2D and E. The differencebetween D and E is just in the order of sequences; the se-quences were reordered according to similarity using the––reorder option in E. In this alignment, ITS1 and ITS2are clearly separated and aligned to appropriate positions inthe full-length alignment. Moreover, this strategy is compu-tationally much less expensive (CPU time = 15 min [firststep] + 1.5 min [second step]) than the full applicationof L-INS-i (CPU time = 98 h). The former option(––multipair ––addfragments) also returns a similarresult to the latter (––6merpair) but is slower (CPUtime = 48.6 min [second step]).
This case suggests that it is crucial to select a strategyappropriate to the problem of interest. The most time-consuming method, L-INS-i, is not always the most accurate
one. The difficulty of this problem for standard approachescomes from the fact that ITS1 sequences and ITS2 sequencesare not homologous to each other and most pairwise align-ments are impossible. Because of these nonhomologous pairs,the distance matrix used for the guide tree calculation isnot additive; the distances between ITS1 and full-lengthsequences and those between ITS2 and full-length sequencesare close to zero, whereas the distances between ITS1 andITS2 are quite large. In this situation, it is difficult for normaldistance-based tree-building methods to give a reasonabletree. Moreover, in the alignment step, the objective functionof the L-INS-i is affected by inappropriate pairwise alignmentscores between ITS1 and ITS2. Such problems can be avoidedby just ignoring the relationship between ITS1 and ITS2,as done in the ––addfragments option.
In addition, a result of the second type of misuse ofmafft-profile (discussed earlier) is shown in figure 2C.Some new sequences are correctly aligned but others areobviously incorrectly aligned (note that the order of se-quences in fig. 2C is identical that in fig. 2D). These misalign-ments are due to an incorrect assumption on phylogeneticplacement of new sequences shown in figure 1C.
Test Case 2: Bacterial SSU rRNAAnother case is the 16S.B.ALL data set by Mirarab et al. (2012).It consists of an MSA of 13,822 bacterial SSU rRNA sequences,taken from the Gutell Comparative RNA Website (CRW)(Cannone et al. 2002) and 138,210 fragmentary sequences,which are originally included in the CRW alignmentbut ungapped and artificially truncated. In Katoh andStandley (2013), we used a subset (13,821 fragmentarysequences) prepared by Mirarab et al. (2012). In addition tothis subset, here we use the full data set (138,210 fragmentarysequences), to examine the scalability. Suppose a situationwhere we already have a manually curated (or backbone)MSA and a newly determined set of many fragmentarysequences in a metagenomics project, and we need anentire MSA of them.
The first four lines in table 2 (case 1) show the perfor-mances of various options for such an analysis, with a rela-tively small data set (13,822 sequences in the existing
Table 2. Comparison of Different Options Using the 16S.B.ALL Data Set (Mirarab et al. 2012).
Data Method Accuracy CPU Time Actual Timea
Case 1 mafft ––multipair ––addfragments frags existingmsa 0.9969 6.67 days 18.3 hmafft ––6merpair ––addfragments frags existingmsa 0.9949 3.76 h 36.2 minmafft ––localpair ––add frags existingmsa 0.9707 23.4 daysb 2.43 daysb
mafft ––6merpair ––add frags existingmsa 0.9604 1.32 h 1.44 hprofile alignment 0.2779 15.5 h 1.60 h
Case 2 mafft ––6merpair ––addfragments frags existingmsa 0.9969 4.54 h 33.8 min
Case 3 mafft ––6merpair ––addfragments frags existingmsa 0.9949 1.79 days 5.91 h
NOTE.—The estimated alignments were compared with the CRW alignment to measure the accuracy (the number of correctly aligned letters/the number of aligned letters in theCRW alignment). Calculations were performed on a Linux PC with 2.67 GHz Intel Xeon E7-8837/256 GB RAM (for the case marked with superscript alphabet “b”), or on a LinuxPC with 3.47 GHz Intel Xeon X5690/48 GB RAM (for the other cases).Case 1: 13,822 sequences in the existing alignment! 13,821 fragments;Case 2: 1,000 sequences in the existing alignment! 138,210 fragments;Case 3: 13,822 sequences in the existing alignment! 138,210 fragments.aWall-clock time with 10 cores. Command-line argument for parallel processing is ––thread 10.bFull command-line options are as follows: mafft ––localpair ––weighti 0 ––add frags existingmsa.
776
Katoh and Standley . doi:10.1093/molbev/mst010 MBE
by guest on February 22, 2015http://m
be.oxfordjournals.org/D
ownloaded from
FromKatohandStandley,2013(dealingwithfragmentarysequences)Mol.Biol.Evol.30(4):772–780doi:10.1093/molbev/mst010
Reviewer1• “TheoverallmessageappearstobethatClustalOmega
appearstobethefastestandmostaccuratealgorithmforverylargealignments,whileitiss<llsubstan<allyoutperformedbyothermethodsforsmalleralignments.”
• “Lookingattheresultspresentedhere(again,fromthestandpointofa"consumer"),Iwouldnotpersonallyexpecttomakemuchuseofthisalgorithmandwillprobablycon<nuetouseMAFFTorMUSCLE,mainlysincethey'res<llperformingsimilarlywellorbederforsmalleralignments,whicharethemainuseformyself.Iexpectthistobetrueforthemajorityofusers.However,thismaychangeinthefutureandespeciallyforthosewhoworkonmajorsequencing/datarepositorycenters(suchastheEBI),assequencingofmorespeciesrampsup.”
Reviewer2• “ThemainnoveltyofthisprogramcomparedtopreviousversionsofClustalisitsabilitytodealwithhugedatasets.Thisismainlyachievedbyusinganefficientclusteringalgorithmtoefficientlycalculateaguidetreeforprogressivealignment…construc<ngtheguidetreeforprogressivealignmentisthebodleneckofmostMSAprograms:calcula<ngatreefromtheinputsequenceswithtradi<onalmethodstakes(atleast)O(n2)<mewhileprogressivelyaligningthesequencesaccordingtothistreecanbedoneinlinear<me.”
CommentsaboutEPA• Reviewer1:“ItalsoofferstheProfileAlignmentuse-case,whichislikelytobemoreimportantinthefuture”
• Reviewer2:“Asecondinteres<ngnoveltyofClustalOmegaistheop<ontousepreviouslycalculated"ExternalProfileAlignments"(EPA)toimprovethealignmentprocedure.ThisisdoneusinganHMMapproachdevelopedbyoneoftheco-authors.Sinceformanyproteinfamilies,therearenowprofileHMMsavailable,itisaverysensibleideatousethisinforma<onforimprovedMSA,ratherthanrelyingonsequencesimilarityalone,asdotradi<onalalignmentmethods.”
Overall• MainfeaturesofClustal-Omegaarethefastguidetreecalcula<on,theuseofitera<on,andEPA.(BothEPAanditera<onuseprofileHMMs.)
• Butimprovementsinaccuracycomparedtoothermethodsonlyseemstobeapparentonlargedatasetswhenrestric<ngthesetofmethodstothosethatcompletequickly.Newermethods(andevennewversionsofoldmethodslikeMAFFT)seemtooutperformClustal-Omegaforaccuracy,eveniftheyarenotasfast.
• AccuracywasonlyassessedusingTCscore,whichisnotnecessarilythewholestory.
ChoiceofbestMSAmethod
• Doesitdependontypeofdata(DNAoraminoacids?)
• Doesitdependonrateofevolu<on?• Doesitdependongaplengthdistribu<on?• Doesitdependonexistenceoffragments?
Howdoestheguidetreeimpactaccuracy?
• Doesimprovingtheaccuracyoftheguidetreehelp?
• Doallalignmentmethodsrespondiden<cally?(Isthesameguidetreegoodforallmethods?)
• DothedefaultseSngsfortheguidetreeworkwell?
Anotherstudy…
• Prank(LoytynojaandGoldman,Science2008)isa“phylogenyaware”progressivealignmentstrategy.
• Theirstudyfocusedonevalua<ngMSAswithrespecttoTCscore,butalsoatypicalcriteria,suchas:– Genetreebranchlengthes<ma<on– Alignmentlengthes<ma<on(compressionissue)– Inser<on/dele<onra<o– Numberofinser<ons/dele<ons
• Theyexploredverysmallsimulateddatasets,evolvingsequencesdowntrees.
have a flaw that has gone unnoticed as long asdifferent methods have been consistent in theerror they create.
That such a significant error has passedundetected may be explained by the align-ment field's historical focus on proteins, wherethese biases tend to be manifested in less-constrained regions such as loops (compareFig. 1). Alignments with insertions and de-letions squeezed compactly between con-served blocks may suffice for, and even bepreferred by, some molecular biologists work-ing with proteins. We have shown, however,
that these patterns are, in fact, imposed bysystematic biases in alignment algorithms,even in cases where they are incorrect and,indeed, phylogenetically unreasonable. Wecontend that algorithms that impose gap pat-terns like those found in structural align-ments of proteins are inappropriate for theincreasingly widespread analysis of genomicDNA and are likely to cause error when theresulting alignments are used for evolutionaryinferences.
We believe that alignment methods spe-cifically designed for evolutionary analyses
will give a very different picture of the mech-anisms of sequence evolution and showsequence turnover through short insertionsand deletions as a more frequent and im-portant phenomenon. This raises interestingquestions of the true evolution of variablesequences such as promoter regions, non-coding DNA, and exposed coil regions inproteins: Do they predominantly evolvethrough point substitutions, or are those dis-similar regions just incorrectly aligned non-homologous sequences? To resolve that, weneed more sequence data and alignmentmethods that can really benefit from theadditional information. The resulting align-ments may be fragmented by many gapsand may not be as visually beautiful as thetraditional alignments, but if they representcorrect homology, we have to get used tothem.
References and Notes1. R. A. Gibbs et al., Nature 428, 493 (2004).2. Rhesus Macaque Genome Sequencing and Analysis
Consortium, Science 316, 222 (2007).3. The ENCODE Project Consortium, Nature 447, 799
(2007).4. A. Stark et al., Nature 450, 219 (2007).5. Materials and methods are available as supporting
material on Science Online.6. A. Rambaut, D. Posada, K. Crandall, E. Holmes, Nat. Rev.
Genet. 5, 52 (2004).7. N. Sullivan, M. Thali, C. Furman, D. Ho, J. Sodroski, J. Virol.
67, 3674 (1993).8. R. Wyatt et al., J. Virol. 69, 5723 (1995).9. M. Jansson et al., AIDS Res. Hum. Retroviruses 17, 1405
(2001).10. S. D. Frost et al., Proc. Natl. Acad. Sci. U.S.A. 102,
18514 (2005).11. M. Sagar, X. Wu, S. Lee, J. Overbaugh, J. Virol. 80, 9586
(2006).12. J. D. Thompson, D. G. Higgins, T. J. Gibson, Nucleic Acids
Res. 22, 4673 (1994).13. C. Notredame, D. G. Higgins, J. Heringa, J. Mol. Biol.
302, 205 (2000).14. R. C. Edgar, BMC Bioinformat. 5, 113 (2004).15. K. Katoh, K. Kuma, H. Toh, T. Miyata, Nucleic Acids Res.
33, 511 (2005).16. A. Löytynoja, N. Goldman, Proc. Natl. Acad. Sci. U.S.A.
102, 10557 (2005).17. D. D. Pollock, D. J. Zwickl, J. A. McGuire, D. M. Hillis,
Syst. Biol. 51, 664 (2002).18. M. S. Rosenberg, S. Kumar, Syst. Biol. 52, 119
(2003).19. M. S. Rosenberg, BMC Bioinformat. 6, 278
(2005).20. K. M. Wong, M. A. Suchard, J. P. Huelsenbeck, Science
319, 473 (2008).21. This work was funded in part by a Wellcome
Trust Programme Grant (GR078968). We thankN. Luscombe for many suggestions that improvedthe manuscript.
Supporting Online Materialwww.sciencemag.org/cgi/content/full/320/5883/1632/DC1Materials and MethodsSOM TextFig. S1Table S1References
28 March 2008; accepted 15 May 200810.1126/science.1158395
Fig. 3. Alignment accuracy errors are reduced by a phylogeny-aware algorithm. (A to F) Errorsfrom traditional alignment methods grow with increasing evolutionary distances and moredifficult alignments (close-intermediate-distant, white to blue gradient) and also with a densersequence sampling and increasingly similar sequences (close-2X-4X, white to red gradient).The phylogeny-aware method PRANK+F is less biased by greater distances and, in contrast toother methods, improves in accuracy with additional sequence data. Alignment statistics forthe five multiple sequence alignment methods are number of (A) insertions and (B) deletions,(C) insertion/deletion ratio, (D) gap overlap, (E) total length of the alignment, and (F)proportion of columns correctly recovered. (G) Inferred branch lengths at different depths inthe tree (t1 – t4) for the intermediate sets indicate that alignment errors lead to overestimatedbranch lengths, with PRANK+F giving the most accurate estimates across the whole range ofdepths. All measures in (A) to (G) are shown relative to those inferred from the true align-ment, and values closer to 1 are more correct. Vertical bars show means and 95% confidenceintervals.
www.sciencemag.org SCIENCE VOL 320 20 JUNE 2008 1635
REPORTS
FromLoytyjojaandGoldman,Science2008:
have a flaw that has gone unnoticed as long asdifferent methods have been consistent in theerror they create.
That such a significant error has passedundetected may be explained by the align-ment field's historical focus on proteins, wherethese biases tend to be manifested in less-constrained regions such as loops (compareFig. 1). Alignments with insertions and de-letions squeezed compactly between con-served blocks may suffice for, and even bepreferred by, some molecular biologists work-ing with proteins. We have shown, however,
that these patterns are, in fact, imposed bysystematic biases in alignment algorithms,even in cases where they are incorrect and,indeed, phylogenetically unreasonable. Wecontend that algorithms that impose gap pat-terns like those found in structural align-ments of proteins are inappropriate for theincreasingly widespread analysis of genomicDNA and are likely to cause error when theresulting alignments are used for evolutionaryinferences.
We believe that alignment methods spe-cifically designed for evolutionary analyses
will give a very different picture of the mech-anisms of sequence evolution and showsequence turnover through short insertionsand deletions as a more frequent and im-portant phenomenon. This raises interestingquestions of the true evolution of variablesequences such as promoter regions, non-coding DNA, and exposed coil regions inproteins: Do they predominantly evolvethrough point substitutions, or are those dis-similar regions just incorrectly aligned non-homologous sequences? To resolve that, weneed more sequence data and alignmentmethods that can really benefit from theadditional information. The resulting align-ments may be fragmented by many gapsand may not be as visually beautiful as thetraditional alignments, but if they representcorrect homology, we have to get used tothem.
References and Notes1. R. A. Gibbs et al., Nature 428, 493 (2004).2. Rhesus Macaque Genome Sequencing and Analysis
Consortium, Science 316, 222 (2007).3. The ENCODE Project Consortium, Nature 447, 799
(2007).4. A. Stark et al., Nature 450, 219 (2007).5. Materials and methods are available as supporting
material on Science Online.6. A. Rambaut, D. Posada, K. Crandall, E. Holmes, Nat. Rev.
Genet. 5, 52 (2004).7. N. Sullivan, M. Thali, C. Furman, D. Ho, J. Sodroski, J. Virol.
67, 3674 (1993).8. R. Wyatt et al., J. Virol. 69, 5723 (1995).9. M. Jansson et al., AIDS Res. Hum. Retroviruses 17, 1405
(2001).10. S. D. Frost et al., Proc. Natl. Acad. Sci. U.S.A. 102,
18514 (2005).11. M. Sagar, X. Wu, S. Lee, J. Overbaugh, J. Virol. 80, 9586
(2006).12. J. D. Thompson, D. G. Higgins, T. J. Gibson, Nucleic Acids
Res. 22, 4673 (1994).13. C. Notredame, D. G. Higgins, J. Heringa, J. Mol. Biol.
302, 205 (2000).14. R. C. Edgar, BMC Bioinformat. 5, 113 (2004).15. K. Katoh, K. Kuma, H. Toh, T. Miyata, Nucleic Acids Res.
33, 511 (2005).16. A. Löytynoja, N. Goldman, Proc. Natl. Acad. Sci. U.S.A.
102, 10557 (2005).17. D. D. Pollock, D. J. Zwickl, J. A. McGuire, D. M. Hillis,
Syst. Biol. 51, 664 (2002).18. M. S. Rosenberg, S. Kumar, Syst. Biol. 52, 119
(2003).19. M. S. Rosenberg, BMC Bioinformat. 6, 278
(2005).20. K. M. Wong, M. A. Suchard, J. P. Huelsenbeck, Science
319, 473 (2008).21. This work was funded in part by a Wellcome
Trust Programme Grant (GR078968). We thankN. Luscombe for many suggestions that improvedthe manuscript.
Supporting Online Materialwww.sciencemag.org/cgi/content/full/320/5883/1632/DC1Materials and MethodsSOM TextFig. S1Table S1References
28 March 2008; accepted 15 May 200810.1126/science.1158395
Fig. 3. Alignment accuracy errors are reduced by a phylogeny-aware algorithm. (A to F) Errorsfrom traditional alignment methods grow with increasing evolutionary distances and moredifficult alignments (close-intermediate-distant, white to blue gradient) and also with a densersequence sampling and increasingly similar sequences (close-2X-4X, white to red gradient).The phylogeny-aware method PRANK+F is less biased by greater distances and, in contrast toother methods, improves in accuracy with additional sequence data. Alignment statistics forthe five multiple sequence alignment methods are number of (A) insertions and (B) deletions,(C) insertion/deletion ratio, (D) gap overlap, (E) total length of the alignment, and (F)proportion of columns correctly recovered. (G) Inferred branch lengths at different depths inthe tree (t1 – t4) for the intermediate sets indicate that alignment errors lead to overestimatedbranch lengths, with PRANK+F giving the most accurate estimates across the whole range ofdepths. All measures in (A) to (G) are shown relative to those inferred from the true align-ment, and values closer to 1 are more correct. Vertical bars show means and 95% confidenceintervals.
www.sciencemag.org SCIENCE VOL 320 20 JUNE 2008 1635
REPORTS
FromLoytynojaandGoldman,Science2008
Observa<ons
• Mostalignmentmethods“over-align”(producecompressedalignments)
• Prankavoidsthisthroughits“phylogeny-aware”strategy
• Compressionresultsin– Over-es<ma<onsofbranchlengths– Under-es<ma<onofinser<ons
• Clustalisleastaccurate,othermethodsinbetween
Researchques<ons
• Clustal-Omega’smainadvantageseemstobeitsfastguidetree.Buttheguidetreehasalargeimpactonaccuracy–canweimprovetheaccuracyofClustal-Omegabygivingitabederguidetree?
• Itera<onusingprofileHMMshelpsaccuracy;canweusethistechniqueinothermethods?
• EPAsareuseful.Whatarethebestwaystouseexternalstructurally-definedalignments?
ResearchProjects• DesignyourownMSAmethod,orjustmodifyanexis<ngoneinsomesimpleway(e.g.,differentguidetree)
• Testexis<ngMSAmethodswithrespecttodifferentcriteria(e.g.,extendPrankstudytomoremethodsanddatasets)
• DevelopdifferentMSAcriteriathataremoreappropriatethanTC,SPFN,SPFP
• ComparedifferentMSAmethodsonsomebiologicaldataset
• ParallelizesomeMSAmethod• ConsiderhowtocombineMSAsonthesameinput