Introduction to characters and parsimony analysis · Unique and unreversed characters • Given a...
Transcript of Introduction to characters and parsimony analysis · Unique and unreversed characters • Given a...
Introduction to charactersand parsimony analysis
Genetic Relationships
• Genetic relationships exist between individuals withinpopulations
• These include ancestor-descendent relationships and moreindirect relationships based on common ancestry
• Within sexually reducing populations there is a network ofrelationships
• Genetic relations within populations can be measured witha coefficient of genetic relatedness
Phylogenetic Relationships• Phylogenetic relationships exist between lineages (e.g.
species, genes)
• These include ancestor-descendent relationships and moreindirect relationships based on common ancestry
• Phylogenetic relationships between species or lineages are(expected to be) tree-like
• Phylogenetic relationships are not measured with a simplecoefficient
Phylogenetic Relationships• Traditionally phylogeny reconstruction was dominated by
the search for ancestors, and ancestor-descendantrelationships
• In modern phylogenetics there is an emphasis on indirectrelationships
• Given that all lineages are related, closeness ofphylogenetic relationships is a relative concept.
Phylogenetic relationships• Two lineages are more closely related to each other than to
some other lineage if they share a more recent commonancestor - this is the cladistic concept of relationships
• Phylogenetic hypotheses are hypotheses of commonancestry
Frog
Toad
Oak
(Frog,Toad)OakHypothetical ancestral lineage
Phylogenetic Trees
A B C D E F G H I J
ROOT
polytomy
terminal branches
interiorbranches
node 1 node 2
LEAVES
A CLADOGRAM
CLADOGRAMS ANDPHYLOGRAMS
ABSOLUTE TIME or DIVERGENCE
RELATIVE TIME
A B
C DE
FG
HI
J
A B C D E F GH I J
Trees - Rooted and Unrooted
ROOTA
B
C
D E
F
GH
I
J
A B C D E F GH I J
ROOT
A B C D E F G H I J
ROOT
Characters and CharacterStates
• Organisms comprise sets of features
• When organisms/taxa differ with respect toa feature (e.g. its presence or absence ordifferent nucleotide bases at specific sites ina sequence) the different conditions arecalled character states
• The collection of character states withrespect to a feature constitute a character
Character evolution• Heritable changes (in morphology, gene
sequences, etc.) produce different characterstates
• Similarities and differences in character statesprovide the basis for inferring phylogeny (i.e.provide evidence of relationships)
• The utility of this evidence depends on how oftenthe evolutionary changes that produce thedifferent character states occur independently
Unique and unreversed characters• Given a heritable evolutionary change that is unique
and unreversed (e.g. the origin of hair) in an ancestralspecies, the presence of the novel character state in anytaxa must be due to inheritance from the ancestor
• Similarly, absence in any taxa must be because thetaxa are not descendants of that ancestor
• The novelty is a homology acting as badge or markerfor the descendants of the ancestor
• The taxa with the novelty are a clade (e.g. Mammalia)
Unique and unreversed characters• Because hair evolved only once and is unreversed
(not subsequently lost) it is homologous and providesunambiguous evidence for of relationships
Lizard
Frog
Human
Dog
HAIR
absentpresent
change or step
• Homoplasy is similarity that is not homologous(not due to common ancestry)
• It is the result of independent evolution(convergence, parallelism, reversal)
• Homoplasy can provide misleading evidence ofphylogenetic relationships (if mistakenlyinterpreted as homology)
Homoplasy - Independent evolution
Homoplasy - independent evolution
HumanLizard
Frog Dog
TAIL (adult)
absentpresent
• Loss of tails evolved independently inhumans and frogs - there are two steps onthe true tree
Homoplasy - misleading evidence ofphylogeny
• If misinterpreted as homology, the absence of tailswould be evidence for a wrong tree: groupinghumans with frogs and lizards with dogs
Human
Frog
Lizard
Dog
TAIL
absentpresent
Homoplasy - reversal• Reversals are evolutionary changes back to an
ancestral condition
• As with any homoplasy, reversals can providemisleading evidence of relationships
True tree Wrong tree101 2 3 4 5 67 8 91 2 3 4 5 6 7 8 9 10
Homoplasy - a fundamentalproblem of phylogenetic inference
• If there were no homoplastic similaritiesinferring phylogeny would be easy - all thepieces of the jig-saw would fit together neatly
• Distinguishing the misleading evidence ofhomoplasy from the reliable evidence ofhomology is a fundamental problem ofphylogenetic inference
Homoplasy and Incongruence• If we assume that there is a single correct
phylogenetic tree then:
• When characters support conflicting phylogenetictrees we know that there must be some misleadingevidence of relationships among the incongruent orincompatible characters
• Incongruence between two characters implies that atleast one of the characters is homoplastic and that atleast one of the trees the character supports is wrong
Incongruence or Incompatibility
• These trees and characters are incongruent - both treescannot be correct, at least one is wrong and at least onecharacter must be homoplastic
Lizard
Frog
Human
Dog
HAIR
absentpresent
Human
Frog
Lizard
Dog
TAIL
absentpresent
Distinguishing homology andhomoplasy
• Morphologists use a variety of techniques todistinguish homoplasy and homology
• Homologous features are expected to display detailedsimilarity (in position, structure, development)whereas homoplastic similarities are more likely to besuperficial
• As recognised by Charles Darwin congruence withother characters provides the most compellingevidence for homology
The importance of congruence
• “The importance, for classification, of triflingcharacters, mainly depends on their beingcorrelated with several other characters ofmore or less importance. The value indeed ofan aggregate of characters is very evident........ a classification founded on any singlecharacter, however important that may be,has always failed.”
• Charles Darwin: Origin of Species, Ch. 13
Congruence
• We prefer the ‘true’ tree because it is supportedby multiple congruent characters
Lizard
Frog
Human
Dog
MAMMALIAHairSingle bone in lower jawLactationetc.
Homoplasy in molecular data
Incongruence and therefore homoplasy can becommon in molecular sequence data– There are a limited number of alternative character
states ( e.g. Only A, G, C and T in DNA)
– Rates of evolution are sometimes high
Character states are chemically identical– homology and homoplasy are equally similar
– cannot be distinguished by detailed study ofsimilarity and differences
Parsimony analysis
• Parsimony methods provide one way ofchoosing among alternative phylogenetichypotheses
• The parsimony criterion favours hypothesesthat maximise congruence and minimisehomoplasy
• It depends on the idea of the fit of a character toa tree
Character Fit• Initially, we can define the fit of a character to
a tree as the minimum number of stepsrequired to explain the observed distribution ofcharacter states among taxa
• This is determined by parsimonious characteroptimization
• Characters differ in their fit to different trees
Character FitF
rog
Coc
odile
Bird
Kan
gero
o
Bat
Hum
anHairabsentpresent
Fro
g
Kan
gero
o
Coc
odile
Hum
an
Bat
Bird
Tree A1 step
Tree B2 steps
Parsimony Analysis• Given a set of characters, such as aligned
sequences, parsimony analysis works bydetermining the fit (number of steps) of eachcharacter on a given tree
• The sum over all characters is called TreeLength
• Most parsimonious trees (MPTs) have theminimum tree length needed to explain theobserved distributions of all the characters
Parsimony in practice
Frog
Bird
Crocodile
Kangeroo
Bat
Human
amni
on
hair
win
gs
anto
rbita
l fe
nest
ra
plac
enta
lact
atio
n
Tree 1
Tree 2
T A
X A
FIT
-
-
-
-
--
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+ -
+
-
CHARACTERS
1 2 3 4 5 6
+
+
+
+
1
1
TREE LENGTH
1 1 1 1 2 7
2 2 2 2 1 10
Frog
Coc
odile
Kan
gero
o
Bat Bird
Hum
an
1
23
6
44
5
52
3
Tree 2
Coc
odile
Kan
gero
o
Frog
Bird Bat Hum
an
1
Tree 1
23
4
66
5
Of these two trees, Tree 1 has the shortest lengthand is the most parsimoniousBoth trees require some homoplasy (extra steps)
Results of parsimony analysis• One or more most parsimonious trees• Hypotheses of character evolution associated with
each tree (where and how changes have occurred)
• Branch lengths (amounts of change associated withbranches)
• Various tree and character statistics describing the fitbetween tree and data
• Suboptimal trees - optional
Character types
• Characters may differ in the costs(contribution to tree length) made by differentkinds of changes
• Wagner (ordered, additive) 0 1 2 (morphology, unequal costs)• Fitch (unordered, non-additive) A G (morphology, molecules)
T C (equal costs for all changes)
one step
two steps
Character types• Sankoff (generalised) A G (morphology, molecules)
T C (user specified costs)• For example, differential weighting of transitions and
transversions
• Costs are specified in a stepmatrix
• Costs are usually symmetric but can be asymmetricalso (e.g. costs more to gain than to loose a restrictionsite)
one step
five steps
Stepmatrices• Stepmatrices specify the costs of changes within a
character
A C G TA 0 5 1 5C 5 0 5 1G 1 5 0 5T 5 1 5 0
To
From
A G
CT
PURINES (Pu)
PYRIMIDINES (Py)
transitions Py Py Pu Pu
tran
sver
sion
s
Py
Pu
Different characters (e.g 1st, 2nd and 3rd)codon positions can also have differentweights
Weighted parsimony• If all kinds of steps of all characters have equal
weight then parsimony:– Minimises homoplasy (extra steps)
– Maximises the amount of similarity due tocommon ancestry
– Minimises tree length
• If steps are weighted unequally parsimonyminimises tree length - a weighted sum of thecost of each character
Why weight characters?• Many systematists consider weighting unacceptable, but weighting is
unavoidable (unweighted = equal weights)• Transitions may be more common than transversions• Different kinds of transitions and transversions may be more or less
common• Rates of change may vary with codon positions• The fit of different characters on trees may indicate differences in their
reliabilities
• However, equal weighting is the commonest procedure and is thesimplest (but probably not the best) approach
Ciliate SSUrDNA data
Num
ber
of C
hara
cter
s
0
5 0
100
150
200
250
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 10
Number of steps
Different kinds of changesdiffer in their frequencies
ToA C G T
From
A
C
G
T
Transitions
Transversions
Unambiguous changeson most parsimonious tree of Ciliate SSUrDNA
Parsimony - advantages
• is a simple method - easily understood operation• does not seem to depend on an explicit model of
evolution
• gives both trees and associated hypotheses ofcharacter evolution
• should give reliable results if the data is wellstructured and homoplasy is either rare or widely(randomly) distributed on the tree
Parsimony - disadvantages• May give misleading results if homoplasy is common or
concentrated in particular parts of the tree, e.g:- thermophilic convergence- base composition biases- long branch attraction
• Underestimates branch lengths• Model of evolution is implicit - behaviour of method not well
understood• Parsimony often justified on purely philosophical grounds - we
must prefer simplest hypotheses - particularly bymorphologists
• For most molecular systematists this is uncompelling
Parsimony can be inconsistent• Felsenstein (1978) developed a simple model phylogeny including four
taxa and a mixture of short and long branches• Under this model parsimony will give the wrong tree
A B
C D
Model tree
p pq
q q
Rates or Branch lengths
p >> q
A
B
C
D
Parsimony tree
Wrong
• With more data the certainty that parsimony will give the wrong treeincreases - so that parsimony is statistically inconsistent
• Advocates of parsimony initially responded by claiming thatFelsenstein’s result showed only that his model was unrealistic
• It is now recognised that the long-branch attraction (in the FelsensteinZone) is one of the most serious problems in phylogenetic inference
Long branches are attracted but the similarity is homoplastic
Finding optimal trees - exactsolutions
• Exact solutions can only be used for smallnumbers of taxa
• Exhaustive search examines all possibletrees
• Typically used for problems with lessthan 10 taxa
Finding optimal trees - exhaustive search
A
B C
1
2a
Starting tree, any 3 taxa
A
B D
C
A
BD C
A
B C
D2b 2c
E
E
EE
E
Add fourth taxon (D) in each of three possible positions -> three trees
Add fifth taxon (E) in each of the five possible positions on each of thethree trees -> 15 trees, and so on ....
Finding optimal trees - exactsolutions
• Branch and bound saves time by discarding familiesof trees during tree construction that cannot beshorter than the shortest tree found so far
• Can be enhanced by specifying an initial upperbound for tree length
• Typically used only for problems with less than 18taxa
Finding optimal trees - branch and bound
A
B C
B1
A
B D
C
A
B C
D
B3
A1
A
B E
D
CC1.1
A
B D
E
CC1.3
A
B D
C
EC1.2
A
B
CC1.4
E D
A
B C
C1.5
ED
A
BD C
B2
C2.1
C2.2
C2.3
C2.4
C2.5
C3.1
C3.2
C3.3
C3.4
C3.5
Finding optimal trees - heuristics
• The number of possible trees increases exponentially withthe number of taxa making exhaustive searchesimpractical for many data sets (an NP complete problem)
• Heuristic methods are used to search tree space for mostparsimonious trees by building or selecting an initial treeand swapping branches to search for better ones
• The trees found are not guaranteed to be the mostparsimonious - they are best guesses
Finding optimal trees - heuristics• Stepwise addition Asis - the order in the data matrix Closest -starts with shortest 3-taxon tree adds taxa in order
that produces the least increase in tree length (greedyheuristic)
Simple - the first taxon in the matrix is a taken as areference - taxa are added to it in the order of theirdecreasing similarity to the reference
Random - taxa are added in a random sequence, manydifferent sequences can be used
• Recommend random with as many (e.g. 10-100) additionsequences as practical
Finding most parsimonious trees -heuristics
• Branch Swapping:
Nearest neighbor interchange (NNI)
Subtree pruning and regrafting (SPR) Tree bisection and reconnection (TBR) Other methods ....
Finding optimal trees - heuristics
• Nearest neighbor interchange (NNI)
A
B
C DE
F
G
A
B
D C E
F
G
A
B
C D
E
F
G
Finding optimal trees - heuristics
• Subtree pruning and regrafting (SPR)
A
B
C DE
F
G
A
B
C D E
F
G
C
D
G
B
A
E F
Finding optimal trees - heuristics
• Tree bisection and reconnection (TBR)
A
B
C DE
F
G
A
B
C D
E
F
G
A
C
F
D
E
B G
Finding optimal trees - heuristics
• Branch Swapping Nearest neighbor interchange (NNI) Subtree pruning and regrafting (SPR) Tree bisection and reconnection (TBR)
• The nature of heuristic searches means we cannotknow which method will find the mostparsimonious trees or all such trees
• However, TBR is the most extensive swappingroutine and its use with multiple random additionsequences should work well
Tree space may be populated by local minimaand islands of optimal trees
GLOBAL MINIMUM
LocalMinimum
LocalMinima
TreeLength
RANDOM ADDITION SEQUENCE REPLICATES
SUCCESSFAILURE FAILURE
Branch SwappingBranch Swapping
Branch Swapping
Searching with topological constraints
• Topological constraints are user-definedphylogenetic hypotheses
• Can be used to find optimal trees that either:
1. include a specified clade or set ofrelationships
2. exclude a specified clade or set ofrelationships (reverse constraint)
Searching with topological constraints
A B C D E F G
ABCDEFG
((A,B,C,D)(E,F,G))
A B C D E F G
ABCDEFG
A B C E D F G
Compatible with constraint tree
CONSTRAINT TREE
Incompatible with reverse constraint tree
Compatible with reverse constraint treeIncompatible with constraint tree
Searching with topological constraintsbackbone constraints
• Backbone constraints specify relationshipsamong a subset of the taxa
A B D E
A B D E
A D B E
possible positions of taxon CCompatible with backbone constraintIncompatible with reverse constraint
Incompatible with backbone constraintCompatible with reverse constraint
BACKBONE CONSTRAINT((A,B)(D,E))
relationships of taxon C are not specified
Parsimonious Character Optimization
A B C D E
*
*0 => 1
==
OR parallelism 2 separate origins 0 => 1 (DELTRAN)
riginndeversalACCTRAN)
0 0 1 1 0
1 => 0
Homoplastic characters often have alternative equally parsimonious optimizationsCommonly used varieties are:ACCTRAN - accelerated transformationDELTRAN - delayed transformation
Consequently, branch lengths are not always fully determined
PAUP reports minimum and maximum branch lengths
Missing data• Missing data is ignored in tree building but can lead to alternative
equally parsimonious optimizations in the absence of homoplasy
A B C D E
**
singleorigin0 => 1on any one of 3branches
1 ? ? 0 0
*Abundant missing data can leadto multiple equally parsimonioustrees.
This can be a serious problemwith morphological data but isunlikely to arise with moleculardata unless analyses are ofincomplete data
Multiple optimal trees
• Many methods can yield multiple equallyoptimal trees
• We can further select among these trees withadditional criteria, but
• Typically, relationships common to all theoptimal trees are summarised with consensustrees
Consensus methods
• A consensus tree is a summary of the agreementamong a set of fundamental trees
• There are many consensus methods that differ in:
1. the kind of agreement
2. the level of agreement• Consensus methods can be used with multiple trees
from a single analysis or from multiple analyses
Strict consensus methods• Strict consensus methods require agreement across all the
fundamental trees
• They show only those relationships that are unambiguouslysupported by the parsimonious interpretation of the data
• The commonest method (strict component consensus)focuses on clades/components/full splits
• This method produces a consensus tree that includes all andonly those full splits found in all the fundamental trees
• Other relationships (those in which the fundamental treesdisagree) are shown as unresolved polytomies
• Implemented in PAUP
Strict consensus methods
A B C D E F G A B C E D F G
TWO FUNDAMENTAL TREES
A B C D E F G
STRICT COMPONENT CONSENSUS TREE
Majority-rule consensus methods• Majority-rule consensus methods require agreement across
a majority of the fundamental trees• May include relationships that are not supported by the
most parsimonious interpretation of the data• The commonest method focuses on clades/components/full
splits• This method produces a consensus tree that includes all and
only those full splits found in a majority (>50%) of thefundamental trees
• Other relationships are shown as unresolved polytomies• Of particular use in bootstrapping
• Implemented in PAUP
Majority rule consensus
A B C D E F G A B C E D F G
A B C E D F G
MAJORITY-RULE COMPONENT CONSENSUS TREE
A B C E F D G
100
66
66
66
66
THREE FUNDAMENTAL TREES
Numbers indicate frequency ofclades in the fundamental trees
Reduced consensus methods• Focuses upon any relationships (not just full splits)• Reduced consensus methods occur in strict and
majority-rule varieties
• Other relationships are shown as unresolvedpolytomies
• May be more sensitive than methods focusing onlyon clades/components/full splits
• Strict reduced consensus methods are implementedin RadCon
Types of Cladistic Relationships
A B C D E
(a)
FIVE LEAF TREEC D E
D E
A B C
(b)
COMPONENTS / CLADES
5-TAXON STATEMENTS
A B D
A C D B C D
D E A
D E B
A B C
A B E D E C
A C E B C E
(c)
ROOTED TRIPLETS
3-TAXON STATEMENTS
A B D E
A B D E
D E A B
FOUR LEAF SUBTREE
4-TAXON STATEMENTS
(d)
D E
Z
A B C
Y
A B
X
X
YZ
Reduced consensus methods
A B C D E F G
TWO FUNDAMENTAL TREES
STRICT REDUCED CONSENSUS TREE Taxon G is excluded
A G B C D E F
A B C D E F
A B C D E F G
Strict component consensuscompletely unresolved
Consensus methods
Spirostomumum
OchromonasSymbiodiniumProrocentrumLoxodesTetrahymena
TracheloraphisEuplotesGruberiaOchromonasSymbiodiniumProrocentrumLoxodesTetrahymenaSpirostomumumEuplotesTracheloraphisGruberia
OchromonasSymbiodiniumProrocentrumLoxodesTetrahymenaEuplotesSpirostomumumTracheloraphisGruberia
OchromonasSymbiodiniumProrocentrumLoxodesTetrahymenaTracheloraphisSpirostomumEuplotesGruberia
OchromonasSymbiodiniumProrocentrumLoxodesTetrahymenaSpirostomumEuplotesTracheloraphisGruberia
Ochromonas
SymbiodiniumProrocentrumLoxodesTetrahymenaSpirostomumTracheloraphisGruberia
Three fundamental trees
majority-rule
strict (component) strict reduced cladisticEuplotes excluded
100
100100
100
6666
Consensus methodsUse strict methods to identify those relationshipsunambiguously supported by parsimoniousinterpretation of the dataUse reduced methods where consensus trees arepoorly resolvedUse majority-rule methods in bootstrappingAvoid other methods which have ambiguousinterpretations