Bioinformatics t6-phylogenetics v2014

FBW

4-11-2014

Wim Van Criekinge

Wel les op 4 november en GEEN les op 18 november

Phylogenetics

Introduction

Definitions

Species concept

Examples

The Tree-of-life

Phylogenetics Methodologies

AlgorithmsDistance Methods

Maximum Likelihood

Maximum Parsimony

Rooting

Statistical Validation

Conclusions

Orthologous genes

Horizontal Gene Transfer

Phylogenomics

Practical Approach: PHYLIP

Weblems

Phylogeny (phylo =tribe + genesis)

Phylogenetic trees are about visualising evolutionary relationships. They reconstruct the pattern of events that have led to the distribution and diversity of life.

The purpose of a phylogenetic tree is to illustrate how a group of objects (usually genes or organisms) are related to one another

Nothing in Biology Makes Sense Except in the Light of Evolution. Theodosius Dobzhansky (1900-1975)

What is phylogenetics ?

Trees

• Diagram consisting of branches and nodes

• Species tree (how are my species related?)

– contains only one representative from each

species.

– all nodes indicate speciation events

• Gene tree (how are my genes related?)

– normally contains a number of genes from a

single species

– nodes relate either to speciation or gene

duplication events

Clade: A set of species which includes all of the species

derived from a single common ancestor

Species Concepts from Various Authors

D.A. Baum and K.L. Shaw - Exclusive groups of organisms, where an exclusive group is one whose members are all more closely related to

each other than to any organisms outside the group.

J. Cracraft - An irreducible cluster of organisms, diagnosably distinct from other such clusters, and within which there is a parental pattern of

ancestry and descent.

Charles Darwin - "From these remarks it will be seen that I look at the term species, as one arbitrarily given for the sake of convenience to a set

of individuals closely resembling each other, and that it does not essentially differ from the term variety, which is given to less distinct and

more fluctuating forms. The term variety, again, in comparison with mere individual differences, is also applied arbitrarily, and for mere

convenience sake" (Origin of Species, 1st ed., p. 108).

T. Dobzhansky - The largest and most inclusive reproductive community of sexual and cross-fertilizing individuals which share a common gene

pool. And later...Systems of populations, the gene exchange between which is limited or prevented by reproductive isolating mechanisms.

M. Ghiselin - The most extensive units in the natural economy, such that reproductive competition occurs among their parts.

D.M. Lambert - Groups of individuals that define themselves by a specific mate recognition system.

J. Mallet - Identifiable genotypic clusters recognized by a deficit of intermediates, both at single loci and at multiple loci.

E. Mayr - Groups of actually or potentially interbreeding natural populations which are reproductively isolated from other such groups.

C.D. Michener - A group of organisms not itself divisible by phenetic gaps resulting from concordant differences in character states (except for

morphs - such as sex, age, or caste), but separated by such phenetic gaps from other such units.

H.E.H. Patterson - That most inclusive population of individual biparental organisms which share a common fertilization system.

G.G. Simpson - A lineage of populations evolving with time, separately from others, with its own unique evolutionary role and tendencies.

P.H.A. Sneath and R.R. Sokal - The smallest (most homogeneous) cluster that can be recognized upon some given criterion as being distinct

from other clusters.

A.R. Templeton - The most inclusive population of individuals having the potential for phenotypic cohesion through intrinsic cohesion

mechanisms (genetic and/or demographic - i.e. ecological -exchangeability).

E.O. Wiley - A single lineage of ancestor-descendant populations which maintains its identity from other such lineages and which has its own

evolutionary tendencies and historical fate.

S. Wright - A species in time and space is composed of numerous local populations, each one intercommunicating and intergrading with others.

Species

I. Definitions:

Species = the basic unit of classification

> Three different ways to recognize species:

Definitions:


1) Morphological species = the smallest group that is

consistently and persistently distinct (Clusters in

morphospace)

species are recognized initially on the basis of

appearance; the individuals of one species look

different from the individuals of another

Plant Species

Definitions:


2) Biological species = a set of interbreeding or

potentially interbreeding individuals that are

separated from other species by reproductive

barriers

species are unable to interbreed

Species

Definitions:


3) Phylogenetic species = the boundary between

reticulate (among interbreeding individuals) and

divergent relationships (between lineages with no

gene exchange)

Species

reticulate

divergentPhylogenetic species

recognized by the pattern of ancestor - descendent relationships

boundary

Definitions:


4) Phylogenomics species = ability to transmit (and

maintain) a (stable) gene pool

Adresses the Anopheles genome topology

variations

Species

• In the tree to the left, A and B share the most recent

common ancestry. Thus, of the species in the tree,

A and B are the most closely related.

• The next most recent common ancestry is C with

the group composed of A and B. Notice that the

relationship of C is with the group containing A

and B. In particular, C is not more closely related to

B than to A. This can be emphasized by the

following two trees, which are equivalent to each

other:

Branching Order in a Phylogenetic Tree

• A common simplifying assumption is that the three is bifurcating,

meaning that each brach node has exactly two descendents.

• The edges, taken together, are sometimes said to define the topology

of the tree

More definitions …

Branch node, internal node

Edge, Branch

Leafs

Tips

external node

Outgroups, rooted versus unrooted

An unrooted reptilian phylogeny with an avian outgroup and

the corresponding rooted phylogeny. The Ri represent modern

reptiles; the Ai, inferred ancestors and the B a bird.

Some definitions …

Phylogenetic methods may be used to

solve crimes, test purity of products, and

determine whether endangered species

have been smuggled or mislabeled:

– Vogel, G. 1998. HIV strain analysis debuts in

murder trial. Science 282(5390): 851-853.

– Lau, D. T.-W., et al. 2001. Authentication of

medicinal Dendrobium species by the internal

transcribed spacer of ribosomal DNA. Planta

Med 67:456-460.

Examples

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=9841423&dopt=Abstract


– Epidemiologists use phylogenetic methods to

understand the development of pandemics,

patterns of disease transmission, and

development of antimicrobial resistance or

pathogenicity:

• Basler, C.F., et al. 2001. Sequence of the 1918

pandemic influenza virus nonstructural gene (NS)

segment and characterization of recombinant viruses

bearing the 1918 NS genes. PNAS, 98(5):2746-2751.

• Ou, C.-Y., et al. 1992. Molecular epidemiology of HIV

transmission in a dental practice. Science

256(5060):1165-1171.

• Bacillus Antracis:

Examples



http://www.sciencemag.org/content/vol296/issue5575/images/large/se2220570002.jpeg

http://www.sciencemag.org/content/vol296/issue5575/images/large/se2220570002.jpeg

• Conservation biologists may use these techniques to determine which populations are in greatest need of protection, and other questions of population structure: – Trepanier, T.L., and R.W. Murphy. 2001. The Coachella Valley

fringe-toed lizard (Uma inornata): genetic diversity and phylogenetic relationships of an endangered species. Mol Phylogenet Evol 18(3):327-334.

– Alves, M.J., et al. 2001. Mitochondrial DNA variation in the highly endangered cyprinid fish Anaecypris hispanica: importance for conservation. Heredity 87(Pt 4):463-473.

• Pharmaceutical researchers may use phylogenetic methods to determine which species are most closely related to other medicinal species, thus perhaps sharing their medicinal qualities: – Komatsu, K., et al. 2001. Phylogenetic analysis based on 18S

rRNA gene and matK gene sequences of Panax vietnamensisand five related species. Planta Med 67:461-465.

Examples




Tree-of-life

Origin of the Universe 15 billion yrs

Formation of the Solar System 4.6 "

First Self-replicating System 3.5 "

Prokaryotic-Eukaryotic Divergence 2.0 "

Plant-Animal Divergence 1.0 "

Invertebrate-Vertebrate Divergence 0.5 "

Mammalian Radiation Beginning 0.1 "

Some Important Dates in History

Tree Of Life

Tree Of Life

http://tolweb.org/tree?group=life

http://tolweb.org/tree?group=life

Tree Of Life

What Sequence to Use ?

• To infer relationships that span the

diversity of known life, it is

necessary to look at genes

conserved through the billions of

years of evolutionary divergence.

• The gene must display an

appropriate level of sequence

conservation for the divergences of

interest.

.

• If there is too much change, then

the sequences become

randomized, and there is a limit to

the depth of the divergences that

can be accurately inferred.

• If there is too little change (if the

gene is too conserved), then there

may be little or no change between

the evolutionary branchings of

interest, and it will not be possible to

infer close (genus or species level)

relationships.


Carl Woese

recognized the full potential of rRNA sequences as a measure of phylogenetic relatedness. He initially used an RNA sequencing method that determined about 1/4 of the nucleotides in the 16S rRNA (the best technology available at the time). This amount of data greatly exceeded anything else then available. Using newer methods, it is now routine to determine the sequence of the entire 16S rRNA molecule. Today, the accumulated 16S rRNA sequences (about 10,000) constitute the largest body of data available for inferring relationships among organisms.

Ribosomal RNA Genes and Their Sequences

An example of genes in this category are

those that define the ribosomal RNAs

(rRNAs). Most prokaryotes have three

rRNAs, called the 5S, 16S and 23S

rRNA.


Namea Size (nucleotides) Location

5S 120 Large subunit of ribosome

16S 1500 Small subunit of ribosome

23S 2900 Large subunit of ribosome

a The name is based on the rate that the

molecule sediments (sinks) in water.

Bigger molecules sediment faster than small

ones.

The extraordinary conservation of rRNA genes can

be seen in these fragments of the small subunit

rRNA gene sequences from organisms spanning

the known diversity of life:

human ...GTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGCTGCAGTTAAAAAG...

yeast ...GTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAG...

Corn ...GTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAG...

Escherichia coli ...GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCG...

Anacystis nidulans ...GTGCCAGCAGCCGCGGTAATACGGGAGAGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCG...

Thermotoga maratima ...GTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTACCCGGATTTACTGGGCGTAAAGGG...

Methanococcus vannielii ...GTGCCAGCAGCCGCGGTAATACCGACGGCCCGAGTGGTAGCCACTCTTATTGGGCCTAAAGCG...

Thermococcus celer ...GTGGCAGCCGCCGCGGTAATACCGGCGGCCCGAGTGGTGGCCGCTATTATTGGGCCTAAAGCG...

Sulfolobus sulfotaricus ...GTGTCAGCCGCCGCGGTAATACCAGCTCCGCGAGTGGTCGGGGTGATTACTGGGCCTAAAGCG...

Ribosomal RNA Genes and Their Sequences

Other genes …

• Rate of evolution = rate of mutation

• rate of evolution for any macromolecule is

approximately constant over time (Neutral

Theory of evolution)

• For a given protein the rate of sequence

evolution is approximately constant across

lineages. Zuckerkandl and Pauling (1965)

• This would allow speciation and duplication

events to be dated accurately based on

molecular data

Molecular Clock (MC)

Noval trees using Hox genes

• (a) A traditional phylogenetic tree and

• (a) A traditional phylogenetic tree and

• (b) the new phylogenetic tree, each showing the

positions of selected phyla. B, bilateria; AC,

acoelomates; PC, pseudocoelomates; C,

coelomates; P, protostomes; L, lophotrochozoa; E,

ecdysozoa; D, deuterostomes.

• Local and approximate molecular

clocks more reasonable

– one amino acid subst. 14.5 My

– 1.3 10-9 substitutions/nucleotide site/year

– Relative rate test (see further)

• ((A,B),C) then measure distance between

(A,C) & (B,C)

Molecular Clock (MC)

Rate of Change Theoretical Lookback Time

(PAMs / 100 myrs) (myrs)

Pseudogenes 400 45

Fibrinopeptides 90 200

Lactalbumins 27 670

Lysozymes 24 850

Ribonucleases 21 850

Haemoglobins 12 1500

Acid proteases 8 2300

Cytochrome c 4 5000

Glyceraldehyde-P dehydrogenase2 9000

Glutamate dehydrogenase 1 18000

PAM = number of Accepted Point Mutations per 100 amino acids.

Proteins evolve at highly different rates

Phylogenetics

Introduction

Definitions

Species concept

Examples

The Tree-of-life



Maximum Likelihood

Maximum Parsimony

Rooting


Conclusions

Orthologous genes


Phylogenomics


Weblems

Multiple Alignment Method

• align

• select method (evolutionary

model)

–Distance

–ML

–MP

• generate tree

• validate tree

4-steps

Some definitions …

• Convert sequence data into a

set of discrete pairwise distance

values (n*(n-1)/2), arranged into

a matrix. Distance methods fit a

tree to this matrix.

• The phylogenetic topology tree

is constructed by using a cluster

analysis method (like upgma or

nj methods).

Distance matrix methods (upgma, nj, Fitch,...)


CGT


Since we start with A,p(A)=1


D=evolutionary distance ~ tijd

F = dissimilarity ~ (1 – PX(t))

F ~ 1 – d


http://nekrut.uchicago.edu/pics/fig3_3.gif

http://nekrut.uchicago.edu/pics/fig3_3.gif

Unweighted Pair Group Method with Arithmatic Mean (UPGMA)

Distance matrix methods: Summary

http://www.bioportal.bic.nus.edu.sg/phylip/neighbor.html

• The phylogeny makes an estimation of the distance for each pair as the sum of branch lengths in the path from one sequence to another through the tree.

easy to perform ;

quick calculation ;

fit for sequences having high similarity scores ;

• drawbacks :

the sequences are not considered as such (loss of information) ;

all sites are generally equally treated (do not take into account differences of substitution rates ) ;

not applicable to distantly divergent sequences.


• In this method, the bases

(nucleotides or amino acids) of all

sequences at each site are

considered separately (as

independent), and the log-likelihood

of having these bases are computed

for a given topology by using a

particular probability model.

• This log-likelihood is added for all

sites, and the sum of the log-

likelihood is maximized to estimate

the branch length of the tree.

Maximum likelihood

Maximum likelihood

• This procedure is repeated for all possible topologies, and the topology that shows the highest likelihood is chosen as the final tree.

• Notes :

ML estimates the branch lengths of the final tree ;

ML methods are usually consistent ;

ML is extented to allow differences between the rate of transition and transversion.

• Drawbacks

need long computation time to construct a tree.

Maximum likelihood

Maximum likelihood

Parsimony criterion

• It consists of determining the minimum number of changes (substitutions) required to transform a sequence to its nearest neighbor.

Maximum Parsimony

• The maximum parsimony algorithm searches for the minimum number of genetic events (nucleotide substitutions or amino-acid changes) to infer the most parsimonious tree from a set of sequences.

Maximum Parsimony

Maximum Parsimony

Occam’s Razor

Entia non sunt multiplicanda praeter necessitatem.William of Occam (1300-1349)

The best tree is the one which requires the least number of

substitutions

• The best tree is the one which needs the

fewest changes.

– If the evolutionary clock is not constant, the

procedure generates results which can be

misleading ;

– within practical computational limits, this

often leads in the generation of tens or more

"equally most parsimonious trees" which

make it difficult to justify the choice of a

particular tree ;

– long computation time to construct a tree.

Maximum Parsimony

Maximum Parsimony: Branch Node A or B ?

Maximum Parsimony: A requires 5 mutaties

Maximum Parsimony: B (and propagating A->B) requires only 4 mutations

• The best tree is the one which needs the fewest changes.

• Problems :– If the evolutionary clock is not

constant, the procedure generates results which can be misleading ;

– within practical computational limits, this often leads in the generation of tens or more "equally most parsimonious trees" which make it difficult to justify the choice of a particular tree ;

– long computation time to construct a tree.

Maximum Parsimony

Phylogenetics

Introduction

Definitions

Species concept

Examples

The Tree-of-life



Maximum Likelihood

Maximum Parsimony

Rooting


Conclusions

Orthologous genes


Phylogenomics


Weblems

There is at present no statistical

methods which allow

comparisons of trees obtained

from different phylogenetic

methods, nevertheless many

studies have been made to

compare the relative consistency

of the existing methods.

Comparative evaluation of different methods

The consistency depends on many

factors, among these the topology

and branch lengths of the real tree,

the transition/transversion rate and

the variability of the substitution

rates.

One expects that if sequences have

strong phylogenetic relationship,

different methods will show the

same phylogenetic tree

Comparative evaluation of different methods

Comparison of methods

• Inconsistency

• Neighbour Joining (NJ) is very fast but depends on accurate estimates of distance. This is more difficult with very divergent data

• Parsimony suffers from Long Branch Attraction. This may be a particular problem for very divergent data

• NJ can suffer from Long Branch Attraction

• Parsimony is also computationally intensive

• Codon usage bias can be a problem for MP and NJ

• Maximum Likelihood is the most reliable but depends on the choice of model and is very slow

• Methods may be combined

Rooting the Tree

• In an unrooted tree the direction of evolution is unknown

• The root is the hypothesized ancestor of the sequences in the tree

• The root can either be placed on a branch or at a node

• You should start by viewing an unrooted tree

Automatic rooting

• Many software packages will root

trees automaticall (e.g. mid-point

rooting in NJPlot)

• Sometimes two trees may look very

different but, in fact, differ only in the

position of the root

• This normally involves assumptions…

BEWARE!

Rooting Using an Outgroup

1. The outgroup should be a sequence (or set

of sequences) known to be less closely

related to the rest of the sequences than they

are to each other

2. It should ideally be as closely related as

possible to the rest of the sequences while

still satisfying condition 1

The root must be somewhere between the

outgroup and the rest (either on the node or

in a branch)

How confident am I that my tree is correct?

Bootstrap values

Bootstrapping is a statistical

technique that can use random

resampling of data to determine

sampling error for tree topologies

Bootstrapping phylogenies

• Characters are resampled with replacement

to create many bootstrap replicate data sets

• Each bootstrap replicate data set is analysed

(e.g. with parsimony, distance, ML etc.)

• Agreement among the resulting trees is

summarized with a majority-rule consensus

tree

• Frequencies of occurrence of groups,

bootstrap proportions (BPs), are a measure

of support for those groups

Bootstrapping - an example

Ciliate SSUrDNA - parsimony bootstrap

Majority-rule consensus

Ochromonas (1)

Symbiodinium (2)

Prorocentrum (3)

Euplotes (8)

Tetrahymena (9)

Loxodes (4)

Tracheloraphis (5)

Spirostomum (6)

Gruberia (7)

100

96

84

100

100

100

• Bootstrapping is a very valuable and widely used technique (it is demanded by some journals)

• BPs give an idea of how likely a given branch would be to be unaffected if additional data, with the same distribution, became available

• BPs are not the same as confidence intervals. There is no simple mapping between bootstrap values and confidence intervals. There is no agreement about what constitutes a ‘good’ bootstrap value (> 70%, > 80%, > 85% ????)

• Some theoretical work indicates that BPs can be a conservative estimate of confidence intervals

• If the estimated tree is inconsistent all the bootstraps in the world won’t help you…..

Bootstrap - interpretation

Jack-knifing

• Jack-knifing is very similar to

bootstrapping and differs only in the

character resampling strategy

• Jack-knifing is not as widely

available or widely used as

bootstrapping

• Tends to produce broadly similar

results

At present only sampling techniques allow testing the

topology of a phylogenetic tree

Bootstrapping

» It consists of drawing columns from a sample of

aligned sequences, with replacement, until one gets

a data set of the same size as the original one.

(usually some columns are sampled several times

others left out)

Half-Jacknife

» This technique resamples half of the sequence sites

considered and eliminates the rest. The final sample

has half the number of initial number of sites

without duplication.

Statistical evaluation of the obtained phylogenetic trees

Weblems

W6.1: The growth hormones in most mammals have very similar ammo acid sequences. (The growth hormones of the Alpaca, Dog Cat Horse, Rabbit, and Elephant each differ from that of the Pig at no more than 3 positions out of 191.) Human growth hormone is very different, differing at 62 positions. The evolution of growth hormone accelerated sharply in the line leading to humans. By retrieving and aligning growth hormone sequences from species closely related to humans and our ancestors, determine where in the evolutionary tree leading to humans the accelerated evolution of growth hormone took place.

W6.2: Humans are primates, an order that we, apes and monkeys share with lemurs and tarsiers. On the basis of the Beta-globin gene cluster of human, a chimpanzee, an old-world monkey, a new-world monkey, a lemur, and a tarsier, derive a phylogenetic tree of these groups.

W6.3: Primates are mammals, a class we share with marsupials and monotremes; Extant marsupials live primarily in Australia, except for the opossum, found also in North and South America. Extant monotremes are limited to two animals from Australia: the platypus and echidna. Using the complete mitochondnal genome from human, horse (Equus caballus), wallaroo (Macropus robustus), American opossum (Didelphis mrgimana), and platypus (Ormthorhynchus anatmus), draw an evolutionary tree, indicating branch lengths. Are monotremes more closely related to placental mammals or to marsupials?

W6.4: Mammals are vertebrates, a subphylum that we share with fishes, sharks, birds and reptiles, amphibia, and primitive jawless fishes (example: lampreys). For the coelacanth (Latimeria chalumnae), the great white shark (Carcharodon carcharias), skipjack tuna (Katsuwonus pelamis), sea lamprey (Petromyzon marinus), frog (Rana Ripens), and Nile crocodile (Crocodylus niloticus), using sequences of cytochromes c and pancreatic ribonucleases, derive evolutionary trees of these species.

Bioinformatics t6-phylogenetics v2014

Documents

Transcript of Bioinformatics t6-phylogenetics v2014