Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance...

71
Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter

Transcript of Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance...

Page 1: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1 Biology, Sequences, Phylogenetics

Part 4

Sepp Hochreiter

Page 2: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Klausur

Mo. 30.01.2011

Zeit: 15:30 – 17:00

Raum: HS14

Anmeldung Kusss

Page 3: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Contents

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular Phylogenies

5.1.3 Methods

5.2 Maximum Parsimony Methods

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted Parsimony and Bootstrapping

5.2.4 Inconsistency of Maximum Parsimony

5.3 Distance-based Methods

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum Evolution

5.3.4 Neighbor Joining

5.3.5 Distance Measures

5.4 Maximum Likelihood Methods

5.5 Examples

Page 4: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Motivation

central field in biology: relation between species in form of a tree

Root: beginning of life

Leaves: taxa (current species)

Branches: relationship “is ancestor of'' between nodes

Node: split of a species into two

phylogeny (phylo = tribe and genesis)

Cladistic trees: conserved characters

Phenetic trees: measure of distance between the leaves of the tree

(distance as a whole and not based on single features)

Phenetic problems: simultaneous development of features and

different evolution rates

Convergent evolution e.g. finding the best form in water

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 5: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Motivation

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 6: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Motivation

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 7: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Motivation

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 8: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Molecular Phylogenies

Not characteristics like wings, feathers, (morphological)

differences between organisms: RNA, DNA, and proteins

α-hemoglobin

Molecular phylogenetics

is more precise can also distinguish bacteria or viruses

connects all species by DNA

uses mathematical and statistical methods

is model-based as mutations can be modeled

detects remote homologies

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 9: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Molecular Phylogenies

Difficulties in constructing a phylogenetic tree:

different mutation rates

horizontal transfer of genetic material (Horizontal Gene

Transfer, Glycosyl Hydrolase from E.coli to B.subtilis)

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Species tree Gene tree

Page 10: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Molecular Phylogenies

Branches: time in number of mutations

molecular clock: same evolution / mutation rate

number of substitution: Poisson distribution

mutation rate: equally distributed over the sequence

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 11: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Methods

choose the sequences e.g. rRNA (RNA of ribosomes) and

mitochondrial genes (in most organism, enough mutations)

pairwise and multiple sequence alignments

method for constructing a phylogenetic tree

distance-based

maximum parsimony

maximum likelihood

• Maximum parsimony: strong sequence similarities

(few trees), few sequences

• Distance based (CLUSTALW): less similarity, many

sequences

• Maximum likelihood: very variable sequences, high

computational costs

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 12: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Methods

Software

Name Author URL

PHYLIP Felsenstein 89,96 http://evolution.genetics.

washington.edu/phylip.html

PAUP Sinauer Asssociates http://www.lms.si.edu/PAUP

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 13: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Maximum Parsimony

minimize number of mutations

mutations are branches in the tree

tree explains the evolution of the sequences

surviving mutations are rare tree with minimal mutations is

most likely explanation

maximum parsimony PHYLIP programs:

DNAPARS, DNAPENNY, DNACOMP, DNAMOVE, and

PROTPARS

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 14: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Tree Length

maximum parsimony tree: tree with smallest tree length

tree length: number of substitutions in the tree

Example protein triosephosphate isomerase for the taxa

“Human”, “Pig”, “Rye”, “Rice”, and “Chicken”: Human ISPGMI

Pig IGPGMI

Rye ISAEQL

Rice VSAEML

Chicken ISPAMI

If we focus on column 4: Human G

Pig G

Rye E

Rice E

Chicken A

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 15: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Fitch (71) alg. for computing the tree length with taxa at leaves:

1. Root node added to an arbitrary branch

2. bottom-up pass: sets of symbols (amino acids) for a

hypothetical sequence at this node.

Minimize the number mutations by maximal agreement of the

subtrees avoid a mutation at the actual node

In the first case is leave, the second case enforces a

mutation, and the third case avoids a mutation

Tree Length

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

$m_

{12

}$

Page 16: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

3. top town pass: hypothetical sequences at the interior nodes;

counts the number of mutations

means that the formula holds for and for

Tree Length

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 17: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Non-informative columns: not used

Columns with one symbol occur multiple an others only single

(number of mutations independent of topology)

Columns with only one symbol

minimal number of substitutions:

maximal subs., star tree (center: most frequent symbol):

number of substitutions for the topology:

consistency index :

High values support the according tree as being plausible

retention index

rescaled consistency index

Tree Length

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 18: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

few sequences: all trees and their length can be constructed

for a larger number of sequences heuristics are used

Branch and Bound (Hendy & Penny, 82) for 20 and more taxa:

1. This step determines the addition order of the taxa as follows.

First, compute the core tree of three taxa with maximal length of all three taxa trees. Next the taxa is added to one of

the three branches which leads to maximal tree length. For

the tree with four branches we determine the next taxa which

leads to maximal tree length. …

2. This step determines an upper bound for the tree length by

either distance based methods (neighbor joining) or heuristic

search (stepwise addition algorithm).

Tree Search

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 19: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Branch and Bound (continued):

3. Start with the core tree of three taxa.

4. Construct new tree topologies by stepwise adding new taxa to

the trees which do not possess a STOP mark. The next taxa is

chosen according to the list in step 1 and added to each tree

at all of its branches. Tree lengths are computed.

5. Assign STOP marks if upper bound is exceeded. Terminate if

all trees possess a STOP mark. List of step 1 leads to many

early STOP signals.

Tree Search

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 20: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Tree Search

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 21: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Heuristics for Tree Search (Step 2 in previous algorithm)

Stepwise Addition Algorithm: extend only tree with shortest

length. If all taxa are inserted then perform branch swapping

Branch Swapping:

(1) neighbor interchange: two taxa connected to the same

node,

(2) subtree pruning and regrafting: remove small tree and

connect it with the root to a branch,

(3) bisection-reconnection: remove branch to obtain two trees

and reconnect them by inserting a new branch where each

branch of the subtrees can be connected in contrast to (2)

where the root is connected

Tree Search

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 22: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Heuristics for Tree Search

Branch and Bound Like:

Use step 1. of the branch-and-bound algorithm to obtain the

minimal tree (not maximal!).

Upper local bounds Un for n taxa are constructed in this way.

These upper bounds server for stopping signals.

Tree Search

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 23: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Type of substitution is weighted according to PAM and

BLOSUM matrices to address survival of substituition

Bootstrapping - accesses the variability of the tree with respect to the data

(“variance”) and identifies stable substructures

- is possible because the temporal order of the alignment

columns does not matter

- cannot access the quality of a method but only its robustness

Weighted Parsimony / Bootstrapping

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 24: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Inconsistency of Maximum Parsimony

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

True Tree

Maximum Parsimony

Page 25: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

a and b are similar to each other and match to 99% whereas c and

d are not similar to any other sequence and only match 5% by

chance (1 out of 20).

Informative columns: only two symbols and each appears twice.

Probabilities of informative columns and their rate:

90% of the cases of informative columns we observe ci = di.

Maximum parsimony will judge c and d as similar as a and b.

Inconsistency of Maximum Parsimony

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 26: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

matrix of pairwise distances between the sequences

A distance D is produced by a metric d (function) on objects x

indexed by i,j,k:

metric d must fulfill

not all scoring schemes are

a metric, e.g. the e-value

Distance-based Methods

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 27: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Unweighted Pair Group Method using arithmetic Averages

(constructive clustering method based on joining pairs of clusters)

It works as follows:

1. each sequence i is a cluster ci with one element ni =1 and

height li = 0. Put all i into a list.

2. Select cluster pair (i,j) from the list with minimal Dij and create

a new cluster ck by joining ci and cj with height lk = Dij / 2 and

number of elements nk = ni + nj.

3. Compute the distances for ck:

4. Remove i and j from the list and add k to the list. If the list

contains only one element then terminate else go to step 2.

UPGMA

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 28: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

assumption of constant rate of evolution in different lineages

bootstrap can evaluate the reliability to data variation

Positive interior branches contribute to the quality of the tree

UPGMA

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 29: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

minimize (Dij – Eij), where Eij is the sum of distances in the tree on

the path from taxa i to taxa j (the path metric)

The objective is

Fitch and Margoliash, 1967, introduced weighted least squares:

optimized under the constraint of nonnegative branch length

Least Squares

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 30: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

If matrix A is the binary topology matrix with N(N-1)/2 rows, one for

each Dij, and v columns for the v branches of the topology. In

each row (i,j) all branches contained in the path from i to j are

marked by 1 and all other branches are 0

l is the v -dimensional vector of branch weights, then

least squares assumption: Dij deviates from Eij according

to a Gaussian distribution εij with mean 0 and variance D2ij:

maximum likelihood estimator (least squares) is

Gaussianity assumption: justified by sufficient large sequences

li are Gaussian and, therefore, also Dij

Least Squares

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 31: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

The objective is the sum of branch length' l:

Given an unbiased branch length estimator l˄

The expected value of L is smallest for the true topology

independent of the number of sequences

(Rzhetsky and Nei, 1993)

Minimum evolution is computational expensive

Minimum Evolution

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 32: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

The neighbor joining (Saitou and Nei, 1987) simplifies the

minimum evolution method (for fewer than six taxa both

methods give the same result)

Neighbors: taxa that are connected by a single node

Additive metric d: any four elements fulfill

path metric (counting the branch weights) is an additive metric

Neighbor Joining

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

i

k

m

j

Page 33: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

An additive metric can be represented by an unique additive tree

construction of an additive tree where a node v is inserted:

Path metric holds:

Neighbor Joining

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 34: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

objective of the neighbor joining algorithm is

S the sum of all branch length' lij

starts with a star tree

We assume N taxa with initial (star tree)

objective S0:

where the comes from the fact that , therefore

is part of distances

Neighbor Joining

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 35: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

In the next step taxa 1 and 2 are joined and a new internal node Y

is introduced and computed as

set all paths from i to j containing equal to

and solve for . These are all path'

from nodes 1 and 2 to . Therefore

paths start from nodes 1 and 2, each, giving

paths. is in all node 1 paths and

in all node 2 paths. The tail is in one

node 1 and one node 2 path.

Above equation is obtained by averaging

over these equations for .

Neighbor Joining

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 36: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

similar as for the initial star tree

inserted into the last equation:

Neighbor Joining

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 37: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

generalized from joining (1,2) to (k,l):

net divergences rk are accumulated distances of k to all other taxa:

giving

is constant for all objectives , therefore an

equivalent objective is

If k and l are evolutionary neighbors but is large due to

fast evolution of k and/or l, then rk and/or rl are large and Qkl small

Neighbor Joining

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 38: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

The algorithm:

1. Given start with a star tree where the taxa are the leaves.

Put all taxa in a set of objects.

2. For each leave i compute

3. For each pair (i,j) compute

4. Determine the minimal Qij. Join these (i,j) to new leave u.

Compute new branch length’ and new distances of u:

Delete i and j from the set of objects and add.

Stop if the set of objects contains only u otherwise go to Step 1

Neighbor Joining

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 39: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Neighbor joining: algorithm; for larger data sets

The formula for Qij accounts for differences in evolution rates

The objective S is only minimized approximatively

CLUSTALW uses neighbor-joining for multiple alignments

Neighbor Joining

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 40: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

we focus on nucleotides!

substitution rates:

Distance Measures

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 41: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Jukes Cantor

Mutation probability:

Identical positions of 2 seq. remain identical:

different nucleotides will be identical:

One changes and the other not:

Two of these events:

Distance Measures

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 42: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Jukes Cantor

difference equation:

continuous model:

The solution for :

Distance Measures

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 43: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Jukes Cantor

The substitutions per position d for two sequences is 2r t

Estimating q and inserting in above equation: estimate d.

The variance of the estimate for d:

Distance Measures

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 44: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Kimura

group nucleotide pairs:

Distance Measures

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 45: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Kimura

transitional substitutions:

transversional substitutions:

equilibrium frequency of each nucleotide: 0.25

However occurrence of GC Drosophila mitochondrial DNA is 0.1

Distance Measures

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 46: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Felsenstein / Tajima-Nei

xij: relative frequency of nucleotide pair (i,j)

Distance Measures

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 47: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Tamura

extends Kimura's model for GC content different from 0.5

Distance Measures

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 48: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Hasegawa (HKY)

hybrid of Kimuras and Felsenstein / Tajima-Nei: GC content and

transition / transversion

Distance Measures

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 49: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Tamura-Nei

includes Hasegawa's model

P1: proportion of transitional differences between A and G

P2: proportion of transitional differences between T and C

Q: proportion of transversional differences

Distance Measures

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 50: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Tree probability: product of the mutation rates in each branch

Mutation rate: product between substitution rate and branch length

D: data, multiple alignment of N sequences (taxa)

Dk: N-dimensional vector at position k of the multiple alignment

A: tree topology (see least squares)

l: vector of branch length

H: number of hidden nodes of the topology A

: model for nucleotide substitution

: set of letters (e.g. the amino acids)

Hidden nodes are indexed from 1 to H

taxa are indexed from H+1 to H+N

Root node has index 1

Maximum Likelihood Methods

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 51: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

The likelihood of the tree at the k-th position:

: prior probability of the root node assigned with

: indicates an existing branch

: probability of branch length between and

hidden states are summed out

Prior is estimated or given

Maximum Likelihood Methods

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 52: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

If is the Felsenstein / Tajima-Nei equal-input model, the

branch length probabilities are

For and we obtain Jukes-Cantor:

reversible models:

choice of the root does not matter because branch lengths

count independent of their substitution direction

Maximum Likelihood Methods

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 53: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

(independent positions)

Felsenstein's (81) pruning algorithm to compute

: probability of a letter a at node i

recursive formula:

computed by dynamic programming from leaves to root

Likelihood:

Maximum Likelihood Methods

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 54: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

best tree: both the branch length' and the topology optimized

branch length‘: gradient based / EM (expectation-maximization)

tree topology: Felsenstein (81) growing (constructive) algorithm

start with 3 taxa, at k-th taxa test all (2k-5) branches for insertion

further optimized by local changing the topology

tree topology: small N all topologies can be tested, then local

changes similar to parsimony tree

ML estimator is computationally expensive but unbiased

(sequence length) and asymptotically efficient (minimal variance)

fast heuristics: Strimmer and v. Haeseler (96): all topologies of 4

taxa then build the final tree (software: http://www.tree-puzzle.de/)

Maximum Likelihood Methods

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 55: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

From triosephosphat isomerase of different species trees are

constructed with PHYLIP (Phylogeny Inference Package) 3.5c

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 56: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Fitch-Margoliash

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 57: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Fitch-Margoliash with assumption of a molecular clock (“kitsch”)

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 58: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

neighbor joining

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 59: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

UPGMA

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 60: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

UPGMA

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Fitch-Margoliash

neighbor joining

Kitsch

Page 61: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Phylogenetic tree based on triosephosphat isomerase for:

Human

Monkey

Mouse

Rat

Cow

Pig

Goose

Chicken

Zebrafish

Fruit FLY

Rye

Rice

Corn

Soybean

Bacterium

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 62: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Fitch-Margoliash

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 63: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Fitch-Margoliash with assumption of a molecular clock (“kitsch”)

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 64: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

neighbor joining

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 65: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

UPGMA

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 66: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

UPGMA

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Fitch-Margoliash

neighbor joining

Kitsch

Page 67: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

relationship between humans and apes

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 68: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

relationship between humans and apes

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 69: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Examples

5 Phylogenetics

5.1 Motivation

5.1.1 Tree of Life

5.1.2 Molecular

Phylogenies

5.1.3 Methods

5.2 Maximum

Parsimony

5.2.1 Tree Length

5.2.2 Tree Search

5.2.3 Weighted

Parsimony

5.2.4 Inconsistency

5.3 Distance-based

5.3.1 UPGMA

5.3.2 Least Squares

5.3.3 Minimum

Evolution

5.3.4 Neighbor

Joining

5.3.5 Distance

Measures

5.4 Maximum

Likelihood

5.5 Examples

Page 70: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

Three Anwers

From where came

the first human?

Africa!

Is Anna Anderson the tsar‘s daughter Anastasia?

No!

Are the neanderthals the

ancestors of the humans?

No! Separate Species

Asiats Europeans Australians Africans

Ancestor

Modern Humans Neanderthals

T

T

T

337

C

C

C

C

C

C

A

A

A

T

T

T

G

G

G

A

A

A

A

A

A

T

T

T

A

A

A

T

T

T

T

T

T

G

G

G

106

T

C

C

C

T

T

324

A

A

A

G

G

G

T

T

T

C

C

C

A

A

A

A

A

A

A

A

A

T

T

T

C

C

C

C

C

C

C

C

C

A

A

A

T

C

C

91

C C Prince Philip (Grand nephew zar)

T C Carl Maucher (Grand nephew

F. Schanzkowska)

T C Anna Anderson

Page 71: Bioinformatics 1 · Cladistic trees: conserved characters Phenetic trees: measure of distance between the leaves of the tree (distance as a whole and not based on single features)

Bioinformatics 1: Biology, Sequences, Phylogenetics

END