ON ISSUES OF SINGULARITY FOR CONFIDENCE REGIONS AND ... · on issues of singularity for confidence...

ON ISSUES OF SINGULARITY FOR CONFIDENCE REGIONS

AND HYPOTHESIS TESTS FOR TOPOLOGIES USING

GENERALIZED LEAST SQUARES

by

Paul Sheridan

SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

AT

DALHOUSIE UNIVERSITY

HALIFAX, NOVA SCOTIA

OCTOBER 2006

c© Copyright by Paul Sheridan, 2006


DEPARTMENT OF MATHEMATICS AND STATISTICS

The undersigned hereby certify that they have read and recommend to

the Faculty of Graduate Studies for acceptance a thesis entitled “ON ISSUES

OF SINGULARITY FOR CONFIDENCE REGIONS AND HYPOTHESIS

TESTS FOR TOPOLOGIES USING GENERALIZED LEAST SQUARES”

by Paul Sheridan in partial fulfillment of the requirements for the degree of

Master of Science.

Dated: October 26, 2006

Supervisor:Dr. Edward Susko

Readers:Dr. Chris Field

Dr. Hong Gu

ii


DATE: October 26, 2006

AUTHOR: Paul Sheridan

TITLE: On Issues of Singularity for Confidence Regions and Hypothesis Tests

for Topologies Using Generalized Least Squares

DEPARTMENT OR SCHOOL: Mathematics and Statistics

DEGREE: M.Sc. CONVOCATION: May YEAR: 2007

Permission is herewith granted to Dalhousie University to circulate and to have

copied for non-commercial purposes, at its discretion, the above title upon the request of

individuals or institutions.

Signature of Author

The author reserves other publication rights, and neither the thesis nor extensiveextracts from it may be printed or otherwise reproduced without the author’s writtenpermission.

The author attests that permission has been obtained for the use of anycopyrighted material appearing in the thesis (other than brief excerpts requiring only properacknowledgement in scholarly writing) and that all such use is clearly acknowledged.

iii

Days in summer, Basil, are apt to linger.

– Oscar Wilde

iv

Table of Contents

List of Tables viii

List of Figures ix

List of Abbreviations and Symbols Used xii

Abstract xiv

Acknowledgements xvi

Chapter 1 Overview 1

Chapter 2 Molecular Phylogenetics 6

Chapter 3 Models of DNA and Amino Acid Evolution 10

3.1 Phylogenetic Trees and Tree Metrics . . . . . . . . . . . . . . . . . . 11

3.1.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.2 Topologies and Phylogenetic Trees . . . . . . . . . . . . . . . 12

3.1.3 Newick Representation of a Phylogenetic Tree . . . . . . . . . 14

3.1.4 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.5 Matrix Encoding of a Phylogenetic Tree . . . . . . . . . . . . 16

3.1.6 Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Probabilistic Models of Site Substitution . . . . . . . . . . . . . . . . 17

3.2.1 Some Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.2 Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.3 Branch Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.4 Markov Models of Nucleotide Substitution . . . . . . . . . . . 22

3.3 Amino Acid Substitution . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Limitations of Markov Models . . . . . . . . . . . . . . . . . . . . . . 25

v

Chapter 4 Reconstruction Methods 27

4.1 The Inferential Problem . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Parsimony Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Maximum Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . 28

4.4 Distance Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.1 ML Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.2 Branch Length and Topological Estimation using Least Squares 33

4.4.3 GLS and its Relationship to WLS . . . . . . . . . . . . . . . . 35

Chapter 5 Confidence Regions and Hypothesis Tests for Topologies

Using Generalized Least Squares 37

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 GLS Test Statistic and Its Chi-Square Distribution . . . . . . . . . . 39

5.3 Estimation of the Covariances Matrix V . . . . . . . . . . . . . . . . 41

5.4 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 42

Chapter 6 Generalized Least Squares Small Sample Simulation 44

6.1 Motivation and Methods for Small Sample Simulation . . . . . . . . . 44

6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2.1 Maximum Likelihood Distances and The Multivariate Normal

Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2.2 The Relationship Between Sample Average Variance and Distance 51

6.2.3 Two Asymptotic Variance Formulae . . . . . . . . . . . . . . . 52

6.2.4 GLS Test Statistic for the True Topology . . . . . . . . . . . . 55

6.2.5 Coverage and Confidence Region Size . . . . . . . . . . . . . . 60

Chapter 7 Singularity of the Covariance Matrix 63

7.1 Singularity of the True Covariance Matrix V . . . . . . . . . . . . . . 64

7.2 Singularity of the Estimated Covariance matrix V . . . . . . . . . . . 68

7.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

vi

Chapter 8 Circumventing Limitations of the Generalized Least Squares

Method 73

8.1 Distance Cutoff Approach . . . . . . . . . . . . . . . . . . . . . . . . 73

8.2 Eigenvalue Cutoff Approach . . . . . . . . . . . . . . . . . . . . . . . 74

8.3 LTD Counting Approach . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.4 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 77

8.5 Relationships Between Cutoffs . . . . . . . . . . . . . . . . . . . . . . 80

8.6 Motivation and Methods for Comparing the Cutoffs . . . . . . . . . . 85

8.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Chapter 9 Conclusions 90

Bibliography 92

vii

List of Tables

Table 2.1 Character data for 4 taxa . . . . . . . . . . . . . . . . . . . . . 7

Table 3.1 The number of unrooted topologies for given numbers of taxa . 14

Table 7.1 Nucleotide patterns over sites. . . . . . . . . . . . . . . . . . . 71

Table 8.1 95% CR using cutoff ce0 with sequence length 250 (18 of 21

eigenvalues used). . . . . . . . . . . . . . . . . . . . . . . . . . 83



Table 8.3 95% CR using cutoff cd with sequence length 250. . . . . . . . . 83

Table 8.4 95% CR summary using cutoffs ce0, ce1, and cd for sequence

length 250. The gT and p-values are for the true topology. . . . 83





Table 8.7 95% CR using cutoff cd with sequence length 25000. . . . . . . 84

Table 8.8 95% CR summary using cutoffs ce0, ce1, and cd for sequence

length 25000. The gT and p-values are for the true topology. . . 84

viii

List of Figures

Figure 2.1 A phylogenetic tree . . . . . . . . . . . . . . . . . . . . . . . . 6

Figure 2.2 A hidden substitution at site 3. . . . . . . . . . . . . . . . . . 8

Figure 3.1 A graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Figure 3.2 A rooted binary tree . . . . . . . . . . . . . . . . . . . . . . . 13

Figure 3.3 An unrooted binary tree . . . . . . . . . . . . . . . . . . . . . 13

Figure 3.4 A phylogenetic tree. . . . . . . . . . . . . . . . . . . . . . . . . 15

Figure 3.5 Conditional independence at a site for 3 taxa. . . . . . . . . . 18

Figure 4.1 A parsimony example. . . . . . . . . . . . . . . . . . . . . . . 28

Figure 4.2 Possible topologies for 4 taxa. . . . . . . . . . . . . . . . . . . 29

Figure 4.3 Maximum likelihood at site 4. . . . . . . . . . . . . . . . . . . 30

Figure 6.1 Histograms of the ML estimated distance errors yij −dij for the

4 taxa case with Tree 2. The sequence length is 50. . . . . . . 48

Figure 6.2 Histograms of the ML estimated distance errors yij −dij for the

first 7 taxa case with Tree 2. The sequence length is 50. . . . . 49

Figure 6.3 Left: Histogram for y23 − d23 for increasing sequence length.

Right: Histogram for y13 − d13 for increasing sequence length. 50

Figure 6.4 Exponential relationship between distance and sample average

variance. Left: A plot of d12 vs the median variance. Right: A

plot of d12 vs the log median variance. . . . . . . . . . . . . . 51

Figure 6.5 Summary plots for confidence regions generated using sample

average variance and NND asymptotic equivalent formula. . . 53

Figure 6.6 GLS test statistic Q-Q plot. Sim I: Q-Q plots of the GLS test

statistic for the true topology using the estimated covari-

ances. Plots for sequence length 50, 250, 1000, and 10000 are

shown for Trees 1, 2, and 3. . . . . . . . . . . . . . . . . . . . 56

ix

Figure 6.7 GLS test statistic Q-Q plot. Sim I: Q-Q plots of the GLS test


ances. Plots for sequence length 250 and 10000 are shown for

Trees 1, 2, and 3. . . . . . . . . . . . . . . . . . . . . . . . . . 56

Figure 6.8 GLS test statistic Q-Q plot. Sim I: Q-Q plots of the GLS

test statistic for the true topology using the true covariances.

Plots for sequence length 50, 250, 1000, and 10000 are shown

for Trees 1, 2, and 3. . . . . . . . . . . . . . . . . . . . . . . . 57

Figure 6.9 GLS test statistic Q-Q plot. Sim I: Another look at Q-Q plots of

the GLS test statistic for the true topology using the true co-

variances. Plots for sequence length 250 and 10000 are shown

for Trees 1, 2, and 3. . . . . . . . . . . . . . . . . . . . . . . . 57

Figure 6.10 GLS test statistic Q-Q plot. Sim II: Q-Q plots of the GLS test


ances. Plots for sequence length 50 are shown for Trees 1, 2,

and 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Figure 6.11 GLS test statistic Q-Q plot. Sim II: Another look at Q-Q plots

of the GLS test statistic for the true topology using the esti-

mated covariances. Plots for sequence length 250, 1000, and

10000 are shown for Trees 1, 2, and 3. . . . . . . . . . . . . . . 58

Figure 6.12 GLS test statistic Q-Q plot. Sim II: Q-Q plots of the GLS


Plots for sequence length 50 are shown for Trees 1, 2, and 3. . 59

Figure 6.13 GLS test statistic Q-Q plot. Sim II: Q-Q plots of the GLS


Plots for sequence length 250, 1000, and 10000 are shown for

Trees 1, 2, and 3. . . . . . . . . . . . . . . . . . . . . . . . . . 59

Figure 6.14 Coverage and CR size plots. Sim I: Plots of coverage probability

of the true topology and average confidence region size using

both estimated covariances (left) and the true covariances (right). 61

x

Figure 6.15 Coverage and CR size plots. Sim II: Plots of coverage probabil-

ity of the true topology and average confidence region size using

both estimated covariances (left) and the true covariances (right). 61

Figure 6.16 Coverage and CR size plots. Sim III: Plots of coverage prob-

ability of the true topology and average confidence region size

using both estimated covariances (left) and the true covariances

(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Figure 7.1 The 4 taxa tree. . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Figure 7.2 A linear dependence leading to singularity in the estimated co-

variance matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Figure 8.1 A 7 taxa tree with ǫ small so that y12, y56, y57, y67 ≤ cd. . . . . 74

Figure 8.2 Here the two maximal groupings have been formed: G1 = 1, 2and G2 = 5, 6, 7. . . . . . . . . . . . . . . . . . . . . . . . . 74

Figure 8.3 A 7 taxa tree with y67 ≃ 0. . . . . . . . . . . . . . . . . . . . . 76

Figure 8.4 Taxa 6 and 7 have been grouped together. . . . . . . . . . . . 76

Figure 8.5 7 taxon topologies in ω1 . . . . . . . . . . . . . . . . . . . . . 78

Figure 8.6 4 taxon topology associated with ω1 . . . . . . . . . . . . . . . 78





Figure 8.11 Plots of coverage probability of the true topology, the number

of confidence regions made, and average confidence region size

for Sim I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Figure 8.12 Plots of coverage probability of the true topology, the number

of confidence regions made, and average confidence region size

for Sim II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

xi

List of Abbreviations and Symbols Used

~α ....... branch lengths

cd ....... distance cutoff

ce0 ....... eigenvalue cutoff

ce1 ....... eigenvalue cutoff

CR ....... confidence region

δ ....... distance function

gT ....... generalized least squares test statistic

G ....... taxa grouping

GLS ....... generalized least squares

GTR ....... general time reversible model

Hij ....... Hamming distance for a pair of taxa

iid ....... independently and identically distributed

ℓ ....... log likelihood function

L ....... likelihood function

LT ....... length of a phylogenetic tree

LTD ....... linear transformed distance

ML ....... maximum likelihood

NND ....... non-negative definite

OLS ....... ordinary least squares

xii

Ω ....... set of topologies

Π ....... stationary probability matrix

Q ....... rate matrix

T ....... phylogenetic tree

T ....... topology

Tr ....... tree

WLS ....... weighted least squares

V ....... covariance matrix

V ....... estimated covariance matrix

X ....... data matrix

~y ....... maximum likelihood distances

xiii

Abstract

Recently, Susko [31] described a computationally inexpensive way to construct confi-

dence regions (CR) for topologies using a generalized least squares (GLS) test statistic,

with chi square distribution, which applies to maximum likelihood (ML) distances.

A software implementation for both nucleotide and protein data, called glsdna and

glsprot respectively, were also provided by Susko [32]. The accuracy of both the GLS

test statistic and sample average approximations used for the variances and covari-

ances for the ML distances are asymptotic in the number of sites; however, in practice

usable sequences may be only hundreds of characters long. It is untested just how

GLS will perform under these conditions.

In this thesis, a simulation study is undertaken to gauge the consequences of

these asymptotic limitations. To this end, 4 and 7 taxon trees were used to simulate

nucleotide sequence data for each of the lengths 50, 100, 250, 500, 1000, 5000, and

10000. For each tree used, and each sequence length, on the order of 10000 CR’s were

generated, and the coverage probability of the true tree, size of each CR, estimated ML

distances, and estimated sample average variances-covariances were recorded. It was

found that the coverage probabilities agreed with what is expected asymptotically for

sequence lengths 1000 and higher. For smaller sample sizes the coverage probabilities

were generally found to be higher than the 0.95 value. It was anticipated that, for

small sample sizes, the coverage probabilities would attain the expected 0.95 value,

if the true covariances were used to compute the GLS test statistic. Surprisingly, the

coverage probabilities were drastically underestimated. The underlying cause can be

attributed to a tendency for the ML distances to be overestimated for small sequence

lengths together with what we found to be exponential increase in variance with

distance between taxa.

The second part of this thesis is directed toward fixing a serious limitation of the

GLS software. Namely, computation of the GLS test statistic requires the estimated

covariance matrix of the ML distances to be invertible. If singularity does occur,

then the test statistic cannot be computed and the programs will crash. In molecular

xiv

evolution models, the covariance matrix is a function of the substitution model and

the underlying tree but it is not generally known what types of trees and models

cause singular covariance matrices. In this thesis, we show that singular covariance

matrices arise if and only if some distance is exactly 0 or equivalently when a pair of

taxa have identical sequences with probability 1. However, in practice the covariance

matrix must be estimated and the underlying causes of singularity are more complex.

A necessary condition for singularity in the estimated covariance matrix is given,

as well as two sufficient conditions which are: 1) The number of distinct nucleotide

patterns at a site is less than the number of pairs of taxa, and 2) A special type of

linear dependence is constructed in the rows of the estimated covariance matrix.

Finally, two alternatives to using the glsdna and glsprot routines are introduced

which allow for the construction of a CR even when the covariance matrix is singular.

First, the routines glsdna eig and glsprot eig, as described in [32], use an eigenvalue

cutoff approach. The causes of singularity described in this thesis led to an alterna-

tive approach which uses a distance cutoff, or in other words, groups of taxa which

are closely related are combined together before computing a CR. This approach is

implemented as glsdna dist and glsprot dist. These different approaches were com-

pared via a simulation on two 8 taxon trees using nucleotide sequence data. Briefly,

the results show that for small samples the glsdna dist routine gives better coverage

probabilities and far smaller CR sizes than those obtained by using glsdna eig, while

for longer sequence lengths the routines exhibit similar performance.

xv

Acknowledgements

First and foremost, I would like to thank my supervisor, Ed Susko, for his guidance

as well as for his patience during my time here. I also owe a special thanks to both

of my readers Hong Gu and Chris Field. To Hong for convincing me to undertake

a master’s degree in statistics at Dalhousie, and to Chris for making it possible.

To George Gabor for teaching me that probability theory is nothing but common

sense reduced to calculations. To Balagopal Pillai and Nick Pilon for their technical

assistance. And finally, a collective thanks to all of the graduate students in the

department.

xvi

Chapter 1

Overview

This thesis is organized as follows. Chapters 2, 3 and 4 will serve as both an avenue to

introduce background material and to codify notation which will be used throughout

this thesis. Specifically, in Chapter 2 a general introduction to the aims of molecular

phylogenetics is provided; and, most importantly, the main inferential problem of

speculating a phylogenetic tree from molecular sequence data - such as nucleotide or

amino acid sequences - for a given set of taxa is described. In Chapter 3 this inferential

problem is formulated in a probabilistic framework culminating with the introduction

of the general time reversible (GTR) model of substitution. Then in Chapter 4 various

ways to reconstruct phylogenetic trees based on these models are introduced including

maximum likelihood (ML), and distance based methods. Distance methods, in which

the generalized least squares (GLS) approach falls, operate by first reducing sequence

data to more manageable pairwise distances between the taxa, then work to infer a

phylogenetic tree from these. Thus, much attention is given to describing the GLS

method of estimation as a linear model on the pairwise distances between taxa, and it

is explained how the GLS model can be transformed into an equivalent weighted least

squares (WLS) model. Under the WLS formulation the pairwise distances are linearly

independent transformations of the original pairwise distances and the eigenvalues

of the covariance matrix V , from the GLS formulation, are the variances of these

transformed distances.

In addition to phylogenetic estimation, the GLS approach provides a natural

framework for hypothesis testing and, in turn, the construction of confidence regions

(CR) for topologies— the branching patterns underlying phylogenetic trees. Chap-

ter 5 is devoted to detailing GLS applied to topologies as in Susko [31]. In brief, given

1

2

a set of distances dij, the GLS test statistic for a particular topology T is given by

gT =∑

i<j,k<l

wij,kl(dij − δij)(dkl − δkl)

where the wij,kl are entries of the inverse covariance matrix V −1 and δij is the sum of

the branch lengths along the path from i to j particular to the topology T . This can

be used to test the hypotheses

H0 : T is the true topology.

HA : A different topology is the true topology.

at significance level α, because under the true topology the GLS test statistic has a

known chi square distribution. But to implement this method it is necessary to know

or estimate the covariance matrix V of the pairwise distances. Susko uses a sample

average to approximate theoretical formulae for the variances and covariances which

are specific to ML distances. Consequently, if the model used to construct the ML

distances is correct, then coverage probabilities are correct with a large number of

sites.

Susko provides software, based on the PHYLIP package [8], to compute the GLS

test statistics for a file of input trees and outputs the trees sorted according to their

P values. The two main routines are

1. glsprot: for amino acid data

2. glsdna: for DNA data

In fact, two asymptotically equivalent versions of the sample approximation for the

entries of V are used in these programs; the first, is a first order approximation; while

the second is a second order approximation which is non-negative definite (NND).

Both of these approximations are described in Chapter 5 and are of particular interest

because due to numerical errors, V may be singular causing the programs to crash.

To help avoid this source of failure, first the sample average approximations are

applied. If the resulting covariance matrix is singular, then the NND sample average

approximations are used. The consequences of this approach are explored in more

detail in the coming chapters.

3

Both the chi square distribution of the GLS test statistic and the sample average

approximation for the variance-covariance matrix are large sample results— both

depending on ML distances estimates. Though GLS rests upon a firm theoretical

foundation for large sample sizes, in general, it is unknown how it will perform when

confronted with small samples. And since a large number of genomic data sets have

sequence length 1000 or less, understanding the performance of GLS under such

asymptotically hostile conditions is an avenue of investigation of real practical interest

to the working Phylogeneticist. In Chapter 6 a simulation study is undertaken to

gauge the consequences of these asymptotic limitations. What results will help to

clarify which real life data sets will prove problematic for the GLS software and,

as well, enable the practitioner to extract the most information possible from the

computed confidence regions.

To this end, confidence regions are computed from simulated nucleotide sequence

data. The trees which were used to simulate the data break down into three cases.

For each of the topologies:

1

2

3

41 2 3 4 5 6 7

1

2 3

4

5

6

7

three trees, say Tree 1, 2, and 3, are chosen to provide increasing problems with

estimation. In each case, Tree 1 represents a form which GLS should handle with

ease. Trees 2 and 3 are made to induce long branch attraction in the first two cases,

and a star topology in the latter case (bottom topology). Each tree is used as a true

tree from which nucleotide data is simulated and 95% confidence regions computed

according to the following specifications:

4

1. 10000 nucleotide data sets are generated for each of the sequence lengths 50,

100, 250, 500, 1000, 5000, and 10000.

2. 95% confidence regions are constructed using the glsdna program using both

the sample average covariances and the true covariances.

For each simulation coverage probability of the true tree, size of each CR, estimated

ML distances, and sample average variances-covariances were recorded.

It was found that the coverage probabilities agreed with what is expected asymp-

totically for sequence lengths 1000 and higher. For smaller sample sizes the coverage

probabilities were generally found to be higher than the 0.95 value, though when the

NND sample average formula was used extensively by glsdna, the coverage probabili-

ties tended to be lower than 0.95. Moreover, it was anticipated that, for small sample

sizes, the coverage probabilities would attain the expected 0.95 value, if the true co-

variances were used to compute the GLS test statistic. Surprisingly, the coverage

probabilities were drastically underestimated.

The underlying cause of the high coverage probabilities can be attributed to a

tendency for the ML distances to be overestimated for small sequence lengths together

with what we found to be an exponential increase in variance with distance between

taxa. The underestimation of coverage probability when the NND sample average

approximation is frequently used is explained as follows: The NND formula will often

produce very small (positive) variances, where by contrast the sample average formula

would give an negative result causing the programs to crash. This, as we will see,

inflates the test statistic, which will be shown to reduce the size of the P value for the

true tree. When the true covariances are used, overestimation in the ML distances

was responsible for the reduced coverage probabilities.

As mentioned above, computation of the GLS test statistic requires the estimated

covariance matrix of the pairwise distances V to be invertible. If singularity does

occur, then the test statistic cannot be computed and the programs will crash. This

limitation of the GLS approach was given some attention in [29]. In Chapter 7, we

show that singular covariance matrices arise if and only if some distance is exactly

0 or equivalently when a pair of taxa have identical sequences with probability 1.

However, in practice the covariance matrix must be estimated and the underlying

5

causes of singularity are more complex. A necessary condition for singularity in the

estimated covariance matrix is given, as well as two sufficient conditions which are:

1) The number of distinct nucleotide patterns at a site is less than the number of

pairs of taxa, and 2) A special type of linear dependence is constructed in the rows of

the estimated covariance matrix. The last section of Chapter 7 provides an example

to illustrate the above claims.

In Chapter 8 three approaches which allow for the construction of a CR even

when the covariance matrix is singular are explored. First, the routines glsdna eig

and glsprot eig, as described in [32], use an eigenvalue cutoff approach. Briefly, this

works by finding the eigen decomposition of V , then any eigenvalues below a specified

cutoff value (along with the transformed distances for which they act as variances)

are omitted and an adjustment to the degrees of freedom of the test statistic is made.

The causes of singularity described in this thesis led to an alternate approach which

uses a distance cutoff, or in other words, groups of taxa which are closely related

are combined together before computing a CR. This approach is implemented as gls-

dna dist and glsprot dist, and are described in some detail in this chapter. Lastly, an

approach which links these two approaches together by inferring the number of near

zero eigenvalues from the closely related taxa is discussed. This approach uses a par-

ticular eigenvalue cutoff, and thus relies upon the software in ]citesusko2 as well. A

simulation using 8 taxon trees is performed to compare these three methods. Briefly,

the results show that for small samples the glsdna dist routine gives better cover-

age probabilities and far smaller CR sizes than those obtained by using glsdna eig,

while for longer sequence lengths the routines exhibit similar performance. The last

approach shows poor performance in comparison to the others.

Chapter 2

Molecular Phylogenetics

The fundamental inferential problem of phylogenetics is to reconstruct evolutionary

relationships among homologous taxa— that is taxa united by a common ancestry.

Such relationships are typically rendered in the form of a phylogenetic tree. That is, a

bifurcating leaf-labeled tree where external nodes represent extant taxa while internal

nodes represent ancestral taxa or equivalently speciation events. The branches deter-

mine lines of common ancestry in the tree; and when endowed with lengths will be

used to represent evolutionary distances between taxa as will be defined with rigour

in due course.

α7

1 2

3

α1 α2

α3

α8

α9 α10

4 5 6

α4 α5 α6

Figure 2.1: A phylogenetic tree

Since its conception as a scientific discipline various, often incompatible, ap-

proaches based primarily on morphological aspects of species had been utilized to

infer phylogenies. However, over the last half-century insights into the role of DNA

in evolution permitted molecular phylogenetics to blossom due to the newly available

wealth of genomic data in combination with a tremendous increase in raw computing

power. Using genomic data such as DNA or amino acid sequences had indeed led

to the resolution of many open problems in biology, along with extensive revision in

what had previously been accomplished.

6

7

Indeed, it is thought that evolutionary change is a consequence of changes to a

species genome. 1 Mutations in the genome are known to occur as substitutions, in-

sertions, deletions, and inversions. In molecular phylogenetics, genomic sequence data

for T homologous taxa, henceforth referred to via a labeling set S = 1, 2, . . . , T, 2

together with our understanding of changes at the genomic level are employed to infer

phylogeny.

In this thesis such genomic information will be represented either by sequences of

nucleotides N = A, C, G, T or amino acids A = Ala, Cys, . . . , T yr collectively

referred to as molecular sequence data. Though for simplicity nucleotide sequences

will be used exclusively without any cost to the reader. Both N and A are examples

of a character set the elements of which are called character states or just states.

In general, a character or site is an inherited trait possessed by a taxa; characters

are described in terms of their states. For example, tentacled versus non-tentacled

would serve as a two-valued morphological character set used to distinguish between

mammalian-like life forms and more dreadful kraken-like beasts.

Character data for a set of taxa S will be organized in a T by N data matrix

X where the entry xij corresponds to the character state for taxon i for character j.

Sometimes it will be convenient to consider i and j as the k’th pair of taxa. In this

case the notation x(k) will be used to mean the 2 by N matrix whose rows correspond

to the i and j rows in the original data matrix, while xki will denote the data at site

i for a pair of taxa k. When the pair is obvious the superscripted k may be dropped.

A simple example should alleviate any source of confusion.

Chimp ACACGGGTAA GGCGTAAGGC ... CCCTTTTTGCGorilla ACAACGTGTT GGTTTGAGAG ... TACTGGACTT

Orangutan ACGTCAAGTT TGTGTTAGAG ... AGCGGGACCTLemur GCATACGCTC AGACTCAGTT ... ATGCGTAATG

Table 2.1: Character data for 4 taxa

Consider the data matrix of fictitious nucleotide sequence data above for some

of our mammalian friends. We say the character state C occurs at site 2 for the

1Species is used in the plural sense here by no accident. Of course, each member of a specieswill have its own genomic variations. In this context parts of the genome which remain fixed in thepopulation is intended.

2The point here is to simplify the exposition by refering to taxa by number instead of by name.

8

chimp. Though the term character state may be replaced by nucleotide, amino acid,

or residue depending on the context. Furthermore x(1) would be the first two rows of

character data corresponding to the pair of taxa chimp and gorilla. 3 Lastly taking k

to be the pair chimp and gorilla, x4 corresponds to the data (C,A) at site 4.

As aforementioned, character data carries with it the premise that the characters

describe inherited traits. Character sequence data for T taxa is said to be site homol-

ogous or aligned in this case. For molecular sequence data the nucleotides at a site

must trace their origin to a site from a common ancestor. Together insertions, dele-

tions, and inversions act to destroy site homology over evolutionary time in groups

of taxa.

Alignment of molecular sequence data is an aspect of inferring phylogenies which

will be side stepped in this thesis as it would take us far afield from the main purpose

here. That is all molecular sequence data is presumed to be aligned.

Aligned sequences whose number of differences at a site, called the Hamming

distance, is small should intuitively be closely related. However, this approach ig-

nores hidden site substitutions which are not deterministically recoverable from the

sequences. For example,

C

T

A AChimp Gorilla

Figure 2.2: A hidden substitution at site 3.

To take this into account probabilistic models of substitution have been devel-

oped. A probabilistic model of evolution quantifies, in a mathematically appealing

way, uncertainty in the evolutionary process under consideration. Together, models

3The pair ordering for taxa labeled 1, . . . , T will always be assumed follow the pattern12, 13, . . . , 1T, 23, . . . , T − 1T .

9

of evolution for aligned molecular sequence data on phylogenetic trees form the cor-

nerstone of a reductionist framework from which to reason about the evolutionary

process. In Chapter 3, we will see that modeling substitutions can fit nicely in a

Markov framework.

With both data and model under our command we can feel free to fall back on the

classic statistical paradigm of estimation and evaluation. Typically, estimating a phy-

logenetic tree consists of choosing an optimality criteria, estimating a tree topology,

and finally estimating the branch lengths. A brief overview of estimation methods

such as parsimony, maximum likelihood, and least squares are given in Chapter 4.

Special attention is given to the least squares approach to estimation, and in partic-

ular to generalized least squares (GLS). Evaluation of the estimate usually consists

of both constructing confidence region and conducting hypothesis testing. In this

thesis, we are interested in constructing confidence regions on topologies (as opposed

to phylogenetic trees) using GLS.

Finally, it would be misleading to not make reference to some inherent limitations

of the designs outlined above. Indeed, lateral gene transfer as observed in bacteria

populations, for instance, wildly contradicts the phylogenetic tree model of evolution

as introduced here. But rather, in an abundance of settings this basis has proven to

be of prodigious utility by enlightening our understanding of the histories of species.

Also it can be difficult to identify the location of a common ancestor to all species

under consideration from molecular sequence data alone. That is, it is difficult to

root a phylogenetic tree based solely on molecular sequences. As such, we will usu-

ally be interested in inferring only unrooted trees representing relative evolutionary

relationships among the species under consideration.

Chapter 3

Models of DNA and Amino Acid Evolution

For taxa labeled by S whose evolutionary history conforms to the structural dictates

of some phylogenetic tree, corresponding molecular sequence data provides merely an

instantaneous frame of the evolutionary processes over time. Complete knowledge in

this setting - amounting to a site by site account of all nucleotide mutations since

divergence from a common ancestor - would permit a deterministic reconstruction of

phylogeny. However, armed only with sequence data for extant taxa, the best one

can hope for is a probabilistic account of past mutations in molecular sequences from

which to proceed with phylogenetic reconstruction.

Incorporating all knowledge of the evolutionary processes at hand into the model

is so delicate as to be marred by impracticality. Conversely, when too much infor-

mation is ignored a serious and unnecessary inaccuracy in subsequent estimation can

result. For example, most models encountered here will assume sites in a sequence

are independently and identically distributed (iid). However, in general this tends to

be an over simplification of the actual state of understanding where rates of evolu-

tion vary not only over sites but also among lineages. In practice, a balance between

using all information available and having a reasonable model from which to make

inferences is sought.

In this chapter we begin with a formal treatment of phylogenetic trees. Then

proceed to contrive some probabilistic models of DNA and amino acid evolution of

varying complexity. Finally, state some limitations of what has been proposed.

10

11

3.1 Phylogenetic Trees and Tree Metrics

This section will serve as both an avenue to introduce phylogenetic trees and to codify

notation which will be used throughout this thesis.

3.1.1 Graphs

Graphs are the most general construct used for representing phylogenetic relationships

among species. We proceed to formally introduce some graph theoretic notation used

through this thesis with the purpose being to fix notation rather than to bore the

reader with familiar concepts.

In general, taking V to be a finite non-empty set and E ⊆ vi, vj|vi, vjǫV and i 6=j a set of unordered pairs from V . We define a graph G as the ordered pair

(V (G), E(G)) where V (G) is the set of nodes (vertices) and E(G) is the set of branches

(edges). A subgraph of G is a graph G′ = (V ′(G), E ′(G)) where V ′(G) ⊆ V (G) and

E ′(G) ⊆ E(G).

v1 v2

v3v4

v5

v6

v7G′

α1

α2

α3 α4 α5

α6

α7

α8

α9

Figure 3.1: A graph

In Figure 3.1 the nodes are written as V = G′, v1, v2, . . . , v7 while branches are

written as E = α1 = (v1, G′), α2 = (v1, v2), . . . , α9 = (v4, v5). It should be noted

12

that a subgraph G′ will be considered itself as a node throughout.

A path in a graph G is a sequence of distinct nodes v1, v2, . . . , vN such that the

branches vi, vi+1 lie in E(G) for all 1 ≤ i ≤ N−1. Two nodes vi and vj are connected

if E(G) contains a path from vi to vj . Otherwise, they are called disconnected. A

graph G itself is connected when each pair of its nodes is connected. Otherwise it too

is disconnected. A cycle is a path where v1 = vN and the edges are distinct. Finally,

the degree of vi, denoted by deg(vi), is the number of branches which are incident

with vi.

Again the graph of Figure 3.1 is used to illustrate these terms. The nodes v5 and v8

are connected, being joined by the path v5, v4, v3, v6. But the graph is disconnected

since v7 disconnected from all of the other nodes. And it can be seen the graph

contains many cycles, for example, v1, v2, v4, v3, v1.

3.1.2 Topologies and Phylogenetic Trees

A tree Tr = (V (Tr), E(Tr)) is a connected graph with no cycles. A node v of degree 1

is called a terminal node. A node which is not a terminal node is called simply an

internal node. Together the set of terminal nodes of a tree is denoted by L(Tr) or just

L when Tr is clear from context. The set of internal nodes V (Tr)\L(Tr) is denoted

by

V (Tr), or similarly by

V . All trees considered here are binary trees which are

trees having all internal nodes of degree 3. Typically in phylogenetics, terminal nodes

represent taxa for which molecular sequence data abounds; whereas internal nodes

represent hypothesized extinct common ancestors.

13

A rooted tree is a tree with exactly one distinguished node called the root denoted

by ρ which by definition has degree 2. For u, vǫE we say u is a parent of v, and v

is a child of u when the path from ρ to v includes u. Furthermore if u, vǫV and the

path from ρ to v includes u, then u is an ancestor of v and v is a descendant of u. A

pair of terminal nodes are called neighbors if they have the same parent.

ρ

1 2 3 4 5

Figure 3.2: A rooted binary tree

1

2 3

4

5

Figure 3.3: An unrooted binary tree

Figure 3.2 shows a rooted binary tree labeled by S = 1, 2, 3, 4, 5, while Figure 3.3

shows its unrooted counterpart. In general, a labeling of a tree Tr is a bijective

mapping φ : L → S. A topology T is a pair (Tr, φ) where Tr is an unrooted binary

tree labeled by φ. If the underlying binary tree is rooted, then the resulting pair is

called a rooted topology. The set of all topologies on T taxa is denoted by Top(T ) and

Ω will be used to mean a subset of interest. Similar for rooted topologies the notation

Topr(T ) and Ωr is employed. The number of topologies increase quite rapidly in the

number of taxa.

Proposition 3.1 A result on the size of the topology spaces:

1. The number of topologies in Top(T ) is

1 × 3 × 5 × . . . × (2T − 5) =(2T − 4)!

(T − 2)!2T−2(3.1)

for T ≥ 3.

2. The number of topologies in Topr(T ) is

1 × 3 × 5 × . . . × (2T − 3) =(2T − 2)!

(T − 1)!2T−1(3.2)

for T ≥ 2.

14

N Unrooted Topologies N Unrooted Topologies4 3 8 103855 15 9 1351356 105 13 ≃ 109

7 945 20 ≃ 1021

Table 3.1: The number of unrooted topologies for given numbers of taxa

This result is well known. However, for detailed proofs one may refer to [28]. The

number of topologies (both unrooted and rooted) is super-exponential in the number

of taxa. This means that techniques for inference like exhaustive search are feasible

for only a small number of taxa. In general, the space of topologies must be carefully

attended to before inference is performed.

Each topology on T > 2 taxa has exactly 2T − 3 branches T − 3 of which are

internal. We use the notation np to denote the number of pairs of taxa. In addition,

branches can be assigned lengths to represent divergence between species. For a

topology T a weighting function w : E(T ) → W , where W ⊆ R, assigns a real number

called a branch length to each branch in E(T ). In this thesis branches lengths will

be written in vector form as ~α = [α1, . . . , α2T−3]T . For trees inferred from molecular

sequence data, branch lengths can usually be interpreted as expected numbers of

substitutions.

Finally, a phylogenetic tree (evolutionary tree) T is a pair (T , ~α) where T labeled

by S and ~α is a vector of branch lengths. A rooted phylogenetic tree is defined similarly.

The length LT of a phylogenetic tree is defined simply as the sum total over all

branches

LT =

2T−3∑

i=1

αi. (3.3)

3.1.3 Newick Representation of a Phylogenetic Tree

The Newick representation of a phylogenetic tree is a computer-readable encoding

which works by exploiting the relationship between trees and nested parentheses.

Take, for example, the phylogenetic tree

15

0.80.4

1

2 3

0.4

0.9 0.8

4

5

0.5

0.9

Figure 3.4: A phylogenetic tree.

In Newick format it is the following string of characters:

((1 : 0.4, 2 : 0.9) : 0.8, 3 : 0.8, (4 : 0.5, 5 : 0.9));

Internal nodes are represented by matching pairs of parentheses. All nodes are fol-

lowed by a colon and branch length. The commas separate three different subtrees

and the string is always ended with a semicolon. The representation of a phylogenetic

tree in Newick format is not unique, but this is of little concern here. A more detailed

explanation of Newick format can be found in [8].

3.1.4 Distance Measures

Sequence data for taxa labeled by S is often reduced to single numeric quantities

between pairs of species. Formally, a distance is function δ : S × S → R≥0 which, for

all i, j, k in S, satisfies

i. δij = 0 ⇐⇒ i = j.

ii. δij = δji.

iii. δik ≤ δij + δjk.

For a pair of taxa i, j in S we call δij a pairwise distance (evolutionary distance) or

just a (distance) between i and j. Also δj is sometimes used to represent a pairwise

16

distance between the j’th pair of taxa when the context is clear. In vector notation

~δ = [δ1, . . . , δnp]T .

An additive distance δ(T,w) is a distance δ on T with associated non-negative valued

weighting w such that for each i, jǫL(T) we have

δij =∑

αiǫPath(i,j)

w(αi) (3.4)

This says that a distance is additive when the branch lengths sum up exactly to the

pairwise distances.

A tree additive distance, as above, satisfies the four point condition of Bune-

man [28]: namely, for any taxa i, j, k, l in a phylogenetic tree

δij + δkl ≤ max(δik + δjl, δjk + δil) (3.5)

This serves as a criterion for gauging the additivity of distances computed from se-

quence data.

3.1.5 Matrix Encoding of a Phylogenetic Tree

Suppose T is a phylogenetic tree with branch lengths ~α. The pairwise distance δij ,

corresponding to the k’th pair, can be expressed as the sum

δij =∑

l

xklαl,

of the branch lengths along the path from i to j. xkl is 1 if the l’th branch is in the

path from i to j and 0 otherwise. The matrix encoding of a phylogenetic tree T is

the matrix X with kl’th entry equal to xkl. Note that a reordering of the rows of X

simply translates to reordering the entries of ~δ.

3.1.6 Hamming Distance

An intuitively appealing means to reduce two nucleotide sequences xi and xj for taxa

i, j to a pairwise distances is by taking the number of differences, called the Hamming

distance

17

Hij =

N∑

k=1

δ(xik, xjk) (3.6)

where δ in this context is the Kronecker delta function. However, as a simple model of

molecular sequence evolution this is a reckless approach because past site substitutions

are not taken into account; and hence, the Hamming distance acts merely as a lower

bound for the actual number of substitutions that have occurred. Any inferentially

sound method should account for past substitution events. If a complete account was

known, then the evolutionary relationships among the taxa could be reconstructed

exactly. But observed molecular sequences are not apt to reveal their secrets so easily.

Indeed, a probabilistic model will prove absolutely essential for site substitution; and

in this manner a more accurate summary of the number of substitutions since a

speciation can be had.

3.2 Probabilistic Models of Site Substitution

Scholarly endeavours to understand site substitutions have produced a vast wealth of

theory on the subject. As such, any probabilistic model of evolution 1 should take this

into account; however, more intricate knowledge leads to more complicated models.

The approach taken here is to begin with some rather serious approximations to what

is actually known and use that as a starting point for inference.

3.2.1 Some Assumptions

Substitution over sites is assumed to be independent and identically distributed (iid),

implying that it suffices to consider models of substitution for a single site. Moreover,

given ancestral character states, substitution along a branch (at a site) is taken to

be conditionally independent; so it is the model for substitution along a branch that

needs to be described. For instance, in the figure below the conditional independence

1In this context evolution is used synonymously with site substitution. The distinction is im-material as other sources of evolution under consideration have been dispensed with by consideringonly aligned sequences.

18

A

C

G T

Figure 3.5: Conditional independence at a site for 3 taxa.

statement is: P (CGT |A) = P (C|A)P (G|A)P (T |A). So basically, it is the terms in

this product which will be described by a probabilistic model.

3.2.2 Markov Models

Our point of departure is to consider substitution at a site as a stochastic process

X(t)|tǫR≥0 with state space N so that X(t) is the nucleotide at a site at time

t. 2 And now, gradually, all of the elements necessary to construct meaningful

probabilistic models of evolution have been introduced.

A stochastic process X(t) that satisfies the Markov property

P (X(t)=j|X(tm)=i,X(tm−1)=im−1,...,X(t0)=i0)=P (X(t)=j|X(tm)=i) for all t0<...<tm<t (3.7)

is said to be a (continuous time) Markov process. The transition probability Pij(tm, t) =

P (X(t) = j|X(tm) = i) is the probability of changing from state i to state j over

the time from tm to t. In an evolutionary context this means, for example, if the

nucleotide at a site at time tm is A, then the probability of a substitution to, say, C

after a time t depends only on A; knowledge that G occurred prior to A would not

contain any information to alter the probability in question. When all Pij(tm, t) are

nonzero for t > tm the Markov process is irreducible and says there is always a chance

to go from any nucleotide to any other nucleotide. Furthermore when the transition

probabilities do not depend on tm the Markov process is said to be homogeneous

P (X(tm + t) = j|X(tm) = i) = P (X(t) = j|X(0) = i) (3.8)

and the transition probabilities are simply denoted by Pij(t) with tm taken to be

zero. This ignores the complexities of having transition probabilities changing in

2Similarly with an amino acid A state space.

19

different parts of the tree. The Chapman-Kolmogorov equation provides an easy way

to compute Pij(t + s):

Pij(t + s) =∑

k

Pik(t)Pkj(s) (3.9)

where the sum is over all possible states. The transition probabilities can be repre-

sented as a matrix P(t), called the transition matrix, with the (i, j)’th entry equal to

Pij(t). The Chapman-Kolmogorov equation has the particularly pleasing form

P(t + s) = P(t)P(s). (3.10)

A Markov process is stationary when

P (X(t) = j) = P (X(0) = j) for all t (3.11)

and the stationary probability P (X(t) = j) is denoted by πj or in vector form ~π or as

the diagonal entries of the matrix Π. 3 In other words stationarity restricts attention

to situations where nucleotide frequencies have reached some long run equilibrium

status.

By casting evolution as an irreducible, homogeneous, and stationary Markov pro-

cess the resulting probabilistic models simplify vastly at the expense of ignoring in-

formation from all but the previous substitution event, the location of substitutions

in the tree, and the presumption that nucleotide frequencies have attained some con-

stancy. The cost of ignoring this information will depend on the particular problem

under consideration. However, if the loss is not fatal, then a tremendous benefit will

be gained. With just a few more simplifying approximations we shall gain the ability

to give an explicit formulaic representation of transition probabilities of nucleotide

substitution, and thus have summed up the whole of evolution in a formula.

We shall subscribe to the standard approach for deriving explicit transition proba-

bilities, a rather indirect approach to be sure. In practice, Markov models of evolution

tend to be defined in terms of rates of evolution from which transition probabilities

are subsequently derived. It can be established that when the probability of multiple

3It follows from irreducibility that the πj are greater than zero.

20

state changes over time t is small ( in o(t)) that

limt→0

1 − Pii(t)

t= νi

limt→0

Pij(t)

t= qij for i 6= j.

Intuitively qij in Pij(t) = qijt + o(t) can be thought of as an instantaneous rate of

substitution from i to j. The corresponding rate matrix Q has off-diagonal entry qij

and diagonal entry νi = −∑i6=j qij . That is the diagonal entries are defined so that

the rows sum to zero.

The transition and rate matrices are related by the Kolmogorov backward and

forward equations:

P′(t) = QP(t) (3.12)

P′(t) = P(t)Q (3.13)

The benefit of introducing a rate matrix is that it leads to a method to estimate the

transition probabilities from the forward-backward equations using matrix exponen-

tials. The solution to the forward-backward equations is

P(t) = expQt =∞∑

i=0

Qiti

i!. (3.14)

So given the rates the transition probabilities can be approximated by approximating

the infinite sum. A Markov process is time-reversible when the direction of transitions

in time is irrelevant. Formally,

P(X(t + t0) = j, X(t0) = i) = P(X(t + t0) = i, X(t0) = j) for all t, t0, i, j. (3.15)

In this special case the infinite sum above can be evaluated exactly thus yielding exact

transition probabilities. The reason lies in that the rate matrix for a time-reversible

Markov process satisfies the following property.

Proposition 3.2 Q is the rate matrix of a time-reversible Markov process with sta-

tionary probabilities ~π if and only if Q = RΠ where

i. R is a symmetric matrix.

21

ii. R~π = ~0.

iii. πj > 0 for all j and∑

πj = 1.

Since R is symmetric it is diagonalizable

R = WΛW−1 (3.16)

so that the rate matrix can be written as

Q = UΛU−1 (3.17)

where U = Π− 1

2W. Then finally, the transition probabilities can be computed exactly

by

P(t) = eQt = UeΛtU−1. (3.18)

From this point onward it should be understood that any reference to a Markov pro-

cess should be understood to satisfy the litany of constraints outlined above. Such a

Markov model is called a general time-reversible (GTR) Markov model. And under

such conditions the transition probabilities are easily obtained provided the instan-

taneous rates of evolution.

3.2.3 Branch Lengths

Even with probabilistic models, if rates of substitution vary in different lineages, we

can only infer expected number of substitutions not times since divergence. That

is a branch could be long because much time has passed, or because a multitude

of substitutions have occurred. In conventional Markov models t represents time.

Here it will represent evolutionary distance or expected numbers of substitutions. To

ensure that branch lengths correspond to expected numbers of substitutions, the rate

matrix is rescaled according to the condition

−∑

πiqii = 1 (3.19)

It can be shown that the desired interpretation of branch length is obtained in this

manner.

22

3.2.4 Markov Models of Nucleotide Substitution

This section will serve as an exhibition of Markov processes for nucleotide evolution

at a site. In principle, the semantic details are as unanimated as they are unsen-

sational; however, at present a remarkable amount of freedom remains concealed

behind Markovian generalities. Special attention will be paid to embroidering the

GTR model with biological meaning and form; each particularization endowed with

its own character and nuances. A more detailed exposition of the models described

here can be found in [14].

The general theory described in the previous section can be applied, quite natu-

rally, to nucleotide evolution at a site. With the nucleotide stationary probabilities

πA, πC , πG, πT the rate matrix Q can be expressed as a 4 x 4 matrix where the rows

and columns correspond to the nucleotides A, C, G, and T

Q = µ

a rACπC rAGπG rAT πT

rACπA b rCGπG rCT πT

rAGπA rCGπC c rGT πT

rAT πA rCT πC rGT πG d

(3.20)

where the diagonals are a = −(rACπC + rAGπG + rAT πT ), b = −(rACπA + rCGπG +

rCT πT ), c = −(rAGπRAC+ rCGπC + rGT πT ), and d = −(rAT πRAC

+ rCT πC + rGTπG).

µ represents the mean instantaneous substitution rate and the alphabetic parameters

correspond to additional rate parameters specific to the the nucleotide pairs.

The equivalent representation of the rate matrix as the product of R and Π then

becomes

R =

− rAC rAG rAT

rAC − rCG rCT

rAG rCG − rGT

rAT rCT rGT −

(3.21)

and

Π =

πA 0 0 0

0 πC 0 0

0 0 πG 0

0 0 0 πT

(3.22)

23

All subsequent models detailed in this section follow as special cases of this, GTR

model, by introducing biologically meaningful restrictions on the rate parameters and

nucleotide stationary probabilities.

The Jukes-Cantor model is the most constricting GTR model which retains some

biological reasonability. The constraints are equal stationary probabilities (πA =

πC = πG = πT = 14) and equal rates (rAC = rAG = rAT = rCG = rCT = rGT = 1).

This gives rate matrix

Q =

−34µ 1

4µ 1

4µ 1

4µ

14µ −3

4µ 1

4µ 1

4µ

14µ 1

4µ 1

4µ −3

4µ

14µ 1

4µ 1

4µ −3

4µ

(3.23)

From which the transition probabilities can be computed by applying the main result

of the previous section as

Pij(t) =

14

+ 34e−µt, i = j;

14− 1

4e−µt, i 6= j.

(3.24)

Notice that the transition probabilities depend on the product of the mean substitu-

tion rate µ and time duration t. Without loss of generality µ is by convention taken

to be 1 and the rij scaled appropriately. In this way branch lengths as computed later

will correspond to the expected number of substitutions along that branch, without

reference to the amount of time which had transpired.

Jukes-Cantor fails to take into account some rather well known features of nu-

cleotide substitution. Substitutions are classified into two types; transitions A ↔ G

or C ↔ T , and transversions as any other substittion. It is known in many settings

that transitions occur more frequently than transversions. A variety of more elabo-

rate GTR models have been derived which incorporate a transition-transversion ratio

κ along with other nuances of nucleotide substitution.

One such model is F84 and will be used throughout this thesis and as such explicit

mention is warranted. The rate matrix is obtained from the GTR rate matrix by

setting transversion rates rAC , rAT , rCG, rGT to 1 and transition rates to rAG = 1+ κπR

24

and rCT = 1 + κπY

:

Q = µ

− πC πG (1 + κ/πR) πT

πA − πG πT (1 + κ/πY )

πA (1 + κ/πR) πC − πT

πA πC (1 + κ/πY ) πG −

(3.25)

where πR = πA + πG and πY = πC + πT . This approach more accurately groups sub-

stitutions according to their molecular structure as either purines A,G or pyrimidines

C,T. And as usual the diagonal entries are chosen to make the rows sum to zero. The

corresponding transition probabilities are given by

Pij(t) =

πj + πj

(1

Πj− 1)

e−µt +(

Πj−πj

Πj

)e−µt(κ+1), i = j;

πj + πj

(1

Πj− 1)

e−µt −(

πj

Πj

)e−µt(κ+1), i 6= j, transition;

πj (1 − e−µt) , i 6= j, transversion.

(3.26)

where Πj = πA + πG if j is a purine and Πj = πC + πT when j is a pyrimidine.

3.3 Amino Acid Substitution

GTR models of amino acid evolution can be defined analogously to the nucleotide

models introduced above. However, the resulting lofty number of parameters is prob-

lematic for inference. Not to mention that the subtleties of modeling protein evolu-

tion are more complicated than with nucleotides where much can be exploited from

consideration of transition-transversion ratio. Empirically rooted approaches circum-

vent these complications to some extent by estimating the frequencies of amino acid

substitutions from large amounts of aligned data, then using these as the transition

probabilities. In this manner the only parameters remaining are the 20 stationary

probabilities. The best known classes of empirically derived substitution matrices are

the PAM [4], JTT [15], and WAG [34] models. All three of these approaches work by

estimating the R matrix in the decomposition Q = R~π of Proposition 2.2, instead of

the transition probabilities directly.

The PAM (probability of accepted mutation) substitution matrices, introduced

by Dayhoff, were the first empirical attempt at assigning transition probabilities to

25

amino acid substitutions and dates back to the late 1960’s. JTT (Jones, Taylor,

and Thornton) dates from 1992 and is in essence an expanded version of PAM. Both

methods derive rates from protein sequences that are at least 85% identical by using

rather involved counting techniques. A detailed description can be found in [17].

However, using only closely related protein sequences ensures that a lot of information

is thrown away. WAG (Whelan and Goldman) avoids this shortcoming by using a

maximum likelihood approach to estimation.

3.4 Limitations of Markov Models

The simplicity of the Markov models introduced here can sometimes lead to the

exclusion of vast amounts of relevant information about substitutions. The most

serious tend to arise from imposing rate homogeneity, that is both constant rates of

substitution at a site and over all sites. And secondly, by considering evolution to be

stationary.

Rate heterogeneity over sites is evidenced by the fact that substitutions depend

on sequence positioning in a very fundamental way. While substitutions at some sites

result in drastic ramifications for a specimen unlucky enough to host them, while

substitutions at other sites result in little to no change to the specimen. Secondly,

rates are also known to vary at a fixed site due to selection pressures and other

constraints.

An extreme but useful approach to accounting for rate heterogeneity is to hold

biologically critical sites invariable while leaving a constant rate at all other sites.

Though better than nothing, this approach relies on having some biological intuition

as to which sites are least able to change without horrible consequences, and it only

allows for a constant rate at other sites.

A more sophisticated treatment by Yang [14] assigns sites different rates according

to a gamma distribution Γ with shape parameter α. When α is small, the majority

of sites evolve slowly, leaving relatively few sites changing at higher rates. As α tends

to ∞ an equal rates model is obtained. In addition, this approach permits the rate at

a particular site to vary over time by changing α according to whatever constraints

26

are known. Needless to say, the computational cost of this additional accuracy can

be significant.

Stationarity of base frequencies is also not the case in many situations. This

is particularly evident when significantly different species are examined. A study

by Lockhart et al [21] established that violation stationarity of base frequencies can

result in serious errors in estimation. The preferred approach has been to transform

resulting distance estimates by means of logdet or paralinear corrections as described

in [14].

Chapter 4

Reconstruction Methods

In this chapter an overview of phylogenetic reconstruction methods is provided. The

most widely used approaches are parsimony, maximum likelihood, and distance based

estimation procedures.

4.1 The Inferential Problem

Given Ω a subset of the topologies on T taxa with sequence data X. The inferential

problem is to infer a topology T ǫ Ω together with branch lengths α— a phylogenetic

tree.

Computational complexity issues play a major role in determining the usefulness

of a particular scheme. Indeed, there are (2T − 5)! distinct topologies for a set of T

taxa 3.1.2. Clearly as the set of taxa becomes large, searching over the entire tree-

space becomes infeasible. The branch and bound algorithm is one approach which is

particularly well suited for phylogenetic reconstruction [7].

4.2 Parsimony Methods

The principle of maximum parsimony is that the preferable tree is the one which

minimizes the number of substitutions needed to explain the data. It is a deterministic

model of evolution which was among the first used for phylogenetic reconstruction.

A simple example can be found in Figure 4.1. It shows the inferred phylogenetic

tree from the nucleotide sequences CCT, CCC, TTC, CTC. In the figure it can be seen

that four nucleotide substitutions are used in the reconstruction. It is easy to verify

empirically that this is the minimum number of substitutions necessary to explain

27

28

CCT CCC TTC CTC

CCC TTC

CCC

Figure 4.1: A parsimony example.

the data.

In general, using parsimony requires a way to compute the minimum number

of substitutions for any given topology and theoretically apply this over the entire

topology space. A nice overview of the nuances of various algorithms can be found

in Hillis [14]. In practice, the tree space can be reduced by heuristics, and other

standard approaches like the branch and bound algorithm.

Parsimony methods are generally considered to be of inferior quality when com-

pared to modern day maximum likelihood and distance methods. In its simplest

form the parsimony approach explicitly avoids incorporating hidden substitutions

into the reconstruction process. Since the premise is to minimize the net amount

of evolutionary change that has occurred, parsimony will tend to underestimate the

actual number of changes. In fact, parsimony has been shown to be inconsistent by

Felsenstein in very simple cases.

4.3 Maximum Likelihood Methods

Maximum likelihood reconstruction from sequence data is largely regarded as the

preferable means of estimation. In a phylogenetic context, this amounts to fixing

a model of evolution and then computing the likelihood of each tree topology for

the given sequence data; and, according to the likelihood principle, subsequently

choosing the tree with largest likelihood as the estimate. Unlike parsimony, likelihood

estimation takes full advantage of the Markov model groundwork laid in Chapter 3.

29

This standard statistical procedure is best outlined via a simple example.

Chimp

Gor

Oran

Lemur

T1

Chimp Gor

Oran Lemur

T2

Chimp Gor

OranLemur

T3

Figure 4.2: Possible topologies for 4 taxa.

Consider the aligned nucleotide data 2.1 along with the three possible topologies

T1, T2, T3 on 4 taxa as seen in Figure 4.2. Suppose a model of evolution is selected

with transition probabilities Pij(t). Then the likelihood function for the i’th topology

is

L (Ti, t) = P (X; Ti, t)

where P denotes the joint distribution of the data X = [x1 . . . xN ] where xi is the

data at site i.

As aforementioned the benefit of restricting attention to time-reversible models is

that the location of the root has no effect on the following likelihood computation;

thus we merely consider the unrooted topologies as above. Moreover, since evolution

over sites is taken to be iid the likelihood function can be simplified considerably to

L (Ti, t) = ΠNj=1P (xj ; Ti, t)

and is usually evaluated as a sum in the form of the log-likelihood

ℓ (Ti, t) =

N∑

j=1

log P (xj ; Ti, t).

30

In words this amounts to evaluating the probability of the data over all possible

internal character states for each topology.

C

A

T

T

y1 y2

Figure 4.3: Maximum likelihood at site 4.

For example, Figure 4.3 shows the data at site 4 applied to T1 with internal nodes

y1 and y2. The log-likelihood function at j = 4 is then

log∑

y1,y2∈N

P (C, A, T, T, y1, y2).

The log-likelihood of the full tree is then the sum of the above formula taken over

all sites. In this case the likelihood computation would be done for each of the three

topologies and the topology which maximizes the likelihood function would be taken

as the estimate.

Maximum likelihood estimation is consistent and tends to give lower variances and

is more robust than both parsimony and distance based methods, though the esti-

mate need not be unique. But maximum likelihood methods can be computationally

expensive, in particular when conditions such as time-reversibility or iid site evolution

are relaxed. Indeed, even in the presence of these simplifications ML estimation has

been shown to be NP complete (in the number of taxa) in the context of phylogenetic

estimation. To reduce this burden various methods are employed to shrink the search

space including branch and bound. The branch and bound algorithm [13] provides a

globally convergent search strategy for criterion-based approaches. It is often still too

computationally intensive and heuristic search strategies are much more frequently

used with large numbers of taxa.

31

4.4 Distance Methods

Parsimony estimates will underestimate the amount of substitutions along a branch.

ML estimation does not suffer from this oversight, however it can be computationally

infeasible in many practical situations. Distance methods fall somewhere in between

as a feasible alternative to maximum likelihood which does not have the same inherent

failings as parsimony approaches.

All distance based reconstruction methods simplify the problem by reducing se-

quence data to more manageable pairwise distances ~y between the taxa under con-

sideration. These distances represent the expected number of substitutions that have

occurred between two sequences under some specified model. Then typically call

upon either a hierarchical clustering algorithm or a more statistical method such as

least squares to estimate a phylogenetic tree T.

Needless to say, transforming sequence data to a pairwise distance summary statis-

tic is an information losing act. In effect what happens is that relationships between

multiple taxa at a site is forsaken for multiple relations between pairs of taxa. Conse-

quently distance methods cannot hope to outperform likelihood estimation in general.

For distance based methods an often desired criteria is that the methods be tree

additive as defined in Chapter 3. In this way with infinite data the estimated pairwise

distances should coincide with the sum of the edge lengths in the estimated tree. A

distance measure which satisfies this criteria is said to be tree additive.

4.4.1 ML Distances

A distance measure d : S × S → R≥0 leaves a substantial amount of freedom as to

how they can be computed. Using maximum likelihood to compute pairwise distances

will be of particular interest, and is outlined in this section. But in a phrase: An ML

distance is the estimate obtained by applying the maximum likelihood approach to

two-taxon data.

For T taxa, ML distances will be denoted by ~yML = [d12, . . . , d1T , d23, . . . , dT−1T ]T . 1

1When convenient the distances may also be written as ~yML = [d1, d2, . . . , dnp]T , and when clear

32

In general, they are computed by first specifying a model of substitution defining tran-

sition probabilities Pij(t). Then each pair of taxa i, j with corresponding sequence

data x1, x2, . . . , xN the maximum likelihood estimate t for L (x1, x2, . . . , xN ; t) acts as

the pairwise distance dij . The resulting economy in computation stems from evalu-

ating the maximum likelihood over a single branch for each pair of taxa instead of

doing the computation over the whole tree. But as a result the tree must still be

estimated from the reduced data.

The Jukes-Cantor model of substitution defined by 3.24 was the most basic Markov

model considered earlier. The ML distance dij can easily be computed for two taxa

i and j. Consider the log likelihood function

ℓ (x1, . . . , xN ; t) =N∑

i=1

log Pi (xi; t)

=

N∑

i=1

log Pi (xi1, xi2; t)

=

N∑

i=1

log Pi (xi2|xi1; t)P (xi1)

=N∑

i=1

log Pxi1xi2(t)πxi1

For the Jukes-Cantor model the transition probabilities depend only on whether xi1

and xi2 are the same or different. With this in mind, the above formula simplifies to

ℓ (~xi; t) = ns log

(1

4+

3

4e−t

)+ nd log

(1

4− 1

4e−t

)

where ns and nd are the number of sites which are the same and different respectively.

The maximizer, and hence ML distance, is

dij = t = −3

4log

(1 − 4

3p

)

where p, called the p-distance, is proportion of sites where the character state is

different for the two taxa. Notice that since two independent sequences are expected

to agree at 1/4 of sites, dij tends to infinity as p approaches 3/4 indicating completely

unrelated sites.

from context the ML subscript may be dropped entirely.

33

As Markov models become more complicated in form so to do the associated

ML distance formulae. The ML distances for the F84 model though does have a

reasonably concise formulation:

dij = −2A log

(1 − P

2A− (A − B)Q

2AC

)+ 2(A − B − C) log

(1 − Q

2C

)

where A = πCπT /πY + πAπG πR, B = πCπT + πAπG, and C = πRπY . Meanwhile P

and Q are the fraction of differences due to transitions and transversions respectively.

In general the ML distances for a GTR model are given by

dij = −trace[Π logΠ−1Fxy

](4.1)

where the Fxy a matrix called the divergence matrix for taxa i and j whose (i, j)th

entry is the fraction of differences in the sequences nij/N due to substitutions from

nucleotide i to nucleotide j with the usual ordering A,C,G,T.

4.4.2 Branch Length and Topological Estimation using Least Squares

Both branch length and topological estimation from pairwise distances can be treated

as a linear modeling problem. The least squares optimality criteria will be used.

Suppose ~δ = [δ12, . . . , δ1T , δ23, . . . , δT−1T ]T are true pairwise distances for T taxa.

For a topology T , the branch lengths ~α = [α1, . . . , α2T−3]T can be written in the form

of a linear model

~δ = X~α. (4.2)

where X is the matrix encoding of T (See Section 3.1.5). If ~y = [d12, . . . , dT−1T ]T are

estimated pairwise distances, then the linear model becomes

~y = X~α + ~ǫ, (4.3)

with ~ǫ = [ǫ12, . . . , ǫT−1T ]T a vector of measurement errors.

The linear model described above is specific to the topology T and estimates

branch lengths specific to that topology. Typically, an estimate α of the branch

lengths ~α is obtained by using one of three variants of the least squares optimality

34

criteria and was first proposed in the foundational work by Cavalli-Sforza and Ed-

wards [3] published in 1967. In addition to branch length estimation, the least squares

estimated phylogenetic tree T, from some set of phylogenetic trees of interest, is the

one whose least square criteria is computed to be a minimum.

The method of ordinary least squares (OLS) is the most basic and estimates

branch lengths for a topology T by minimizing the unweighted sum of squared errors

∑

i<j

(dij − δij)2 . (4.4)

OLS is sub-optimal in the presence of variation of variances or nonzero covariances.

Least squares estimation can be made more precise if additional information from

the variances of the pairwise distances is used. When the variances var(dij) = vij

are known statistical theory ensures that the vij are optimal weights for the squared

errors so that the weighted least squares optimality criteria is

∑

i<j

(dij − δij)2

vij

. (4.5)

An implementation of WLS by Fitch and Margoliash [9] use vij = 1/d2ij which pre-

sumes the variances are proportional to the pairwise distances themselves. Neither

OLS or WLS take full account of what is known about variances and covariances of

pairwise distances. For instance, OLS does not take into account that larger distances

have more variance than smaller ones, while WLS omits correlation which arises be-

tween two distances dij and dkl which partially share a path on the topology under

consideration. Moreover, the weighting used by Fitch is dubious because it’s more

likely exponential [23].

The method of generalized least squares (GLS) takes covariances into account by

minimizing the weighted sum of squares and cross products of residuals.

gT =∑

i<j,k<l

wij,kl(dij − δij)(dkl − δkl) (4.6)

= (~y −X~α) V −1 (~y −X~α) . (4.7)

Here V is the covariance matrix of ~y so that the vkl is the covariance between the

distances for the k’th and l’th pairs. Equally well we will write wkl for the (k, l)’th

entry of the inverse covariance matrix V −1.

35

Similarly the optimal weights are the elements of the inverse covariance matrix.

GLS (and WLS) were first proposed by Cavalli-Sforza and Edwards [3], but no method

of actually computing the variances and covariances were established at that time.

GLS is well esteemed on two accounts, namely because it takes into account

covariances which are known to be relevant in a phylogenetic context, is consistent

and efficient, and moreover, unlike OLS and WLS, the GLS test statistic gT has

a χ2 distribution under the true topology. This will be of high importance in the

subsequent work as it will enable the construction of confidence regions for topologies.

But for now we stay on course with studying estimation proper.

Bullmer [2] provided computationally practical estimates for computing the vari-

ances and covariances so that GLS estimation can be realized. These estimates were

specific to the models of Tajima and Nei [33]. Susko [31] later extended and refined

the work of Bullmer to work for maximum likelihood distances and furthermore pro-

vided a computationally feasible means for hypothesis testing and confidence region

construction.

4.4.3 GLS and its Relationship to WLS

The generalized least squares formulation of phylogenetic tree estimation can be trans-

formed into an equivalent weighted least squares problem. The algebraic manipula-

tions are straight forward. If the true covariances are known, then the covariance

matrix V has eigen-decomposition

V = UΛUT (4.8)

where Λ is a diagonal matrix of nonnegative eigenvalues λ1, . . . , λnpand U is the

matrix of corresponding eigenvectors with (i, j)’th entry being uij. It follows that the

GLS test statistic can be written in the form:

gT = (~y − X~α)T V −1(~y − X~α) (4.9)

= (~y − X~α)T UΛ−1UT (~y − X~α) (4.10)

= (UT ~y − UT X~α)Λ−1(UT~y − UT X~α) (4.11)

= (~y∗ − X∗~α)Λ−1(~y∗ − X∗~α) (4.12)

36

where the entries of ~y∗ = UT ~y are the independent linear transformations of the

distances from ~y, and X∗ = W TX. Written in terms of the entries we have

gT =∑ (y∗

i − (X∗α)i)2

λi

(4.13)

where (X∗α)i is the i’th row of X∗.

In terms of a linear model the relationship between GLS and WLS is, for a given

topology, to fit

~y∗ = X∗~α + ~ǫ (4.14)

as a WLS problem since the linear transformed distances are independent, we need

only be concerned with variances since where the λi act as variances of the y∗i .

Chapter 5

Confidence Regions and Hypothesis Tests for Topologies

Using Generalized Least Squares

The estimation procedures of the previous chapter produce a phylogenetic tree deemed

to be optimal under some criteria, based implicitly or explicitly on a probabilistic

model of substitution and aligned molecular sequence data. In this chapter the em-

phasis is placed on the underlying topology alone. In the ideal setting consisting of

sequence data of infinite length and the correct model of substitution, the estimated

topology will reflect the true history of substitution. In reality, sequence length will

be finite and the underlying substitution process is never known with complete cer-

tainty. Inevitably, sampling variation and model misspecification guarantee that the

estimated topology will not always coincide with the true branching pattern. More-

over, different estimation procedures often give rise to different estimated topologies.

In general, parsimony, maximum likelihood, and distance methods are by no means

predestined to agree upon the estimated topology. And are indeed known to dif-

fer in even very basic examples. Thus some quantification of the uncertainty in an

estimated topology is critical.

The classical approach to quantifying uncertainty in an estimated parameter is

through hypothesis testing and the construction of confidence intervals. Unlike typical

parameter spaces, the space of all topologies is discrete. So in a phylogenetic context

the aim is to expand upon the estimated topology by constructing a set of topologies

which contains the true topology with a specified level of probability— a confidence

region.

In Chapter 4 GLS was specifically tailored to address the problem of phyloge-

netic estimation from pairwise distances. Fortuitously, the GLS approach provides

37

38

a natural framework for hypothesis testing and, in turn, the construction of confi-

dence regions for topologies. The idea of using GLS for hypothesis testing is not

new; indeed, its conception can be traced back to the foundational work of Cavalli-

Sforza and Edwards [3]. But to implement this method it was necessary to know

or estimate variances and covariances for the pairwise distances. Bullmer [2], much

later, provided computationally practical estimates for distances specific to Tajima

and Nei [33]. Susko [31] then extended and refined the work of Bullmer by providing

a means to estimate variances and covariances for ML distances— the icon of contem-

porary distances. This chapter explores Susko’s approach to constructing confidence

regions for topologies based on the GLS test statistic along with the nuances of an

accompanying software package.

5.1 Introduction

In this section an overview of hypothesis testing and the construction of confidence

regions for topologies using the GLS test statistic is presented. The material is a

natural continuation of the ideas found in Section 4.4.2.

Suppose Ω is a set of topologies of interest for T taxa. For T ǫ Ω, GLS can be

used to test the hypotheses

H0 : T is the true topology.

HA : A different topology is the true topology,

at significance level α, because under the true topology the GLS test statistic gT

has a χ2 distribution with k = T (T − 1)/2 − (2T − 3) degrees of freedom when the

distribution of ~y is multivariate normal and the underlying model of substitution is

correct. A P value for the test can be easily computed, under the χ2(k) distribution,

as the tail area of the observed test statistic under the null hypothesis. The GLS test

statistic is better suited for testing this hypothesis, than the corresponding OLS (4.4)

or WLS (4.5) test statistics which would have more complicated distributions which

depend on the covariances between the pairwise distances.

Equivalently, a (1 − α) x 100% confidence region for the true topology can be

39

constructed by computing the GLS test statistic for each topology in Ω. The resulting

confidence region is the set of topologies with P value greater than or equal to α.

5.2 GLS Test Statistic and Its Chi-Square Distribution

The χ2 distribution of gT , and subsequent hypothesis testing and confidence region

mechanics, relies on the assumption that the distribution of the observed pairwise

distances ~y is multivariate normal. Susko [31] showed, by an approach borrowed

from Lehmann [20], that for ML estimated pairwise distances this is indeed the case

asymptotically.

Theorem 5.1 Suppose that ~y = [y1, . . . , ynp]T is a vector of ML distance estimates

for true distances ~d. Then ~y has asymptotic multivariate normal distribution Nnp(~d, V )

in the number of sites N . The entries of the covariance matrix are given by

Vjj = V ar(yj) =

(E

− ∂2

∂d2j

log Pj(x(j); dj)

)−1

÷ N, (5.1)

Vjk = NVjjVkkE

∂

∂dj

log Pj(x(j); dj)∂

∂dk

log Pk(x(k); dk)

. (5.2)

Proof: (Sketch) Suppose x1, x2, . . . , xN is sequence data for the j’th pair of taxa.

Let Pj(xi; dj) denote the probability of the data xi at a site i where dj is the true

pairwise distance. Suppose yj is the corresponding ML distance estimate for the pair.

The log likelihood function for the j’th pair is given by

uj(dj) = ℓ(x1x2 . . . xN ; dj) =N∑

i=1

log Pj(xi; dj) (5.3)

with score function u′j(dj) and second derivative u′′

j (dj).

Consider now the first order Taylor expansion of the score function about the true

value dj evaluated at yj:

u′j(yj) = u′

j(dj) − u′′j (dj)(yj − dj) + op(1). (5.4)

40

Since yj is the MLE of dj it follows that u′j(yj) = 0. Hence the above equation can

be rewritten as √N(yj − dj) = −

√N I(dj)

−1u′j(dj) + op(1). (5.5)

where I(dj) = −u′′j (dj)/N is the observed Fisher information.

Now, define ~u(~d) to be the vector with j’th entry

uj = −√

N I(dj)−1u′

j(dj), (5.6)

so that in vector form Equation 5.5 becomes

√N(~y − ~d) = ~u + op(1). (5.7)

Each uj is a sum (over the number of sites) of independent random variables with

mean 0. Hence by the multivariate central limit theorem

~u ∼ Nnp(0, N V ), (5.8)

where the entries in V are given by Equations 5.1 and 5.2 which are simply the

inverse expected Fisher information written in expanded form.

Finally, since

~y ∼ 1√N

~u + ~d, (5.9)

it follows that

~y ∼ Nnp(~d, V ). (5.10)

Hence the distribution of ~y is multivariate normal

The variances 5.1 and covariances 5.2 are first order approximations. There are

also second order approximations which are asymptotically equivalent to those derived

above, but are strictly nonnegative. These will be referred to as the nonnegative

definite NND variances.

Corollary 5.2 The variance Vjj and covariance Vjk are asymptotically equivalent to

the NND variances

Vjj ∼ V ′jj =

(E

−(

∂

∂dj

log Pj(x(k); dj)

)2)−1

÷ N, (5.11)

41

and

Vjk ∼ V ′jk = NV ′

jjV′kkE

∂

∂dj

log Pj(x(j); dj)

E

∂

∂dk

log Pk(x(k); dk)

. (5.12)

Proof: The result follows directly by substituting the asymptotic equivalent form

∂2

∂d2j

log Pj(x(k); dj) =

(∂

∂dj

log Pj(x(k); dj)

)2

+ op(1), (5.13)

in the variance (5.1) and covariance (5.2) formulas.

These alternative formulas for the variances and covariances were not mentioned

in the original paper, but are used in the implementation which will be discussed

below.

5.3 Estimation of the Covariances Matrix V

In practice, the covariances will depend upon the unknown relationships between the

pairs of taxa and therefore need to be estimated. The chi-square limiting distribution

of the GLS test statistic will apply so long as the method of estimation is consistent.

Susko provides two such estimation methods.

The first method, the sample average method, uses a sample average approxima-

tion to estimate the expected values lurking inside Equations 5.1 and 5.2. The sample

average approximations are

Vjj = V ar(yj) =

(N−1

N∑

i=1

− ∂2

∂d2j

log Pj(xi; dj)

)−1

÷ N, (5.14)

Vjk = NVjjVkk

N∑

i=1

∂

∂dj

log Pj(xi; dj)∂

∂dk

log Pk(xi; dk). (5.15)

The sample average approximations to the NND equivalents are obtained in an analo-

gous manner. The NND approximations are themselves guaranteed to be nonnegative

definite as will be shown in Theorem 6.4.

42

The second method uses a nonparametric bootstrap. Details are provided in [31],

but the resulting estimates for the variances and covariances are

Vjj = B−1

B∑

b=1

(y(b)j − yj)

2, (5.16)

and

Vjk = B−1

B∑

b=1

(y(b)j − yj)(y

(b)k − yk), (5.17)

where y(b)j is the b’th bootstrap estimate for ML distance yj with B the total number

of bootstrap estimates.

Both the sample average method and the nonparametric bootstrap are consistent

estimators for the covariances matrix V as the number of sites becomes large, and

hence, both are valid to embed in the GLS methodology for constructing confidence

regions.

5.4 Software Implementation

Susko provides software, based on the PHYLIP package [8], to compute the GLS

test statistics for a file of input trees and outputs the trees sorted according to their

P values. The two main routines

1. glsprot: for amino acid data

2. glsdna: for DNA data

together with a detailed usage file are available for download at [32]. Briefly, the

glsprot routine has the same options as found in the protdist of the PHYLIP distribu-

tion version 3.6a3. This includes the ability to use the PAM substitution model (3.3)

to estimate transition probabilities. The glsdna routine allows the user to specify a

transition/transversion ratio along with rates for the rate distribution accompanied

by their probabilities. Transition probabilities are then derived from the F84 model.

The GLS test statistics are evaluated using a variation of the NNLS routine [19]

which forces a non-negativity constraint on the estimated branch lengths. Susko

43

conjectures, rather convincingly, that branch lengths, and consequently topologies,

should be estimated more accurately with the restriction based on the findings of

Kuhner and Felsenstein [18]. Moreover, negative branch lengths are not biologically

meaningful anyway. Reasonable estimates for topologies can be expected to have

positive or near zero branch lengths. But permitting negative branch length estimates

allows GLS the freedom to fit the data to poor topologies. As an estimation procedure

this should not make a big difference as the true tree is not expected to have strange

forms in general. However, the issue becomes more of a concern when computing

confidence regions because poor trees devoid of biological meaning can sometimes

fit the data well when negative branch lengths are permitted. The most striking

consequence of the non-negativity constraint is that some topologies can be given

identical GLS test statistics because branches were estimated as 0, making them

equivalent.

The glsprot and glsdna routines implement the sample average method for com-

puting the covariances matrix V rather than using the bootstrap. There are two

reasons: first, results presented by Susko suggest both methods yield similar esti-

mates. Secondly, the bootstrap approach requires much longer computational times.

An additional detail of the software implementation not discussed in the original pa-

per is how the two asymptotically equivalent variance formulae are used. Due to

numerical errors, V may be singular causing the programs to crash. To help avoid

this source of failure, the sample average approximations are applied. If the resulting

covariance matrix is singular, then the NND sample average approximations are used.

The consequences of this approach are explored in more detail in the coming chapters.

Chapter 6

Generalized Least Squares Small Sample Simulation

Though GLS rests on upon a firm theoretical foundation for large sample sizes, in

general, it is unknown how it will perform when confronted with small samples.

And since a large number of genomic data sets have sequence length 1000 or less,

understanding the performance of GLS under such asymptotically hostile conditions

is an avenue of investigation of real practical interest to the working Phylogeneticist.

Indeed, the main objectives of this chapter are to a) distinguish what constitutes a

large sample, b) characterize features of small sample data sets which lead to poor

estimation by the GLS software, and c) exhibit the underlying causes of the poor

performance. These inquiries are addressed through simulation for both 4 and 7

taxon examples. These results will help to clarify which real life data sets will prove

problematic for the GLS software and, as well, enable the practitioner to extract the

most information possible from the computed confidence regions.

6.1 Motivation and Methods for Small Sample Simulation

The theory underlying the methods presented here relies on the number of sites being

large. In practice, the asymptotic properties of GLS will differ from the theoretical

ideals for three reasons:

1. The number of sites is always finite.

2. In theory, V and ~α, along with all other quantities, are calculated exactly. In

practice they contain estimation errors.

3. The substitution model used in constructing the distances is always an approx-

imation to the actual substitution process.

44

45

The remainder of this chapter is devoted to addressing the first two of these compli-

cations, while sidestepping issues of numerical error and robustness.

Both the chi square distribution of the GLS test statistic and the sample average

approximation for the variance-covariance matrix are large sample results— both

depending on ML distances estimates from the F84 substitution model. Exactly

what constitutes a large number of sites is unclear, but, one would fain think the

asymptotics carry over for smaller samples, so long as neither the number of taxa is

not too large nor the estimated ML distances lie too close to their boundary values.

However, this is merely intuition and must be confirmed. Simulation is used here to

gauge the consequences of these limitations for smaller sample sizes.

The idea is to compute confidence regions from simulated nucleotide sequence

data. The trees which were used to simulate the data break down into three cases

as outlined below. For each simulation, the three trees used are chosen to provide

increasing problems with estimation. In each case, Tree 1 represents a form which

GLS should handle with ease. Trees 2 and 3 are made to induce long branch attraction

in Sim I and II, and a star topology in Sim III.

Sim I Three trees based on the following 4 taxa topology.

1

2

3

4

1. ((1:0.3,2:0.3):0.3,3:0.3,4:0.3);

2. ((1:0.02,2:0.5):0.05,3:0.03,4:0.9);

3. ((1:0.01,2:0.5):0.01,3:0.01,4:0.9);

Sim II Three trees based on the following 7 taxa topology.

1 2 3 4 5 6 7

1. (3:0.3,((1:0.3,2:0.3):0.3,(5:0.3,(6:0.3,7:0.3):0.3):0.3):0.3,4:0.3);

2. (3:0.03,((1:0.02,2:0.5):0.05,(5:0.3,(6:0.6,7:0.6):0.3):0.05):0.3,4:0.9);

46

3. (3:0.01,((1:0.01,2:0.5):0.01,(5:0.3,(6:1.1,7:1):0.3):0.01):0.3,4:0.9);

Sim III Three trees based on the following 7 taxa topology.

1

2 3

4

5

6

7

1. (1:0.3,2:0.3,(3:0.3,(4:0.3,(5:0.3,(6:0.3,7:0.3):0.3):0.3):0.3):0.3);

2. (1:0.02,2:0.7,(3:0.1,(4:0.3,(5:0.7,(6:0.6,7:0.9):0.05):0.04):0.03):0.05);

3. (1:0.01,2:0.9,(3:1,(4:0.3,(5:1.3,(6:1.1,7:1.2):0.01):0.01):0.01):0.01);

Each tree is used as a true tree from which nucleotide data is simulated and 95%

confidence regions computed according to the following specifications:

1. 10000 nucleotide data sets are generated for each of the sequence lengths 50,

100, 250, 500, 1000, 5000, and 10000 using the PHYLIP program SEQ-GEN

with the F84 substitution model and a transition transversion ration of 2.0.

2. 95% confidence regions are constructed using the glsdna program using both

the estimated covariances and the true covariances. 1

We look at the results of coverage probability, confidence region size. The numbers

of taxa used here were small enough so that all topologies could be fitted.

6.2 Results

Confidence regions were generated for the Sim I,II, and III trees according to the

above specifications. It should be noted that of the 10000 data sets generated for a

given sequence length, not all could be used to compute a confidence region. Indeed,

this is the main topic of the coming chapters. But for now, we focus on the cases

when a confidence region is made.

1The ”true” covariances were obtained by computing the sample average estimates for sequencelength 25000.

47

6.2.1 Maximum Likelihood Distances and The Multivariate Normal Dis-

tribution

Theorem 5.1 establishes that ML estimated distances are multivariate normal asymp-

totically in the number of sites. The purpose of this section is to investigate the

behaviour of the ML distances, computed from the simulated data, for small sample

sizes. For each of the 10000 simulated data sets, the ML estimated distances were

computed and frequency plots were constructed for the quantities yij − dij; that is to

say the differences between the estimated and true distances. According to the large

sample theory the histograms should look normal with mean located at 0.

A complete summary of the results would prove as long as it would be repetitive.

In light of that, frequency plots are provided merely for Sim I and II for Tree 2 in

both cases.

The histograms for Sim I are found in Figure 6.1 and Figure 6.3(left) and show

some potentially serious deviation from multivariate normality for the smaller se-

quence lengths examined. Most noticeably is the left skewed nature of the plots for

sequence length 50 and 250 complete with some rather large outliers. These outliers

correspond to distances which have been seriously overestimated.

A few of the histograms for Sim II are shown in Figure 6.2 and Figure 6.3(right).

They exhibit features similar to the Sim I histograms such as the tendency for overes-

timation of the estimated distances for the smaller sequence lengths. However, in this

case, many of the histograms are shifted to the left of 0; an indication that for small

samples there is some systematic underestimation happening. Indeed, this feature is

also apparent in some of the Sim I histogram for sequence length 50; in particular for

the histogram associated with y23 − d23.

So it seems, based on these results, that there are two underlying concerns with

the ML estimation for small samples; first, the tendency to drastically overestimate

distances a small amount of the time, and secondly, some more systematic underes-

timation, which is more serious as the number of taxa gets large.

Though corresponding results for Trees 1 and 3 are not shown, it is the case

that the features as described above are, as would be expected, are evident less for

48

Tree 1 and more for Tree 3 in all three cases. Also it should be noted that the

major deviations from normality occur only for sequence lengths 500 or less, and are

essentially removed for sequence length 1000 and higher. Of course, it goes without

saying that the results found here, being based on only a handful of trees, do not

entitle us to make any general conclusions. But the main point here is simply to have

some sense of what might go wrong in practice.

y12 − d12

Fre

quen

cy

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0

040

080

012

00

y13 − d13

Fre

quen

cy

−0.1 0.0 0.1 0.2 0.3

050

010

0015

00

y14 − d14

Fre

quen

cy

−1 0 1 2 3 4

050

015

0025

00

y23 − d23

Fre

quen

cy

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8

040

080

012

00

y24 − d24

Fre

quen

cy

−1 0 1 2 3 4

050

015

00

y34 − d34

Fre

quen

cy

0 1 2 3 4

050

015

0025

00

Figure 6.1: Histograms of the ML estimated distance errors yij − dij for the 4 taxacase with Tree 2. The sequence length is 50.

49

y12 − d12

Fre

quen

cy

−0.2 0.0 0.2 0.4 0.6

040

010

00

y13 − d13

Fre

quen

cy

−0.5 0.0 0.5 1.0

060

012

00

y14 − d14

Fre

quen

cy

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8

060

012

00

y15 − d15

Fre

quen

cy

−1 0 1 2

010

0020

00

y16 − d16

Fre

quen

cy

−1 0 1 2 3

010

0025

00

y17 − d17

Fre

quen

cy

−1 0 1 2 3

010

00

y23 − d23

Fre

quen

cy

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8

040

010

00

y24 − d24

Fre

quen

cy

−0.2 0.0 0.2 0.4 0.6

060

012

00

Figure 6.2: Histograms of the ML estimated distance errors yij − dij for the first 7taxa case with Tree 2. The sequence length is 50.

50

4 Taxa Case: Tree 2

y23 − d23

Seq

Len

100

00

−0.04 −0.02 0.00 0.02 0.04

040

010

00

7 Taxa Case: Tree 2

y13 − d13

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06

040

010

00

y23 − d23

Seq

Len

100

0

−0.10 −0.05 0.00 0.05 0.10 0.15

040

010

00

y13 − d13

−0.15 −0.10 −0.05 0.00 0.05 0.10 0.15

040

010

00

y23 − d23

Seq

Len

250

−0.2 −0.1 0.0 0.1 0.2 0.3

040

010

00

y13 − d13

−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

040

010

00

y23 − d23

Seq

Len

50

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8

060

012

00

y13 − d13

−0.5 0.0 0.5 1.0

060

012

00

Figure 6.3: Left: Histogram for y23 − d23 for increasing sequence length. Right:Histogram for y13 − d13 for increasing sequence length.

51

6.2.2 The Relationship Between Sample Average Variance and Distance

Now for a minor digression. It is generally thought that variance increases expo-

nentially with pairwise distance [10]. This section contains the methods and results

of a simple simulation undertaken to check this supposition for the sample average

method as applied in the GLS software.

To gauge how the variance changes with distance, a sequence of seven 4 taxa trees

are considered with the same topology as the 4 taxa topology above. The pairwise

distance for taxa 1 and 2 are assigned the values d12 = 0.125, 0.25, 0.375, 0.5, 1, 2, 3,

while all other branch lengths are held constant. Then 10000 nucleotide data sets are

generated for each of the sequence lengths 50, 250, 1000, and 10000 using the PHYLIP

program SEQ-GEN with the F84 substitution model and a transition transversion

ration of 2.0. For each data set, the variances are computed using sample average

approximations 5.1 and 5.2.

0.0 0.5 1.0 1.5 2.0 2.5 3.0

02

46

810

12

Branch Length

Med

ian

Var

ianc

e

Sequence Length 50Sequence Length 250Sequence Length 1000Sequence Length 10000

0.0 0.5 1.0 1.5 2.0 2.5 3.0

1e−

051e

−03

1e−

011e

+01

Branch Length

Med

ian

Var

ianc

e (lo

g)

Figure 6.4: Exponential relationship between distance and sample average variance.Left: A plot of d12 vs the median variance. Right: A plot of d12 vs the log medianvariance.

52

In Figure 6.4, the relationships between distance and the median variance are

shown. In each case, an exponential relationship seems reasonable. This is especially

clear for smaller sequences lengths, as the variances are particularly misbehaved for

the larger pairwise distances.

6.2.3 Two Asymptotic Variance Formulae

In theory, V −1 is calculated exactly, but in practice it will contain numerical errors.

Numerical errors can cause negative variances making V singular; rendering the pro-

grams useless. Though not mentioned in the paper, Susko uses a combination of

both asymptotic equivalent variance formulas in the implementation as a line of first

defence against singularity in the covariances matrix. First the sample average ap-

proximation 5.14 and 5.15 is assayed. If the resulting covariance matrix is singular,

then the NND approximation is invoked to recompute the covariances matrix. If this

too fails then alternative approaches discussed in Chapter 8 must be resorted to.

The NND formula is less likely to produce negative variances; however, the dis-

advantage is that an extra approximation is made so that the variances will be less

accurate than ones obtained using the original sample average formula. When the

number of sites is large both formulas should produce similar estimates. But the

trade off of using the NND formula when the sample size is small could compromise

the accuracy of the confidence regions produced. The consequences of using each

approach for small samples is addressed through simulation in this section.

The simulation uses the same 4 taxa trees as defined above and follows the same

specifications. That is 10000 confidence regions are generated for sequence length 50,

100, 250, 500, 1000, 5000, and 10000 for each tree.

Figure 6.5 contains a summary of the results. The first column shows three plots

relating to the sample average formula: the coverage probability of the true tree,

the number of confidence regions made, and the average confidence region size. The

second and third column show analogous plots for the NND formula and mixture of

the two respectively.

First, the number of confidence regions made using the sample average formula

53

50 100 200 500 1000 2000 5000

0.90

0.94

0.98

Sample average variance

Sequence Length

Cov

erag

e

Tree 1Tree 2Tree 3

50 100 200 500 1000 2000 5000

0.90

0.94

0.98

NND sample average variance

Sequence Length

Cov

erag

e

50 100 200 500 1000 2000 5000

0.90

0.94

0.98

Mixture

Sequence Length

Cov

erag

e

50 100 200 500 1000 2000 5000

2000

6000

1000

0

Sequence Length

CR

Mad

e

50 100 200 500 1000 2000 5000

7500

8500

9500

Sequence Length

CR

Mad

e

50 100 200 500 1000 2000 5000

7500

8500

9500

Sequence Length

CR

Mad

e

50 100 200 500 1000 2000 5000

0.5

1.0

1.5

2.0

2.5

3.0

Sequence Length

Ave

rage

CR

Siz

e

50 100 200 500 1000 2000 5000

0.5

1.0

1.5

2.0

2.5

3.0

Sequence Length

Ave

rage

CR

Siz

e

50 100 200 500 1000 2000 5000

0.5

1.0

1.5

2.0

2.5

3.0

Sequence Length

Ave

rage

CR

Siz

e

Figure 6.5: Summary plots for confidence regions generated using sample averagevariance and NND asymptotic equivalent formula.

54

over the NND formula is significant. This is especially evident when the sequence

length is small and long branch attraction is a factor. For instance, with sequence

length 50 roughly 2000 CR’s are made using the sample average formula while 7500

are made using the NND formula.

The average confidence region size shows little variation over the three cases.

The benefit of the original sample average formula for computing the variances is

seen from the coverage plots. Though V will be singular more often, the resulting

coverage probabilities are much better behaved than in both the NND and mixed

cases.

The underestimation of coverage probability in the NND and mixed cases is a

direct consequence of the approximation made in going from the first order formula

to the NND formula. The advantage being that it helps avoid small variances, thus

reducing the chance of V being singular.

Take for example Tree 3. For sequence length 50 the sample average formula

produces about 2000 CRs while under the NND formula about 7500 CRs are made.

For sequence length 100 the numbers are about 3000 and 9250. The point is that

using the NND formula here enables the production of many more CRs. However, a

potentially serious consequence can be seen in the coverage plots. The NND formula

will often produce very small (positive) variances, where by contrast the sample av-

erage formula would give an negative result. This, in turn, inflates the size of the

test statistic, which finally acts to reduce the size of the P value for the true tree.

Sequence length 250 can be viewed as an extreme case where the NND formula pro-

duces a confidence region every time. The result is a significantly reduced coverage

probability for the true tree.

In conclusion, it seems reasonable to use a mixture of both variance approxima-

tions as currently implemented in the GLS software. However, it is important to

understand the consequences of this approach. Namely, that coverage probabilities

will be underestimated for small samples— especially when long branch attraction

becomes significant.

55

6.2.4 GLS Test Statistic for the True Topology

In this section Q-Q plots of the GLS test statistics for the true topology in each case

are examined, and some unexpected features of the plots are explained.

First we focus on Sim I. Figures 6.6 and 6.7 show Q-Q plots using the estimated

covariances. It can been seen that the test statistics are underestimated significantly

for sequence length 50, though this tends to go away for longer sequence lengths—

even as small as 250. Though for Tree 3 the underestimation can still be seen. In

fact, this is a perfect example to use for a general explanation of the plots. As

perviously explained the ML estimated pairwise distances for small samples tend to

be overestimated enough to skew the frequency plots. Moreover, simulation confirmed

the variance increases dramatically with distance, especially for samples. These two

features taken together works to explain what is evidenced in the plots. Namely, the

GLS test statistic is inversely proportional to variance; hence any overestimation in

the pairwise distances will result in a huge deflation of the test statistic.

It was anticipated that the coverage values would be accurate, even for small

samples, when the estimated covariances were replaced by the true ones. But as

evidenced in Figures 6.8 and 6.9 the state of overestimation is severe. This can,

serendipitously, be explained by the same argument. When the true covariances are

employed, the only source of error remaining are the pairwise distances themselves.

Their overestimation results in overestimation of the test statistics.

The results for Sim II deserve some additional explanation. This time in Fig-

ures 6.9 and 6.10 it seems the underestimation is less than in Sim I. It was initially

thought the underestimation would be more extreme, but the underlying reasons are

found in the previous two sections. First, in addition to being prone to overestima-

tion, it was found (especially in the 7 taxa cases) that the ML estimated distances

are systematically underestimated resulting in a mean left of 0. This acts to reduced

the effects of overestimation to some extent. But more importantly, it is explained

how using the NND formula used to compute the variances reduces the coverage

probability in comparison to using an original sample average approximation. This is

equivalent to increasing the size of the test statistic. Sim II uses 7 taxa, as opposed

56

to 4, and it can be checked that the increase in taxa results in an increase in usage

of the NND formula; and in turn, an inflation of the test statistic.

Chi Square Quantiles

GLS

Tes

t Sta

tistic

Qua

ntile

s

0 2 4 6 8 10 12

0

5

10

Tree 150

Tree 250

0 2 4 6 8 10 12

Tree 350

Tree 1250

Tree 2250

0

5

10

Tree 3250

0

5

10

Tree 11000

Tree 21000

Tree 31000

Tree 110000

0 2 4 6 8 10 12

Tree 210000

0

5

10

Tree 310000

Figure 6.6: GLS test statistic Q-Q plot. Sim I: Q-Q plots of the GLS test statisticfor the true topology using the estimated covariances. Plots for sequence length50, 250, 1000, and 10000 are shown for Trees 1, 2, and 3.


GLS

Tes

t Sta

tistic

Qua

ntile

s

0 2 4 6 8 10 12

0

5

10

Tree 1250

Tree 2250

0 2 4 6 8 10 12

Tree 3250

Tree 110000

0 2 4 6 8 10 12

Tree 210000

0

5

10

Tree 310000

Figure 6.7: GLS test statistic Q-Q plot. Sim I: Q-Q plots of the GLS test statisticfor the true topology using the estimated covariances. Plots for sequence length250 and 10000 are shown for Trees 1, 2, and 3.

57


GLS

Tes

t Sta

tistic

Qua

ntile

s

0 2 4 6 8 10 12

5

10

Tree 150

Tree 250

0 2 4 6 8 10 12

Tree 350

Tree 1250

Tree 2250

5

10

Tree 3250

5

10

Tree 11000

Tree 21000

Tree 31000

Tree 110000

0 2 4 6 8 10 12

Tree 210000

5

10

Tree 310000

Figure 6.8: GLS test statistic Q-Q plot. Sim I: Q-Q plots of the GLS test statisticfor the true topology using the true covariances. Plots for sequence length 50, 250,1000, and 10000 are shown for Trees 1, 2, and 3.


GLS

Tes

t Sta

tistic

Qua

ntile

s

0 2 4 6 8 10 12

5

10

Tree 1250

Tree 2250

0 2 4 6 8 10 12

Tree 3250

Tree 110000

0 2 4 6 8 10 12

Tree 210000

5

10

Tree 310000

Figure 6.9: GLS test statistic Q-Q plot. Sim I: Another look at Q-Q plots of the GLStest statistic for the true topology using the true covariances. Plots for sequencelength 250 and 10000 are shown for Trees 1, 2, and 3.

58


GLS

Tes

t Sta

tistic

Qua

ntile

s

0 5 10 15 20 25 30

0

20

40

60

80

Tree 150

0 5 10 15 20 25 30

Tree 250

0 5 10 15 20 25 30

Tree 350

Figure 6.10: GLS test statistic Q-Q plot. Sim II: Q-Q plots of the GLS test statisticfor the true topology using the estimated covariances. Plots for sequence length50 are shown for Trees 1, 2, and 3.


GLS

Tes

t Sta

tistic

Qua

ntile

s

0 5 10 15 20 25 30

0

10

20

30

Tree 1250

Tree 2250

0 5 10 15 20 25 30

Tree 3250

Tree 11000

Tree 21000

0

10

20

30

Tree 31000

0

10

20

30

Tree 110000

0 5 10 15 20 25 30

Tree 210000

Tree 310000

Figure 6.11: GLS test statistic Q-Q plot. Sim II: Another look at Q-Q plots of theGLS test statistic for the true topology using the estimated covariances. Plots forsequence length 250, 1000, and 10000 are shown for Trees 1, 2, and 3.

59


GLS

Tes

t Sta

tistic

Qua

ntile

s

0 5 10 15 20 25 30

0

50

100

150

200

250

300

Tree 150

0 5 10 15 20 25 30

Tree 250

0 5 10 15 20 25 30

Tree 350

Figure 6.12: GLS test statistic Q-Q plot. Sim II: Q-Q plots of the GLS test statisticfor the true topology using the true covariances. Plots for sequence length 50 areshown for Trees 1, 2, and 3.


GLS

Tes

t Sta

tistic

Qua

ntile

s

0 5 10 15 20 25 30

0

50

100

150

200

Tree 1250

Tree 2250

0 5 10 15 20 25 30

Tree 3250

Tree 11000

Tree 21000

0

50

100

150

200

Tree 31000

0

50

100

150

200

Tree 110000

0 5 10 15 20 25 30

Tree 210000

Tree 310000

Figure 6.13: GLS test statistic Q-Q plot. Sim II: Q-Q plots of the GLS test statisticfor the true topology using the true covariances. Plots for sequence length 250,1000, and 10000 are shown for Trees 1, 2, and 3.

60

6.2.5 Coverage and Confidence Region Size

Figures 6.14, 6.15, and 6.16 show plots of the coverage probabilities and average CR

size for each of the three simulations using both the estimated and true covariances.

From the coverage plots (made using the estimated covariances) we can see that

the performance deviates from asymptotic theory even for sample sizes as large as

1000 nucleotides (Sim II and III), though in Sim I the coverages for the more well

behaved trees settle almost immediately, after sequence length 50. The inflation of

the coverages is a natural consequence of the underestimation of the test statistics

as discussed in the previous section; as does the drastic underestimation when using

the true covariances. One feature of the plots which remains unexplained is why

coverages are most inflated for the simplest trees.

The average CR size plots are essentially in line with what would be expected;

that is to say, they are larger when using the estimated covariances for small samples

and approach common values in size as the sequence length becomes large. The only

unusual one being Sim III (Tree 3) where, on average, 600 topologies remain in the

CR even for sequence length 10000. But this is just a consequence of the estimation

process.

In the examples considered here the coverage probabilities agreed with what is

expected asymptotically for sequence lengths of 1000 and above. But it should be

noted that in practice the number of taxa considered may be substantially higher

than those examined here, and it remains unknown to what extent a large increase in

taxa will inflate the coverage probabilities. Though it seems, based on the simulation

results, that the coverage probabilities remain reasonably constant over different tree

structures.

61

50 100 200 500 1000 2000 5000 100000.

900.

920.

940.

960.

981.

00

Estimated Covariances

Sequence Length

Cov

erag

e

Tree 1Tree 2Tree 3

50 100 200 500 1000 2000 5000 10000

0.80

0.85

0.90

0.95

1.00

True Covariances

Sequence Length

Cov

erag

e

50 100 200 500 1000 2000 5000 10000

0.5

1.0

1.5

2.0

2.5

3.0

Sequence Length

Ave

rage

CR

Siz

e

50 100 200 500 1000 2000 5000 10000

0.5

1.0

1.5

2.0

2.5

3.0

Sequence Length

Ave

rage

CR

Siz

e

Figure 6.14: Coverage and CR size plots. Sim I: Plots of coverage probability of thetrue topology and average confidence region size using both estimated covariances(left) and the true covariances (right).

50 100 200 500 1000 2000 5000 10000

0.90

0.92

0.94

0.96

0.98

1.00


Sequence Length

Cov

erag

e

Tree 1Tree 2Tree 3

50 100 200 500 1000 2000 5000 10000

0.5

0.6

0.7

0.8

0.9

1.0

True Covariances

Sequence Length

Cov

erag

e

50 100 200 500 1000 2000 5000 10000

020

040

060

080

0

Sequence Length

Ave

rage

CR

Siz

e

50 100 200 500 1000 2000 5000 10000

020

040

060

080

0

Sequence Length

Ave

rage

CR

Siz

e

Figure 6.15: Coverage and CR size plots. Sim II: Plots of coverage probability of thetrue topology and average confidence region size using both estimated covariances(left) and the true covariances (right).

62

50 100 200 500 1000 2000 5000 10000

0.90

0.92

0.94

0.96

0.98

1.00


Sequence Length

Cov

erag

e

Tree 1Tree 2Tree 3

50 100 200 500 1000 2000 5000 10000

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

True Covariances

Sequence Length

Cov

erag

e

50 100 200 500 1000 2000 5000 10000

020

040

060

080

0

Sequence Length

Ave

rage

CR

Siz

e

50 100 200 500 1000 2000 5000 10000

020

040

060

080

0

Sequence Length

Ave

rage

CR

Siz

e

Figure 6.16: Coverage and CR size plots. Sim III: Plots of coverage probability ofthe true topology and average confidence region size using both estimated covariances(left) and the true covariances (right).

Chapter 7

Singularity of the Covariance Matrix

Computation of the GLS test statistic requires the estimated covariance matrix of

the pairwise distances V to be invertible. If singularity does occur, then the test

statistic cannot be computed and the programs will crash. Singularity can arise in

three ways: either from a linear dependence amongst the pairwise distances, sampling

variation, or numerical errors— or some combination of these factors. In this chapter

the sources of singularity are characterized for both large and small samples sizes.

Hypothetically, if the true covariance matrix V is known, singularity occurs exactly

when there is a linear dependence among the pairwise distances. In Section 7.1, it

will be shown that such a linear dependence is characterized by the three equivalent

conditions: (i) some subset of linearly transformed distances have a variance of 0,

(ii) there is a pairwise distance dij = 0, (iii) a pair of nucleotide sequences xi and xj

coincide with probability 1.

Knowing the causes of singularity for the true covariance matrix automatically

provides some intuition for what to expect for the estimated covariance matrix. And

as the practitioner is often interested in making confidences regions based on rela-

tively small sequence lengths, it is critical to have a through understanding of how

singularity can arise in V . In Section 7.2, a more thorough understanding of just how

V can be singular is developed by explicitly computing how a linear dependence can

arise in the rows of the sample average approximation matrix. An illustrative 4 taxa

example with sequence length 500 is provided in Section 7.3.

63

64

7.1 Singularity of the True Covariance Matrix V

The covariance matrix V using the true pairwise distances is inherently a large sample

object, as true distances correspond to nucleotide data of infinite sequence length. In

this section, a characterization of when V is singular is given. Though in practice,

sequence length is never infinite, knowing the conditions of singularity here will give

some insight as to what to expect for smaller sequence lengths.

As a matter of notation, for T taxa with nucleotide data over N sites, as be-

fore, ~x = [x1, . . . , xT ]T will represent data at a site. But in addition we use ~y =

[y1, . . . , y2T−3]T for internal nucleotides at a site. The nucleotides for a taxa i will be

written as xi = [xi1, . . . , xiN ]T . Also, for branch lengths ~α, [α1, . . . , αT ] are for termi-

nal branch lengths and [αT+1, . . . , α2T−3] internal. Lastly ~s = [s1, . . . , snp]T represents

the Hamming distance for each pair of taxa over N sites.

Theorem 7.1 For any time reversible Markov model of substitution, the covariance

matrix V is singular if and only if one of the following holds

1. At least one linear transformed distance d∗i has variance λi = 0.

2. At least one pair of taxa i, j satisfy dij = 0.

3. At least one pair of taxa i, j have nucleotide sequences xi and xj coinciding at

each site with probability 1, that is to say P (xi = xj) = 1.

Proof:

V singular ⇔ 1. The covariance matrix has eigen-decomposition V = UΛUT where

Λ is a diagonal matrix of nonnegative eigenvalues λ1, . . . , λnpand U is the matrix

of corresponding eigenvectors. If some λi is zero, then

det(V ) = det(U)det(Λ)det(U) = 0.

On the other hand, if V is singular and no λi is zero, then it must be the

case that det(U) is zero. But this is not possible since the eigenvectors are

guaranteed independent.

65

1 ⇔ 2 First we show 2 implies 1. WLOG suppose that taxa 1 and 2 have pairwise

distance d12 = 0. This mean that for all taxa k = 3, . . . , T we have d1k = d2k.

But this means there is a linear dependence in the pairwise distances so V is

singular. Hence it follows that at least one λij = 0.

For 1 implies 2 apply Lemma 7.2 followed by Lemma 7.3.

2 ⇔ 3 First we show 2 implies 3. Suppose there is a pair of taxa i, j with pairwise

distance dij = 0. By independence, to show P (xi = xj) = 1 it will suffice to

show that P (xik = xjk) = 1 at a single site k. To do so we write out the sum

P (xik = xjk) =∑

xik=xjk

Pxikxjk(0)πxjk

and note that Pxikxjk(0) = 1 for each possible nucleotide. Hence

P (xik = xjk) =∑

iǫA,C,G,T

πi = 1

and the result follows.

To show 3 implies 2 suppose that [2] does not hold. That is to say suppose

dij > 0 for all pairs i and j. It follows for all i, j that Pkl(dij) > 0 for any choice

of nucleotides k and l. This is because for a continuous time Markov chain where

all states communicate, Pkl(t) > 0 for all k, l and t > 0. In particular when

k 6= l the transition probability is nonzero for all taxa pairs. Hence it is possible

to have a site containing both nucleotides k and l. Therefore P (xi = xj) < 1

follows, and so to the result.

Lemma 7.2 If at least one linear transformed distance d∗i has variance λi = 0, then

there exists a nucleotide pattern at a site ~x which has probability 0.

Proof: The idea here is to force singularity in V to get the result of the lemma.

To begin, from the eigen-decomposition of V , as described above, it is clear that

singularity is equivalent to λi = 0 for at least one set of linear transformations of the

pairwise distances. In other words there are weights aj not all 0 such that

np∑

j=1

ajdj = 0 (7.1)

66

for all possible nucleotide patterns x1, . . . , xT over all sites.

The dj are, in fact, functions of the Hamming distance sj so we can rewrite the

pairwise distances as dj = d(sj). Then we have

np∑

j=1

ajd(sj) = 0 (7.2)

Since d is a monotone increasing function of the sj we can write

snp= d−1

(−

np−1∑

j=1

ajdj/anp

)(7.3)

where WLOG we have taken anp6= 0. Hence snp

is uniquely determined by s1, . . . , snp−1,

so that P (snp|s1, . . . , snp−1) is a point mass distribution.

Now define the site pattern ~x = [A, A, . . . , A, C, y]T where y = C or G, and

consider the data set obtained by repeating this N times. In this case the sij satisfy

sij = 0 if i, j < T − 1

sij = N if i < T − 1, j = T, T − 1.

Hence s1, s2, . . . , snp−1 are identical for both choices of y. But Equation 7.1 implies

that snpis determined by s1, s2, . . . , snp−1. This mean that either

1. The value of snpis the same for both choices of y.

2. One of the site patterns [A, A, . . . , A, C, C]T or [A, A, . . . , A, C, G]T has proba-

bility 0 of occurring.

But notice that snp= 0 if y = C and sn−p = N if y = G. Therefore the first possibility

is false, so the second must hold.

Lemma 7.3 If P (~x) = 0 for some site pattern ~x then dij = 0 for some pair of taxa

i and j.

Proof: The proof is by induction on the contrapositive statement over T .

For the base case, consider T = 3 taxa. To show for any nucleotide pattern

x1, x2, x3 that P (x1, x2, x3) > 0 it will suffice to show there is an internal nucleotide

y4 such that P (x1, x2, x3, y4) > 0.

67

x1

x2

x3

y4α1

α2

α3

Consider the tree

By assumption at most one of α1, α2, α3 = 0 since otherwise some pairwise distance

would be 0. Consequently there are two cases to be considered.

Case I Exactly one αi = 0. So WLOG suppose α1 = 0 and α2, α3 > 0. Choose

y4 = x1. From the product expansion

P (x1, x2, x3, y4) = Px1y4(α1)Px2y4

(α2)Px3y4(α3)πy4

(7.4)

it follows by irreducibility that each term in the product is strictly positive.

Hence P (x1, x2, x3) > 0.

Case II All αi > 0. Since all states can communicate over nonzero time, it follows

that P (x1, x2, x3, y4) > 0 for any choice of y4. Therefore the overall probability

P (x1, x2, x3), being a sum of positive quantities, is itself greater than zero.

Now suppose the claim for any tree with 4, 5, . . . , T−1 taxa. And consider a T taxa

tree with nucleotides x1, . . . , xT , internal nucleotides yT+1, . . . , y2T−3, and terminal

branch lengths α1, . . . , αT . Again it will suffice to show that there exist internal

nucleotides ~y such that P (~x, ~y) > 0. There are two cases:

Case I This case can be handled without using the inductive hypothesis. Suppose

αi > 0 for all i = 1, . . . , T . Choose all internal nucleotides to be the same.

Consider the product expansion

P (~x, ~y) = P (x1, x2, . . . , xT−1, y1, y2, . . . , y2T−4)

=T−1∏

i=1

Pxiyi(αi) ∗

∏

internal

Pyiyj(αi) ∗ πm

68

about some internal state m. Each αi for a terminal branch is greater than

zero, so Pxiyi(αi) > 0 for each pair xi, yi. Moreover since yi = yj for all internal

states, each term Pyiyj(αi) is positive regardless of αi. And since πm is positive

by irreducibility, the overall product is positive.

Case II Suppose 1 ≤ k ≤ ⌈T/2⌉ of the αi are zero1. Since no dij = 0 it follows, as

seen in the diagrams above, that either αi = 0 has no adjacent branch length, or

the adjacent αj is greater than 0. WLOG take i to be 1. If α1 has an adjacent

yT+i

αi = 0

xi

yT+i

αi = 0 αj > 0

xi xj

branch length or not choose yT+1 = x1. The product expansion becomes

P (~x, ~y) = P (x1, x2, . . . , xT , yT+1, yT+2, . . . , y2T−2)

= Px1yT+1(α1) ∗

T∏

i=2

PxiyT+i(αi) ∗

∏

internal

Pyiyj(αi) ∗ πm

= Px1yT+1(α1) ∗ P (~x0, ~y0)

where m is some internal nucleotide. The final product consists of the positive

quantity Px1yT+1(α1), and P (~x0, ~y0) which corresponds to the probability of

nucleotides for a T −1 taxon tree in which no dij is zero. By the induction step

this too is positive. Hence the overall product is positive and we are done

7.2 Singularity of the Estimated Covariance matrix V

The purpose of this section is to characterize when the estimated covariance matrix

V is singular. Due to sampling variation and limitations on computational accuracy

1If the value of k exceeded ⌈T/2⌉, then some dij must be zero.

69

the source of singularity is more complex than in case of the true covariances. The

strategy here is to take advantage of the fact that we have a formula for computing

V , namely the sample average approximation 5.14 and 5.15 of Chapter 6, to find

necessary and sufficient conditions for singularity. This is, of course, still subject to

sampling variation and numerical errors. If we had had data of infinite length and

hence the true covariance matrix, then a linear dependence would be the only source

of singularity; however, we do not. The variances must be estimated and are subject

to errors in estimation. The results stem from writing V as follows:

Theorem 7.4 For T taxa, let nd be the number of distinct nucleotide patterns over

N sites. Also define pi to be the proportion of the i’th distinct nucleotide pattern. The

estimated covariance matrix can be written as

V =1

N

nd∑

i=1

pia(i)[a(i)]T (7.5)

where a(i) = [a(i)1 , . . . , a

(i)np]

T is a vector with

a(i)j = NVjj

∂

∂dj

log Pj(xi, dj) (7.6)

for j a taxa pair and xi the data at site i.2

Proof: We will show the result for the individual entries of V . First consider the

sample average approximation for the variance

Vjj =

(N−1

N∑

i=1

−(

∂

∂dj

log Pj(x; dj)

)2)−1

/N.

Some algebraic manipulations yield

Vjj =N3V 2

jj

N

(N∑

i=1

[a(i)j ]2

)−1

and

NVjj =1

N

N∑

i=1

[a(i)j ]2.

2The variance in Formula (7.8) would be computed using the sample average estimate.

70

Finally

NVjj =

nd∑

i=1

pi[a(i)j ]2 (7.7)

is obtained by collapsing the sum over all distinct site patterns.

Similarly

Vjk = NVjjVkk

N∑

i=1

∂

∂dj

log Pj(x; dj)∂

∂dk

log Pk(x; dk)

can be rewritten as

Vjk =

N∑

i=1

a(i)j a

(i)k

N2

and collapsing the sum similarly as above to obtain

Vjk =1

N

nd∑

i=1

pia(i)j a

(i)k . (7.8)

The following corollaries give a necessary, and two sufficient conditions for singu-

larity in the estimated covariance matrix V .

Corollary 7.5 Let A be the matrix whose i’th row is a(i). If V is singular, then

Null(A) 6= ~0.

Proof: If V is singular, then there is a nonzero vector ~b such that V~b = ~0. Using the

form of V in 7.7, this is equivalent to saying there is a nonzero ~b such that [a(i)]T~b = 0

for all i; or, in other words Null(A) 6= ~0.

Corollary 7.6 If the number of distinct nucleotide patterns nd is less than the number

of pairs of taxa np, then V is singular.

Proof: This is a simple consequence of the matrix A having np rows and nd

columns.

Corollary 7.7 If there is a constant c and taxa pairs j and k such that a(i)j = ca

(i)k

for all distinct site patterns i = 1, . . . , nd, then V is singular.

Proof: The corollary follows directly from writing the estimated covariance matrix

in the form of 7.7.

71

7.3 An Example

In this section a specific example is introduced to illustrate the claims of the previous

section. That is to say: a small sample based example is given which causes V to be

singular, and hence the GLS software to crash.

Sequence data of length N = 500 was generated under the F84 model of substi-

tution; though the underlying tree has been lost. The estimated distances are shown

in Figure 7.1. The nucleotide data as displayed in Table 7.1 shows each distinct

0.01

1

2

0.11

0.026

3

4

0.014

0.027

Figure 7.1: The 4 taxa tree.

site pattern generated along with a count of how many times the pattern occurred.

np = 38 distinct site patterns that were generated. This illustrates a connection with

CCCC 99 ACCC 2 TTTT 104 TTTC 6AGGG 9 CTTT 6 AAAA 179 GGAA 2GAAA 6 GTTT 1 CTAA 1 CGGG 1GGGG 34 AAGG 1 TCCC 9 GAGG 2ATTT 7 AAAG 1 CCAC 2 TGTT 2TAAA 2 AAAC 1 TTTA 1 CCTT 1TCTT 1 GGGC 1 TGGG 3 CCTC 1CTCC 3 TTGT 1 CAAA 1 AACA 1TATT 2 AGAA 2 TTTG 1 TTCT 1CCCT 2 AAGA 1

Table 7.1: Nucleotide patterns over sites.

Corollary 7.6; though nd, the number of distinct nucleotide patterns, is greater than

72

np it is interesting to note that the only 4 patterns occur more than 10 times. The

remaining patterns will have a small contribute to V by comparison.

This also shows a connection to the large sample results of Section 7.1 because

the vast majority of sites had identical nucleotides; while most of the other patterns

which arose occurred only once or twice.

The plots shown in Figure 7.2 illustrate the relationship with Corollary 7.7 in this

case. On the left, a plot of the pairs (a(1)2 , a

(1)3 ), . . . , (a

(38)2 , a

(38)3 ), while on the right,

(a(1)4 , a

(1)5 ), . . . , (a

(38)4 , a

(38)5 ). 3 A clear linear dependence as in Corollary 7.2 can be

seen. Thus V will be singular. The reason so few points are observed in the plots (i.e.

not 38 distinct points) is because under the F84 model there are only six possible

values for a(i)j given a pairwise distance dj.

The example considered in this section is, indeed, illustrative of how the GLS

software can fail when faced with real small sample data. In the following Chapter

methods for how to best make a confidence region in situations such as this one where

V is singular are described.

0 1 2 3

01

23

taxa 1 and 3

taxa

1 a

nd 4

0 2 4 6 8 10

02

46

8

taxa 2 and 3

taxa

2 a

nd 4

Figure 7.2: A linear dependence leading to singularity in the estimated covariancematrix.

3To be clear, the taxa pairs 1, 2, 3, 4, 5, 6 correspond to the pairs of taxa(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4).

Chapter 8

Circumventing Limitations of the Generalized Least Squares

Method

In the previous chapters it was seen that near singularity in V can be highly prob-

lematic for software implementation of GLS. In this chapter three strategies which

can be applied when the software fails are introduced: the first works by combining

groups of taxa which are closely related according to a distance based cutoff value;

the second, an eigenvalue cutoff approach which, briefly, works by omitting near zero

eigenvalues (which are variances of the LTD’s) from the computation and making an

adjustment to the degrees of freedom of the test statistic; and lastly, an approach

which links these two approaches together by inferring the number of near zero eigen-

values from the closely related taxa. The eigenvalue method was included by Susko as

an appendage to the original paper, complete with a software implementation. The

last approach uses a particular eigenvalue cutoff, and thus relies upon that software

as well. The distance cutoff method was developed (again with software implemen-

tation) as part of this thesis. The remainder of this chapter is devoted to comparing

the three different approaches through simulation on 8-taxon trees.

8.1 Distance Cutoff Approach

Some notation is introduced here to facilitate matters later. For T taxa with pairwise

distances d, a taxa grouping G (for a given cutoff cd) is a subset of taxa such that

dij ≤ cd for every taxa pair i, j in G. A taxa grouping G is called maximal when none

of the taxa i lying outside G satisfy dij ≤ cd for any j in G.

73

74

1 2 3 4 5 6 7

ǫ ǫǫ

ǫ ǫ

ǫ

Figure 8.1: A 7 taxa tree with ǫ smallso that y12, y56, y57, y67 ≤ cd.

13 4

5

Figure 8.2: Here the two maximalgroupings have been formed: G1 =1, 2 and G2 = 5, 6, 7.

For example, consider the tree shown in Figure 8.1. Under cd there would be two

maximal taxa groupings: G1 = 1, 2 and G2 = 5, 6, 7 as shown in Figure 8.2.

When convenient we represent a maximal taxa grouping by its smallest element as

seen in the figure.

The first approach, as mentioned in the introduction, is to naively group closely

related taxa, based on a distance cutoff cd, and make the appropriate adjustment to

the degrees of freedom of the chi-square distribution.

Now, this approach works simply by specifying a distance cutoff cd. Any maximal

taxa groupings are identified and subsequently considered to act as a single taxa. If

the number of effective taxa after the groupings is Tr, then, of course, the resulting

chi-square random variable has Tr − (2Tr − 3) degrees of freedom.

8.2 Eigenvalue Cutoff Approach

As described in Section 4.4.3 GLS can be altered into WLS by computing the eigenvec-

tor/eigenvalue decomposition of the estimated covariance matrix V = UΛUT . In this

case, the approximate linearly transformed distances (LTD’s) are given by ~y∗ = UT ~y.

To clarify notation, the ij’th LTD will be written as

y∗ij = uij,12y12 + uij,13y13 + . . . + uij,T−1TyT−1T (8.1)

75

where the rows and columns of UT are both indexed by 12, 13, . . . , T − 1T . However,

it should be pointed out that the indices of the LTD’s do not correspond to pairs of

taxa as they do for the original distances yij.

The eigenvalues λij act as estimated variances of the y∗ij. V is singular when

some of the eigenvalues are approximately 0. The idea here is to ignore these near

0 variances and make an adjustment to the degrees of freedom of the test statistic;

thus permitting the production of a confidence region.

This can be accomplished by specifying an eigenvalue cutoff ce. Any terms in the

GLS test statistic

gT =∑ (y∗

ij − (Xα)ij)2

λij

with λij ≤ ce are to be dropped from the summation. In this way any near 0 variances

- which have a drastic effect on the magnitude of gT - are omitted from the compu-

tation. With np contributions remaining in the sum, the chi-square distribution used

in computing a P value for a given topology has np − (2T − 3) degrees of freedom.

Larger eigenvalue cutoffs will tend to result in confidence regions including more

topologies with the advantage that singularity problems in V go away. For best results

Susko recommends trying various cutoff values and comparing the resulting confidence

regions produced. A natural approach is to use a small (near 0) cutoff value, denoted

by ce0, to avoid any very small variances which would come to dominate the test

statistic. In this thesis ce0 was chosen to be a small constant divided by the sequence

length N .

8.3 LTD Counting Approach

An alternate approach involves inferring the number of LTD’s with λij ≃ 0 based on

the estimated distances that are close to 0. In that sense it acts as a combination of

both the distance and eigenvalue cutoff approaches. Consider the following example:

Figure 8.3 shows a 7 taxa tree with y67 ≈ 0. Consequently yx6 ≈ yx7 for i from 1

to 5. Simulation results (though not included here) confirm that these relationships

76

1 2 3 4 5 6 7

ǫ ǫ

Figure 8.3: A 7 taxa tree with y67 ≃ 0.

1 2 3 4 56

Figure 8.4: Taxa 6 and 7 have beengrouped together.

induce the near linear dependencies

y∗45 = u45,16y16 + u45,17y17 ≃ 0

y∗46 = u46,26y26 + u46,24y27 ≃ 0

y∗47 = u47,36y36 + u47,37y37 ≃ 0

y∗56 = u56,46y46 + u56,47y47 ≃ 0

y∗57 = u57,56y56 + u57,57y57 ≃ 0

y∗67 = u67,67y67 ≃ 0

in the LTD’s. In this case the number of λij ≈ 0 is six, leaving np = 15. This is the

same np that would be obtained by using all the λij if taxa 6 and 7 had been combined

in a maximal taxa grouping as seen in Figure 8.4. In the case that there are multiple

yij ≈ 0 as seen in Figure 8.1 the linear dependencies will involve more complicated

relationships between the near 0 pairwise distances. However, the end result will be

the same, namely, the np value is obtained by taking the number of pairs of taxa in

the reduced topology. And the cutoff, denoted by ce1, simply accounts for the largest

np eigenvalues. So in summary, this approach would suggest grouping taxa according

to which dij are near 0, then keeping the number of λi corresponding to the reduced

tree.

77

8.4 Software Implementation

The routines glsdna eig and glsprot eig implement the eigenvalue cutoff method for

dna and protein data respectively. glsdna dist and glsprot dist are analogous routines

implementing the distance cutoff approach. Each of the routines requires a cutoff

parameter in addition to the parameters required by glsdna and glsprot.

The glsdna dist and glsprot dist Routines

As the glsdna dist and glsprot dist routines are work done for this thesis, it is of

interest to describe the implementation in special detail. First, for a given distance

cutoff cd, the command line arguments are

glsdna dist treefile paramfile infile cd

and

glsprot dist ntree treefile cd

respectively. Of course, the resulting confidence region will depend directly upon

how large a cd is specified by the user. For reasons explained below, too large a

cd will result in confidence regions including more topologies than necessary; while

a cd value which is very small may not eliminate the singularity problem in V . In

practice, it is advisable to first identify groups of problem taxa, then choose a cutoff

accordingly. This may be assayed several time and results compared.

As for the more mechanical aspects of the implementation, we focus on glsdna dist.

glsprot dist differs not in methodology, but merely in details. So suppose the topolo-

gies in Ω, a subset of the topology space on T taxa Top(T ), are topologies of inter-

est in constructing a confidence region. The set of topologies can be partitioned as

Ω = ΩM ∪ΩM where ΩM contains topologies T such that each maximal taxa grouping

G, as defined by cd, corresponds to a subtree of T . Furthermore the set ΩM itself can

78

be partitioned as ω1∪ω2∪, . . . ,∪ωr where each ωi is a set of topologies corresponding

to a unique topology in the reduced Tr taxa topologies. The algorithm works by first

identifying which reduced topology each topology in ΩM corresponds to. A P value

is obtained for each topology by computing it for the reduced topology, then it is

assigned to the original T taxa topology. Any topology in ΩM is assigned a P value

of 0.

To illustrate this consider, again, the tree in Figure 8.1 with distance cutoff cd

chosen such that there are two maximal taxa groupings as shown in Figure 8.2. Take

Ω to be the full set of 945 topologies on 7 taxa. The topologies in ΩM are shown in

Figures 8.5, 8.7, and 8.9 where the topologies in each figure make up the elements

of ω1, ω2, and ω3 respectively. The adjacent figures show the corresponding 4 taxon

topology in Tr to each of ω1, ω2, and ω3. In this example, a single P value will be

computed for the topologies in each ωi while the remaining 936 topologies in ΩM are

assigned a P value of 0.

1 23 4 5 6 71 23 4 56 7

1 23 4 5 67

Figure 8.5: 7 taxon topologies in ω1

1

5

3

4

Figure 8.6: 4 taxon topology associ-ated with ω1

The code implements this in a simple, albeit mildly inefficient manner. First,

each T taxa topology is associated with its Tr taxa counterpart as follows. Given a

T taxa topology T , pairs of taxa are combined iteratively. For each pair of taxa i

and j satisfying dij ≤ cd (estimated ML distance), the taxa are grouped if they are

found to be neighbors in T . The resulting distance is taken to be that of the smallest

taxa1. This process is repeated on the new topology obtained until all maximal taxa

groupings have been made. The result is either a Tr taxa topology, or a topology with

1Under the ordering 1, 2, . . . , T .

79

1 2 3 4 56 71 2 3 4 56 7

1 2 3 4 56 7


1

53

4


1 2 4 3 56 71 2 4 3 56 7

1 2 4 3 56 7


1

5

3

4


80

a larger number of taxa. In the former case, this is a topology lying in ΩM for which

a P value can be computed. A topology as in the latter case means that it is in ΩM

and is assigned a P value of 0. It should be noted that extra computation is done

in two places. First in reducing the topologies. Secondly, P values are computed for

each topology in ΩM and not in the reduced set. Neither of these should prove costly

in practice.

8.5 Relationships Between Cutoffs

Recall, to this point three different approaches to making confidence regions when V

is nearly singular have been introduced:

1. Use an eigenvalue cutoff ce0 with degrees of freedom np − (2T − 3).

2. Use an eigenvalue cutoff ce1 suggested by the near zero yij with degrees of

freedom np − (2T − 3).

3. Use a distance cutoff cd with degrees of freedom Tr − (2Tr − 3). 2

Naturally, the question of which one of the three cutoff approaches works best

in practice. For the two eigenvalue cutoff approaches: The near 0 variances may

have strong discriminatory power, but they also can cause singularity problems in V .

Alternately, it seems possible that the near 0 variances correspond to no additional

information and merely complicate the computation. Theoretical motivation for the

distance cutoff lies in Theorem 7.1 which attributes the cause of singularity in V to

exactly dij ≃ 0, at least for large samples. But it is unclear how this approach relates

to the alternate approaches. It will be shown that, indeed, the three approaches

differ— especially for smaller samples. And since it is the case they are different, the

remainder of this chapter is devoted to addressing, through simulation, the question

of which method gives better results.

2Note that the degrees of freedom will usually differ in all three cases.

81

The main points of this section are illustrated by means of a specific example

making use of the 7 taxa tree

(1 : 0.1, ((2 : 0.001, 3 : 0.001) : 0.1, (4 : 0.1, (5 : 0.1, 6 : 0.1) : 0.1) : 0.1) : 0.1, 7 : 0.1);

(8.2)

written here in Newick format. As usual, sequence data is generated using the seq-

gen routine of the Phylip package [8] with transition/transversion ratio 2. This tree

structure was chosen so that taxa 2 and 3 tend to be grouped using the distance

cutoff.

First consider when the sequence length is small. Due to the inherent small sample

variability we can expect that the near 0 LTD’s have little discriminatory power.

Consequently, the two eigenvalue based cutoffs ce0 and ce1 should lead to similar

confidence regions. However, employing the distance cutoff cd effectively removes the

uncertainty due to the near 0 LTD’s a priori. So the resulting confidence regions

should be smaller, but this may result in grouping taxa erroneously and caution

should be exercised.

For large samples the situation is markedly different. In this case, the LTD’s can

be expected to have high discriminatory power and hence it would be logical to use

as many eigenvalues as possible in the computation of a confidence region. So ce0

should outperform ce1 under such circumstances. Again the cd cutoff acts to omit the

source of uncertainty to begin with, and as such, should produce a confidence region

comparable with the ce0 approach.

In both cases, it is clear the three cutoffs will yield test statistics which are approx-

imately equal. The resulting confidence regions will differ. this is because, in general,

the degrees of freedom accompanying ce0, ce1, and cd will differ. ce0 from ce1 because

a larger cutoff will act to reduce the number of eigenvalues used in constructing the

test statistic. While the degrees of freedom for cd will differ from both as the number

of effective taxa are reduced. Moreover, in the case of cd the topology space may be

significantly reduced in size because of taxa groupings. This will work to reduce the

resulting size of the confidence region produced.

As an example, sequence data of length 250 was generated under 8.2 and con-

fidence regions were computed using each cutoff. The particular values used were

82

cd = 0.01 so as to group taxa 2 and 3; ce0 = 0.1/250 = 0.004, that is to say a small

constant divided by the sequence length; and, ce1 was chosen so that the largest 15

of the 21 eigenvalues were used. Tables 8.1, 8.2, and 8.3 show confidence regions

generated using cutoffs ce0, ce1, and cd respectively. The topologies in Newick form

have their branch lengths suppressed for clarity of presentation and the true tree is

in boldface.

It can be seen in Table 8.4 that the test statistics are nearly equal while the

corresponding P values differ due to the degrees of freedom differing. The most

striking feature is the size of the CRs produced. While, as expected, the size under

the eigenvalue cutoffs is roughly the same; the CR size using the distance cutoff is

dramatically smaller. The primary cause is that the majority of topologies appearing

in the first two CRs are omitted automatically when using the cd cutoff because only

topologies in which 2 and 3 are neighbours are used.

The next Tables 8.5, 8.6, and 8.7 show CRs, this time, generated from sequence

data of length 25000. The cutoff values were chosen in a similar manner as before,

this time with cd = 0.1/25000 = 0.000004. The results again are consistent with

expectation. This time by using the ce1 cutoff highly relevant information about the

relationships between taxa 2 and 3 and the remaining taxa is lost. Consequently, the

size of the CR produced using ce1 dwarfs the size of the CR corresponding to ce0,

as can be seen in Table 8.8. More interestingly, even for such a large sample the ce0

cutoff could not distinguish between the highly related topologies seen in Table 8.5.

However, the distance cutoff cd alleviates this concern by grouping taxa 2 and 3

together.

83

gT P-value Topology4.332437 0.740789 (1,((2,3),(4,(5,6))),7);4.346498 0.739114 (1,(2,(3,(4,(5,6)))),7);4.346498 0.739114 (1,((2,(4,(5,6))),3),7);9.426283 0.223482 (1,(2,((3,4),(5,6))),7);9.426283 0.223482 (1,(((2,4),(5,6)),3),7);9.477332 0.220177 (1,(((2,3),4),(5,6)),7);9.522713 0.217272 (1,(((2,4),3),(5,6)),7);9.522713 0.217272 (1,((2,(3,4)),(5,6)),7);9.523981 0.217191 (1,(2,((3,(5,6)),4)),7);9.523981 0.217191 (1,(((2,(5,6),4),3),7);9.524434 0.217162 (1,((2,4),(3,(5,6))),7);9.524434 0.217162 (1,((2,(5,6)),(3,4)),7);9.568989 0.214342 (1,(((2,3),(5,6)),4),7);9.614511 0.211491 (1,(((2,(5,6)),3),4),7);9.614511 0.211491 (1,((2,(3,(5,6))),4),7);

Table 8.1: 95% CR using cutoff ce0

with sequence length 250 (18 of 21eigenvalues used).

gT P-value Topology3.229072 0.520248 (1,(((2,4),(5,6)),3),7);3.229072 0.520248 (1,(2,((3,4),(5,6))),7);3.728931 0.443932 (1,((2,4),(5,6)),(3,7));3.728931 0.443932 (1,(((2,4),(5,6)),7),3);3.728931 0.443932 (1,2,(((3,4),(5,6)),7));3.728931 0.443932 (1,(2,7),((3,4),(5,6)));4.332437 0.362881 (1,((2,3),(4,(5,6))),7);4.346498 0.361139 (1,((2,(4,(5,6))),3),7);4.346498 0.361139 (1,(2,(3,(4,(5,6)))),7);4.352395 0.360410 (1,(2,((3,(5,6)),4)),7);4.352395 0.360410 (1,(((2,(5,6)),4),3),7);5.937947 0.203831 (1,2,((3,(4,(5,6))),7));5.937947 0.203831 (1,(2,7),(3,(4,(5,6))));5.937947 0.203831 (1,(2,7),((3,(5,6)),4));5.937947 0.203831 (1,2,(((3,(5,6)),4),7));5.937947 0.203831 (1,((2,(4,(5,6))),7),3);5.937947 0.203831 (1,(((2,(5,6)),4),7),3);5.937947 0.203831 (1,((2,(5,6)),4),(3,7));5.937947 0.203831 (1,(2,(4,(5,6))),(3,7));



gT P-value Topology4.284726 0.638206 (1,((2,3),(4,(5,6))),7);9.364156 0.154110 (1,(2,(3,(4,(5,6)))),7);9.442960 0.150156 (1,((2,(4,(5,6))),3),7);

Table 8.3: 95% CR using cutoff cd withsequence length 250.

cutoff gT P-value d.f. used CR sizece0 4.332437 0.740789 7 15ce1 4.332437 0.362881 4 19cd 4.28726 0.638206 6 3

Table 8.4: 95% CR summary usingcutoffs ce0, ce1, and cd for sequencelength 250. The gT and p-values arefor the true topology.

84

gT P-value Topology gT P-value Topology3.951879 0.412557 (1,2,((3,(4,(5,6))),7)); 4.189017 0.381029 (1,(((2,(5,6)),4),7),3);3.951879 0.412557 (1,((2,(4,(5,6))),7),3); 4.226704 0.376193 (1,(2,7),((3,4),(5,6)));3.978746 0.408890 (1,((2,3),(4,(5,6))),7); 4.226704 0.376193 (1,((2,4),(5,6)),(3,7));3.979137 0.408837 (1,((2,(4,(5,6))),3),7); 4.234302 0.375224 (1,(2,((3,4),(5,6))),7);3.979137 0.408837 (1,(2,(3,(4,(5,6)))),7); 4.234302 0.375224 (1,(((2,4),(5,6)),3),7);3.979518 0.408785 (1,(2,7),(3,(4,(5,6)))); 4.319641 0.364472 (1,(2,7),((3,(5,6)),4));3.979518 0.408785 (1,(2,(4,(5,6))),(3,7)); 4.319641 0.364472 (1,((2,(5,6)),4),(3,7));4.063861 0.397432 (1,2,(((3,4),(5,6)),7)); 4.321732 0.364212 (1,(2,((3,(5,6)),4)),7);4.063861 0.397432 (1,(((2,4),(5,6)),7),3); 4.321732 0.364212 (1,(((2,(5,6)),4),3),7);4.189017 0.381029 (1,2,(((3,(5,6)),4),7));

Table 8.5: 95% CR using cutoff ce1 with sequence length 25000 (15 of 21 eigenvaluesused).

gT P-value Topology3.978746 0.912806 (1,((2,3),(4,(5,6))),7);3.979137 0.912780 (1,((2,(4,(5,6))),3),7);3.979137 0.912780 (1,(2,(3,(4,(5,6)))),7);



gT P-value Topology3.976323 0.679881 (1,((2,3),(4,(5,6))),7);

Table 8.7: 95% CR using cutoff cd withsequence length 25000.

cutoff gT P-value d.f. used CR sizece0 3.978746 0.912806 9 3ce1 3.978746 0.408890 4 19cd 3.976323 0.679881 6 1

Table 8.8: 95% CR summary using cutoffs ce0, ce1, and cd for sequence length 25000.The gT and p-values are for the true topology.

85

8.6 Motivation and Methods for Comparing the Cutoffs

In this section the arguments of the previous section are tested by means of, once

again, a simulation. The main purpose is to gauge which cutoff should be used under

what circumstances. The simulation details are quite similar to those of Section 6.1.

Two 8 taxa trees based on the following topology considered.

1 2 3 4 5 6 7 8

The two trees are:

Sim I The tree defined by branch lengths:

(1:0.8,2:0.2,((3:0.9,4:0.3):0.1,((5:0.2,6:0.3):0.1,(7:0.003,8:0.007):0.1):0.3):0.1);

Sim II The tree defined by branch lengths:

(1:0.8,2:0.2,((3:0.9,4:0.3):0.1,((5:0.02,6:0.003):0.01,(7:0.003,8:0.007):0.1):0.3):0.1);

As before each tree is used as a true tree from which nucleotide data is simulated

and 95% confidence regions computed according to the same specifications as in Sec-

tion 6.1. But this time in addition to the glsdna routine, glsdna eig (using cutoffs ce0

and ce1) and glsdna dist are used to compute confidence regions.

The branch lengths for the tree in Sim I are chosen so that taxa 7 and 8 are very

closely related. Therefore the three different cutoff approaches will tend to function

by a) ce0 is 0.1 divided by the sequence length so that near 0 eigenvalues are omitted,

b) ce1 is chosen so that the largest 21 of 28 possible eigenvalues, while c) cd is chosen

to be 0.01 so that taxa 7 and 8 are grouped and confidence regions are constructed

on the reduced 7 taxa topologies.

86

The tree in Sim II, on the other hand, is selected such that taxa 5,6,7, and 8

will tend to act as a single taxa. Hence the ce1 cutoff will use only 10 of the 28

eigenvalues. The distance cutoff cd was be chosen to be 0.1 so that all four taxa are

grouped together; thus reducing the computation to the 5 taxa topologies.

We look at the results of coverage probability and confidence region size. Again,

the numbers of taxa used here were small enough so that all topologies could be

fitted. Also it should be noted that unlike with the previous simulation, the form of

the topology is not necessarily of interest here, but merely how the different cutoffs

compare in a realistic situation.

8.6.1 Results

A graphical summary of the simulation is given in Figure ??. We proceed to interpret

the plots.

Confidence regions made:

The number of confidence regions made (10000 attempted for each sequence

length) is essentially as would be expected. Using the GLS software without in-

corporating a cutoff value exacts a serious toll on the number of confidence regions

produced. In both Sim I and Sim II, sequence length 1000 is the limit beyond which

singularity issues cease to be a problem. This will increase with the number of taxa.

When one of the three cutoff approaches is employed, singularity in the covariance

matrix is eliminated, and a confidence region is produced nearly without exception.

The exception being in Sim I when using the eigenvalue cutoff ce1. This feature arises

simply because for very small sequence lengths there is a tremendous amount of vari-

ability in the λij values. Consequently, it is sometimes the case that more eigenvalues

than ce1 omits fall short of zero; singularity results.

Coverage:

The confidence region made plots make it clear that the GLS software alone will

87

fail, in smaller sample cases, more often than is acceptable. The purpose of the

remaining plots is to gauge which cutoff is preferable to use in a given situation.

The coverage probabilities using no cutoff are consistent with those observed in

the simulations of Section 6.1. Using this as a baseline, we can compare it with the

coverage probabilities produced by cd, ce0, and ce1. First of all, the cd cutoff gives

coverage probabilities closer to the theoretical value of 0.95— particularly for small

sequence lengths. By contrast, the ce0 cutoff coverage probabilities are essentially

the same as those obtained using no cutoff at all. The most surprising result is the

coverage probabilities arising from using the ce1 cutoff. In both Sim I and Sim II

they differ dramatically from the predicted 0.95 value. In Sim I there is serious

underestimation, 0.90 even for sequence length 10000. Exactly the opposite is found

in Sim II where the coverage probabilities hang quite close to 1. The deviation

observed in Sim I remains unexplained, while in Sim II the probable cause is that

too much information is lost by using merely 10 of 28 possible eigenvalues and the

underlying algorithms break down.

Average confidence region size:

Again using the no cutoff method as a baseline we can see that in both Sim I and

Sim II the average confidence region size falls somewhat below with cd and somewhat

above with ce0. The average confidence region size for the ce1 cutoff is consistent with

the coverage probabilities it produced.

Summary:

Based on the results of this, somewhat limited but instructive, simulation it is

clear that the cd cutoff gives more desirable coverage probabilities and significantly

smaller confidence regions for smaller sample sizes, than either ce0 or ce1. In fact, the

wild behaviour found when using the ce1 cutoff would lead us, clearly, to recommend

not using it outright. The ce0 cutoff however does seem useful especially for larger

samples.

88

50 100 200 500 1000 2000 5000 10000

0.80

0.85

0.90

0.95

1.00

Sequence Length

Cov

erag

e

no cutoffcd cutoffce1 cutoffce0 cutoff

50 100 200 500 1000 2000 5000 10000

2000

4000

6000

8000

1000

0

Sequence Length

CR

Mad

e

50 100 200 500 1000 2000 5000 10000

020

0040

0060

0080

00

Sequence Length

Ave

rage

Num

ber

of T

rees

in C

R

500 1000 2000 5000 10000

050

100

150

Sequence Length

Ave

rage

Num

ber

of T

rees

in C

R

Figure 8.11: Plots of coverage probability of the true topology, the number of confi-dence regions made, and average confidence region size for Sim I.

89

50 100 200 500 1000 2000 5000 10000

0.80

0.85

0.90

0.95

1.00

Sequence Length

Cov

erag

e

no cutoffcd cutoffce1 cutoffce0 cutoff

50 100 200 500 1000 2000 5000 10000

2000

4000

6000

8000

1000

0

Sequence Length

CR

Mad

e

50 100 200 500 1000 2000 5000 10000

020

0040

0060

0080

00

Sequence Length

Ave

rage

Num

ber

of T

rees

in C

R

500 1000 2000 5000 10000

05

1015

2025

Sequence Length

Ave

rage

Num

ber

of T

rees

in C

R

Figure 8.12: Plots of coverage probability of the true topology, the number of confi-dence regions made, and average confidence region size for Sim II.

Chapter 9

Conclusions

The main contributions of this thesis were

• A simulation study to gauge the performance of the GLS method for construct-

ing confidence regions for topologies when the sample size is small.

• The source of singularity in the true covariance matrix of the ML distances

was characterized in terms of both the ML distances and the underlying se-

quence data. As well, a necessary and sufficient condition for singularity of the

estimated covariance matrix was found in terms of the underlying sequences.

• A new method was introduced for constructing confidence regions for topologies

using GLS, based on combining closely related taxa, which can be used even

when the covariance matrix is singular. This approach was used because of

the nature of the singularity in the covariance matrix. An implementation is

provided. And this method was compared with other approaches via simulation.

Ideas for improvements and future work include

• Use topologies with more taxa (say 20) in the small sample simulation.

• The programs glsdna dist and glsprot dist work by combining groups of taxa

which are closely related by their ML distances. An alternate approach that

could be applied is to reduce the number of taxa by removing closely related

nucleotide/protein sequences.

• Optimize the code for glsdna dist and glsprot dist. Currently, the code relies

on a package due to Gascuel et. al. [10] to combine taxa iteratively. In the

end this dependency was found to be unnecessary and existing routines in the

Phylip package can be applied.

90

91

• It should be noted there are various possibilities for assigning a new branch

length to a maximal taxa grouping. The naive approach taken here does not

take into account the nature of the distances at all. Basically, if the distances

under consideration are small, then the choice of how to assign a new length

should not effect the result to any great extent. However, other approaches such

as taking the average distance over all pairs in a maximal taxa grouping (as in

UPGMA)could be tried. Another possibility would be to use NJ [22] algorithm

distances between groups.

• Safia [26] effectively removes singularity from the estimated covariance matrix

by adding a small constant to the diagonal values of the matrix. This approach

should be compared with those suggested here.

Bibliography

[1] W.J. Bruno, N.D. Socci, and A.L. Halpern, Weighted Neighbor Joining: A likeli-hood based approach to distance-based phylogeny reconstruction, Mol. Biol. Evol.16 (2000) 189-197.

[2] M. Bullmer, Use of the method of generalized least squares in reconstructionphylonegies from sequence data, Mol. Biol. Evol. 8 (1991) 868-883.

[3] L. L. Cavalli-Sforza, and A. W. F. Edwards, Phylogenetic Analysis: models andestimation procedures, Am. J. Hum. Genet. 19 (1967) 233-257.

[4] M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt, A model of evolutionarychangein proteins, Atlas of protein sequence and structure. Vol. 5, suppl 3. (1978)345-352.

[5] B. Efron, E. Halloran, S. Holmes, Bootstrap confidence levels for phylogenetictrees, Proc. Natl. Acad. Sci. (1993) 13429-13434.

[6] J. Felsenstein, Confidence limits on phylogenies: an approach using the bootstrap,Evolution 39 (1985) 783-791.

[7] J. Felsenstein, Inferring Phylogenies, Sinauer Associates Inc, Sunderland, Mass.2004.

[8] J. Felsenstein, PHYLIP (phylogeny inference package), Version 3.6a2. Dis-tributed by author, Department of Genetics, University of Washington, Seattle.

[9] W.M. Fitch, and E. Margoliash, Construction of Phylogenetic Trees, Science. 155(1967) 279-284.

[10] O. Gascuel, and R. Desper, Theoretical Foundation of the Balanced MinimumEvolution Method of Phylogenetic Inference and Its Relation to Weighted Least-Squares Tree Fitting, Mol. Biol. Evol. 23(4) (2004) 587-598.

[11] O. Gascuel, BIONJ: An improved version of the NJ algorithm based on a simplemodel of sequence data, Mol. Biol. Evol. 14 (1997) 685-695.

[12] N. Goldman, J. P. Anderson, and A. G. Rodrigo, Likelihood-based tests of topolo-gies in phylogenetics, Syst. Biol. 49 (2000) 652-670.

[13] MD Hendy and D. Penny. Branch and Bound Algorithms to Determine MinimalEvolutionary Trees, Mathematical Biosciences, 59 (1982) 277-290.

92

93

[14] D. Hillis, C. Moritz, and B. Mable, Molecular Systematics, Sinauer AssociatesInc, Sutherland, Mass. 1996.

[15] D. T. Jones, W.R. Taylor, and J. M. Thompson, The rapid generation of muta-tion data matrices from protein sequences, CABIOS 8 (1992) 275-282.

[16] B. W. Kernighan, and D. M. Ritchie, The C programming language, PrenticeHall, New Jersey, 1988.

[17] C. Kosiol, and N. Goldman, Different versions of the Dayhoff rate matrix, Mol.Biol. Evol. 22(2) (2005) 193-199.

[18] M. K. Kuhner, and J. Felsenstein, A simulation comparison of phylogeny algo-rithms under equal and unequal evolutionary rates, Mol. Biol. Evol. 11 (1994)459-468.

[19] C. L. Lawson, and R. J. Hanson, Solving least squares problems, Prentice Hall,Englewood Cliffs, N.J. 1974.

[20] E. L. Lehmann, Theory of point estimation, Wiley, New York, 1983.

[21] P.J. Lockhart, M. Steel, M. Hendy, and D. Penny, Recovering Evolutionary Treesunder a More Realistic Model of Sequence Evolution, Mol. Biol. Evol. 11(4) (1994)605-612.

[22] M. Nei, Molecular evolutionary genetics, Columbia University Press, New York,1987.

[23] Y. Pauplin, Direct Calculation of a Tree Length Using a Distance Matrix, J. Mol.Evol. 51 41-47.

[24] S. Ross, Stochastic Processes, John Wiley and Sons, New York, 1983.

[25] A. Rzhetsky, and M. Nei, METREE: Q program package for inferring and testingminimum-evolution trees, CABIOS 10 (1994) 409-412.

[26] A. Safia, Phylogenetic Inference by Generalized Least Squares: ComputationalAspects, Unpublished Master of Science, McGill, Canada (2006).

[27] N. Saitou, and M. Nei, The neighbor-joining method: A new method for recon-structing phylogenetic trees, Mol. Biol. Evol. 4 (1987) 406-425.

[28] C. Semple, M. Steel, Phylogenetics, Oxford University Press, Oxford, 2003.

[29] X. Shi, H. Gu, E. Susko, C. Field, The comparison of confidence regions inphylogeny, Mol. Biol. Evol. 22 (2005) 2285-2296.

[30] H. Shimodaira, An approximately unbiased test of phylogenetic tree selection,Syst. Biol. 16 (2002) 492-508.

94

[31] E. Susko, Confidence regions and hypothesis tests for topologies using generalizedleast squares, Mol. Biol. Evol. 20 (2003) 862-868.

[32] E. Susko, Software for confidence regions and hypothesistests for topologies using generalized least squares, Availablewww.mathstat.dal.ca/ tsusko/doc/gls soft.pdf, July 17, 2006.

[33] F. Tajima, and M. Nei, Estimation of evolutionary distance between nucleotidesequences, Mol. Biol. Evol. 1 (1984) 269-285.

[34] S. Whelan, and N. Goldman, A General Empirical model of Protein EvolutionDerived from Multiple Protein Families Using a Maximum-Likelihood Approach,Mol. Biol. Evol. 18(5) (2001) 691-699.

ON ISSUES OF SINGULARITY FOR CONFIDENCE REGIONS AND ... · on issues of singularity for confidence...

Documents

Transcript of ON ISSUES OF SINGULARITY FOR CONFIDENCE REGIONS AND ... · on issues of singularity for confidence...