Carl H. Oliveros

124
PHYLOGENOMICS OF RAPID AVIAN RADIATIONS By Carl H. Oliveros Submitted to the graduate degree program in Ecology and Evolutionary Biology and the Graduate Faculty of the University of Kansas in partial fulfillment of the requirements for the degree of Doctor of Philosophy. ________________________________ Chairperson: Robert G. Moyle ________________________________ Rafe M. Brown ________________________________ John K. Kelly ________________________________ Xingong Li ________________________________ A. Townsend Peterson TITLE PAGE Date Defended: 17 July 2015

Transcript of Carl H. Oliveros

Page 1: Carl H. Oliveros

PHYLOGENOMICS OF RAPID AVIAN RADIATIONS

By

Carl H. Oliveros

Submitted to the graduate degree program in Ecology and Evolutionary Biology and the

Graduate Faculty of the University of Kansas in partial fulfillment of the requirements for the

degree of Doctor of Philosophy.

________________________________

Chairperson: Robert G. Moyle

________________________________

Rafe M. Brown

________________________________

John K. Kelly

________________________________

Xingong Li

________________________________

A. Townsend Peterson

TITLE PAGE

Date Defended: 17 July 2015

Page 2: Carl H. Oliveros

The Dissertation Committee for Carl H. Oliveros

certifies that this is the approved version of the following dissertation:

Phylogenomics of Rapid Avian Radiations

________________________________

Chairperson: Robert G. Moyle

ACCEPTANCE PAGE

Date approved: 20 July 2015

ii

Page 3: Carl H. Oliveros

ABSTRACT

I use data from sequence capture of ultraconserved elements to resolve three rapid

radiations in the avian tree of life and in the process gain insights on applying analytical

strategies with gene tree-based coalescent methods (GCM). In Chapter 1, I explore analytical

strategies that can be employed with GCMs to increase phylogenetic resolution and minimize

highly supported conflicting results, including subsampling taxa to increase the number of gene

trees analyzed, trimming sequences to eliminate sequence length heterogeneity, and filtering loci

based on information content. These strategies are used to reconstruct a highly resolved and

consistent phylogenetic hypothesis for the relatively young avian family, Zosteropidae. I show

how conflicting results from different GCMs can arise from biases introduced by sequence

length heterogeneity and uninformative loci that can lead to strongly supported incorrect

estimates of phylogeny. In Chapter 2, I examine higher-level relationships in the enigmatic core

Corvoidea group of Oscine passerines. A highly resolved phylogeny of core Corvoidea is

recovered, with a majority of nodes receiving high support from both ML and coalescent

analyses. I show that short sequence lengths do not bias species tree estimates of GCMs if

informative sites are present in these sequences. In contrast, some samples that have longer

sequence lengths compared to most taxa but shorter sequence lengths compared to taxa in its

clade can also bias species tree estimates of GCMs. In Chapter 3, I develop a hypothesis on the

origins of the trogons (Trogonidae) based on a robust dated phylogeny estimated from thousands

of genome-wide loci. I recover the first well-supported hypothesis of relationships among trogon

genera. This topology, combined with the trogon fossil record, geologic, and climatic data,

suggests an Old World origin in the Late Oligocene/Early Miocene for the crown group. In this

iii

Page 4: Carl H. Oliveros

chapter, I show that in some datasets in which loci have high information content, exclusion of

less informative loci in analysis can lead to lower bootstrap support of species tree estimates of

GCMs.

iv

Page 5: Carl H. Oliveros

ACKNOWLEDGEMENTS

I am grateful to the following people and their institutions for providing tissue and toe

pad loans: Nate Rice, Academy of Natural Sciences of Drexel University; Paul Sweet, Tom

Trombone, and Peter Capainolo, American Museum of Natural History; Herman Mays,

Cincinnati Museum Center; Ben Marks, Field Museum of Natural History; Donna Dittmann,

Louisiana State University Museum of Natural Science; Helen James and Chris Milensky,

National Museum of Natural History; Mark Robbins, University of Kansas Natural History

Museum; Chris Witt and Andrew Johnson, University of New Mexico Museum of Southwestern

Biology; Sharon Birks, University of Washington Burke Museum of Natural History and

Culture; Ron Johnstone, Western Australia Museum; Kristof Zyskowski, Yale Peabody

Museum.

Brant Faircloth, Noor White, Brian Smith, and Michael Harvey gave advice on UCE

laboratory protocols; Brant Faircloth offered guidance on UCE data processing.

I thank Mike Andersen and Rob Moyle for providing edits and suggestions on Chapters 1

and 2. Rob Moyle, Mike Andersen, Pete Hosner, Fred Sheldon, and Joel Cracraft provided

comments and edits on an earlier version of Chapter 3.

I am grateful to the following funding sources for supporting my research: National

Science Foundation (NSF) Doctoral Dissertation Improvement Grant, American Museum of

Natural History Frank M. Chapman Fund, KU Biodiversity Institute Panorama Grants, KU

Graduate Student Research Fund. The KU Graduate Fellowships, KU EEB Summer

Fellowships, and NSF grants to Rob Moyle and Rafe Brown also supported me as a Graduate

Research Assistant.

v

Page 6: Carl H. Oliveros

My graduate career was enriched by unforgettable (and some forgettable) experiences in

the field and I thank Rob Moyle, Town Peterson, and Mark Robbins for giving me the

opportunity to work in places like Cambodia, the Democratic Republic of Congo, Indonesia,

Palau, and the Philippines. I also express my deep appreciation for collaborators and field

workers in those countries, who are too many to name, for all the memories.

My graduate career and life in Lawrence would not have been as enjoyable without

friends in Lawrence including EEB graduate students and faculty, their significant others and

family, and the Filipino jayhawks. I especially thank members of the Moyle lab—Michael

Andersen, Luke Campillo, Pete Hosner, Muhammad Janra, Robin Jones, Luke Klicka, and Joe

Manthey—for being sources of advice, support, fun, and merriment.

Members of my Ph.D. committee, Rafe Brown, John Kelly, Xingong Li, Rob Moyle, and

Town Peterson, have been very supportive throughout all these years. Mark Holder, whom I

recently had to replace in my committee with John Kelly because of a scheduling conflict, has

been very helpful as well. I am deeply grateful to my main advisor, Rob Moyle, who has been a

terrific mentor and friend.

My family in the Philippines has been a great source of energy, encouragement, and love

even if they are on the other side of the globe. Most especially, I am privileged to have the love

and support of my wife, Cynthia.

vi

Page 7: Carl H. Oliveros

TABLE OF CONTENTS

Title Page ......................................................................................................................................... i

Acceptance Page ............................................................................................................................. ii

Abstract .......................................................................................................................................... iii

Acknowledgements ......................................................................................................................... v

Table of contents ........................................................................................................................... vii

List of Figures .............................................................................................................................. viii

List of Tables ................................................................................................................................. ix

List of Appendices .......................................................................................................................... x

Introduction ..................................................................................................................................... 1

Chapter 1 ......................................................................................................................................... 4

Introduction ................................................................................................................................. 6

Methods ..................................................................................................................................... 13

Results ....................................................................................................................................... 22

Discussion ................................................................................................................................. 39

Chapter 2 ....................................................................................................................................... 48

Introduction ............................................................................................................................... 50

Methods ..................................................................................................................................... 52

Results ....................................................................................................................................... 57

Discussion ................................................................................................................................. 62

Chapter 3 ....................................................................................................................................... 68

Introduction ............................................................................................................................... 70

Methods ..................................................................................................................................... 73

Results ....................................................................................................................................... 78

Discussion ................................................................................................................................. 82

Literature Cited ............................................................................................................................. 90

Appendices .................................................................................................................................... 98

vii

Page 8: Carl H. Oliveros

LIST OF FIGURES

Figure 1.1 Missing data in sequence capture …………………………………………………… 11

Figure 1.2 Estimate of phylogenetic relationships in Zosteropidae from the study of

Moyle et al. (2009) …………………………………………………………………………. 16

Figure 1.3. Analysis approach with gene tree-based coalescent methods ……………………... 20

Figure 1.4. Maximum likelihood estimate of phylogenetic relationships in Zosteropidae …….. 25

Figure 1.5. Species tree estimates obtained in Iteration 0 ……………………………………… 29

Figure 1.6. Resolution in final species tree estimates of Zosteropidae and the Pacific clade ….. 30

Figure 1.7. Excerpts of phylogenies containing the Pacific clade estimated with MP-EST …… 32

Figure 1.8. Phylogenetic position of samples with short sequences estimated with MP-EST … 35

Figure 1.9. Final species tree estimates from GCMs …………………………………………... 38

Figure 2.1. Estimate of interfamilial relationships in core Corvoidea from previous studies …. 52

Figure 2.2. Maximum likelihood estimate of phylogenetic relationships in core Corvoidea …. 59

Figure 2.3. Estimate of interfamilial relationships in core Corvoidea based on maximum

likelihood and ASTRAL analyses ………………………………………………………….. 60

Figure 3.1. Trogonidae generic level phylogeny ………………………………………………. 80

Figure 3.2. Effect of the number of loci analyzed on bootstrap support values ……………….. 81

viii

Page 9: Carl H. Oliveros

LIST OF TABLES

Table 1.1. Sampling information .................................................................................................. 13

Table 1.2. Dataset characteristics used in GCM analyses ............................................................ 23

Table 2.1. Sampling information. ................................................................................................. 53

Table 2.2. Characteristics of datasets used in coalescent analyses ............................................... 57

Table 3.1 Sampling information .................................................................................................. 74

Table 3.2. Distribution of unique gene tree topologies. ................................................................ 83

ix

Page 10: Carl H. Oliveros

LIST OF APPENDICES

Appendix 1.1 Species tree estimates of STAR for Iterations 1, 2a, 2b, 3a, 3b ……………….. 99

Appendix 1.2 Species tree estimates of STEAC for Iterations 1, 2a, 2b, 3a, 3b ……………... 100

Appendix 1.3 Species tree estimates of MP-EST for Iterations 1, 2a, 2b, 3a, 3b ……………. 101

Appendix 1.4 Species tree estimates of ASTRAL for Iterations 1, 2a, 2b, 3a, 3b …………… 102

Appendix 1.5 Species tree estimates of STAR for Iterations 1i, 2ai, 2bi, 3ai, 3bi …………... 103

Appendix 1.6 Species tree estimates of STEAC for Iterations 1i, 2ai, 2bi, 3ai, 3bi …………. 104

Appendix 1.7 Species tree estimates of MP-EST for Iterations 1i, 2ai, 2bi, 3ai, 3bi ………... 105

Appendix 1.8 Species tree estimates of ASTRAL for Iterations 1i, 2ai, 2bi, 3ai, 3bi ……….. 106

Appendix 1.9 Phylogenetic position of Dasycrotapha pygmaea ……………………………. 107

Appendix 1.10 Phylogenetic position of Zosterops oleagineus ……………………………... 108

Appendix 1.11 Phylogenetic position of Zosterops lateralis lateralis ……………………… 109

Appendix 1.12 Phylogenetic position of Zosterops melanocephalus ……………………….. 110

Appendix 1.12 Phylogenetic position of Megazosterops palauensis ……………………….. 111

Appendix 2.1. Species tree estimates of STAR, STEAC, MP-EST, and ASTRAL among basal

families and major superfamilies in core Corvoidea, Mohouidae, and Orioloidea ……… 112

Appendix 2.2. Species tree estimates of STAR, STEAC, MP-EST, and ASTRAL among

Malaconotoidea and Orioloidea …………………………………………………………. 113

Appendix 3.1. Phylogenetic placement of Trogoniformes ………………………………….. 114

x

Page 11: Carl H. Oliveros

INTRODUCTION

High-throughput sequencing is revolutionizing the field of systematics. Techniques such

as whole-genome sequencing and re-sequencing, reduced representation strategies, sequence

capture, and RNA sequencing are now employed to examine population structure, patterns of

diversification, and phylogenetic relationships in birds (Toews et al., In review). The advances

in sequence data acquisition has brought an unprecedented amount of data to ornithologists,

exemplified in a recent study of higher-level phylogenetic relationships of birds based on 48

complete genomes (Jarvis et al. 2014). Sequence capture techniques (Bi et al. 2012; Faircloth et

al. 2012; Lemmon et al. 2012) have been favored in deep level phylogenetic studies involving

several taxa because they provide a large number of orthologous sequences for comparisons.

The massive amount of sequence data from new sequencing techniques also poses

challenges for systematic ornithologists. First, bioinformatics skills are required to handle and

examine the data. Second, phylogenetic inference methods that process data in this scale are still

in their infancy. For example, a debate on the use of traditional concatenation approaches vs.

gene tree-based coalescent methods (GCMs) in species tree inference is ongoing (Song et al.

2012; Gatesy and Springer 2014; Springer and Gatesy 2014; Zhong et al. 2014; Liu et al. 2015;

Tonini et al. 2015). At the same time, the large number of loci in phylogenomic datasets provide

opportunities to resolve nodes in the Tree of Life that are otherwise difficult to untangle with

small datasets, especially rapid radiations, which cause high levels of incomplete lineage sorting

(Whitfield and Lockhart 2007). A high number of loci allows us to better sample gene tree

distributions generated by the species trees.

1

Page 12: Carl H. Oliveros

In this dissertation, I use data from sequence capture of ultraconserved elements to

resolve three rapid radiations in the avian tree of life and in the process gain insights on

analytical approaches and strategies for dealing with phylogenomic datasets.

In Chapter 1, I explore analytical strategies that can be employed with GCMs to increase

phylogenetic resolution and minimize highly supported conflicting results. I show how

conflicting results from different GCMs can arise from biases introduced by sequence length

heterogeneity and uninformative loci that can lead to strongly supported incorrect estimates of

phylogeny. Strategies for eliminating these biases and increasing resolution of species tree

estimates of GCMs are proposed and tested, including subsampling taxa to increase the number

of gene trees analyzed, trimming sequences to eliminate sequence length heterogeneity, and

filtering loci based on information content. These strategies are used to reconstruct phylogenetic

relationships in the relatively young avian family, Zosteropidae, using four different GCMs from

sequence capture data. The resulting estimates are highly resolved, consistent with each other,

and comparable to the maximum likelihood estimate of the concatenated data. The analytical

strategies learned in this Chapter are applied and explored further in the subsequent chapters.

In Chapter 2, I examine higher-level relationships in the enigmatic core Corvoidea group

of Oscine passerines, a clade comprising 773 species that occur worldwide. Phylogenetic

relationships among the 29 families in this large clade have been largely unresolved owing to a

combination of inadequate character and taxon sampling in previous studies and short time

intervals between lineage splitting events. I use sequence data from thousands of ultraconserved

element loci obtained from 86 species to estimate higher level phylogenetic relationships in the

group using maximum likelihood (ML) and coalescent methods. A highly resolved phylogeny of

core Corvoidea is recovered, with a majority of nodes receiving high support from both ML and

2

Page 13: Carl H. Oliveros

coalescent analyses. Four families restricted to or thought to have originated from the Australia-

New Guinea region and New Zealand form the earliest branching lineages in the group. The

other families are divided into three major clades each consisting of 8–9 families. In this

chapter, I show that short sequence lengths do not bias species tree estimates of GCMs if

informative sites are present in these sequences. In contrast, some samples that have longer

sequence lengths compared to most taxa but shorter sequence lengths compared to taxa in its

clade can also bias species tree estimates of GCMs.

In Chapter 3, I develop a hypothesis on the origins of the trogons (Trogonidae) based on

a robust dated phylogeny estimated from thousands of genome-wide loci. Phylogenetic

reconstruction in the trogons has been problematic in previous studies for two reasons: (1)

successive short internodes at the base of this radiation cause high gene tree discordance, and (2)

a long branch leading to the family makes root placement problematic. I recover the first well-

supported hypothesis of relationships among trogon genera. Trogons comprise three clades, each

confined to one of three biogeographic regions: Africa, Asia, and the Neotropics, with the

African clade sister to the rest of trogons. This topology, combined with the trogon fossil record,

geologic, and climatic data, suggests an Old World origin for the crown group. Continental

connections during the warm Late Oligocene/Early Miocene facilitated dispersion between

Eurasia, North America, and Africa, and subsequent global cooling plausibly caused divergence

between main trogon lineages. In this chapter, I show that in some datasets in which loci have

high information content, exclusion of less informative loci can lead to lower bootstrap support

of species tree estimates of GCMs.

3

Page 14: Carl H. Oliveros

CHAPTER 1

Strategies for Consistent Species Tree Inference from Sequence Capture Data:

a Case Study with the White-eyes (Aves: Zosteropidae)

4

Page 15: Carl H. Oliveros

ABSTRACT

Gene tree-based coalescent methods (GCM) for estimating species trees from

phylogenomic datasets have been criticized, among others, for producing conflicting results

when applied to the same data and yielding lower resolution compared to results from maximum

likelihood inference on concatenated data. In this paper we show how conflicting results from

different GCMs can arise from biases introduced by sequence length heterogeneity and

uninformative loci that can lead to strongly supported incorrect estimates of phylogeny.

Strategies for eliminating these biases and increasing resolution of species tree estimates of

GCMs are proposed and tested, including subsampling taxa to increase the number of gene trees

analyzed, trimming sequences to eliminate sequence length heterogeneity, and filtering loci

based on information content. These strategies are used to reconstruct phylogenetic relationships

in the avian family Zosteropidae using four different GCMs from sequence capture of

ultraconserved elements. The resulting estimates are highly resolved, consistent with each other,

and comparable to the maximum likelihood estimate of the concatenated data.

5

Page 16: Carl H. Oliveros

INTRODUCTION

Collection of DNA sequence data from hundreds to thousands of genome-wide loci is

now in wide practice for phylogenetic studies. Sequence capture techniques (Bi et al. 2012;

Faircloth et al. 2012; Lemmon et al. 2012) provide large numbers of orthologous sequences that

have been used to resolve deep-level phylogenetic relationships in various groups of organisms

(Crawford et al. 2012; McCormack et al. 2012, 2013; Faircloth et al. 2013; Brandley et al. 2015).

Two major approaches are currently employed to estimate phylogenies with datasets of this

scale. In the traditional approach, all loci are concatenated into a large supermatrix on which

maximum likelihood (ML) or Bayesian tree inference is performed. This approach is statistically

inconsistent and yield highly supported incorrect topologies if species trees are in the anomaly

zone (Kubatko and Degnan 2007; Roch and Steel 2015), but it is otherwise expected to perform

well if species tree branch lengths are long enough such that individual gene trees are congruent

(Kubatko and Degnan 2007; Mirarab et al. 2014b; Liu et al. 2015). A second approach

incorporates gene tree discordance in species tree inference. An entire class of methods that use

information from inferred gene trees to estimate the species tree has gained wide use owing to

their theoretical strengths and computational tractability. These methods follow a two-step

process. First, gene trees are estimated for each locus (usually using a maximum likelihood

approach), then the species tree is estimated based on information from these gene trees. The

weaknesses of these gene tree-based methods include: (1) different gene tree-based methods

produce highly supported conflicting results from the same data (Gatesy and Springer 2014;

Springer and Gatesy 2014), (2) branch support values are lower compared to those obtained from

concatenation techniques (Edwards 2009; Liu et al. 2009b), (3) some uncontroversial clades are

not recovered by these methods, whereas concatenation methods recover them with high support

6

Page 17: Carl H. Oliveros

(Gatesy and Springer 2014), (4) species tree estimates are less accurate compared to those

obtained by concatenation when the inference is based on gene trees estimated from sequence

data (Gatesy and Springer 2014), and (5) potential recombination within individual loci violates

one of the main assumptions of the multispecies coalescent model (Gatesy and Springer 2014).

In this paper, we propose and test analytical strategies to address the first three

weaknesses of gene tree-based methods in an empirical context. Specifically, we demonstrate

that analyzing subsamples of taxa, trimming sequence data from some loci, and filtering loci

based on information content can increase resolution and minimize highly supported conflicting

results in gene tree-based species tree methods. We illustrate these techniques using sequence

capture of ultraconserved elements (UCEs) to estimate the phylogeny of the diverse avian family

Zosteropidae, which contains one of the most rapid radiations known among vertebrates (Moyle

et al. 2009).

Gene Tree-Based Coalescent Methods

Several gene tree-based techniques have been developed to estimate species trees: STEM

(Kubatko et al. 2009), STAR (Liu et al. 2009b), STEAC (Liu et al. 2009b), MP-EST (Liu et al.

2010), GLASS (Mossel and Roch 2010), NJ-ST (Liu and Yu 2011), STELLS (Wu 2012), MulRF

(Chaudhary et al. 2014), and ASTRAL (Mirarab et al. 2014c). In this paper, we compare the

performance of four methods, STAR, STEAC, MP-EST, and ASTRAL, which are not truly

coalescent based but are statistically consistent under the multispecies coalescent model (Liu et

al. 2009b, 2010; Mirarab et al. 2014c). We henceforth refer to these methods as gene tree-based

coalescent methods, or GCMs (Liu et al. 2015).

These four GCMs take different approaches to species tree estimation. STAR and

STEAC are most similar in that both use distance-based tree reconstruction based on coalescent

7

Page 18: Carl H. Oliveros

information from gene trees; the former uses a matrix of average ranks of coalescences between

taxa across all gene trees, whereas the latter uses a matrix of average coalescence times. In

contrast, MP-EST maximizes a pseudo-likelihood function that is based on the frequency of

rooted triples in gene trees. Lastly, ASTRAL finds the species tree that maximizes the number

of compatible quartets in the gene trees. The first three methods require rooted gene trees as

input, whereas the last accepts unrooted gene trees. In addition, STEAC is the sole method that

requires branch length information in the gene trees.

Simulation and empirical studies provide insights into the performance of GCMs. For

instance, these methods can be implemented with some species missing for some loci, but

including such loci is not advisable for several reasons. First, missing species in some loci can

introduce bias to species estimation results in some GCMs. A large number of missing species

in some loci can bias the ranks in that particular locus in a STAR analysis (Zhong et al. 2014).

The results of MP-EST may not be biased by missing species as long as missing species occur

randomly across loci (Zhong et al. 2014), but this may not be the case with empirical data (see

below). ASTRAL can also conceivably be biased by non-random missing lineages in cases

where quartets that represent the correct topology are underrepresented in gene trees due to

missing lineages from these quartets. Second, a large proportion of missing species can lead to

low bootstrap support in MP-EST analysis, but this could be remedied by sampling a large

number of loci (Liu et al. 2010; Zhong et al. 2014). Third, the effects of missing species in some

loci have not been thoroughly studied for the GCMs used in this study, but see Hovmöller et al.

(2013) for results on STEM. Fourth, because STAR, STEAC, and MP-EST require rooted gene

trees, these require the same outgroup species to be present in all loci. Lastly, in order for results

8

Page 19: Carl H. Oliveros

of all four GCMs to be comparable in this study, species tree inferences should be based on the

same data, requiring the use of datasets with no missing species in each locus.

Using a large number of loci benefits GCM analyses because they are statistically

consistent; i.e., the probability of the estimated species tree being congruent to the true species

tree increases to 1.0 as the number of gene trees analyzed increases (Liu et al. 2009b, 2010;

Mirarab et al. 2014c). Apart from increasing accuracy, analyzing a large number of gene trees

has also been shown to increase nodal support in species tree estimation (Song et al. 2012).

Therefore, maximizing the number of loci should be a goal of data collection and assembly.

Poor phylogenetic signal in individual loci, which results in high gene tree estimation

error, has been shown to result in poor accuracy of GCMs (Bayzid and Warnow 2013; Gatesy

and Springer 2014) or low bootstrap support values for the estimated species trees (Zhong et al.

2013; Mirarab et al. 2014a; Liu et al. 2015). One approach proposed to overcome this issue is to

combine alignments from multiple loci to increase phylogenetic signal in gene tree estimation

(Bayzid and Warnow 2013; Mirarab et al. 2014a); but these “binning” techniques have been

criticized as possibly biasing gene tree distributions and resulting in incorrect species tree

topologies (Liu et al. 2015). Another strategy is to exclude weakly supported gene trees (e.g.,

omitting loci with < 50% average bootstrap support) from species tree analysis, which has been

shown to improve bootstrap support values in MP-EST analyses (Zhong et al. 2013; Liu et al.

2015). Yet another solution is to augment phylogenetic signal by increasing phylogenomic data

sets (i.e., increasing the number of loci or alignment lengths; Liu et al. 2015).

Missing Data in Sequence Capture Data Sets

It is important to understand the nature of missing data in empirical datasets. Sequence

capture techniques result in missing data in phylogenomic datasets on two levels. First, for each

9

Page 20: Carl H. Oliveros

individual sample, not all loci targeted are enriched and some samples can have a significantly

lower number of loci enriched compared with others (Fig. 1.1a). These samples are not missing

data randomly across loci, thus they are expected to lack sequence data in a given locus more

than other samples. Additionally, enrichment success rates are often significantly less than

100%, in our case ~80%. Combined with the stochastic nature of locus enrichment, this means

that if 4000 out of 5000 loci are enriched for each of two samples, the number of enriched loci

common to both samples will be < 4000. This number will be 0.80 × 4000 = 3200 loci, if

enrichment was completely random. As the number of samples increases, therefore, the number

of enriched loci in common to all samples decreases to a small fraction of loci enriched per

individual (Fig. 1.1b). If GCM analyses are performed on complete matrices consisting of loci

for which all samples have sequence data, only a small fraction of the data available is used.

However, if datasets with fewer taxa are assembled for GCM analyses, the number of loci

common to all samples can be significantly higher (Fig. 1.1b).

Secondly, sequence capture methods result in alignments with heterogeneous sequence

lengths (i.e., some samples have missing nucleotides at their flanks caused by varying lengths

and alignment position of assembled contigs). Retaining flanking sites with missing data in each

alignment has the advantage of retaining informative sites for samples that have data at these

sites, especially in some sequence capture techniques that yield loci in which informative sites

are located towards the flanks (Lemmon et al. 2012; McCormack et al. 2012). However, missing

data in alignment flanks can mislead gene tree estimation if the short sequences are missing a

sufficient number of informative sites (Fig. 1.1c). Some samples have consistently short

sequences across many loci because of the suboptimal quality of DNA extracts (e.g., DNA

extracted from museum skins or from poorly preserved tissue). For these samples, if their

10

Page 21: Carl H. Oliveros

placement in gene trees is misestimated in a significant number of loci, then their placement in

species tree analyses based on gene trees is expected to be misestimated as well.

Figure 1.1. Missing data in sequence capture. a) Number of ultraconserved element (UCE) loci obtained for 71 samples of Zosteropidae using a sequence capture technique, with a few samples yielding substantially fewer loci. Dashed line indicates number of loci common to all samples. b) The number of loci common to a subset of samples decreases as the size of the subset increases. The order samples were added to the subset follows that in (a) from left to right. c) Missing data at flanks of alignment can mislead gene tree estimation. Top panel shows alignment with full data and correct gene tree estimate. Informative sites occur towards the flanks of the alignment as in UCE loci obtained from sequence capture. Bottom panel shows same alignment with one sequence truncated causing it to lose informative sites and resulting in the incorrect gene tree.

11

Page 22: Carl H. Oliveros

The Family Zosteropidae

The avian family Zosteropidae (white-eyes) consists of 121 recognized species in 12

genera (Dickinson and Christidis 2014; Alstrom et al. 2015). Most species are endemic to

islands or archipelagos of Southeast Asia and Oceania, but the entire distribution spans a vast

area in the Old World, from the eastern Atlantic to the western Pacific. A series of molecular

studies clarified the species composition of the family (Cibois 2003; Gelang et al. 2009; Moyle et

al. 2009, 2012; Alstrom et al. 2015) and it is now understood that members of Yuhina, which

occur in mainland Asia and Borneo, form the earliest branches in Zosteropidae (Fig. 1.2). The

oldest divergence in the family was estimated to have occurred in the late Miocene (Moyle et al.

2009). Next to branch off are “old” zosteropids of the genera Zosterornis, Cleptornis,

Dasycrotapha, Sterrhoptilus, and Lophozosterops, a group whose distribution is concentrated in

insular Southeast Asia. Finally, the largest clade in the family comprises the rapid, widespread,

and species-rich radiation of white-eyes that consists of the genus Zosterops, but also includes

the monotypic genus Chlorocharis. This enormous radiation involves ~80 species that began

diversifying only very recently in the early Pleistocene, yielding one of the highest speciation

rates known among terrestrial vertebrates (Moyle et al. 2009; Jetz et al. 2012). Phylogenetic

relationships in the family remain largely unresolved owing to short intervals between speciation

events in multiple sections of the phylogeny, especially within the rapid radiation of Zosterops.

A high level of gene tree discordance is therefore expected in this radiation. Thus, the family

provides an ideal system to evaluate GCM analytical strategies with empirical data.

12

Page 23: Carl H. Oliveros

METHODS

Sampling

We sampled 70 individuals from 65 of 121 recognized species belonging to 9 of 12

genera in the family Zosteropidae (Dickinson and Christidis 2014; Alstrom et al. 2015): one

individual from 61 species and 2–3 individuals from 4 species (Table 1.1). Three samples of Z.

lateralis and two samples of Z. luteus were included because of paraphyly in their mitochondrial

ND2 gene trees (Nyári and Joseph 2013). Two subspecies of Z. palpebrosus were sampled

because we had access to material from two very disparate geographic regions (the Lesser

Sundas and Vietnam). Two individuals of Z. kulambangrae were used as control samples that

should be sister to each other in analyses. The monotypic genera Tephrozosterops and

Apalopteron, and the Caroline Islands-endemic genus Rukia were not sampled. The genus

Madanga, which has long been classified in Zosteropidae, was recently found to belong to the

family Motacillidae (Alstrom et al. 2015). A sample of Timalia pileata (Timaliidae), a member

of the sister clade to Zosteropidae (Moyle et al. 2012), was used as an outgroup.

Table 1.1. Sampling information

Taxon Accession Number

Locality Number of UCE loci enriched

Mean contig length

Mean coverage

Chlorocharis emiliae KU 17802 Borneo, Malaysia 3981 765.6 26.3

Cleptornis marchei KU 22576 Saipan, Micronesia 4280 895.2 39.6

Dasycrotapha platen KU 19056 Mindanao, Philippines 4132 701.6 33.0

Dasycrotapha pygmaea AMNH 708397 Samar, Philippines 3999 306.9 54.6

Lophozosterops goodfellowi KU 28427 Mindanao, Philippines 4270 847.2 37.6

Lophozosterops squamiceps AMNH DOT12549 Sulawesi, Indonesia 4205 934.9 34.3

Lophozosterops squamifrons LSUMNS B51197 Borneo 3110 700.4 28.3

Lophozosterops superciliaris WAM 23291 Flores, Indonesia 3997 834.0 33.1

Lophozosterops wallacei WAM 22903 Sumba, Indonesia 4186 930.4 34.5

13

Page 24: Carl H. Oliveros

Megazosterops palauensis KU 23671 Palau 3417 662.7 17.2

Sterrhoptilus capitalis KU 28326 Mindanao, Philippines 4247 929.7 38.1

Sterrhoptilus dennistouni KU 20225 Luzon, Philippines 4317 929.4 43.1

Sterrhoptilus nigrocapitatus KU 18034 Luzon, Philippines 4253 900.4 38.2

Timalia pileata KU 23375 Vietnam 4268

Yuhina brunneiceps AMNH DOT5230 Taiwan 3904 772.3 29.1

Yuhina castaniceps KU 13784 China 4208 892.9 34.9

Yuhina diademata KU 11118 China 4263 776.4 38.9

Yuhina everetti KU 17756 Borneo, Malaysia 4312 890.6 39.1

Yuhina flavicollis KU 15170 Myanmar 4227 782.0 36.1

Yuhina gularis KU 15173 Myanmar 4253 887.1 37.7

Yuhina nigrimenta KU 9997 China 4271 958.0 35.0

Yuhina occipitalis KU 15177 Myanmar 4225 825.5 33.4

Zosterops atricapilla KU 17735 Borneo, Malaysia 3901 641.6 26.8

Zosterops atrifrons AMNH DOT12620 Sulawesi, Indonesia 4243 866.5 39.1

Zosterops capensis FMNH 390165 South Africa 4230 929.2 34.8

Zosterops chloris AMNH DOT12558 Sulawesi, Indonesia 3636 700.9 23.4

Zosterops cinereus KU 23678 Palau 3756 745.1 21.9

Zosterops citrinella WAM 23542 Roti, Indonesia 4172 749.8 33.8

Zosterops conspicillatus KU 22583 Saipan, Micronesia 3864 655.3 24.0

Zosterops erythropleurus KU 28088 Vietnam 4593 847.8 107.9

Zosterops everetti KU 13949 Camiguin Sur, Philippines 4326 908.8 40.1

Zosterops explorator KU 24401 Fiji 3893 803.8 29.0

Zosterops flavifrons LSUMNS B45805 Vanuatu 3171 707.9 30.8

Zosterops fuscicapilla NMNH 2003-062 Louisiade Island, PNG 4350 1040.5 56.7

Zosterops griseotinctus NMNH 2003-067 Louisiade Island, PNG 4401 991.9 47.4

Zosterops hypoxanthus KU 27718 New Ireland, PNG 4338 937.1 42.4

Zosterops japonicus KU 28142 Vietnam 3668

Zosterops kulambangrae 1 UWBM 76278 Kohingo, Solomon Islands 4314 967.1 45.2

Zosterops kulambangrae 2 KU 15931 Kohingo, Solomon Islands 3935 689.3 26.4

Zosterops lacertosus KU 19413 Santa Cruz, Solomon Islands

3759 697.7 23.2

Zosterops lateralis chloronotus KU 6094 SW Australia 4265 918.0 40.9

Zosterops lateralis flaviceps KU 22568 Fiji 3988 777.1 26.1

Zosterops lateralis lateralis KU 14863 New Zealand 3949 476.8 34.0

14

Page 25: Carl H. Oliveros

Zosterops luteirostris AMNH DOT113 Ghizo, Solomon Islands 4288 902.3 46.9

Zosterops luteus balstoni KU 8904 Western Australia 4332 970.3 47.9

Zosterops luteus luteus KU 22720 Northern Territory, Australia

3704 686.4 23.4

Zosterops maderaspatanus FMNH 345980 Madagascar 4287 927.5 39.3

Zosterops melanocephalus AMNH 461540 Cameroon 4012 300.5 76.1

Zosterops metcalfii UWBM 63177 Choiseul, Solomon Islands 3988 785.0 29.1

Zosterops meyeni KU 17853 Batan, Philippines 3671 664.7 21.2

Zosterops montanus KU 20891 Negros, Philippines 4364 870.6 43.0

Zosterops murphyi AMNH DOT193 Kolombangara, Solomon Islands

3407 627.6 20.7

Zosterops nigrorum KU 12519 Camiguin Norte, Philippines

3920 721.5 22.5

Zosterops novaeguineae KU 12068 Papua New Guinea 3973 798.7 23.9

Zosterops oleaginous LSUMNS B48626 Yap, Micronesia 4367 414.3 93.4

Zosterops palpebrosus siamensis

KU 23522 Vietnam 4022 771.8 25.9

Zosterops palpebrosus unicus WAM 23218 Flores, Indonesia 4210 886.7 38.8

Zosterops rendovae UWBM 76356 Tetepare, Solomon Islands 4269 965.6 38.7

Zosterops rennellianus UWBM 69808 Rennel, Solomon Islands 4132 834.3 30.7

Zosterops santaecrucis KU 19412 Santa Cruz, Solomon Islands

4355 940.0 43.8

Zosterops semperi KU 23684 Palau 3906 778.3 25.1

Zosterops senegalensis KU 19969 Sierra Leone 3724 687.0 23.4

Zosterops splendidus AMNH DOT171 Rannonga, Solomon Islands

4264 903.9 35.6

Zosterops stresemanni KU 19428 Malaita, Solomon Islands 3738 701.0 23.4

Zosterops superciliosus UWBM 58818 Rennel, Solomon Islands 3945 724.3 25.8

Zosterops ugiensis KU 12803 Makira, Solomon Islands 4285 912.7 38.2

Zosterops vellalavella AMNH DOT166 Vella Lavella, Solomon Islands

4066 911.7 35.3

Zosterornis hypogrammicus CMC 37765 Palawan, Philippines 4005 903.3 31.2

Zosterornis latistriatus CMC 34221 Panay, Philippines 4057 922.5 34.6

Zosterornis nigrorum KU 27209 Negros, Philippines 4329 898.5 44.4

Zosterornis whiteheadi KU 18001 Luzon, Philippines 4256 919.9 36.9

Note: Abbreviations: AMNH (American Museum of Natural History), CMC (Cincinnati Museum Center), FMNH (Field Museum of Natural History), KU (University of Kansas Natural History Museum), LSUMNS (Louisiana State University Museum of Natural Science), NMNH (National Museum of Natural History), UWBM (University of Washington Burke Museum), WAM (Western Australia Museum).

15

Page 26: Carl H. Oliveros

Laboratory Techniques

We extracted and purified DNA from fresh muscle or liver tissue or toe pad clips from

museum specimens using the Qiagen DNeasy Blood and Tissue Kit following the manufacturer’s

Figure 1.2. Estimate of phylogenetic relationships in Zosteropidae from the study of Moyle et al. (2009). Only well-supported nodes are shown. (*) Zosterops radiation estimated to have initiated diversification 1.65 Ma.

16

Page 27: Carl H. Oliveros

protocol. DNA extracts were quantified using a Qubit 2.0 Fluorometer, and 500 ng of DNA of

each sample was sheared in 50 μl volume using a Covaris S220 sonicator at 175 W peak incident

power, 2% duty factor, and 200 cycles per burst for 45 seconds. We performed end repair, A-

tailing, adapter ligation, and amplification on sheared DNA using Kapa Biosystems Library Prep

kits following the procedure of Faircloth et al. (2012) and described in detail at

http://ultraconserved.org. We deviated from the above protocol in three ways: we used ¼

volume of reagents called for in the library prep kit (N. White, in litt.), we ligated universal iTru

stubs instead of sample-specific adapters to allow for dual indexing, and we added a second

AMPure XP bead clean up at 1.0x volume after stub ligation. Dual-indexed iTru adapters were

added to DNA fragments through a 17-cycle PCR using NEB Phusion High-Fidelity PCR Master

Mix. iTru stubs and adapters were developed and designed by T. Glenn et al. (unpublished data).

Libraries were quantified using a Qubit 2.0 Fluorometer and subsequently combined in

pools of 8 equimolar samples for enrichment. We performed sequence capture and post-

enrichment amplification following standard protocols (Faircloth et al. 2012) using the

Mycroarray MYbaits kit for Tetrapods UCE 5K version 1, which targets 5,060 UCE loci.

Briefly, biotinylated RNA probes were hybridized with pooled libraries for 24 hrs. Targeted

DNA fragments were then recovered with the use of streptavidin-coated beads. These fragments

were then amplified in a 17-cycle PCR amplification step using NEB Phusion High-Fidelity PCR

Master Mix. After post-enrichment amplification, libraries were quantified using an Illumina

Eco qPCR System, then sequenced in a high output, paired-end run of 100 cycles on an Illumina

HiSeq 2500 System at the University of Kansas Genome Sequencing Core. Including samples

for other projects, 96 individuals were multiplexed in a single Illumina lane.

17

Page 28: Carl H. Oliveros

Data Assembly

Raw reads were de-multiplexed using CASAVA ver. 1.8.2. Low-quality bases and

adapter sequences were trimmed from reads using illumiprocessor ver. 1

(https://github.com/faircloth-lab/illumiprocessor). Subsequent data processing was performed

using the python package phyluce (Faircloth 2014) and outlined below. Cleaned reads were

assembled into contigs using the program Trinity (Grabherr et al. 2011). Contigs matching UCE

loci were extracted for each taxon. An incomplete dataset containing UCE loci that were present

in at least 75% of all 71 taxa was assembled for maximum likelihood (ML) analysis. For species

tree analyses, complete datasets consisting of loci common to all taxa being analyzed were

assembled (see below). For each dataset, each locus was aligned using MAFFT (Katoh and

Standley 2013), allowing missing nucleotides at the flanks of the alignment only if at least 65%

of taxa contained data, which is the default option in phyluce. These alignments were trimmed

using Gblocks (Castresana 2000) with default parameters except for the minimum number of

sequences for a flank position in Gblocks, which we set at 65% of taxa. The alignments were

formatted to phylip files for phylogenetic analysis.

Phylogenetic Analyses

We performed maximum likelihood (ML) inference on the concatenated loci of the

incomplete dataset using RAxML ver. 8.1.3 (Stamatakis 2014) assuming a general time

reversible model of rate substitution and gamma-distributed rates among sites. Node support

was evaluated using 500 rapid bootstraps.

GCM Analysis Strategies.—Three general strategies were employed to increase

resolution and minimize highly supported conflicting results in GCM analyses. First, datasets

with subsets of taxa were compiled to increase the number of loci available for GCM analysis.

18

Page 29: Carl H. Oliveros

Subsampling taxa has been shown to yield consistent results with GCM analysis (Song et al.

2012). Second, we used only the most informative loci to increase bootstrap support values in

GCM species tree estimates. Lastly, we trimmed alignments in order to eliminate biases

introduced by short sequence lengths of some samples. An outline of the series of analyses

performed is presented in Figure 1.3 and described below.

An initial GCM analysis was performed with all 71 samples (Iteration 0). We then

removed the 4 samples with the lowest average contig length (Dasycrotapha pygmaea, Zosterops

oleginea, Z. lateralis lateralis, and Z. melanocephalus) to assess their potential to bias results and

performed another set of species tree analyses with the remaining 67 samples (Iteration 1). From

the results of Iteration 1, we looked for well-supported clades or groups across the four summary

methods that contained many unresolved nodes and performed another set of species tree

analyses on these groups. A well-supported clade containing all 43 samples from the genera

Chlorocharis and Zosterops was recovered in Iteration 1, so the 67 samples were divided into

two groups, the Chlorocharis + Zosterops clade (Iteration 2b) and all other Zosteropid genera

(Iteration 2a). Each group was analyzed with a few samples from the other group either as

outgroup or representatives for clades. The results from Iteration 2a yielded highly resolved

trees and thus this group was not analyzed further. The analyses from Iteration 2b yielded a

substantial number of polytomies, so the Chlorocharis + Zosterops clade was further subdivided

in two and analyzed separately (Iterations 3a and 3b).

19

Page 30: Carl H. Oliveros

In order to determine the effect of removing uninformative loci on species tree analysis,

we repeated all analyses in Iterations 1–3 but discarded uninformative loci. Liu et al. (2015)

suggested excluding weak loci (e.g., < 50% average bootstrap support) from species tree

analysis. The maximum average bootstrap support was ~50% in some of our datasets, thus we

Figure 1.3. Analysis approach with gene tree-based coalescent methods (GCM). An initial analysis with all samples was performed after which four samples with short sequence lengths were analyzed separately. The remaining 67 samples were subsampled five times. Trimming was applied to datasets that contained the samples with short sequence lengths. Datasets were analyzed with all loci common to all samples present in the dataset, and with only the most informative loci. Dark gray boxes denote subsets of samples, light gray boxes indicate datasets and their treatments, if any. All datasets were analyzed using four GCMs.

20

Page 31: Carl H. Oliveros

discarded loci whose number of parsimony-informative sites or average bootstrap support was

below their respective means for all the loci in the dataset, in effect, using only the “most

informative” loci.

The phylogenetic placement of the four samples that had the shortest average contig

lengths was estimated by performing separate species tree analyses on smaller datasets consisting

of the sample of interest along with other samples close to its expected position. This expected

position was based on results of a previous study (Moyle et al. 2009), the species tree estimates

in Iteration 0, and from the ML-estimated tree. A series of analyses was performed on these

smaller datasets. First, species tree analyses were carried out using all loci common to the taxa

in the dataset. Second, each locus was trimmed using Gblocks (Castresana 2000) so that no

missing nucleotides were permitted at the flanks of each alignment (subsequently referred to as

“100% trimming”). Lastly, uninformative loci were discarded similarly as described above.

Species trees were estimated on the trimmed dataset before and after uninformative loci were

discarded. Because the placement of Z. melanocephalus was still uncertain after performing the

steps above, the dataset for this species was alternatively trimmed using a custom python script

by removing from each alignment only flanking sites where the sequence of Z. melanocephalus

had missing data. This alterative trimming truncated each locus to the alignment length of the Z.

melanocephalus sequence but allowed other samples to have missing data at their flanks.

The final species tree estimate for each GCM was summarized by including only well-

supported nodes (bootstrap proportion > 70%) that appear in at least one of the results of

Iterations 1–3 that used only the most informative loci. The four samples with short sequence

lengths were placed based on results of the trimmed datasets after uninformative loci were

discarded.

21

Page 32: Carl H. Oliveros

For GCM analysis, gene tree inference and bootstrapping were performed with RAxML

ver. 8.1.3 (Stamatakis 2014) using the python package phyluce (Faircloth 2014). We modified

the phyluce scripts to implement multi-locus bootstrapping (i.e., sampling with replacement of

loci and sites, Seo 2008) and generated 500 multi-locus bootstrap replicate sets of gene trees for

each dataset. On each replicate set of gene trees, we ran four GCMs: STAR and STEAC as

implemented in the R package phybase ver. 1.3 (Liu et al. 2009b), MP-EST ver. 1.4 (Liu et al.

2010), and ASTRAL ver. 4.7.7 (Mirarab et al. 2014c). All programs were run with the default

options. Multi-locus bootstrapping may underestimate support for some correct nodes in point

estimates of GCMs (Mirarab et al. 2014b). Because we were interested only on nodes with high

support, for each method species trees inferred from the bootstrap replicates were summarized

with 70% consensus trees using the sumtrees.py program in Dendropy (Sukumaran and Holder

2010). Command line, python, and R scripts used to process the data and run species tree

analyses are available at https://github.com/carloliveros/uce-scripts.

RESULTS

We enriched an average of 4069.8 UCE loci per sample with an average contig length of

789.9 bp and average coverage of 35.7 X (Table 1.1). The incomplete dataset used for ML

analysis included data from 3934 loci with a mean locus length of 765.9 bp. Nucleotide data

were present in 88.7% of the data matrix. Alignment statistics for the various GCM analyses are

presented in Table 1.2. Only 605 UCE loci were common to all 71 samples but in a subset of 33

samples, this more than doubled to 1297 loci and in subsets of 10–12 samples, this number

increased 5-fold to ~3000 loci. After applying our filter for informative loci, datasets retained

32–45% of their loci in Iterations 1–3 but only 14–26% of their loci in analyses placing samples

22

Page 33: Carl H. Oliveros

with short sequence lengths. Average locus lengths ranged from 873–1024 bp and the percentage

of missing nucleotides in alignments were within 5–10% with no trimming applied. Trimming

missing data in alignment flanks considerably reduced the average locus length to 35–51% of

their original length as well the average number of parsimony-informative sites to 12–31% of

their original number.

Table 1.2. Dataset characteristics used in GCM analyses Dataset Description Number

of Taxa Number of Loci

Mean locus length (bp)

% Missing Nucleotides

Average Number of Parsimony- Informative Sites/Locus

Average Bootstrap Support/

Locus

Iteration 0, all loci 71 605 890.7 8% 42.7 26.2% Iteration 1, all loci 67 654 902.9 6% 42.8 28.1% Iteration 1i, most informative loci 67 273 986.4 5% 64.2 35.1% Iteration 2a, all loci 47 1042 873.4 6% 16.8 21.1% Iteration 2ai, most informative loci 47 377 964.2 5% 27.9 26.7% Iteration 2b, all loci 27 1703 942.5 5% 35.0 45.7% Iteration 2bi, most informative loci 27 762 1023.7 5% 53.7 56.6% Iteration 3a, all loci 22 2086 867.0 7% 12.0 30.2% Iteration 3ai, most informative loci 22 699 953.8 6% 21.4 39.3% Iteration 3b, all loci 33 1297 884.8 7% 9.0 20.6% Iteration 3bi, most informative loci 33 419 968.6 6% 16.1 26.1% Placing Zosterops melanocephalus All loci 11 2941 895.3 10% 6.5 38.1% 100% trimming, all loci 11 2941 316.8 0% 0.8 16.0% 100% trimming, most informative loci 11 564 337.8 0% 3.1 33.2% Trimming to Z. melanocephalus sequence, all loci

11 2941 327.6 0% 0.9 18.2%

Trimming to Z. melanocephalus sequence, most informative loci

11 487 341.6 0% 3.4 37.1%

Placing Dasycrotapha pygmaea All loci 12 3083 919.4 9% 13.7 49.3% 100% trimming, all loci 12 3083 325.7 0% 2.1 22.6% 100% trimming, most informative loci 12 797 350.2 0% 5.4 42.2% Placing Z. oleagineus All loci 10 2986 898.8 9% 4.3 40.0% 100% trimming, all loci 10 2986 434.1 0% 1.2 22.1% 100% trimming, most informative loci 10 429 489.8 0% 4.9 46.9%

23

Page 34: Carl H. Oliveros

Maximum Likelihood

ML analysis yielded a well-supported phylogeny of Zosteropidae with 65 out of 70

(93%) internal nodes garnering >70% bootstrap support (Fig. 1.4). Of the 65 well-supported

nodes, 57 had 100% bootstrap support. We recovered seven major clades that formed a

completely asymmetric topology with 100% support in all nodes (Fig. 1.4). Each clade is

described in increasing ladderized order. The first clade, which was sister to the rest of

Zosteropidae, contained a single species, Yuhina diademata. The second major clade included

two other species of Yuhina: Y. castaniceps and Y. everetti. The third major clade comprised five

other species of Yuhina, which was sister to a clade with all other Zosteropid genera. Within this

third clade, a subclade consisting of Y. nigrimenta and Y. brunneiceps was recovered sister to a

subclade containing Y. flavicollis, Y. occipitalis, and Y. gularis. The fourth major clade in

Zosteropidae included the genera Cleptornis, Dasycrotapha, and Sterrhoptilus. The monotypic

genus Cleptornis is endemic to Saipan, Northern Mariana Islands, whereas Dasycrotapha and

Sterrhoptilus, which were recovered as reciprocally monophyletic sister genera, are Philippine

endemics. Another genus endemic to the Philippines, Zosterornis, comprised the fifth major

clade in the family. The sixth major clade included the monotypic genus Megazosterops, which

is endemic to Palau, embedded within the genus Lophozosterops, whose members are distributed

in the Philippines, Wallacea, and the Greater Sundas. This Megazosterops + Lophozosterops

clade was recovered sister to the last and largest major clade, the Zosterops radiation.

Placing Z. lateralis lateralis All loci 13 2742 964.2 8% 5.5 35.0% 100% trimming, all loci 13 2741 487.3 0% 1.7 17.9% 100% trimming, most informative loci 13 578 592.8 0% 5.3 35.8%

24

Page 35: Carl H. Oliveros

Figure 1.4. Maximum likelihood estimate of phylogenetic relationships in Zosteropidae based on 3934 concatenated UCE loci from 1000 bootstrap replicates. Numbers below branches indicate bootstrap support (BS) values. Nodes with < 70% BS are collapsed whereas nodes with no labels indicate 100% BS.

25

Page 36: Carl H. Oliveros

The Zosterops radiation included only one other genus, the monotypic Chlorocharis.

Within this radiation, a clade consisting of five taxa from continental Asia, Africa, and

Madagascar (Z. palpebrosus siamensis, Z. senegalensis, Z. melanocephalus, Z. maderaspatanus,

and Z. virens) was recovered sister to a large clade with all other taxa in the radiation. This large

clade was in turn subdivided into two subclades. One subclade contained seven species (Z.

semperi, Z. everetti, Z. nigrorum, Z. erythropleurus, Z. japonicus, Z. montanus, and Z. meyeni)

that occur in East Asia, continental Southeast Asia, the Philippines, and Palau. The other

subclade was marginally well-supported (BS = 72%) and included three lineages in a polytomy:

the Sumatra-Borneo endemic Z. atricapilla, the Bornean endemic Chlorocharis emiliae, and a

large “Pacific” clade of 31 taxa distributed throughout Wallacea, the Australia-New Guinea

region, and the Southwest Pacific. Within this Pacific clade, a pair of Wallacean species (Z.

chloris and Z. atrifrons) was sister to a clade containing all other members of the Pacific clade,

which in turn consisted of two subclades. One subclade included three species from Northern

Australia (Z. luteus) and the Lesser Sundas (Z. palpebrosus unicus and Z. citrinella) and the

other subclade contained six lineages from the Australia-New Guinea region and the Southwest

Pacific that formed a polytomy. The six lineages consist of: (a) a pair of species from New

Guinea and the adjacent Louisiade archipelago (Z. novaeguineae and Z. fuscicapilla), (b) Z.

hypoxanthus from the Bismarck archipelago, (c) seven species that occur in the Australia-New

Guinea region and the Southwest Pacific (Z. lateralis, Z. rennellianus, Z. explorator, Z.

superciliosus, Z. flavifrons, Z. santaecrucis, and Z. lacertosus), (d) a pair of species from the

New Georgia and Louisiade archipelagos (Z. murphyi and Z. griseotinctus), (e) three species

from the West Pacific Islands (Z. conspicillatus, Z. cinereus, and Z. oleagineus), and (f) eight

26

Page 37: Carl H. Oliveros

species endemic to the Solomon Islands (Z. ugiensis, Z. metcalfi, Z. stresemanni, Z. vellalavella,

Z. luteirostris, Z. rendovae, Z. splendidus, and Z. kulambangrae).

ML analysis recovered three species that were sampled with more than one individual as

monophyletic: Z. luteus, Z. lateralis, and Z. kulambangrae. However, two subspecies of Z.

palpebrosus were placed in different clades. The subspecies siamensis, represented by a sample

from Vietnam, was found sister to a clade of African and Malagasy species, whereas the

subspecies unicus, represented by a sample from the Lesser Sundas, was recovered sister to

another Lesser Sundas endemic, Z. citrinella.

Gene-tree-based Coalescent Methods

Iteration 0.—Results from analyzing all 605 UCE loci common to all 71 samples

exemplify some of the criticisms leveled against GCMs: poor resolution and highly supported

conflicting results. Analyses in Iteration 0 yielded species tree estimates with varying levels of

resolution across the four GCMs (Figs. 1.5, 1.6a). MP-EST produced the least resolved tree with

only 49% of nodes being well-supported; whereas STEAC resulted in a species tree with the

highest resolution with 89% of nodes being well-supported. The resolution of the STAR and

ASTRAL species trees were comparable (76% and 79% of nodes being well-supported,

respectively). As expected, samples with consistently short sequences across loci (Dasycrotapha

pygmaea, Z. melanocephalus, Z. oleagineus, and Z. lateralis lateralis) were placed by the

different GCMs in different clades, in many cases with high support (Fig. 1.5). For instance,

STAR placed D. pygmaea sister to the Zosterops radiation with high support. STEAC and

ASTRAL, on the other hand, both recovered it sister to the clade with the genera Cleptornis,

Sterrhoptilus, and the other Dasycrotapha species, both with high support. MP-EST, however,

was uncertain of its placement among old zosteropid lineages. Another well-supported

27

Page 38: Carl H. Oliveros

relationship that conflicted between GCMs was the placement of Y. diademata. STAR’s

placement of this species as sister to the pair consisting of Y. castaniceps and Y. everetti

conflicted with ASTRAL’s placement of this species as sister to all other members of the family.

The last well-supported conflicting result between GCMs involved the paraphyly of Z.

kulambangrae in STAR, MP-EST, and ASTRAL, whereas STEAC recovered the species

monophyletic. With the exception of the placement of these six taxa, no other well-supported

conflicts were recovered in the estimated species trees of the four GCMs in Iteration 0.

28

Page 39: Carl H. Oliveros

29

Page 40: Carl H. Oliveros

Figure 1.5. Species tree estimates obtained from analysis of 605 UCE loci common to all 71 samples (Iteration 0) using (a) STAR, (b) STEAC, (c) MP-EST, and (d) ASTRAL. Samples with short sequences are highlighted in gray. Numbers below branches indicate bootstrap support (BS) values from 500 multi-locus bootstrap replicates. Nodes with < 70% BS are collapsed whereas nodes with no labels indicate 100% BS.

Figure 1.6. a) Resolution in final species tree estimates of Zosteropidae (Fig. 1.7) obtained using four GCMs and ML. b) Resolution in phylogenetic reconstruction of the Pacific clade using four GCMs across six iterations. Bars indicate percentage of well-supported nodes (BS ≥ 70%). Diamonds indicate number of loci analyzed.

30

Page 41: Carl H. Oliveros

Iterations 1, 2, and 3.—Full results of GCM analyses in Iterations 1–3 are provided in the

Supplementary Material (App. 1.1–1.8). Figure 1.6b illustrates the effect of the number and

informativeness of loci in the resolution of a clade of 29 taxa (30 samples) from the Australia-

New Guinea region and Southwest Pacific, which were analyzed in Iterations 1, 2b, 3b, 1i, 2bi,

and 3bi. Across all GCMs, the increase in the number of loci from Iteration 1 to 2b resulted in

an increase in the proportion of well-supported nodes. The most notable improvement was

observed in MP-EST analysis wherein the proportion of well-supported nodes jumped from 38%

in Iteration 1 to 66% in Iteration 2b (Figs. 1.6b, 1.7). The further increase in the number of loci

from Iteration 2b to 3b brought further increases in the proportion of well-supported nodes in

STEAC and MP-EST, but not in STAR and ASTRAL. Analyzing a much lower number but

more informative loci (Iterations 1i, 2bi, and 3bi) yielded species trees with slightly less or

comparable resolution to those obtained using all loci for STAR, STEAC, and ASTRAL; but a

significant improvement was observed for MP-EST (Figs. 1.6b, 1.7). STEAC produced the most

highly resolved species tree in 5 out of the 6 iterations presented in this section (Figs. 1.6b, S2,

S6).

Across all GCMs and all Iterations 1–3, only three well-supported conflicts were

recovered. The first involves the placement of Y. diademata as sister to Y. everetti and Y.

castaniceps by STAR in Iterations 1 and 1i, as it was similarly placed in Iteration 0. When it

was placed with high support, Y. diademata came out sister to the rest of the family in analyses

using the other three GCMs (App. 1.1–1.8). The unique placement of Y. diademata with STAR

appears to be an artefact of the method (see below) and was therefore discounted. The second

conflict concerns the position of M. palauensis. When its position was estimated with high

support, this species was recovered sister to the pair of L. wallacei and L. superciliaris in all

31

Page 42: Carl H. Oliveros

Figure 1.7. Excerpts of phylogenies containing the Pacific clade estimated with MP-EST across six iterations. a) Iteration 1: 67 taxa, 654 loci. b) Iteration 2b: 47 taxa, 1042 loci. c) Iteration 3b: 33 taxa, 1297 loci. d) Iteration 1i: 67 taxa, 273 loci. e) Iteration 2bi: 47 taxa, 377 loci. f) Iteration 3bi: 33 taxa, 419 loci. Full trees are shown in Appendices 1.3 and 1.7. Numbers below branches indicate bootstrap support (BS) values from 500 multi-locus bootstrap replicates. Nodes with < 70% BS are collapsed whereas nodes with no labels indicate 100% BS.

32

Page 43: Carl H. Oliveros

GCM analyses, with the exception of STEAC analysis in Iterations 1i and 2ai and ASTRAL

analysis in Iteration 1i, when it was recovered sister to L. wallacei (App. 1.1–1.8). Further

examination of sequence lengths in the data suggest that this conflict may have arisen from bias

introduced by short sequence lengths of M. palauensis. Analyses that focused on the

Megazosterops + Lophozosterops clade and entailed 100% trimming confirmed this bias and M.

palauensis was recovered sister to L. wallacei across all GCMs in this round of analyses (App.

1.13). The third conflict involves the relationships of the two samples of Z. kulambangrae. In

GCM analyses using all loci (Iterations 1, 2b, 3b) and with the exception of STEAC, Z.

kulambangrae was found to be paraphyletic (Figs. 1.7a–c, App. 1.1, 1.3–1.4). However, STEAC,

as well as all the three other GCMs when only the most informative loci are analyzed (Iterations

1i, 2bi, 3bi), recovered a monophyletic Z. kulambangrae (Figs. 1.7d–f, App. 1.2, 1.5–1.8).

STEAC uses branch length information from gene trees and is thus resistant to biases introduced

by uninformative loci. The paraphyly of Z. kulambangrae obtained in the other GCMs appears

to be a bias caused by the presence of uninformative loci and was thus ignored.

Placement of Samples with Short Sequences.—The strategy of performing GCM analysis

separately for the four samples with short sequences was successful in estimating their

phylogenetic position with high support, with the exception of Z. melanocephalus. Full results

are available in the Supplementary Material (App. 1.9–1.12) but only results of MP-EST are

discussed in detail (Fig. 1.8). In analyzing all 3083 loci common to 12 taxa, D. pygmaea was

placed sister to a clade that includes D. plateni and the genus Sterrhoptilus with high support

(Fig. 1.8a). However, with 100% trimming of the 3083 loci, D. pygmaea was recovered sister to

D. plateni with high support, the expected result because these two Philippine endemics are quite

similar and have often been considered a single species. Using the 797 most informative 100%-

33

Page 44: Carl H. Oliveros

trimmed loci yielded the same species tree topology as that obtained using all 3083 trimmed loci.

The result obtained from analysis of all 3083 without trimming is likely due to short sequence

length bias as discussed above so the result of this analysis was discounted. Analyses using the

three sets of loci all yielded a fully resolved species tree.

The phylogenetic position of Z. oleagineus, an endemic to the Caroline Islands in the

West Pacific, was consistent across analyses, whether all loci were used, all loci were trimmed,

or only the most informative trimmed loci were analyzed (Fig. 1.8b). The species was recovered

sister to the Palau endemic, Z. finschii; Z. conspicillatus, a Mariana Island endemic, was found

sister to the clade containing Z. oleagineus and Z. finschii. The species tree resulting from

analyzing all 2986 loci had the highest resolution (89% of nodes well-supported), followed by

that from using only 429 most informative trimmed loci (78%), and lastly, that from using 2986

trimmed loci (67%).

The subspecies Z. lateralis lateralis was placed in a polytomy with the other two Z.

lateralis samples when all 2742 loci were analyzed (Fig. 1.8c). However, it was recovered sister

to Z. lateralis chloronotus both when all 2741 trimmed loci were analyzed and when only 578

most informative trimmed loci were used. The resolution of the species tree using all 2741

trimmed loci was higher (67% of nodes well-supported) than those obtained from the analysis of

all 2742 loci or 578 of the most informative trimmed loci, which had the same level of resolution

(58%).

34

Page 45: Carl H. Oliveros

Figure 1.8. Phylogenetic position of samples with short sequences (highlighted in gray) as estimated with MP-EST. a) Dasycrotapha pygmaea. b) Zosterops oleagineus. c) Zosterops lateralis lateralis. d) Zosterops melanocephalus. Numbers below branches indicate bootstrap support (BS) values from 500 multi-locus bootstrap replicates. Nodes with < 70% BS are collapsed whereas nodes with no labels indicate 100% BS.

35

Page 46: Carl H. Oliveros

Analysis of all 2941 loci placed the Cameroon endemic Z. melanocephalus sister to a

clade that includes Z. palpebrosus siamensis, Z. senegalensis, Z. virens, and Z. madaraspatanus

with high support (Fig. 1.8d). Analysis of the same loci when trimmed 100% recovered Z.

melanocephalus sister to a clade with Z. virens, Z. senegalensis, and Z. maderaspatanus; but

when only 564 of the most informative 100%-trimmed loci were used, Z. melanocephalus was

placed in a four-way polytomy with these three species. Z. virens and Z. senegalensis are

African endemics, whereas Z. maderaspatanus occurs only on Madagascar. Results from 100%

trimming confirm that the result obtained by using all 2941 untrimmed loci was erroneous,

caused by short sequence length bias, and should therefore be discounted. Trimming the 2941

loci to the sequence length of Z. melanocephalus allowed a gain of 243 (10%) parsimony-

informative sites across all loci (vs. 100% trimming) and resulted in the placement of the species

as sister to the pair of Z. maderaspatanus and Z. virens. However, using only 487 of the most

informative sequence-length-trimmed loci recovered Z. melanocephalus in a three-way polytomy

with Z. senegalensis and the sister pair of Z. madaraspatanus + Z. virens. Results of sequence

trimming also suggest that either of the results found with 2941 100%-trimmed loci or 2941

sequence-length-trimmed loci could have been biased by uninformative loci, as observed in the

placement of Z. kulambangrae in Iterations 1, 2b, and 3b above. Therefore, placing Z.

melanocephalus in a three-way polytomy such as that recovered using 487 of the most

informative sequence-length-trimmed loci would be the best estimate of its position. Resolution

of the species tree was highest when all 2941 untrimmed loci was used (90%) followed by the

two analyses that used sequence-length-trimmed loci, both of which had 70% well-supported

nodes; resolution was lowest with the two analyses that used 100% trimming, both of which had

36

Page 47: Carl H. Oliveros

60% well-supported nodes. STAR and ASTRAL recovered the same position for Z.

melanocephalus as MP-EST but STEAC found it sister to Z. senegalensis (App. 1.12).

Highly supported conflicting placement of taxa with short sequences were also observed

for STAR and ASTRAL, but not STEAC (App. 1.9–1.12). However, these conflicts were caused

by biases introduced by either short sequence length or uninformative loci, and thus, the biased

topologies were ignored. The phylogenetic position of taxa with short sequences were taken

from results of GCM analysis using the most informative trimmed loci.

GCM Species Trees.—Combining results from analyses using only the most informative

loci (i.e., Iterations 1i, 2ai, 2bi, 3ai, and 3bi) and from the analyses placing samples with short

sequences, well-supported estimates of species trees showing relationships among all 71 samples

were produced for each of the four GCMs (Fig. 1.9). The resolution of these species trees were

comparable to those obtained from ML analysis on the concatenated data (Fig. 1.6b). The four

GCM species trees were compatible with each other with the exception of one well-supported

conflict: STAR’s placement of Y. diademata as described above.

Comparing the GCM species trees with the ML species tree, only two well-supported

conflicts were found. The first involves STAR’s erroneous placement of Y. diademata as

described above. In the other conflict, STEAC recovered Z. melanocephalus, one of the samples

with short sequences, sister to Z. senegalensis; whereas ML placed it sister to the pair of Z.

virens and Z. maderaspatanus. The other GCMs placed this species in a three-way polytomy

with Z. senegalensis and the pair of Z. virens and Z. maderaspatanus. Outside this clade, the

species tree topologies from ML and GCM analyses were compatible.

37

Page 48: Carl H. Oliveros

38

Page 49: Carl H. Oliveros

Figure 1.9. Final species tree estimates from (a) STAR, (b) STEAC, (c) MP-EST, and (d) ASTRAL constructed from the combined results of datasets that used only the most informative loci. Nodes with < 70% BS are collapsed. BS values for each node vary across datasets and are provided in App. 1.1–1.13.

Four well-supported nodes found using GCM analyses were not recovered using ML

analysis. In the first, ML results placed the Sumatra-Borneo endemic Z. atricapilla, the Bornean

endemic C. emiliae, and a Pacific clade of 31 taxa in a three-way polytomy; however, all GCMs

recovered C. emiliae sister to the Pacific clade (Fig. 1.9). Second, a clade of three West Pacific

taxa (Z. conspicillatus, Z. cinereus, and Z. oleagineus) was situated in a six-way polytomy in the

ML topology; but STAR, MP-EST, and ASTRAL increased resolution by placing it sister to a

clade that contains the five other lineages in this polytomy. Resolving this polytomy further in

two different ways comprise the third and fourth nodes recovered by GCMs but not ML. MP-

EST placed one member of this polytomy, the pair of Z. fuscicapilla and Z. novaeguinea from

the Louisiade archipelago and New Guinea, respectively, sister to a clade that consists of the four

other lineages in the polytomy. On the other hand, STAR found the Bismarck Archipelago

endemic, Z. hypoxanthus, sister to the Z. fuscicapilla and Z. novaeguinea clade.

DISCUSSION

Bias from Uninformative Loci

Uninformative loci have been shown to reduce bootstrap support values in GCM analysis

(Liu et al. 2015). In this study, we have shown that they can also result in incorrect GCM

species tree estimates with high support. However, this bias was observed in the placement of

only two samples belonging to one species, Z. kulambangrae (Fig. 1.7) and only with STAR,

MP-EST, and ASTRAL, but not with STEAC (App. 1.1–1.8). The strategy of analyzing only the

most informative loci removed this bias but had varying effects on resolution. Using only the

39

Page 50: Carl H. Oliveros

most informative loci in GCM analysis significantly increased the proportion of well-supported

nodes in MP-EST, but decreased slightly or had little effect on the proportion of well-supported

nodes in STAR, STEAC, and ASTRAL (Fig. 1.6b). The limited improvement in resolution in

some methods after applying this strategy may have been caused by filtering loci too stringently.

In removing loci whose number of parsimony-informative sites and average bootstrap support

values were less than their respective means, we discarded 55–86% of loci in each dataset, many

of which contained some informative sites. Assembling datasets with fewer taxa increases the

number of both informative and uninformative loci common to them. This large proportion of

discarded loci is not too surprising considering that this study focuses on a fairly recently

diverged group and uses relatively slow evolving UCE markers. We expect that in studies of

more deeply diverged lineages a smaller proportion of loci will need to be discarded. However,

it is remarkable that comparable resolution can be achieved with GCM analysis using only a

small fraction of loci that are most informative. Further studies on selecting an optimal filtering

scheme to maximize resolution in GCM species tree estimates and minimize bias from

uninformative loci are needed.

GCMs achieve greater accuracy and resolution (higher bootstrap support) as the number

of loci analyzed increases (Liu et al. 2009b, 2010; Song et al. 2012; Mirarab et al. 2014c) but this

and another study (Liu et al. 2015) suggest that these advantages are achieved only with highly

informative loci. The number of informative loci available for GCM analysis can be maximized

in empirical datasets by choosing phylogenomic markers that evolve fast enough for the level of

divergences in the group of interest. In sequence capture techniques, it may be worth targeting a

high number of loci in hopes of obtaining more informative ones or increasing sequence capture

success rates for each sample. Increasing information content of each loci can also be achieved

40

Page 51: Carl H. Oliveros

by shearing larger fragment sizes in library preparation and using longer sequencing reads

(Faircloth et al. 2012).

Bias from Sequence Length Heterogeneity

The results from Iteration 0 demonstrate that species trees estimated by some GCMs can

show high support for the wrong topology when using samples with sequence lengths that are

consistently shorter than other samples. This bias is a result of gene tree estimation bias rather

than a defect of the GCMs themselves. Our analyses in Iterations 1–3 show that GCMs can

perform well with some sequence length heterogeneity (see Missing Data column in Table 1.2).

Short sequence lengths should not bias GCM species tree estimates as long as: (1) enough

phylogenetic signal is present in samples with short sequences so as not to bias gene tree

estimation; or (2) samples have short sequences only in a small fraction of loci. In UCE loci,

where informative sites occur towards the flanks of loci (Faircloth et al. 2012), the severity of the

bias will depend on the depth of divergences between the taxa of interest, the average length of

loci, the average length of the short sequences, and the fraction of loci in which consistently

short sequences occur. In exon capture techniques where informative sites are present

throughout the length of the locus, short sequences are expected to bias GCM species tree

estimates to a lesser degree as they do with ultraconserved elements.

Analyzing samples with short sequence lengths separately with a subset of taxa requires

prior information about their possible phylogenetic position. As was done in this study, this

information can be taken from results of previous studies, the ML tree, and GCM species trees

estimated with full sampling. Using a subset of taxa also considerably increases the number of

loci available for GCM analysis (by up to 400%). In the case of Z. oleagineus, merely increasing

the number of loci analyzed masked any bias introduced by short sequence lengths, even without

41

Page 52: Carl H. Oliveros

trimming sequences or removing uninformative loci (Figs. 1.8b, App. 1.10). However, for the

other three samples sequence trimming was necessary to converge on consistent relationships.

Trimming of UCE loci eliminates many informative sites (Table 1.2), but nonetheless resulted in

well-supported relationships in the clades of interest (Figs. 1.8, App. 1.9–1.12). We expected to

see biases introduced by uninformative loci in species tree estimates from trimmed datasets, but

were surprised to see no well-supported conflicts between results obtained using all trimmed loci

and those obtained using only the most informative loci for each of the four species with short

sequences across all GCMs.

Bias in STAR analysis

The placement of Y. diademata as sister to the clade formed by Y. everetti and Y.

castaniceps with high support by STAR in Iterations 0, 1, and 1i is quite unusual. This position

is contradicted by results from the other three GCMs and ML, which recovered Y. diademata

sister to the rest of the family. Interestingly, when more loci were analyzed in Iteration 2a,

support for the sister relationship between Y. diademata and Y. everetti + Y. castaniceps in STAR

analysis dropped from 99% in Iteration 1 to 68% in Iteration 2a (App. 1.1). Using only the most

informative loci, support for this same relationship was only 82% in Iteration 1i; and when more

loci were analyzed in Iteration 2ai, the alternative topology recovered by the other methods was

also recovered by STAR, albeit with only 66% bootstrap support (App. 1.5). The drastic decline

in bootstrap support and change in topology between analyses greatly suggest that the highly

supported placement of Y. diademata as sister to Y. everetti + Y. castaniceps is erroneous and is

a bias unique to STAR. We have observed that STAR places an early diverging lineage sister to

the next diverging clade in another dataset with a large number of taxa. The possible causes of

this bias, which may include the high number of taxa in analysis, the proximity of the taxon to

42

Page 53: Carl H. Oliveros

the root of the species tree, the number of loci analyzed, or the presence of uninformative loci

should be further investigated.

Improving Resolution by Subsampling Taxa

This study shows that analyzing subsamples of taxa with GCMs improves resolution in

their species tree estimates, but the improvement varies across methods. This strategy succeeds

by increasing the number of loci analyzed by these statistically consistent methods. As has been

found by Song et al. (2012), subsampling taxa does not seem to have an effect on the topology of

species tree estimates, as shown by the absence of conflicts between results that were not

attributed to known biases. This divide-and-conquer strategy will provide the largest benefits in

datasets with high levels of missing data at the taxon level such as those collected with

inefficient sequence capture techniques; however, this strategy will be of little value with

datasets collected from complete genomes or from using highly efficient sequence capture

techniques (e.g., Bi et al. 2012).

A divide-and-conquer approach has been developed for MP-EST that improves its

efficiency and performance (Bayzid et al. 2014). This method implements an iterative process of

estimating species trees from subsets of taxa, combining these estimates using a supertree

method, and scoring the combined tree. This method did not take into account the biases that

can be introduced by missing data presented here but it can be modified to implement the

analytical strategies proposed in this paper.

Comparison of GCMs

Among the four GCMs used in this study, STEAC performed better than the others in

three ways. First, the resolution of its initial species tree estimate in Iteration 0 was significantly

higher than those obtained using the other GCMs, and in fact, only slightly less than that of its

43

Page 54: Carl H. Oliveros

final species tree estimate or the ML result (Fig. 1.6a). Second, STEAC was least affected by

bias introduced by samples with short sequences. In Iteration 0 STEAC placed three out of the

four samples with short sequences close to their final estimated position (Fig. 1.5). Third, bias

introduced by uninformative loci did not affect STEAC species tree estimates. STEAC

recovered a monophyletic Z. kulambangrae in all analyses. STEAC’s use of branch length

information from gene trees, in addition to topology, makes it more robust to these biases from

short sequence lengths and uninformative loci. This GCM assumes a uniform substitution rate

across lineages (i.e., the molecular clock; Liu et al. 2009b) but is robust to deviations from the

molecular clock when the number of loci is large (Liu et al. 2009a). These advantages make

STEAC a good GCM to employ in analyzing phylogenomic datasets.

Of the GCMs that make use only of topology information from gene trees, ASTRAL

achieved the most resolved species trees across the different analyses, slightly better than STAR

(Fig. 1.6b). MP-EST, one of the most popular GCMs used, appears to require more loci to

achieve the same level of resolution in other GCMs (Fig. 1.6b). These three GCMs were

sensitive to biases from both samples with short sequence lengths and from uninformative loci.

Missing Data and GCM Analysis

The bias introduced by samples with short sequence lengths shows the sensitivity of

GCMs to missing data at the alignment level, a bias that does not influence concatenated

analyses. These results highlight another case of coalescent methods being more sensitive to

missing data than concatenation methods (Thomson et al. 2008). We have avoided any possible

biases to GCM results caused by missing lineages by performing analysis on datasets with no

missing lineages. A number of gene-tree based species-tree methods such as STEM, MP-EST,

ASTRAL, and NJst can accommodate missing lineages in input gene trees (Liu et al. 2010; Liu

44

Page 55: Carl H. Oliveros

and Yu 2011; Hovmöller et al. 2013; Mirarab et al. 2014c). However, STEM requires that

lineages are missing at random across loci (Hovmöller et al. 2013), and MP-EST and NJst can

perform poorly with non-random missing lineages or many missing lineages, respectively (Liu et

al. 2010; Liu and Yu 2011; Zhong et al. 2014). Species tree methods that are robust to non-

random missing lineages, as is expected in empirical data, would be ideal to take full advantage

of the size of phylogenomic datasets. Nonetheless, the potential biases that can be introduced by

missing data to results of current GCMs emphasizes the importance of exploring the properties

of phylogenomic datasets, including randomness of missing lineages, sequence length

heterogeneity, and information content of loci, to inform the choice of methods and analytical

strategies.

Zosteropidae Relationships

The resolution of the ML and GCM topologies represents a vast improvement over that

obtained by Moyle et al. (2009) using three markers, underscoring the power of genome-wide

sequence data to resolve difficult rapid radiations. The clades recovered in this study were

generally circumscribed by known biogeographic regions in varying scales (from archipelagoes

to continents). In addition, the species trees estimated from this study are generally compatible

with the results obtained by Moyle et al. (2009), although a few well-supported conflicts between

them exist and are discussed below.

Three instances of conflicts between the results of this study and that of Moyle et al.

(2009) involve samples of three species (Z. palpebrosus, Z. montanus, and Z. erythropleurus)

that were sourced from different localities. In the first species, Moyle et al. (2009) recovered one

sample of Z. palpebrosus palpebrosus from Nepal as sister to Z. japonicus, which occurs in East

Asia. On the other hand, this study placed one Z. palpebrosus siamensis sample from Vietnam

45

Page 56: Carl H. Oliveros

sister to a clade of Zosterops species from Africa and Madagascar, and another Z. palpebrosus

unicus sample from the Lesser Sundas sister to the Lesser Sundas endemic Z. citrinella. In the

second and third species, a sample of Z. montanus from Sulawesi and a captive individual of Z.

erythropleurus were placed by Moyle et al. (2009) in a clade with a Bornean endemic (C.

emiliae), a Sumatra-Borneo endemic (Z. atricapilla), and an E. African/Arabian peninsula

endemic (Z. abyssinica). In this study, a Z. montanus sample from the Philippines was found to

be sister to a Philippine endemic Z. meyeni, whereas a Z. erythropleurus sample from Vietnam

was recovered sister to a clade containing Z. japonicus, and the two species from the Philippines

(Z. montanus and Z. meyeni). Paraphyly in Zosterops species is evident in several species in this

study and others (Warren et al. 2006; Moyle et al. 2009; Cox et al. 2014). It is likely that these

conflicts reflect the true paraphyly in these three species that have fairly wide geographical

distributions, but further molecular work with more intensive sampling of these widespread

species is needed to confirm this and clarify species limits. One sample of Z. ugiensis from the

same individual was used by both studies but placed in different clades. Moyle et al. (2009)

placed this species endemic to the Solomon Islands in a clade with two species from the West

Pacific, whereas this study recovered it in a clade with seven other species from the Solomon

Islands. This conflict could possibly have arisen if the few genes used by Moyle et al. (2009)

had a gene tree topology that was different from the species tree.

GCMs and Concatenation

One weakness of GCMs is their supposed difficulty in recovering clades that are easily

recovered by concatenation methods. In this study, however, after using our three analytical

strategies, all clades recovered by ML were recovered by at least one GCM method (Figs. 1.4,

46

Page 57: Carl H. Oliveros

1.9). Contrary to expectations, four well-supported clades found in GCM results were not found

in ML results.

The congruence in species tree estimates of Zosteropidae obtained by GCM and

concatenation approaches is remarkable. One of the primary reasons for using GCMs in a group

like Zosteropidae is the potential for the ML species-tree estimate to be highly supported but

incorrect if a section of the species tree is in the anomaly zone (Kubatko and Degnan 2007).

However, given the high number of short internodes in the family, only two well-supported

conflicting results were recovered between the GCM results and the ML result, and one involves

a biased result by STAR. The internodes are extremely short in the African/Malagasy clade to

which Z. melanocephalus belongs (Fig. 1.4), so this example is the only candidate case of a

section of the Zosteropidae phylogeny falling within the anomaly zone. If there are others, then

we have not detected them because either the GCMs or ML have not resolved them with high

support. The accuracy of phylogenetic reconstructions of Zosteropidae derived here from GCMs

and ML cannot be evaluated, but the consistency between topologies and comparable resolutions

indicate that our estimates from concatenation and coalescent approaches are roughly as accurate

as each other.

47

Page 58: Carl H. Oliveros

CHAPTER 2

Ultraconserved Elements Resolve Phylogenetic Relationships in Core Corvoidea Passerines

48

Page 59: Carl H. Oliveros

ABSTRACT

Oscine passerines in the core Corvoidea group, comprising 773 species that occur

worldwide, represent one of the major radiations of perching birds (Passeriformes).

Phylogenetic relationships among the 29 families in this enigmatic clade have been largely

unresolved owing to a combination of inadequate character and taxon sampling in previous

studies and short time intervals between lineage splitting events. We use sequence data from

thousands of ultraconserved element loci obtained from 86 species to estimate higher level

phylogenetic relationships in the group using maximum likelihood (ML) and coalescent

methods. A highly resolved phylogeny of core Corvoidea is recovered, with a majority of nodes

receiving high support from both ML and coalescent analyses. Four families restricted to or

thought to have originated from the Australia-New Guinea region and New Zealand form the

earliest branching lineages in the group. The other families are divided into three major clades

each consisting of 8–9 families. Our results are vital to understanding the biogeographic history

of the core Corvoidea, which has been considered an example of a major avian radiation that

originated from an island setting and colonized almost the entire world.

49

Page 60: Carl H. Oliveros

INTRODUCTION

Passerine birds (order Passeriformes) are the largest avian radiation, representing almost

60% of avian diversity (Cracraft 2014). Our knowledge of higher level phylogenetic

relationships in this ubiquitous order consisting of 6063 species of perching birds (Dickinson and

Christidis 2014) has grown tremendously over the past decade (summarized in Cracraft 2014). It

is now well known that Passeriformes consists of three main clades (Barker et al. 2002, 2004;

Ericson et al. 2002). The New Zealand wrens (Acanthisittidae), which are represented by only 2

extant species, are sister to all other passerines, which themselves are subdivided into two large

clades: the suboscines and the oscines. Suboscines, which comprise 21% of passerine diversity,

occur in the tropical regions of the New World and Old World but a vast majority of species are

restricted to the Neotropics. The oscines, or songbirds, represent a higher share of passerine

diversity (79%) and have a cosmopolitan distribution.

One of the most enigmatic groups of passerines involves a clade of oscines referred to as

the “core Corvoidea” (Barker et al. 2002) or infraorder Corvides (Cracraft 2014). The core

Corvoidea comprises 773 species belonging to 29 families (Cracraft 2014; Dickinson and

Christidis 2014), the largest of which contains 126 species with an almost cosmopolitan

distribution (Corvidae). In contrast, four families are each represented by a single species with

restricted ranges: Eulacestomatidae, Rhagologidae, and Ifritidae are endemic to the highlands of

New Guinea, whereas Pityriasidae occurs only on Borneo. Recent molecular phylogenetic work

(Barker et al. 2004; Norman et al. 2009; Jønsson et al. 2011; Aggerbeck et al. 2014; Selvatti et

al. 2015) have had limited success in resolving interfamilial relationships in this clade (Fig. 2.1)

owing to a combination of inadequate data (Cracraft 2014) and short intervals between lineage

splitting events. The study that obtained the highest phylogenetic resolution recovered four

50

Page 61: Carl H. Oliveros

major clades: one containing only the New Zealand endemic Mohouidae, two clades each

comprising 10 families, and one clade consisting of 9 families (Aggerbeck et al. 2014, Fig. 2.1c).

However, support for interfamilial relationships in this study was attained mostly from Bayesian

inference; the maximum likelihood topology was poorly resolved. In addition, it is unclear

whether the high support obtained from Bayesian analysis of 22 concatenated loci could have

resulted from concatenation approaches being misled by gene tree discordance in an anomaly

zone, a scenario that can occur when sufficiently short successive internodes are present in the

species tree (Degnan and Rosenberg 2006; Kubatko and Degnan 2007).

The core Corvoidea has been hypothesized to have originated in proto-Papuan islands in

the late Eocene/Oligocene and has been held as an example of a major global radiation that was

derived from an island setting and subsequently colonized and diversified across the globe

(Jønsson et al. 2011; Aggerbeck et al. 2014). These conclusions, however, were based in part on

phylogenies with limited resolution. A well-resolved phylogenetic estimate of the group is

required for a more robust hypothesis of the tempo and mode of diversification in this major

oscine radiation.

Here we use thousands of genome-wide loci to clarify phylogenetic relationships among

families in the core Corvoidea. Concatenation and coalescent approaches to phylogenetic

inference are used to take into account gene tree discordance among loci and to take advantage

of the large size of phylogenomic data.

51

Page 62: Carl H. Oliveros

METHODS

Sampling

We sampled 76 genera belonging to 28 out of the 29 recognized families in core

Corvoidea (Dickinson and Christidis 2014). Only one family, the Bornean endemic Pityriasidae,

was unsampled. All genera were represented by one species, with the exception of Dicrurus,

which was sampled with 2 species (Table 2.1). Species from 9 families (Meliphagidae,

Orthonychidae, Pomatostomidae, Cnemophilidae, Petroicidae, Cisticolidae, Stenostiridae,

Passeridae, and Chloropseidae) were chosen as outgroups; thus, the total number of species in

the study was 86.

Figure 2.1. Estimate of interfamilial relationships in core Corvoidea from Bayesian inference in previous studies: (a) Barker et al. (2004), (b) Jonsson et al. (2011), and (c) Aggerbeck et al. (2014). The number of markers used in each study are indicated. Nodes with < 95% Bayesian posterior probability are collapsed.

52

Page 63: Carl H. Oliveros

Table 2.1. Sampling information. Species Family Accession Number Locality Number of

UCE loci enriched

Aegithina lafresnayei Aegithinidae KU 23213 Vietnam 4247 Aleadryas rufinucha Oreoicidae KU 16569 Papua New Guinea 3674 Arses telescophthalmus Monarchidae KU 12218 Papua New Guinea 4230 Artamus cinereus Artamidae KU 6183 Australia 4340 Batis senegalensis Platysteiridae KU 15403 Ghana 4161 Chaetorhynchus papuensis Rhipiduridae KU 16414 Papua New Guinea 4219 Chlorophoneus sulfureopectus Malaconotidae KU 15498 Ghana 3956 Cinclosoma punctatum Cinclosomatidae UWBM 57790 Australia 4344 Cissa hypoleuca Corvidae KU 23241 Vietnam 4300 Colluricincla harmonica Pachycephalidae KU 8866 Australia 4212 Coracina papuensis Campephagidae KU 27758 Papua New Guinea 4213 Corcorax melanoramphos Corcoracidae KU 10720 Australia 4082 Corvus corax Corvidae KU 30042 Alaska, USA 4579 Cyanocitta cristata Corvidae KU 2271 Kansas, USA 4088 Cyanocorax violaceus Corvidae KU 21655 Peru 4129 Cyanolyca viridicyanus Corvidae KU 17037 Peru 4113 Daphoenositta chrysoptera Neosittidae KU 23086 Australia 4547 Dendrocitta cinerascens Corvidae KU 17738 Borneo, Malaysia 4572 Dicrurus aeneus Dicruridae KU 23352 Vietnam 4381 Dicrurus paradiseus Dicruridae KU 23288 Vietnam 4290 Dryoscopus cubla Malaconotidae KU 26685 Botswana 4009 Edolisoma tenuirostre Campephagidae KU 23644 Palau 4250 Epimachus meyeri Paradisaeidae KU 16421 Papua New Guinea 4099 Erpornis zantholeuca Vireonidae KU 27942 Vietnam 4209 Eulacestoma nigropectus Eulacestomidae KU 16397 Papua New Guinea 3661 Falcunculus frontatus Falcunculidae UWBM 57627 Australia 4585 Grallina cyanoleuca Monarchidae KU 22946 Australia 4324 Gymnorhinus cyanocephalus Corvidae UNM NK165437 New Mexico, USA 4135 Hemipus picatus Vangidae KU 23303 Vietnam 4177 Hypothymis azurea Monarchidae KU 20189 Philippines 4657 Ifrita kowaldi Ifritidae KU 12106 Papua New Guinea 4389 Lalage maculosa Campephagidae KU 26428 Fiji 4302 Lamprolia victoriae Rhipiduridae KU 24329 Fiji 4265 Laniarius barbarus Malaconotidae KU 15451 Ghana 4009 Lanius excubitor Laniidae KU 28984 Mongolia 4585 Lophorina superba Paradisaeidae KU 16456 Papua New Guinea 4297 Machaerirhynchus nigripectus Machaerirhynchida

e KU 4734 Papua New Guinea 4429

Melampitta lugubris Melampittidae KU 16552 Papua New Guinea 4477 Melanorectes nigrescens Pachycephalidae KU 18387 Papua New Guinea 4532

53

Page 64: Carl H. Oliveros

Melloria quoyi Artamidae KU 4831 Papua New Guinea 4008 Metabolus takatsukasae Monarchidae KU 22596 Northern Marianas 4522 Mohoua ochrocephala Mohouidae YPM 96717 New Zealand 3096 Myiagra azureocapilla Monarchidae KU 24306 Fiji 4402 Mystacornis crossleyi Vangidae FMNH 345860 Madagascar 4246 Oreocharis arfaki Paramythiidae KU 16440 Papua New Guinea 4445 Oreoica gutturalis Oreoicidae KU 8884 Australia 4179 Oriolus chinensis Oriolidae KU 10450 China 4587 Ornorectes cristatus Oreoicidae KU 4697 Papua New Guinea 3987 Pachycephala vitiensis Pachycephalidae KU 19410 Solomon Islands 4284 Paradisaea minor Paradisaeidae KU 16148 Papua New Guinea 4195 Parotia lawesii Paradisaeidae KU 16428 Papua New Guinea 3703 Pericrocotus divaricatus Campephagidae KU 10261 China 3766 Perisoreus canadensis Corvidae AMNH

DOT15892 New York, USA 4304

Philentoma pyrhoptera Vangidae KU 12323 Borneo, Malaysia 4126 Phonygammus keraudrenii Paradisaeidae KU 12256 Papua New Guinea 4207 Pitohui dichrous Oriolidae KU 12200 Papua New Guinea 4432 Platylophus galericulatus Corvidae KU 24459 Borneo, Malaysia 4332 Platysteira cyanea Platysteiridae KU 29120 DR Congo 4255 Prionops plumatus Vangidae KU 26690 Botswana 4572 Pseudorectes ferrugineus Pachycephalidae KU 9610 Papua New Guinea 4134 Psophodes occidentalis Psophodidae KU 6204 Australia 4053 Pteruthius flaviscapis Vireonidae KU 17806 Borneo, Malaysia 4124 Ptilorrhoa leucosticta Cinclosomatidae KU 16514 Papua New Guinea 4458 Ptilostomus afer Corvidae LSU B39279 Ghana 4208 Pyrrhocorax pyrrhocorax Corvidae KU 28865 Mongolia 4309 Rhagologus leucostigma Rhagologidae KU 18382 Papua New Guinea 4546 Rhipidura javanica Rhipiduridae KU 17717 Borneo, Malaysia 4559 Sphecotheres viridis Oriolidae KU 10752 Australia 3990 Strepera graculina Artamidae KU 9660 Australia 4190 Struthidea cinerea Corcoracidae KU 10728 Australia 4313 Symposiachrus verticalis Monarchidae KU 27668 Papua New Guinea 4347 Tchagra senegalus Malaconotidae KU 15518 Ghana 4099 Tephrodornis virgatus Vangidae KU 30886 Vietnam 4206 Terpsiphone paradisi Monarchidae KU 23463 Vietnam 4342 Trochocercus nitens Monarchidae KU 15689 Ghana 4268 Urolestes melanoleuca Laniidae KU 26670 Botswana 4136 Vireo solitarius Vireonidae KU 25186 Kansas, USA 4189 Outgroup: Chelidorhynx hypoxanthus Stenostiridae KU 28022 Vietnam 4278 Chloropsis sonnerati Irenidae KU 24451 Borneo, Malaysia 4342 Cisticola anonymus Cisticolidae KU 29165 DR Congo 4317

54

Page 65: Carl H. Oliveros

Cnemophilus loriae Cnemophilidae KU 16529 Papua New Guinea 4644 Foulehaio carunculatus Meliphagidae KU 26344 Fiji 4549 Orthonyx temminckii Orthonychidae UWBM 76694 Australia 4503 Passer domesticus Passeridae KU 28860 Mongolia 4615 Petroica multicolor Petroicidae KU 24355 Fiji 4324 Pomatostomus superciliosus Pomatostomidae KU 8792 Australia 4388

Data Collection and Assembly

We extracted and purified DNA from fresh muscle or liver tissue (n = 85) or toe pad clips

from museum specimens (n = 1) using the Qiagen DNeasy Blood and Tissue Kit following the

manufacturer’s protocol. Sequence capture of ultraconserved element (UCE) loci was performed

targeting 5,060 UCE loci following the same process of library preparation, target capture, post-

enrichment amplification, and sequencing described in Chapter 1. We followed the procedures

from Chapter 1 in cleaning reads, assembling reads into contigs, compiling contigs into complete

and incomplete datasets, aligning loci, trimming loci, and reformatting data to phylip format.

Phylogenetic Analyses

Maximum likelihood (ML) inference was performed on the concatenated loci of the

incomplete dataset using RAxML ver. 8.1.3 (Stamatakis 2014) assuming a general time

reversible model of rate substitution and gamma-distributed rates among sites. Node support

was evaluated using 1000 rapid bootstraps.

Gene tree-based coalescent methods (GCM) were used to estimate relationships in core

Corvoidea by applying strategies of subsampling taxa, filtering loci by information content, and

trimming alignments that include short sequences to increase resolution and consistency among

estimates (Chapter 2). We performed coalescent analysis on a total of five subsets of the 86

species. The first subset contained 21 species and focused on the basal families and major

superfamilies in core Corvoidea, excluding Mohoua albicilla, the only sample that contained

55

Page 66: Carl H. Oliveros

substantially lower average contig length compared to the rest of the samples. The second subset

consisted of the same species but this time included Mohoua. With this subset, alignments were

trimmed so that no missing data was present at their flanks. The remaining three subsets focused

on the three multi-family clades recovered by Aggerbeck et al. (2014; Fig. 2.1c). In this text we

refer to these groups as “Orioloidea”, “Malaconotoidea”, and “Corvoidea,” corresponding to

clades X, Y, and Z in that study, respectively. Thus the last three subsets comprised 25 species

with Orioloidea as ingroup, 25 species with Malaconotoidea as ingroup, and 43 species with

Corvoidea as ingroup. Loci in which the number of parsimony-informative sites or average

bootstrap support in gene tree inference were lower than their respective means were excluded

from coalescent analysis.

For coalescent analyses, gene tree inference and bootstrapping were performed with

RAxML ver. 8.1.3 (Stamatakis 2014) using the python package phyluce (Faircloth 2014). We

modified the phyluce scripts to implement multi-locus bootstrapping (Seo 2008), and generated

500 multi-locus bootstrap replicate sets of gene trees for each dataset. On each replicate set of

gene trees, we performed coalescent analyses using four GCMs: STAR and STEAC as

implemented in the R package phybase ver. 1.3 (Liu et al. 2009b), MP-EST ver. 1.4 (Liu et al.

2010), and ASTRAL ver. 4.7.7 (Mirarab et al. 2014c). All programs were run with default

options. Species trees inferred from the bootstrap replicates were summarized with 70%

consensus trees using the sumtrees.py program in Dendropy (Sukumaran and Holder 2010).

Command line, python, and R scripts used to process the data and run the species tree analyses

are available at https://github.com/carloliveros/uce-scripts.

56

Page 67: Carl H. Oliveros

RESULTS

Dataset Attributes

We enriched an average of 4,257.5 UCE loci per sample (Table 2.1). The incomplete

dataset used for ML analysis included 4121 loci with a mean locus length of 603.6 bp and total

alignment length of 2,487,457 bp. Nucleotide data were present in 89.6% of the data matrix.

GCM analyses were performed on 602–775 of the most informative loci in each dataset from an

original 1,979–2,342 loci in common to all taxa in each dataset (Table 2.2). The average locus

length ranged from 744.5–817.5 bp in the datasets that were not trimmed, but was less than half

these lengths (291.9 bp) in the trimmed dataset. Similarly, the average number of parsimony-

informative sites and average gene tree bootstrap support value both had considerably lower

values in the trimmed dataset compared to the untrimmed datasets.

Table 2.2. Characteristics of datasets used in coalescent analyses Dataset description Number

of taxa Number of loci in common

Number of loci

analyzed

Average locus length

Average number of parsimony-

informative sites

Average gene tree bootstrap support value

Higher level core Corvoidea, excluding Mohoua

21 2342 775 766.6 52.6 34.8

Higher level core Corvoidea, including Mohoua, trimmed

22 2342 602 291.9 6.3 17.5

Orioloidea 25 2084 679 744.5 55.6 35.1 Malaconotoidea 25 2201 775 817.5 65.5 42.8 Corvoidea 43 1979 764 782.3 80.5 40.4

Phylogenetic Relationships

Both ML and coalescent analyses yielded well-resolved estimates of core Corvoidea

phylogeny. Results obtained in both approaches largely agree so only the ML estimate is shown

(Fig. 2.2). Results of coalescent analyses were consistent with each other (App. 2.1–2.2) and the

few differences are discussed in the text. In ML results, 72 out of 76 nodes (95%) in core

57

Page 68: Carl H. Oliveros

Corvoidea were well-supported, (i.e., received > 70% bootstrap support; BS), 68 of these with

98–100% BS (Fig. 2.2). Among GCM results, resolution was comparable between methods with

65–67 well-supported nodes out of 76, 60–62 of these with 98–100% BS (App. 2.1–2.2).

Relationships within Families.—All families that were represented by more than one

genus were recovered monophyletic with 100% BS across ML and four coalescent analyses, with

the exception of Corvidae (see below). Estimates of intergeneric relationships within families

that were represented by more than two genera other than Corvidae were all well-supported

across ML and four coalescent analyses. All but one node received 98–100% BS with the

exception of a clade of 5 genera in Monarchidae (Symposiachrus, Metabolus, Grallina, Myiagra,

and Arses) which received 74% BS in ML and 85–93% in coalescent analyses. Only one case of

well-supported conflicting relationship was recovered across all analyses: within Paradisaeidae,

ML placed Lophorina sister to a clade containing Paradisaea and Epimachus, whereas all GCMs

recovered Paradisaea sister to the pair of Lophorina and Epimachus.

Interfamilial Relationships.—For a clearer view of phylogenetic relationships among

families in core Corvoidea, branches from ML and coalescent analyses were collapsed so that

each tip represented a monophyletic family (Fig. 2.3). At the family level, the ML analysis

achieved higher resolution (24 out of 28 nodes were well-supported) than coalescent analyses

(17–19 well-supported nodes). ML inference recovered six major clades in core Corvoidea that

branched sequentially from the base of the clade. The first clade, which was sister to the

rest of core Corvoidea, consisted of a single family endemic to the Australia-New Guinea region,

Cinclosomatidae. The second clade included only Campephagidae, which has a broad Old

World distribution with members occurring from Africa to the Southwest Pacific. The third

clade comprised two small families: Mohouidae, a New Zealand endemic, and Neosittidae, an

58

Page 69: Carl H. Oliveros

Figure 2.2. Maximum likelihood estimate of phylogenetic relationships in core Corvoidea inferred from 4121 UCE loci using RAxML. Numbers below branches indicate bootstrap support (BS) values from 1000 bootstrap replicates. Nodes with < 70% BS are collapsed whereas nodes with no labels indicate 100% BS.

59

Page 70: Carl H. Oliveros

Australia-New Guinea-region endemic. The fourth clade is composed of 8 families:

Eulacestomidae, Psophodidae, Falcunculidae, Oreoicidae, Paramythiidae, Vireonidae, Oriolidae,

and Pachycephalidae. The fifth and sixth sister clades both contained several families. The fifth

clade included seven families: Machaerirhynchidae, Artamidae, Rhagologidae, Aegithinidae,

Malaconotidae, Platysteiridae, and Vangidae, whereas the sixth clade comprised nine families:

Figure 2.3. Estimate of interfamilial relationships in core Corvoidea based on maximum likelihood and ASTRAL analyses. Results of STAR, STEAC, and MP-EST were similar to that of ASTRAL; topological differences are indicated by symbols (*, #) and are discussed in the text. Nodes with < 70% BS are collapsed. BS values for each node vary across GCMs and subsampled datasets and are thus not shown, but they are provided in Appendices 2.1–2.2.

60

Page 71: Carl H. Oliveros

Rhipiduridae, Dicruridae, Monarchidae, Corcoracidae, Ifritidae, Paradisaeidae, Melampittidae,

Corvidae, and Laniidae.

On the other hand, coalescent analyses recovered seven main clades in core Corvoidea

among which relationships were largely not well-supported (Fig. 2.3). Similar to ML results,

Cinclosomatidae was placed sister to the rest of core Corvoidea in coalescent analyses.

However, the six other main clades formed a six-way polytomy with the following members:

Campephagidae, Mohouidae, Neosittidae, and the same last three clades recovered in ML

analysis (Orioloidea, Malaconotoidea, and Corvoidea). STAR and MP-EST recovered

Orioloidea sister to Malaconotoidea in one of 5 subsets of taxa (App. 2.2a), albeit with only 76%

and 74% BS, respectively.

Coalescent and ML analyses recovered identical but poorly resolved relationships within

Orioloidea. This clade comprised five lineages forming a polytomy: Eulacestomidae,

Psophodidae, the pair of Falcunculidae and Oreoicidae, the pair of Paramythiidae and

Vireonidae, and the pair of Oriolidae and Pachycephalidae. Identical, highly resolved topologies

were inferred by ML and coalescent analyses within Malaconotoidea. Machaerirhynchidae was

recovered sister to the other members of Malaconotoidea. Next to branch out in succession were

Artamidae, followed by Rhagologidae. The final clade within Malaconotoidea comprised the

pair of Aegithinidae and Malaconotidae sister to the pair of Platysteiridae and Vangidae.

Within Corvoidea, ML and coalescent analyses resulted in slightly different, but

compatible topologies (Fig. 2.3). In the ML result, Rhipiduridae, Dicruridae, and a clade

containing the seven other families of Corvoidea formed a three-way polytomy. Coalescent

analyses, however, recovered Rhipiduridae sister to all other Corvoidea families. ML fully

resolved the clade of seven families with Monarchidae sister to a clade that in turn contained two

61

Page 72: Carl H. Oliveros

sister clades, each consisting of three families. In one of these triplets, Corcoracidae was sister

to Ifritidae and Paradisaeidae; in the other, Melampittidae was recovered sister to Corvidae and

Laniidae. Within the Corvidae and Laniidae group, the genus Platylophus, traditionally a

member of Corvidae, was sister to Laniidae, and this clade was in turn sister to the rest of

Corvidae. Coalescent analyses achieved much less resolution in this clade of 7 Corvoidea

families, yielding a five-way polytomy with the following lineages: Monarchidae, Corcoracidae,

Ifritidae, Paradisaeidae, and a clade containing the other three families. In the last clade,

Melampittidae was recovered sister to a three-way polytomy of Corvidae, Laniidae, and

Platylophus. STEAC’s estimate of Corvoidea topology differed from that obtained by the other

GCMs in two ways. First, it recovered Corcoracidae sister to Monarchidae, with 78% BS, a

result compatible with other GCM results, but conflicting with ML results. Second, it placed

Platylophus sister to the other members of Laniidae with 76% BS, marginally supporting

Corvidae paraphyly. This finding is also compatible with other GCM results and congruent to

the ML topology.

DISCUSSION

Core Corvoidea Relationships

This study provides a phylogenetic hypothesis for core Corvoidea with a vast

improvement in resolution over previous work (Barker et al. 2004; Norman et al. 2009; Jønsson

et al. 2011; Aggerbeck et al. 2014). The fact that most relationships were supported by both ML

and coalescent analyses indicate that our results are robust and that the high support for most

nodes in our ML estimate is not an artefact of the anomaly zone (Kubatko and Degnan 2007).

62

Page 73: Carl H. Oliveros

Our estimate of relationships among families in core Corvoidea is compatible with the

well-supported relationships in Barker et al. (2004) and Jønsson et al. (2011; Figs. 2.1, 2.3).

These two studies recovered the clades corresponding to Malaconotoidea and Corvoidea, with

Barker et al. (2004) finding a sister relationship between these two clades. However, our results

contradict these two studies in a few sister relationships among families that were found with

high support in their Bayesian analyses. Barker et al. (2004) recovered Monarchidae sister to

Melampittidae, whereas Jønsson et al. (2011) found sister relationships in the following pairs of

families: Vireonidae and Oriolidae, Falcunculidae and Pachycephalidae, and Rhagologidae and

Aegithinidae. In this study, each of these families were recovered in different clades with high

support from both ML and coalescent methods.

Results of this study contradict many of the relationships recovered by Aggerbeck et al.

(2014), the study that until this work included the densest character sampling and attained the

highest resolution among core Corvoidea families. First, we find different basal lineages in the

group. Aggerbeck et al. placed the New Zealand endemic Mohouidae sister to the rest of core

Corvoidea whereas our data finds Cinclosomatidae, an endemic of the Australia-New Guinea

region, in this position. Furthermore, ML results from our study recover Campephagidae, a

widespread family thought to have originated in the Australia-New Guinea region (Jønsson et al.

2010), and the families Mohouidae and Neosittidae among the early lineages to break off from

the group. Aggerbeck et al. (2014) placed Campephagidae within our Malaconotoidea and

Neosittidae within our Orioloidea. Second, we find conflicting relationships within the

Orioloidea clade. Our data recovered only three sets of well-supported sister families in this

clade, but each one contradicts well-supported relationships in Aggerbeck et al. (2014). For

example, we found Falcunculidae sister to Oreoicidae but Aggerbeck et al. recovered

63

Page 74: Carl H. Oliveros

Falcunculidae sister to Cinclosomatidae. Third, well-supported conflicts also occur within the

Corvoidea clade. ML and coalescent analyses in this study recover Melampittidae, Corvidae,

Platylophus, and Laniidae forming a clade, whereas Aggerbeck et al. group the last three

lineages in a clade with Monarchidae. The high number of conflicting results between this study

and that of Aggerbeck et al. (2014) may have been caused by the sparse taxon sampling in the

latter. Aggerbeck et al. sampled only one individual in each family whereas this study used

multiple genera in each family where possible. These conflicts, especially those that involve

lineages at the base of core Corvoidea and the major clades, may have important implications in

ancestral range reconstruction in the group.

Concatenation and coalescent methods

Three of the major clades in core Corvoidea illustrate the varying degrees by which

concatenation and coalescent approaches can resolve clades with many short internodes. Within

Orioloidea, both approaches yield identical results but with poor resolution. The low support

values in ML analyses can be expected in cases of high gene tree heterogeneity, a low number of

informative sites for this section of core Corvoidea phylogeny, or both. On the other hand,

within Malaconotoidea, both approaches produce the same topology with high resolution,

indicating little gene tree discordance in this group. Within Corvoidea, two contrasting patterns

are observed. In the first pattern, coalescent approaches resolve a node that concatenation does

not. Specifically, all GCMs found Dicruridae sister to the eight other families of Corvoidea with

91–96% BS (App. 2.2), whereas ML recovered the same relationship with only 64% BS. In this

case, slight to moderate levels of gene tree discordance, likely due to incomplete lineage sorting,

could have pulled down support values for this node in ML analysis, but not in GCM analyses,

which are expected to perform well in this type of situation. In the second pattern, nodes are

64

Page 75: Carl H. Oliveros

resolved with moderate to high support by the concatenation approach, but poorly resolved by

coalescent methods, such as in the clade containing Monarchidae, Corcoracidae, Ifritidae,

Paradisaeidae, and the clade containing Melampittidae, Corvidae, Laniidae, and the genus

Platylophus. In this situation, overall signal from nucleotides yields support in concatenated

analysis, but that signal is not apparent in many genes, unlike the case in Malaconotoidea.

Moreover, unlike in the Dicruridae pattern, GCM support values were low. This pattern is

indicative of high gene tree estimation errors in GCM analysis.

One justification for using coalescent-based approaches in phylogenetic analysis is to

detect highly-supported, incorrect topologies from concatenation approaches that are expected in

species trees that fall within the anomaly zone (Kubatko and Degnan 2007). In this study, we

have shown that not only do coalescent methods increase our confidence for highly supported

nodes in concatenation results when they agree, as with the vast majority of nodes in our species

tree estimate, coalescent methods can also provide high support for an otherwise weakly or

moderately well-supported node in concatenation results. This result was demonstrated in the

placement of Dicruridae sister to the rest of Corvoidea.

The conflicting results between ML and GCMs in estimating relationships among the

genera Lophorina, Paradisaea, and Epimachus in the family Paradisaeidae is an apparent

artefact of bias in GCMs caused by consistently short sequence lengths for the Paradisaea

sample (Chapter 2). Further GCM analysis on a dataset containing only members of

Paradisaeidae and 3 outgroups, with alignments trimmed to eliminate flanking missing data, and

using only the most informative loci recovers the same result as in ML, with Lophorina sister to

Paradisaea and Epimachus (not shown). Several other samples have shorter average sequence

lengths than the Paradisaea sample, including Pericrocotus, Eulacestoma, Psophodes,

65

Page 76: Carl H. Oliveros

Aleadryas, Corcorax, and Parotia. The congruence of ML and GCM results for the phylogenetic

position of these taxa indicates that missing sequence data in these samples did not bias GCM

results. Either sufficient parsimony-informative sites were present in these short sequences or

only a small proportion of loci lacked informative sites for these samples. In addition, genetic

divergences between Lophorina, Paradisaea, and Epimachus were shallow compared to those

between the other samples with short sequences and their closely related taxa. Thus, missing

sequences in Paradisaea likely resulted in more severe gene tree estimation error in its clade.

The sister relationship between Orioloidea and Malaconotoidea recovered by STAR and

MP-EST in the analysis of one subset of taxa should be interpreted with caution for two reasons.

First, bootstrap support values for this node (74% and 76%) were on the low end of our

definition of well-supported. Second, a support value > 70% for this node was found only in 2

out of 20 GCM analyses that contained both clades. In these two instances, Orioloidea formed

part of the outgroup and had sparse taxon sampling. This relationship could be an artefact of

sparse taxon sampling in this subset of the data. STEAC was the only GCM that recovered

Platylophus sister to Laniidae, agreeing with ML results, and placed Corcoracidae sister to

Monarchidae, contradicting ML results. The positions of these taxa were ambiguous in the

results of other GCMs. Considering that the support values for these nodes were only 76% and

78% respectively, these results should also be considered with caution.

STEAC, the only GCM among the four used in this study that uses branch length

information from gene trees, assumes the molecular clock across lineages (Liu et al. 2009b). The

high congruence of the STEAC results with the other three GCMs indicates the robustness of this

method to deviations from this assumption in this relatively deeply diverging and diverse group

(Liu et al. 2009a).

66

Page 77: Carl H. Oliveros

Taxonomy

We propose two updates to the higher-level classification of Infraorder Corvides of

Cracraft (2014) based on relationships recovered with high support from ML and coalescent

approaches in this study. The superfamily Orioloidea can be erected to group Eulacestomatidae,

Psophodidae, Falcunculidae, Oreoicidae, Paramythiidae, Vireonidae, Orioloidae, and

Pachycephalidae. The best estimate of the position of the genus Platylophus is the one provided

by ML and STEAC, which places it sister to Laniidae, rendering Corvidae paraphyletic. Thus,

either Platylophus can be moved to Laniidae, or a new monotypic family can be erected.

67

Page 78: Carl H. Oliveros

CHAPTER 3

Phylogenomic Comparisons Clarify Relationships and Biogeographic History of a

Rapid Pantropical Bird Radiation: the Trogons

68

Page 79: Carl H. Oliveros

ABSTRACT

Historical biogeographic inference is only as accurate and reliable as the phylogenetic

hypotheses on which it is based. In the avian family of trogons (Trogonidae) multiple,

conflicting biogeographic histories have been proposed to explain their pan-tropical distribution

because of differing estimates of inter-generic relationships. Phylogenetic reconstruction in the

trogons has been problematic because of successive short internodes at the base of this radiation

that cause gene tree discordance and a long branch leading to the family that makes root

placement tricky. We re-estimate inter-generic relationships in this rapid bird radiation using

data from thousands of ultraconserved element (UCE) loci and employ analytical methods that

consider gene tree discordance. We then examine the origin, tempo, and mode of diversification

of the group. We recover the first well-supported hypothesis of relationships among trogon

genera. Trogons comprise three clades, each confined to one of three biogeographic regions:

Africa, Asia, and the Neotropics, with the African clade sister to the rest of trogons. This

topology, combined with the trogon fossil record, geologic, and climatic data, suggests an Old

World origin for the group. Continental connections during the warm Late Oligocene/Early

Miocene facilitated dispersion between Eurasia, North America, and Africa, and subsequent

global cooling plausibly caused divergence between main trogon lineages.

69

Page 80: Carl H. Oliveros

INTRODUCTION

Two main hypotheses attempt to explain the origins of bird lineages with disjunct

distributions in portions of the Old World and New World. One hypothesis attributes these

distributions to vicariance caused by the Cretaceous break-up of Gondwana (Cracraft 1973,

2001). The other hypothesis ascribes these disjunct distributions to a Laurasian origin in the

Early Paleogene, during which bird lineages assumed a broad distribution under favorable

Laurasian climatic conditions and suitable land connections before subsequently becoming

isolated with the deterioration of climatic conditions and the severing of land bridges (Cracraft

1973; Olson 1989). Evaluating these hypotheses for a specific lineage requires a robust estimate

of phylogenetic relationships and divergence times within the group and its close relatives, as

well as a careful consideration of the fossil record.

The avian family Trogonidae (trogons) represents a challenge for reconstructing

biogeographic history. The family, which is placed in its own order Trogoniformes, consists of

43 species in 7 genera distributed throughout much of the Old World and New World tropics.

Multiple, conflicting biogeographic histories have been published for the group because of

differences in the estimates of the family’s phylogeny (Espinosa de los Monteros 1998;

Johansson and Ericson 2005; Moyle 2005; Ornelas et al. 2009; Hosner et al. 2010). For instance,

some studies recovered single clades each for African, Asian, and the Neotropical taxa (Espinosa

de los Monteros 1998; Johansson and Ericson 2005; Ornelas et al. 2009), whereas others

suggested the Neotropical taxa are paraphyletic (Moyle 2005; Hosner et al. 2010). These

conflicting findings likely resulted from the use of different sets of only a few molecular

markers, each supporting different topologies or providing no support at all. In all of these

studies support for inter-generic relationships was low, with the exception of one (Ornelas et al.

70

Page 81: Carl H. Oliveros

2009) in which Bayesian posterior probabilities were high (100%) but maximum likelihood

bootstrap support values were < 50%. Moreover, with the exception of one study (Hosner et al.

2010), previous work did not include the highly divergent Asian genus Apalharpactes, which

may not be closely related to other Asian trogons.

Despite conflicting phylogenetic estimates for trogons, previous molecular studies have

rejected a Gondwanan origin for this group based on estimated dates of divergence among crown

trogons that are much younger than the sundering of Gondwana (Espinosa de los Monteros 1998;

Moyle 2005). A Laurasian derivation is evident but these two studies offer two conflicting

hypotheses of trogon origins. Espinosa de los Monteros (1998) concluded an African origin

based on his results. He further noted that his hypothesis is supported by the fact that stem

trogon fossils from the Paleocene have been found in Europe, whereas the oldest known fossil

trogons from the New World are extant taxa from Pleistocene deposits (Espinosa de los

Monteros 1998). On the other hand, Moyle's (2005) results supported a derivation of trogons in

the New World, which had been previously proposed because of the higher trogon diversity in

the region (Mayr 1946). Since the publication of these two studies, more information on the

fossil record of trogons and close relatives have been revealed (Kristoffersen 2002; Mayr 2003,

2005, 2009; Weidig 2006; Ksepka and Clarke 2009) that has implications for understanding

trogon origins. Nonetheless, a critical step to understanding trogon origins is clarifying

phylogenetic relationships in the group.

The difficulty in resolving trogon phylogeny likely stems from two main factors. First,

the base of the trogon phylogeny is characterized by short successive internodes, indicating a

rapid initial burst of lineage splitting. Phylogenies with such structure are difficult to resolve

with limited data because of the low probability of informative substitutions occurring along

71

Page 82: Carl H. Oliveros

these short internodes (Lanyon 1988) and the high probability of gene tree discordance (Degnan

and Rosenberg 2006). Second, trogons have no close living sister-group such that a very long

branch leads to this rapid radiation. This long branch, combined with subsequent short

internodes at the base of the trogon clade, makes rooting the trogon phylogeny difficult

(Johansson and Ericson 2005; Moyle 2005), which in turn has a large impact on the possible

biogeographic conclusions. An additional factor that may be confounding phylogeny estimation

in trogons is that the presence of successive short internodes may place the species tree within

the anomaly zone, a situation in which the most common gene trees inferred from sampled loci

are expected to be incongruent with the species tree, and as such, methods based on the multi-

species coalescent may be necessary to resolve the species tree (Degnan and Rosenberg 2006;

Kubatko and Degnan 2007).

Advances in sequencing technology have made collection of hundreds to thousands of

genome-wide sequences feasible for phylogenetic studies (Faircloth et al. 2012; Lemmon et al.

2012). Recent studies using sequence capture of ultraconserved elements and their flanking

regions (UCE loci) have clarified phylogenetic relationships in regions of the avian tree of life

with successive short internodes (McCormack et al. 2013; Sun et al. 2014). Additionally, several

analytical methods that are statistically consistent under the multi-species coalescent can be

harnessed to estimate species trees from hundreds to thousands of loci (Liu et al. 2009b, 2010;

Mirarab et al. 2014c). In this paper, we leverage these advantages to re-examine phylogenetic

relationships in the rapid pan-tropical radiation of trogons using thousands of UCE loci. We then

estimate the timing of diversification and evaluate hypotheses of the origin, tempo, and mode of

diversification in this pan-tropical group using recent advances in our knowledge of the trogon

fossil record.

72

Page 83: Carl H. Oliveros

METHODS

Sampling

Monophyly of all seven trogon genera has been established in previous molecular studies

(DaCosta and Klicka 2008; Ornelas et al. 2009; Hosner et al. 2010). Because the focus of this

study is on inter-generic relationships and broad-scale biogeographic patterns, we selected 11

trogon species that included at least one representative of each genus and one or two additional

species from species-rich genera as ingroup taxa (Table 3.1). Samples from five closely related

orders were used as outgroups.

Laboratory techniques

We extracted and purified DNA from fresh muscle or liver tissue using the Qiagen

DNeasy Blood and Tissue Kit following the manufacturer’s protocol. Sequence capture of

ultraconserved element (UCE) loci was performed targeting 5,060 UCE loci following the same

process of library preparation, target capture, post-enrichment amplification, and sequencing

described in Chapter 1.

73

Page 84: Carl H. Oliveros

Table 3.1 Tabl

e 3.

1. S

ampl

ing,

read

, and

con

tig c

hara

cter

istic

s. Sp

ecie

s A

cces

sion

Num

ber

Loca

lity

Tota

l nu

mbe

r of

read

s

Perc

enta

ge

of ≥

Q30

ba

ses

Mea

n re

ad

qual

ity

scor

e

Tota

l nu

mbe

r of

co

ntig

s

Num

ber

of U

CE

cont

igs

Mea

n U

CE

cont

ig

leng

th

Mea

n U

CE

cove

rage

Ingr

oup

Ap

alha

rpac

tes m

ackl

oti

LSU

MN

S B

4910

4 In

done

sia

4,26

2,87

0 88

.88

35.2

9 73

10

4192

95

7.9

38.6

Ap

alod

erm

a ae

quat

oria

le

KU

NH

M 8

461

Equa

toria

l Gui

nea

3,23

9,03

4 90

.63

35.7

3 72

22

4082

79

0.6

35.5

Eu

ptilo

tis n

eoxe

nus

AM

NH

DO

T110

81

USA

4,

834,

786

89.9

4 35

.53

8793

40

78

881.

3 42

.3

Har

pact

es a

rden

s K

UN

HM

269

58

Phili

ppin

es

4,49

8,18

6 89

.49

35.4

4 88

60

4288

89

0.8

37.4

H

arpa

ctes

ery

thro

ceph

alus

K

UN

HM

997

0 C

hina

3,

270,

822

89.7

5 35

.48

7294

38

82

792.

6 32

.1

Har

pact

es o

resk

ios

KU

NH

M 2

3185

V

ietn

am

2,99

6,05

4 87

.9

35.0

4 70

31

3695

79

9.3

26.2

Ph

arom

acru

s ant

isia

nus

LSU

MN

S B

2287

0 B

oliv

ia

4,86

8,51

6 88

.64

35.2

6 73

31

4201

10

17.5

42

.5

Prio

telu

s ros

eiga

ster

K

UN

HM

636

3 D

omin

ican

Rep

ublic

3,

937,

650

89.5

6 35

.43

7362

41

05

684.

8 41

.6

Prio

telu

s tem

nuru

s A

NSP

556

5 C

uba

5,68

0,70

0 88

.34

35.1

6 81

64

4264

10

03

47.7

Tr

ogon

per

sona

tus

AM

NH

DO

T426

6 V

enez

uela

3,

742,

850

88.4

1 35

.16

7938

39

35

911.

7 34

.1

Trog

on v

iola

ceus

A

MN

H D

OT1

1951

V

enez

uela

3,

491,

602

85.6

6 34

.46

9095

31

95

748.

7 31

.1

Out

grou

p

Bere

nico

rnis

com

atus

A

MN

H D

OT1

4737

C

aptiv

e 5,

483,

676

91.6

3 35

.95

1096

2 43

42

644.

0 51

.1

Ceyx

arg

enta

tus

KU

NH

M 1

9269

Ph

ilipp

ines

6,

084,

638

89.9

7 35

.58

9033

42

78

951.

8 55

.6

Lept

osom

us d

iscol

or

FMN

H 4

4918

4 M

adag

asca

r 5,

031,

310

89.3

7 35

.46

9321

43

06

945.

8 37

.2

Otu

s ele

gans

K

UN

HM

109

75

Phili

ppin

es

5,73

2,98

8 89

.5

35.4

5 93

32

4225

95

5.2

46.8

C

oliu

s str

iatu

s D

ownl

oade

d fr

om

giga

db.o

rg (J

arvi

s et

al.

2014

)

Cap

tive

- -

- -

4872

21

13.8

-

Acr

onym

s: A

MN

H (A

mer

ican

Mus

eum

of N

atur

al H

isto

ry),

AN

SP (A

cade

my

of N

atur

al S

cien

ces o

f Dre

xel U

nive

rsity

), K

UN

HM

(Uni

vers

ity o

f Kan

sas

Nat

ural

His

tory

Mus

eum

), LS

UM

NS

(Lou

isia

na S

tate

Uni

vers

ity M

useu

m o

f Nat

ural

Sci

ence

).

74

Page 85: Carl H. Oliveros

Data assembly

Raw reads were de-multiplexed using CASAVA ver. 1.8.2. Low-quality bases and

adapter sequences were trimmed from reads using illumiprocessor ver. 1

(https://github.com/faircloth-lab/illumiprocessor). Subsequent data processing was performed

using the python package phyluce (Faircloth 2014) and outlined below. Cleaned reads were

assembled into contigs using the program Trinity (Grabherr et al. 2011). Contigs matching UCE

loci were extracted for each taxon. For Colius striatus UCE loci were obtained by in silico

alignment of UCE probes with the full genome sequence (Jarvis et al. 2014) and taking the

matched region along with 1000 bp of flanking nucleotides on each side of the matched region.

Two datasets were then assembled: one “incomplete dataset” containing UCE loci that were

present in at least 75% of taxa and a “complete dataset” containing UCE loci that were present in

all taxa. Each locus was aligned using MAFFT (Katoh and Standley 2013) and trimmed using

Gblocks (Castresana 2000) using default parameters with the exception of the minimum number

of sequences for a flank position in Gblocks, which we set at 65% of taxa. The alignments were

formatted to phylip and nexus files for phylogenetic analysis.

Phylogenetic analysis

Maximum likelihood (ML) and Bayesian inference were performed on the concatenated

loci of the incomplete dataset. ML tree searches were carried out using RAxML ver. 8.1.3

(Stamatakis 2014) on the unpartitioned dataset assuming a general time reversible model of rate

substitution and gamma-distributed rates among sites. Node support was evaluated using 500

rapid bootstraps. For Bayesian analysis, we first estimated substitution models for each locus

using Cloudforest (Crawford and Faircloth 2014) and grouped loci with the same substitution

model into separate partitions. We performed two sets of Bayesian analysis using ExaBayes ver.

75

Page 86: Carl H. Oliveros

1.3 (Aberer et al. 2014): one with 4 independent runs each with two coupled chains, and another

with 8 independent runs with no coupled chains. Both MCMCs were run for 107 generations

sampling every 5 × 103 generations. Convergence of likelihood and parameter estimates was

assessed using the program Tracer ver. 1.6.0 (Rambaut and Drummond). Effective sample sizes

of all parameters in all runs were > 200. Convergence of the independent runs was evaluated

using the sdsf program of ExaBayes.

We used gene tree-based coalescent methods (GCM) to estimate the species tree from the

complete dataset. Gene tree inference and bootstrapping were performed with RAxML ver.

8.1.3 (Stamatakis 2014) using the python package phyluce (Faircloth 2014). We modified the

phyluce scripts to implement multi-locus bootstrapping (Seo 2008), i.e., sampling with

replacement of loci and sites, and generated 500 multi-locus bootstrap replicate sets of gene trees

for each dataset. On each replicate set of gene trees, we ran four GCMs: species tree estimation

using average ranks of coalescences (STAR) and species tree estimation using average

coalescence times (STEAC) (Liu et al. 2009b) as implemented in the R package phybase ver.

1.3, maximum pseudo-likelihood for estimating species trees (MP-EST) ver. 1.4 (Liu et al.

2010), and accurate species tree algorithm (ASTRAL) ver. 4.7.7 (Mirarab et al. 2014c). All

programs were run with the default options. Consensus trees were generated for each method

using the sumtrees.py program in Dendropy (Sukumaran and Holder 2010). Command line and

R scripts used to process the data and run the species tree analyses are available at

https://github.com/carloliveros/uce-scripts.

Loci with weak or no phylogenetic signal can lower bootstrap support values in species

tree estimation (Liu et al. 2015) and can lead to incorrect GCM-estimated species trees with high

support (Chapter 1). Thus, in addition to performing GCM analyses with all 1828 loci common

76

Page 87: Carl H. Oliveros

to all taxa in the dataset, we also performed three other sets of GCM analyses on subsets of loci

consisting of 1469, 1095, and 741 loci with the highest number of parsimony-informative sites.

In order to assess the robustness of rooting the trogon phylogeny, we carried out

additional ML tree searches using RAxML ver. 8.1.3 (Stamatakis 2014) evaluated with 500 rapid

bootstraps on the incomplete dataset with four alternative sets of outgroups: (a) Ceyx argentatus,

a member of the sister clade of trogons, as a single outgroup; (b) Otus elegans, the most distant

taxon to trogons in our sampling, as a single outgroup; (c) Ceyx argentatus and Berenicornis

comatus, two members of the sister clade of trogons, as the outgroup, and (d) Ceyx argentatus

and Otus elegans as the outgroup.

We used the program MCMCtree in the PAML ver. 4.8 package (dos Reis and Yang

2011) to estimate absolute divergence times between trogon genera. The complete dataset was

concatenated and treated as a single locus in the analysis. The tree topology was fixed to the

well-supported topology inferred from both concatenated and species tree analyses. No fossils

are known from the trogon crown clade (Mayr 1999, 2005; Kristoffersen 2002; Mayr and Smith

2013); thus two secondary time calibrations from a recent chronogram of avian bird orders based

on genome-wide data and multiple fossil calibrations (Jarvis et al. 2014) were utilized. Normally

distributed priors were assigned to both calibrations: one with mean of 59.942 Mya with s.d.

1.675 My at the split of Coliiformes and its sister clade; and the other with mean 51.985 Mya and

s.d. 1.55 My at the split of Bucerotiformes and its sister clade. We used means and standard

deviations in our calibration priors that reflected their posterior distribution in (Jarvis et al.

2014). A model of independent rates between lineages drawn from a lognormal distribution was

used with gamma hyper priors for the mean and variance of rates. Hyper prior parameters

(rgene_gamma and sigma2_gamma) were selected using estimates of the overall rate on the tree

77

Page 88: Carl H. Oliveros

obtained by the PAML program baseml. A birth-death-sampling model of lineage

diversification was used with sampling rate of 0.25. Date estimates were robust to changing

initial values for the hyper prior parameters and the birth and death rates. An HKY85

substitution model with gamma-distributed rates in 5 categories was chosen to account for

substitution rate variation between sites. Two independent MCMC chains were run with

parameters sampled every 5000 generations 104 times after discarding 2×105 generations as

burn-in. Convergence of likelihood and parameters was assessed by examining trace plots using

the program Tracer ver. 1.6.0 (Rambaut and Drummond) and comparing results between

independent runs.

RESULTS

Sequence attributes

We obtained a total of 6.71 × 107 raw Illumina reads with each individual yielding an

average of 4.47 × 106 reads (Table 3.1). For each individual, an average of 8337 contigs were

assembled, of which roughly half corresponded to UCE loci. The average UCE contig length

was 865 bp with an average coverage of 40×. The incomplete dataset included data from 4,011

loci with a total alignment length of 3,339,138 bp, yielding a mean locus length of 832.50.

Nucleotide data were present for 88% of the data matrix. On the other hand, the complete dataset

included data from 1,828 loci with a total alignment length of 1,653,366 bp, resulting in an

average locus length of 904.47. In this data matrix, 92% represented nucleotide data. UCE

sequence data are available at GenBank (accession numbers to be provided upon acceptance of

manuscript).

78

Page 89: Carl H. Oliveros

Phylogenetic relationships

Bayesian, ML, and GCMs yielded the same estimate of topology with high support in

most nodes (Fig. 3.1). The earliest split in crown-clade trogons involves the African genus

Apaloderma, which is sister to two radiations, one in Asia and the other in the New World

tropics, each of which is monophyletic. Within the Asian clade Apalharpactes is sister to

Harpactes. In the New World, the quetzal genera Euptilotis and Pharomachrus form a clade

sister to Trogon and Priotelus. With two exceptions, all nodes in the ingroup had 100% bootstrap

and Bayesian posterior-probability support across the different analytical approaches, and across

the subsets of loci. The node that unites the Asian and New World clades (Node 1, Fig. 3.1)

received 100% Bayesian posterior-probability and ML bootstrap support, and 91–97% bootstrap

support across GCM results based on all 1828 loci. In constrast, the node that unites the Asian

genera Apalharpactes and Harpactes (Node 2, Fig. 3.1) received 100% Bayesian posterior-

probability, 61% ML bootstrap support, and 60–83% bootstrap support across GCM analysis on

all loci. GCM analyses performed on subsets of informative loci yielded the same ingroup

topology (not shown) but with lower support values for Nodes 1 and 2 (Fig. 3.2). Bootstrap

support for Node 2 dropping below 50% for STEAC, MP-EST, and ASTRAL when the 741

most informative loci were analyzed. Support values for the other nodes in the ingroup remained

constant at 100% in these sets of GCM analyses.

79

Page 90: Carl H. Oliveros

Figure 3.1. Trogonidae generic level phylogeny. Estimate of phylogenetic relationships of trogons based on concatenated and coalescent approaches. Nodes with dots correspond to 100% Bayesian posterior probability and bootstrap support from all methods of analysis. Numbers next to nodes indicate support from Bayesian, ML, STAR, STEAC, MP-EST, and ASTRAL analyses, respectively. Chronogram based on divergence time estimation with MCMCTree. Color highlights on species names correspond to different geographic regions: red = Africa, yellow = Asia, light blue = continental Neotropics, dark blue = Caribbean. Light gray shading shows short time window of diversification among continents. Abbreviations in geologic time scale are Plio = Pliocene, Q = Quaternary.

80

Page 91: Carl H. Oliveros

The trogon family was found to be sister to a clade consisting of Bucerotiformes

(hornbills) and Coraciiformes (rollers, bee-eaters and allies; Fig. 3.1) in all analyses except in

STAR, which placed the trogons in a three-way polytomy with Bucerotiformes and

Coraciiformes. These clades were sister to Leptosomiformes (cuckoo-roller) and in turn sister to

Coliiformes (mousebirds) based on most phylogenetic analyses. STEAC, however, recovered a

Figure 3.2. Effect of the number of loci analyzed on bootstrap support values. a) Support values for Node 1 in Figure 1. b) Support values for Node 2 in Figure 1.

81

Page 92: Carl H. Oliveros

topology in which the placement of Leptosomiformes and Coliiformes was switched with high

support.

Almost all analyses using smaller sets of outgroup taxa produced the same rooting of

trogon phylogeny with high support, that is, along the edge that leads to the African lineage

Apaloderma (not shown), similar to the result when all five outgroup taxa were used. The only

exception, the case in which Otus elegans was used as the single outgroup, rooted the tree along

the edge that connected a clade containing Apaloderma and Apalharpactes relative to another

clade consisting of all the other trogon genera (not shown). This alternative root presented a

different ingroup topology as well but the alternative clades received little bootstrap support

values (< 51%).

DISCUSSION

Genomic data and phylogenetic inference

This study joins a growing number that uses hundreds to thousands of loci to clarify

phylogenetic relationships that have previously been difficult to resolve with fewer data. The

congruence of results among various concatenation and coalescent approaches indicates that our

estimate of relationships among trogon genera is the most robust to date. The short internodes in

the topology represent a soft polytomy, which was resolved with the use of genome-scale data.

Although the internodes at the base of the trogon phylogeny are extremely short, they do not

appear to be in the anomaly zone. If the outgroup taxa are excluded, the most common gene tree

topology inferred from the complete dataset (n = 19) is congruent with the species tree (Table

3.2).

82

Page 93: Carl H. Oliveros

Table 3.2. Distribution of unique gene tree topologies from the complete dataset if outgroup taxa are excluded. Frequency of gene tree topology Number of gene tree topologies that have given

frequency 1 1267 2 77 3 27 4 11 5 11 6 6 7 2 8 2 9 4

10 3 11 3 12 1 14 1 17 1 19 1*

* Gene tree topology congruent to the inferred species tree topology

Support for most of the nodes in the ingroup was high (>90% bootstrap support or 100%

Bayesian posterior probability) across both concatenation and coalescent approaches with one

exception. Support for the node uniting the Asian genera Apalharpactes and Harpactes was low

in ML (61%), MP-EST (60%), and ASTRAL (61%), moderate in STEAC (77%) and STAR

(83%), and strong in Bayesian inference (100%). The high support value from Bayesian

inference compared with those from other approaches is not surprising considering that Bayesian

posterior probabilities are generally higher than ML bootstrap support values (Alfaro et al. 2003)

and can be excessively liberal (Suzuki et al. 2002). The higher support value in STEAC relative

to MP-EST and ASTRAL is notable. This GCM tends to be least affected by biases from short

sequence lengths and uninformative loci because of its use of branch length information from

gene trees (Chapter 1). Despite some low support values, this study provides some support

(>70% from both STAR and STEAC, and 100% from Bayesian analysis) for the position of

Apalharpactes from both concatenation and coalescent approaches, whereas the only other study

83

Page 94: Carl H. Oliveros

that included this genus was unable to estimate its position with any confidence (Hosner et al.

2010).

The root of the trogon phylogeny along the edge leading to Apaloderma appears to be

appropriate because this positioning is supported in reconstructions with different numbers and

combinations of outgroups. The only alternative root placement occurred when a single distant

outgroup, Otus elegans, was used, albeit with weak support. It is well known that the use of only

distant outgroups can cause incorrect rooting (Wheeler 1990).

The utility of GCMs at deep phylogenetic time scales has recently been questioned based

in part on a re-analysis of UCE loci in estimating relationships within placental mammals

(Gatesy and Springer 2014). In our study, however, GCMs were effective, probably for a variety

of reasons. First, the average locus length in this study was longer (twice that of the mammal

dataset in (Gatesy and Springer 2014)). Longer alignments generally improves accuracy of

GCMs (Mirarab et al. 2014c), especially for UCE loci because variability in UCE flanking

regions increases with increasing distance from the center of the ultraconserved element

(Faircloth et al. 2012). Second, more loci were used for GCM analyses in this study than the

mammal study (10 × as many). The GCMs used here are statistically consistent, i.e., they are

expected to be highly accurate as the number of gene trees analyzed goes to infinity (Liu et al.

2009b, 2010; Mirarab et al. 2014c). Lastly, this study involves a relatively few lineages

separated by short internodes. As the number of short internodes in a phylogeny increases, the

likelihood and complexity of deep coalescences between lineages also increases.

Results in previous studies indicate that filtering out uninformative loci in GCM analyses

can increase bootstrap support values of species tree estimates (Liu et al. 2015; Chapter 1).

However, the lower bootstrap values obtained when fewer loci were used in GCM analyses in

84

Page 95: Carl H. Oliveros

this study (although only for Nodes 1 and 2 in Fig. 3.1) indicate the opposite result. These

conflicting results can partly be explained by the large discrepancy in information content in loci

between the studies. In Chapter 1, in which the oldest divergence between taxa occurred

relatively recently (~8 Ma; Moyle et al. 2009), the average number of parsimony-informative

characters in each locus was 42.7 bp. In contrast, trogons are an older clade, and a locus in the

present study contained an average of 73.5 informative characters. In our study the exclusion of

hundreds of loci in GCM analysis had the effect of removing informative gene trees, leading to

lower support values, whereas in Chapter 1 the exclusion of many loci resulted in minimizing

spurious gene trees, leading to higher support values. Filtering fewer uninformative loci than

was done in our study could potentially have produced higher support values for Nodes 1 and 2

but this possibility was not further explored. GCMs are statistically consistent under the multi-

species coalescent (Liu et al. 2009b, 2010; Mirarab et al. 2014c) and as such perform better with

more loci. Thus the criteria for filtering uninformative loci should consider levels of divergence

in the taxa of interest.

Trogon relationships and biogeography

This study is the first to estimate trogon phylogeny based on genome-wide data and the

first to produce a well-resolved tree of trogon genera. The main results of our study,

demonstrating the monophyly of trogons from each geographic region—Africa, Asia, and the

Neotropics—and establishing hierarchical relationships among them, are similar to results

obtained in earlier studies (Espinosa de los Monteros 1998; Ornelas et al. 2009), but now with

strong nodal support throughout the tree. Our results contradict the outcome of earlier studies

that found Asian (Hosner et al. 2010) and New World trogons (Moyle 2005; Hosner et al. 2010)

to be paraphyletic. These earlier problems were probably caused by weak conflicting

85

Page 96: Carl H. Oliveros

phylogenetic signal from markers used in these studies. For example, analysis of mitochondrial

Cyt-b gene sequences showed the African genus Apaloderma to be the sister of all other trogons,

with low support (Espinosa de los Monteros 1998; Johansson and Ericson 2005). Another

mitochondrial gene, ND2, and the nuclear RAG-1 gene each indicated Priotelus and the quetzal

genera Euptilotis and Pharomachrus as the first lineages to sequentially diverge from the base of

the tree, again with weak support (Moyle 2005). Interestingly, the study that combined all these

markers arrived at a topology similar to ours (Ornelas et al. 2009), suggesting that more data is

better than fewer.

Our dating analysis estimated the trogon lineages from Africa, Asia, and the New World

diverged rapidly within a few million years in the Early Miocene (Fig. 3.1). The divergence of

the Caribbean lineage from the continental lineage in the New World is estimated to have taken

place 1–2 million years later. These age estimates are within the confidence intervals of previous

estimates (Espinosa de los Monteros 1998; Moyle 2005) but are on the younger end of these

ranges. The younger date estimates in this study are not surprising considering that our

calibrations are based on a study that, in general, placed a younger date of diversification of

modern birds than previously thought (Jarvis et al. 2014). As found in two previous trogon

studies (Espinosa de los Monteros 1998; Moyle 2005), a Gondwanan origin for the crown group

can be rejected based on our date-estimates and a Laurasian origin appears to be a more plausible

scenario.

With respect to the placement of trogons among related bird orders, our results are

similar to those of recent phylogenomic studies of birds (Hackett et al. 2008; Jarvis et al. 2014).

The trogons are members of the clade that includes Bucerotiformes, Coraciiformes, and

Piciformes (App. 3.1). Bucerotiformes presently occurs in the Old World tropics, whereas

86

Page 97: Carl H. Oliveros

Coraciiformes and Piciformes each has a contemporary pan-tropical distribution. However,

Bucerotiform fossils from the Miocene and Eocene are known from Europe; Eocene/Oligocene

Coraciiform fossils are also known from Europe; and Piciforms are known from the Eocene of

North America and the Miocene of Europe (Brodkorb 1971). The groups closely related to the

clade comprising Trogoniformes, Bucerotiformes, Coraciiformes, and Piciformes include

Leptosomiformes, a Malagasy endemic order, and Coliiformes, an African endemic order.

However, the fossil record indicates that these two orders had much wider distributions in North

America and Europe during the Eocene (Weidig 2006; Ksepka and Clarke 2009). All

unambiguously identified trogon fossils from the Paleogene are from Europe (e.g., Kristoffersen

2002; Mayr 2005, 2009) and these fossil taxa are viewed as successive sister taxa of crown

trogons (Mayr 2009). Thus Mayr (2009) believed that the fossil record is consistent with an Old

World origin of crown trogons. Our phylogenetic estimate of trogon relationships is also

consistent with this view. However, Mayr (2009) also cautioned that an unpublished record of a

putative stem trogon from Early Eocene North America exists (Weidig 2003), albeit from a

partial skeleton, which if confirmed can possibly erode support for an Old World origin

hypothesis for this group.

The timing of trogon diversification is much younger than the Early Paleogene opening

of the North Atlantic Ocean (Courtillot et al. 1999), precluding the North Atlantic land bridge

playing a role in crown trogon diversification. However, Western North America and Eastern

Asia were connected by the Beringian land bridge from the mid-Cretaceous until the Pliocene

(see summary in Sanmartín et al. 2001). In the Eocene, when climates were much warmer than

the present, a belt of boreotropical forest stretched over this land bridge permitting the exchange

of terrestrial fauna and flora. By the Oligocene, global climates began cooling and the

87

Page 98: Carl H. Oliveros

boreotropical forest across Beringia were replaced by boreal forest vegetation. Evidence exists

for a short bout of Late Oligocene warming, followed by a period of cooling in the Early

Miocene, before another climatic optimum in the mid-Miocene (Zachos et al. 2001). Our

estimates of crown trogon rapid diversification fall within these last two climatic optima.

Coincidentally, the African continent established a definitive connection with Eurasia through

the Middle East in the Early Miocene (Gheerbrant and Rage 2006). Thus geologic and climate

data, along with our divergence date estimates, support the view that crown trogons assumed a

wide distribution during the Early Miocene, likely crossing through the Beringian land bridge

and the African-Eurasian collision zone. The aridification of North Africa and the Middle East,

along with continued increase of the latitudinal temperature gradient in the Northern Hemisphere

would have isolated trogons in African, SE Asia, and the New World. This hypothesis assumes

that suitable vegetation permitted the dispersion of trogons between Eurasia, North America, and

Africa. The presence of Early Miocene trogon fossils in Europe (Brodkorb 1971) indicate that

early trogons persisted during this period in higher latitudes, possibly including the Beringian

land bridge.

A review of the origins of Caribbean vertebrates suggests colonization occurred via

overwater dispersal for most lineages (Hedges 1996). A similar study with a focus on mammals

indicate that the Gaarlandia land connection to South America in the Late Eocene/Early

Oligocene played a role in the establishment of some mammal lineages in the Caribbean

(Dávalos 2004). No land bridges are known to have connected the Caribbean islands with North

America in the Miocene (Iturralde-Vinent 2006). Thus long distance dispersal may have played

a role in the colonization of the Caribbean by ancestors of the genus Priotelus. This finding

seems counterintuitive given that trogons are considered poor dispersers (Moyle 2005).

88

Page 99: Carl H. Oliveros

However, trogons have colonized islands in the Philippines that have no known land connections

to the Asian continent (Hall 2002), so trogon dispersal ability may be underestimated.

89

Page 100: Carl H. Oliveros

LITERATURE CITED

Aberer A.J., Kobert K., Stamatakis A. 2014. ExaBayes: massively parallel bayesian tree inference for the whole-genome era. Mol. Biol. Evol. 31:2553–6.

Aggerbeck M., Fjeldså J., Christidis L., Fabre P.H., Jønsson K.A. 2014. Resolving deep lineage divergences in core corvoid passerine birds supports a proto-Papuan island origin. Mol. Phylogenet. Evol. 70:272–285.

Alfaro M.E., Zoller S., Lutzoni F. 2003. Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. Mol. Biol. Evol. 20:255–266.

Alstrom P., Jonsson K.A., Fjeldsa J., Odeen A., Ericson P.G.P., Irestedt M. 2015. Dramatic niche shifts and morphological change in two insular bird species. R. Soc. Open Sci. 2:140364.

Barker F.K., Barrowclough G.F., Groth J.G. 2002. A phylogenetic hypothesis for passerine birds: taxonomic and biogeographic implications of an analysis of nuclear DNA sequence data. Proc. R. Soc. B Biol. Sci. 269:295–308.

Barker F.K., Cibois A., Schikler P., Feinstein J., Cracraft J. 2004. Phylogeny and diversification of the largest avian radiation. Proc. Natl. Acad. Sci. U. S. A. 101:11040–5.

Bayzid M.S., Hunt T., Warnow T. 2014. Disk covering methods improve phylogenomic analyses. BMC Genomics. 15:S7.

Bayzid M.S., Warnow T. 2013. Naive binning improves phylogenomic analyses. Bioinformatics. 29:2277–2284.

Bi K., Vanderpool D., Singhal S., Linderoth T., Moritz C., Good J.M. 2012. Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales. BMC Genomics. 13:403.

Brandley M.C., Bragg J.G., Singhal S., Chapple D.G., Jennings C.K., Lemmon A.R., Lemmon E.M., Thompson M.B., Moritz C. 2015. Evaluating the performance of anchored hybrid enrichment at the tips of the tree of life: a phylogenetic analysis of Australian Eugongylus group scincid lizards. BMC Evol. Biol. 15:62.

Brodkorb P. 1971. Catalogue of fossil birds: part 4 (Columbiformes through Piciformes). Bull. Florida State Museum, Biol. Sci. 15:163–266.

Castresana J. 2000. Selection of Conserved Blocks from Multiple Alignments for Their Use in Phylogenetic Analysis. Mol. Biol. Evol. 17:540–552.

90

Page 101: Carl H. Oliveros

Chaudhary R., Fernandez-Baca D., Burleigh J.G. 2014. MulRF: a software package for phylogenetic analysis using multi-copy gene trees. Bioinformatics. 31:432–433.

Cibois A. 2003. Sylvia is a babbler: taxonomic implications for the families Sylviidae and Timaliidae. Bull. Ornithol. Club. 123:257–260.

Courtillot V., Jaupart C., Manighetti I., Tapponnier P., Besse J. 1999. On casual links between flood basalts and continental breakup. Earth Planet. Sci. Lett. 166:177–195.

Cox S.C., Prys-Jones R.P., Habel J.C., Amakobe B. a., Day J.J. 2014. Niche divergence promotes rapid diversification of East African sky island white-eyes (Aves: Zosteropidae). Mol. Ecol. 23:4103–4118.

Cracraft J. 1973. Continental drift, paleoclimatology, and the evolution and biogeography of birds. J. Zool. 169:455–545.

Cracraft J. 2001. Avian evolution, Gondwana biogeography and the Cretaceous-Tertiary mass extinction event. Proc. Biol. Sci. 268:459–469.

Cracraft J. 2014. Avian higher-level relationships and classification: Passeriforms. In: Dickinson E., Christidis L., editors. The Howard & Moore Complete Checklist of the Birds of the World. 4th Edition. Vol. 2. Eastbourne, U.K.: p. xvii–xlv.

Crawford N.G., Faircloth B.C., McCormack J.E., Brumfield R.T., Winker K., Glenn T.C. 2012. More than 1000 ultraconserved elements provide evidence that turtles are the sister group of archosaurs. Biol. Lett. 8:783–6.

Crawford N.G., Faircloth B.C. 2014. CloudForest: Code to calculate species trees from large genomic datasets. DOI:10.5281/zenodo.12259.

DaCosta J.M., Klicka J. 2008. The Great American Interchange in birds: a phylogenetic perspective with the genus Trogon. Mol. Ecol. 17:1328–43.

Dávalos L.M. 2004. Phylogeny and biogeography of Caribbean mammals. Biol. J. Linn. Soc. 81:373–394.

Degnan J.H., Rosenberg N. a. 2006. Discordance of Species Trees with Their Most Likely Gene Trees. PLoS Genet. 2:e68.

Dickinson E., Christidis L. 2014. The Howard & Moore complete checklist of the birds of the world. 4th Edition, Vol. 2. Eastbourne, U.K.: Aves Press.

Edwards S. V. 2009. Is a new and general theory of molecular systematics emerging? Evolution (N. Y). 63:1–19.

91

Page 102: Carl H. Oliveros

Ericson P.G.P., Christidis L., Cooper A., Irestedt M., Jackson J., Johansson U.S., Norman J.A. 2002. A Gondwanan Origin of Passerine Birds Supported by DNA Sequences of the Endemic New Zealand Wrens. Proc. R. Soc. B Biol. Sci. 269:235–241.

Espinosa de los Monteros A. 1998. Phylogenetic relationships among the trogons. Auk. 115:937–954.

Faircloth B.C., McCormack J.E., Crawford N.G., Harvey M.G., Brumfield R.T., Glenn T.C. 2012. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Syst. Biol. 61:717–26.

Faircloth B.C., Sorenson L., Santini F., Alfaro M.E. 2013. A Phylogenomic Perspective on the Radiation of Ray-Finned Fishes Based upon Targeted Sequencing of Ultraconserved Elements (UCEs). PLoS One. 8:e65923.

Faircloth B.C. 2014. phyluce: phylogenetic estimation from ultraconserved elements. DOI:10.6079/J9PHYL.

Gatesy J., Springer M.S. 2014. Phylogenetic Analysis at Deep Timescales: Unreliable Gene Trees, Bypassed Hidden Support, and the Coalescence/Concatalescence Conundrum. Mol. Phylogenet. Evol. 80:231–266.

Gelang M., Cibois A., Pasquet E., Olsson U., Alström P., Ericson P.G.P. 2009. Phylogeny of babblers (Aves, Passeriformes): major lineages, family limits and classification. Zool. Scr. 38:225–236.

Gheerbrant E., Rage J.C. 2006. Paleobiogeography of Africa: How distinct from Gondwana and Laurasia? Palaeogeogr. Palaeoclimatol. Palaeoecol. 241:224–246.

Grabherr M.G., Haas B.J., Yassour M., Levin J.Z., Thompson D.A., Amit I., Adiconis X., Fan L., Raychowdhury R., Zeng Q., Chen Z., Mauceli E., Hacohen N., Gnirke A., Rhind N., di Palma F., Birren B.W., Nusbaum C., Lindblad-Toh K., Friedman N., Regev A. 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29:644–652.

Hackett S.J., Kimball R.T., Reddy S., Bowie R.C.K., Braun E.L., Braun M.J., Chojnowski J.L., Cox W.A., Han K.-L., Harshman J., Huddleston C.J., Marks B.D., Miglia K.J., Moore W.S., Sheldon F.H., Steadman D.W., Witt C.C., Yuri T. 2008. A phylogenomic study of birds reveals their evolutionary history. Science. 320:1763–8.

Hall R. 2002. Cenozoic geological and plate tectonic evolution of SE Asia and the SW Pacific: computer-based reconstructions, model and animations. J. Asian Earth Sci. 20:353–431.

Hedges S.B. 1996. Historical Biogeography of West Indian Vertebrates. Annu. Rev. Ecol. Syst. 27:163–196.

92

Page 103: Carl H. Oliveros

Hosner P.A., Sheldon F.H., Lim H.C., Moyle R.G. 2010. Phylogeny and biogeography of the Asian trogons (Aves: Trogoniformes) inferred from nuclear and mitochondrial DNA sequences. Mol. Phylogenet. Evol. 57:1219–25.

Hovmöller R., Lacey Knowles L., Kubatko L.S. 2013. Effects of missing data on species tree estimation under the coalescent. Mol. Phylogenet. Evol. 69:1057–1062.

Iturralde-Vinent M.A. 2006. Meso-Cenozoic Caribbean paleogeography: implications for the historical biogeography of the region. Int. Geol. Rev. 48:791–827.

Jarvis E.D., Mirarab S., Aberer A.J., Li B., Houde P., Li C., Ho S.Y.W., Faircloth B.C., Nabholz B., Howard J.T., Suh A., Weber C.C., da Fonseca R.R., Li J., Zhang F., Li H., Zhou L., Narula N., Liu L., Ganapathy G., Boussau B., Bayzid M.S., Zavidovych V., Subramanian S., Gabaldon T., Capella-Gutierrez S., Huerta-Cepas J., Rekepalli B., Munch K., Schierup M., Lindow B., Warren W.C., Ray D., Green R.E., Bruford M.W., Zhan X., Dixon A., Li S., Li N., Huang Y., Derryberry E.P., Bertelsen M.F., Sheldon F.H., Brumfield R.T., Mello C. V, Lovell P. V, Wirthlin M., Schneider M.P.C., Prosdocimi F., Samaniego J.A., Velazquez A.M. V., Alfaro-Nunez A., Campos P.F., Petersen B., Sicheritz-Ponten T., Pas A., Bailey T., Scofield P., Bunce M., Lambert D.M., Zhou Q., Perelman P., Driskell A.C., Shapiro B., Xiong Z., Zeng Y., Liu S., Li Z., Liu B., Wu K., Xiao J., Yinqi X., Zheng Q., Zhang Y., Yang H., Wang J., Smeds L., Rheindt F.E., Braun M., Fjeldsa J., Orlando L., Barker F.K., Jonsson K.A., Johnson W., Koepfli K.-P., O’Brien S., Haussler D., Ryder O.A., Rahbek C., Willerslev E., Graves G.R., Glenn T.C., McCormack J., Burt D., Ellegren H., Alstrom P., Edwards S. V., Stamatakis A., Mindell D.P., Cracraft J., Braun E.L., Warnow T., Jun W., Gilbert M.T.P., Zhang G. 2014. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science (80). 346:1320–1331.

Jetz W., Thomas G.H.H., Joy J.B.B., Hartmann K., Mooers a. O.O. 2012. The global diversity of birds in space and tim. Nature. 491:1–5.

Johansson U.S., Ericson P.G.P. 2005. A re-evaluation of basal phylogenetic relationships within trogons (Aves: Trogonidae) based on nuclear DNA sequences. J. Zool. Syst. Evol. Res. 43:166–173.

Jønsson K. a, Fabre P.-H., Ricklefs R.E., Fjeldså J. 2011. Major global radiation of corvoid birds originated in the proto-Papuan archipelago. Proc. Natl. Acad. Sci. U. S. A. 108:2328–2333.

Jønsson K. a., Bowie R.C.K., Nylander J. a a, Christidis L., Norman J. a., Fjeldså J. 2010. Biogeographical history of cuckoo-shrikes (Aves: Passeriformes): Transoceanic colonization of Africa from Australo-Papua. J. Biogeogr. 37:1767–1781.

Katoh K., Standley D.M. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30:772–80.

Kristoffersen A. V. 2002. An early Paleogene trogon (Aves: Trogoniformes) from the Fur Formation, Denmark. J. Vertebr. Paleontol. 22:661–666.

93

Page 104: Carl H. Oliveros

Ksepka D.T., Clarke J. a. 2009. Affinities of Palaeospiza bella and the phylogeny and biogeography of mousebirds (Coliiformes). Auk. 126:245–259.

Kubatko L.S., Carstens B.C., Knowles L.L. 2009. STEM: Species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 25:971–973.

Kubatko L.S., Degnan J.H. 2007. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst. Biol. 56:17–24.

Lanyon S.M. 1988. The stochastic mode of molecular evolution: what consequences for systematic investigations? Auk. 105:565–573.

Lemmon A.R., Emme S.A., Lemmon E.M. 2012. Anchored hybrid enrichment for massively high-throughput phylogenomics. Syst. Biol. 61:727–44.

Liu L., Xi Z., Wu S., Davis C., Edwards S. V. 2015. Estimating phylogenetic trees from genome-scale data. Ann. N. Y. Acad. Sci. DOI:10.1111/nyas.12747.

Liu L., Yu L., Edwards S. V. 2010. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol. Biol. 10:302.

Liu L., Yu L., Kubatko L., Pearl D.K., Edwards S. V. 2009a. Coalescent methods for estimating phylogenetic trees. Mol. Phylogenet. Evol. 53:320–328.

Liu L., Yu L., Pearl D.K., Edwards S. V. 2009b. Estimating species phylogenies using coalescence times among sequences. Syst. Biol. 58:468–77.

Liu L., Yu L. 2011. Estimating species trees from unrooted gene trees. Syst. Biol. 60:661–667.

Mayr E. 1946. History of the North American bird fauna. Wilson Bull. 58:3–41.

Mayr G., Smith T. 2013. Galliformes, Upupiformes, Trogoniformes, and other avian remains (?Phaethontiformes and ?Threskiornithidae) from the Rupelian stratotype in Belgium, with comments on the identity of “Anas” benedeni Sharpe, 1899. Proc. 8th Int. Meet. Soc. Avian Paleontol. Evol.:23–36.

Mayr G. 1999. A new trogon from the Middle Oligocene of Céreste, France. Auk. 116:427–434.

Mayr G. 2003. On the phylogenetic relationships of trogons (Aves, Trogonidae). J. Avian Biol. 34:81–88.

Mayr G. 2005. New trogons from the early Tertiary of Germany. Ibis (Lond. 1859). 147:512–518.

Mayr G. 2009. A well-preserved second trogon skeleton (Aves, Trogonidae) from the Middle Eocene of Messel, Germany. Palaeobiodiversity and Palaeoenvironments. 89:1–6.

94

Page 105: Carl H. Oliveros

McCormack J.E., Faircloth B.C., Crawford N.G., Gowaty P.A., Brumfield R.T., Glenn T.C. 2012. Ultraconserved elements are novel phylogenomic markers that resolve placental mammal phylogeny when combined with species-tree analysis. Genome Res. 22:746–54.

McCormack J.E., Harvey M.G., Faircloth B.C., Crawford N.G., Glenn T.C., Brumfield R.T. 2013. A phylogeny of birds based on over 1,500 Loci collected by target enrichment and high-throughput sequencing. PLoS One. 8:e54848.

Mirarab S., Bayzid M.S., Boussau B., Warnow T. 2014a. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science. 346:1250463.

Mirarab S., Bayzid M.S., Warnow T. 2014b. Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. Syst. Biol. 0:1–15.

Mirarab S., Reaz R., Bayzid M.S., Zimmermann T., Swenson M.S., Warnow T. 2014c. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 30:i541–8.

Mossel E., Roch S. 2010. Incomplete lineage sorting: Consistent phylogeny estimation from multiple loci. IEEE/ACM Trans. Comput. Biol. Bioinforma. 7:166–171.

Moyle R.G., Andersen M.J., Oliveros C.H., Steinheimer F.D., Reddy S. 2012. Phylogeny and biogeography of the core babblers (Aves: Timaliidae). Syst. Biol. 61:631–51.

Moyle R.G., Filardi C.E., Smith C.E., Diamond J.M. 2009. Explosive Pleistocene diversification and hemispheric expansion of a “great speciator.” Proc. Natl. Acad. Sci. U. S. A. 106:1863–1868.

Moyle R.G. 2005. Phylogeny and biogeographical history of Trogoniformes, a pantropical bird order. Biol. J. Linn. Soc. 84:725–738.

Norman J.A., Ericson P.G.P., Jønsson K.A., Fjeldså J., Christidis L. 2009. A multi-gene phylogeny reveals novel relationships for aberrant genera of Australo-Papuan core Corvoidea and polyphyly of the Pachycephalidae and Psophodidae (Aves: Passeriformes). Mol. Phylogenet. Evol. 52:488–97.

Nyári Á.S., Joseph L. 2013. Comparative phylogeography of Australo-Papuan mangrove-restricted and mangrove-associated avifaunas. Biol. J. Linn. Soc. 109:574–598.

Olson S.L. 1989. Aspects of global avifaunal dynamics during the Cenozoic. Acta XIX Congr. Int. Ornithol. 2:2023–2029.

Ornelas J.F., González C., Espinosa de los Monteros a. 2009. Uncorrelated evolution between vocal and plumage coloration traits in the trogons: a comparative study. J. Evol. Biol. 22:471–84.

95

Page 106: Carl H. Oliveros

Rambaut A., Drummond A.J. 2007. Tracer v.1.5. http://tree.bio.ed.ac.uk/software/tracer/.

Dos Reis M., Yang Z. 2011. Approximate likelihood calculation on a phylogeny for Bayesian estimation of divergence times. Mol. Biol. Evol. 28:2161–72.

Roch S., Steel M. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor. Popul. Biol. 100:56–62.

Sanmartín I., Enghoff H., Ronquist F. 2001. Patterns of animal dispersal, vicariance and diversification in the Holarctic. Biol. J. Linn. Soc. 73:345–390.

Selvatti A.P., Gonzaga L.P., Russo C.A.D.M. 2015. A Paleogene origin for crown passerines and the diversification of the Oscines in the New World. Mol. Phylogenet. Evol. 88:1–15.

Seo T.K. 2008. Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol. Biol. Evol. 25:960–971.

Song S., Liu L., Edwards S. V., Wu S. 2012. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc. Natl. Acad. Sci. 109:14942–14947.

Springer M.S., Gatesy J. 2014. Land plant origins and coalescence confusion. Trends Plant Sci. 19:267–269.

Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 30:1312–3.

Sukumaran J., Holder M.T. 2010. DendroPy: a Python library for phylogenetic computing. Bioinformatics. 26:1569–71.

Sun K., Meiklejohn K.A., Faircloth B.C., Glenn T.C., Braun E.L., Kimball R.T. 2014. The evolution of peafowl and other taxa with ocelli (eyespots): a phylogenomic approach. Proc. Biol. Sci. 281.

Suzuki Y., Glazko G. V., Nei M. 2002. Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proc. Natl. Acad. Sci. 99:16138–16143.

Thomson R.C., Shedlock A.M., Edwards S. V, Shaffer H.B. 2008. Developing markers for multilocus phylogenetics in non-model organisms: A test case with turtles. Mol. Phylogenet. Evol. 49:514–25.

Toews D.P.L., Campagna L., Taylor S.A., Balakrishnan C.N., Baldassere D.T., Deane-Coe P.E., Harvey M.G., Hooper D.M., Irwin D.E., Judy C.D., Mason N.A., McCormack J.E., McCracken K.G., Oliveros C.H., Safran R.J., Scordato E.S.C., Stryjewski K.F., Tigano A., Uy J.A.C., Winger B. In review. Genomic approaches to understanding the early stages of population divergence and speciation in birds. Auk.

96

Page 107: Carl H. Oliveros

Tonini J., Moore A., Stern D., Shcheglovitova M., Ortí G. 2015. Concatenation and species tree methods exhibit statistically indistinguishable accuracy under a range of simulated conditions. PLOS Curr. Tree Life.:1–14.

Warren B.H., Bermingham E., Prŷs-Jones R., Thébaud C. 2006. Immigration, species radiation and extinction in a highly diverse songbird lineage: white-eyes on Indian Ocean islands. Mol. Ecol. 15:3769–86.

Weidig I. 2003. Fossil birds from the Lower Eocene Green River Formation (USA). Ph.D. Thesis. Johann-Wolfgang-Goethe-Universität.

Weidig I. 2006. The first New World occurrence of the Eocene bird Plesiocathartes (Aves: ?Leptosomidae). Paläontologische Zeitschrift. 80:230–237.

Wheeler W.C. 1990. Nucleic acid sequence phylogeny and random outgroups. Cladistics. 6:363–367.

Whitfield J.B., Lockhart P.J. 2007. Deciphering ancient rapid radiations. Trends Ecol. Evol. 22:258–65.

Wu Y. 2012. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution (N. Y). 66:763–775.

Zachos J., Pagani M., Sloan L., Thomas E., Billups K. 2001. Trends, rhythms, and aberrations in global climate 65 Ma to present. Science. 292:686–694.

Zhong B., Liu L., Penny D. 2014. The multispecies coalescent model and land plant origins: A reply to Springer and Gatesy. Trends Plant Sci. 19:270–272.

Zhong B., Liu L., Yan Z., Penny D. 2013. Origin of land plants using the multispecies coalescent model. Trends Plant Sci. 18:492–495.

97

Page 108: Carl H. Oliveros

APPENDICES

(left intentionally blank)

98

Page 109: Carl H. Oliveros

99

Page 110: Carl H. Oliveros

[Grab your reader’s attention with a great quote from the document or use this space to emphasize a key point. To place this text box anywhere on the page, just drag it.]

100

Page 111: Carl H. Oliveros

[Grab your reader’s attention with a great quote from the document or use this space to emphasize a key point. To place this text box anywhere on the page, just drag it.] [Grab your reader’s attention with a great quote from the document or use this space to emphasize a key point. To place this text box anywhere on the page, just drag it.] [Grab your reader’s attention with a great quote from the document or use this space to emphasize a key point. To place this text box anywhere on the page, just drag it.] [Grab your reader’s attention with a great quote from the document or use this space to emphasize a key point. To place this text box anywhere on the page, just drag it.]

101

Page 112: Carl H. Oliveros

102

Page 113: Carl H. Oliveros

103

Page 114: Carl H. Oliveros

104

Page 115: Carl H. Oliveros

105

Page 116: Carl H. Oliveros

106

Page 117: Carl H. Oliveros

107

Page 118: Carl H. Oliveros

108

Page 119: Carl H. Oliveros

109

Page 120: Carl H. Oliveros

110

Page 121: Carl H. Oliveros

111

Page 122: Carl H. Oliveros

112

Page 123: Carl H. Oliveros

Appendix 2.2. Species tree estimates of STAR, STEAC, MP-EST, and ASTRAL among (a) Malaconotoidea and (b) Orioloidea. Numbers below branches indicate bootstrap support (BS) values from 500 multi-locus bootstrap replicates. Nodes with < 70% BS are collapsed whereas nodes with no labels indicate 100% BS.

113

Page 124: Carl H. Oliveros

Appendix 3.1. Phylogenetic placement of Trogoniformes. Estimate of phylogenetic relationships of trogons and closely-related orders based on concatenated and coalescent species tree approaches. Nodes with dots correspond to 100% Bayesian posterior probability and bootstrap support from all analysis methods. Numbers next to nodes indicate support from Bayesian, ML, STAR, STEAC, MP-EST, and ASTRAL analyses, respectively. Chronogram based on divergence time estimation with MCMCTree.

114