ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1....

54
Complimentary Contributor Copy

Transcript of ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1....

Page 1: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Complimentary  Contributor  Copy

Page 2: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Complimentary  Contributor  Copy

Page 3: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

SYSTEMS BIOLOGY - THEORY, TECHNIQUES AND APPLICATION

NETWORK BIOLOGY

THEORIES, METHODS AND APPLICATIONS

No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form orby any means. The publisher has taken reasonable care in the preparation of this digital document, but makes noexpressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. Noliability is assumed for incidental or consequential damages in connection with or arising out of informationcontained herein. This digital document is sold with the clear understanding that the publisher is not engaged inrendering legal, medical or any other professional services.

Complimentary  Contributor  Copy

Page 4: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

SYSTEMS BIOLOGY - THEORY, TECHNIQUES AND APPLICATION

Additional books in this series can be found on Nova�’s website under the Series tab.

Additional e-books in this series can be found on Nova�’s website under the e-book tab.

Complimentary  Contributor  Copy

Page 5: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

SYSTEMS BIOLOGY - THEORY, TECHNIQUES AND APPLICATION

NETWORK BIOLOGY

THEORIES, METHODS AND APPLICATIONS

WENJUN ZHANG EDITOR

New York

Complimentary  Contributor  Copy

Page 6: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Copyright © 2013 by Nova Science Publishers, Inc. All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical photocopying, recording or otherwise without the written permission of the Publisher. For permission to use material from this book please contact us: Telephone 631-231-7269; Fax 631-231-8175 Web Site: http://www.novapublishers.com

NOTICE TO THE READER The Publisher has taken reasonable care in the preparation of this book, but makes no expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of information contained in this book. The Publisher shall not be liable for any special, consequential, or exemplary damages resulting, in whole or in part, from the readers�’ use of, or reliance upon, this material. Any parts of this book based on government reports are so indicated and copyright is claimed for those parts to the extent applicable to compilations of such works. Independent verification should be sought for any data, advice or recommendations contained in this book. In addition, no responsibility is assumed by the publisher for any injury and/or damage to persons or property arising from any methods, products, instructions, ideas or otherwise contained in this publication. This publication is designed to provide accurate and authoritative information with regard to the subject matter covered herein. It is sold with the clear understanding that the Publisher is not engaged in rendering legal or any other professional services. If legal or any other expert assistance is required, the services of a competent person should be sought. FROM A DECLARATION OF PARTICIPANTS JOINTLY ADOPTED BY A COMMITTEE OF THE AMERICAN BAR ASSOCIATION AND A COMMITTEE OF PUBLISHERS. Additional color graphics may be available in the e-book version of this book. Library of Congress Cataloging-in-Publication Data

Published by Nova Science Publishers, Inc. �† New York

Complimentary  Contributor  Copy

Page 7: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Contents

Preface vii

Chapter I Analytical Frameworks of Social Network Analyses 1 Swarna Weerasinghe

Chapter II Phylogenetic Networks Are Fundamentally Different from Other Kinds of Biological Networks 23 David A. Morrison

Chapter III Construction of the Statistic Network from Field Sampling 69 WenJun Zhang

Chapter IV Systems Biology and Environmental Exposures 81 Julia E. Rager and Rebecca C. Fry

Chapter V A Method for Creating a Real Network with Expected Mean and Variance of Degree Distribution 133 WenJun Zhang and GuangHua Liu

Chapter VI Testing a Tree Productivity: Climate Model with Dendrochronological Data 141 Yueh-Hsin Lo, Juan A. Blanco, Brad Seely, J. P. (Hamish) Kimmins and Clive Welham

Chapter VII Modelling Stochasticity in Multi-stable and Oscillatory Biological Networks Far from Equilibrium 163 Thusangi Wannige, Don Kulasiri and Sandhya Samarasinghe

Index 193

Complimentary  Contributor  Copy

Page 8: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Complimentary  Contributor  Copy

Page 9: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

In: Network Biology ISBN: 978-1-62618-942-3 Editor: WenJun Zhang © 2013 Nova Science Publishers, Inc.

Chapter II

Phylogenetic Networks Are Fundamentally Different from Other

Kinds of Biological Networks

David A. Morrison* Department of Biomedical Sciences and Veterinary Public Health,

Swedish University of Agricultural Sciences, Uppsala, Sweden

Abstract

Complex networks are found in all parts of biology, but there are at least two distinct types of biological networks. In the most common type, the nodes and edges are empirically observed, and the network analysis involves summarizing the characteristics of the network. In the second type, only the leaf nodes are observed, and the internal nodes and all of the edges must be inferred from information available about the leaf nodes. Perhaps the most widespread of this inferred type of network is the phylogenetic network, which illustrates the genealogical history and the connection of all life. Evolution involves a series of unobservable historical events, each of which is unique, and we can neither make direct observations of them nor perform experiments to investigate them. This makes a phylogenetic study one of the hardest forms of data analysis known, as there is no mathematical algorithm for discovering unique historical accidents. This chapter summarizes the essential differences of this network type and discusses the consequences of these differences.

Due to the complexity of evolutionary history, two types of phylogenetic networks have been developed, which have been actively used in parallel by biologists for 150 years: (1) rooted evolutionary networks, in which the internal nodes represent ancestors of the leaf nodes, and the directed edges represent historical pathways of transfer of genetic information between ancestors and their descendants; and (2) unrooted data-display networks, in which the internal nodes do not represent ancestors, and the undirected edges represent affinity (e.g. similarity) relationships among the leaf nodes. The latter type of network is the most commonly encountered in phylogenetics, because

* [email protected].

Complimentary  Contributor  Copy

Page 10: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 24

there is a wide range of available mathematical techniques that work well. They have been put to a number of uses by phylogeneticists, including exploratory data analysis, displaying similarity patterns, displaying data conflicts, summarizing analysis results, and testing phylogenetic hypotheses; and I illustrate each of these with an empirical example. There are, as of yet, few mathematical techniques available for evolutionary networks, and recent focus has therefore been on the development of practical and effective methods. There are, however, a wide range of methodological questions that need to be answered before this can happen; and I raise a number of these here, along with a preliminary discussion of them. There are also issues related to the realism of the common mathematical constraints, the evolutionary units in a network, and the concept of a most recent common ancestor.

Keywords: Phylogenetics; Evolution; Genealogical relationships: Phylogenetic networks; Evolutionary networks; Exploratory data analysis

1. Introduction The Online Etymology Dictionary (2010) has this to say about the history of the word

network: �“net-like arrangement of threads, wires, etc.�” 1560, from net (n.) + work (n.). Extended sense of �“any complex, interlocking system�” is from 1839 (originally in reference to transport by rivers, canals, and railways). Meaning �“broadcasting system of multiple transmitters�” is from 1914; sense of �“interconnected group of people�” is from 1947.

I guess that biological networks fit into the second category, as referring to complex, interlocking systems. However, in this chapter I wish to point out that biological networks themselves are of two quite different types, both in the way that they are constructed and the way they are interpreted, and that this has quite profound consequences for both the biology and the mathematics of networks.

There are many different complex networks embedded within biological systems, with nodes (representing the biological objects) connected by edges (representing some form of interaction). The nodes can represent units at all levels of the biological hierarchy, from elements, through organic and inorganic compounds, to tissues, organs, individuals, populations, species, communities and ecosystems. The edges (or arcs, if they have a specific direction) represent all sorts of interactions between the nodes, including transcriptional control and other biochemical processes, energy and nutrient flow, behavioral interactions, and genetic or genealogical relationships.

The distinction I wish to make among these different types of networks is whether the network is observed or inferred. An observed network is one in which the connections between the nodes are recorded as part of the empirical data (such as food webs or biochemical interactions), while an inferred network is one in which the connections are not (or cannot be) observed directly, and must instead be deduced from the data available. This distinction is often ignored by practitioners, and indeed the latter group is often ignored entirely (e.g. Proulx et al., 2005), but there is at least one area of biology in which such networks are widespread.

Complimentary  Contributor  Copy

Page 11: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 25

That area is evolutionary biology, and in particular the discipline of phylogenetics. Phylogenetics is the study of the historical (evolutionary) relationships among organisms, whether these are populations, species, or larger taxonomic groups. As such, it involves reconstructing the genealogical relationships of the organisms, which are expressed as a network called a phylogeny. Since these relationships are historical, often arising millions of years in the past, they cannot be directly observed by studying contemporary organisms. Nevertheless those organisms contain information about their historical relationships, stored in their genes, and it is this information that is used to infer the relationships. Genetic information is actually about how an organism functions, not about its history. However, the functions arose in particular ways and at particular times in the past, and it is the pattern of sharing of these functions among organisms that can be used to reveal the past. Basically, if two organisms share part of their genome then it is likely to be because they inherited that piece of genome from a shared common ancestor, and so we can infer the existence of that ancestor and its relationships from the patterns of sharing.

There are other parts of biology that involve inferring networks rather than observing them, but they are not covered in this chapter. For example, gene regulatory networks are sometimes inferred from studies of genome data rather than from genetics experiments (e.g. Margolin et al., 2006; Sarder et al., 2010).

2. Phylogenetics

2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered to be unscientific because

it deals with historically unique events. Hypotheses concerning these events are thus not universal (in either space or time) and they are therefore considered to be untestable in the contemporary world. The development of phylogenetic analysis has consequently been based largely on the erection of what have been called �“evolutionary scenarios�” describing the presumed genealogical history of the organisms under study. The number of such scenarios that may be created is, of course, limited solely by the imagination of the researcher, and none of the scenarios are likely to be open to falsification. Recent developments can thus be seen as attempting to base phylogenetic analysis on a more objective footing, where the phylogenetic hypotheses are explicitly stated, along with the evidence supporting (and contradicting) them, and are then subjected to quantitative testing.

Reconstructing a tree-like phylogenetic history is conceptually straightforward, although it took a long time for someone (Hennig, 1966) to explain the most appropriate approach. The objective is to infer the ancestors of the contemporary organisms, and the ancestors of those ancestors, etc., all the way back to the most recent common ancestor (MRCA) of the group of organisms being studied. Ancestors can be inferred because the organisms share unique characteristics. That is, they have features that they hold in common and that are not possessed by any other organisms. The simplest explanation for this observation is that the features are shared because they were inherited from an ancestor. The ancestor acquired a set of heritable (i.e. genetically controlled) characteristics, and passed those characteristics on to its offspring. We observe the offspring, note their shared characteristics, and thus infer the

Complimentary  Contributor  Copy

Page 12: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 26

existence of the unobserved ancestor(s). If we collect a number of such observations, what we often find is that they form a set of nested groupings of the organisms. This can be represented as a network, with the external (leaf) nodes representing the contemporary organisms, the internal nodes representing the ancestors, and the edges representing the lines of descent.

Phylogenetic analysis thus attempts to group organisms on the basis of their common ancestry. Many different mathematical methods have been developed, based on different criteria: minimum distance, maximum parsimony, maximum likelihood, and bayesian probability (Felsenstein, 2004). None of these has any strictly biological motivation, but they each have a strong philosophical and/or mathematical basis. Minimum distance methods try to match the pathlengths of the network to the observed genetic distances between the organisms, as measured in one of many different ways. Maximum parsimony is an attempt to minimize the number of unfounded assumptions in the analysis. This usually involves minimizing a count of something (e.g. finding the minimum number of events needed to produce a specified result). Maximum likelihood is a method used for fitting a statistical model to data, maximizing the probability of obtaining the data given the specified model. In our case, the model will be some simplified analogy to evolutionary processes (e.g. nucleotide substitutions). Bayesian probability takes the data and the statistical model and combines them to update a specified prior probability. The posterior probability produced is proportional to the likelihood of the observed data multiplied by the prior probability of observing the data.

Phylogenetics is founded on a widely held view of the mode of the evolutionary process: species are lineages undergoing divergent evolution with modification of their intrinsic attributes, the attributes being transformed through time from ancestral to derived states. The attributes can be phenotypic or genotypic. That is, biology recognizes a distinction between genotype, which is the collection of genes and other associated material in an organism, and phenotype, which is the product of interactions between genes and also between genes and their environment. The DNA, RNA and proteins in an organism are usually taken to represent the genotype, whereas the cells, tissues and organs constitute the phenotype of an individual. To quote Richard Lewontin (2011): �“the actual correspondence between genotype and phenotype is a many�–many relation in which any given genotype corresponds to many different phenotypes and there are different genotypes corresponding to a given phenotype.�” Up until the 1990s phenotypes were the basic source of data for phylogenetics, but since then biologists have switched wholesale to genotypes for constructing phylogenies.

An important part of the use of genotypes for constructing phylogenies is the distinction between vertical and horizontal flow of genetic information. The vertical components of descent are those from parent to offspring, while all other components are referred to as horizontal. The horizontal components arise due to phenomena such as hybridization and introgression, recombination, lateral gene transfer, and genome fusion. If there is no horizontal gene flow, then a phylogenetic history would be perfectly represented as a tree. However, horizontal patterns need to be represented by reticulations in a network. Mathematically, trees form a subset of networks. Therefore, we do not need to choose between the two as the most appropriate model for phylogenetics �— we can always choose a network model, and the resulting network will be more or less tree-like depending on the data.

Complimentary  Contributor  Copy

Page 13: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 27

In this regard, it is important to note that populations are inherently more likely to form networks than trees. The inter-breeding among organisms within closely related populations means that there will be a lot of horizontal flow of genetic material, in addition to the vertical flow from parent to offspring. For species, on the other hand, even between closely related species there are barriers to gene flow, which are effective to a greater or lesser extent, and so trees may be more likely than networks. For this reason, many of the methods for constructing phylogenetic networks have been developed specifically with populations in mind (Morrison 2005, 2011).

Traditionally, phylogenetic trees are interpreted in terms of what are called monophyletic groups (as called clades), each of which consists of an ancestor and all of its descendants. These are natural groups in terms of their evolutionary history, whereas other types of groups (e.g. paraphyletic, polyphyletic) are not. So, a phylogenetic tree consists of a set of nested clades, which are the groups that are represented and given names in formal taxonomic schemes. A network requires a more complex interpretation.

For phylogenetic trees, there is thus a rationale for treating a tree diagram as a representation of evolutionary history. For example, in a study of a set of gene sequences, first we produce a mathematical summary of the data based on a quantitative model. We then infer that this summary represents the gene history, based on the Hennigian logic that the patterns are formed from a nested series of shared derived character states (this is a logical inference about the biology being represented by the mathematical summary). We then infer that this gene history represents the organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (i.e. a pragmatic inference). However, no such rationale exits for most of the current network methods. The network still represents a mathematical summary of the data, of course, but there is no logic for direct inference about biology. It is almost certain that the mathematical summary represents real biological patterns, but there is no necessity that those patterns are evolutionary ones.

Fossils can be used to help infer the historical network but they cannot be used to observe it. There is nothing inscribed on the fossils themselves that tells us their relationships �— the relationships have to be inferred, just as for contemporary organisms. A classic example concerns the human-like fossil fragments (a finger bone and some teeth) from the Denisova Cave, in Siberia. We know almost nothing about their phenotype, because we do not have an entire skeleton, but we know a lot about their genotype, because the fossils have yielded nucleotide sequences, allowing us to study their genome. As discussed below (section 4.1), this genome is strikingly different from those of both Neanderthals and modern humans.

2.2. Purposes of Phylogenetic Networks To date, published phylogenetic networks have been of two distinct types: (i) non-

directional (unrooted) networks displaying observed affinity relationships among organisms, and (ii) directed (rooted) networks displaying inferred genealogical relationships among organisms. These two �“relationship�” threads can be seen running through the history of biology, so that there have been two types of networks used by biologists in parallel. It is worth considering these two historical threads, because it makes clear there are actually two distinct things that biologists are trying to achieve when they use a phylogenetic network.

Complimentary  Contributor  Copy

Page 14: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 28

2.2.1. Affinity and Unrooted Phylogenetic Networks Carl von Linné�’s Aphorism 77 (Linnaeus, 1751) specified that: �“Plantae omnes utrinque

affinitatem monstrant, uti Territorium in Mappa geographica�” [All plants show affinities on either side, like territories in a geographical map]; and such a map was published in Linné (1792). This type of map can also be represented as a network, as previously suggested by Vitaliano Donati (1750), although no diagram was drawn at that time.

There is little point in trying to formally define what biologists have meant by �“affinity�”, since it seems to vary greatly. However, it has usually referred to some sort of attraction or connection between organisms (and their characteristics), sometimes described as similar to the laws regulating the combinations of elements that form compounds in chemistry. In modern terminology, affinity originally included evaluation of both homology and analogy, and therefore affinity was a more general organismal relationship than was evolutionary history, and it does not necessarily correlate well with it.

After Linné, taxonomists consistently invoked the web or network, in which affinity relationships were perceived as horizontal, multiple and undirected. Character analysis seemed to show multiple parallelisms between intuitively recognized taxonomic groups. These groups seemed to be polythetic �— definable only by an inconsistent set of overlapping morphological characters. This was particularly so for studies of plants, but was also prevalent in studies of algae and protozoa.

Affinity was usually imagined as being multi-facetted, so that any diagram of affinities showed multiple connections among the organisms: relationships between groups were very definitively reticulating, and it was considered impossible to form a linear series, because emphasizing a relationship in one direction necessarily entailed simultaneously breaking relationships in another (Stevens, 1994). Hence, the diagrams were networks rather than trees (Stevens, 1994; Ragan, 2009; Tassy, 2011; Pietsch, 2012). These network diagrams lacked any directionality, as affinities between taxonomic groups were considered to be symmetrical (unlike genealogical relationships). Notably, when interpreting their diagrams the authors either failed to mention genealogy, or they implicitly or explicitly excluded it.

This tradition continues to this day. The pheneticists, starting in the late 1950s, quantified affinity objectively as overall similarity. They initially used a tree as their visual metaphor, in the form of an unrooted phenogram, but these days it is more likely to be an unrooted network. Almost all of the published phylogenetic networks of the past 20 years have been unrooted, and therefore logically cannot represent genealogical relationships (see below).

The importance that I see in these historical networks is that they match closely the modern idea of unrooted data-display networks. They are not a form of exploratory data analysis, because they were intended to express the author�’s ideas about biological relationships, rather than to reveal previously unquantified patterns in the data; but they certainly are not rooted evolutionary networks.

2.2.2. Genealogy and Rooted Phylogenetic Networks

The first diagram representing what we would now call phylogenetic history was the �“genealogical tree�” of dog breeds produced by Georges-Louis Leclerc, comte de Buffon (1755), and the second was the �“genealogical tree�” of strawberry cultivars produced by Antoine Duchesne (1766). Both of these showed reticulating relationships, and were thus networks rather than trees.

Complimentary  Contributor  Copy

Page 15: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 29

This lead was then mostly ignored, except by Jean-Baptiste de Monet, chevalier de Lamarck (1809), and almost all of the relationship diagrams drawn afterwards were either (i) networks showing non-genealogical affinity, or (ii) trees showing non-genealogical affinity (e.g. those of Agassiz, Augier, Bronn, Eichwald, Hitchcock, Seringe; see Pietsch, 2012).

This lasted until Charles Darwin (1859) wrote: �“The affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth�”. Darwin thus introduced the �‘Tree of Life�’ as a simile, which has since become very popular as a metaphor for phylogenetic relationships, especially among the general public. However, this simile was quite independent of Darwin�’s diagrams, because he always referred to his theory as �“descent with modification�” (Penny, 2011). Darwin referred to the Tree of Life at the end of the chapter containing his bush-like phylogenetic figure (from which the quote above is taken), and later in the book he referred to relationships as being �“somewhat like the branches of a tree�”, but neither of these was a direct reference to any diagram.

After Darwin, trees showing genealogy were immediately produced by a number of people, including Gaudry, Haeckel, Hilgendorf, and Mivart (see Stevens, 1994; Ragan, 2009; Bigoni and Barsanti, 2011; Gontier, 2011; Tassy, 2011; Pietsch, 2012). Networks showing genealogy at the species level or above were notably absent, until the work of Ferdinand Pax (1888), who illustrated hybridization, followed by Constantin Mereschkowsky (1910), who studied genome fusion.

This tradition continues to this day, with most genealogical diagrams being trees rather than networks. In recent times, it was Sneath (1975) and Bremer and Wanntorp (1979) who started asking about whether a tree is the best model for displaying the genealogy of, respectively, bacteria (where lateral gene transfer is widespread) and plants (where hybridization is common). Subsequently, during the 1990s molecular biologists started asking the same sort of question, often in the context of lateral gene transfer. Consequently, more and more networks are appearing in the literature that are effectively genealogical trees with reticulations.

The importance that I see in these historical networks is that they match closely the modern idea of rooted evolutionary networks, which thus have a history just as long as that of the affinity (data-display) networks, but were effectively side-tracked by the tree metaphor.

2.3. Types of Phylogenetic Networks Above, I have recognized two fundamentally different types of phylogenetic network,

which I will call (a) the �“data-display network�” and (b) the �“evolutionary network�”. The data-display network merely displays character-variation patterns, whatever their cause. That is, we don�’t have to hypothesize particular evolutionary events as the cause(s) of the patterns. The patterns can be displayed directly from the data or by first creating trees and then summarizing them. The evolutionary network, on the other hand, tries to display evolutionary events, where at least one feature of the ancestors is modified in a descendant, so that there is transfer of genetic information.

Formally, these networks can be characterized as: (1) unrooted data-display networks, in which the internal nodes do not represent ancestors, and the undirected edges represent affinity (e.g. similarity) relationships among the leaf nodes; and (2) rooted evolutionary

Complimentary  Contributor  Copy

Page 16: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 30

networks, in which the internal nodes represent ancestors of the leaf nodes, and the directed edges represent historical pathways of transfer of genetic information between ancestors and their descendants.

Evolutionary networks must be rooted in order to form an hypothesized evolutionary history. The root indicates the time direction of the genealogy, away from ancestors and towards their descendants, and time goes in only one direction (so that we have Time�’s Arrow not Time�’s Boomerang). The root gives each network edge a direction, so that it forms a directed network. Data-display networks can be either rooted or unrooted, but they are almost always unrooted, so that the edges are undirected.

The use of a tree / network metaphor in biology provides a quantitative connection with mathematics. Graph theory is a well-developed branch of mathematics, and the tree / network metaphor connects very well with acyclic line graphs. This has allowed phylogeneticists to develop objective and repeatable methods for producing trees from phenotypic and genotypic data, thus linking phylogenetics to mainstream scientific activity.

Huson and Scornavacca (2011) provide this mathematical definition: �“a phylogenetic network is any graph used to represent evolutionary relationships (either �‘abstractly�’ or �‘explicitly�’) between a set of taxa that label some of its nodes (usually the leaves)�”; and they provide a taxonomy of the various mathematical methods for producing both rooted and unrooted networks.

This means that a data-display network can be seen as a cyclic undirected line graph, whereas an evolutionary network must be a directed acyclic line graph (DAG). The idea of trees as line graphs is usually credited to Cayley (1857), who called the nodes �“knots�” and the edges �“branches�” �— biologically, �‘nodes�’ is more accurate than �‘knots�’ although �‘branches�’ is more accurate than �‘edges�’.

The relationship between direction and cycles is illustrated in Figure 1, which shows why any evolutionary diagram must involve a directed acyclic graph. A directed cyclic graph cannot represent a realistic evolutionary history, because at one of the nodes in the cycle an inferred ancestor in also its own descendant (or one of the inferred descendants is also its own ancestor).

An undirected cyclic graph can be turned into either a directed cyclic graph or a directed acyclic graph. It is therefore important to note that in an evolutionary analysis, the goal is always to produce a directed acyclic graph.

As an aside, in biology the word �‘network�’ has also been used to refer to an unrooted tree. This usage arose in the early days of cladistics, from the idea that an unrooted tree represents a set of rooted trees (one potential root per edge in the tree). This usage is usually credited to Farris (1970), who wrote:

�“Trees are directed entities in which the root is presumed to represent a point

chronologically prior to any descendent point ... If the root is not specified, we have an �“undirected tree,�” or a network. A network with a certain set of nodes may correspond to a wide class of trees with the same nodes, each tree differing from the others in the class only in the position of its root.�”

Complimentary  Contributor  Copy

Page 17: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 31

Figure 1. An explanation of the relationship between direction and cycles in a line graph. In all of the graphs there are four (unlabelled) leaves, but the number of internal nodes and edges varies depending on whether there are cycles (4 nodes, 4 edges) or not (2 nodes, 1 edge; or 4 nodes, 4 edges).

2.4. The Many Names of Phylogenetic Networks

So, there are basically two types of phylogenetic network, although there can be

gradations between the two extremes. Unfortunately, these go by many different names, which inevitably leads to some confusion on the part of users. To help deal with this, some of the names are listed here, along with an explanation of what the terminology is intended to convey. The terms are arranged in pairs, indicating the two different types of network. The �‘network�’ part of the name is assumed in each case unless indicated otherwise.

Type 1 Type 2 1. Affinity Genealogical 2. Data-display Reticulogeny 3. Implicit Explicit 4. Directed Undirected 5. Rooted Unrooted 6. Splits graph Augmented tree, Reconciliation, Recombination, Hybridization 1. This reflects the biologists�’ perspective, describing the different purposes for which

networks have been used. Affinity networks display overall similarity relationships

Complimentary  Contributor  Copy

Page 18: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 32

among the organisms, whereas genealogical networks display only historical relationships of ancestry.

2. This reflects the assumptions used for the data analysis. Data-display networks are interpreted solely as visualizations of the patterns of variation in the data, while the reticulogenies are based on some inferences about those data patterns (such as their possible cause). Some network types, such as Reduced Median Networks and Median-joining Networks, are based on algorithms that make partial inferences from the data. Data-display networks have mainly been used as affinity networks and reticulogenies as genealogical networks.

3. This reflects the computational perspective, describing the goal of the algorithm used to analyze the data. Explicit networks are intended to provide a phylogeny in the traditional sense used for phylogenetic trees, displaying both vertical and horizontal patterns of descent with modification. Implicit networks provide information that can be used to explore phylogenetic patterns in a dataset without any direct interpretation as necessarily showing a phylogeny. Implicit networks have mainly been used as data-display networks and explicit networks as reticulogenies.

4. This reflects the mathematical interpretation of networks as line graphs. In a directed graph the edges have a direction, usually indicated by an arrow, in which case the edges are more correctly referred to as arcs. Undirected graphs do not have directed edges. Explicit networks are directed while implicit ones are usually undirected.

5. This reflects the tree-thinking view of phylogenetic networks, in which directed graphs are called rooted trees and undirected graphs are called unrooted trees. Rooted networks are usually treated as explicit networks and are thus used as genealogical networks, although there is no reason why they could not be used simply as a convenient form of data display.

6. This reflects the modelling approach to network analysis based on mathematical structures. Splits graphs model phylogenetic patterns as bipartitions of the data, and build the network from those partitions (the result will be a tree if there are no incompatible bipartitions). Augmented trees are essentially trees with a few added reticulation edges / arcs, while reconciliation networks are based on reconciling the differences between trees. Recombination networks are based on analyzing data patterns in terms of a simple model of genetic cross-over, while Hybridization networks model the data in terms of patterns in conflicting trees.

So, there are reasons why so many different terms have appeared in the literature. Unfortunately, they are not always used consistently with the meaning that was originally intended.

3. Phylogenetic Versus Other Biological Networks I am arguing in this chapter that inferred networks are fundamentally different from

observed networks. In the latter, we observe the nodes and edges within the network, which we can then display and mathematically summarize. Constructing the network is thus an empirical process, and the network analysis consists of studying the characteristics of that

Complimentary  Contributor  Copy

Page 19: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 33

network by summarizing its important features. For an inferred network, on the other hand, we observe only some of the nodes, and we need to deduce the existence of the other nodes and all of the edges of the network. So, the network analysis consists of trying to reconstruct the network from the indirect information available. The network itself is the summary.

I will argue in this section that these differences mean that inferred phylogenetic networks do not match other types of biological network.

Almost all types of biological networks are built by starting with a labelled set of nodes and then directly linking those nodes with edges �— phylogenetic networks seem to be the only major class of biological networks in which some or many extra nodes are inferred by the network-building process, along with all of the edges. That is, almost all other networks are built empirically, by using a collection of observed nodes and connecting them via observed edges (�‘observed�’ indicating that there are experimental data). Phylogenetic networks, on the other hand, attempt to reconstruct unobserved (and unobservable) historical relationships using data, a model and a mathematical optimization procedure.

3.1. Network Characteristics First, here is a list some of the important characteristics of phylogenetic networks if they

are to represent evolutionary history: 1. fully connected 2. directed 3. single root 4. each edge (arc) has a single direction 5. no directed cycles 6. in species networks the internal nodes are usually unlabelled, although in population

networks some (or many) of them may be labelled.

If we take these individually, we can see the difference from other biological networks. Most other biological networks can be disconnected, at least potentially, because the

definition of the nodes to be included in the network is often independent of the network itself, so that there is no necessary connection between nodes. For example, the species within a local community may not all be connected to each other with respect to the characteristic being studied (e.g. genetic relatedness). Indeed, finding this out may be a primary goal of any particular study. Similarly, molecular compounds usually form at least semi-independent sets of pathways, so that the study of any one organ can produce disconnected networks. With evolutionary history, on the other hand, all conceivable nodes are connected to each other by definition (unless there are multiple origins of life in the Universe).

In order to represent history, which has a single time direction, a phylogenetic network must have directed edges (arcs) to represent the time course. Many other biological networks have no explicit direction, even if there is an implied one. For example, in protein�–protein interaction networks the edges represent the presence of physical interactions between proteins (with either no implied direction or a symmetric interaction), and in genetic-relationship networks the edges simply represent the degree of genetic relatedness of

Complimentary  Contributor  Copy

Page 20: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 34

individuals (e.g. the link between siblings has no explicit direction, although there is an implied directional link to their parents).

In a phylogeny there is usually a single root, because phylogeneticists try to work on monophyletic groups (clades); and if they really do want to study the so-called Tree of Life then there is assumed to be a single origin of life in the Universe. Once again, for other networks the definition of the included nodes is often independent of the network or its shape, so that a single root is not necessary. For example, networks of regulatory interactions among genes are often represented with the nodes around the perimeter of a circle, with the edges being chords of that circle. Furthermore, in food webs the arcs represents who eats whom, and these networks are called �‘webs�’ for a good reason: there is usually no obvious root position. Indeed, the usual representation of a food pyramid starts with multiple sources (at the bottom) and a single sink (at the top), with the arc directions indicating �‘is eaten by�’.

Also, many biological networks have directed cycles. For example, the feedback loops in biochemical pathways are usually important (as also are feedforward loops). Indeed, the discovery of feedback has been considered to be a major contribution to our understanding of why biological systems are different from non-biological ones. The recycling of nutrients in ecosystem nutrient pathways is another prominent example, although no feedback is involved in this case. Once again, the recognition that the Earth is effectively a closed system with finite resources that must be reused is considered to be a major contribution by biology.

Moving on, many networks have bidirectional arcs, indicating direct interactions between nodes. Indeed, many behavioral systems show this feature, including intra- and inter-competition networks in ecology, as well as sexual-contact social networks (which, incidentally, have two distinct types of nodes). Immunological networks often have this characteristic, as well, with the arcs pointing in one direction or the other at different time points during a cell�’s immunological reaction to a stimulus. (These networks also can have nodes with arcs that point directly back to themselves, indicating that a molecule regulates itself.) Host-parasite systems can also be considered to have bidirectional arcs, although in this case the paired arcs represent different processes (the effect of the parasite on the host and the host on the parasite operate via different mechanisms). In this case, two separate arcs are usually used, rather than a single bidirectional one, thus representing a directed cycle.

Predator�–prey systems may, on occasion, match phylogenetic networks. If we isolate the predator�–prey relationships from all of the others in a food web then a single tree-like structure sometimes emerges, with a single �‘key�’ predator at the root and a series of non-predators at the leaves. However, more often there are several �‘root�’ predators within any one community predator�–prey network. Similarly, disease-transmission networks can be tree-like if there is a single identifiable origin to an epidemic, for example, but not otherwise. Note that the internal nodes are all labelled in both of these types of network, so that they will match a population network rather than a species network.

So, phylogenetic networks match other biological networks only partially for each of these characteristics, and thereby form a unique class of network. This suggests, that much of the theoretical work being directed towards the study of those networks (�‘network science�’; e.g. Newman, 2010) may turn out not to be particularly relevant for phylogenetic networks, at least from the biological perspective.

Complimentary  Contributor  Copy

Page 21: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 35

3.2. Network Summary Measures One major part of the study of biological networks has been the development of

descriptive summaries of the network characteristics, using one or more mathematical measurements. This does not necessarily mean that biologists have seen any close relationship between these mathematical measures and biologically relevant quantities. However, the expectation is that they will reveal the overall patterns (and processes) underlying the network complexity, so that we do not get tangled in the fine structure of the details.

We therefore need to consider whether any of these network measures have yet played a role in phylogenetic networks. My conclusion is that most of them have not, because it is the network itself that is the outcome of the network analysis, rather than any mathematical summary of that network.

3.2.1. Properties of Individual Nodes

Node degree�— the number of incident edges to a node

For a dichotomous phylogenetic tree this is pre-defined (indegree 1, outdegree 2), and many network models have similar restrictions (e.g. indegree 2, outdegree 1 for reticulation nodes). However, applying the coalescent to a population network suggests that the node with the largest degree is the most probable common ancestor, so this summary measure is potentially of interest here.

Degree distribution�— the frequency distribution of the degree for all nodes

Not used so far in phylogenetics, presumably because it would be uninteresting in light of the previous comment about pre-defined degrees.

Structural equivalence�— the extent to which nodes are exactly substitutable

If the network represents ancestor�–descendant relationships then no two nodes are exactly substitutable.

3.2.2. Properties Affected by Local Subgraphs of the Network

Clustering coefficient �— the degree to which nodes cluster together, measured

as the density of triangles in the network (can also be a global measure)

Not used so far in phylogenetics. Distribution of network motifs �— motifs are connectivity patterns that occur

more often than expected, usually expressed as a frequency distribution

Not used so far in phylogenetics.

3.2.3. Properties Affected by the Whole Network Connectivity / cohesion �— the minimum number of nodes or edges needing to be

removed to disconnect the remaining nodes from each other By definition, an evolutionary history is fully connected, and if it is a tree then

removing 1 edge or node serves to disconnect it. However, for a reticulating network connectivity is an important concept, since it expresses how tree-like the network is. To date, two measures of reticulation have been proposed, the -score (Holland et al.,

Complimentary  Contributor  Copy

Page 22: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 36

2002) and the Q-residual (Gray et al., 2010), with apparent preference for the -score (Wichman et al., 2011).

Closeness �— the inverse of the summed shortest pathlengths to all other nodes,

often averaged across all nodes Not used so far in phylogenetics.

Betweenness �— the number of inter node shortest paths on which a node lies,

often averaged across all nodes Not used so far in phylogenetics.

Node density �— the number of nodes per unit pathlength Not used formally, as far as I know, but phylogeneticists have consistently (and

perhaps inappropriately) distinguished highly branched (speciose) parts of a tree from unbranched parts, which can be seen as a use of this summary concept.

Centrality �— can be measured with respect to degree, closeness or betweenness

Not used so far in phylogenetics. Nodes in a phylogenetic network represent organisms and their ancestors, and so they do not really have a characteristic related to centrality �— they are connected to their immediate ancestors and descendants only. However, centrality is a key concept for most biological networks, and so this is an important distinction of phylogenetic networks.

Network diameter �— either the average minimum distance between pairs of nodes,

or the longest pathlength between any pair of nodes (relative to the number of

nodes)

This has sometimes made its appearance as a statistic in the phylogenetic literature. For example, it has been used as an optimality criterion for distance-based tree-building methods. If nothing else, the maximum diameter is used as the mid-point rooting criterion for a phylogenetic tree.

Nestedness �— quantifies whether the structure of small assemblages is a proper

subset of the structure of large assemblages A dichotomous tree is fully nested, and so nestedness has had a leading role in

phylogenetics. Indeed, nestedness is the Hennigian justification for representing evolutionary history as a tree in the first place. Nestedness can thus be used to measure the tree-likeness of a network.

Fractal structure �— quantifies the similarity of network structure at different scales Not used so far in phylogenetics, although tree-imbalance (inversely related to

fractal structure) has been an important summary measurement for phylogenetic trees.

Network resolution �— the amount of information contained in the network (i.e.

how much of the variation in node and edge behaviour is retained in the network

representation)

In phylogenetics, network resolution has this rank order: unrooted < rooted < rooted with variable edgelengths. This concept is thus of interest in phylogenetics, but it is usually not quantified. For example, an unrooted tree or network cannot represent evolutionary history, and so some extra network resolution is implied when there is a root. Also, the use of variable edgelengths is common for rooted trees but not often for rooted networks, implying that resolution is lost in the networks.

Complimentary  Contributor  Copy

Page 23: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 37

Variable edgelengths, however, have been used in unrooted networks (e.g. Winkworth et al., 2005; Holland et al., 2006).

3.2.4. Conclusion

Most of these summary measures have not yet played a significant part in the development of phylogenetics. Instead, phylogeneticists have concentrated on quantifying the fit of their data to phylogenetic trees, such as the consistency index, retention index or permutation tests (for maximum parsimony methods), likelihood scores (for maximum likelihood methods) and posterior probabilities (for bayesian analysis), or they have considered �‘support�’ for individual edges, via procedures such as the bootstrap, various parametric statistical measurements, and the posterior probability of clades (Felsenstein, 2004).

This distinction between phylogenetics and biological networks seems to come from the different way that the networks are constructed. The other networks are usually constructed directly from observed objects and interactions, so that interest focuses on a description of the resulting network. Phylogenetic networks, on the other hand, are inferred via optimization of the data and a model, so that interest focuses on the quality of the inference rather than on a description of the network.

It seems likely, therefore, that this situation will continue, as most of these summary measures are specifically designed for describing empirically observed networks. However, the somewhat more nebulous concept of �‘network robustness�’ (the degree to which a network structure is affected by removal or alteration of nodes) has been seen as an important characteristic in the study of all biological networks.

4. Data-Display Networks We now have a good array of mathematical techniques for data-display networks (Huson

and Scornavacca 2011; Morrison 2011). These methods do the sorts of things that biologists have wanted for affinity networks, and they do them well. It is for this reason that data-display networks are far and away the most common form of phylogenetic network in the biological literature; and they are becoming increasingly common in the social sciences, as well, notably in anthropology (particularly the study of historical linguistics).

In this section I will discuss some of the uses to which these networks have been put, and provide an original data analysis as an example of each one.

Many of the techniques are based on the concept of a splits graph, which can be expressed as a Median Network. This is illustrated in Figure 2, which shows the genetic relationships between mammoths (Mammuthus), African elephants (Loxodonta) and Asian elephants (Elephas). Each set of parallel edges in the Median Network represents a set of character differences between the species; for example, there are 181 DNA differences between Loxodonta and both Elephas and Mammuthus. A Median Network is thus a direct representation of the character data. It is interpreted in terms of bipartitions or splits, which divide the network into two groups �— each split is represented by an edge or a set of parallel edges, so that removing the edges disconnects the graph into the two groups. For example, the set of four parallel edges labelled �“181�” in Figure 2 represents the split {Loxodonta}{Elephas,Mammuthus}.

Complimentary  Contributor  Copy

Page 24: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 38

However, as data complexity increases a Median Network becomes decreasingly viable as a display, because the number of required dimensions rapidly increases. Therefore, various simplifications have been developed that try to display as much as possible of the data in 2�–3 dimensions. For example, the network in Figure 2 can actually be displayed in 2 dimensions, making it planar. In this case, there is only one edge needed to represent the split {Loxodonta}{Elephas,Mammuthus}, for example. This reduction approach is adopted by network methods such as the Reduced Median Network, the Median-joining Network, NeighborNet, and Consensus or Super Networks, which are used in the following sections.

Figure 2. A Median Network and its reduction to a planar view. The data are from Orlando et al. (2007). The numbers represent a count of the character differences between the species.

4.1. Exploratory Data Analysis Exploratory data analysis (EDA) is a frequently undervalued part of data analysis in

biology (Ellison, 2001). It involves evaluating the characteristics of the data before proceeding to the definitive analysis in relation to the scientific question at hand, because it is not prudent to rely on mathematical analysis without a detailed exploration of the data first (Bandelt, 2005). EDA traditionally involves both graphical displays and numerical summaries of the data (Tukey, 1977), and for phylogenetic analyses a useful tool for EDA is a data-display network (Wägele and Wetzel, 2007; Wägele et al., 2009; Morrison, 2010). This type of network is designed to display any character (or tree) conflict that might exist in a data set, without prior assumptions about the causes of those conflicts. The conflicts might be caused, for example, by methodological issues in data collection or analysis, homoplasy, or horizontal gene flow of some sort.

As an example of the use of a data-display network for exploratory analysis I will look at the published mitochondrial genome data for hominins, including:

contemporary humans, including the revised Cambridge Reference Sequence (rCRS) (Andrews et al., 1999) and 53 sequences sampled world-wide (Ingman et al., 2000)

some early modern humans known from northern-hemisphere fossils, including palaeolithic hunter-gatherers and neolithic farmers (Ermini et al., 2008; Gilbert et al., 2008; Krause et al., 2010b; Sânchez-Quinto et al., 2012; Skoglund et al., 2012)

Complimentary  Contributor  Copy

Page 25: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Figu

re 3

. A N

eigh

borN

et g

raph

of t

he re

latio

nshi

ps a

mon

g th

e m

itoch

ondr

ial g

enom

es o

f var

ious

hom

inin

s. Th

e th

ree

mai

n gr

oups

of s

eque

nces

are

nam

ed, a

s ar

e so

me

of th

e in

divi

dual

gen

omes

, but

mos

t of t

he h

uman

gen

omes

are

unl

abel

led

(exc

ept t

he rC

RS).

The

sour

ce o

f som

e of

the

hum

an se

quen

ces i

n A

frica

is

also

indi

cate

d.

Complimentary  Contributor  Copy

Page 26: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 40

Neanderthals, a non-human group who ranged right across Europe into western Siberia, but whose fossil record stops about 30,000 years ago (Green et al., 2008; Briggs et al., 2009), and

the fossil Denisovan individual referred to above (Krause et al., 2010a).

The objective is to assess the relationship between Neanderthals and humans, which seems to be the most common question addressed in the literature about ancient-DNA.

The NeighborNet analysis of these data (based on uncorrected distances) is presented in Figure 3. The fit of the data to the graph is 98%, so that almost all of the data patterns are actually displayed in the network. The network is very tree-like as far as the main clusters are concerned, but there is one large undifferentiated cluster that contains most if the human sequences and all of the early modern humans.

Note that the genetic variation of the Neanderthal mtDNA is much less than that in the human mtDNA, and probably less than can be accounted for solely by the smaller sample size (6 genomes versus 54). Also, the genomes from the early modern humans all cluster with the non-African genomes, indicating a closer relationship of these fossil sequences with modern Northern Hemisphere and Australasian humans than with Africans. The is consistent with the Out-of-Africa scenario, in which hominins arose in Africa and subsequently migrated elsewhere

Importantly for the EDA, there is also clearly detectable non-tree structure to the data, notably in the relationship of the Denisovan genome to the other genomes, but also in the relationship between the Neanderthals and the humans. It is this non-tree structure that complicates any attempt to reconstruct the evolutionary relationship of the Neanderthals to humans.

To make this point clear, Table 1 lists seven of the bipartitions that contradict the tree-like relationship between the Denisovan, the Neanderthals and some of the humans. They are listed in decreasing order of data support for the bipartitions (i.e. split weight). Note that all of these non-tree bipartitions involve human samples from Africa, indicating that there are complex genetic relationships between ancient hominins and modern Africans. This may result, at least partly, from homoplasy caused by saturation of nucleotide substitutions, or from some form of ancestral polymorphism.

4.2. Displaying Data Patterns It is also possible to use data-display networks to examine the patterns of similarity

among a set of samples. That is, we wish to quantify and illustrate the complexity of the similarity relationships. For example, we might be interested in showing that the samples form clusters even if those clusters have considerable within-cluster variation and have complex inter-cluster relationships. If the between-cluster relationships are not complex, then the network will be very tree-like.

An example of the use of a data-display network to display similarity relationships is shown in Figure 4. The data concern amino acid sequences from 215 type-I interferon genes of various mammal species, downloaded from their respective online genome databases. These genes code for at least nine different classes within the type-I interferon family. The NeighborNet network analysis of the data (based on uncorrected distances), as shown in

Complimentary  Contributor  Copy

Page 27: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 41

figure, indicates that some of these sub-types have considerable within-class variation in their protein sequences, but that the sub-types nevertheless do form distinct clusters. The relationships between the clusters are indistinct in most cases, but there is nevertheless a well-supported bipartition associated with each cluster.

Of particular interest in this dataset is the relationship of the class labelled as µ. This has previously been recognized as an indeterminate sub-type called (Krause and Pestka, 2005), but the network analysis shows that it is quite distinct from both sub-type and sub-type , and thus is worthy of recognition in its own right. Equally interesting is the relationship of the two sequences that are individually labelled in the figure. These do not cluster with any of the currently named sub-types, and so might constitute a new, as yet unnamed, class.

Table 1. Some of the bipartitions (splits) that contradict the tree-like relationship

between the Denisovan, the Neanderthals and the human genomes

Group Split weight Non-humans Humans Accession Origin 1 0.00031073 Denisovan AF347008 African (San) AF347009 African (San) 2 0.00014065 Denisovan Group 1 all Neanderthal AF346985 African (Hausa) 3 0.00008465 Denisovan Group 5 Mezmaiskaya AF346986 African (Ibo) 4 0.00007869 Denisovan Group 7 all Neanderthal AF346994 African (Lisongo) AF347014 African (Yoruba) AF347015 African (Yoruba) 5 0.00006484 Denisovan Group 7 AF346968 African (Biaka) AF346969 African (Biaka) AF346987 African (Ibo) AF346992 African (Kikuyu) AF346996 African (Mbenzele) AF346997 African (Mbenzele) 6 0.00004584 all Neanderthal except Mezmaiskaya AF347014 African (Yoruba) 7 0.00004407 Denisovan Group 2 all Neanderthal except Feldhofer_2 AF346998 African (Mbuti) AF346999 African (Mbuti)

4.3. Displaying Data Conflicts It is also possible to use data-display networks to examine the patterns of disagreement

among the samples. That is, we wish to quantify and illustrate the complexity of the differences among the samples. For example, we might be interested in showing the ways in which different sets of data conflict with respect to the relationships among the organisms, with respect to both the location and the quantity of the conflicts. The conflicts might be localized in the network, for instance, or they might be widespread.

Complimentary  Contributor  Copy

Page 28: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 42

Figure 4. A NeighborNet graph of the similarity relationships among the type-I interferon of various mammals. The Greek letters indicate the different sub-types within the gene family. Only two of the individual gene sequences are labelled.

4.3.1. Displaying the Data Conflicts As an example of the use of a data-display network to display conflicting data patterns I

will use conflict between trees derived from different datasets for the same species. Phylogenetic datasets often produce different phylogenetic trees. Indeed, Darwin himself

(1859) noted that different organ systems suggest different relationships, and it was St George Mivart who first had to deal with this situation empirically. (Mivart was also the first person to publish a phylogenetic tree after the Origin of Species was published.) His early work was principally on the comparative anatomy of primates, for which he provided very detailed comparisons of the skeletons of a large number of species, notably in Mivart (1865), based on the axial skeleton (or spinal column), and Mivart (1867), based on the appendicular skeleton (or limbs).

In the 1865 paper Mivart noted that the data for the spinal column �“lead to an arrangement of groups and an interpretation of affinities somewhat differing from, yet in part agreeing with, the classification founded on cranial and dental characters�”. Moreover, the 1865 and 1867 studies did not produce the same phylogenetic tree, either.

Figure 5 shows a Super Network based on Mivart�’s 1865 and 1867 trees. There are 24 types of Primates, although the Gorilla was not in the 1865 dataset, and Inuus, Cynocephalus, Chrysothrix were not in the 1867 dataset. The main conflict between the two trees is in the relationship between the Nycticebinae and Cebidae, shown as the centre reticlulated area in the network. Basically, the root of Mivart�’s trees is between these two groups, and they swap sides of the root between 1865 and 1867. Another cause of this netted area is whether Homo is within the Simiinae (as in the 1865 tree) or is sister to the Apes (as in the 1867 tree).

The bottom netted area of the network is caused by conflicts about relationships within the Lemuroidea, which are of less biological importance. The top netted area refers to

Complimentary  Contributor  Copy

Page 29: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 43

whether Simia is sister to either Hylobates (in 1867) or Troglodytes (in 1865), which is a relatively minor point. 4.3.2. Comparison of Networks and Bootstrapped Trees

It is also important to note that other methods that have been used in phylogenetics for assessing data conflict, such as the bootstrap, do not always agree with the equivalent network assessment of data patterns (Wägele and Mayer, 2007; Wägele et al., 2009; Morrison, 2010). The naïve percentile bootstrap, as used for phylogenetic trees, can be expected a priori to provide biased and skewed estimates of confidence intervals, because the frequency distribution associated with tree edges will not be symmetrical (Morrison, 2006), but in addition to this it can also be expected to disagree with a network analysis. So, a tree with bootstrap values does not show the same thing as a data-display network.

The basic distinction between networks and bootstrapped trees is this: use of a data-display network, such as a splits graph, evaluates the character (or distance) data independently of any tree, whereas a bootstrap analysis evaluates the data solely in terms of a tree. For example, a bootstrap analysis records the trees at each iteration (or replicate) rather than recording the bootstrapped character set itself, and many different character sets can produce the same tree. Therefore, a bootstrap analysis does not directly assess the character support for a tree. Neither does a posterior probability from a bayesian analysis.

The importance of this distinction for phylogenetics is that a tree analysis forces the data into a tree irrespective of how well the data fit that tree. All that is required is that the tree be the optimal one based on a particular criterion (distance, parsimony, likelihood, etc.), while the degree of fit of the data and tree is effectively treated as immaterial to the analysis �— the tree-likeness of the data is never evaluated. This is true at each bootstrap iteration, as well, so that all we learn from a bootstrap analysis is which tree edges are the best supported �— we do not learn anything directly about the support of the data for a tree in the first place.

This problem becomes particularly acute as the number of terminal nodes in the network increases. The potential complexity of a dataset increases combinatorially with the number of organisms (each one added can potentially have a reticulation with every one of the existing organisms). A dataset thus needs a very strong tree signal when there are hundreds of species, if the network is to show anything more than a disorganized blob. This seems to be an unlikely scenario for most taxonomic groups, especially when using genetic data.

Figure 6 shows an empirical example of this phenomenon. The data are from Soltis et al. (2011), who analyzed 25,260 aligned nucleotides of 17 genes from 640 species of flowering plants. They concluded from their bootstrapped tree that �“Many important questions of deep-level relationships in the non-monocot angiosperms have now been resolved with strong support�”, where �“strong support�” meant that most of the branches on their phylogenetic tree had >85% bootstrap support. This conclusion does not accord with the NeighborNet graph (based on uncorrected distances) shown in the figure, which indicates that there is not much clear tree-like structure in this data set.

So, bootstrapped trees and networks will often be in agreement, especially if the data are very tree-like. However, they can differ in two ways: (i) the network may show well-supported character patterns that are not included in the tree, and (ii) the tree may show well-supported branches that are not accommodated by the network-building algorithm. Well-supported branches on a tree are not necessarily well-supported by the character data, and absence of a branch from a tree does not necessarily mean that it has little character support.

Complimentary  Contributor  Copy

Page 30: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 44

Figure 5. A Super Network illustrating the areas of disagreement between the two primate trees published by Mivart (1865) and Mivart (1867).

Figure 6. A NeighborNet graph of the genetic relationships among the 640 species of flowering plants from Soltis et al. (2011).

Complimentary  Contributor  Copy

Page 31: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 45

Figure 7. Scatter plot of the bipartition support (split weight) from the NeighborNet analysis and the bootstrap percentages from the Neighbor-joining analysis of the bird genome data.

This can be illustrated with an empirical example. The data are from Wang et al. (2012), being 25,700 aligned nucleotide positions from 28 species of bird. I calculated (a) a Neighbor-joining tree with 1,000 bootstrap pseudoreplicates, and (b) a NeighborNet graph (both based on uncorrected distances).

Figure 7 shows the bipartition support (split weights) for the 58 bipartitions that were included in the NeighborNet graph. These form a collection of what are called circular splits, and it is important to note that this collection does not include all of the bipartitions supported by the data. Those bipartitions not in the NeighborNet graph are shown in filled circles along the vertical axis of the graph with a split weight of 0.00001 (rather than zero, to accommodate the log scale).

The graph also shows the bootstrap percentages for all of the bipartitions in the NeighborNet graph plus all of those branches with a bootstrap frequency greater than 1%. Bipartitions that did not appear in any of the bootstrap pseudoreplicates are shown in filled circles along the horizontal axis of the graph.

Those bipartitions / edges where there is an approximate agreement between the tree and the network are shown in open circles. There is a roughly s-shaped relationship between the split weights and the bootstrap percentages, so that an increase in one is associated with an increase in the other.

Note that in the range of the graph where there is 100% bootstrap support there are 8 bipartitions with a large split weight but 0% bootstrap support. These are bipartitions that contradict at least one better-supported bipartition. These bipartitions thus get 0% bootstrap support even though there is considerable character support for them, as evaluated by the network.

Complimentary  Contributor  Copy

Page 32: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 46

Equally importantly, there is one edge that appears in the bootstrap assessment with high support (87%) but which does not appear in the NeighborNet graph at all. The first step of the NeighborNet algorithm decides on a set of circular splits, and only these bipartitions will appear in the splits graph, no matter how well-supported other bipartitions might be.

4.4. Summarizing Analysis Results It is also possible to use data-display networks as a summary of the results from some

other form of phylogenetic analysis, particularly if those results are expressed as a set of trees. Under these circumstances, it is traditional to use a consensus tree to summarize the patterns in the tree set (a meta-analysis), but it would be more informative to use a Consensus Network instead. Consensus networks could be used to summarize the set of Markov Chain Monte Carlo (MCMC) trees from a bayesian analysis (Holland et al., 2005), for example, or to present the results of bootstrap analyses (Holland et al., 2006), or a set of equally parsimonious trees or near-optimal maximum-likelihood trees (Holland et al., 2006).

In the case of a bayesian analysis, use of a Consensus Network would be considerably more logical than a consensus tree. Bayesian methods differ from other forms of probabilistic analysis in that they are concerned with estimating a probability distribution, rather than a single estimate of the maximum probability. That is, bayesian analysis is not about identifying the most likely outcome, it is about estimating the likelihood of all possible outcomes. In this sense, it is quite distinct from other probabilistic methods, such as those based on estimating the optimal outcome under criteria such as maximum likelihood or maximum parsimony.

In a phylogenetic analysis this creates a potentially confusing situation, as the result of most bayesian analyses is presented as a single tree (the so-called MAP tree), rather than showing the probability distribution of all trees. Certainly, some of the information from the probability distribution is used in the tree, usually the posterior probabilities that are attached to each of the tree edges, but this is a poor visual summary of the available information. A better approach would be to use a network to display the probability distribution.

The tree produced by a bayesian phylogenetic analysis is built from the best-supported edges of the set of trees sampled by the MCMC procedure, but only a set of compatible edges can be included in the consensus tree. Any well-supported but incompatible edges will not be shown, and it is the absence of these edges that causes the phylogenetic tree to deviate from the standard Bayesian philosophy of presenting a probability distribution. A Consensus Network solves this problem because it is specifically designed to present a specified percentage of the incompatible edges as well (Holland et al., 2004).

As an example of the use of a Consensus Network to summarize the posterior probabilities from a bayesian analysis, I will use the data of Schnittger et al. (2012), being sequence data for the -tubulin gene of 17 samples of the parasite genus Babesia. The bayesian analysis yielded 100,000 trees in the final MCMC sample. The rooted MAP tree is shown in Figure 8, with the posterior probabilities (PP) indicated on each branch.

Complimentary  Contributor  Copy

Page 33: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 47

Figure 8. The maximum a posteriori probability (MAP) tree from the bayesian analysis of the -tubulin data of Schnittger et al. (2012). The numbers on the edges are the posterior probabilities.

Figure 9. The 5% Consensus Network from the bayesian analysis of the -tubulin data of Schnittger et al. (2012), which represents all of the bipartitions occurring in at least 5% of the MCMC trees. The lengths of the edges represent the proportion of MCMC trees containing that edge.

Complimentary  Contributor  Copy

Page 34: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 48

There are several ways that the MCMC trees could be summarized, the standard way being to choose some percentage of the MCMC trees and to show the network of those trees. The MAP tree, for example, includes all of the bipartitions that occur in at least 50% of the trees, plus all of the other bipartitions that are compatible with those bipartitions. So, Figure 9 shows the unrooted Consensus Network with all of the bipartitions that occur in at least 5% of the trees. Note that this network approximates the 95% credible set of trees, as it excludes those bipartitions that occur in >5% of the trees.

It is important to note that the edge lengths in the network represent the bipartition support, not the estimated number of nucleotide substitutions (as in the MAP tree). For example, the relationships among the Babesia microti sequences have very short branches in the MAP tree (i.e. few substitutions) but long tree edges in the consensus networks (i.e. good support).

This network makes visually clear where are the major incompatibilities among the MCMC trees, which the MAP consensus tree does not unless one checks the PP values. The major set of boxes in the network involve the branches with PP values of c. 0.4.

4.5. Testing Phylogenetic Hypotheses Finally, it is also possible to use data-display networks to test any phylogenetic

hypothesis that can be stated a priori. Science is usually characterized as being an hypothesis-testing exercise, and phylogenetic networks have a role to play within this logical context. The simplest type of hypothesis to test is of this form:

I predict that [my expected reticulation event] will create [this particular character

pattern] in the data.

For example, a specific bipartition of the samples may have strong character support in contradiction to the usually accepted phylogenetic tree. This would not necessarily be a very strong test (Morrison, 2011), because many other possible causes of reticulation could predict exactly the same character pattern, but it may nevertheless be an important component of a phylogenetic study.

As an example of the use of a data-display network to test an explicit phylogenetic hypothesis I will consider the phylogenetic position of turtles. This has been difficult to determine (Hedges, 2012). Historically, turtles were thought to be early diverging reptiles (called anapsids), but recent morphological studies have allied turtles with lizards and snakes (together called the squamates) plus tuataras (which, together with the squamates make up the lepidosaurs). However, most molecular phylogenetic studies support neither of these hypotheses, instead finding a relationship of turtles with birds and crocodiles (together called the archosaurs).

Crawford et al. (2012) collected genome data for a range of vertebrates, wishing to test whether the turtles are genealogically most closely related to the lepidosaurs or to the archosaurs. Thus there are two explicit alternative topologies predicted for the network, one with a well-supported bipartition linking the turtles to the lepidosaurs and one with a well-supported bipartition linking the turtles to the archosaurs.

Complimentary  Contributor  Copy

Page 35: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 49

Figure 10. A NeighborNet graph of the relationships among the genomes of various vertebrate species. The two main groups of interest are named �— the archosaurs (birds + crocodiles) and the lepidosaurs (snakes + lizards + tuatara). The nine best-supported bipartitions (i.e. splits) are numbered, and their support values (i.e. weights) are listed in the table.

The NeighborNet analysis of these data (based on uncorrected distances) is presented in Figure 10. The fit of the data to the graph is 99%, so that almost all of the data patterns are actually displayed in the network. The network has some well-supported clusters, but the relationship of these cluster to each other is not clear. I have numbered the nine best-supported bipartitions in the data, and shown their location in the graph as well as their relative weights.

The bipartition of interest for the hypothesis testing is the one numbered 6, which unites the turtles and the archosaurs, and thus apparently supports one of the two a priori hypotheses. However, the bipartition support is rather small and, what is worse, there are several bipartitions with non-trivial support that contradict bipartition 6, notably number 7. This is thus a very equivocal hypothesis test.

Of particular note, the complexity created by bipartition 7 involves the relationship between the turtles and the tuatara, while bipartition 8 involves the relationship between the turtles and the crocodilians. This emphasizes just why there are so many different hypotheses about turtle relationships �— many contradictory relationships are supported by at least some of the genetic data. It is the tuatara relationship that appears to be one of the keys to understanding the complexity of turtle relationships. It is therefore unfortunate that there are no other available datasets to test this relationship further. Those studies with genomic data available do not include the tuatara; and those genomic studies that do include the tuatara apparently do not have their aligned molecular data freely available online (and sometimes both issues apply).

5. Evolutionary Networks The basic difference between an evolutionary network and an evolutionary tree is that the

former tries to identify reticulation events in the phylogenetic history. These events can arise

Complimentary  Contributor  Copy

Page 36: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 50

from the processes of recombination, hybridization and introgression, lateral gene transfer, or genome fusion.

Unfortunately, at the moment we do not have many suitable mathematical methods for evolutionary networks (Huson et al., 2011; Morrison, 2011). The future of phylogenetic networks should therefore involve the development of networks that explicitly model evolution. In this section I will discuss some of the issues that need to be dealt with in order to achieve this goal.

The basic problem is that evolution involves a series of unobservable historical events, each of which is unique, and we can neither make direct observations of them nor perform experiments to investigate them. This makes a phylogenetic study one of the hardest forms of data analysis known, as there is no mathematical algorithm for discovering unique historical accidents.

There are two basic approaches to evolutionary networks at the moment: (a) a genealogy derived from characters, and (b) a genealogy derived from multiple phylogenetic trees. As examples of the former we have Recombination networks and Haplotype networks. As examples of the latter we have Hybridization networks and Lateral Gene Transfer networks, which can be derived directly from the trees themselves, or via subsets of the trees such as clusters or triplets.

5.1. Open Questions about Evolutionary Networks There are a number of issues that have been of interest to the phylogenetics community

with regard to the construction of evolutionary trees that have not yet been addressed for evolutionary networks. These can be considered to be �‘open questions�’ �— ones that need widespread discussion at some stage, either by biologists or by computational scientists (or both). I will discuss some of these in the following sections.

5.1.1. Optimization Criteria

Phylogenetic analysis has been treated mainly as a mathematical optimization problem. There seem to be two types of data that can be optimized:

(i) character-state changes (e.g. nucleotide substitution, nucleotide insertion / deletion,

or their amino acid equivalents), and (ii) character-block events (e.g. inversion, duplication / loss, transposition,

recombination, hybridization, horizontal gene transfer).

To date, phylogenetic tree-building has concentrated on data-type (i), and methods have been developed using optimization criteria such as minimum distance, maximum parsimony, maximum likelihood, and bayesian analysis (which, strictly speaking, does not involve an optimization criterion, as noted in section 4.4).

Most of the data-display network methods have also been based on optimizing data-type (i), notably the splits-graph methods discussed above, which conceptually can be seen as based on either maximum parsimony or minimum distance. Moreover, it is possible to optimize the character data directly onto a network by maximizing either the parsimony scores (e.g. Hein, 1990, 1993; Dickerman, 1998; Nakhleh et al., 2005; Jin et al., 2006a,

Complimentary  Contributor  Copy

Page 37: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 51

2007a, 2007b) or the likelihood scores (e.g. von Haeseler and Churchill, 1993; Strimmer and Moulton, 2000; Strimmer et al., 2001; Jin et al., 2006b; Snir and Tuller, 2009; Bloomquist and Suchard, 2010). The likelihood scores can also be evaluated in a bayesian context (Radice, 2011).

However, evolutionary networks can differ from evolutionary trees by explicitly taking into account data-type (ii), either instead of or in addition to type (i). So far, maximum parsimony has been the criterion of choice for doing this, in the sense that the available methods minimize the count of the postulated number of evolutionary events. For example, a large amount of work has been done to minimize the number of reticulation nodes when reconciling a set of incompatible phylogenetic trees, or alternatively minimizing the level (see Morrison, 2011).

However, this means that there are currently few available likelihood-based methods that will allow us to build networks directly from quantitative evolutionary models of how non-tree events occur. The most obvious exception here is the recent development of Admixture graphs, some at least of which are based on an approximate maximum-likelihood model (e.g. Pickrell and Pritchard, 2012).

This seems to be a serious omission, given that model-based methods are among the most widely used of those available for phylogenetic trees, at least among those users who want a robust analysis (Kelchner and Thomas, 2006). Likelihood has effectively replaced maximum parsimony as an optimization criterion for tree building. The quick-and-dirty distance-based methods, however, will probably always out-rank the other methods, because they can be useful as a �‘first approximation�’.

It may not be easy to create likelihood models for non-tree events, perhaps even more so given the number of different types of events that need to be modelled. Nevertheless, the lack of such models seems to be a handicap for the widespread acceptance of network-based methods in phylogenetics.

5.1.2. Partitioned Models for Likelihood Analyses

This topic is a direct extension of the previous one. Current likelihood models for tree-building analyses can be applied independently to different partitions of the type-(i) character data, and this partitioning is considered to be a valuable part of any likelihood analysis (e.g. Blair and Murphy, 2011). Indeed, it is the desirability of model partitioning that seems to be a major component of the increasing move from maximum likelihood to bayesian analysis in phylogenetics, as well as the ease of implementing models that deal with heterogeneity among and within lineages (especially relaxed molecular clocks).

Partitioned models allow us to add complexity that can deal with heterogeneity within a dataset (Endicott et al., 2009), by a priori or a posteriori choice of partitions with greater inter- than intra-partition variability in substitution rates. For example, there is substitution-rate heterogeneity within genes (e.g. different codon positions in protein-coding genes, paired versus unpaired positions in RNA-coding genes), as well as between genes (e.g. house-keeping genes versus rRNA-coding genes), between coding and non-coding regions (e.g. introns versus exons, as well as transcribed spacers and the mitochondrial control region), and between genomes (e.g. nuclear versus mitochondrial). Failure to correctly account for this heterogeneity can seriously mislead phylogenetic analyses; and automated procedures for devising partition schemes have now been developed for trees (e.g. Lanfear et al., 2012).

Complimentary  Contributor  Copy

Page 38: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 52

Partitioning is not a panacea for heterogeneity, of course, and there are potential problems that need to be addressed concerning partition choice and its consequences (see Brown et al., 2010; Marshall, 2010; Fan et al., 2011). None of these issues has yet been addressed in the context of evolutionary networks, although there seems to be no barrier to the use of partitioning for network likelihood models. On the other hand, dealing with evolutionary heterogeneity among and within lineages may actually be a bigger problem than it is for trees, given the increased complexity of the lineages in a network.

5.1.3. Mixture Models for Likelihood Analyses

An alternative approach to dealing with heterogeneity is through the use of mixture models. Here, the likelihood of each character is calculated under more than one model, and these likelihoods are then combined. For example, the parameters of several substitution models, as well as the probability with which each model applies to each alignment position, can be determined directly from the data. Such models have been developed for nucleotide (Pagel and Meade, 2004) and amino-acid (Le et al., 2008) sequences, but this is otherwise a very under-explored part of phylogenetic analysis. Nevertheless, computer programs are becoming more readily available for trees (e.g. Stamatakis, 2006).

It would presumably be possible to combine data-types (i) and (ii) using this approach. Indeed, this has obvious theoretical advantages for networks, although the resulting models may be overly complex. It seems likely that the ability to model, say, hybridization versus recombination as alternative causes of reticulations in a phylogeny will be a part of any successful attempt to produce a widely used method of phylogenetic analysis.

5.1.4. Separating Randomness and Rooting from Reticulation

There are at least three quite distinct causes of incompatible patterns in a phylogenetic dataset:

(a) Randomness, which is expected to create stochastic variation (such as homoplasy),

but which may also be due to bias (e.g. selection); (b) Rooting, with different �‘gene trees�’ being rooted in different places; and (c) Reticulation, which can have any one of several causes (e.g. hybridization, lateral

gene transfer, recombination, etc.).

If we want an evolutionary network to display only Reticulation then we need to deal with the other two issues, either before-hand or at the same time.

Morrison (2011) discusses published examples in which several trees have been presented, from different gene segments, that differ from each other in the location of their outgroup root (e.g. Figures 4.7 and 4.27 of that book). In at least one of these cases, there are no reticulate evolutionary events at all, merely an uncertain root. That is, a network was constructed showing putative hybridizations and yet the only evolutionary pattern in the data was that the single unrooted species tree had different roots in the different gene trees.

In all of these cases, it is difficult to present an evolutionary network, because many of the resulting reticulations reflect the differences in the outgroup roots rather than true evolutionary reticulation events. Clearly, we cannot accept a situation where incompatibility among the trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes. This issue is further discussed in the next section (5.1.5).

Complimentary  Contributor  Copy

Page 39: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 53

Randomness refers to uncertainty in any of the relationships depicted by the tree. Stochastic variation has long been recognized in phylogenetics, and it is the principal issue that most tree-building methods try to address in their algorithms. Biologically, stochastic variation usually arises from short evolutionary intervals (represented as short edge lengths in the tree), but may also arise from inadequate tree-building models, etc. It is the problem that branch-support estimates are designed to quantify, such as bootstrap values or posterior probabilities.

In the �‘normal�’ statistical world, random data variation is assumed to be associated with estimation errors. For phylogenetic data, these might include incorrect data (e.g. contamination), inappropriate sampling, and model mis-specification. Alternatively, these errors might lead to bias rather than to random variation. If so, then the sources of bias should be dealt with via exploratory data analysis, and the offending information can then be corrected or deleted.

However, when we are specifically trying to study reticulate evolution, there will also be many possible biological causes of data conflicts, which are not the result of either reticulation or estimation errors, such as homoplasy (parallelism, convergence, reversal), duplication / loss, and various complex molecular activities (such as sequence inversion, duplication, and transposition). All of these issues need to be dealt with under the concept of �‘non-reticulation variation�’.

Separating reticulation-caused data conflict from non-reticulation data conflict requires a null model for reticulation. This is also discussed below (section 5.1.6).

5.1.5. Standardizing the Root

Rooting is a problem that has not yet been dealt with properly in phylogenetics. The differences among a set of gene trees are often little more that the relative location of the root (the common ancestor). That is, the unrooted gene trees are (almost) identical, but they have been rooted in somewhat different places (often not too far from each other). In one sense, this might be simply Randomness occurring with respect to the root. However, its effect can be great, because it can potentially affect all of the rest of the network, whereas Randomness in most other locations will have only local effects on the topology.

Situations where incompatibility among trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes, can be dealt with by pre-processing of the data (prior to network analysis). Here, I will make a few suggestions about possible ways to do this.

If we have a set of �‘gene trees�’, then problems with incompatible rooting might be dealt with using polychotomies. That is, we could try to create a set of rooted gene trees with the �‘same�’ root by deleting conflicting basal edges from the tree. For example, an algorithm might look like this:

1. unroot all of the gene trees; 2. find the most-common root �— the root location in an unrooted tree defines a split

(bipartition), so the most-common root will be the root-split that occurs in the largest number of trees (unless there are multiple outgroup species that make the ingroup non-monophyletic);

3. any rooted gene tree consistent with that root (i.e. displays that split) can be used unmodified;

Complimentary  Contributor  Copy

Page 40: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 54

4. any gene tree with a topologically close root could then be modified so that some of its edges are contracted into a ploychotomy until the unrooted tree is consistent with the common root, and the resulting less-refined tree would then be used as the rooted tree �— obviously, it would be necessary to explicitly define �‘topologically close�’; and

5. the remaining gene trees would then be set aside and not used in the network analysis.

This algorithm might work more often that not in practice, although it is easy to think of situations where it will be a very uncertain procedure.

If the ingroup is not monophyletic, then the biologist should fix this before proceeding with the network analysis. This is a �‘biological problem�’ of sampling, not a mathematical one �— perhaps the problem arises from deep coalescence, for example. If there is no clear �‘most-common root�’ among the trees, then perhaps we could define an �‘average�’ or centroid root of some sort. We would then proceed with the rest of the method.

An alternative to this �‘polychotomy method�’ might be to use the coalescent to construct an approximate species tree from the multiple gene trees �— there are now several methods to do this (see Knowles and Kubatko, 2010; Anderson et al., 2012; Sánchez-Gracia and Castresana, 2012). Then, in the network analysis we could allow the gene trees to differ from the species tree only with respect to the poorly supported branches in the species tree. That is, we would use the well-supported parts of the coalescent tree as a backbone common to all of the gene trees, and for the uncertain parts we would use each of the gene trees. However, the applicability of the coalescent to higher taxonomic groups (as opposed to closely related species) is uncertain.

Duplication-loss is another potential cause of problems with the root. It is not immediately obvious how to approach this, but some suggestions have been made by Burleigh et al. (2011).

A different strategy would be to try all possible roots and see which one(s) minimize the network complexity. This might be computationally intensive, depending on the size of the dataset and the network method used. It might be necessary to restrict the roots tested to those observed among the input trees.

5.1.6. Null Models for Reticulation

Once we have the root standardized for the dataset, we are then set the task of separating reticulation-caused data conflict from non-reticulation data conflict. This requires a null model for data conflict �— any data conflict that cannot be accommodated by the null model is a candidate for explanation as the result of a reticulation event.

Looking at the literature, the most commonly accepted null model is currently deep coalescence (incomplete lineage sorting) (Meng and Kubatko, 2009; Kubatko and Meng, 2010). For example, a maximum-likelihood method has been developed that models hybridization in the presence of deep coalescence (Kubatko, 2009). One can also use the coalescent as an optimality criterion to choose among alternative networks, with lineage sorting under the coalescent as the null hypothesis (Huson et al., 2005; Buckley et al., 2006; Than et al., 2007; Lyngsø et al., 2008; Joly et al., 2009).

However, the sole use of deep coalescence effectively ignores the other non-reticulation causes of data conflict, as listed above (section 5.1.4). It is possible that this approach will

Complimentary  Contributor  Copy

Page 41: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 55

work in practice, but it seems unlikely that this will be so. Effectively, this approach assumes that the gene trees correspond to the true underlying coalescent trees. This is unlikely because the gene trees are inferred and therefore can be incorrect, due to the other (listed) non-reticulation causes of data conflict. Moreover, if there are multiple types of reticulation event occurring then the approach might fail. For example, if one wishes to study hybridization, then the coalescence methods assume that recombination occurs only between and not within the regions used to infer the gene trees, which is also unlikely.

So, a more comprehensive null model seems to be needed, one that includes more than simply traditional statistical randomness plus deep coalescence. The default expectation at the moment seems to be that deep coalescence occurs above the species level, so that all data sets should be tree-like, whereas the objective here is to detect the non-tree-like parts of evolutionary history.

5.1.7. Dealing with Stochastic Error and Bias

In addition to null models, we may also need pre-processing of the data to deal with stochastic error and bias. There is a limit to what can be done with a single null model, and phylogenetic data are rarely simple. Here, I make a few suggestions for alternative strategies.

If we have a set of �‘gene trees�’, then perhaps the most obvious approach is to delete uncertain edges. That is, they would appear as polychotomies in the gene trees. This allows refined versions of these trees to be represented in the network, rather than requiring extra edges in a network to accommodate all of them. An alternative is to weight all of the edges with respect to their data �‘support�’, with the expectation that poorly supported edges would only appear in the network if they are consistently supported across a number of the gene trees.

I think that there are two types of support that could be relevant to uncertainty: (1) classic branch support, such as bootstrap values; and (2) the set of multiple equally optimal or nearly optimal trees. These two types coincide in bayesian analysis, as it is currently implemented in phylogenetics, because in bayesian analysis the branch support is derived from the set of nearly optimal trees. I suspect that support-type (2) may be a better idea than (1), because it expresses something about the tree itself rather than each edge alone. It is used in the SpNet method (Nakhleh et al., 2005), for example, where each gene tree is a consensus tree of several nearly-optimal trees. The appeal of using polychotomies is that it is simple. The main arguments against it may be the work required for the calculations in methods such as maximum likelihood (both parsimony and bayesian analyses do the necessary calculations anyway), and the fact that it may create non-dense sets of triplets (Jansson and Sung, 2006), for example.

Another idea might be to delete organisms that have no consistent position among the input trees. The idea here is that biologically we are looking for things like hybridization and lateral gene transfer, and we are not expecting this to involve any one organism in combination with many other organisms. Therefore, extremely uncertain positions are unlikely to reflect Reticulation but rather Randomness (or lack of information). Creating polychotomies would lose a lot of information in this situation, and so it would be better to flag these organisms as problematic, and then leave them out of the network analysis. This is basically the concept used for largest common pruned trees (or agreement subtrees), except that here we don't prune the data all the way down to a tree (see Abby et al., 2010). This also seems to be the idea behind the Dendroscope (Huson and Scornavacca, 2012) computer

Complimentary  Contributor  Copy

Page 42: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 56

program�’s option to deal only with clusters that appear in a certain percentage of the trees. The problem with the Dendroscope approach, however, is that a cluster generated by lateral gene transfer, say, that appears in only one tree will be ignored. It would thus be better to use the variation in position of individual nodes, rather than presence/absence of clusters.

5.1.8. Robustness of Branch / Reticulation Estimates

It is de rigueur in the world of phylogenetic tree building to pepper the tree edges with bootstrap values or posterior probabilities, or frequently both, especially if these estimates are >50%. On the other hand, these values are almost never seen in the world of phylogenetic networks.

If there is a direct link between the network and some character-state data, then bootstrap values can be calculated for a network in the same manner as for a tree �— one simply builds many networks from the re-sampled character data. However, this procedure may not be quite as computationally feasible, if the network method does not have a practical computational running time.

Moreover, this procedure is not necessarily straightforward for other types of data from which we might build a network. For example, if we are building a network by minimizing the number of reticulations needed to reconcile a set of conflicting trees, the application of the bootstrap has not yet been evaluated. The computational focus to date has been on the optimization problem, not on the re-sampling problem. And, of course, in the absence of a likelihood model for reticulation events, posterior probabilities cannot be calculated at all.

So, this is another area where the lack of methods commonly associated with tree building seems to be a handicap for the widespread acceptance of network-based methodology.

5.2. Are the Mathematical Constraints Biologically Realistic? Mathematicians and other computational scientists have produced their own definitions

of phylogenetic networks, independently of biologists, for the purpose of making the calculations tractable. We therefore need to consider whether the constraints put in place by these mathematical definitions are likely to produce biologically realistic networks. How do these definitions connect to what biologists have in mind when they use the term �‘phylogenetic network�’?

It is still an open question about the extent to which we can use these topologically restricted families of mathematical networks as a basis for reconstructing biological histories. Clearly, much more work is needed to understand the connections between the mathematical restrictions and the requirements of biological modelling.

5.2.1. The Network Definition

For the evolutionary type of network, the mathematical definition usually looks something like this:

A phylogenetic network is a rooted, directed graph (consisting of nodes, plus edges that connect each parent node to its child nodes) such that:

(1) There is exactly one node having indegree 0, the root

Complimentary  Contributor  Copy

Page 43: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 57

�– all other nodes have indegree 1 or 2 (2) All nodes with indegree 1 have outdegree 2 or 0

�– nodes with outdegree 2 are tree nodes �– nodes with outdegree 0 are leaves, distinctly labelled

(3) The root has outdegree 2, and (4) Nodes with indegree 2 have outdegree 1, called reticulation nodes.

Clearly, this definition places considerable restrictions on the networks that will be inferred by any mathematical algorithm, which in turn affects their use as models for biological inference.

The first thing to note is that unrooted networks are excluded, because the graph is directed. Furthermore, a tree is considered to have all internal nodes with indegree 1 and outdegree 2 (i.e. no reticulation nodes), and we know this to be biologically unrealistic, in general.

Biologically, the other parts of the definition imply: One node of indegree 0

�– the network has no previous ancestry that is to be inferred Nodes with outdegree 0 are labelled

�– observed (contemporary) organisms occur only at the leaves All nodes with indegree 2 have outdegree 1

�– reticulation and speciation cannot occur simultaneously No nodes with indegree >2

�– reticulation events cannot involve input from more than 2 parents simultaneously No nodes with outdegree >2

�– speciation involves only two children at a time.

These do not appear to be onerous biological restrictions. Indeed, he first two have been standard characteristics of phylogenetic tree-building for several decades. The other three are also logical extensions of the restrictions that have previously been placed on trees. However, phylogenetic history is unlikely to have been as simple as implied by these features. Thus, biologists will need to keep a careful eye on whether the simplifications are affecting the networks inferred for their particular group of organisms.

5.2.2. Other Restrictions

In addition to the restrictions created by the definition, other topological restrictions have been used to make the mathematical inference algorithms computationally tractable. Thus, only certain sub-families of possible networks are considered by most of the computer programs. These include:

tree-child network, tree-sibling network level-k network, galled tree binary input trees for Hybridization and Lateral Gene Transfer networks binary characters for Recombination networks.

Complimentary  Contributor  Copy

Page 44: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 58

These restrictions may be unrelated to each other; so, we can evaluate them separately.

Tree-Child, Tree-Sibling In a tree-child network, every internal node has at least one child node that is a tree node

— that is, a reticulation event cannot be followed immediately by another reticulation event. In a tree-sibling network, every reticulation node has at least one sibling node that is a tree node — that is, a parent cannot be directly involved in two separate reticulation events. Note that every tree-child network will also be a tree-sibling network, but not vice versa.

Algorithmically, these two restrictions may involve the addition of extra tree nodes to an inferred network, in order to satisfy the restrictions. Biologically, the question is whether real networks are this simple. Arenas et al. (2008) simulated data under the coalescent with recombination, and found that even at small recombination rates most of the networks produced were already more complex than a tree-sibling network. On the other hand, Arenas et al. (2010) analyzed real population-level data from the PopSet and Polymorphix databases using the TCS computer program, and found that >98% the resulting networks could be characterized as tree-sibling. So, there is cause for optimism, in the sense that the ‘optimum’ networks algorithmically are not necessarily complex, at least for closely related organisms (i.e. within species).

Level-k Network, Galled Tree A network has level k if each tangled part of the network (i.e. each biconnected

component) contains at most k reticulation nodes. This is a generalization of the older notion of galled trees, in which reticulation cycles do not overlap (i.e. do not share edges or nodes), as galled trees are level-1 networks. Level-k networks can also be seen as a generalization of networks with k reticulation nodes, although there may be a difference between a network with minimum level and one with a minimum number of reticulations.

Algorithmically, these restrictions have been used to guide the search for (or choice of) the ‘optimal’ inferred network. Biologically, these notions do not seem to have been investigated, but basically they restrict how complex inferred reticulation histories can be. In particular, they restrict the complexity of any given subset of each network. It has been noted that optimizing k can easily lead to networks that look biologically unrealistic (Huson et al., 2011).

Binary Input The requirements for binary input trees and binary characters are restrictions that have

been applied in the past, because they greatly reduce the complexity of the input to the network algorithms, but they are now being relaxed. Effectively, the restrictions are to fully dichotomous trees and characters like single-nucleotide polymorphisms (SNP). These are not unusual restrictions in evolutionary analysis, but they are obviously unrealistic.

As I noted in section 5.1.7, non-binary data often reflect uncertainty in the input, rather than a strictly bifurcating history, and this is not taken into account in the network inference if the input is restricted to a binary state. In particular, it may be unnecessarily hard to construct a network (because not all of the data signals relate to reticulation), and the resulting networks may have far too many reticulation nodes.

Complimentary  Contributor  Copy

Page 45: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 59

5.3. What Are the Evolutionary Units and Pathways? If we use an evolutionary network (rather than a tree) to represent the genealogy of a

group of organisms then we are explicitly claiming that these organisms have a reticulate evolutionary history. For example, humans have long been considered to have a reticulate evolutionary history, both genetically and culturally (Moore, 1994), and anthropologists have therefore used networks as one of their representations of that intra-species history (Brace, 1981), although the consequences of reticulation for anthropological studies form an ongoing debate (Holliday, 2003; Arnold, 2009).

One of those consequences is that the organisms form what might be called fuzzy clusters. Gene exchange is predominantly within the clusters but there is still gene exchange between the clusters. Indeed, we might question whether distinct evolutionary lineages are worth recognizing when there is extensive reticulation in a network.

For example, genomically humans form fuzzy clusters, rather than discrete groups with sharp boundaries (Novembre et al., 2008; Lao et al., 2008). Inter-breeding is predominantly within the clusters, due to geographical and social isolation, with relatively little inter-breeding between the clusters. This creates a situation where gene-based distinctions between �‘races�’ seem to be obvious to casual observers but where more detailed analysis reveals considerable complexity.

From the analysis point of view, the recognition of races is a model, and all models are wrong because they are simplifications of the real world. However, we are told that some models are more useful than others. So, the key question is this: Is the recognition of distinct evolutionary lineages a worthwhile model for interpreting a reticulated network? After all, the lineages may not form nested phylogenetic clusters, which is historically the basic criterion for recognizing them. The answer seems to be that, in practice, people have usually thought so.

For example, domesticated organisms provide other classic examples of genealogical reticulation. People recognize dog breeds, for instance, and we even have an official register of breeds at the Fédération Cynologique Internationale. However, dog breeds form fuzzy clusters rather than discrete groups, with many individual dogs being cross-breeds. In spite of this, a model of fuzzy clusters formed by a reticulate evolutionary history is still considered to be useful by dog breeders and owners. A similar thing can be said about the breeds of horses, cats and cows; and, indeed, also for almost all human-associated species (see Arnold, 2009).

In the non-domesticated part of biodiversity, taxonomists recognize subspecies, which often refer to morphologically distinguishable populations occupying geographically separated areas, but which are not otherwise genetically isolated. These subspecies can also form fuzzy clusters as a result of a reticulate evolutionary history, especially for plants. Once again, this is apparently a useful model, although there is no universal criterion for how much morphological difference it takes to delimit a subspecies.

However, the question here actually goes further than this, and asks about what should be the units of analysis in the first place. For example, if it is the dog breeds, then we are effectively excluding cross-bred dogs from the evolutionary history, unless they themselves form a new breed that is subsequently recognized. This issue has not yet been addressed in the literature.

Complimentary  Contributor  Copy

Page 46: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 60

5.4. Most Recent Common Ancestors As noted in the previous section, the interpretation of an evolutionary network is

confounded by the fact that descendants of reticulation nodes have complex ancestry. Therefore, the concept of a Most Recent Common Ancestor (MRCA) of a group of organisms is not as straightforward as it is for a tree, as there may be multiple paths from any one descendant back to its ancestors. This creates several possible interpretations of what we might mean by a MRCA, which need to be addressed by phylogeneticists when they are using an evolutionary network.

Figure 11a illustrates the calculation of the MRCA in a tree of five species (A-E), showing the MRCA of species C and D. To locate the MRCA, we simply trace each of the descendant species backward along the edges towards the root, and the ancestral node where all of these traces first intersect is the MRCA of those species.

Figure 11b illustrates a more complex history, involving two hybridization events. The incoming edges to the two reticulation nodes have arrows, to indicate their direction. The figure also recognizes several possible interpretations of the MRCA of species C and D (see Huson and Rupp, 2008; Fischer and Huson, 2010).

A conservative definition of the MRCA (or a stable MRCA) is the intersection of all paths from the descendants to the root, so that any reticulation pushes the MRCA back towards the root. In this example it pushes the MRCA all the way to the root. Alternatively, we could define the Lowest Common Ancestor (sometimes called the minimal common ancestor) as the shared ancestor that is furthest from the root along any path. That is, the LCA is not an ancestor of any other common ancestor of the species concerned.

In the mathematical terminology of lattices, which can have an algebraic or order theoretic definition, the Conservative MRCA is called the Least Lower Bound (LLB) and the LCA is called the Greatest Lower Bound (GLB).

Figure 11. Various concepts for the Most Recent Common Ancestor of a group of five species (A�–E).

We could also have a biological compromise between these two mathematical concepts and recognize a Fuzzy MRCA, in which only a specified proportion of the paths (representing some proportion of the genomes) needs to be accommodated by the MRCA, thus keeping the MRCA close to the main collection of descendants (Fischer and Huson, 2010). In this example, the Fuzzy MRCA represents 75% of the genome of species C and 100% of the genome of species D. (The Conservative MRCA represents 100% for both species, by

Complimentary  Contributor  Copy

Page 47: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 61

definition; and in this example the LCA represents 50% of the genome of species C and 100% of the genome of species D.)

However, neither the Fuzzy MRCA nor the LCA is necessarily unique, although the Conservative MRCA will always be unique. Figure 11c shows an example where there are two independent LCAs of species C and D. Neither of these LCAs is an ancestor of the other, as required by the definition, and so they are both equal candidates as LCA. Each one represents 50% of the genome for both species C and D.

In terms of a lattice, Figure 11b is called a lower semi-lattice (or meet semi-lattice), because every pair of nodes has only one GLB, whereas Figure 11c is not a semi-lattice, because at least one node pair has more than one GLB.

This leads to the biological question of how we are best to interpret the MRCA in situations such as that represented by Figure 11c. This is a question that does not yet seem to have been addressed by biologists. Figure 11c does not represent an impossible evolutionary history, although it may be an unusual one because one lineage hybridizes with another lineage twice, presumably at different times.

The lack of a unique LCA is clearly problematic, as it almost defeats the purpose of the concept of a MRCA. It would certainly make life easier if we could restrict evolutionary networks to the class of lower semi-lattices.

An alternative is to restrict the MRCA concept to the Conservative MRCA. However, it is easy to imagine situations where this pushes the MRCA so far towards the root of the network as to be uninformative, especially in cases involving horizontal gene transfer, which can occur between widely separated evolutionary groups. If we insist that a eukaryote MRCA represent 100% of the genome, and we include non-nuclear genomes in the calculation, then the Conservative MRCA creates an extreme theoretical problem.

A Fuzzy MRCA may be the best compromise between these two extremes, although there are obvious practical issues for obtaining agreement on how much of the genome history is to be discounted from the MRCA.

Conclusion The study of phylogenetic networks is still a relatively new field, in spite of the fact that

biological interest in such networks goes back more than 150 years, and explicit mathematical interest has existed for 20 years or more. Bringing these two interests together has been a major challenge for bioinformatics.

In this chapter I have shown that phylogenetic networks are fundamentally different from other types of biological network, as a consequence of the fact that they are inferred networks rather than observed ones. I have also illustrated the ways in which they are currently used as data-display networks, and discussed the various issues that have arisen as regarding their future use as evolutionary networks.

This raises the point as to the extent to which phylogeneticists have adopted both networks. It has been noted that thinking in terms of phylogenetic trees is often poorly done by biologists, and Baum and Smith (2012) have noted the following:

Complimentary  Contributor  Copy

Page 48: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 62

�“We do not know why it should be so, but we have learned from working with thousands of students that, without contrary training, people tend to have a one-dimensional and progressive view of evolution. We tend to tell evolution as a story with a beginning, a middle, and an end. Against that backdrop, phylogenetic trees are challenging; they are not linear but branching and fractal, with one beginning and many equally valid ends. Tree thinking is, in short, counterintuitive.�”

This leads me to an obvious question: If people have so much trouble going from a linear view of evolution to a tree-based view, are they having even more trouble going to a network-based view?

I cannot answer this question, yet. At one extreme, maybe the big conceptual leap is going from a chain to a tree, and a network is just a complicated tree, so that the conceptual leap is not great. Alternatively, maybe a tree is difficult because it is a set of linked and overlapping chains, and therefore a network is very difficult because it is a set of linked and overlapping trees. Maybe reality will turn out to be somewhere in between these two extremes. We will presumably find out how difficult things are after we have developed a set of widely used methods for constructing evolutionary networks.

References Abby S.S., Tannier E., Gouy M., Daubin V. 2010. Detecting lateral gene transfers by

statistical reconciliation of phylogenetic forests. BMC Bioinformatics 11: 324. Anderson C.N., Liu L., Pearl D., Edwards S.V. 2012. Tangled trees: the challenge of inferring

species trees from coalescent and noncoalescent genes. Methods in Molecular Biology 856: 3-28.

Andrews R.M., Kubacka I., Chinnery P.F., Lightowlers R.N., Turnbull D.M., Howell, N. 1999. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nature Genetics 23: 147.

Arenas M., Patricio M., Posada D., Valiente G. 2010. Characterization of phylogenetic networks with NetTest. BMC Bioinformatics 11: 268.

Arenas M, Valiente G., Posada D. 2008. Characterization of reticulate networks based on the coalescent with recombination. Molecular Biology and Evolution 25: 2517-2520.

Arnold M.L. 2009. Reticulate Evolution and Humans: Origins and Ecology. Oxford University Press, New York.

Bandelt H-J. 2005. Exploring reticulate patterns in DNA sequence data. In: Bakker F.T., Chatrou L.W., Gravendeel B., Pelser P.B., eds. Plant Species-Level Systematics: New Perspectives on Pattern and Process, pp. 245-269. Koeltz, Königstein.

Baum D.A., Smith S.D. 2012. Tree Thinking: An Introduction to Phylogenetic Biology. Roberts & Company, Greenwood Village CO.

Bigoni F., Barsanti G. 2011. Evolutionary trees and the rise of modern primatology: the forgotten contribution of St. George Mivart. Journal of Anthropological Sciences 89: 93-107.

Blair C., Murphy R.W. 2011. Recent trends in molecular phylogenetic analysis: where to next? Journal of Heredity 102: 130-138.

Complimentary  Contributor  Copy

Page 49: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 63

Bloomquist E.W., Suchard M.A. 2010. Unifying vertical and nonvertical evolution: a stochastic ARG-based framework. Systematic Biology 59: 27-41.

Brace C.L. 1981. Tales of the phylogenetic woods: the evolution and significance of evolutionary trees. American Journal of Physical Anthropology 56: 411-429.

Bremer K., Wanntorp H.-E. 1979. Hierarchy and reticulation in systematics. Systematic Zoology 28: 624-627.

Briggs A.W., Good J.M., Green R.E., Krause J., Maricic T., Stenzel U., Lalueza-Fox C., Rudan P., Brajkovic D., Kucan Z., Gusic I., Schmitz R., Doronichev V.B., Golovanova L.V., de la Rasilla M., Fortea J., Rosas A., Paabo S. 2009. Targeted retrieval and analysis of five Neanderthal mtDNA genomes. Science 325: 318-321.

Brown J.M., Hedtke S.M., Lemmon A.R., Moriarty Lemmon E. 2010. When trees grow too long: investigating the causes of highly inaccurate bayesian branch-length estimates. Systematic Biology 59: 145-161.

Buckley T., Cordeiro M., Marshall D., Simon C. 2006. Differentiating between hypotheses of lineage sorting and introgression in New Zealand Alpine cicadas Maoricicada dugdale. Systematic Biology 55: 411-425.

Buffon, comte de 1755. Histoire Naturelle Générale et Particulière, tome V. Imprimerie Royale, Paris.

Burleigh J.G., Bansal M.S., Eulenstein O., Hartmann S., Wehe A., Vision T.J. 2011. Genome-scale phylogenetics: inferring the plant tree of life from 18,896 gene trees. Systematic Biology 60: 117-125.

Cayley A. 1857. On the theory of the analytical forms called trees. Philosophical Magazine 13: 172-176.

Crawford N.G., Faircloth B.C., McCormack J.E., Brumfield R.T., Winker K., Glenn T.C. 2012. More than 1000 ultraconserved elements provide evidence that turtles are the sister group of archosaurs. Biology Letters 8: 783-786.

Darwin C. 1859. On the Origin of Species by Means of Natural Selection. John Murray, London.

Dickerman A.W. 1998. Generalizing phylogenetic parsimony from the tree to the forest. Systematic Biology 47: 414-426.

Donati V. 1750. Della Storia Naturale Marina dell�’ Adriatico. Francesco Storti, Venezia. Duchesne, A.N. 1766. Histoire Naturelle des Fraisiers. Didot le jeune & C.J. Panckoucke,

Paris. Ellison A.M. 2001. Exploratory data analysis and graphic display. In: Scheiner S.M.,

Gurevitch J., eds. Design and Analysis of Ecological Experiments, 2nd ed., pp. 37-62. Oxford University Press, Oxford.

Endicott P., Ho S.Y.W., Metspalu M., Stringer C. 2009. Evaluating the mitochondrial timescale of human evolution. Trends in Ecology and Evolution 24: 515-521.

Ermini L., Olivieri C., Rizzi E., Corti G., Bonnal R., Soares P., Luciani S., Marota I., De Bellis G., Richards M.B., Rollo F. 2008. Complete mitochondrial genome sequence of the Tyrolean Iceman. Current Biology 18: 1687-1693.

Fan Y., Wu R., Chen M.-H., Kuo L., Lewis P.O. 2011. Choosing among partition models in bayesian phylogenetics. Molecular Biology and Evolution 28: 523-532.

Farris J.S. 1970. Methods for computing Wagner trees. Systematic Zoology 19: 83-92. Felsenstein J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland MA.

Complimentary  Contributor  Copy

Page 50: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 64

Fischer J., Huson D.H. 2010. New common ancestor problems in trees and directed acyclic graphs. Information Processing Letters 110: 331-335.

Gilbert M.T., Kivisild T., Grønnow B., Andersen P.K., Metspalu E., Reidla M., Tamm E., Axelsson E., Götherström A., Campos P.F., Rasmussen M., Metspalu M., Higham T.F., Schwenninger J.L., Nathan R., De Hoog C.J., Koch A., Møller L.N., Andreasen C., Meldgaard M., Villems R., Bendixen C., Willerslev E. 2008. Paleo-Eskimo mtDNA genome reveals matrilineal discontinuity in Greenland. Science 320: 1787-1789.

Gray R.D., Bryant D., Greenhill S. 2010. On the shape and fabric of human history. Philosophical Transactions of the Royal Society B 365: 3923-3933.

Gontier N. 2011. Depicting the Tree of Life: the philosophical and historical roots of evolutionary tree diagrams. Evolution: Education and Outreach 4: 515-538.

Green R.E., Malaspinas A.S., Krause J., Briggs A.W., Johnson P.L., Uhler C., Meyer M., Good J.M., Maricic T., Stenzel U., Prüfer K., Siebauer M., Burbano H.A., Ronan M., Rothberg J.M., Egholm M., Rudan P., Brajkovi D., Ku an Z., Gusi I., Wikström M., Laakkonen L., Kelso J., Slatkin M., Pääbo S. 2008. A complete Neanderthal mitochondrial genome sequence determined by high-throughput sequencing. Cell 134: 416-426.

Hedges S.B. 2012. Amniote phylogeny and the position of turtles. BMC Biology 10: 64. Hein J. 1990. Reconstructing evolution of sequences subject to recombination using

parsimony. Mathematical Biosciences 98: 185-200. Hein J. 1993. A heuristic method to reconstruct the history of sequences subject to

recombination. Journal of Molecular Evolution 36: 396-405. Hennig W. 1966. Phylogenetic Systematics. University of Illinois Press, Urbana IL.

[Translated by D.D. Davis and R. Zangerl from W. Hennig 1950. Grundzüge einer Theorie der Phylogenetischen Systematik. Deutscher Zentralverlag, Berlin.]

Holland B.R., Delsuc F., Moulton V. 2005. Visualizing conflicting evolutionary hypotheses in large collections of trees: using consensus networks to study the origins of placentals and hexapods. Systematic Biology 54:66-76.

Holland B.R., Huber K.T., Dress A., Moulton V. 2002. plots: a tool for analyzing phylogenetic distance data. Molecular Biology and Evolution 19: 2051-2059.

Holland B.R., Huber K.T., Moulton V., Lockhart P.J. 2004. Using consensus networks to visualize contradictory evidence for species phylogeny. Molecular Biology and Evolution 21: 1459-1461.

Holland B.R., Jermiin L.S., Moulton V. 2006. Improved consensus network techniques for genome- scale phylogeny. Molecular Biology and Evolution 23: 848-855.

Holliday T.W. 2003. Species concepts, reticulation, and human evolution [with discussion]. Current Anthropology 44: 653-673.

Huson D.H., Klöpper T., Lockhart P.J., Steel M.A. 2005. Reconstruction of reticulate networks from gene trees. Lecture Notes in Bioinformatics 3500: 233-249.

Huson D.H., Rupp R. 2008. Summarizing multiple gene trees using cluster networks. Lecture Notes in Bioinformatics 5251: 296-305.

Huson D.H., Rupp R., Scornavacca C. 2011. Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, Cambridge.

Huson D.H., Scornavacca C. 2011. A survey of combinatorial methods for phylogenetic networks. Genome Biology and Evolution 3: 23-35.

Complimentary  Contributor  Copy

Page 51: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 65

Huson D.H., Scornavacca C. 2012. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Systematic Biology 61: 1061-1067.

Ingman M., Gyllensten U. 2006. mtDB: Human Mitochondrial Genome Database, a resource for population genetics and medical sciences. Nucleic Acids Research 34: D749-751.

Jansson J., Sung W.-K. 2006. Inferring a level-1 phylogenetic network from a dense set of rooted triplets. Theoretical Computational Science 363: 60-68.

Jin G., Nakhleh L., Snir S., Tuller T. 2006a. Efficient parsimony-based methods for phylogenetic network reconstruction. Bioinformatics 23: e123-e128.

Jin G., Nakhleh L., Snir S., Tuller T. 2006b. Maximum likelihood of phylogenetic networks. Bioinformatics 22: 2604-2611.

Jin G., Nakhleh L., Snir S., Tuller T. 2007a. Inferring phylogenetic networks by the maximum parsimony criterion: a case study. Molecular Biology and Evolution 24: 324-337.

Jin G., Nakhleh L., Snir S., Tuller T. 2007b. A new linear-time heuristic algorithm for computing the parsimony score of phylogenetic networks: theoretical bounds and empirical performance. Lecture Notes in Bioinformatics 4463: 61-72.

Joly S., McLenachan P.A., Lockhart P.J. 2009. A statistical approach for distinguishing hybridization and incomplete lineage sorting. American Naturalist 174: E54-E70.

Kelchner S.A., Thomas M.A. 2006. Model use in phylogenetics: nine key questions. Trends in Ecology and Evolution 22: 87-94.

Knowles L.L., Kubatko L.S. (editors) 2010. Estimating Species Trees: Practical and Theoretical Aspects. Wiley-Blackwell, Hoboken NJ.

Krause C.D., Pestka S. 2005. Evolution of the Class 2 cytokines and receptors, and discovery of new friends and relatives. Pharmacology and Therapeutics 106: 299-346.

Krause J., Briggs A.W., Kircher M., Maricic T., Zwyns N., Derevianko A., Pääbo S. 2010b. A complete mtDNA genome of an early modern human from Kostenki, Russia. Current Biology 20: 231-236.

Krause J., Fu Q., Good J.M., Viola B., Shunkov M.V., Derevianko A.P., Pääbo S. 2010a. The complete mitochondrial DNA genome of an unknown hominin from southern Siberia. Nature 464: 894-897.

Kubatko L.S. 2009. Identifying hybridization events in the presence of coalescence via model selection. Systematic Biology 58: 478-488.

Kubatko L.S., Meng C. 2010. Accommodating hybridization in a multilocus phylogenetic network. In: Knowles L.L., Kubatko L.S. eds. Estimating Species Trees: Practical and Theoretical Aspects, pp. 99-113. Wiley-Blackwell, Hoboken NJ.

Lamarck, chevalier de 1809. Philosophie Zoologique. Dentu et l�’Auteur, Paris. Lanfear R., Calcott B., Ho S.Y.W., Guindon S. 2012. PartitionFinder: combined selection of

partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution 29: 1695-1701.

Lao O., Lu T.T., Nothnagel M., Junge O., Freitag-Wolf S., Caliebe A., Balascakova M., Bertranpetit J., Bindoff L.A., Comas D., Holmlund G., Kouvatsi A., Macek M., Mollet I., Parson W., Palo J., Ploski R., Sajantila A., Tagliabracci A., Gether U., Werge T., Rivadeneira F., Hofman A., Uitterlinden A.G., Gieger C., Wichmann H.-E., Rüther A., Schreiber S., Becker C., Nürnberg P., Nelson M.R., Krawczak M., Kayser M. 2008- Correlation between genetic and geographic structure in Europe. Current Biology 18: 1241-1248.

Complimentary  Contributor  Copy

Page 52: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 66

Le S.Q., Lartillot N., Gascuel O. 2008. Phylogenetic mixture models for proteins. Philosophical Transactions of the Royal Society of London, B: Biological Sciences 363: 3965-3976.

Lewontin R. 2011. The genotype/phenotype distinction. In: Stanford Encyclopedia of Philosophy. http://plato.stanford.edu/entries/genotype-phenotype/

Linnaeus C. 1751. Philosophia Botanica. G. Kiesewetter, Stockholm, & Z. Chatelain, Amsterdam.

Linné C. von 1792. Praelectiones in Ordines Naturales Plantarum. B.G. Hoffmann, Hamburg. [Transcriptions by J.C. Fabricius and P.D. Giseke of Linné�’s posthumous collection of lectures about plant families.]

Lyngsø R.B., Song Y.S., Hein J. 2008. Accurate computation of likelihoods in the coalescent with recombination via parsimony. Lecture Notes in Computer Science 4955: 463-477.

Margolin A.A., Nemenman I., Basso K., Wiggins C., Stolovitzky G., Dalla Favera R., Califano A. 2006. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7 Suppl 1: S7.

Marshall D.C. 2010 Cryptic failure of partitioned bayesian phylogenetic analyses: lost in the land of long trees. Systematic Biology 59: 108-117.

Meng C., Kubatko L.S. 2009. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theoretical Population Biology 75: 35-45.

Mereschkowsky C. 1910. Theorie der zwei Plasmaarten als Grundlage der Symbiogenese, einer neuen Lehre von der Entstehung der Organismen. Biologisches Centralblatt 30: 278�–303, 321�–347, 353�–367.

Mivart St.G. 1865. Contributions towards a more complete knowledge of the axial skeleton in the primates. Proceedings of the Zoological Society of London 33: 545-592.

Mivart St.G. 1867. On the appendicular skeleton of the primates. Philosophical Transactions of the Royal Society of London 157: 299-429.

Moore J.H. 1994. Putting anthropology back together again: the ethnogenetic critique of cladistic theory. American Anthropologist 96: 925-948.

Morrison D.A. 2005. Networks in phylogenetic analysis: new tools for population biology. International Journal for Parasitology 35: 567-582.

Morrison D.A. 2006. Phylogenetic analyses of parasites in the new millennium. Advances in Parasitology 63: 1-124.

Morrison D.A. 2010. Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology and Evolution 27: 1044-1057.

Morrison D.A. 2011. Introduction to Phylogenetic Networks. RJR Productions, Uppsala. Nakhleh L., Jin G., Zhao F., Mellor-Crummey J. 2005. Reconstructing phylogenetic networks

using maximum parsimony. In: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference, pp. 93-102. IEEE Computer Society, Washington DC.

Nakhleh L., Warnow T., Linder C.R., St John K. 2005. Reconstructing reticulate evolution in species �— theory and practice. Journal of Computational Biology 12: 796-811.

Newman M.E.J. 2010. Networks: An Introduction. Oxford University Press, Oxford. Novembre J., Johnson T., Bryc K., Kutalik Z., Boyko A.R., Auton A., Indap A., King K.S.,

Bergmann S., Nelson M.R., Stephens M., Bustamante C.D. 2008. Genes mirror geography within Europe. Nature 456: 98-101.

Online Etymology Dictionary. 2010. http://www.etymonline.com/index.php?term=network

Complimentary  Contributor  Copy

Page 53: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

Phylogenetic Networks Are Fundamentally Different ... 67

Orlando L., Hänni C., Douady C.J. 2007. Mammoth and elephant phylogenetic relationships: Mammut americanum, the missing outgroup. Evolutionary Bioinformatics 3: 45-51.

Pagel M., Meade A. 2004. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Systematic Biology 53: 571-581.

Pax F.A. 1888. Monographische übersicht über die arten der gattung Primula. Botanische Jahrbücher für Systematik, Pflanzengeschichte und Pflanzengeographie 10: 75-241.

Penny D. 2011. Darwin�’s theory of descent with modification, versus the biblical Tree of Life. PLoS Biology 9: e1001096.

Pickrell J.K., Pritchard J.K. 2012. Inference of population splits and mixtures from genome-wide allele frequency data. Unpublished ms: http://arxiv.org/abs/1206.2332

Pietsch T.W. 2012. Trees of Life: A Visual History of Evolution. Johns Hopkins University Press, Baltimore MD.

Proulx S.R., Promislow D.E.L., Phillips P.C. 2005. Network thinking in ecology and evolution. Trends in Ecology and Evolution 20: 345-353.

Radice R. 2011. A Bayesian Approach to Phylogenetic Networks. PhD thesis, University of Bath, UK.

Ragan M. 2009. Trees and networks before and after Darwin. Biology Direct 4: 43. Sánchez-Gracia A., Castresana J. 2012. Impact of deep coalescence on the reliability of

species tree inference from different types of DNA markers in mammals. PLoS ONE 7: e30239.

Sânchez-Quinto F, Schroeder H, Ramirez O, Avila-Arcos MC, Pybus M, Olalde I, Velazquez AM, Marcos ME, Encinas JM, Bertranpetit J, Orlando L, Gilbert MT, Lalueza-Fox C. 2012. Genomic affinities of two 7,000-year-old Iberian hunter-gatherers. Current Biology 22: 1494-1499.

Sarder P., Schierding W., Cobb J.P., Nehorai A. 2010. Estimating sparse gene regulatory networks using a bayesian linear regression. IEEE Transactions on Nanobioscience 9: 121-131.

Schnittger L., Rodriguez A.E., Florin-Christensen M., Morrison D.A. 2012. Babesia: a world emerging. Infection, Genetics and Evolution 12: 1788-1809.

Skoglund P., Malmström H., Raghavan M., Storå J., Hall P., Willerslev E., Gilbert M.T., Götherström A., Jakobsson M. 2012. Origins and genetic legacy of Neolithic farmers and hunter-gatherers in Europe. Science 336: 466-469.

Sneath P.H.A. 1975. Cladistic representation of reticulate evolution. Systematic Zoology 24: 360-368.

Snir S., Tuller T. 2009. The NET-HMM approach: phylogenetic network inference by combining maximum likelihood and hidden markov models. Journal of Bioinformatics and Computational Biology 7: 625-644.

Soltis D.E., Smith S.A., Cellinese N., Wurdack K.J., Tank D.C., Brockington S.F., Refulio-Rodriguez N.F., Walker J.B., Moore M.J., Carlsward B.S., Bell C.D., Latvis M., Crawley S., Black C., Diouf D., Xi Z., Rushworth C.A., Gitzendanner M.A., Sytsma K.J., Qiu Y.L., Hilu K.W., Davis C.C., Sanderson M.J., Beaman R.S., Olmstead R.G., Judd W.S., Donoghue M.J., Soltis P.S. 2011. Angiosperm phylogeny: 17 genes, 640 taxa. American Journal of Botany 98: 704-730.

Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688-2690.

Complimentary  Contributor  Copy

Page 54: ComplimentaryContributorCopy - ATSPACEacacia.atspace.eu/papers/Networks3.pdf2. Phylogenetics 2.1. Phylogenetic Analysis The study of evolutionary processes has often been considered

David A. Morrison 68

Stevens P.F. 1994. The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. Columbia University Press, New York.

Strimmer K., Moulton V. 2000. Likelihood analysis of phylogenetic networks using directed graphical methods. Molecular Biology and Evolution 17: 875-881.

Strimmer K., Wiuf C., Moulton V. 2001. Recombination analysis using directed graphical models. Molecular Biology and Evolution 18: 97-99.

Tassy P. 2011. Trees before and after Darwin. Journal of Zoological Systematics and Evolutionary Research 49: 89-101.

Than C., Ruths D., Innan H., Nakhleh L. 2007. Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 14: 517-535.

Tukey J.W. 1977. Exploratory Data Analysis. Addison-Wesley, Reading MA. von Haeseler A., Churchill G.A. 1993- Network models for sequence evolution. Journal of

Molecular Evolution 37: 77-85. Wägele J.W., Letsch H., Klussmann-Kolb A., Mayer C., Misof B., Wägele H. 2009.

Phylogenetic support values are not necessarily informative: the case of the Serialia hypothesis (a mollusk phylogeny). Frontiers in Zoology 6: 12.

Wägele J.W., Mayer C. 2007. Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects. BMC Evolutionary Biology 7: 147.

Wang N., Braun E.L., Kimball R.T. 2012. Testing hypotheses about the sister group of the Passeriformes using an independent 30-locus data set. Molecular Biology and Evolution 29: 737-750.

Wichmann S., Holman E.W., Rama T., Walker R.S. 2011. Correlates of reticulation in linguistic phylogenies. Language Dynamics and Change 1: 205-240.

Winkworth R.C., Bryant D., Lockhart P.J., Havell D., Moulton V. 2005. Biogeographic interpretation of split graphs: least squares optimization of branch lengths. Systematic Biology 54: 56-65.

Complimentary  Contributor  Copy