Data for Phylogenetics

Data for phylogenetic analysis The data that are used to estimate the phylogeny of a set of tips are the characteristics of those tips. Therefore the success of phylogenetic inference depends in large measure on the choice of trait data, the accuracy of those data, and the quantity of data obtained. Whether or not you plan on being engaged in phylogenetic analysis it is important to know the kinds of data that are typically used in phylogenetic analysis and to understand how data are organized to permit phylogenetic analysis. In this chapter I introduce the concept of a data matrix, and focus initially on the two kinds of data that are most important to know about: DNA sequence data and morphological data. I end by surveying the full range of data types that have at one time or another been used for phylogenetic analysis. A character-state data matrix The first step in a phylogenetic analysis is to decide on the organisms or species that will serve as tips. The tips scored for traits in a phylogenetic analysis are usually called taxa (singular == taxon). [This usage of the term taxon/taxa is similar but not identical to the “formally named groups or clade” used in the context of biological classification.] Enough taxa need to be included in a phylogenetic analysis so that the tree will resolve the phylogenetic questions that motivate the study. This will almost always mean including multiple taxa from within the group of interest (the ingroup) and at least one, but perhaps many, outgroups: taxa that are known not to be within the ingroup. The reason for including such outgroups was discussed briefly in chapter 6, and will be covered in more detail in chapter 9. The choice of taxa is also heavily influenced by practical issues, such as the availability of material for study. Imagine for example that you are studying the phylogeny of the Carnivora, a group of mammals, which share a number of dental and skeletal traits that support their monophyly. Your ingroup would likely include at least one representative of each of the previously identified clades from within Carnivora: dogs (Canidae), cats (Felidae), hyaenas (Hyaenidae), otters (Mustelidae), raccoons (Procyonidae), bears (Ursidae), civets (Viverridae), and seals (Pinnipedia). Additionally, individual taxa whose relationships are uncertain, in this case the giant and red pandas, might be included in the study. As outgroups, you could theoretically include any taxon that is not a member of Carnivora. However the best outgroups are usually ones that are reasonably closely related to the ingroup so that they can be compared directly for many traits. For example, you might use non-Carnivora mammals that have previously been found to be closely-related to Carnivora, for example members of the horse/rhino/tapir clade. Having selected the set of taxa to include in a phylogenetic study, the next step is to collect information on the traits of those taxa. In order to keep track of the similarity or differences of the taxa, the data are organized into a data matrix. Two kinds of data matrix are important to know about.

A character-state matrix is a list of taxa and the state that they manifest for each of a set of characters. A character-state matrix has one entry for each taxon for each character scored. Thus, for T taxa and C characters, the total number of data points is T x C. A distance matrix is a listing of the overall dissimilarity (or more rarely similarity) of each pair of taxa. A distance matrix has (T x (T-1))/2 entries. Given that we typically score many more characters than taxa, distance matrices usually include less total information than character-state matrices. A character-state matrix lists the state of each taxon for a number of characters. In the case of morphological data, a character is a measurable attribute of an organism that has the potential to be present in distinct states. Some examples of characters and character states are given in Table 1. Note that the characters include hard and soft body parts and behavioral traits. Further some characters are presence/absence, some show discrete variation (e.g., number of teeth), others show truly continuous variation (e.g., color, size) that may be broken-up into discrete states based on patterns of variation seen among the study taxa. Character scoring issues will be discussed further below.

Character States: actual States: numerical Hair color White, brown, black 0, 1, 2

Number of lower molars

1, 2, 3 0, 1, 2

External ear (pinna) Present, absent 0, 1 Life habit Terrestrial, amphibious, marine 0, 1, 2

Adult body mass < 50g, 50-500g, 500-2000g, >2000g 0, 1, 2, 3 Having decided on a list of characters and character states you can then score them for each taxon. The scoring is usually conducted on one or a limited number of individual organisms that are considered to represent the taxon of interest (e.g., a species). In the case that all the individuals examined for a taxon have the same state, that state is scored unambiguously. If there is variation among the individuals of one taxon for a trait, the trait will be scored so as to keep track of the fact that there is variation, or polymorphism, within this taxon. Normally scoring is facilitated by assigning numerical values to each state of each character, conventionally starting with ‘0’. The numerical values assigned are arbitrary, except that for characters where there is an intrinsic ordering among the states (for example molar number and body mass in the above example) one preserves the ordering of states, although it does not matter whether they are ordered them from high-to-low or low-to-high. In the Hennigian Method, ancestral states or states occurring in the outgroup would usually be assigned state ‘0,’ but this convention is no longer followed. An example of a part of a typical morphological data matrix is given. To the left the taxa are listed. Following each taxon are the values for each of the ten characters in this small data matrix. Notice that some taxa may be scored as unknown for certain characters (conventionally represented with ‘?’), either because we are ignorant as to the proper

scoring or because it is impossible to score (e.g., toe number in snakes). Also, a taxon can be scored as polymorphic by listing multiple states within a cell. 1 2 3 4 5 6 7 8 9 10 Horse (outgroup) 0 1 0 0 1 1 0 0 0 0 Cat 0 1 1 1 1 1 0 1 1 1 Dog 0 0 1 1 0 1 0 1 1 1 Bear 1 0 1 12 1 2 0 2 1 1 Otter 0 0 2 2 1 2 0 1 1 01 Seal 1 ? 2 2 1 0 0 2 1 ? [Does this matrix make sense?!] In the case of DNA sequence data, a character-state matrix looks very similar. The taxa correspond to individual organisms from the taxon whose DNA was extracted and sequenced. The characters in this case correspond to a particular position in a gene’s sequence. Therefore, unlike morphological data, the character number is not arbitrary but indicates the relative position of the nucleotide within the region that has been sequenced. The four states correspond to the four different bases that can occupy each position. As with morphological data one can be ignorant as to the identity of a base in a particular position (‘?’) or you can find multiple alternative states within the taxon. Additionally, you might find that some taxa have a portion of the sequence missing entirely due to insertions of deletions of DNA in the course of evolution. These are generally indicated with a gap code, ‘-,‘ but are most commonly treated as equivalent to missing data. The phylogenetic analysis of gaps and the determination of where insertion/deletion events have occurred raises a number of complex issues that will be touched on when we cover sequence alignment. 1 2 3 4 5 6 7 8 9 10 Horse (outgroup) A T G G A T C A A C Cat A T ? G A C C A A C Dog A T G G A C C A T CT Bear A T G A A C C A T C Otter A T G G A C C A T C Seal A T G A A C - - - C A distance matrix A distance matrix differs from a character-state data matrix in that instead of include entiries for T taxa and N characters, it includes a single value that summarizes the evolutionary difference (dissimilarity) (or its converse, similarity) of each pair of taxa. For T taxa, there are T * (T-1)/2 pairwise distances. An example is shown:

Character number

Character number

Character-state scoring Taxon list

Sequence position

Sequence data Taxon list

sealion walrus seal bear racoon weasel dog civet hyaena cat horse 0.290 0.289 0.297 0.270 0.293 0.299 0.288 0.250 0.274 0.250 sealion - 0.028 0.058 0.134 0.156 0.148 0.187 0.196 0.207 0.202 walrus - 0.055 0.135 0.155 0.147 0.188 0.194 0.209 0.197 seal - 0.134 0.161 0.154 0.181 0.198 0.205 0.199

bear - 0.139 0.156 0.185 0.179 0.193 0.181 racoon - 0.130 0.205 0.214 0.221 0.214 weasel - 0.205 0.208 0.219 0.213 dog - 0.210 0.217 0.202 civet - 0.092 0.081 hyaena - 0.092

In this case the matrix lists the difference (dissimilarity) of each pair of taxa, meaning that the higher the number the greater the distance between two taxa. Sometimes a matrix will show pairwise similarity, in which case the higher the number the shorter the distance between the two taxa. Immunological cross-reactivity and DNA-DNA hybridization data, kinds of data that are rarely collected nowadays, are two kinds of data that are obtained in the form of pairwise similarity or difference. More commonly a distance matrix is derived from a character-state matrix. The simplest way to convert a character-state matrix into a distance matrix is to calculate the proportion of characters for which a pair of taxa differ in state (and repeat this for all taxa). For example, as shown below, the dog and cat in the example matrix differ at two out of 10 morphological characters, representing a dissimilarity of 0.2 (or a similarity of 0.8). There are many complications in calculating distances, a few of which will be discussed in Chapter 11. It is worth noting that because many character state matrices can yield the same distance matrix you cannot infer a character state matrix from a distance matrix. Cat 0 1 1 1 1 1 0 1 1 1 Dog 0 0 1 1 0 1 0 1 1 1 Generating an aligned DNA sequence data matrix Most of published phylogenetic results are derived from analysis of DNA sequence data. To provide some context, it will be useful to briefly review the experimental methods that are used to collect DNA sequence data. Sequence databases such as GenBank contain a phenomenal amount of sequence data from a very large number of different species. As a result, it is sometimes possible to conduct a phylogenetic study without collecting any new sequence data. All that is required is to download a particular annotated gene from a set of taxa of interest. However, because the best phylogenetic research utilizes large quantities of data (many genes) for a representative set of taxa, such work still usually requires the generation of new DNA sequence data for the specific purpose of the phylogenetic study.

taxa

Distance measures

taxa

Usually phylogenetic research begins with a choice of a gene to study. The aim is to pick a gene with a suitable level of variation: too little variation and the research is not deploying resources efficiently, too much variation and the data is likely to be messy (for reasons that will be discussed). Additionally, it is desirable to select a gene that is present as a single copy in all the organisms being studied, because extra copies can confound interpretation of the resulting trees (Chapter 13). Lastly, practical issues, such as the amount of preexisting data and the ease with which the gene can be isolated, always influence the choice. Appendix 7.1.i provides a brief summary of the molecular methods used to obtain DNA sequences for each taxon in a data matrix. The sequences are composed of the letters, A, C, G, and T corresponding to the four bases, adenine, cytosine, guanine, and thymine. There may also be additional letters such as R, Y, S, W, M, and K, which are used to indicate uncertainty in the identity of a base at a certain position (for example, R means that either an A or G is present, Y refers to C or T, etc.). ‘N’ indicates that an unknown nucleotide is present, whereas ‘?’ is used to indicate a position where the editor was uncertain if there was or was not a base at all, for example if the data are ambiguous as to whether there are 9 or 10 A’s in a row in the sequence. Here are two sequences as they might be obtained. CGTTTATGGTGACGGAGCCGGGGGAGGTAGCACGTGGCAAAAAGAACGGCCTCGATTATCTCTTCCATCTTTACGAACAGTGCCGGGAGTTCTTGATTCAAGTCCAAAACATCGCCAAGGAGCGCGGCGAAAAATGCCCCACCAAGGTAACAATAGAAACAAATCTATTTTTAATGTTTCTTAAGTAAAATTTTGAATTCAAGCTCCGTAAATGAATGAAAATATGAGAAATATCCTGTTTTTGATCCGATTCTCATGGAAAAATATGAAACTAGGATAGTTTTTGCATGGTGCACGAGGTTTGACACGGGACTAGCTGTAAAAACAAGGCTGTCTCTGTTAGAATCTTAGAACTGGACCAGCCCTCCCATTAAAGCTAGGGTTTCTAGCCCATGAAAATGTGACAACTCAGGTACGGGGAGGAATGGAGTCTGAAAACTTGGGACATGTATGTCTAAATTTTTGCAGAGTAAGGTCCCCTCCGCCCCAAAAGGTTGTAC?TTTTGTCTTTAAASACTTTACTGTCTTCCTTTCTGAAGCCTCGTTTTCCCTGTCCGGTTTAGCTGAGGTGGCGTGACCCTAATACGACAGCTCCACCAYTTTTGGATCCTAATCTTATTGCTTATACAGGTGACCAACCAAGTTTTCAGATATGCTAAGAAGGCTGGGGCGAGCTACATTAACAARCCCAAAATGMGCCATTACGTCGGCAGGA TCACCCACGACCGTTCATGGTGACGGAGCCGGGGGAGGTAGCAAGTGGCAAAAAGAACGGCCTCGATTATCTCTTCCATCTTTACGAGCAGTGCAGGGAGTTCTTGATTCAAGTCCAAAACATCGCCAAGGAACGCGGCGAAAAATGCCCCACGAAGGTAACAATAGAAACAAATCTATTTTTAATGATTCTTAAGTAAAATTTTGAATTCAAGCGTAAATGAATGAAATATGAGAAATATCCTGTTTTTGATCCGATTCTCATGGAAAAATATGAAACTAGGATAGTTTTTGCATGGTGCACGAGGTTTGACACGTGACTAGCTGTAAAAACAAGGCTGTCTCTGTTAGAATCTTAGAACTGGACCAACCCTCCCATTAAAGCTAGGGTTTCTAGCCCATGAAAATGTGACAACTCAGGTACGGGGAGGAATGGAGTCTGAAAACTTGGGACATGTATGTCTAAATTTTTGCAGAGTAAGGTCCCCTCCGCCCCAAAAGGTTGTACTTTTTGTCTTTAAACACTTTACTGTCCTCCTTTCTGAAGCCTCGTTTTCCCCTGTCCGGTTGAGCTGAGGTGGCGTGACCCTAATACGACAGCTCCATTGGATCCTAACCTTGTTACTTATACAGGTGACCAACCAAGTCCTCAGATATGCTAAGAAGGCTGGGGCGAGCTACATTAACAAACCCAAAATGCGCCACTATGTC You will see that they are slightly different lengths and have different numbers of bases. In order to proceed to use these sequences for phylogenetic analysis the first step is to align these sequences to one another so that a nucleotide position in one is matched-up with the homologous position in all the other sequences. A data matrix is composed of characters that are shared by the taxa but which potentially differ in state: for example, hair (the character) may be white, brown, or black (the

character states). For DNA sequences, the character is the nucleotide position (numbered 1, 2, 3 etc.) and the states are the nucleotides (A, C, G, and T). It is critical, therefore, that nucleotides in each taxon be assigned to the correct positions. This process, called sequence alignment, involves sliding the sequences over one another and inserting gaps, guided by the sequences themselves. Sequence alignment, done properly, poses severe computational challenges and has become a very technical subject. Here, I will just summarize the underlying issues and point to some additional resources. A DNA strand is a physical structure with nucleotides in a specific linear order. In the simple case where the only kind of mutations are base-substitutions, each nucleotide position in one taxon would be homologous to a nucleotide position at the same place in the sequence of another taxon: position 1 in taxon A will be homologous to position 1 in taxon B, position 2 to position 2, and so on. If we write out the two sequences the homologous positions are aligned above one another. Parent: G T A T T G A C C A C T G A C T A G C A T

| | | | | | | | | | | | | | | | | | | | | Offspring:G C A T T A A C C A T T G T C T A G C A A If the only kind of mutation were base substitutions, having found the homologous genes you would merely needed to line up one homologous position and the rest of the alignment is trivial. However, sequences are subject to additional kinds of mutation: deletions, insertions, inversions and translocations. A deletion involves the removal of one or multiple continuous bases. Deletions may be due to errors during DNA replication, but can also happen during the “life” of a DNA molecule due to imperfect DNA repair following chemical or radiation induced damage, unequal crossing-over during recombination, or due to the actions of mobile genetic elements. Deletions can be very short (even one base pair) or very long (entire genes). When deletions happen, nucleotide positions in the parent strand lack homologs in the daughter strand: they have gone extinct. Parent: G T A T T G A C C A C T G A C T A G C A T

| | | | | | | | | | | | | | | | | | | | | Offspring:G C A T T - - - - - T T G T C T A G C A A The same mechanisms that cause deletions (errors during replication and recombination, DNA damage, and mobile genetic elements) can cause the insertion of DNA sequences into a strand. In some cases inserted sequences are duplicates of nucleotide positions in the parent strand. In that case, bases in the insertion have homologous positions in the parent strand (but one parental position may be homologous to two daughter positions). In other cases the inserted sequence will be novel and will lack identifiable homologous positions – in effect a new nucleotide position has been created from scratch.

Parent: G T A T T G A C C - - - A C T G A C T A G C A T | | | | | | | | | | | | | | | | | | | | | | | |

Offspring:G C A T T A A C C A C C A T T G T C T A G C A A Inversion events occur when a piece of DNA is effectively cut out and then replaced in the opposite orientation. Parent: G T A T T G A C C A C T G A C T A G C A T

| | | | | | | | | | | | | | Offspring:G C A T T A G T C A C C A T C T A G C A A A translocation involves a piece of sequence being moved to a different position within the sequence. Parent: G T A T T G A C C A C T G A C T A G C A T

| | | | | | | | | | | | | Offspring:G C A T C A T T T A A C G T C T A G C A A Inversions and translocations both yield cases in which there is a single nucleotide position in a daughter strand that is homologous to each position in a parent strand, but the sequences are not in the same linear order. While inversions and translocations are commonly identified, the major focus of most alignment programs is on inserting gaps in sequences to capture the history of insertions and deletions, collectively indels. The process of sequence alignment aims to align homologous positions based on the true history of sequence evolution. Alignment is, thus, properly viewed as a problem of historical inference. Furthermore, because base substitution, indels, and other structural mutations occurred along the branches of the true gene tree, sequence alignment and tree inference are really two aspects of the same problem. Therefore, in the ideal world, we would have computer programs that could take raw, unaligned sequences and search for trees that could simultaneously account for the bases in the sequences and their structural evolution. A few programs do conduct combined alignment and phylogenetic inference (e.g., POY; http://research.amnh.org/scicomp/projects/poy.php). However, the problem is so computationally challenging, that it is necessary to make a number of unrealistic assumptions to make the programs work. Therefore, the vast majority of phylogenetic analysis separates the two problems: first generating an alignment, and then provisionally accepting that alignment as the basis for phylogenetic inference. To get a feel for how sequence alignment can be conducted free of a phylogeny, see if you can align the following pair of sequences. A T G A C C A G T A C G G C T T T A A T G A T C G A T A T G G C A T T A You might conclude that these sequences are already well aligned and that the two sequences differ by five base substitution events. While this might be the best alignment,

it is worth considering alternatives that can also explain these data through the addition of insertion/deletion, or indel, events. For example, you could align these same two sequences by invoking eight indels and no substitutions, or three base substitutions and two indels. These three alternatives are shown. Five substitutions: A T G A C C A G T A C G G C T T T A A T G A T C G A T A T G G C A T T A Eight indels: A T G A C - C A G - - T A - C G G C - T T T A A T G A - T C - - G A T A T - G G C A - T T A Three substitutions and two indels: A T G A C C A G - - T A C G G C T T T A A T G A T C - - G A T A T G G C A T T A To choose between these we need to ask ourselves whether it is more likely that there were five substitutions, eight indels, or three substitutions and two indels. Assuming that the general rate of evolution is low then it is probably reasonable to favor five events over eight, thereby rejecting the second alignment. Furthermore, most data from molecular biology would say that base substitutions are more frequent that indels (especially in coding genes), which means that we would normally favor the first alignment. This example allows us to state a rule that is applied in almost all sequence alignment programs: only invoke an indel if at least one base substitution is avoided. Indeed it is normal to set the gap penalty, the threshold for the number of base substitutions avoided before an indel is inferred, higher still: gap penalties of three to twenty are common. Additionally, most computer programs impose an extra cost for longer gaps or gaps at the ends to avoid alignments such as the following, which avoid all base substitutions at the “cost” of two indels. - - - - - - - - - - - - - - - - - - A T G A C C A G T A C G G C T T T A A T G A T C G A T A T G G C A T T A - - - - - - - - - - - - - - - - - - That being said, a smart alignment program will permit gaps to be inserted at the end in cases such as the following. A T G A C C A G T A C G G C T T T A A G C A T G G C T A T A G A T A C C A T G A C C A G T A C G G C T T T A - - - - - - - - - - - - A G C A T G G C T A T A G A T A C C Here the “gaps” probably do not represent indel events but, rather, sequencing reactions that started and/or ended at different positions in the sequence. A purist might therefore

change the gaps to uncertainty codes (below), although this will not affect phylogeny reconstruction programs. A T G A C C A G T A C G G C T T T A ? ? ? ? ? ? ? ? ? ? ? ? A G C A T G G C T A T A G A T A C C It is probably clear that alignment is easiest and most certain when both base substitutions and indels are rare. This is because matched parts of the sequence provide a framework for identifying the position and size of indel events. For example, below are two true alignments. Which do you think you would yield data that were easier to align? A T G A - - - T G C A G C T T T A G G T A ? C A A C A G T A C G A - - C T A C - C A A T G A C C A G T A C A G - T T T A ? ? ? A C G T C C - - T A C G G C T T C A G T A The answer is the second one. While the number of indels is similar, the many extra base substitutions in the top case would make it very hard to identify the true alignment with confidence. Computer programs can often squeeze more information out of sequences by taking into account not just the number of base substitutions but the kind of substitution. For good molecular reasons (associated with repair mechanisms), purine-to-purine (A ⇔ G) and pyrimidine-to-pyrimidine (C ⇔ T ) substitutions, collectively called transitions, are more frequent than changes from a purine to a pyrmidine or vice versa (A ⇔ C, A ⇔ T, G ⇔ C, G ⇔ T), collectively called transversions. Computer programs can use differential penalties on transitions and transversions so as to pick alignments that more closely match the underlying molecular processes. Likewise, functional constraints on proteins mean that certain amino acid substitutions, ones entailing amino acids with similar chemical properties, are more frequent than others. These too can be taken into account by computational algorithms. Pairwise alignment considers just two sequences at a time whereas multiple alignments include sequences from many taxa so as to obtain an entire aligned data matrix. A pairwise alignment is relatively simple for a computer to determine, even when a complex set of penalties are implemented. Multiple alignments are, however, disproportionately more difficult. As the number of sequences being aligned increases, the number of possible alignments goes up exponentially. Multiple alignment algorithms should allow the placement of gaps in one sequence to influence the placement of gaps in other sequence. This is because, when gaps in two species are in the same position they can be attributed to a single indel occurring somewhere on the gene tree. Nonetheless, most multiple alignment programs start by making a pairwise alignment and then gradually align additional sequences to the earlier

alignment. In this procedure, a gap introduced early is retained even if the addition of later sequences might suggest an alternative position for the gap. Given the difficulties faced by alignment programs, humans can often visually identify the more egregious mistakes. Thus, while computer multiple alignment programs (e.g., CLUSTAL) provide a good starting point, they usually need to be examined and adjusted by eye. To illustrate this, here is a problematic portion of an alignment that was actually returned by CLUSTAL and an eyeball-edited version of the same. You might notice that not only does the second alignment imply a simpler mutational history, but also the human editor could take account of the codon structure of this gene (something that few computer programs keep track of) so as to both gaps in the same reading frame.

Given that sequence alignments will vary depending on how they were generated (what algorithm, what penalties, and whether they were manually edited), you might worry that phylogenetic analysis of aligned sequences is invalid. Actually, the problems are less than they may seem. Usually, even if some part of a sequence is hard to align unambiguously, many regions can be aligned confidently. It is common practice to either exclude regions of ambiguous alignment from the phylogenetic analysis or to repeat the analysis using a range of alternative alignments to see if the inferred trees vary significantly. And even in the worst-case scenario where the entire gene is hard to align, the data matrix is likely to show a lack of clear phylogenetic signal rather than a strong, misleading signal. Thus, even if a suspect alignment is used, statistical analysis of the phylogenetic conclusions (Chaps X-XX) will probably show that they are weak.

Generating a morphological data matrix Although the field of phylogenetics was founded on morphological data, most modern research uses molecular data, especially DNA sequences, instead. Phylogenetic analysis of morphological data turns out to be more challenging than the analysis of DNA sequences, and it is more difficult to obtain statistically significant support for phylogenetic conclusions. However, fossils can only be analyzed through the use of morphological data (scored for them and related living species). Additionally, even when trees come from different kinds of data it often becomes necessary to build a morphological data matrix as a means to use the tree to reconstruct the evolution of morphological traits. Therefore, despite the preeminence of molecular phylogenetics, it is important to know how morphological data are scored and assembled into data matrices. Two steps can be recognized in the building of a morphological matrix, which I will call character encoding and character scoring. Character encoding involves deciding on the limits of characters and on the alternative states that are recognized for each character. For example, when you decide to score fur color using two states, brown/black and white, then you have encoded one character (fur color). Character scoring involves looking at each taxon and assigning it a state for each encoded character. For example, you could score an otter as having brown/black fur. In practice observations made while scoring taxa often results in changes being made to character encoding, but it is still useful to distinguish these two steps. Character encoding involves defining the characters and character-states. Character encoding is trivial for DNA sequence data because they are inherently divisible into characters (positions) each of which can adopt one of four states (A, C, G, or T). In contrast, for morphological data, the recognition of characters and character-states needs to be determined by consideration of the patterns of variation among taxa. Once a set of taxa has been selected for study a systematist generally starts by looking for characteristics that appear to vary among them. Notice that whereas many characters in a DNA sequence data matrix may be invariant, the way that morphological characters are selected means that constant characters are usually not included. Once some variation has been noted the next challenge is to clearly encode the characters. This is not straightforward. For example, imagine that you observed that leaf shape and size differed among a set of eight plant species, as shown in the figure. How would you capture this variation?

Consider two of the numerous possible ways to encode this variation. (1) You recognize two basic leaf shapes, cordate (heart-shaped) and obcordate (with the widest point near the top) and two size classes. (2) You encode leaf length, leaf width, and the height of the widest point. The scoring that might result from these two encoding schemes is shown in the two matrices below.

Taxon Leaf shape (0 = cordate; 1 = obcordate)

Leaf size (0 = small; 1 = large)

A 0 1 B 0 0 C 0 1 D 0 1 E 1 0 F 1 1 G 1 1 H 1 01

Taxon Leaf length (0 =

short; 1 = long Leaf width (0 =

narrow; 1 = wide) Height of widest point (0 = below

middle; 1 = above middle)

A 1 0 0 B 0 0 0 C 1 0 0 D 1 0 0 E 0 1 1 F 1 0 1 G 1 0 1 H 1 1 1

The decision among alternative encoding schemes is guided by a few, potentially conflicting considerations. You want to capture as much of the variation as possible without “double counting.” Scoring the same basic variation multiple times results in overweighting that variation to the point where it will dominate the phylogenetic results. For example, you might be concerned that by measuring both length and width of these leaves you might score one basic trait, leaf size, twice.

A

B

C

D

E

F

G

H

Another important consideration is that the character states recognized should really be versions of the same character. This is not always easy to decide. Suppose that close relatives of these plants have compound leaves. Should their “leaf shape” be encoded based on the individual leaflets or the outline of the whole compound leaf? Once the characters are defined, the next question is how to delimit character states. If the variation is rather discrete between taxa and with little variation within taxa, as illustrated by leaf shape in the example above, then it may be easy to encode the categories. However, most morphological traits are inherently continuous and variation within taxa is

usual. Thus, it can be difficult to divide continuous variation into the discrete states needed for phylogenetic analysis. The graph below shows hypothetical data on leaf length in the ten species.

You might see this as three “clusters” of taxa corresponding to three states: small (A-E), medium (D-F-H), and large (B-G-I). In that case, you might score C as polymorphic for small and medium and J as polymorphic for medium and large. Or you could recognize two classes (small and large) or even five (A-E; C; D-F-H; J; B-G-I). And, sadly, there is no well-grounded theory to tell you which of these encoding schemes will yield us the best estimates of the phylogeny of these species. Taken together, you can probably see that there are many somewhat subjective decisions that must be made in encoding morphological data and that these are likely to be adjusted by observations made while scoring individual taxa. Nonetheless, the aim is to eventually define the encoding scheme so clearly that any researcher would score the same taxa and arrive at the same morphological data matrix. However, the fact that scoring can be objective once encoding is completed, does not change the fact that morphological data matrices are always somewhat subjective because data encoding cannot be rendered fully objective. While subjectivity is something that makes scientists uncomfortable, the fact that there is some subjectivity does not invalidate morphology as a source of phylogenetic data. So

Leaf Length (cm)

A B

C D

E F

H

G

J I

long as different encoding schemes capture the actual variation among taxa, then they should yield similar estimates of the tree. However, since different encoding schemes can result in different trees being chosen, as with sequence alignment, it is considered good practice to try a few different schemes and see if the phylogenetic conclusions remain the same. Classes of data used for phylogenetic analysis Information for phylogenetic inference can theoretically come from any traits that we believe evolved within the constraints of the underlying tree. A trait that varies among tips and shows some degree of heritability (ancestors having the trait tend to give descendants that have it too), thus, has the potential to provide phylogenetic information. Prior to the advent of modern molecular methods, phylogenetic analysis was conducted primarily on morphological variation. Nowadays, the great majority of phylogenetic analyses are conducted based on DNA sequences obtained from representative individuals. In the following chapters we will examine methods that have been developed for phylogenetic inference from DNA sequence data and morphological data. Although I will focus on DNA sequence and morphological data, there are many other kinds of data that can be used for phylogenetic analysis. In most cases these other kinds of data can be analyzed using methods developed for DNA sequences or morphology but in some cases specialized methods need to be used. It is beyond the scope of this book to explore all these methods. However, below I review most of the kinds of data that are used for phylogenetic inference and provide some brief discussion as to how they are typically analyzed.

i. Molecular Sequences Sequences of peptides or small proteins were the first kind of molecular sequences to be used for phylogenetics. Originally amino acid sequences were determined chemically. Now it is much more common to infer an amino acid sequence from either the mRNA or DNA sequence that encodes the protein. Protein sequences are still used for phylogeny reconstruction, especially to study ancient relationships. This is because protein sequences are often subject to functional constraints that slow down the rate of evolution, reducing the frequency with which a single position undergoes multiple evolutionary changes of state (multiple hits). Nonetheless, the general principles of phylogenetic analysis of protein sequences are very similar to those applied to DNA sequences. The second kinds of molecular sequences to be collected widely were ribosomal RNA (rRNA) sequences. This was because ribosomes could be purified and their DNA could be sequenced with reverse transcriptase. This technology has now been superseded and rRNA or messenger RNA (mRNA) sequences are nowadays obtained by sequencing the encoding DNA (or in the case of mRNA, by copying it back into DNA). Because there is

usually a one to one correspondence between RNA and the encoding DNA, phylogenetic analysis of RNA sequences is basically identical to that of DNA sequences. The only complication that arises sometimes is that some RNA molecules adopt folded structures due to bonding between bases at different positions in the molecule. When this happens, there can be non-independence between positions in the sequence that interact, potentially complication phylogenetic analysis.

ii. Molecular presence/absence data Over the years a number of different molecular techniques have been developed that allow researchers to score organisms for the presence/absence of particular molecular markers. Appendix 7.1 provides a brief description of some of these method. The first two of methods developed, RFLPs (Appendix 7.1.ii) and isozymes (Appendix 7.1.iii), date back to a time when molecular sequencing was not feasible. Isozyme data is optimized for population genetic studies and is not well suited to phylogenetic analysis. RFLP data, in contrast, provided a lot of robust phylogenetic information and played a major role in the development of molecular phylogenetics. However, for all its historical importance, RFLP has been superseded in phylogenetics with the advent of inexpensive and efficient DNA sequencing methods. RFLP data can be analyzed by treating it as presence/absence data using methods appropriate to morphological characters. However, in so doing efforts should be made to take account of an inherent asymmetry in RFLP data: shared sites are more likely to be lost in parallel than gained in parallel. This is because an entire restriction site (typically six base pairs long) must match for an enzyme to cut, whereas any one of many possible mismatches will result in a failure to cut. Parsimony can been modified using character-state weighting (Chap X) to partially account for this phenomenon. More specialized methods have been developed that use explicit models of RFLP evolution to obtain better phylogenetic estimates. RAPDs (7.1.iv), ISSRs (7.1.v), AFLPs (7.1.vi), and microsatellites (7.1.vii), collectively distributed molecular markers, were originally developed for studying variation within populations. Because the members of a population are closely related, their DNA sequences are usually very similar. Distributed molecular marker methods quickly scan a set of taxa for molecular variation distributed anywhere in the whole genome. Because these markers (of which AFLP is currently the most widely used) have been particularly popular for phylogenetic studies of closely related organisms, where it can be difficult to find sufficiently variable gene regions to sequence. In addition to certain technical problems that are mentioned in the appendix, distributed molecular markers have one significant drawback for studying phylogenies among closely related species. The available methods of analysis estimate a common phylogenetic tree under the assumption that all parts of the genome have tracked the same history. This is fine if all the genes have tracked the same history. But, phenomena such as incomplete lineage sorting, mean that different parts of the genome can have

different phylogenetic trees (chap. 5) – and these phenomena will be particular prevalent at the low phylogenetic scales that are generally studied with distributed molecular markers. You might hope that computer programs for phylogenetic analysis would have a safeguard built in to detect that the assumptions of the method have been violated. But unfortunately it is not easy to tell from distributed marker data whether there is a single shared history. You will usually obtain a well-resolved tree whether there is a single history of the whole genome or not. The tree may not correspond to the history of the whole genome but may be some kind of average across the conflicting histories from different parts of the genome. Therefore, users of distributed marker data need to be extra vigilant to avoid reading too much significance into the trees they obtain in cases where discordance among gene trees is suspected.

iii. Molecular structural data

Whereas molecular marker methods have generally been motivated by the search for variation at low taxonomic scales, structural molecular characters have mainly been of interest because they tend to evolve very slowly. A rare structural molecular feature shared by a group of taxa can provide compelling evidence that the taxa form a clade. Structural molecular characters include insertions and deletions, inversions, and duplicative or non-duplicative translocations. For each kind of structural mutation, there is a range of scales from very local mutations (e.g., single base-pair deletions; six base pair inversions) to large scale ones (e.g., deletion of a whole gene; insertion of an intron; translocation of an entire chromosome). As a general rule, finding that two taxa share a small/local structural mutation is considered weaker evidence of a close relationship than sharing a larger structural characteristic. This is because the probability of homoplasy is higher in the former case. It is more likely that two taxa independently underwent a deletion of the same AT than they independently experienced an insertion of the same 500 base-pair sequence at the same point the genome. Structural characters are usually identified by gene or genome sequencing, restriction mapping, or by microscopic observation of chromosomes. In some cases the kind of structural mutation involved can be unambiguously inferred. For example, when a large region of otherwise quite similar sequence is inverted in some taxa relative to others, a molecular inversion is clearly implied. In other cases a diversity of different mutational processes could have contributed to an observed pattern. For example, if tips differ in their gene order along a chromosome, different combinations of duplications, translocations, inversions, and deletions might provide competing explanations for the same basic data. The use of molecular structural characters for phylogenetic inference generally follows one of two approaches. The first involves scoring tips for the presence or absence of a number of structural characters and then using parsimony or other standard phylogenetic

methods to infer the tree that best explains the full set of characters. This approach allows that any of the characters could have been subject to homoplasy. The alternative approach is invoked when a particular structural character is considered to have resulted from such an improbable event that homoplasy is ruled out. In that case the Hennigian logic is invoked. This means that once the structural character is polarized (e.g., by looking in outgroups), a clade is inferred. A simple example of the latter approach is provided by a study of land plants conducted in 1992 by Linda Raubeson and Robert Jansen then at the University of Connecticut. They used RFLP mapping approaches to show that the clubmosses and their allies (the lycophytes) have a major inversion in their plastid genome relative to all other vascular land plants. Furthermore, the condition found in lycophytes resembled the outgroups (mosses and liverworts). From this they concluded that the lycophytes must be outside a clade that includes all other living vascular plants.

Abundant subsequent research on land plant phylogeny has confirmed the conclusion reached based on this inversion. It seems that this inversion really did occur just once in a common ancestor of all living vascular plants except lycophytes. Another classic example that fits this basic principle are chromosomal inversion phylogenies, which were reconstructed based on a logical analysis of a series of nested chromosomal inversions in Hawaiian fruit flies. Fruit flies have polytene chromosomes in the salivary glands that can be stained to reveal banding patterns, which may be observed under the microscope. In 1982, Carson summarize more than a decade of such data and was able to provide a detailed phylogenetic tree for approximately 103 fly species – a tree than has largely been validated based on DNA seqence data (O’Grady et al. 2001; BMC Evo Bio). Nonetheless, while structural molecular characters have provided definitive data in many cases, this does not mean that the Hennigian logic will always succeed. Our knowledge of molecular process is not good enough to definitively rule out independent origins of the same structural mutations. Therefore, even when clades are supported by supposedly

A

B

Outgroup Other vascular

plants Lycophytes

B A A

Inversion

rare structural mutations, biologists still hope to corroborate those clades through the use of other kinds of data.

iv. Morphology and other morphotypic data As discussed earlier in this chapter, morphological features that reflect some underlying aspects of the genotype, making them heritable, can be used for inferring phylogenetic relationships. The basic approach with such data is to examine the variation among the taxa and develop an encoding scheme for summarizing that variation with a number of discrete states. As discussed earlier the delimitation of characters and character states can be tricky, depending on a series of somewhat ambiguous judgment calls. There are a number of kinds of data that do not involve morphology in the strict sense (gross physical features of organisms), but nonetheless contend with similar problems of character encoding. These are sometimes called “phenotypic data,” but I don’t find that term appropriate given that all features of organisms we observe are technically phenotypic. Rather, I will use the term morphotypic to refer to those kinds of data for which character encoding is not defined a priori, but is guided by the observed variation among the taxa combined with insights into the homology and independence of the traits. Below is a list of some kinds of morphotypic data that have been used for phylogenetic analysis. It is common for a data matrix to include more than one kind of morphotypic data.

a) Morphology (sensu stricto): Gross physical features of organisms b) Anatomy: Microscopic feature of organisms c) Molecular morphology: The shape of particular molecules or molecular

complexes, for example ribosomes. Sometimes this variation can be captured effectively using sequence data, but occasionally the shape of subcellular structures has been treated as morphotypic data.

d) Development: How traits change during the lifetime of an organism e) Behavior: How the organisms tend to behave. Mating behaviors have been

particular widely used. f) Secondary biochemistry: Compounds organisms accumulate and chemical

reactions that its cells can conduct. The production of defensive secondary chemicals has proved important in plants, whereas fungi are often scored for the ability to cause a color reaction given a particular substrate. In some cases the biochemical trait corresponds to the presence or absence of a particular gene, in which case this kind of data blends into structural molecular or molecular presence/absence data.

g) Biogeography: The geographic distribution of organisms can theoretically be scored like other morphotypic traits. More commonly geographic history is mapped onto phylogenies once they are determined.

h) Ecology: The preferred habit or way of life of the organisms, which is to say aspects of their ecological niche. Examples include pollinator identity, prey type or preferred habitat type.

v. Frequency data If different individuals organisms scored from a given taxon vary in a trait, one strategy (the usual one nowadays) is to score that taxon as polymorphic, i.e., containing multiple states, but not to worry about the frequency of each variant within the taxon. However, in earlier times the frequency of a polymorphic character, usually an isozyme or RFLP allele, was considered to provide evidence of relatedness. Therefore, a data matrix could be constructed in which the entries are not the presence or absence of a state, but the frequency of different alleles in different data, as illustrated below. You will see that for each locus, the sum of the frequencies of each allele adds up to 1.0.

Methods have been developed for analyzing frequency data, most commonly involving first converting the frequency matrix into a distance matrix. However, nowadays frequency is rarely if ever used for phylogeny reconstruction. When taxa share polymorphisms (e.g., each contains both alleles a and b of the same locus) then it is questionable whether the taxa are monophyletic entities that can meaningfully be placed assigned to a single tip of the tree of life. This is, I think, why frequency data have so completely fallen out of favor. vi. Distance data All the other kinds of data discussed are initially scored in the form of a character-state matrix data. Although they can be converted into a distance matrix, they start out as a list of discrete states scored for each taxon. However, two kinds of data are collected in the form of a distance between a pair of taxa: immunological distances and DNA-DNA similarity. Immunological methods were popular in animals in the 1970’s and 1980’s. The intensity of the reaction between the immune serum of one animal and antigenic proteins from another animal was scored quantitatively in the laboratory. The underlying logic was that the greater the time since common ancestry the greater the protein differences, which should in turn lead to a more intense immune reaction. Similarly, mainly in the 1980’s, many systematists put great stock in DNA-DNA hybridization data as a measure of the overall sequence similarity of a pair of genomes. A typical experiment would involve single stranded DNA from one taxon being attached

Alleles

Locus 1 Taxon list Locus 2 Locus 3

1a 1b 1c 2a 2b 3a 3b 3c 3d 3e A 0.0 0.9 0.1 0.0 1.0 1.0 0.0 0.0 0.0 0.0 B 0.0 0.3 0.7 0.2 0.8 0.4 0.6 0.0 0.0 0.0 C 0.1 0.4 0.5 0.1 0.9 0.1 0.8 0.1 0.0 0.0 D 0.9 0.1 0.0 0.8 0.2 0.0 0.1 0.0 0.9 0.0 E 0.7 0.3 0.0 1.0 0.0 0.0 0.0 0.0 0.7 0.3 F 1.0 0.0 0.0 0.9 0.1 0.0 0.1 0.1 0.4 0.4

to a column and then being allowed to hybridize with single stranded (and radioactively labeled) DNA from another species. The greater the sequence similarity of the two genomes the more tightly would the complementary sequences bind. Overall genomic similarity could thus be measured by looking at the release of radioactivity when the column was gradually heated up to melt apart the two strands. Immunological and DNA-DNA distance measures can only be analyzed using distance methods, which is somewhat limiting. Also, whole genome distance data do not allow you to detect the existence of different gene trees for different parts of the genome. However, the main reason that these data are no longer used for phylogenetic research is that compared to DNA sequencing the data are harder to generate and less readily repeatable. Nonetheless, a number of conclusions reached using immunological or DNA-DNA hybridization data have since been validated using DNA sequence data. Major Points Phylogenetic inference is based on variable traits that have been scored for a set of taxa and have been entered into a data matrix. While there are many kinds of data that can be used, the most important kinds to know about are DNA sequences and morphology. DNA sequence data is the most widely used tool for studying phylogeny, being easy to collect in large quantities and relatively simple to analyze using statistical methods. The most common approaches involve first aligning the DNA sequences to one another by adding gaps that represent insertion/deletion (indel) events. Morphological data are needed to add fossils to the tree of life and morphology is often scored as a first step in using phylogenies to study evolution (Chap. XX). Morphological data do not need alignment, but the delimitation of characters and character states can be quite difficult and can introduce an element of subjectivity. Learning objectives

• Understand the difference between a character-state matrix and a distance matrix. o Be able to identify the state of a taxon for a particular character based on a

data matrix o Be able to determine the pairwise distance or similarity of a pair of taxa

given a distance matrix o Be able to convert a pairwise distance into a pairwise similarity, and vice

versa o Be able to convert a simple character-state matrix into a distance matrix

based on simple proportional similarity o Be able to explain why a character-state matrix can be converted into a

distance matrix, but the reverse is not possible • Understand the nature and challenge of DNA alignment.

o Be able to identify sequence alignment as a problem of homology assessment of the nucleotide positions, analogous to homology assessment in morphology

o Be able to explain why alignment and phylogeny reconstruction are two parts of the same historical inference problem

o Be able to align similar sequences without invoking indels o Be able to determine the number of indels and the number of substitutions

implied by a pairwise alignment o Be able to use an argument based on probabilities of indels versus

substitutions to favor one pairwise alignment over another o Be able to explain why alignment becomes more difficult in more

divergent sequences • Understand the issues that arise in building a morphological data matrix

o Be able to generate two or more alternative encoding strategies given information on trait variation

o Be able to give examples of cases where different views on homology alter the way that a characters are encoded

o Be able to recognize as misguided character-state encoding that obviously over-weight some characters (e.g., by repeating the same character)

o Be able to correctly score taxa for some non-technical traits given a character encoding scheme

o Be able to distinguish the objectivity of character-state scoring from the subjectivity of character encoding

• Understand that many kinds of data can be used for phylogenetic inference o Be able to explain why molecular data are currently the most widely used

tool for inferring phylogenies (abundant, easy to collect, easy to analyze because there are good models of sequence evolution)

o Be able to defend the use of morphological data for phylogenetic analysis o Be able to list at least one kind of molecular data besides DNA sequences

that have been used for phylogenetic analysis o Be able to explain why DNA-DNA hybridization or immunological

distance data cannot be represented in a character-state matrix o Be able to articulate the difference between the character-state scoring and

frequency scoring

Data for Phylogenetics

Documents

Transcript of Data for Phylogenetics