Phylogenetic Analysis -A bioinformatics Tool

32
Review work on Phylogenetic Analysis -a bioinformatics tool M.Sc. 2 nd SEMESTER ROLL: VU/PG/MIC II-S, NO: 03 REG.NO: 161515 YEAR: 2003-04 Department of Microbiology. Vidyasagar University. Midnapore.

description

review on Phylogenetic Analysis

Transcript of Phylogenetic Analysis -A bioinformatics Tool

Page 1: Phylogenetic Analysis -A bioinformatics Tool

Review work on Phylogenetic Analysis -a bioinformatics

tool

M.Sc. 2nd SEMESTER ROLL: VU/PG/MIC II-S, NO: 03

REG.NO: 161515 YEAR: 2003-04 Department of Microbiology.

Vidyasagar University. Midnapore.

Page 2: Phylogenetic Analysis -A bioinformatics Tool

Acknowledgement

This review paper would not have been possible without the help & encouragement

of many individuals. It gives me tremendous pleasure in acknowledging the valuable

assistance extends to me by various people in the successful completion of this program.

First & foremost I wish to acknowledge my profound thanks to our respected teacher Prof.

Bikash Ranjan Pati; Head of Microbiology Dept.,Vidyasagar Universiy; Dr Keshab

Ch.Mondal & Dr. Debdulal Banerjee; lecturer Dept.of Microbiology, Vidyasagar University

for their constant encouragement and valuable guidance & suggestions.

I also express my thanks to Mr. Pradeep DasMahapatra, & Mr.Jhereswer

Chalak, Laboratory Assistant of The Dept. of Microbiology, Vidyasagar University.

I like to express my special thanks to Mr. Paltu Dhal, Scholar, IIT; Kharagpur

for his valuable advice and support.

I must express my deepest regard to my parents for their continuous blessing and

support.

Last but not the least, I would like to thanks all of my friends and my department who

gave me timely support due to which the seminar report has been completed successfully.

..............................................................

Uttam Kumar Patra

M.Sc. 2nd Sem. Dept. of Microbiology.

Vidyasagar University.

Page 3: Phylogenetic Analysis -A bioinformatics Tool

CONTENT Subject Page No Abstract 1 Introduction 1 Basic Terminology in Phylogenetic Tree 4 The Concept of Evolutionary Trees 4 Fundamental Elements of Phylogenetic Models 7 Tree Interpretation—the Importance of Identifying 8 Paralogs and Orthologs Genome Complexity and Phylogenetic Analysis 9 Phylogenetic Data Analysis 13 Relationship of Phylogenetic Analysis to Sequence Alignment 13 Tree-Building Methods 15 Outline of Tree-Building Method 15 Character-Based Methods 16

Maximum Parsimony Method 17 Maximum Likelihood Method 20

Distance-Based Methods 21 Unweighted Pair Group Method With Arithmetic Mean (UPGMA) 22 Neighbor Joining (NJ) 22 Fitch-Margoliash (FM) 23 Minimum Evolution (ME) 23 Which Distance-Based Tree-Building Procedure Is Best? 23

The Difference between Distance, Parsimony, and Maximum Likelihood Methods 24 Pitfalls of Phylogenetic Analysis 24 Tree Evaluation 25 Reliability of Phylogenetic Predictions 25 Complications from Phylogenetic Analysis 26 Reference 27

Page 4: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Abstract: Phylogenetics is the study of evolutionary relationships. Phylogenetic analysis is

the means of inferring or estimating these relationships. The evolutionary history inferred from phylogenetic analysis is usually depicted as branching, treelike diagrams that represent an estimated pedigree of the inherited relationships among molecules, organisms, or both. The phylogenetic relationship between two or more sets of sequences is often extremely important information for bioinformatics analyses such as the construction of sequence alignment. It is not unreasonable to think of the kind of molecular data being something of a historical document that contains within it evidence of the important steps in the evolution of a gene. The very same evolutionary events (substitutions, insertions, deletions, and rearrangements) that are important to the history of a gene can also be used to resolve questions about the evolutionary history and relationship between entire species. In fact, the phylogenetic relationships among many kinds of organisms are difficult to determine in any another way. The true relationship between homologous sequences is hardly ever known aside from computer simulation experiments. A variety of approaches are available for inferring the most likely phylogenetic relationship between genes and species using nucleotide and protein sequence information. Thus bioinformatics tend to determine statistically based, commonly used inferring evolutionary relationship. Introduction:

Phylogenetic analysis of nucleic acid and protein sequences is presently and will continue to be an important area of sequence analysis. In addition to analyzing changes that have occurred in the evolution of different organisms, the evolution of a family of sequences may be studied. On the basis of the analysis, sequences that are the most closely related can be identified by their occupying neighboring branches on a tree. When a gene family is found in an organism or group of organisms, phylogenetic relationships among the genes can help to predict which ones might have an equivalent function. These functional predictions can then be tested by genetic experiments. Phylogenetic analysis may also be used to follow the changes occurring in a rapidly changing species, such as a virus. Analysis of the types of changes within a population can reveal, for example, whether or not a particular gene is under selection (McDonald and Kreitman 1991; Comeron and Kreitman 1998; Nielsen and Yang 1998), an important source of information in applications like epidemiology.

Molecular phylogenetics predates DNA sequencing by several decades. It is derived from the traditional method for classifying organisms according to their similarities and difference, as first practiced in a comprehensive fashion by Linnaeus in the 18th century. Linnaeus was a systematicist not an evolutionist, his objective being to place all known organisms into a logical classification which he believed would reveal the great plan used by the Creator –the Systema Naturae in 1753. His framework schemes dividing organism into a series of taxonomic categories, starting with kingdom and progressing down though phylum, class, order, family and genus to species. The naturalists of the 18th and early 19th centuries linked this hierarchy to a ‘tree of life’, an analogy that was adopted by Darwin (1859) in The Origin of Species as a means of describing the interconnected evolutionary histories of living organisms. The classificatory scheme devised by Linnaeus therefore became reinterpreted as a phylogeny indicating not just the similarity between species but also their evolutionary relationship.

1

Page 5: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Whether the objective is to construct a classification or to infer a phylogeny, the relevant data are obtained by examining variable characters in the organisms being compared. Originally, these characters were morphological features, but molecular data were introduced at a surprisingly early stage. In 1904 Nuttall used immunological tests to deduce relationships between varieties of animals, one of his objectives being to place humans in their correct evolutionary position relative to other primates. Nuttall’s work showed that molecular data can be used In phylogenetics, but the approach was not widely adopted until the late 1950s, the delay being due largely to technical limitations, but also partly because classification and phylogenetics had to undergo their own evolutionary changes before the value of molecular data could be fully appreciated. These changes came about with the introduction of phenetics (Michener and Sokal, 1957) and cladistics (Hennig, 1966), two novel phylogenetic methods which, although quite different in their approach, both place emphasis on large datasets that can be analyzed by rigorous mathematical procedures. The difficulty in obtaining large mathematical datasets when morphological characters are used was one of the main driving forces behind the gradual shift towards molecular data.

The sequences of protein and DNA molecules provide the most detailed and unambiguous data from molecular phylogenetics, but techniques for proteins sequencing did not become routine until the late 1960s, and rapid DNA sequencing was not developed until 10 years after that.

By the and of the 1960s these indirect methods had been supplemented with an increasing number of protein sequence studies ( Fitch and margoliash,1967 ) and during the 1980s DNA- based phylogenetics began to be carried out on a large scale. Protein sequences are still used today in some contexts, but DNA has now become far the predominant molecule. This is mainly because DNA yields more phylogenetic information than protein, the nucleotide sequence of a pair of homologous genes having higher information content than the amino acid sequences of the corresponding proteins, because mutations that results in non-synonymous change alter the DNA sequence but did not affect the amino acid sequence. Entirely novel information can also be obtained by DNA sequence analysis because variability in both in coding and non coding regions of the genome can be examined. The case with which DNA samples for sequence can be prepared by PCR in another key reason behind the predominance of DNA in modern molecular phylogenetics.

Procedures for phylogenetic analysis are strongly linked to those for sequence alignment and similar difficulties are encountered. Just as two very similar sequences can be easily aligned even by eye, a group of sequences that are very similar but with a small level of variation throughout can easily be organized into a tree. Conversely, as sequences become more and more different through evolutionary change, they can be much more difficult to align. A phylogenetic analysis of very different sequences is also difficult to do because there are so many possible evolutionary paths that could have been followed to produce the observed sequence variation. Because of the complexity of this problem, considerable expertise is required for difficult situations.

Macromolecules, especially sequences, have surpassed morphological and other organismal characters as the most popular form of data for phylogenetic or cladistic analysis. It is unrealistic to believe that an all-purpose phylogenetic analysis recipe can be delineated (Hillis et al., 1993). Although numerous phylogenetic algorithms, procedures, and computer programs have been devised, their reliability and practicality are, in all cases, dependent on the structure and size of the data. The merits and pitfalls of various methods are the subject of often acrimonious debates in taxonomic and phylogenetic journals. Some of these debates are summarized in a series of useful

2

Page 6: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

reviews of phylogenetics (Saitou, 1996; Li, 1997; Swofford et al., 1996). An especially concise introduction to molecular phylogenetics is provided by Hillis et al. (1993).

The danger of generating incorrect results is inherently greater in computational phylogenetics than in many other fields of science. The events yielding a phylogeny happened in the past and can only be inferred or estimated (with a few exceptions, Hillis et al., 1994). Despite the well-documented limitations of available phylogenetic procedures, current biological literature is replete with examples of conclusions derived from the results of analyses in which data had been simply run through one or another phylogeny program. Occasionally, the limiting factor in phylogenetic analysis is not so much the computational method used; more often than not, the limiting factor is the users’ understanding of what the method is actually doing with the data.

This brief guide to phylogenetic analysis has several objectives. First, a conceptual approach that describes some of the most important principles underlying the most widely and easily applied methods of phylogenetic analyses of biological sequences and their interpretation will be introduced. The aim is to show that practical phylogenetic analysis should be conceived as a search for a correct model, as much as a search for the correct tree. In this context, some of the particular models assumed by various popular methods and how these models might affect analysis of particular data sets will be discussed. Finally, some examples of the application of particular methods to the inferences of evolutionary history are provided.

Phylogenetic analysis programs are widely available at little or no cost. A comprehensive list will not be given here since one has been published previously (Swofford et al. 1996). The main ones in use are PHYLIP (phylogenetic inference package) (Felsenstein 1989 1996) available from Dr. J. Felsenstein at www.evolution.genetics.washington.edu/ phylip and PAUP (phylogenetic analysis using parsimony) available from Sinauer Associates, Sunderland, Massachusetts, www.lms.si.edu/PAUP/. Current versions of these programs provide the three main methods for phylogenetic analysis—parsimony, distance, and maximum likelihood methods —and also include many types of evolutionary models for sequence variation. Each program requires a particular type of input sequence format. Another program, MacClade, is useful for detailed analysis of the predictions made by PHYLIP, PAUP, and other phylogenetic programs and is also available from Sinauer ( www.phylogeny.arizona.edu/macclade/ macclade). MacClade, as the name suggests, runs on a Macintosh computer. PHYLIP and PAUP run on practically any machine, but the user interface for PAUP has been most developed for use on the Macintosh computer. There are also several Web sites that provide information on phylogenetic relationships among organisms. There are several excellent descriptions of phylogenetic analysis in which the methods are covered in considerable depth (Li and Graur 1991; Miyamoto and Cracraft 1991; Felsenstein 1996; Li and Gu 1996; Saitou 1996; Swofford et al. 1996; Li 1997).

3

Page 7: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Basic terminology in phylogenetic tree:

Root: the common ancestor of all taxa. Branch: reflects the relationship between taxa according to descent and ancestry. Branch length: indicates the number of changes that have occurred in the branch. Node: is a taxonomic unit identifying either an existing or an extinct species. It is a bifurcating branch point. Clade: a subgroup of two or more taxa or DNA/Protein sequences that includes both their common ancestor and all of their descendents. Clades are groups of organisms or genes that include the most recent common ancestor of all of its members and all of the descendants of that most recent common ancestor. Clade is derived from the Greek word ‘klados’, meaning branch or twig. Distance scale: scale that represents the number of differences between organisms or sequences. Topology: defines the branching patterns of the tree. Taxon: is any named group of organisms but not necessarily a clade.

The Concept of Evolutionary Trees:

An evolutionary tree is a two-dimensional graph showing evolutionary relationships among organisms, or in the case of sequences, in certain genes from separate organisms. The separate sequences are referred to as taxa (singular taxon), defined as phylogenetically distinct units on the tree. These are the concepts of rooted and unrooted trees. A rooted tree corresponds, more or less, to everyone’s idea of a tree. Typically, the ancestral state of the organisms, or genes, being studied are shown at the bottom of the tree, and the tree branches, or bifurcates, until it reaches the terminal branches (or tips, or leaves) at the top of the tree. There is nothing sacred about this convention, however, and trees can also be drawn with the tips at the bottom, at the left, at the right or even at a 458 angle. An unrooted tree is a less-intuitive, more-abstract concept. Unrooted trees represent the branching order, but do not indicate the root, or location, of the last common ancestor. Ideally, rooted trees are preferable, but, in

4

Page 8: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

practice, virtually every phylogenetic reconstruction algorithm provides an unrooted tree; thus, one needs to become familiar with them.

Figure:1; Structure of evolutionary trees. The tree is composed of outer branches (or leaves) representing the taxa and

nodes and branches representing relationships among the taxa, illustrated as sequences A–D in Figure:1. Thus, sequences A and B are derived from a common ancestor sequence represented by the node below them, and C and D are similarly related. The A/B and C/D common ancestors also share a common ancestor represented by a node at the lowest level of the tree. It is important to recognize that each node in the tree represents a splitting of the evolutionary path of the gene into two different species that are isolated reproductively. Beyond that point, any further evolutionary changes in each new branch are independent of those in the other new branch. The length of each branch to the next node represents the number of sequence changes that occurred prior to the next level of separation. Note that, in this example, the branch length between the A/B node and A is approximately equal to that between the A/B node and B, indicating the species are evolving at the same rate.

The amount of evolutionary time that has transpired since the separation of A and B is usually not known. What is estimated by phylogenetic analysis is the amount of sequence change between the A/B node and A, and also between the A/B node and B. Hence, judging by the branch lengths from this node to A and B, the same number of sequence changes has occurred. However, it is also likely that for some biological or environmental reason unique to each species, one taxon may have undergone more mutations since diverging from the ancestor than the other. In this case, different branch lengths would be shown on the tree. Some types of phylogenetic analyses assume that the rates of evolution in the tree branches are the same, whereas others assume that they vary, as discussed below. The assumption of a uniform rate of mutation in the tree branches is known as the molecular clock hypothesis and is usually most suitable for closely related species (Li and Graur 1991; Li 1997). Tests for this hypothesis have been

5

Page 9: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

devised as described below. Even if there is a common rate of evolutionary change, statistical variations from one branch to another can influence the analysis. The number of substitutions in each branch is generally assumed to vary according to the Poisson distribution, and the rate of change is assumed to be equal across all sequence positions (Swofford et al. 1996).

An alternative representation of the relationships among sequences A–D in Figure A is shown in Figure B. The difference between the tree in A and that in B is that the tree in B is unrooted. The unrooted tree also shows the evolutionary relationships among sequences A–D, but it does not reveal the location of the oldest ancestry. B could be converted into A by placing another node and adjoining root to the black line. A root could also be placed anywhere else in the tree. Hence, there are a great many more possibilities for rooted than for unrooted trees for a given number of taxa or sequences.

The tree shown is only one of many, each predicting a different evolutionary relationship among the sequences or taxa. The number of possible rooted trees increases very rapidly with the number of sequences or taxa. A root has been placed at this position indicating that in this evolutionary model of the sequences this basal node is the common ancestor of all of the other sequences. A unique path leads from the root node to any other node, and the direction of the path indicates the passage of evolutionary time. The root is defined by including a taxon that we are reasonably sure branched off earlier than the other taxa under study but should be related to the remaining taxa. It is also possible to predict a root, assuming that the molecular clock hypothesis holds.

The sum of all the branch lengths in a tree is referred to as the tree length. The tree is also a bifurcating or binary tree, in that only two branches emanate from each node. This situation is what one would expect during evolution—only one splitting away of a new species at a time. Trees can have more than one branch emanating from a node if the events separating taxa are so close that they cannot be resolved, or to simplify the tree.

6

Page 10: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Figure: 2 possible kind of phylogenetic tree.

Fundamental Elements of Phylogenetic Models: Phylogenetic tree-building methods presume particular evolutionary models. For

a given data set, these models can be violated because of occurrences such as the transfer of genetic material between organisms. Thus, when interpreting a given analysis, one should always consider the model used and its assumptions and entertain other possible explanations for the observed results

Models inherent in phylogenetics methods make additional ‘‘default’’ assumptions:

1. The sequence is correct and originates from the specified source. 2. The sequences are homologous (i.e., are all descended in some way from a

shared ancestral sequence). 3. Each position in a sequence alignment is homologous with every other in

that alignment. 4. Each of the multiple sequences included in a common analysis has a common

phylogenetic history with the others (e.g., there are no mixtures of nuclear and organellar sequences). . 5. The sampling of taxa is adequate to resolve the problem of interest.

6. Sequence variation among the samples is representative of the broader group of interest.

7. The sequence variability in the sample contains phylogenetic signal adequate to resolve the problem of interest.

7

Page 11: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

There are additional assumptions that are defaults in some methods but can be at least partially corrected for in others:

1. The sequences in the sample evolved according to a single stochastic process.

2. All positions in the sequence evolved according to the same stochastic process.

3. Each position in the sequence evolved independently. Errors in published phylogenetic analyses can often be attributed to violations

of one or more of the foregoing assumptions. Every sequence data set must be evaluated against these assumptions, with other possible explanations for the observed results considered. Tree Interpretation—The Importance Of Identifying Paralogs And Orthologs:

As more genomes are sequenced, we are becoming more interested in learning about protein or gene evolution (i.e., investigating gene phylogeny, rather than organismal phylogeny). This can aid our understanding of the function of proteins and genes.

Studies of protein and gene evolution involve the comparison of homologs— sequences that have common origins but may or may not have common activity. Sequences that share an arbitrary, threshold level of similarity determined by alignment of matching bases are termed homologous. They are inherited from a common ancestor that possessed similar structure, although the structure of the ancestor may be difficult to determine because it has been modified through descent.

Homologs are most commonly either orthologs, paralogs, or xenologs.

Figure: 3; the evolution of ortholog and paralog. • Orthologs are homologs produced by speciation. They represent genes derived

from a common ancestor that diverged due to divergence of the organisms they are associated with. They tend to have similar function.

8

Page 12: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

• Paralogs are homologs produced by gene duplication. They represent genes derived from a common ancestral gene that duplicated within an organism and then subsequently diverged. They tend to have different functions.

• Xenologs are homologs resulting from horizontal gene transfer between two organisms. The determination of whether a gene of interest was recently transferred into the current host by horizontal gene transfer is often difficult. Occasionally, the %(G + C) content may be so vastly different from the average gene in the current host that a conclusion of external origin is nearly inescapable, however often it is unclear whether a gene has horizontal origins. Function of xenologs can be variable depending on how significant the change in context was for the horizontally moving gene; however, in general, the function tends to be similar. Genome Complexity and Phylogenetic Analysis:

When performing a phylogenetic analysis, it is important to keep in mind that the genomes of most organisms have a complex origin. Some parts of the genome are passed on by vertical descent through the normal reproductive cycle. Other parts may have arisen by horizontal transfer of genetic material between species through a virus, DNA transformation, symbiosis, or some other horizontal transfer mechanism. Accordingly, when a particular gene is being subjected to phylogenetic analysis, the evolutionary history of that gene may not coincide with the evolutionary history of another.

One of the most significant uses of phylogenetic analysis of sequences is to make predictions concerning the tree of life. For this purpose, a gene should be selected that is universally present in all organisms and easily recognizable by the conservation of sequence in many species. At the same time, there should be enough sequence variation to determine which groups of organisms share the same phylogenetic origin. Ideally, the gene should also not be under selection, meaning that as variation occurs in populations of organisms, certain sequences are not favored with a loss of the more primitive variation. Two molecules of this type that carry a great deal of evolutionary history in inter-species sequence variations are the small rRNA subunit and mitochondrial sequences. A large number of rRNA sequences from a variety of organisms were aligned. Phylogenetic predictions were then made using the distance method described below (Woese,1987). On the basis of rRNA sequence signatures, or regions within the molecule that are conserved in one group of organisms but different in another (Woese,1987) predicted that early life diverged into three main kingdoms—Archaea, Bacteria, and Eukarya—a view that has been challenged (Mayr 1998). Evidence for the presence of additional organisms in these groups has since been found by PCR amplification of environmental samples of RNA (Barns et al. 1996). A more detailed analysis was used to find relationships among individual species within each group. The types of relationships found among the prokaryotic organisms are illustrated in Figure 4. The use of mitochondrial sequences for analysis of primate evolution is given below in the description of the parsimony method of phylogenetic analysis.

9

Page 13: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Figure 4. Rooted tree of life showing principal relationships among prokaryotic domains Bacteria and Archaea (Woese 1987; Barns et al. 1996; Brown and Doolittle 1997). Branch lengths are approximate only. Species that have been sequenced or are being sequenced are shown.

10

Page 14: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Although these studies of rRNA sequences suggest a quite clear-cut model for the evolution of life, phylogenetic analysis of other genes and gene families has revealed that the situation is probably more complex and that a more appropriate model might be the one shown in Figure 5. There are now many examples of horizontal or lateral transfer of genes between species that introduce new genes and sequences into an organism (Brown and Doolittle 1997; Doolittle 1999). These types of transfers are inferred from the finding that the phylogenetic histories of different genes in an organism, such as genes for metabolic functions, are not the same or that codon use in different genes varies. Another type of phylogenetic analysis is based on the number of genes shared between genomes and produces a tree that is similar to the rRNA tree (Snel et al. 1999).

Figure 5: The reticulated or net-like form of the tree of life. Analysis of rRNA sequences originally suggested three main branches in the tree of life, Archaea, Bacteria, and Eukarya. Subsequent phylogenetic analysis of genes for some metabolic enzymes is not congruent with the rRNA tree. Hence, for these metabolic genes, the tree has a reticulated form due to horizontal transfer of these genes between species. (Martin 1999)

11

Page 15: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

To track the evolutionary history of genes, more attention has also been paid to the methodology of phylogenetic analysis and to the inherent errors in many of the assumptions (Doolittle 1999). Problems associated with variations between rates of change in different sites and of analyzing more distantly related sequences are discussed below. Moreover, there is evidence that genomes undergo extensive rearrangements, placing sequences of different evolutionary origin next to each other and even causing rearrangements within protein-encoding genes (Henikoff et al. 1997).

The different regions of independent evolutionary origin in a sequence therefore need to be identified. Proteins are modular with functional domains, sometimes repeated within a protein and sometimes shared within a protein family. These regions are identified by their sharing of significant sequence similarity. The remainder of the aligned regions in the group may have variable levels of similarity. In nucleic acid sequences, a given sequence pattern may provide a binding site for a regulatory molecule, leading to promoter function, RNA splicing, or some other function. It may be difficult to decide the extent of these patterns for phylogenetic analysis.

Figure 6: DNA Sequence Evolution

Another feature of genome evolution that should be considered in phylogenetic

analysis is the occurrence of gene duplication events that create tandem copies of a gene. These two copies may then evolve along separate pathways leading to different functions. However, these copies maintain a certain level of similarity and undergo concerted evolution, a process of acquiring mutations in a coordinated way, probably through gene conversion or recombination events. Speciation events following gene duplications will give rise to two independent sets of genes and sequences, one set for each gene copy. Two genes in the same lineage can have different relationships.

12

Page 16: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Phylogenetic Data Analysis: A straightforward phylogenetic analysis consists of three major steps:

1. Alignment (both building the data model and extracting a phylogenetic dataset)

2. Tree construction 3. Tree evaluation

Each step is critical for the analysis and should be handled accordingly. For example, trees are only as good as the alignment they are based on. When performing a phylogenetic analysis, it often insightful to build trees based on different modifications of the alignment to see how the alignment proposed influences the resulting tree. Relationship of Phylogenetic Analysis to Sequence Alignment:

When the sequences of two nucleic acid or protein molecules found in two different organisms are similar, they are likely to have been derived from a common ancestor sequence. Sequence alignment methods used to determine sequence similarity. Multiple sequence alignment methods that need to be applied to a set of related sequences before a phylogenetic analysis can be performed. The methods for searching through a database of sequences to locate sequences that are similar to a query sequence. A sequence alignment reveals which positions in the sequences were conserved and which diverged from a common ancestor sequence. When one is quite certain that two sequences share an evolutionary relationship, the sequences are referred to as being homologous.

The commonest method of multiple sequence alignment, first aligns the most closely related pair of sequences and then sequentially adds more distantly related sequences or sets of sequences to this initial alignment. The alignment so obtained is influenced by the most alike sequences in the group and thus may not represent a reliable history of the evolutionary changes that have occurred. Other methods of multiple sequence alignment attempt to circumvent the influence of alike sequences. Once a multiple sequence alignment has been obtained, each column is assumed to correspond to an individual site that has been evolving according to the observed sequence variation in the column. Most methods of phylogenetic analysis assume that each position in the protein or nucleic acid sequence changes independently of the others (analysis of RNA sequence evolution is an exception).

As indicated above, the analysis of sequences that are strongly similar along their entire lengths is quite straightforward. However, to align most sequences requires the positioning of gaps in the alignment. Gaps represent an insertion or deletion of one or more sequence characters during evolution. Proteins that align well are likely to have the same three-dimensional structure. In general, sequences that lie in the core structure of such proteins are not subject to insertions or deletions because any amino acid substitutions must fit into the packed hydrophobic environment of the core. Gaps should therefore be rare in regions of multiple sequence alignments that represent these core sequences. In contrast, more variation, including insertions and deletions, may be found in the loop regions on the outside of the three-dimensional structure because these regions do not influence the core structure as much. Loop regions interact with the environment of small molecules, membranes, and other proteins.

13

Page 17: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Figure7: Outline of multiple sequence alignment (MSA) [Mount] .

Gaps in alignments can be thought of as representing mutational changes in sequences, including insertions, deletions, or rearrangements of genetic material. The expectation that a gap of virtually any length can occur as a single event introduces the problem of judging how many individual changes have occurred and in what order. Gaps are treated in various ways by phylogenetic programs, but no clear-cut model as to how they should be treated has been devised. Many methods ignore gaps or focus on regions in an alignment that do not have any gaps. Nevertheless, gaps can be useful as phylogenetic markers in some situations.

Another approach for handling gaps is to avoid analysis of individual sites in the sequence alignment and instead to use sequence similarity scores as a basis for phylogenetic analysis. Rather than trying to decide what has happened at each sequence position in an alignment, a similarity score based on a scoring matrix with penalties for gaps is often used. These scores may be converted to distance scores that are suitable for phylogenetic analysis (Feng and Doolittle 1996) by distance methods.

14

Page 18: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Tree-Building Methods: Tree-building methods implemented in available software are discussed in detail

in the literature (Saitou, 1996; Swofford et al., 1996; Li, 1997) and described on the Internet. This section briefly describes some of the most popular methods. Tree building methods can be sorted into distance-based vs. character-based methods. Much of the discussion in molecular phylogenetics dwells on the utility of distance and character-based methods (e.g., Saitou, 1996; Li, 1997). Distance methods compute pairwise distances according to some measure and then discard the actual data, using only the fixed distances to derive trees. Character-based methods derive trees that optimize the distribution of the actual data patterns for each character. Pairwise distances are, therefore, not fixed, as they are determined by the tree topology. The most commonly applied distance-based methods include neighbor-joining and the Fitch-Margoliash method, and the most common character-based methods include maximum parsimony and maximum likelihood.

Outline of Tree-Building Method:

Figure 8: basic scheme of phylogenetic analysis methods [Mount].

1. The sequences chosen can be either DNA or protein sequence: Different programs and program options are used for each type. RNA sequences are analyzed by covariation methods and by analyzing changes in secondary structure. The selected sequences should align with each other along their entire lengths, or else each should have a common set of patterns or domains that provides a strong indication of evolutionary relatedness. 2. The alignment of the sequence pairs should not have a large number of gaps that are obviously necessary to align identical or related characters. A phylogenetic analysis should only be performed on parts of sequences that can be reasonably aligned. In general, phylogenetic methods analyze conserved regions that are represented in all the sequences. The more similar

15

Page 19: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

the sequences are to each other, the better. The simplest evolutionary models assume that the variation in each column of the multiple sequence alignment represents single-step changes and that no reversals (A → T → A) have occurred. As the observed variation increases, more multiple-step changes (A → T → G) and reversions are likely to be present. Corrections may be applied for such variation, thereby increasing the observed amount of change to a more reasonable value. These corrections assume a uniform rate of change at all sequence positions over time. Gaps in the multiple sequence alignment are usually not scored because there is no suitable model for the evolutionary mechanisms that produce them. 3. This question is designed to select sequences suitable for maximum parsimony analysis. Other methods may also be used with these same sequences. For parsimony analysis, the best results are obtained when the amount of variation among all pairs of sequences is similar (no very different sequences are present) and when the amount of variation is small. Some columns in the multiple sequence alignment will have the same residue in all sequences; other columns will include both conserved and nonconserved residues. There should be a clear-cut majority of certain residues in some columns of the alignment but also some variation. These more common residues are taken to represent an earlier group of sequences from which others were derived. If there is too much variation, there will be too many possible ancestral relationships. Because the maximum parsimony method has to attempt to fit all possible trees to the data, the method is not suitable for more than 11 or 12 sequences because there are too many trees to test. More than one tree may be found to be equally parsimonious. A consensus tree representing the conserved features of the different trees may then be produced. 4. The purpose of this question is to select sequences for phylogenetic analysis by distance methods. Distance methods are able to predict an evolutionary tree when variation among the sequences is present (some sequences are more alike than others) and when the amount of variation is intermediate. The number of changed positions in an alignment between two sequences divided by the total number of matched positions is the distance between the sequences. As distances increase, corrections are necessary for deviations from single-step changes between sequences. Of course, as distances increase, the uncertainty of alignments also increases, and a reassessment of the suitability of the multiple sequence alignment method may be necessary. Sequences with this type of variation may also be suitable for phylogenetic analysis by maximum likelihood methods. Distance methods may be used with a large number of sequences. The program CLUSTALW produces a distance-based tree at the same time as a multiple sequence alignment (Higgins et al. 1996). 5. Maximum likelihood methods may be used for any set of related sequences, but they are particularly useful when the sequences are more variable. These methods are computationally intense, and computational complexity increases with the number of sequences since the probability of every possible tree must be calculated as described in the text. An advantage of these methods is that they provide evolutionary models to account for the variation in the sequences. 6. The data in the multiple sequence alignment columns is resampled to test how well the branches on the evolutionary tree are supported (boot-strapping). Character-Based Methods:

The character-based methods have little in common with each other, besides the use of the character data at all steps in the analysis. This allows the assessment of the reliability of each base position in an alignment on the basis of all other base positions.

16

Page 20: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Maximum Parsimony Method (MP): This method predicts the evolutionary tree (or a tree) that minimizes the number

of steps required to generate the observed variation in the sequences. For this reason, the method is also sometimes referred to as the minimum evolution method. A multiple sequence alignment is required to predict which sequence positions are likely to correspond. These positions will appear in vertical columns in the multiple sequence alignment. For each aligned position, phylogenetic trees that require the smallest number of evolutionary changes to produce the observed sequence changes are identified. This analysis is continued for every position in the sequence alignment. Finally, those trees that produce the smallest number of changes overall for all sequence positions are identified. This method is used for sequences that are quite similar and for small numbers of sequences, for which it is best suited. The algorithm followed is not particularly complicated, but it is guaranteed to find the best tree, because all possible trees relating a group of sequences are examined. For this reason, the method is quite time-consuming and is not useful for data that include a large number of sequences or sequences with a large amount of variation. One or more unrooted trees are predicted and other assumptions must be made to root the predicted tree.

Maximum parsimony is an optimization criterion that adheres to the principle that the best explanation of the data is the simplest, which in turn is the one requiring the fewest ad hoc assumptions. In practical terms, the MP tree is the shortest—the one with the fewest changes—which, by definition, is also the one with the fewest parallel changes. There are several variants of MP that differ with regard to the permitted directionality of character state change (Swofford et al., 1996).

To accommodate substitution bias, MP is amenable to weighting; for example, the transformation of a transversion can be weighted relative to a transition. The easiest way to do this is to create a weighting step matrix in which the weights are the reciprocal of the rates estimated using ML as described above. However, step-matrix weighting can greatly slow MP computation. The MP method performs poorly when there is substantial among-site rate heterogeneity (Huelsenbeck, 1995). There are few good fixes for this problem. One approach is to modify the data set to include only sites that exhibit little or no heterogeneity as determined by likelihood estimation. Another approach is to recursively reweight positions according to their propensity to change as observed in preliminary trees. This ‘‘successive approximations’’ approach is automatically facilitated in PAUP, but it is prone to error to the degree that the preliminary trees are incorrect. PAUP offers a number of options and parameter settings for a parsimony analysis in the Macintosh environment. The main programs for maximum parsimony analysis in the PHYLIP package (Felsenstein 1996)

MP analyses tend to yield numerous (and sometimes many thousands of) trees that have the same score. Because each is held to be as optimal as any other, only groupings present in the strict consensus of all trees are considered to be supported by the data. The reason that distance and ML tree methods tend to arrive at a single best tree is that their calculations involve division and decimals, whereas MP merely counts discrete steps. For a given data set, a strict consensus of all ME or ML trees that are not significantly worse than optimal probably would yield resolution more or less comparable to the MP consensus. Unfortunately, whereas MP users conventionally present strict consensus (and sometimes consensus of trees one or two steps worse), ME and ML users typically do not.

Simulation studies have shown that MP performs no better than ME and worse

17

Page 21: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

than ML when the amount of sequence evolution since lineages diverged is much greater than the amount of divergence that occurred between lineage splits (i.e., in a tree with very long terminal branches and short internal internodes) (Huelsenbeck, 1995). This condition produces ‘‘long branch attraction’’—the long branches become artificially connected because the number of nonhomologous similarities the sequences have accumulated exceeds the number of homologous similarities they have retained with their true closest relatives (Swofford et al., 1996). Character weighting improves the performance of MP under these conditions (Huelsenbeck, 1995). Example: Maximum Parsimony Analysis of Sequences: Taxa Sequence position (sites) and character

1 2 3 4 5 6 7 8 9 1 A A G G G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G

Table 1: Example of phylogenetic analysis to find the correct unrooted tree from four aligned sequences by the maximum parsimony method (Li and Graur 1991) Rules for analysis by maximum parsimony in this example are:

1. In the analysis, all of the possible unrooted trees (three trees for four sequences) are considered. The sequence variations at each site in the alignment are placed at the tips of the trees, and the tree that requires the smallest number of changes to produce this variation is determined. This analysis is repeated for each informative site, and the tree (or trees) that supports the smallest number of changes overall is found. The length of the tree, defined as the sum of the number of steps in each branch of the tree, will be a minimum.

2. Some sites are informative, i.e., they favor one tree over another (site 5 is informative but sites 1, 6, and 8 are not).

3. To be informative, a site must have the same sequence character in at least two taxa (sites 1, 2, 3, 4, 6, and 8 are not informative; sites 5, 7, and 9 are informative). 4. Only the informative sites need to be analyzed. The three possible trees are shown in Figure 9. The optimal tree is obtained by

adding the number of changes at each informative site for each tree, and picking the tree requiring the least number of changes. A scoring matrix may be used instead of scoring a change as 1. Tree 1 is the correct one and the tree length will be 4 (one change at each of positions 5 and 7 and two changes at position 9).

18

Page 22: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Figure 9: Example of phylogenetic analysis using the maximum parsimony method. (Li and Graur 1991). This figure shows an example of phylogenetic analysis by maximum parsimony. This method finds the tree that changes any sequence into all of the others by the least number of steps. Features:

Multiple sequence alignment needed Used for rather similar sequences to be analyzed in small numbers Time-consuming and computationally costly Widely used software in the field that implements Maximum parsimony and other

methods is PHYLIP http://evolution.genetics.washington.edu/phylip.html

PAUP* is also used for this purpose http://paup.csit.fsu.edu/index.html Optimality criterion: The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences. Advantages:

Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’). Can be used on molecular and non-molecular (e.g., morphological) data. Can tease apart types of similarity (shared-derived, shared-ancestral,

homoplasy) Can be used for character (can infer the exact substitutions) and rate analysis. Can be used to infer the sequences of the extinct (hypothetical) ancestors.

Disadvantages:

Can be fooled by high levels of homoplasy (‘same’ events). Can become positively misleading in the “Felsenstein Zone”:

19

Page 23: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

The Maximum Likelihood method (ml): This method uses probability calculations to find a tree that best accounts for the

variation in a set of sequences. The method is similar to the maximum parsimony method in that the analysis is performed on each column of a multiple sequence alignment. All possible trees are considered. Hence, the method is only feasible for a small number of sequences. For each tree, the number of sequence changes or mutations that may have occurred to give the sequence variation is considered. Because the rate of appearance of new mutations is very small, the more mutations needed to fit a tree to the data, the less likely that tree (Felsenstein 1981). The maximum likelihood method resembles the maximum parsimony method in that trees with the least number of changes will be the most likely. However, the maximum likelihood method presents an additional opportunity to evaluate trees with variations in mutation rates in different lineages, and to use explicit evolutionary models such as the Jukes-Cantor and Kimura models. Thus, the method can be used to explore relationships among more diverse sequences, conditions that are not well handled by maximum parsimony methods. However, with faster computers, the maximum likelihood method is seeing wider use and is being used for more complex models of evolution (Schadt et al. 1998). Maximum likelihood has also been used for an analysis of mutations in overlapping reading frames in viruses (Hein and Støvlbæk 1996). PAUP version 4 can be used to perform a maximum likelihood analysis on DNA sequences. The method has also been applied for changes from one amino acid to another in protein sequences.

In practice, ML is derived for each base position in an alignment. The likelihood is calculated in terms of the probability that the pattern of variation at a site would be produced by a particular substitution process, given a particular tree and the overall observed base frequencies. The likelihood becomes the sum of the probabilities of each possible reconstruction of substitutions under a particular substitution process. The likelihoods for all the sites are multiplied to give an overall ‘‘likelihood of the tree’’ (i.e., the probability of the data given the tree and the substitution process). As one can imagine, for one particular tree, the likelihood of the data is low at some sites and high at others. For a ‘‘good’’ tree, many sites will have higher likelihood, so the product of likelihoods is high. For a ‘‘poor’’ tree, the reverse will be true.

The substitution model should be optimized to fit the observed data. For example, if there is a transition bias, evident by an inordinate number of sites that include only purines or pyrimidines, the likelihood of the data under a model that assumes no bias will never be as good as one that does. Likewise, if a substantial proportion of the sites are occupied by a single base and another substantial proportion have an equal base frequencies, the likelihood of the data under a model that assumes that all sites evolve equally will be less than that of a model that allows rate heterogeneity. Modifying the substitution parameters, however, modifies the likelihood of the data associated with particular trees. Thus, the tree yielding the highest likelihood under one substitution model might yield much lower likelihood under another.

Because ML uses great amounts of computational time, it is usually impractical to perform a complete search that simultaneously optimizes the substitution model and the tree for a given data set. An economical, heuristic approach is recommended (Adachi and Hasegawa, 1996; Swofford et al., 1996). Perhaps the best time saver in this regard is preliminary ML estimation of the substitution model (as can be performed using PAUP). This procedure can be applied iteratively, searching for better ML trees, then re estimating the parameters, and then searching for better trees.

As algorithms, computers, and phylogenetic understanding have improved, the ML criterion has become more popular for molecular phylogenetic analysis. In simulation

20

Page 24: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

studies, ML has consistently outperformed ME and MP when the data analysis proceeds according to the same model that generates the data (Huelsenbeck, 1995). ML will always be the most computationally intensive method of all, however, so there will always be situations in which it is not practical. Features:

Uses probability calculations to find a tree that best accounts for the observed sequence variations.

All possible trees are considered (time-consuming) Few sequences can be analyzed It is possible to evaluate trees with mutations in different lineages Use evolutionary models that allow for variations in base composition (Jukes-

Cantor, Kimura) Optimality criterion: ML methods evaluate phylogenetic hypotheses in terms of the probability that a proposed model of the evolutionary process and the proposed unrooted tree would give rise to the observed data. The tree found to have the highest ML value is considered to be the preferred tree. Advantages:

Are inherently statistical and evolutionary model-based. Usually the most ‘consistent’ of the methods available. Can be used for character (can infer the exact substitutions) and rate analysis. Can be used to infer the sequences of the extinct (hypothetical) ancestors. Can help account for branch-length effects in unbalanced trees. Can be applied to nucleotide or amino acid sequences, and other types of data.

Disadvantages:

Are not as simple and intuitive as many other methods. Are computationally very intense (Iimits number of taxa and length of sequence). Like parsimony, can be fooled by high levels of homoplasy. Violations of the assumed model can lead to incorrect trees.

Distance-Based Methods:

The distance method employs the number of changes between each pair in a group of sequences to produce a phylogenetic tree of the group. The sequence pairs that have the smallest number of sequence changes between them are termed “neighbors.” On a tree, these sequences share a node or common ancestor position and are each joined to that node by a branch. The goal of distance methods is to identify a tree that positions the neighbors correctly and that also has branch lengths which reproduce the original data as closely as possible. Finding the closest neighbors among a group of sequences by the distance method is often the first step in producing a multiple sequence alignment.

The distance method was pioneered by Feng and Doolittle, and a collection of programs by these authors will produce both an alignment and tree of a set of protein sequences (Feng and Doolittle 1996). The program CLUSTALW, uses the neighbor-joining distance method as a guide to multiple sequence alignments. PAUP version 4 has options for performing a phylogenetic analysis by distance methods. Programs of

21

Page 25: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

the PHYLIP package that perform a distance analysis which automatically read in a sequence in the PHYLIP infile format and automatically produce a file called outfile with a distance table.

Distance-based methods use the amount of dissimilarity (the distance) between two aligned sequences to derive trees. A distance method would reconstruct the true tree if all genetic divergence events were accurately recorded in the sequence (Swofford et al., 1996). However, divergence encounters an upper limit as sequences become mutationally saturate. After one sequence of a diverging pair has mutated at a particular site, subsequent mutations in either sequence cannot render the sites any more ‘‘different.’’ In fact, subsequent mutations can make them again equal (for example, if a valine mutates to an isoleucine, which mutates back to a valine). Therefore, most distance-based methods correct for such ‘‘unseen’’ substitutions. In practice, application of the rate matrix effectively presumes that some proportion of observed pairwise base identities actually represents multiple mutations and that this proportion increases with increasing overall sequence divergence. Some programs implement, at least optionally, calculation of uncorrected distances, whereas, for example, the MEGA program (Kumar et al., 1994) implements only uncorrected distances for codon and amino acid data. Unless overall divergences are very low, the latter approach is virtually guaranteed to give inaccurate results.

Pairwise distance is calculated using maximum-likelihood estimators of substitution rates. The most popular distance tree-building programs have a limited number of substitution models, but PAUP 4.0 implements a number of models, including the actual model estimated from the data using maximum likelihood, as well as the logdet distance method.

Distance methods are much less computationally intensive than maximum likelihood but can employ the same models of sequence evolution. This is their biggest advantage. The disadvantage is that the actual character data are discarded. The most commonly applied distance-based methods are the unweighted pair group method with arithmetic mean (UPGMA), neighbor joining (NJ), and methods that optimize the additivity of a distance tree, including the minimum evolution (ME) method. Several methods are available in more than one phylogenetics software package but not all implementations allow the same parameter specifications and/or tree optimization features (e.g., branch swapping).

Unweighted Pair Group Method with Arithmetic Mean (UPGMA):

UPGMA is a clustering or phenetic algorithm—it joins tree branches based on the criterion of greatest similarity among pairs and averages of joined pairs. It is not strictly an evolutionary distance method (Li, 1997). UPGMA is expected to generate an accurate topology with true branch lengths only when the divergence is according to a molecular clock (ultrametric; Swofford et al., 1996) or approximately equal to raw sequence dissimilarity. As mentioned earlier, these conditions are rarely met in practice.

Neighbor Joining (NJ):

The neighbor-joining algorithm is commonly applied with distance tree building, regardless of the optimization criterion. The fully resolved tree is ‘‘decomposed’’ from a fully unresolved ‘‘star’’ tree by successively inserting branches between a pair of closest (actually, most isolated) neighbors and the remaining terminals in the tree (Fig. 10). The closest neighbor pair is then consolidated, effectively reforming a star tree, and the process is repeated. The method is comparatively rapid.

22

Page 26: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Figure 10. Star decomposition. This is how tree-building algorithms such as neighbor joining work. The most similar terminals are joined, and a branch is inserted between them and the remainder of the star. Subsequently, the new branch is consolidated so that its value is a mean of the two original values, yielding a star tree with n-1 terminals. The process is repeated until only one terminal remains. Fitch-Margoliash (FM):

The Fitch-Margoliash (FM) method seeks to maximize the fit of the observed pairwise distances to a tree by minimizing the squared deviation of all possible observed distances relative to all possible path lengths on the tree (Felsenstein, 1997). There are several variations that differ in how the error is weighted. The variance estimates are not completely independent because errors in all the internal tree branches are counted at least twice (Rzhetsky and Nei, 1992).

Minimum Evolution (ME):

Minimum evolution seeks to find the shortest tree that is consistent with the path lengths measured in a manner similar to FM; that is, ME works by minimizing the squared deviation of observed to tree-based distances (Rzhetsky and Nei, 1992; Swofford et al., 1996; Felsenstein, 1997). Unlike FM, ME does not use all possible pairwise distances and all possible associated tree path lengths. Rather, it fixes the location of internal tree nodes based on the distance to external nodes and then optimizes the internal branch length according to the minimum measured error between these ‘‘observed’’ points. It thus purports to eliminate the non independence of FM measurements.

Which Distance-Based Tree-Building Procedure Is Best?

ME and FM appear to be the best procedures, and they perform nearly identically in simulation studies (Huelsenbeck, 1995). ME is becoming more widely implemented in computer programs, including METREE (Rzhetsky and Nei, 1994) and PAUP. For protein data, the FM procedure in PHYLIP offers the greatest range of substitution models but no correction for among-site rate heterogeneity. The MEGA (Kumar et al., 1994) and METREE packages include a gamma correction for proteins, but only in conjunction with a raw (‘‘p-distance’’) divergence model (no distance or bias correction), which is unreliable except for small divergences (Rzhetsky and Nei, 1994). MEGA also computes separate distances for synonymous and nonsynonymous sites, but this method is valid only in the absence of substitution or base frequency bias and when there is no correction for among-site rate heterogeneity. Thus, for most data sets, using

23

Page 27: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

the nucleotide data under a more realistic model might be preferable to MEGA’s methods.

Simulation studies indicate that UPGMA performs poorly over a broad range of tree shape space (Huelsenbeck, 1995). The use of this method is not recommended; it is mentioned here only because its application seems to persist, as evidenced by UPGMA gene trees appearing in publications (Huelsenbeck, 1995).

NJ is clearly the fastest procedure and generally yields a tree close to the ME tree. (Rzhetsky and Nei, 1992; Li, 1997). However, it yields only one tree. Depending on the structure of the data, numerous different trees might be as good or significantly better than the NJ tree (Swofford et al., 1996). THE DIFFERENCE between DISTANCE, PARSIMONY, AND MAXIMUM LIKELIHOOD Methods:

Distance matrix methods simply count the number of differences between two sequences. This number is referred to as the evolutionary distance, and its exact size depends on the evolutionary model used. The actual tree is then computed from the matrix of distance values by running a clustering algorithm that starts with the most similar sequences (i.e., those that have the shortest distance between them) or by trying to minimize the total branch length of the tree. The principle of maximum parsimony searches for a tree that requires the smallest number of changes to explain the differences observed among the taxa under study.

A maximum-likelihood approach to phylogenetic inference evaluates the probability that the chosen evolutionary model has generated the observed data. The evolutionary model could simply mean that one assumes that changes between all nucleotides (or amino acids) are equally probable. The program will then assign all possible nucleotides to the internal nodes of the tree in turn and calculate the probability that each such sequence would have generated the data (if two sister taxa have the nucleotide ‘‘A,’’ a reconstruction that assumes derivation from a ‘‘C’’ would be assigned a low probability compared with a derivation that assumes there already was an ‘‘A’’). The probabilities for all possible reconstructions (not just the more probable one) are summed up to yield the likelihood for one particular site. The likelihood for the tree is the product of the likelihoods for all alignment positions in the data set.

Pitfalls of phylogenetic analysis:

It is not generally appreciated that molecular sequence analysis is a field in its infancy. It is an inexact science in which there are few analytical tools that are truly based on general mathematical and statistical principles. Consequently, many, perhaps most, phylogenetic trees reconstructed from molecular sequences are incorrect and frequently conflict with common sense. This is mainly caused by one or more of the three pitfalls of sequence analysis: (1) incorrect sequence alignments, caused by inadequate mathematical models and often related specifically to biases created by progressive alignment algorithms when they are used to align more than three taxa; (2) the failure to account properly for site-to-site variation (all sites within sequences can evolve at different rates); and (3) unequal rate effects (the inability of most tree-building algorithms to produce good phylogenetic trees when genes from different taxa in the tree evolve at different rates). All three pitfalls can produce the same artefact – long branch attraction. In these artefactually produced trees, rapidly evolving sequences (represented by long branches on phylogenetic trees) will be placed with other rapidly evolving sequences, even if the sequences are only distantly related. In comparison with

24

Page 28: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

most problems in molecular biology, which can be solved by acquiring more data, long branch attractions are much more complex. If longer sequences are used when long branch attractions are present, the incorrect solution will be even more strongly supported. Of the three pitfalls, alignment artefacts are potentially the most serious, because even if the second and third problems are solved, the misalignments can still produce incorrect trees. A new algorithm, paralinear (logdet) distances (Lockhart, P.J. et al. 1994 and Lake, J.A.;1994) provides a simple, but rigorous, mathematical solution for the third pitfall. This particular algorithm is now available in some of the phylogenetic packages described below. (For a discussion of many other useful algorithms that are available, including maximum parsimony, maximum likelihood and other distance methods) (Stewart, C-B; 1993). Tree Evaluation:

Several procedures are available that evaluate the phylogenetic signal in the data and the robustness of trees (Swofford et al., 1996; Li, 1997). The most popular of the former class are tests of data signal versus randomized data (skewness and permutation tests). The latter class includes tests of tree support from resampling of observed data (nonparametric bootstrap). The likelihood ratio test provides a means of evaluating both the substitution model and the tree. Reliability of Phylogenetic Predictions:

As discussed earlier, phylogenetic analysis of a set of sequences that aligns very well is straightforward because the positions that correspond in the sequences can be readily identified in a multiple sequence alignment of the sequences. The types of changes in the aligned positions or the numbers of changes in the alignments between pairs of sequences then provide a basis for a determination of phylogenetic relationships among the sequences by the above methods of phylogenetic analysis. For sequences that have diverged considerably, a phylogenetic analysis is more challenging. A determination of the sequence changes that have occurred is more difficult because the multiple sequence alignment may not be optimal and because multiple changes may have occurred in the aligned sequence positions. The choice of a suitable multiple sequence alignment method depends on the degree of variation among the sequences. Once a suitable alignment has been found, one may also ask how well the predicted phylogenetic relationships are supported by the data in the multiple sequence alignment.

In the bootstrap method, the data are resampled by randomly choosing vertical columns from the aligned sequences to produce, in effect, a new sequence alignment of the same length. Each column of data may be used more than once and some columns may not be used at all in the new alignment. Trees are then predicted from many of these alignments of resampled sequences (Felsenstein 1988). For branches in the predicted tree topology to be significant, the resampled data sets should frequently (for example, _70%) predict the same branches. Bootstrap analysis is supported by most of the commonly used phylogenetic inference software packages and is commonly used to test tree branch reliability. Another method of testing the reliability of one part of the tree is to collapse two branches into a common node (Maddison and Maddison 1992). The tree length is again evaluated and compared to the original length, and any increase is the decay value. The greater the decay value, the more significant the original branches. In addition to these methods, there are some additional recommendations that increase confidence in a phylogenetic prediction.

One further recommendation is to use at least two of the above methods (maximum parsimony, distance, or maximum likelihood) for the analysis. If two of these

25

Page 29: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

methods provide the same prediction, confidence in the prediction is much higher. Another recommendation is to pay careful attention to the evolutionary assumptions and models that are used for both sequence alignment and tree construction (Li and Graur 1991; Swofford et al. 1996; Li 1997). Complications from Phylogenetic Analysis:

The above methods provide a further level of sequence analysis by predicting possible evolutionary relationships among a group of related sequences. The methods predict a tree that shows possible ancestral relationships among the sequences. A phylogenetic analysis can be performed on proteins or nucleic acid sequences using any one of the three methods described above, each of which utilizes a different type of algorithm. The reliability of the prediction can also be evaluated.

The traditional use of phylogenetic analysis is to discover evolutionary relationships among species. In such cases, a suitable gene or DNA sequence that shows just enough, but not too much, variation among a group of organisms is selected for phylogenetic analysis. For example, analysis of mitochondrial sequences is used to discover evolutionary relationships among mammals. Two more recent uses of phylogenetic analysis are to analyze gene families and to trace the evolutionary history of specific genes. For example, database similarity searches may identify several proteins in a plant genome that are similar to a yeast query protein. From a phylogenetic analysis of the protein family, the plant gene most closely related to the yeast gene and therefore most likely to have the same function can be determined. The prediction can then be evaluated in the laboratory. Tracking the evolutionary history of individual genes in a group of species can reveal which genes have remained in a genome for a long time and which genes have been horizontally transferred between species. Thus, phylogenetic analysis can also contribute to an understanding of genome evolution.

26

Page 30: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Reference: Book reference: Baxevanis Andreas D. & Ouellette B. F. Francis ;BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins (Wiley, 2001). Brown T.A. Genome 2, Wiley-Liss (2002). Mount. David W. - Bioinformatics Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press. Journal reference: Adachi, J., and Hasegawa, M. (1996). MOLPHY Version 2.3. Programs for Molecular phylogenetics based on maximum likelihood (Tokyo: Institute of Statistical Mathematics). assessing confidence in phylogenetic analysis. Syst. Biol. 42, 182–192. Barns S.M., Delwiche C.F., Palmer J.D., and Pace N.R. 1996. Perspectives on archaeal diversity, thermophily and monophyly from environmental rRNA sequences. Proc. Natl. Acad. Sci. 93: 9188–9193. Brown J.R. and Doolittle W.F. 1997. Archaea and the procaryotic-to-eukaryote transition. Microbiol. Mol. Biol. Rev. 61: 456–502. Comeron J.M. and Kreitman M. 1998. The correlation between synonymous and nonsynonymous substitutions in Drosophila: Mutation, selection or relaxed constraints? Genetics 150: 767–775. Darwin C (1859) The Origin Of Species by Means of Natural Selection, or the Prevention of Favoured Races in the Strugglr for Life, Penguin Books, London. Doolittle W.F. 1999. Phylogenetic classification and the universal tree. Science 284: 2124–2128. evolution trees. Mol. Biol. Evol. 9, 945–967. Eerniss DJ (1998) A brief guide to phylogenetic software.Trends Genet, 14, 473-475. Felsenstein J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17: 368–376. ———. 1988. Phylogenies from molecular sequences: Inferences and reliability. Annu. Rev. Genet. 22: 521–565. ———. 1989. PHYLIP: Phylogeny inference package (version 3.2). Cladistics 5: 164–166. ———. 1996. Inferring phylogeny from protein sequences by parsimony, distance and likelihood methods. Methods Enzymol. 266: 368–382. Feng D.F. and Doolittle R.F. 1996. Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol. 266: 368–382. Feng, D. F., and Doolittle, R. F. (1996). Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol. 266, 368–382. Fitch W.M. 1981. A non-sequential method for constructing trees and hierarchical classifications. J. Mol. Evol. 18: 30–37. Fitch W.M. and Margoliash E. 1987. Construction of phylogenetic trees. Science 155: 279–284 Hein J. and Støvlbæk J. 1996. Combined DNA and protein alignment. Methods Enzymol. 266: 402–418. Henikoff S., Greene E.A., Pietrokovski S., Bork P., Attwood T.K., and Hood L. 1997. Gene families: The taxonomy of protein paralogs and chimeras. Science 278: 609–614. Hennig W (1966) Phylogenetic Systematics. University of lllinois Press, Uroana, IL. Hershkovitz, M.A., and Lewis, L.A. (1996). Deep-level diagnostic value of the rDNA ITS region. Mol. Biol. Evol. 13, 1276–1295.

27

Page 31: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Higgins D.G., Thompson J.D., and Gibson T.J. 1996. Using CLUSTAL for multiple sequence alignments. Methods Enzymol. 266: 383–402 Hillis, D. M., Allard, M. W., and Miyamoto, M. M. (1993). Analysis of DNA sequence data: Phylogenetic inference. Methods Enzymol. 224, 456–487. Hills DM (1997) Biology recapitulates phylogeny. Science, 267, 218-219. Hillis, D. M., and Bull, J. J. (1993). An empirical test of bootstrapping as a method for Hillis, D. M., Huelsenbeck, J. P., and Cunningham, C. W. (1994). Application and accuracy of molecular phylogenies. Science 264, 671–677. Huelsenbeck, J. P. (1995). Performance of phylogenetic methods in simulation. Syst. Biol. 44, 17–48. Huelsenbeck, J. P., Hillis, D. M., and Jones, R. (1996). Parametric bootstrapping in molecular phylogenetics. In Molecular Zoology: Advances, Strategies, and Protocols, J. D. Ferraris and S. R. Palumbi, Eds. (New York: Wiley-Liss), p. 19–45. Kumar, S., Tamura, K., and Nei, M. (1994). MEGA: Molecular Evolutionary Genetics Analysis software for microcomputers. Comput. Appl. Biosci. 10, 189–191. Lake, J.A. (1994) Proc. Natl.Acad. Sci.U. S.A. 91,1455–1459 Li W.-H. 1993. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Mol. Evol. 36: 96–99. ———. 1997. Molecular evolution. Sinauer Associates, Sunderland, Massachusetts. Li W.-H. and Graur D. 1991. Fundamentals of molecular evolution, pp. 106–111. Sinauer Associates, Sunderland, Massachusetts. Li W.-H. and Gu X. 1996. Estimating evolutionary distances between DNA sequences. Methods Enzymol. 266: 449–459. Li W.-H., Wu C.I., and Luo C.C. 1985. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2: 150–174. Lockhart, P.J. et al. (1994) Mol. Biol. Evol. 11,605–612 Maddison W.P. and Maddison D.R. 1992. MacClade: Analysis of phylogeny and character evolution (version 3). Sinauer Associates, Sunderland, Massachusetts Martin W. 1999. Mosaic bacterial chromosomes: A challenge en route to a tree of genomes. Bioessays 21: 99–104. Mayr E. 1998. Two empires or three? Proc. Natl. Acad. Sci. 95: 9720–9723 McDonald J.H. and Kreitman M. 1991. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351: 652–654. Miyamoto M.M. and Cracraft J. 1991. Phylogenetic analysis of DNA sequences. Oxford University Press, New York. Nielsen R. and Yang Z. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148: 929–936. Nuttal GHF (1904) Blood Immunity and Blood Relationship Cambridge Unerversity Press, Cambridge. Rzhetsky, A., and Nei, M. (1992). A simple method for estimating and testing minimum Rzhetsky, A., and Nei, M. (1994). METREE: A program package for inferring and testing minimum-evolution trees. Comput. Appl. Biosci. 10, 409–412. Saitou N. 1996. Reconstruction of gene trees from sequence data. Methods Enzymol. 266: 427–449 Saitou N. and Nei M. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406–425. Schadt E.E., Sinsheimer J.S., and Lange K. 1998. Computational advances in maximum likelihood methods for molecular phylogeny. Genome Res. 8: 222–233. Snel B., Bork P., and Huynen M.A. 1999. Genome phylogeny based on gene content. Nat. Genet. 21: 108–110.

28

Page 32: Phylogenetic Analysis -A bioinformatics Tool

Phylogenetic analysis- a bioinformatics tool

Stewart, C-B. (1993) Nature 361, 603–607 Swofford D.L., Olsen G.J., Waddell P.J., and Hillis D.M. 1996. Phylogenetic inference. In Molecular systematics, 2nd edition (ed. D.M. Hillis et al.), chap. 5, pp. 407–514. Sinauer Associates, Sunderland, Massachusetts. Woese C.R. 1987. Bacterial evolution. Microbiol. Rev. 51: 221–271

29