M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

21
Reprinted from ADVANCES IN MATHEMATICS AU Rights Reserved by Academic Preas, New York and London Vol. 20, No. 3 June 1976 Rintcci tn Belgium Some Biological Sequence Metrics* M. S. WATERMAN Idaho State University,Pocatello, Idaho 83209 T. F. SMITH Northern Michigan University,Marquctte, Mu&an 49855 AND W. A. BEYER Los Alamos ScientificLaboratory, Los Alamos, New Mexico 87545t Some new metria are introduced to measure the distance between biological sequences, such as amino acid sequences or nucleotide sequences. These metrics generalize a metric of Sellers, who considered only single deletions, mutations, and insertions. The present metria allow, for example, multiple deletions and insertions and single mutations. They also allow computation of the distance among more than two sequences. Algorithms for computing the values of the metrics are given which also compute best alignments. The con- nection with the information theory approach of Reichert, Cohen, and Wong is discussed. 1. INTRODUCTION A protein is specified by the sequence of amino acids composing the protein. Since twenty kinds of amino acids are used in proteins, a protein is a finite word over an alphabet of twenty letters. Biology seeks to discover the evolutionary relations among the set of proteins. It postulates that proteins have evolved in time past, and this evolution has produced a tree (or trees) whose terminal nodes are the extant *Work performed under an NSF Faculty Research Participation Program at Loe + Theoretical Division. Alamos Scientific Laboratory and also supported by the Atomic Energy Commission. 367 Copyright 0 1976 by Academic Preu, Inc. AU rightl of repduction in any form ruervcd.

Transcript of M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

Page 1: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

Reprinted from ADVANCES IN MATHEMATICS AU Rights Reserved by Academic Preas, New York and London

Vol. 20, No. 3 June 1976 Rintcci tn Belgium

Some Biological Sequence Metrics*

M. S. WATERMAN

Idaho State University, Pocatello, Idaho 83209

T. F. SMITH

Northern Michigan University, Marquctte, Mu&an 49855

AND

W. A. BEYER

Los Alamos Scientific Laboratory, Los Alamos, New Mexico 87545t

Some new metria are introduced to measure the distance between biological sequences, such as amino acid sequences or nucleotide sequences. These metrics generalize a metric of Sellers, who considered only single deletions, mutations, and insertions. The present metria allow, for example, multiple deletions and insertions and single mutations. They also allow computation of the distance among more than two sequences. Algorithms for computing the values of the metrics are given which also compute best alignments. The con- nection with the information theory approach of Reichert, Cohen, and Wong is discussed.

1. INTRODUCTION

A protein is specified by the sequence of amino acids composing the protein. Since twenty kinds of amino acids are used in proteins, a protein is a finite word over an alphabet of twenty letters. Biology seeks to discover the evolutionary relations among the set of proteins. It postulates that proteins have evolved in time past, and this evolution has produced a tree (or trees) whose terminal nodes are the extant

*Work performed under an NSF Faculty Research Participation Program at Loe

+ Theoretical Division. Alamos Scientific Laboratory and also supported by the Atomic Energy Commission.

367 Copyright 0 1976 by Academic Preu, Inc. AU rightl of repduction in any form ruervcd.

Page 2: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

368 WATERMAN, SMITH AND BEYER

proteins. I t is a task of biology to decide what the mechanisms and probabilities of the various protein evolutionary processes are. It is a mathematical task then to produce from existing protein sequences and the postulated mechanisms and probabilities of evolutionary processes a probable tree of the extant proteins. These matters are discussed in the references, but there is no adequate survey. The closest reference to

survey is Dayhoff [3]. These tree construction methods can also be regarded as part of the theory of clusters or of numerical taxonomy. See Jardine and Sibson [4].

One of the major techniques used in taxonomic tree construction depends on the introduction of a measure of dissimilarity of sequences (see [l, 2, or 31). Beginning with Fitch [lo], there is a growing literature concerned with the distance between two protein sequences [5-171. The work on distances or metrics on protein sequences has grown more sophisticated and is no longer entirely concerned with producing measures of dissimilarity for the construction of evolutionary trees. One of the main concerns is to discover what genetic mutations are required to change one sequence into another. Perhaps the most mathe- matically satisfactory treatment to date is by Sellers [6]. The present paper presents some new metrics on sequences and some of their mathematical and algorithmic properties.

A metric p(sl , s,) is a nonnegative function on the set S x S of pairs (sl, s2) of finite sequences si over a fixed alphabet. I t is to be thought of as a measure of the amount of evolutionary change from the sequence s1 to s 2 . (1) It is zero if and only if s1 = s, . (2) It is assumed that evolutionary changes are reversible1 and therefore that p(sl, s,) = p ( s 2 , sl) for all st E S.2 (3) It is further assumed to measure the fewest number of evolutionary changes (weighted according to probability) from s1 to s, . Hence p should satisfy the triangle inequality:

P(S1 9

for all s1 , s 2 , and ss in S. Sequence metrics have been introduced in the past in references [l],

[2], [6], and [7]. The metric in [2] was suggested by T. Smith. While it seems to yield satisfactory results, its interpretation in terms of

G P(S1 9 s2) + P(S2 9 Sa)

I

By reversible is meant that evolutionary changes and their inverses are equally

2 However, nonsymmetric p’s can be handled within the theory of this paper and are probable.

discussed in Section 7.

Page 3: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

SOME BIOLOGICAL SEQUENCE METRICS 369

evolutionary changes is not clear. The interpretation of the metric due to Sellers [6] in terms of evolutionary changes is more clear. This metric is referred to as the s-metric. In this paper the s-metric is generalized to allow for more general evolutionary changes.

2. THE T-METRIC

Let A be a finite set such that d E A. d is called the neutral element. As in [6], the following definitions are made.

nating, only finitely many of the a, are # d, and a( E A.

nonneutral terms are identical.

sequences equivalent to a = ala2 *... Every evolutionary sequence ala2 has a member such that if a, # A , then ak # d for all K < n. Let S be the set of A-sequences such that this property holds. S includes dd . e - .

A finite set T of weighted transformations T: 8+ S such that each T E T has as its domain .9( T) and range a( T) nonempty subsets of S is considered which satisfies the following two conditions.

The identity transformation I is in T.

Each T E T has an associated nonnegative number w( T) called

(i)

(ii)

(iii)

a = ala2 --. is an A-sequence if the sequence is nontermi-

Two A-sequences are equivalent if the subsequences of

An evolutionary sequence I = ala2 is the set of all A-

(i) (ii)

the weight of T which is zero if and only if T = I. Since T is finite, (ii) implies

min w ( T ) > 0. TEr TZI

Fix a j 2 1. Suppose ala2 is an A-sequence and ajai+, E .9(T). aj-, T(ajaj+, - . e ) where, if

Suppose a finite set T of transformations satisfying (i) and (ii) is given.

Then Tj is defined by Tj(ala2 e . . ) = a1 j = 1, a,

For a, b E S, define

uj-, is omitted. If T E T, then define w( Tj) = w( T ) .

{a + b}, = { T t Tt I Tt Tta = b},

where Ti, ET.

Page 4: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

370 WATERMAN, SMITH AND BEYER

Thus {a -+ b}, is the set of all finite transformation sequences from T

For a, b E S, define which map a into b. The set (a -+ b}, may be the empty set 0 .

If {a ---f b}7 # 0, then, by the finiteness of 7, the above minima exist. If {a -+ b}, = 0, define

1

It is obvious that the relation p(a, b) < co is reflexive, symmetric, and transitive and therefore this relation partitions S into equivalence classes {Si>. In each Si the value of p between any two elements of Si is finite.

THEOREM 1. Each equivalence class S, of S together with the function p (called a 7-metric) is a metric space.

Proof. It is obvious that the function p is symmetric and nonnegative. Because of ( l ) , p(a, b) = 0 if and only if a = b. One has

Suppose the first min on the right of (2) is attained by Ti;; Ti; ; i.e.

and

Likewise, suppose

Then

This completes the proof of the theorem.

Page 5: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

SOME BIOLOGICAL SEQUENCE METRICS 37 1

3. THE RELATIONSHIP BETWEEN THE T-METRIC AND THE S-METRIC

Sellers [6] has extended a metric d on A to a metric d on A-sequences and a metric iE on the set of evolutionary sequences by the formulae:

m

d(a, b) = d(ai , h),

I i=l

and d(ii,6) = min d(a, b).

a C b ~ i 6

The d metric is referred to as the s-metric. If one relates the weights of mutations and deletions to the metric d,

then a gives the smallest total of the weights of the sets of mutations and deletions which make H and 6 identical.

A set T of transformations will be defined for which

+, 6) = p(a, b),

where a@) is the member of S which is in the evolutionary sequence H(6), thus showing .that the 7-metric is at least as general as ;E. Define To- by

9(TQ-) = {UU2% . * * I upu3 ... E S}

TQ-(aa2U3 . * e ) = u2u3 *.* . and

Page 6: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

372 WATERMAN, SMITH AND BEYER

and

W ( a , d = 4% b).

The notion of the “path” of a site through a sequence of transforma- tions is now introduced for use in the proof of Theorems 2 and 3. Suppose

b = T?& ... *’, T::(a)

For an a i E a , let the sequence p , , p , , . . . , p , record the sequence of subscripts of the successors of a, and let c, , c, ,..., c, record the sequence of successors of ai under the 1 transformations Tj; . One puts p , = i and c, = a i . If ci = A , then pi is designated by pi = -. If pi = -, then pk = - and ck = d for all k 2 j. For n < 1 suppose c, and p , are known. If Ti;;: is such that j,,, < p, , let c,+, = c, and

if in+l = (a, b), P n + l = Pn + 1 if in+, = a+, jp. pn - 1 if i,+, = a-.

pnfl = p , and En+, = a if in+, = (Cn , a),

P n + l = - and c,+~ = A if in+, = c, -, pn+, = p n + 1 and c,+~ = C, if in+, = a+.

If jn+l > P a , then p,+1 = P , and Cn+l = cn - Let N = max{ j: bj # A ) . Let P be the set of all p , associated with

ai # d which are members of a. If j E {I, 2 ,..., N } - P, we create a p and c sequence associated with b, as follows. Let p , = j and cl = bj . In an obvious way, we can trace p , and ck backwards. If p , = -, then pk = - and c, = d for all k < n and every such sequence hasp, = - and c, = d.

This scheme gives one the “path” of a site through the sequence of transformations. Each a, becomes a b, or is deleted. Each bi comes from an ai or is inserted.

THEOREM 2. If T and w( T ) are the above class of transformations and weights, then p(a, b) = a(&&).

Page 7: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

SOME BIOLOGICAL SEQUENCE METRICS 373

Proof. Recall that

We first show that for any a E H and b E 6, there is a TI E {a --+ blT and T, E {b + a}, such that:

(a) The sum of weights C w ( T ) associated with TI is d(a, b). (/3) The sum of weights C w ( T ) associated with T, is d(b, a).

(Recall that T$ may denote compositions of transformations from 7.)

Since d(a, b) = d(b, a) and a and b are arbitrary, we need only show case a.

There is an N such that ai = bi = d for i 2 N . For the largest j < N such that a* # bj , define Tkj by:

k = a, - (a deletion) if bi = A,

k = bj + (an insertion) if bj # A and ai = A,

and

k = (aj , bi) (a mutation) if bj # A and a, # A.

Then go to the next smallest j < j such that u, # b, . A finite number of applications of this rule yields a sequence of transformations in {a --.+ b}T . The corresponding sum of weights is

m 2 w(T) = d(Uj , bj) = d(a, , bi). 5 P J l i-1

I This result implies I

!

p(a, b) < d(a, b)

and therefore i p(a, b) < a(& 6).

We next prove that for

Tt .-. T:;a = b,

Page 8: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

374 WATERMAN, SMITH AND BEYER

there are sequences a‘ E H and b’ E 6 such that 1

1 w ( T k ) z +’, b’), k=1

which implies that

p(a, b) 2 J(% 6)

(3)

and will complete the proof. As observed in the discusion preceeding Theorem 2, each a, either

becomes a bj or is deleted. In the alignment visualized by writing the a sequence above the b sequence, put bj under a, in the first case and d under ai in the second. In case any bj are not listed, insert them in their proper positions in the b sequence with a d above them. The new a sequence is called a’ and the new b sequence is called b‘. One must show that (3) holds.

The application of the transformation corresponds to changes in the sequence co c1 ... c l . If cc1 = c1 = = c, , then no transformations were applied which affected the position corresponding to that c- sequence. Each transformation causes exactly one c-sequence to change. The triangle inequality for d ( . , a ) implies the sum of weights of the transformation associated with a sequence is at least d(c, , c l ) . Therefore one has (3), which completes the proof.

Theorem 2 shows that the 7-metric is at least as general as the s-metric. That the 7-metric is more general than the s-metric can be seen from the following example. Let A = (A , a>, w(T,+) = w(T,-) = 1, and w( Tclcl-) = w( Tact-) = 1.1 where

and

Then p(a, ) = 1. So an equivalent s-metric must have Z(a, d) = d(a, A ) = 1. But then d(aa, d) = 2 which does not agree with p(aa, d) = W(T(, , , - ) = 1.1.

The weights used with the 7-metric need not be related to the s- metric on the alphabet A. For example, let A = {d, a, b}. Let w(T,-) =

w( T,T) = 3, w( Tb-) = w( Tb-) = 1, and w( T(a,b)) = w( T(*,,)) = 1. These w’s do not provide a metric on A if we choose d(a, d) = w( T,-),

Page 9: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

SOME BIOLOGICAL SEQUENCE METRICS 375

d(b, d) = W(Tb-), and d(a, b) = w(T(Q,b)), since d(a, b) + d(b, A) = 2, but d(a, d) = 3. One can also allow w( T(Q,b)) Z w( T(bSQ)) and w( T,+) # w( TQ-), which is not possible with the s-metric.

4. DELETIONS AND INSERTIONS OF MORE THAN ONE LETTER

A motivation for considering 7-metrics is to allow a more general class of transformations than the s-metric. Specifically one wants to allow deletions and insertions of letter sequences of length greater than 1. This can be accomplished in the s-metric formalism by successive deletions. However, this may be a greater distance than one might want to permit.

A major result of [6] is an efficient algorithm to compute distance. Below we define a 7-metric which includes longer deletions and insertions and give a Sellers-like algorithm for this 7-metric under certain conditions.

Let a be a fixed positive integer. For 1 < m < a, define the domain of Tbl. .+,- to be those members of s whose first m letters are b, e - . b, . Put

Tbl...),-(bl...b,~+l...) = "'

Define the domain of Tb, . .++ as s and

Tb l...b,+(a1a2 "') = b1 bma1U2 "'

Define T(a,b) as before. We say that {a --+ b}, satisfies condition M if there is a minimum

sequence of transformations in {a ---+ b}, (i.e., T, ... Tk E {a ---f b}., such that

w(Ti ) t-1

is not greater than any other k'

1 I

for which TI' .-. Tk' E {a -+ b},) such that Tl e - . Tk contains no multiple hits. That is, if an dement is deleted, inserted, or undergoes mutation, it is not changed again. If one considers the associated c, c1 cs el for each position in a and in b, then there are no multiple hits if {co , cl , ..., c l } has at most two distinct elements.

Page 10: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

376 WATERMAN, SMITH AND BEYER

For brevity let

AK = ala2 aB A A * * -

and Bj = 6,b2 *.. bj d d . . I

where A, = d d... and Bo = d d... . Let

P z b , b) = P l ( h a).

The length of a E s is the number of ai # d. The following theorem is analogous to Theorem 2 of [6]. (Put 3 = A - d.)

THEOREM 3. Suppose a and b in S and the length of a(b) is M ( N ) . Let T be the set of transformation consisting of the following types: ( 1 ) T(a,b) for every a, b E 3, (2) Tbl...b,+, Tbl...b,- for all sequences b, ... b,, 1 < m < a. Suppose {A, -+ B,), (0 < k f M ; 0 < j < N ) satisjies condition M. Obviously pl(Ao, Bo) = 0. Then p1(A2 , B,)(O < i < M , 0 < j < N , i + j # 0) is the minimum of the following values:

Pl(Ai-k 9 4 ) + w o ( - b + l . . . L ? - ) ( l d k < min(% 0 Pi(&l 9 B+l) + w ( T ( a i . b j ) ) ,

fl(Ai 9 Bj-k ) + w(Tb3-t+l***b,+)(1 < < min(or,j)),

where pl(Ap , B,) is ignored if p or q is negative and an index set is empty the corresponding quantities are not computed.

T t E {A, 3 BJ, has no multiple hits, then the T’s may be reordered in any order and thej’s appropriately changed so that the resulting transformation sequence is also in {Ai -+ Bj}, .

There are three cases to be considered with regard to the fate of a, under a minimum mapping in {A,-+ I?,>, . (i) a, is deleted. (ii) a, is unchanged or undergoes a mutation to bi . (iii) b, is inserted at the end of the sequence.

We place the T that carries out the operation in (i), (ii), or (iii) in the first (right-most) position.

Proof. Fix i and j. It is evident that if Ti;

Page 11: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

SOME BIOLOGICAL SEQUENCE METRICS 377

In case (i), T f ; (a, Ti; in {Ai -+ Bj}T , then Tf;

ai) = a, ai+, 1 < 12 < min(a, i ) . Thus T:; E {Ai-k + Bj}T and, if T = Ti; e - . Ti; has minimum weight

Tf; has minimum weight in {Ai--k + Bj}, and

9 B,) = P~(AI--K v Bj) + W(Ta,-r+l*. .a,-)*

Case (iii) is handled in a similar fashion.

one sees that Ti; e - . Ti; E {Ai-l + Bi-,}, , so that In case (ii), let Til = I if ai is left unchanged. Reasoning as in case (i),

p l ( 4 9 Bj) = ~l(Ai-1, Bj-1) + W(T(a,.b,)),

which completes the proof. Next consider the number of steps required in the algorithm of

Theorem 3. If a < min{M, N}, there are essentially 2a + 1 values to find the minimum of for each i and j. Thus the algorithm runs in ap- proximately (2a + 1) MN steps. If a = max{M, N), the number of steps is approximately

or of the order MN max{M, N} steps. Theorem 3 can be used, in an obvious way, to construct an induction

which will calculate pl(a, b). The value of p2(a, b) is then calculated from p2(a, b) = pl(b, a) and

Aa, b) = max(pl(a, b), pp(a, b))

COROLLARY. If W( T(a,b)) = W( T(b,a)) a d W ( Tb, ... b,-) = w(Tbl...bk+), (1 < k < a), then

pda, b) = p2(a, b) = p(a, b).

The transformations Tca,b) and T(b,a) are inverses of each other, as are Tb1...bk- and Tb, ...?,+ . Thus if a sequence of T’s is in {a -+ b}, , there is a corresponding sequence of inverse mappings in {b + a}, and the weights, by the symmetry assumptions, are equal.

Proof.

THEOREM 4. The hypotheses of Theorem 3. are satisfied if we def ie w(T(,,b)) = A,, , T.U(T~~. . .~~*) = Akf (1 < k < a) and require

A,, < A,, d A,, < ... < A,, ,

Page 12: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

378 WATERMAN, SMITH AND BEYER

and

Proof. We show that there is some minimum sequence in {a --+ b}7 which has no multiple hits. It is sufficient to consider pairs of transforma- tions acting on the same site. We shall show they can, in each case, be replaced by equivalent transformations, which do not act on the same site and whose weight is less than or equal to the sum of the weights of the pairs of transformations.

The three types of transformations (insertion, deletion, and mutation) allows nine categories of pairs of transformations. However any trans- formation followed by an insertion cannot act on the same site and three of the nine categories are eliminated.

A mutation followed by a mutation, both on the same site, must be of the form T(b,c)T(a,b) which has the weight 2)b and this pair can be replaced by T(,,J which has weight A,, .

A deletion followed by a mutation or deletion cannot act at the same site. An insertion followed by a mutation must be of the form T(,,,b)

(where I < i < k ) which has the weight A,, + A,+ . This pair can be replaced by Tal...a,-lba,+l...ak+ which as weight A,+ .

A mutation followed by a deletion has the form Ta:,..ak-T(b,a~) (where 1 < i < k ) and weight A,- + A,. We replace this pair by T%...at--lbOl+,....Qk- which has weight Ak- .

The remaining category is an insertion followed by a deletion. Suppose this pair is of the form T, ,... ,,-Tu ,... ,,+ where i < j < 1 < k, or i 5 j < k < I , or j < i < l < k , or j < i < k < I . This pair has weight Ar-.j+i- + Ax-$+,+ . Suppose i < j < 1 < k . Replace the previous pair by T, ,... a,-la,+l...ak- , which has weight A ( k - l ) + ~ - i ) + = &-a-(r-j)+ < A,-,+,+ . Suppose i < j < k < I . Replace the original pair by Tak+l...u,- T,,...,,-,+ with weight A1-,- + A.j-t+ < A1-,+,- + The remain- ing two subcases are handled in the same way.

5. AN EXAMPLE

Suppose A = {A , a, b, c}. Suppose all single mutations have weight I , all single deletions and insertions have weight 1, and all double deletions and insertions have weight 1.1. Theorem 3, the Corollary, and Theorem 4 all apply. Suppose a = abccaaa and b = abaaa. (We omit writing the

Page 13: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

SOME BIOLOGICAL SEQUENCE METRICS 379

terminal A’s.) Table 1 is a matrix of values from Theorem 3. The first row has the values

TABLE 1

Double Deletion and Insertion Distance Calculation

A U b C C a a a

0 1 1.1 2.1 2.2 3.2 3.3 4.3 1 0 1 1.1 2.1 2.2 3.2 3.3 1.1 1 0 1 1.1 2.1 2.2 3.2 2.1 1.1 1 1 2 1.1 2.1 2.2 2.2 2.1 1.1 2 2 2 1.1 2.1 3.2 2.2 2.1 2.1 3 2 2 1.1

TABLE 2

Single Deletion and Insertion Distance Calculation

A a b C C a a a

0 1 2 3 4 5 6 7 1 0 1 2 3 4 5 6 2 1 0 1 2 3 4 ’ 5 3 2 1 1 2 2 3 4 4 3 2 2 3 2 2 3 5 4 3 3 3 2 2 2

Page 14: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

380 WATERMAN, SMITH AND BEYER

For comparison, Table 2 is a matrix for the distance if double deletions and insertions are not allowed. The double deletion distance is p(a, b) -= 1.1. (the value in the lower right-hand corner). The single deletion distance is J(a, b) = 2.

Finally, we remark that the matrix of Table 1 gives an alignment or best matching between sequences in the same manner that Sellers obtains best matching in the single insertion and deletion case. Our alignment is:

a b c c a a a

a b A u u u

6. CONNECTION WITH INFORMATION THEORY

Reichert, Cohen, and Wong [5, 8, 91 have studied an application of information theory for determining the quality of alignment of two biosequences. Their papers [5, 81 are related to this work, although they do not obtain any mathematical results for their model. The transformations they consider are insertion, deletion, substitution, and unequal replacement.

Suppose one associates a probability pi with each transformation T4 and with any sequence of transformations Tj; T i ; , the probability ni=, pi, . Reichert et al. choose the element of {a -P b}, such that the associated information,

1

1a-b = - logpi, k-1

is minimized, which is equivalent to maximizing nL=, p i , . T o associate this with what we have considered, take eo, > 0 (identity transformation is not allowed) and put

wi = --logpi.

The distance between two sequences and the resulting alignment can then be computed as in Sections 3 and 4. The model of Reichert et al. includes a location cost which is omitted

im the above correspondence. It should be possible to change the weights e€ the transformations to include the consideration of location costs, but this has not been carried out. The model of Reichert et al. also

Page 15: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

SOME BIOLOGICAL SEQUENCE METRICS 38 1

includes the concept of a “mutation machine” whose motion and action is specified by the sequence Ti; - a - Tj; .

On page 45 of [8] multiple hits are excluded so that the set of possible alignments is finite. In our treatment, the concept of the path of a site through a sequence of transformations makes possible a mathematically precise definition of multiple hits. Then, in Theorem 4, these concepts are used to exclude such occurrences in our algorithm.

Finally, the efficiency of the algorithm in [8] is that, for insertions and deletions of any length, it runs in approximately M2N2i4 steps. Recall that our algorithm runs in approximately M N max(M, N} steps. If insertions and deletions of length a < min(M, N} are allowed, our algorithm runs in approximately (2a + 1) MN steps. Reichert et al., do not consider limiting the length of insertions and deletions, but, for that case their algorithm should run in approximately a2 MN steps.

7. QUASIMETRICS

In the introduction, metric measures of dissimilarity were presented. Of the three requirements for a metric measure of dissimilarity, perhaps the requirement of symmetry is the least realistic. The assumption that p(sl , s2) = p(s2 , sl) for all s1 , s2 E S implies that evolutionary changes are reversible. Here it will be shown that the assumption of symmetry can be dropped in the previous work.

A quasimetric space p ( s I , s2) is a nonnegative function on S x S which satisfies

Then it is easily seen that Theorem 1 becomes: Each equivalence class Si of S together with p1 is a quasimetric space.

The computation of pl(a, b) has already been handled by Theorem 3 and Theorem 4, which gives sufficient conditions for the hypotheses of Theorem 3 to be satisfied, does not require symmetry.

Page 16: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

382 WATERMAN, SMITH AND BEYER

8. THE +DISTANCE

This section extends the notion of distance between two sequences to a distance among n sequences: the n-distance. An algorithm which computes this distance is given. The algorithm also gives the alignment of the n sequences which has least weight. These alignments can then be used in ancestral reconstruction in which an alignment is interpreted as a common ancestor of the n sequences. Evolutionary tree construction can be based on the method.

Suppose pn maps A" to the nonnegative real numbers and p , (A , A,..., A ) = 0. Then extend p" to n A-sequences by the formula:

m

f -1

Then pn is extended to n evolutionary sequences by

where the minimum is taken over all a"), a(*), ..., a'") in the respective equivalence classes. Due to the requirement of an A-sequence to have only a finite number of terms not equal to A and p ( A , A,,, . , A ) = 0, we have the existence of the minimum. Also

0 \< ijn(B"', z(e) ,..., P) < 00, ijn(if, if ,..., if) = 0.

For 1 < i < n, consider

where a t ) # A and a'i) E S. Each such sequence in Swill be represented by

If a'i) = if, we represent a(i) by A .

- -

Page 17: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

SOME BIOLOGICAL SEQUENCE METRICS 383

Then in can be defined on such sequences by

pn(&) ... ... a,, (2) ,..., U P ) -.. a,(=))

The next theorem yields an algorithm to compute this number.

THEOREM 5 . The quantity (1) ... a(') (2) ... (2) (n) ...

i n ( % z1 9 41 az, ,'**, a1

where not all arguments are A , can be found by taking the minimum of the following 2" - 1 quantities:

pn(ajz) ..* a,,-,, , a?) * * . (2) ... , ... n

( *) (1) .(2) .(VI) -i- Pn(aZ,.€, 9 ly"2 ).*') ,".efl)

where ci = 0 or 1, A . If a?) *.. U T ) =: A , then omit computations of (*) which involve ei = 1.

Since all sequences in S are A after a certain point, there is an I > 0 such that, for some by'

= E~ = = E , = 0 is excluded, and ay) ~

Proof. by' E a?) a($'

1, '

= pn(p ... p , b?) ... b1(2) ,..., bi"' ... p ) I-1

(1) (2) (1) (2) = pn(& 9 bi ,*.., P) + p n p , , b, ,.**, bin)) (-1

= p , ( p . e * p - f a ,..., b?' .*. @\)

(1) bk) + pn@, 2 2 , . * a , P). The last equality is true because if the sequences of length 1 - 1 were not minimal, then the sequences of length I would not be minimal.

Take I to be the smallest integer such that not all b'j'), ..., by' are equal to A . Therefore the possibilities in our last equation are identical with the 2" - 1 possibilities described in Theorem 5. Moreover, each of 2" - 1 possible numbers in Theorem 5 is greater or equal to the minimum value. This completes the proof of the theorem.

Page 18: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

384 WATERMAN, SMITH AND BEYER

b A 0 1 2 3 4 b 1 2 2 3 4 c 2 3 3 3 4 4 c 3 4 4 4 5

For n = 2, there are 22 - 1 = 3 possibilities:

p&l * - * %, 9 b, --. b2J

= min&(a, .** uz1 9 bl -.. &*-A + p z ( 4 bl,),

p z ( 4 .*. a,,-, 9 bl *.. bZ*-l) + pz(a2, 9 bzJ,

h(a1 - * * %,-l 9 bl -.. at,) + Pz(a2, 9 4. This is the algorithm given by Sellers.

In general we can use Theorem 5 to compute &(a:') -.. a") 1, ,..-,

in finding the minimum of (at most) 2" - 1 numbers. For implementa- tion on computers we note that it is not necessary to store an n-dimen- sional array with dimensions ZI + I , la + I , ..., 1, + 1. Entries can be discarded after they will not be needed in computing future minima.

Before discussing the computation in more detail, consider the

.:"' ... a'n' ) in (Il + 1)(Z2 + 1) (I, + 1) steps where each step consists

problem of defining p3 on AS. Let A = {a, b, c, d, A} , and define p3 on A3 by

0 if a = p = y

2 if none of a, 8, y are equal. ps(a, 8, 7) = 1 if exactly 2 of a, 8, y are equal I

This definition is motivated by

where

Page 19: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

SOME BIOLOGICAL SEQUENCE METRICS 385

(iii) (iv) (b) A a b c b (c) A a b c b

b 2

c 4 4 2

Therefore jj3(a, b, c) = 3 which results from the tri-alignment

a b c b d b c c a b c A

Of course, for this alignment, 4 c p3(ai(l), u p , a!)) = 1 + 0 + 0 + 2 = 3. (-1

The following scheme is used in reconstructing ancestral sequences. For each minimal alignment dl), d2), a@) consider ai’), ai2), If at least two of a:”, ala), ul3) are equal, then that is the ith element in the ancestral sequence. Otherwise all three are distinct and we use {all), a:’), a:”’> as the ith element in the ancestral sequence.

For our alignment, then, the ancestral sequence is

abc{b, c, A} .

9. CONCLUSIONS

These new metrics should help in the investigation of the evolutionary relationship between two proteins by allowing more realistic evolutionary steps. It would be of interest to compare the new metrics for various lengths of insertions and deletions with existing metrics for a set of proteins. The problem of which metric and tree construction technique to use in the construction of evolutionary trees is a very difficult problem that may never be satisfactorily solved.

In connection with the construction of evolutionary trees it is possible that the method of reconstructing ancestral sequences with the 3-metric will be of value. We feel such an investigation should be carried out.

In conclusion, it should be remarked that one of the authors (M.S.W.) has used the multiple insertion and deletion metric as a tool to solve

Page 20: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

386 WATERMAN, SMITH AND BEYER

the problem of prediction of RNA secondary structure. The secondary structure problem is quite distinct from the problems handled in this paper, but these metrics are fundamental in the solution. The work on secondary structures will appear elsewhere.

Note Added in Proof. Section 4 discusses multiple insertions and deletions. The notion of the path of site through a sequence of the more general transformations has been omitted by oversight. If multiple insertions or deletions overlap a site previously altered, then the transformations cannot be reordered and the argument given in Theorem 3 fails. To correct this alter the definition on p. 6, so that every A has a position number. Then there are no multiple hits if each position has been acted on or overlapped by at most one transformation.

REFERENCES

1. w. A. BEYER, T. F. SMITH, M. L. STEIN, AND s. M. ULAM, Metrics in biology, an introduction, Los Alamos Scientific Laboratory report, LA-4973, 1972.

2. W. A. BEYER, T. F. SMITH, M. L. STEIN, AND S. M. ULAM, A molecular sequence metric and evolutionary trees, Math. Biosci. 19 (1974), 9-25.

3. M. 0. DAYHOFF et al., “Atlas of Protein Sequence and Structure,” National Biomedical Research Foundation, Silver Spring, Maryland.

4. N. JARDINE AND R. SIBSON, “Mathematical Taxonomy,” John Wiley and Sons, New York, 1971.

5. T. A. REICHERT, D. N. COHEN, AND A. K. C. WONG, An application of information theory to genetic mutations and the matching of polypeptide sequences, J. Theor. Biol. 42 (1973), 245-261.

6. P. H. SELLERS, On the theory and computation of evolutionary distances, SIAM J. Appl. Math. 26 (1974), 781-793.

7. S. M. ULAM, Some ideas and prospects in biomathematics, Ann. Rev. Biophys. Bioeng. 1 (1972), 271-292.

8. A. K. C. WONG, T. A. REICHERT, D. N. COHEN, AND B. 0. AYGUN, A generalized method for matching informational macromolecular code sequences, Comput. Biol. Med. 4 (1974), 43-57.

9. D. N. COHEN, T. A. REICHERT, AND A. K. C. WONG, Matching code sequences utilizing context free quality measures, Math. Biosci. 24 (1975), 25-30.

10. W. M. FITCH, An improved method of testing for evolutionary homology, J. Mol. Biol. 16, 9, (1966).

11. S. B. NEEDLEMAN AND C. D. WUNSCH, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol. 48 (1970), 443-453.

12. D. SANKOFF, Matching sequences under deIetion/insertion constraints, Proc. Nut. Acad. Sci. 69(1), (1972), 4-6.

13. J. E. HABER AND D. E. KOSHLAND, An evaluation of the relatedness of proteins based on comparison of amino acid sequences. J. Mol. Biol. 50 (1970), 617-639.

14. M. J. SACKIN, Crossassociation: a method of comparing protein sequences, Biochem. Genet. 5 (1970), 287-313.

Page 21: M. S. WATERMAN T. F. SMITH W. A. BEYER 1. INTRODUCTION

SOME BIOLOGICAL SEQUENCE METRICS 387

15. D. SANKOFF AND R. J. CEDERGREN, A test for nucleotide sequence homology, Technical Report 122, Centre de Recherches MathCmatiques, UniversitC de Montreal, Montreal, Canada.

16. A. J. GIBES AND G. A. MCINTYRE, The diagram, a method for comparing sequences,

17. W. M. FITCH, Locating gaps in amino acid sequences to optimize the homology EUY. J. Biochm. 16 (1970), 1-11.

between two proteins, Biochem. Genet. 3 (1969), 99.

Printed by the St Catherine Press Ltd., Tempelhof 37, Bruges, Belgium.

_- I