Inexact Matching

78
1 Inexact Matching Charles Yan 2008

description

Inexact Matching. Charles Yan 2008. Longest Common Subsequence. Given two strings, find a longest subsequence that they share substring vs. subsequence of a string Substring: the characters in a substring of S must occur contiguously in S - PowerPoint PPT Presentation

Transcript of Inexact Matching

Page 1: Inexact Matching

1

Inexact Matching

Charles Yan2008

Page 2: Inexact Matching

2

Longest Common Subsequence

Given two strings, find a longest subsequence that they share

substring vs. subsequence of a string Substring: the characters in a substring of S must

occur contiguously in S Subsequence: the characters can be interspersed

with gaps. Consider ababc and abdcb

alignment 1ababc.abd.cb

the longest common subsequence is ab..c with length 3

alignment 2aba.bcabdcb.

the longest common subsequence is ab..b with length 3

Page 3: Inexact Matching

3

Longest Common Subsequence

Let’s give a score M an alignment in this way,M=sum s(xi,yi), where xi is the i character in the first aligned sequence

yi is the i character in the second aligned sequence s(xi,yi)= 1 if xi= yi

s(xi,yi)= 0 if xi≠yi or any of them is a gap

The score for alignment:

ababc.abd.cb

M=s(a,a)+s(b,b)+s(a,d)+s(b,.)+s(c,c)+s(.,b)=3

To find the longest common subsequence between sequences S1 and S2 is to find the alignment that maximizes score M.

Page 4: Inexact Matching

4

Longest Common Subsequence

Subproblem optimalityConsider two sequences

Let the optimal alignment bex1x2x3…xn-1xn

y1y2y3…yn-1yn

There are three possible cases

for the last pair (xn,yn):

S1: a1a2a3…ai

S2: b1b2b3…bj

Page 5: Inexact Matching

5

Longest Common Subsequence

Mi,j = MAX {Mi-1, j-1 + S (ai,bj) (match/mismatch) Mi,j-1 + 0 (gap in sequence #1) Mi-1,j + 0 (gap in sequence #2) }

Mi,j is the score for optimal alignment between strings a[1…i] (substring of a from index 1 to i) and b[1…j]

S1: a1a2a3…ai

S2: b1b2b3…bj

There are three cases for (xn,yn) pair:

x1x2x3…xn-

1xn

y1y2y3…yn-

1yn

Page 6: Inexact Matching

6

Examples:G A A T T C A G T T A (sequence #1) G G A T C G A (sequence #2)

s(ai,bj)= 1 if ai=bj

s(ai,bj)= 0 if ai≠bj or any of them is a gap

Mi,j = MAX { Mi-1, j-1 + S(ai,bj) Mi,j-1 + 0 Mi-1,j + 0

}

Longest Common Subsequence

Page 7: Inexact Matching

7

Longest Common Subsequence

M1,1 = MAX[M0,0 + 1, M1, 0 + 0, M0,1 + 0] = MAX [1, 0, 0] = 1

Fill the score matrix M and trace back table B

Score matrix M Trace back table B

Page 8: Inexact Matching

8

Longest Common Subsequence

Score matrix M Trace back table B

M7,11=6 (lower right corner of Score matrix)This tells us that the best alignment has a score of 6What is the best alignment?

Page 9: Inexact Matching

9

Longest Common Subsequence

We need to use trace back table to find out the best alignment, which has a score of 6

(1) Find the path from lower right corner to upper left corner

Page 10: Inexact Matching

10

Longest Common Subsequence

(2) At the same time, write down the alignment backward

:Take one character from each sequence

:Take one character from sequence S1

(columns):Take one character from sequence S2

(rows)

S1

S2

Page 11: Inexact Matching

11

Longest Common Subsequence

:Take one character from each sequence

:Take one character from sequence S1

(columns):Take one character from sequence S2

(rows)

Page 12: Inexact Matching

12

Longest Common Subsequence

Thus, the optimal alignment is

The longest common subsequence is G.A.T.C.G..A

There might be multiple longest common subsequences (LCSs) between two given sequences.

These LCSs have the same number of characters (not include gaps)

Page 13: Inexact Matching

13

Longest Common Subsequence

Algorithm LCS (string A, string B) {Input strings A and BOutput the longest common subsequence of A and B

M: Score MatrixB: trace back table (use letter a, b, c for )n=A.length()m=B.length()// fill in M and Bfor (i=0;i<m+1;i++) for (j=0;j<n+1;j++)

if (i==0) || (j==0) then M(i,j)=0;

else if (A[i]==B[j]) M(i,j)=max {M[i-1,j-1]+1, M[i-1,j], M[i,j-1]}

{update the entry in trace table B}else M(i,j)=max {M[i-1,j-1], M[i-1,j], M[i,j-1]} {update the entry in trace table B}

then use trace back table B to print out the optimal alignment…

Page 14: Inexact Matching

14

Global Alignment

Global Alignment: Find the overall similarity between two sequences

s(ai,bj) can be replaced by the similarity score between ai , bj

Mi,j = MAX { Mi-1, j-1 + S(ai,bj); Mi,j-1 + 0; Mi-1,j + 0

}

s(ai,bj)= 1 if ai=bj

s(ai,bj)= 0 if ai≠bj or any of them is a gap

Page 15: Inexact Matching

15

Sequence Alignment

Page 16: Inexact Matching

16

Local Alignment

Local Alignment: Find all pairs of substrings that have similarity scores higher than .

Global Local

Page 17: Inexact Matching

17

Global Alignment>seq1ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWLDVDAITRVLLISSVMLVMIVEILNSAIEAVVDRIGSEYHELSGRAKDMGSAAVLIAIIVAVITWCILLWSHFG>seq2MINPNPKRSDEPVFWGLFGAGGMWSAIIAPVMILLVGILLPLGLFPGDALSYERVLAFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWVFYGLAAILTVVTLIGVVTIIKAGYSAWKG

Global alignment score: -2 10 20 30 40 50Seq1 ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWL---DVDAITRVL : . : . . .. : . : .:. . .:..... : :. . :::Seq2 MINPNP-KRSDEPVFWGLFGAGGMW---SAIIAPVMILLVGILLPLGLFPGDALSYERVL 10 20 30 40 50

60 70 80 90 100Seq1 LISSVML--VMIVEILNSAIEAVVDRIGSEYHEL-----SGRAKDMGSAAVLIAI-IVAV ... .. :.. .. . . :. .:.: .:. .: ::.: .. ...:Seq2 AFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWVFYGLAAILTVVTLIGV 60 70 80 90 100 110

110 120Seq1 ITWCILLWSHF-G .: .: . :Se2 VTIIKAGYSAWKG 120

Page 18: Inexact Matching

18

Local Alignment

>seq1ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWLDVDAITRVLLISSVMLVMIVEILNSAIEAVVDRIGSEYHELSGRAKDMGSAAVLIAIIVAVITWCILLWSHFG>seq2MINPNPKRSDEPVFWGLFGAGGMWSAIIAPVMILLVGILLPLGLFPGDALSYERVLAFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWVFYGLAAILTVVTLIGVVTIIKAGYSAWKG

Score = 19.6 bitsSeq 1: 6 GFTRIIKAAGYSWKG 20 G IIKA +WKGSeq 2: 115 GVVTIIKAGYSAWKG 129

Score = 13.1 bitsSeq 1: 103 IAIIVAVIT 111 +A I+ V+TSeq 2: 104 LAAILTVVT 112

Seq

1

Seq2

Page 19: Inexact Matching

19

Local Alignment

Mi,j = MAX { 0; Mi-1, j-1 + S(ai,bj); Mi,j-1 + 0; Mi-1,j + 0

}

Global Local

Mi,j = MAX { Mi-1, j-1 +

S(ai,bj); Mi,j-1 + 0; Mi-1,j + 0

}

Page 20: Inexact Matching

20

Gap Penalty

Mi,j = MAX {Mi-1, j-1 + S (ai,bj) (match/mismatch) Mi,j-1 + 0 (gap in sequence #1) Mi-1,j + 0 (gap in sequence #2) }

Mi,j is the score for optimal alignment between strings a[1…i] (substring of a from index 1 to i) and b[1…j]

S1: a1a2a3…ai

S2: b1b2b3…bj

There are three cases for (xn,yn) pair:

x1x2x3…xn-

1xn

y1y2y3…yn-

1yn

Page 21: Inexact Matching

21

Gap Penalty

Mi,j = MAX {Mi-1, j-1 + S (ai,bj) (match/mismatch) Mi,j-1 + G (-, bj) (gap penalty) Mi-1,j + G (ai,-) (gap penalty) }

Mi,j is the score for optimal alignment between strings a[1…i] (substring of a from index 1 to i) and b[1…j]

S1: a1a2a3…ai

S2: b1b2b3…bj

There are three cases for (xn,yn) pair:

x1x2x3…xn-

1xn

y1y2y3…yn-

1yn

Page 22: Inexact Matching

22

Gap Penalty

1. Constant gap weight: Equal penalty for each gap, regardless of the gap length.A gap of length q has a penalty of W, so is a gap of length 1

2. Affine gap weight: a constant weight to each additional spacegap opening penalty: Wg (-10) gap extension penalty Ws (-1)A gap of length q will have a penalty of Wg+(q-1)*Ws

How to modify the recursion function?3. Convex gap weight: Each additional space contributes less than the

proceeding space Wg+log (q)

4. Alphabet-weighted gap penalty: The penalty also depend on letter.

Seq1 NSAIEAVVDRIGSEYHEL-----SGRWVFYGLAASeq2 VLPLWCGLHRMHHAMHDLKLHSPAGKWVFYGLAA

Seq1 NSAIEAVVDRIGSEYHE--L-S--GRWVFYGLAASeq2 VLPLWCGLHRMHHAMHDLKLHSPAGKWVFYGLAA

Page 23: Inexact Matching

23

BLAST

http://www.ncbi.nlm.nih.gov/BLAST/ Local alignment Database search

Efficiency is vital

Page 24: Inexact Matching

24

Local Alignment

>seq1ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWLDVDAITRVLLISSVMLVMIVEILNSAIEAVVDRIGSEYHELSGRAKDMGSAAVLIAIIVAVITWCILLWSHFG>seq2MINPNPKRSDEPVFWGLFGAGGMWSAIIAPVMILLVGILLPLGLFPGDALSYERVLAFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWVFYGLAAILTVVTLIGVVTIIKAGYSAWKG

Score = 19.6 bitsSeq 1: 6 GFTRIIKAAGYSWKG 20 G IIKA +WKGSeq 2: 115 GVVTIIKAGYSAWKG 129

Score = 13.1 bitsSeq 1: 103 IAIIVAVIT 111 +A I+ V+TSeq 2: 104 LAAILTVVT 112

Seq

1

Seq2

Page 25: Inexact Matching

25

BLAST

http://www.ncbi.nlm.nih.gov/BLAST/ Local alignment Database search

Efficiency is vital

Page 26: Inexact Matching

26

Page 27: Inexact Matching

27

Page 28: Inexact Matching

28

Raw Score S

The raw score S for an alignment is calculated by summing the scores for each aligned position and the scores for gaps

Page 29: Inexact Matching

29

Bit Score S'

Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and lambda. Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifying feet, meters, or light years. By normalizing a raw score using the formula

one attains a "bit score" S', which has a standard set of units.

Page 30: Inexact Matching

30

Bit Score S'

The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.

Page 31: Inexact Matching

31

Significance

The significance of each alignment is computed as a P value or an E value

E value: Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

P value :The probability of an alignment occurring with the score in question or better. The p value is calculated by relating the observed alignment score, S, to the expected distribution of HSP scores from comparisons of random sequences of the same length and composition as the query to the database. The most highly significant P values will be those close to 0. P values and E values are different ways of representing the significance of the alignment.

Page 32: Inexact Matching

32

E-value

In the limit of sufficiently large sequence lengths m and n, the statistics of HSP scores are characterized by two parameters, K and lambda. Most simply, the expected number of HSPs with score at least S is given by the formula

We call this the E-value for the score S.   This formula makes eminently intuitive sense. Doubling the length of either sequence should double the number of HSPs attaining a given score. Also, for an HSP to attain the score 2x it must attain the score x twice in a row, so one expects E to decrease exponentially with score. The parameters K and lambda can be thought of simply as natural scales for the search space size and the scoring system respectively.

Page 33: Inexact Matching

33

P-value The number of random HSPs with score >= S is described by a

Poisson distribution. This means that the probability of finding exactly a HSPs with score >=S is given by

where E is the E-value of S given by equation (1) above. Specifically the chance of finding zero HSPs with score >=S is e-E, so the probability of finding at least one such HSP is

This is the P-value associated with the score S. For example, if one expects to find three HSPs with score >= S, the probability of finding at least one is 0.95. The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995.

Page 34: Inexact Matching

34

BLAST

The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 that are used to search sequence databases for optimal local alignments to a query.

Break the query and database sequences into fragments ("words"), and initially seek matches between fragments. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a given substitution matrix.

Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search.

Page 35: Inexact Matching

35

Page 36: Inexact Matching

36

BLAST

Use keyword trees for find HSPs in a subject sequence.

For each HSP, extend the local alignment at both ends as long as the alignment score is higher than threshold T.

If a local alignment has score higher than C, than the there is “significant” similarity between query and subject sequences. Report a “hit”.

Page 37: Inexact Matching

37

Page 38: Inexact Matching

38

Page 39: Inexact Matching

39

Inexact matching

Alignment Motif/profile searching

Page 40: Inexact Matching

40

PROSITE

A profile or weight matrix is a table of position-specific alphabet weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence.

Page 41: Inexact Matching

41

Motifs and Matching

Motif Finding: Given a set of protein sequences, to find the motif(s) that are

shared by these proteins. Motif Scanning Given a motif and a protein sequence, to find the occurrences

(not necessary identical) of the motif on the protein sequences.

Page 42: Inexact Matching

42

Motifs

How significant is a motif? How different is it from a uniform distribution? Does it represent a biologically significant region?

Information Content (IC)

Probability of observing character j at position i

))(log(*)( jPjPIC ii j

i

)( jPi

Page 43: Inexact Matching

43

MotifsUsually, the background is not a uniform distribution. Thus, it is

more useful to take into account the background distribution

KL divergence score

Probability of observing character j at position i

Probability that character j occurs at position i based on background distribution.

The higher the score, the more unlikely to obtain the motif by chance

, i.e., the more significant is the motif.

))(

)(log(*)(

jB

jPjP

i

i

i ji

)( jPi

)( jBi

Page 44: Inexact Matching

44

Motifs

How to find potential occurrences of a motif in a give string?Motif/profile• Counts: the entry in (j,i)th cell is the times character j

occurs at ith column: Ni,j

• Frequency (likelihood) :

• Log-likelihood:

• Log-odds:

,*

,)(i

jii N

NjP

)(log jPi

)(

)(log

jB

jP

i

i

Page 45: Inexact Matching

45

Motifs

For a given motif of length K, to scan an input sequence to find the occurrences of the motif.

Assume that each entry in the motif is Pi(j)

A

C

T

G

ATCGCTAGCTAAGTAGTGGGCTAAGCTAAGCTAAGTGTGTAGCGTA

)(*...*)(*)(*)()|( 332211 kk xpxpxpxpMotifXP X=x1x2x3x…xk is the substring within the sliding window

Page 46: Inexact Matching

46

Motifs

Independent identical distribution (iid)

)|( BackgroudXP

kkxpxpxpxpBackgroundXP 25.0)(*...*)(*)(*)()|( 321

Markov chain model (1st order or higher order)

?1)|(

)|(

BackgroudXP

MotifXP

Page 47: Inexact Matching

47

Motifs

)(log...)(log)(log)(log)|(log 332211 kk xpxpxpxpBackgroundXP

?0)|(log)|(log BackgroudXPMotifXP

)(log...)(log)(log)(log)|(log 332211 kk xpxpxpxpMotifXP

Page 48: Inexact Matching

48

MotifScanner

http://homes.esat.kuleuven.be/~thijs/Work/MotifScanner.html

Page 49: Inexact Matching

49

JASPAR MotifsJASPAR

MA0112

1 1 7 2 0 0 0 6 1 2 3 1 1 5 0 0 2 35 5 1 0 0 0 7 0 7 5 2 1 0 1 8 9 4 41 1 1 7 9 0 2 2 1 1 4 1 8 3 0 0 0 12 2 0 0 0 9 0 1 0 1 0 6 0 0 1 0 3 1

Page 50: Inexact Matching

50

Suffix Trees for Inexact Matching

In a rooted tree T, a node u is an ancestor of a node v is u is on the unique path from the root to v.

A proper ancestor of v is an ancestor that is not v.In a rooted tree T, the lowest common ancestor (lca)

of two nodes x and y is the deepest node in T that is an ancestor of both x and y.

lca retrieval problem: Given two nodes, x and y, of a rooted tree, to find their lca.

Let n be the number of nodes in a rooted tree T, then after a O(n)-time preprocessing, lca retrieval problem can be solved in constant time!!! (Independent of n)

Page 51: Inexact Matching

51

lca Retrieval

Page 52: Inexact Matching

52

Longest Common ExtensionLongest common extension (lce) problem: Given string S1 and S2, with

a total length of n, for each specified index pair (i, j), to find the length of the longest string of S1 starting at i that matches a substring of S2 starting at position j. That is, to find the length of the longest prefix of suffix i of S1 that matches a prefix of suffix j of S2.

Given an index pair (i,j) It is easy to solve the lce problem in linear time.

But, we are challenged to solve it in constant time for each given index pair (after a linear time preprocessing)!

i

j

x

y

Page 53: Inexact Matching

53

Longest Common Extension

• Build a generalized suffix tree for S1 and S2. O(n)

• Compute the string-depth of each node. O(n)

• For a given pair (i,j), lce problem is reduced to lca problem, which can be solved in constant time.

Page 54: Inexact Matching

54

K-Mismatch Problem Given a pattern P, a text T, and a fixed number k

that is independent of the length of P and T, a k-mismatch of P is a |P|-length substring of T that matches at least |P|-k characters of P. That is, it matches P with at most k mismatches.

Does not allow insertions or deletions Matches or mismatches

P=bend, T=abentbananaend, k=2

Page 55: Inexact Matching

55

K-Mismatch Problem |P|=n, |T|=m, O(km) vs O(nm) K<<n

AlgorithmINPUT (T, P, i, K)OUTPUT: whether a K-mismatch of P occurs in T starting at i

j=1; i’=i; count=0;While (count≤K)

l =lce (j, i’ )if j+l=n+1, then a K-mismatch of P occurs in T starting at i,

stopcount++;if count>k, then a K-mismatch of P does NOT occurs in T

starting at i, stopj=j+l+1; i’ = i’ +l+1

Page 56: Inexact Matching

56

K-Mismatch Problem When the alphabet size is small, e.g. |∑|=4,

and k is small, e.g. k=2, A practical approach in biological database

search Build a suffix tree of T; O(m) Enumerate all k-mismatches (p’) of P; (24=16) Find the occurrences of every p’; (16*n)

Total time O(m+16*n) <<O(km), since n<<m

Page 57: Inexact Matching

57

Maximal Palindromes An even-length substring S’ of S is a maximal

palindrome of radius k if, starting in the middle of S’, S’ read the same in both directions for k characters but not for any k’>k character.

aabactgaaccaat An odd-length maximal palindrome S’ is

similarly defined after excluding the middle character of S’.

aabactgaaccaat

Page 58: Inexact Matching

58

Palindrome Problem Given a string S of length n, the palindrome

problem is to locate all maximal palindromes in S.

Sr is the reverse of S. O(n) For q from 1 to n-1

Find the lce for index pair (q+1, n-q+1) in S and Sr, respectively. Let k be the length of lce (q+1, n-q+1).

If k>0, then report that a maximal palindrome of radius k centered at q.

All the maximal even-length palindromes in a string can be identified in linear time.

How about odd-length palindromes??

Page 59: Inexact Matching

59

Complemented Palindromes

In DNA two halves of the substring from a palindrome only if the characters in one half are converted to their complement characters. (A and T are complemented, and C and G are complemented)

ATTAGCTAATTAATCGATTA

Finding all complemented palindromes in a string S can be solved in linear time.Let Sr be the complemented string.

Page 60: Inexact Matching

60

K-mismatch Palindromes A k-mismatch palindrome is a substring that

become a palindrome after k or fewer mutations.

axabbcca 2-mismatch palindrome.

Finding all k-mismatch palindromes in a string S can be solved in O(kn) time.

Page 61: Inexact Matching

61

Tandem Repeats A tandem repeat is a string that can be

written as , where is a substring. Does not need to be maximal

xababababy two tandem repeats starting at 2. Find all tandem repeats in string S in O(n2) time.

For every pair of index (i,j) (i<j) Find lce (i,j) If lce (i,j) >=j-i, then a tandem repeat

starting at i with length of 2*(j-i)

Page 62: Inexact Matching

62

K-mismatch Tandem Repeats

A substring that becomes a tandem repeat after k or fewer mutations.

axabaybb 2 mismatch tandem repeat

Find all k-mismatch tandem repeats in string S in O(kn2) time.

Tandem repeats (or k-mismatch tandem repeats) can be solved faster.

Page 63: Inexact Matching

63

Repetitive Structures A maximal pair in a string S is a pair of identical substrings and

in S such that the character to the immediate left (right) of is different from the character to the immediate left (right) of .

Represented by (p1,p2,n’), where p1 and p2 are the starting position of , and . n’ is the length.

xabcyiiizabcqabcyrxar(2,10,3) (10,14,3)But (2,14,3) is not a maximal pair. Instead, (2,14,4) is a

maximal pair.For an input string S, R(S) is the set of maximal pairs.

R(S) is too large to be useful.

b c x y

qp

Page 64: Inexact Matching

64

Repetitive Structures A maximal repeat is a substring of S that occurs in a

maximal pair of S.xabcyiiizabcqabcyrxar(2,10,3) (10,14,3) =abc(2,14,4) =abcy

For an input string S, R’(S) is the set of maximal repeats.

| R’(S) |≤| R(S) | A supermaximal repeat is a maximal repeat that never

occurs as a substring of any other maximal repeat.abcy is a supermaximal repeat. abc is not.

Page 65: Inexact Matching

65

Maximal Repeats Goal: To find all maximal repeats in linear time. Lemma 7.12.1 Let T be the suffix tree for string S. If a string

is a maximal repeat in S then must be the path-label of an internal node v in T.

Theorem 7.12.1 There can be at most n maximal repeats in any string of length n.

Is it true that the path-label of any internal node must be a maximal repeat?

b c x y

qp

c

yq

p

Page 66: Inexact Matching

66

Maximal Repeats For each position i in string S, character S(i-1)

is called the left character of i. Let T be a suffix tree, the left character of a

leaf (i) in T is the left character of position i. A node v is left diverse if at least two leaves in

v’s subtree have different left characters. A path-label of a node v in T is a maximal

repeat if and only if v is left diverse. A B is easy BA

Page 67: Inexact Matching

67

Maximal Repeats

Let node v with path-label is left diverse. Let p and q are the two leaves below v that have different left character. b≠x

if c≠y (leaves p and q diverge at v) then is a maximal repeat, because (p,q,||) is a maximal pair.

b c x y

qp

c

yq

p

v

Page 68: Inexact Matching

68

Maximal Repeats

Else if c=y (leaves p and q diverge at a point below v) , then there is anther branch at v (because it is an internal node). Let the leave be k.

Then n ≠y/c If b ≠m, then (p,k,||) is a maximal

pair, is a maximal repeat. Else if b=m, then b ≠x, (q,k,||) is a

maximal pair, is a maximal repeat.

b c x y

qp

m n

k

c/yq

p

v

k

n

Page 69: Inexact Matching

69

Maximal Repeats The property of left diverse propagates upward in T. If a node is left diverse then its parent node is also

left diverse. A node is a frontier node if it is left diverse and none

of its children are left diverse. Trim of all leaves and nodes under frontier nodes,

we obtain a subtree (frontier tree) in which every path from the root to a node is a maximal repeat.

The frontier tree is a compact representation of all maximal repeats.

Page 70: Inexact Matching

70

Maximal Repeats

Page 71: Inexact Matching

71

Maximal RepeatsFinding left diverse nodeAlgorithm

Start from leavesAt a node v,

if some children are left diverse, then label v as left diverseelse if all v’s children have the same left character or v is a leaf,

then record the left character.else label v as frontier node.

Depth-firstAt each node the processing time is proportional to the number of children.

Return the node is left diverse or the left character of the node.O(n) in total.

Maximal repeats in S can be found in O(n) time.To be precise, a compact representation of maximal repeats in S can be found

in O(n) time.

Page 72: Inexact Matching

72

Supermaximal Repeats A supermaximal repeat is a maximal repeat that

never occurs as a substring of any other maximal repeat.

A supermaximal repeat must be the path-label from the root to a frontier node.

Page 73: Inexact Matching

73

Supermaximal RepeatsIs it true the every path from the root to a frontier node is a

supermaximal repeat?

q

p

x c x y

qq

c y

qq

d e

d≠e, c≠yThe parent of leaves i and j (node u) is left diverseThe path-label of u is , which is a maximal repeat

ji

node s is NOT left diverse c≠y

i j

u

v s

Page 74: Inexact Matching

74

Supermaximal Repeats

If a frontier node v has a children that is an internal node, then the path-label of v is NOT a supermaximal repeat.

Page 75: Inexact Matching

75

Supermaximal RepeatsIs it true the every path from the root to a frontier node that

has not internal node children is a supermaximal repeat?

q

p

x c x y

qq

c y

qq

d e

d≠e, c≠yThe parent of leaves i and j (node u) is left diverseThe path-label of u is , which is a maximal repeat

ji

c≠y

i j

u

Page 76: Inexact Matching

76

Supermaximal RepeatsIf a frontier node v has a children that is an internal node, then the

path-label of v is NOT a supermaximal repeat.If a frontier node v has two children that have the same left

character, then the path-label of v is NOT a supermaximal repeat.

A frontier node v represent a supermaximal repeat if and only if all of its children are leaves and each leaf has a distinct left character.

To check whether all children are leaves O(k)To check whether all children have distinct left character O(k)

(Assuming constant alphabet size)

Supermaixmal repeats can be found in linear time.The nodes corresponding to supermaximal repeats can be found in

linear time.

Page 77: Inexact Matching

77

Maximal Pairs How to output maximal pairs?

i

j

At a left diverse node v, if i and j root at different children of v and i and j have different left character, then (i,j ||) define a maximal pair.

v

b c x y

ji

q

Page 78: Inexact Matching

78

Maximal Pairs Starting from leaves For each node v, we will maintain |∑| linked lists, with one list

for a character. The list for character x consists of v’s children leafs that have

x as left character. xj,q bi …

At each left diverse node, output the maximal pairs using the linked lists of its children

Link (not copy) children’s linked lists to get the v’s linked lists O(c) (assuming constant alphabet size, c the number of children)

O(n+k) in total, where k is the number of maximal pairs.