Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra...

17
Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino

Transcript of Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra...

Page 1: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Indexing Structures for Approximate String Matching

Alessandra Gabriele

Filippo Mignosi

Antonio Restivo

Marinella Sciortino

Page 2: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Approximate string matching concerns to find patterns in texts in presence of “mismatches” or “errors”. It has several applications in data analysis and data retrieval, such as:

The nature of mismatches depends on the problem or application considered and can be well captured in a formal way by introducing distances among strings.

•searching text under the presence of typing or spelling errors;•retrieving musical passages;•finding biological sequences in presence of possible mutations or misreads.

Page 3: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Ex.: x=acgtatct, y=aggttact

The distance d(x,y) between two strings x and y is the minimal cost of a sequence of operations that transform x into y (and if no such sequence exists).

The different possible operations are:

Let d: *x * R+ be a distance function.

We consider one of the most commonly used distance functions, the Hamming distance, that allows only substitutions, which cost 1 in the simplified definition. It is finite whenever |x|=|y| and it holds 0d(x,y)|x|.

Ex.: x=acgtatct, y=aggttact

d(x,y)=3 (in the simplified definition)

3) Substitution, 4) Transposition.

1) Insertion, 2) Deletion,

Page 4: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Let S be a string over the alphabet and let k, r be non negative integers such that k r. A string v occurs in S at the position l, up to k errors in a window of size r if:

1) |v| < r d (v, S (l,l+|v|-1) ) k; 2) |v| r i, 1 i |v|-r+1, d( v(i, i+r-1),S(l+i,l+i+r-1)) k.

L(S,k,r) is the set of the words that satisfy the previous definition for some l, 1 l |S|-|v|+1.

Typical approaches in this field consist in considering a percentage D of errors or fixing the number k of them.

The new idea in our approach is to introduce a new parameter r and to allow at most k errors for any substring of length r, where r is not necessarily constant.

Page 5: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

An Index over a fixed text S is an abstract data type based on the set of all factors of S, denoted by Fact(S). Such data type is equipped with some operations that allow it to answer the following query:

given xFact(y), find the list of all its occurrences in y.

This operation can easily be extended to the case of approximate string matching.

Page 6: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Statement of the

problem:

Given a “text” S, a “pattern” x and two integers k and r, return all the text positions l, such that x occurs in S at position l, up to k errors for r symbols.

Natural Solution:

Building an automaton recognizing the language L(k,S,r).

determinization

minimization

Exponential size!!

Page 7: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Let u be a string over the alphabet , the neighbourhood of u is the set of all words that have at most k errors in every windows of size r respect to u, i.e.: V(u,k,r)=L(u,k,r)|u|.

Different bounds from the classical exponential ones have been obtained by using a new parameter R, called Repetition Index.

The Repetition Index R(S,k,r) of S is the smallest value of an integer h such that all strings of this length occur at most in a unique position of the text upto k errors for r symbols: R(S,k,r) = min{ h 1 s.t . i, j,1 i, j |S| - h + 1,

V(S(i,i+h-1),k,r) V(S(j,j+h-1),k,r) i=j}.

•R(S,k,r) is always defined because h=|S| is an element of the set above described;•If k/r 1/2 then R(S,k,r)=|S|.

Page 8: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Let S be an infinite sequence generated by a memoryless source and Sn be the sequence of prefixes of S of length n.

• For fixed k and r a.s.

• For fixed k and r(n) (in particular for r(n)=R(Sn,k,r(n))

H(D, p)=(1-D)log((1-D)/p)+D log(D/(1-p)), where p is the probability that the letters in two

distinct positions are equal and 0D1-p.

Page 9: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

•|S|=64 R(S,k,R)~13•|S|=80 R(S,k,R)~14•|S|=128 R(S,k,R)~15•|S|=256 R(S,k,R)~16•|S|=1024 R(S,k,R)~19•|S|=16384 R(S,k,R)~25•|S|~300.000 R(S,k,R)~30•|S|~5.000.000 R(S,k,R)~35•|S|~3.000.000.000 R(S,k,R)~47

Some Average Estimations for Random Texts

Alphabet , ||=4, r = R(S,k,r), k=2 fixed

Page 10: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Worst Case: R, t = O(|S|)

The size of the Automaton is

exponentialexponential again!!

Average Case: R=O(log |S|).

If t is constant

for k fixed, the size of the

Automaton is linear times a linear times a polylogpolylog of the size of the text

S!!

O(|S| Rt).

Using the Repetition Index we give a method to construct the automaton that recognizes the language L(S,k,r). Its size is a function of |S|, R(S,k,r) and the number of errors t in a window of size R(S,k,r). More precisely, the size is

Page 11: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Indexing

|x| R(S,k,r) |x| < R(S,k,r)

Case of long patterns Case of short patterns

Page 12: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

|x| R(S,k,r)

•Build the deterministic Automaton A recognizing the language L(S,k,r).

In this case, if x appears, it appears just once

• Label any state with an integer representing the length of the shortest path from that state to a state without outgoing edges.

•“Read” as long as possible the string x and, if the end of x is reached, the output is |S| minus the number associated to the arrival state minus the length of the pattern x.

Page 13: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

|x| < R=R(S,k,r)

This procedure concerns the case of short patterns and includes•a non trivial reduction to the Document Listing Problem•an algorithm for finding the Repetition Index•standard filters for approximate string matching

Page 14: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

The average searching time of a pattern in our data structures turns out to be linear under an hypothesis on the distribution of R(S,k,r).

More precisely, we require that there exists a real number > 1 such that if is the expected value of R(S,k,r) for a text S of length n then the probability that R(S,k,r) > goes to zero faster than 1/n.

Under this condition, the average running time spent by our algorithm for finding the list occ(x) of all occurrences of a pattern x in a text, up to k errors in every window of size r, is proportional to

|x|+|occ(x)|.|x|+|occ(x)|.

Page 15: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Distribution of R(S,k,r)Number of strings

Repetition Index

Page 16: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

An Application: the Longest Common Substring Problem with k Mismatches

Solution: We build two automata recognizing the languages L(S1,k1,r) and L(S2,k2,r), with k1+k2=k.With a DFS we find the longest label of common paths to the two automata, starting from the two initial states.

The average time spent by this algorithm is O(max {| S1 | log(| S1 |)k1, | S2 | log(| S2 |)k2}).

Page 17: Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Ravello, 19 - 21 Settembre 2003 Indexing Structures for Approximate String Matching

Works in progress...

• To generalize the results proved for the Hamming distance – to the Edit (or Weighted Levenshtein) distance, that allows

Insertions, Deletions and Substitutions;

– to the Score functions, that are linked to Levenshtein distance and are much more used in Computational Biology;

• To prove the hypothesis on the distribution of R(S,k,r) (according with the experimental results obteined by A. Langiu);

• To find other applications to our data structures.