Download - KMP Algorithm 1

Transcript
  • 7/30/2019 KMP Algorithm 1

    1/22

    TECHComputer Science

    String Matching

    detecting the occurrence of a particular substring

    (pattern) in another string (text)

    A straightforward Solution

    The Knuth-Morris-Pratt Algorithm

    The Boyer-Moore Algorithm

  • 7/30/2019 KMP Algorithm 1

    2/22

    Straightforward solution

    Algorithm: Simple string matching

    Input: P and T, the pattern and text strings; m, the length of P.The pattern is assumed to be nonempty.

    Output: The return value is the index in T where a copy of P

    begins, or -1 if no match for P is found.

  • 7/30/2019 KMP Algorithm 1

    3/22

    int simpleScan(char[] P,char[] T,int m)

    int match //value to return.

    int i,j,k; match = -1;

    j=1;k=1; i=j;

    while(endText(T,j)==false)

    if( k>m )

    match = i; //match found.

    break;

    if(tj == pk)

    j++; k++;

    else

    //Back up over matched characters.

    int backup=k-1; j = j-backup;

    k = k-backup;

    //Slide pattern forward,start over.

    j++; i=j;

    return match;

  • 7/30/2019 KMP Algorithm 1

    4/22

    Analysis

    Worst-case complexity is in (mn)

    Need to back up.

    Works quite well on average for natural language.

  • 7/30/2019 KMP Algorithm 1

    5/22

    Finite Automata

    Terminologies

    : the alphabet *: the set of all finite-length strings formed using

    characters from .xy: concatenation of two strings x and y.

    Prefix: a string w is a prefix of a string x if x=wy for

    some string y *.Suffix: a string w is a suffix of a string x if x= yw for

    some string y *.

  • 7/30/2019 KMP Algorithm 1

    6/22

    Finite Automata (contd)

  • 7/30/2019 KMP Algorithm 1

    7/22

    Finite Automata, e.g.,

  • 7/30/2019 KMP Algorithm 1

    8/22

    Algorithm

  • 7/30/2019 KMP Algorithm 1

    9/22

    The Knuth-Morris-Pratt algorithm

    1. Skip outer iteration I =3

    2. Skip first inner iteration testing n

    vs n at outer iteration i=4

  • 7/30/2019 KMP Algorithm 1

    10/22

    Strategy

    In general, if there is a partial match of j chars starting at i,

    then we know what is in position T[i]T[i+j-1]. So we cansave by

    1. Skip outer iterations (for which no match possible)

    2. Skip inner iterations (when no need to test know matches).

    When a mismatch occurs, we want to slide P forward, butmaintain the longest overlap of a prefix of P with a suffix ofthe part of the text that has matched the pattern so far.

    KMP algorithm achieves linear time performance bycapitalizing on the observation above, via building asimplified finite automaton: each node has only two links,success and fail.

  • 7/30/2019 KMP Algorithm 1

    11/22

    Sliding the pattern for the KMP algorithm

  • 7/30/2019 KMP Algorithm 1

    12/22

    The Knuth-Morris-Pratt Flowchart

    Character labels are inside the nodes

    Each node has two arrows out to other nodes: success

    link, or fail link

    next character is read only after a success link

    A special node, node 0, called get next char whichread in next text character.

    e.g. P = ABABCB

  • 7/30/2019 KMP Algorithm 1

    13/22

    Construction of the KMP Flowchart

    Definition:Fail links

    We define fail[k] as the largest r (with r

  • 7/30/2019 KMP Algorithm 1

    14/22

    Algorithm: KMP flowchart construction

    Input: P,a string of characters;m,the length of P.

    Output: fail,the array of failure links,defined for indexes1,...,m.The array is passed in and the algorithm fills it.

    Step:

    void kmpSetup(char[] P, int m, int[] fail)

    int k,s 1. fail[1]=0;

    2. for(k=2;k=1) 5. if(ps==pk-1)

    6. break;

    7. s=fail[s];

    8. fail[k]=s+1;

  • 7/30/2019 KMP Algorithm 1

    15/22

    The Knuth-Morris-Pratt Scan Algorithm

    int kmpScan(char[] P,char[] T,int m,int[] fail)

    int match, j,k;

    match= -1;

    j=1; k=1;

    while(endText(T,j)==false)

    if(k>m)

    match = j-m;

    break; if(k==0)

    j++; k=1;

    else if(tj==pk)

    j++; k++;

    else //Follow fail arrow.

    k=fail[k];

    //continue loop.

    return match;

  • 7/30/2019 KMP Algorithm 1

    16/22

    Analysis

    KMP Flowchart Construction require 2m3

    character comparisons in the worst case

    The scan algorithm requires 2n character comparisons

    in the worst case

    Overall: Worst case complexity is (n+m)

  • 7/30/2019 KMP Algorithm 1

    17/22

    The Boyer-Moore Algorithm

    Al i h C i J f

  • 7/30/2019 KMP Algorithm 1

    18/22

    Algorithm:Computing Jumps for

    the Boyer-Morre Algorithm

    Input:Pattern string P:m the length of P;alphabet size

    alpha=|| Output:Array charJump,defined on indexes

    0,....,alpha-1.The array is passed in and the algorithmfills it.

    void computeJumps(char[] P,int m,int alpha,int[]charJump)

    char ch; int k;

    for (ch=0;ch

  • 7/30/2019 KMP Algorithm 1

    19/22

    Computing matchJump

  • 7/30/2019 KMP Algorithm 1

    20/22

    Computing matchjump (e.g.,)

  • 7/30/2019 KMP Algorithm 1

    21/22

    BoyerMooreScan Algorithm

  • 7/30/2019 KMP Algorithm 1

    22/22

    Summary

    Straightforward algorithm: O(nm)

    Finite-automata algorithm: O(n)

    KMP algorithm: O(n+m)

    Relatively easier to implement

    Do not require random access to the text BM algorithm: O(n+m), worst, sublinear average

    Fewer character comparison

    The algorithm of choice in practice for string matcing