Algorithms and Data Structures

44
Algorithms and Data Structures

description

Algorithms and Data Structures. Outline. Data Structures Space Complexity Case Study: string matching Array implementation (e.g. KMP alg.) Tree implementation (e.g. suffix tree). Algorithm in action: data structure transformation. Algorithm. Intermediate data structure. Input data - PowerPoint PPT Presentation

Transcript of Algorithms and Data Structures

Page 1: Algorithms and Data Structures

Algorithms and

Data Structures

Page 2: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 2

Outline

Data Structures Space Complexity Case Study: string matching

Array implementation (e.g. KMP alg.) Tree implementation (e.g. suffix tree)

Page 3: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 3

Algorithm in action: data structure transformation

Intermediate data structure

Algorithm

Input datastructure

Output datastructure

Page 4: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 4

Basic Data Structures

Scalar or “Atomic” data structures Building blocks for other data structures Cannot be divided into sub-elements Integer, floating-point, character, access (pointer) types

Composite data structures arrays, records

Data Abstraction Abstract Data Types: A collection of data values together

with a set of well-specified operations on that data, e.g. list, stack, queue, trees etc.

Page 5: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 5

Scalar Data Structure

Conceptual View

0238

0239

0240

0241

0242

0243

0244

0245

Physical Layout inthe Computer Memory

Memory address

value

value

Variable name

var1

Assignment operation:

var1 value;

var2 var1;

var1 var3;

Page 6: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 6

Composite Data Structure: Array

Conceptual View

v1

Variable name Array A[1..5]

A v2 v3 v4 v5

Accessing array elements:

A[0] 5

k 1

A[k] 11

A[k+1] A[k] + 3

0 1 2 3 4 0238

0239

0240

0241

0242

0243

0244

0245

Physical Layout inthe Computer Memory

Memory address

v2

v1

v3

v4

v5

nil

Page 7: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 7

Data Abstraction: Tree

Conceptual View

v1

v2 v3

v4

T

___ __ ____ __ _

___ __ _

___ __ ____ __ _

Accessing the elements:

T.value 12

T.left new(T)

T.right new(T)

0238

0239

0240

0241

0242

0243

0244

0245

Physical Layout inthe Computer Memory

Memory address

0241

0244

v1

T 0238

nil

nil

v2

nil

v3

0247

...

Page 8: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 8

Space Analysis

Storage space, like time, is another limited resource that is important to programmers

Space requirements are also expressed as a function of the input size

Space functions are classified in the same manner as running times

Page 9: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 9

Complexity Analysis: Sorting

Algorithm Time-Complexity

Insertionsort O(n2)

Quicksort O(n.log n)

Space-Complexity

O(n)

O(n)

Page 10: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 10

Space-Time Tradeoff

Reductions in running time are often possible if we increase storage requirements

Decreasing the amount of storage used by an algorithm usually results in longer running times Using an array to lookup previously computed

values can drastically increase the speed of a function

Page 11: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 11

Case Study: Searching for Patterns

Problem: find the first occurrence of pattern P of length m inside the text S of length n.

String matching problem

Page 12: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 12

String Matching - Applications

Text editing Term rewriting Lexical analysis Information retrieval And, bioinformatics

Page 13: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 13

Model for Pattern-Matching Problem

PatternMatchergenerator

PatternMatcher

PatternP

Input stringS

Yes

No

Page 14: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 14

Array Implementation

Text S represented as an array of characters: S [1..n]

Pattern P represented as an array of characters: P [1..m]

a g c a g a a g a g t aS

Time complexity = O(m.n)

Space complexity = O(m + n)

P ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a g

Page 15: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 15

Can we be more clever ?

When a mismatch is detected, say at position k in the pattern string, we have already successfully matched k-1 characters.

We try to take advantage of this to decide where to restart matching

a g c a g a a g a g t aS

P ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a g

Page 16: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 16

Problem of Matching Keyword

PROBLEM. Given a pattern p consisting of a single keyword and an input string s, answer “yes” if p occurs as a substring of s, that is, if s=xpy, for some x and y; “no” otherwise.

For convenience, we will assume p=p1p2…pm and s=s1s2…sn where pi represents the ith character of the pattern and sj the jth character of the input string.

Page 17: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 17

The Knuth-Morris-Pratt AlgorithmObservation: when a mismatch occurs, we may not need to restart the comparison all way back (from the next input position).

What to do:

Constructing a table h, called the next function, that determines how many characters to slide the pattern to the right in case of a mismatch during the pattern-matching process.

Knuth, D. E., Morris, J.H. and Pratt, V. R., Fast Pattern Matching Algorithm for Strings, SIAM J. Comput Sci., 43, 1977, 323-350

Page 18: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 18

The key idea is that if we have successfully matched the prefix

p=p1p2…pi-1 of the keyword with the substring sj-i+1 sj-i+2… sj-

1 of the input string and pi = sj, then we do not need to reprocess any of the suffix sj-i+1 sj-i+2… sj-1 since we know this portion of the text string is the prefix of the keyword that we have just matched.

Page 19: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 19

Note that the inner while loop will iterate as long as p_i and s_j do not match each other. Once they match, the innerwhile loop terminate, both i and j will shift by one, and inner loop repeats ...

Page 20: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 20

An Important Property of the Next Function in KMP

Algorithm

The largest k less than i such that p1p2…pk-1 is a

suffix of p1p2…pi-1 (i.e., p1…pk-1 = pi-k+1…pi-1) and pi

= pk. if there is no such i, then hi=0

Page 21: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 21

Backtrack or Not Backtrack ?

Assume for some i and j, what should we do? KMP algorithm chose not to backtrack on the text

S (e.g. j) for a good reason The choice is how to shift the pattern P (e.g. i) –

i.e. by how much If for each j, the shift of P is a small constant, then

the total time complexity is clearly linear in n

P(i) = S(j)

Page 22: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 22

An Example

1 2 3 4 5 6 7 8 9 10 11 12 13Patten: a b a a b a b a a b a a bNext funciton: 0 1 0 2 1 0 4 0 2 1 0 7 1

abaababaabacabaababaabaab.

Given:

Input string:

a b a a b a b a a b a a ba b a a b a b a a b a c a b a a b a b a a b a a b

Scenario 1:i = 12

j = 12

Scenario 2: i

j

h12 = 7, i = 7

a b a a b a b a a b a a ba b a a b a b a a b a c a b a a b a b a a b a a b

Next function: 0 1 0 2 1 0 4 0 2 1 0 7 1

What is hi = h12 = ? hi = 7

Page 23: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 23

An Example

Scenario 3: i

j

h7 = 4, i = 4

Subsequently i = 2, 1, 0

Finally, a match is found:

a b a a b a b a a b a a ba b a a b a b a a b a c a b a a b a b a a b a a b

i

j

(Contn’d)

a b a a b a b a a b a a ba b a a b a b a a b a c a b a a b a b a a b a a b

Page 24: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 24

Question: when P(i) = S(j), how much should we shift?

Observations: We should shift P to the right But – by how much? One answer is: do not backtrack S(j)

P

S

i=1

j=1

i

Pi

j

Sj

Pattern

Input

Page 25: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 25

Observation: Never backtrack on the input string S.

Page 26: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 26

How to Compute the Next Function?

hi:= hj hi := j

j:= hj

Page 27: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 27

How to Compute the Next Function?

hi:= hj hi := j

j:= hj

Note: once p_i does not match p_j -- we know that j should bethe index to be found where a prefix before i matches a suffix ends at j

Page 28: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 28

Interpretation of the Next Function

Interpretation

Question: how to compute the next function?

aababaaba

aababaaba

987654321

Note: P2 = P5 P4 = P9

0 1 0 2 1 0 4 0 2

Page 29: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 29

Interpretation of the Next Function

Interpretation

Question: how to compute the next function?

1 2 3 4 5 6 7 8 9

a b a a b a b a a

a b a a b a b a a

Note: P1 = P5 P4 = P9

0 1 0 2 1 0 4 0 2

Page 30: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 30

Interpretation of the Next Function

Interpretation

Question: how to compute the next function?

1 2 3 4 5 6 7 8 9

a b a a b a b a a

a b a a b a b a a

0 1 0 2 1 0 4 0 2

Note: P1 = P5 P4 = P9

Page 31: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 31

KMP - Analysis

The KMP algorithm never needs to backtrack on the text string.

Time complexity = O(m + n)

Space complexity = O(m + n)

preprocessing searching

Page 32: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 32

KMP Algorithm Complexity Analysis Hints

What is the cost in the building of the next function? (hint: in the code for the next function, the operation j=h_j in the inner loop is never executed more often than the statement i := i+1 in the outer loop)

What is the cost of the matching itself? (hint: similar to the above)

Page 33: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 33

Other String Matching Algorithms

The Boyer-Moore Algorithm [Boyer, R. S. and Moore, J. E., A Fast String Searching Algorithm, CACM, 20(10), 1977, 62-72]

The Karp-Rabin Algorithm [Karp, R. M. and Rpbin, M. O., Efficient Randomized Pattern-Matching Algorithm, IBM J. of Res. And Develop., 32(2), 1987, 249-260].

Page 34: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 34

Matching of A Set of Key Words ?

Given a pattern of a set of keywords and an input string S, answer “yes” if some keywords occur as a substring of S, and “no” otherwise.

How to solve this ?

Page 35: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 35

What time complexity KMP algorithm will have when do a matching of k patterns?

- Preprocessing each of the k patterns: assume each pattern has 0(m) in length, this will take 0(km) time- Searching each pattern will take o (n) time per pattern

so, total time = k • o(m+n)

How about repeatedly apply KMP ?

Page 36: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 36

Question: Can we improve the time complexity when k is large?

Answer:

Yes, preprocessing the input string – tree implementation.

Page 37: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 37

Model for Pattern-Matching Problem

PatternMatchergenerator

PatternMatcher

PatternP

Input string

S

Yes

No

Pre Pro-cessing

Page 38: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 38

Tree Implementation -- suffix tree

Instead of preprocessing the pattern (P), preprocess the text T !

Use a tree structure where all suffixes of the text are represented;

Search for the pattern by looking for substrings of the text;

You can easily test whether P is a substring of T because any substring of T is the prefix of some suffix.

Page 39: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 39

Suffix Tree

3

c

a

x

ba b x a c

62

x a b x a c

4

cw

c

c u

Suffix tree for string xabxac. The node labels u and w on the two interior nodes will be used.

Con’d

A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i…m].

Page 40: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 40

Note on Suffix Tree

Not all strings guaranteed to have corresponding suffix trees

For example:

consider xabxa: it does not have a suffix tree: because here xa is both a prefix and suffix

(I.e. xa does not necessarily ends at a leaf) How to fix the problem: add $ - a special

“termination” character to the alphabet.

Page 41: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 41

Algorithm for Constructing a Suffix Tree

A subtree can be constructed in linear time

[Weiner73, McCreight76, Ukkonen95]

Page 42: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 42

Suffix Tree

Time complexity = O(n + m)

Space complexity = O(m + n)

preprocessing searching

Page 43: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 43

Question

How to use suffix tree to help solving the string matching problem ?

Page 44: Algorithms and Data Structures

/course/eleg67701-f/Topic-1b 44

Other Tree based Methods

Suffix tree is not the only one ..