[PPT]Analysis of Algorithms - Utah State...

68
Pattern Matching Pattern Matching 1 1 a b a c a a b 2 3 4 a b a c a b a b a c a b

Transcript of [PPT]Analysis of Algorithms - Utah State...

Pattern Matching

Pattern Matching1

1

a b a c a a b

234a b a c a b

a b a c a b

Strings

Pattern Matching2

A string is a sequence of characters

Examples of strings: Java program HTML document DNA sequence Digitized image

An alphabet is the set of possible characters for a family of strings

Example of alphabets: ASCII Unicode {0, 1} {A, C, G, T}

Let P be a string of size m A substring P[i .. j] of P is the

subsequence of P consisting of the characters with ranks between i and j

A prefix of P is a substring of the type P[0 .. i]

A suffix of P is a substring of the type P[i ..m 1]

Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to P

Applications: Text editors Search engines Biological research

Brute-Force Algorithm

Pattern Matching3

The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either a match is found, or all placements of the

pattern have been tried Brute-force pattern

matching runs in time O(nm) Example of worst case:

T aaa … ah P aaah may occur in images and

DNA sequences unlikely in English text

Algorithm BruteForceMatch(T, P)Input text T of size n and pattern

P of size mOutput starting index of a

substring of T equal to P or 1 if no such substring exists

for i 0 to n m{ test shift i of the pattern }j 0while j m T[i j] P[j]

j j 1if j m

return i {match at i}else

break while loop {mismatch}return -1 {no match anywhere}

Boyer-Moore Heuristics

Pattern Matching4

The Boyer-Moore’s pattern matching algorithm is based on two heuristicsStart at the end: Compare P with a subsequence of T moving backwardsCharacter-jump heuristic: When a mismatch occurs at T[i] c If P contains c, shift P to align the last occurrence of c in P with T[i] Else, shift P to align P[0] with T[i 1]

Example

1

a p a t t e r n m a t c h i n g a l g o r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

2

3

4

5

6

7891011

Last-Occurrence Function

Pattern Matching5

Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet to build the last-occurrence function L mapping to integers, where L(c) is defined as the largest index i such that P[i] c or 1 if no such index exists

Example: P abacab {a, b, c, d}

The last-occurrence function can be represented by an array indexed by the numeric codes of the characters

The last-occurrence function can be computed in time O(m s), where m is the size of P and s is the size of

c a b c d

L(c) 4 5 3 1

The Boyer-Moore Algorithm

Pattern Matching6

m j

i

j l

. . . . . . a . . . . . .

. . . . b a

. . . . b a

j

Case 1: j 1l

Algorithm BoyerMooreMatch(T, P, )L is lastOccurenceFunction(P, )i m 1j m 1repeat

if T[i] P[j]if j 0

return i { match at i }else

i i 1j j 1

else{ character-jump }last L[T[i]]i i m – min(j, last)j m 1

until i n – 1 // beyond text lengthreturn 1 { no match } m (1 l)

i

jl

. . . . . . a . . . . . .

. a . . b .

. a . . b .

1 l

Case 2: 1lj

Last is abbreviated “l” in figs

Update function?

Pattern Matching7

i i m – min(j, last +1)

Why the min? If the last character is to the left of where you are looking, you

will just shift so that the characters align. That amount is (m-1ast+l).

If the last character is to the right of where you are looking, that would require a NEGATIVE shift to align them. That is not good. In that case, the whole pattern just shifts 1. HOWEVER, the code to do that is NOT obvious. Recall j starts out at m-1 and then decreases. i also starts out at the end

of the pattern and decreases. Thus, if you add to i m-j, you are really moving the starting point to just one higher than the starting value for i for the current pass. Try it with some numbers to see.

Example

Pattern Matching8

1

a b a c a a b a d c a b a c a b a a b b

234

5

6

7

891012a b a c a b

a b a c a b

a b a c a b

a b a c a b

a b a c a b

a b a c a b1113

Last a is to the right of where we arecurrently. Just shift pattern to right by 1

0 1 2 3 4 5 6 7 8 9

i=4 j=3 last(a) = 4 j<last(a)+1m=6 SOOO i=4+6-3 = 7

Analysis

Pattern Matching9

Boyer-Moore’s algorithm runs in time O(nm ||)

The || comes from initialzing the last function. We expect || to be less than nm, but if it weren’t we add it to be safe.

Example of worst case: T aaa … a P baaa

The worst case may occur in images and DNA sequences but is unlikely in English text

Boyer-Moore’s algorithm is significantly faster than the brute-force algorithm on English text

11

1a a a a a a a a a

23456b a a a a a

b a a a a a

b a a a a a

b a a a a a

7891012

131415161718

192021222324

The KMP Algorithm - Motivation

Pattern Matching10

Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute-force algorithm.

It takes advantage of the fact that we KNOW what we have already seen in matching the pattern.

When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons?

Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]

x

j

. . a b a a b . . . . .

a b a a b a

a b a a b a

No need torepeat thesecomparisons

Resumecomparing

here by setting j to 2

At this point, the prefix of the pattern matches the suffixof the PARTIAL pattern

KMP Failure Function

Pattern Matching11

Knuth-Morris-Pratt’s algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself

The failure function F(j) is defined as the size of the largest prefix of the pattern that is also a suffix of the partial pattern P[1..j]

Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm so that if a mismatch occurs at P[j]T[i] we set j F(j 1), which resets where we are in the pattern.

Notice we don’t reset i, but just continue from this point on.

j 0 1 2 3 4

P[j] a b a a b a

F(j) 0 0 1 1 2

x

j

. . a b a a b . . . . .

a b a a b a

F(j 1)

a b a a b a

The KMP Algorithm

Pattern Matching12

At each iteration of the while-loop, either i increases by one, or it doesn’t the shift amount i j increases by at

least one (observe that F(j 1) < j) One worries that they may be stuck

in the section where i doesn’t increase.

Amortized Analysis: while sometimes we just shift the pattern without moving i, we can’t do that forever, as we have to have moved forward in i and j before we can just shift the pattern.

Hence, there are no more than 2n iterations of the while-loop

Thus, KMP’s algorithm runs in optimal time O(n) plus the cost of computing the failure function.

Algorithm KMPMatch(T, P)F failureFunction(P)i 0j 0while i n

if T[i] P[j]if j m 1

return i j { match }

else // keep goingi i 1j j 1

else if j 0

j F[j 1]else

// at first position can’t // use the failure function

i i 1return 1 { no match }

Computing the Failure Function

Pattern Matching13

The failure function can be represented by an array and can be computed in O(m) time

The construction is similar to the KMP algorithm itself

At each iteration of the while-loop, either i increases by one, or the shift amount i j increases

by at least one (observe that F(j 1) < j)

Hence, there are no more than 2m iterations of the while-loop

So the total complexity of KMP is O(m+n)

Algorithm failureFunction(P)F[0] 0i 1j 0while i m

if P[i] P[j]{we have matched j + 1

chars}F[i] j + 1i i 1j j 1

else if j 0 then{use failure function to

shift P}j F[j 1]

elseF[i] 0 { no match }i i 1

Example

Pattern Matching14

1a b a c a a b a c a b a c a b a a b b

7

8

19181715

a b a c a b

1614

13

2 3 4 5 6

9a b a c a b

a b a c a b

a b a c a b

a b a c a b

10 11 12

c

j 0 1 2 3 4

P[j] a b a c a b

F(j) 0 0 1 0 1

Note,we start comparingfrom the left.

Fail at 6, reset j to 1 (F of prev loc)

Fail at 7, reset j to 0 (F of prev loc)

Fail at 12, reset j to 0 (F of prev loc)

Fail at 13, reset j to 0

Binary Failure Function

Your programming assignment (due 10/29) extends the idea of the KMP string matching.

If the input is binary in nature (only two symbols are used - such as x/y or 0/1), when you fail to match an x, you know you are looking at a y.

Normally, when you fail, you say – “How much of the PREVIOUS pattern matches?” And then check the current location again.

With binary input, you can say, “How much of the string including the real value of what I was trying to match can shift on top of each other?”

Pattern Matching15

Example

Pattern Matching16

Text String x x y x x y x x y x x y y y y x x x x x x

Pattern x x y x x y x y

Compare 0 0 0 0 0 0 0 ^

Shift x x y x x y x y

Compare 0 0 ^

Shift x x y x x y x y

Compare 0 ^

Shift x x y x x y x y

Compare ^

Can you find the Binary Failure Function (given the regular failure function?) for the TWO pattern strings below?

Pattern Matching17

i P[i] Bfail F0 x 01 x 12 y 03 x 14 x 25 y 36 x 47 y 08 x 1 9 y 010 x 111 y 0

i p[i] Bfail F0 a 01 a 12 b 03 a 14 a 25 b 36 a 47 a 58 a 29 b 3

Can you find the Binary Failure Function (given the regular failure function?)

Pattern Matching18

i P[i] Bfail F0 x 0 01 x 0 12 y 2 03 x 0 14 x 0 25 y 2 36 x 0 47 y 5 08 x 0 1 9 y 2 010 x 0 111 y 2 0

i p[i] Bfail F0 a 0 01 a 0 12 b 2 03 a 0 14 a 0 25 b 2 36 a 0 47 a 0 58 a 6 29 b 2 3

Tries

05/06/23 21:16Tries19

e nimize

nimize ze

zei mi

mize nimize ze

Preprocessing Strings

05/06/23 21:16Tries20

Preprocessing the pattern speeds up pattern matching queries After preprocessing the pattern, KMP’s algorithm performs

pattern matching in time proportional to the text size If the text is large, immutable and searched for

often (e.g., works by Shakespeare), we may want to preprocess the text instead of the pattern

A trie (pronounced TRY) is a compact data structure for representing a set of strings, such as all the words in a text A tries supports pattern matching queries in time

proportional to the pattern size

Standard Trie

05/06/23 21:16Tries21

The standard trie for a set of strings S is an ordered tree such that: Each node but the root is labeled with a character The children of a node are alphabetically ordered The paths from the external nodes to the root yield the strings of S

Example: standard trie for the set of stringsS = { bear, bell, bid, bull, buy, sell, stock, stop }

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Standard Trie (cont)

05/06/23 21:16Tries22

A standard trie uses O(n) space and supports searches, insertions and deletions in time O(dm), where:n total size of the strings in Sm size of the string parameter of the operationd size of the alphabet

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Word Matching with a Trie

05/06/23 21:16Tries23

We insert the words of the text into a trie

Each leaf stores the occurrences of the associated word in the text

s e e b e a r ? s e l l s t o c k !

s e e b u l l ? b u y s t o c k !

b i d s t o c k !

a

a

h e t h e b e l l ? s t o p !

b i d s t o c k !

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86a r

87 88

a

e

b

l

s

u

l

e t

e

0, 24

o

c

i

l

r

6

l

78

d

47, 58l

30

y

36l

12 k

17, 40,51, 62

p

84

h

e

r

69

a

Compressed Trie

05/06/23 21:16Tries24

A compressed trie has internal nodes of degree at least two

It is obtained from standard trie by compressing chains of “redundant” nodes

e

b

ar ll

s

u

ll y

ell to

ck p

id

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Compact Representation

05/06/23 21:16Tries25

Compact representation of a compressed trie for an array of strings: Stores at the nodes ranges of indices instead of substrings in order

to make nodes a fixed size Uses O(s) space, where s is the number of strings in the array Serves as an auxiliary index structure

s e eb e a rs e l ls t o c k

b u l lb u yb i d

h eb e l ls t o p

0 1 2 3 4a rS[0] =

S[1] =

S[2] =

S[3] =

S[4] =

S[5] =

S[6] =

S[7] =

S[8] =

S[9] =

0 1 2 3 0 1 2 3

1, 1, 1

1, 0, 0 0, 0, 0

4, 1, 1

0, 2, 2

3, 1, 2

1, 2, 3 8, 2, 3

6, 1, 2

4, 2, 3 5, 2, 2 2, 2, 3 3, 3, 4 9, 3, 3

7, 0, 3

0, 1, 1

be

ar llid u

ll y

hear s

ee ll

topck

Suffix Trie

05/06/23 21:16Tries26

The compressed tree doesn’t work if you don’t start at the beginning of the word. Suppose you were allowed to start ANYWHERE in the word.

The suffix trie of a string X is the compressed trie of all the suffixes of X

e nimize

nimize ze

zei mi

mize nimize ze

m i n i z em i0 1 2 3 4 5 6 7

Suffix Trie (cont) – showing as numbers

05/06/23 21:16Tries27

Compact representation of the suffix trie for a string X of size n from an alphabet of size d Uses O(n) space Supports arbitrary pattern matching queries in X in O(dm)

time, where m is the size of the pattern

7, 7 2, 7

2, 7 6, 7

6, 7

4, 7 2, 7 6, 7

1, 1 0, 1

m i n i z em i0 1 2 3 4 5 6 7

Encoding Trie

05/06/23 21:16Tries28

A code is a mapping of each character of an alphabet to a binary code-word

A prefix code is a binary code such that no code-word is the prefix of another code-word

An encoding trie represents a prefix code Each leaf stores a character The code word of a character is given by the path from the root

to the leaf storing the character (0 for a left child and 1 for a right child

a

b c

d e00 010 011 10 11a b c d e

Encoding Trie (cont)

05/06/23 21:16Tries29

Given a text string X, we want to find a prefix code for the characters of X that yields a small encoding for X Frequent characters should have short code-words Rare characters should have long code-words

Example X = abracadabra T1 encodes X into 29 bits T2 encodes X into 24 bits

c

a r

d b a

c d

b r

T1 T2

Huffman’s Algorithm

05/06/23 21:16Tries30

Given a string X, Huffman’s algorithm constructs a prefix code that minimizes the size of the encoding of the string.

It runs in timeO(nd log d), where n is the size of the string and d is the number of distinct characters of the string

A heap-based priority queue is used as an auxiliary structure

Algorithm HuffmanEncoding(X)Input string X of size nOutput optimal encoding trie for XC distinctCharacters(X)computeFrequencies(C, X)Q new empty heap for all c C

T new single-node tree storing cQ.insert(getFrequency(c), T)

while Q.size() > 1f1 Q.minKey()T1 Q.removeMin()f2 Q.minKey()T2 Q.removeMin()T join(T1, T2)Q.insert(f1 + f2, T)

return Q.removeMin()

Example

05/06/23 21:16Tries31

a b c d r5 2 1 1 2

X = abracadabraFrequencies

ca rdb5 2 1 1 2

ca rdb

2

5 2 2ca bd r

2

5

4

ca bd r

2

5

46

c

a

bd r

2 46

11

Choice of which twoIs not unique

At Seats

05/06/23 21:1632

i n b a c t h e8 8 2 8 3 9 6 12

X = catinthehatFrequencies in English text

Create tree

Use to decode

Pattern Matching33

Text SimilarityDetect similarity to focus on, or ignore, slight differencesa. DNA analysisb. Web crawlers omit duplicate pages, distinguish between similar onesc. Updated files, archiving, delta files, and editing distance

Pattern Matching34

Longest Common SubsequenceOne measure of similarity is the length of the longest common subsequence between two texts. This is NOT a contiguous substring, so it loses a great deal of structure. I doubt that it is an effective metric for all types of similarity, unless the subsequence is a substantial part of the whole text.

Pattern Matching35

LCS algorithm uses the dynamic programming approach

Recall: the first step is to find the recursion. How do we write LCS in terms of other LCS problems? The parameters for the smaller problems being composed to solve a larger problem are the lengths of a prefix of X and a prefix of Y.

Pattern Matching36

Find recursion:

Let L(i,j) be the length of the LCS between two strings X(0..i) and Y(0..j).

Suppose we know L(i, j), L(i+1, j) and L(i, j+1) and want to know L(i+1, j+1). a. If X[i+1] = Y[j+1] then the best we can do is to get a LCS of L(i, j) + 1.b. If X[i+1] != Y[j+1] then it is max(L[i, j+1], L(i+1, j))

Pattern Matching37

* a b c d g h t h m s* 0 0 0 0 0 0 0 0 0 0 0a 0 1 1 1 1 1 1 1 1 1 1e 0 1 1 1 1 1 1 1 1 1 1d 0 1 1 1 2 2 2 2 2 2 2f 0 1 1 1 2 2 2 2 2 2 2h 0 1 1 1 2 2 3 3 3 3 3h 0 1 1 1 2 2 3 3 4 4 4

Longest Common SubsequenceOne measure of similarity is the length of the

longest common subsequence between two texts.

Pattern Matching38

This algorithm initializes the array or table for L by putting 0’s along the borders, then is a simple nested loop filling up values row by row. This it runs in O(nm)

While the algorithm only tells the length of the LCS, the actual string can easily be found by working backward through the table (and strings), noting points at which the two characters are equal

Pattern Matching39

* a b c d g h t h m s* 0 0 0 0 0 0 0 0 0 0 0a 0 1 1 1 1 1 1 1 1 1 1e 0 1 1 1 1 1 1 1 1 1 1d 0 1 1 1 2 2 2 2 2 2 2f 0 1 1 1 2 2 2 2 2 2 2h 0 1 1 1 2 2 3 3 3 3 3h 0 1 1 1 2 2 3 3 4 4 4

Longest Common SubsequenceMark with info to generate string Every diagonal

shows what is part of LCS

Pattern Matching40

* i d o n o t l i k e* 0 0 0 0 0 0 0 0 0 0 0n 0o 0t 0i 0c 0e 0

Try this one…

Pattern Matching41

The rest of the material in these notes is not in your text (except as exercises)

Sequence Comparisons Problems in molecular biology involve finding

the minimum number of edit steps which are required to change one string into another.

Three types of edit steps: insert, delete, replace. (replace may cost extra as it is like delete and

insert) The non-edit step is “match” – costing zero. Example: abbc babb abbc bbc babc babb (3 steps) abbc babbc babb (2 steps) We are trying to minimize the number of steps.

Pattern Matching42

Idea: look at making just one position right. Find all the ways you could use.

Count how long each would take (using recursion) and figure best cost.

Then use dynamic programming. Orderly way of limiting the exponential number of combinations to think about.

For ease in coding, we make the last character correct (rather than any other).

First steps to dynamic programming

Think of the problem recursively. Find your prototype – what comes in and what comes out.

Int C(n,m) returns the cost of turning the first n characters of the source string (A) into the first m characters of the destination string (B).

Now, find the recursion. You have a helper who will do ANY smaller sub-problem of the same variety. What will you have them do? Be lazy. Let the helper do MOST of the work.

Pattern Matching43

Pattern Matching44

Types of edit steps: insert, delete, replace, match. Consider match to be “free” but the others to cost 1.There are four possibilities (pick the cheapest)

1. If we delete an, we need to change A(0..n-1) to B(0..m). The cost is C(n,m) = C(n-1,m) + 1

C(n,m) is the cost of changing the first n of str1 to the first m of str2.

2. If we insert a new value at the end of A(n) to match bm, we would still have to change A(n) to B(m-1). The cost is C(n,m) = C(n,m-1) + 1

3. If we replace an with bm, we still have to change A(n-1) to B(m-1). The cost is C(n,m) = C(n-1,m-1) + 1

4. If we match an with bm, we still have to change A(n-1) to B(m-1). The cost is C(n,m) = C(n-1,m-1)

Pattern Matching45

We have turned one problem into three problems - just slightly smaller.

Bad situation - unless we can reuse results. Dynamic Programming.

We store the results of C(i,j) for i = 1,n and j = 1,m.

If we need to reconstruct how we would achieve the change, we store both the cost and an indication of which set of subproblems was used.

Pattern Matching46

M(i,j) which indicates which of the four decisions lead to the best result.

Complexity: O(mn) - but needs O(mn) space as well.

Consider changing do to redo: Consider changing mane to mean:

Pattern Matching47

At your seats try Changing “mane” to “mean”

* m e a n* 0mane

Pattern Matching48

Changing “mane” to “mean”

* m e a n* 0 I-1 I-2 I-3 I-4m D-1 M-0 I-1 I-2 I-3a D-2 D-1 R-1 M-1 I-2n D-3 D-2 R-2 D-2 M-1e D-4 D-3 M-2 D-3 D-2

Pattern Matching49

Changing “do” to “redo”Assume: match is free; others are 1.I show the choices as I- or R-, etc, but could have shown with an arrow as well

* r e d o

* I-0 I-1 I-2 I-3 I-4

d D-1 R-1 R-2 M-2 I-3

o D-2 R-2 R-2 R-3 M-2

Pattern Matching50

Another problem:Longest Increasing Subsequence of single list

Find the longest increasing subsequence in a sequence of distinct integers.

Example: 5 1 10 2 20 30 40 4 5 6 7 8 9 10 11Why do we care? Classic problem: 1. computational biology: related to MUMmer

system for aligning genomes. 2. Card games 3. Airline boarding problem 4. Maximization problem in a random

environment.

How do we solve?

Pattern Matching51

Longest Increasing Subsequence of single list

Find the longest increasing subsequence in a sequence of distinct integers.

Idea 1. Given a sequence of size less than m, can find the longest sequence of it. (Recursion) What is problem? Can we use a subproblem to solve the larger problem? Will the solution to the smaller problem be a part of the solution for the larger problem?

Case 1: It either can be added to the longest subsequence or not Case 2: It is possible that it can be added to a non-selected

subsequence (creating a sequence of equal length - but having a smaller ending point)

Case 3: It can be added to a non-selected sub-sequence creating a sequence of smaller length but successors make it a good choice.

Example: 5 1 10 2 20 30 40 4 5 6 7 8 9 10 11Smallest increasing subsequence of underlined part is not part of

complete solution

Pattern Matching52

Idea 2. Given a sequence of size string < m, we know how to find the longest increasing subsequence for EVERY smaller problem.

We don’t know which problem we want to add to. What is the complexity? For each n, we call n-1

subproblems which are 1 smaller. Looks exponential.

We would need to store results of subproblems some way.

Pattern Matching53

BIS: an array of the best (least value) ending point for a subsequence of each length.

For s= 1 to n (or recursively the other way) For k = s downto 1 until find correct spot If BIS(k) > As and BIS(k-1) < As BIS(k) = As

Idea: if you have two subsequences of length x, the one with the smaller end value is preferable.

Pattern Matching54

Actually, we don't need the sequential search as can do a binary search.

5 1 10 2 12 8 15 18 45 6 7 3 8 9 Length BIS 1 5 1 2 10 2 3 12 8 6 3 4 15 7 5 18 8 6 45 9 To output the sequence would be difficult as

you don't know where the sequence is. You would have to reconstruct. You only the length of the longest increasing subsequence.

Pattern Matching55

Try: 8 1 4 2 9 10 3 5 14 11 12 7

Length End Pos 1stReplacement

2nd

Replacement

1 8 12 4 23 9 34 10 55 14 11 76 12

Pattern Matching56

Probabilistic Algorithms

Suppose we have a collection of items and wanted to find a number that is greater than the median (the number for which half are bigger).

How would you solve it?

Pattern Matching57

Probabilistic Algorithms

Suppose we have a collection of items and wanted to find a number that is greater than the median (the number for which half are bigger).

We could sort them - O(n log n) and then select one in last half.

We could find the biggest - but stop looking half way through. O(n/2)

Cannot guarantee one in the upper half in less than n/2 comparisons.

What if you just wanted good odds? Pick two numbers, pick the larger one. What is

probability it is in the lower half?

Pattern Matching58

There are four possibilities: both are lower than median the first is lower the other higher. the first is higher the other lower both are higher.

If we pick the larger of the two numbers…We will be right 75% of the time! We only lose if

both are in the lowest half.

Pick two numbers

Pattern Matching59

Select k elements and pick the biggest, the probability of being correct is 1 - 1/2k . Good odds - controlled odds.

Termed a Monte Carlo algorithm. It maygive the wrong result with very small probability. The method is called after the city in the Monaco principality,

because of a roulette, a simple random number generator. The name and the systematic development of Monte Carlo methods dates from about 1944.

Another type of probabilistic algorithm is one that never gives a wrong result, but its running time is not guaranteed.

Termed Las Vegas algorithm as you are guaranteed success if you try long enough and don’t care how much you spend.

Pattern Matching60

A coloring Problem: Las Vegas Style

Let S be a set with n elements. (n only effects complexity not algorithm)

Let S1, S2... Sk be a collection of distinct (in some way different)

subsets of S, each containing exactly r elements such that k 2r-2 . (We will use this fact to bound the time)

GOAL: Color each element of S with one of two colors (red or blue) such that each subset Si contains at least one red and one blue element.

Pattern Matching61

Idea

Try coloring them randomly and then just checking to see if you happen to win. Checking is fast, as you can quit checking each subset when you see one of each. You can quit checking the collection (and announce failure) when any single color subset is found.

What is the probability that all items in a set of r elements are red? 1/2r

as equal probability that each of the two colors is assigned and r items in the set.

Pattern Matching62

What is the probability that any one of the collections is all red?

k/2r = 1/2r + 1/2r +… + 1/2r Since we are looking for the or of a set of

probabilities, we add. k is bound by 2r-2 so k*1/2r <= 1/4 The probability of all blue or all red in a single

set is one half. (double probability of all red) If our random coloring fails, we simply try again

until success. Our expected number of attempts is 2.

Pattern Matching63

Finding a Majority

Let E be a sequence of integers x1,x2,x3, ... xn The multiplicity of x in E is the number of times x appears in E. A number z is a majority in E if its multiplicity is greater than n/2.

Problem: given a sequence of numbers, find the majority in the sequence or determine that none exists.

NOTE: we don’t want to merely find who has the most votes, but determine who has more than half of the votes.

Pattern Matching64

For example, suppose there is an election. Candidates are represented as integers. Votes are represented as a list of candidate numbers.

We are assuming no limit of the number of possible candidates.

Pattern Matching65

Ideas

1. sort the list O(n log n)2. Go through the list, incrementing the count of each

candidate. If I had to look up the candidate, I would need to store them somewhere. If have a balanced tree of candidate names, complexity would be n log c (where c is number of candidates) Note, if we don’t know how many candidates, we can’t give them indices.

3. Quick select. See if median (kth largest item) occurs more than n/2 times. O(n) (Find the median, and then make a pass through seeing how many times it occurs.)

4. Take a small sample. Find the majority - then count how many times it occurs in the whole list.

5. Make one pass - Discard elements that won’t affect majority.

Pattern Matching66

Our algorithm will find a possible majority.

Algorithm: find two unequal elements. Delete them. Find the majority in the smaller list. Then see if it is a majority in the original list.

How do we remove elements? It is easy. We scan the list in order.

We are looking for a pair to eliminate. Let i be the current position. All the items before xi

which have not been eliminated have the same value. All you really need to keep is the number of times this candidate, C value occurs (which has not been deleted).

Pattern Matching67

Note: If there is a majority and if x i xj and we remove

both of them, then the majority in the old list is the majority in the new list.

Reasoning: if xi is the majority, it had to be more than half, so throwing out a subset where it is EXACTLY half, won’t affect the majority.

If xi is not a majority, throwing it out won’t matter. If xi is a majority, there are m xi’s out of n, where

m > n/2. Notice if we subtract one from both sides, we get

m-1 > n/2 -1 = (n-2)/2 If we remove two elements, (m-1 > (n-2)/2). The converse is not true. If there is no majority,

removing two may make something a majority in the smaller list: 1,2,4,5,5.

Pattern Matching68

For example:

List: 1 4 6 3 4 4 4 2 9 0 2 4 1 4 2 2 3 2 4 2Occurs: X X 1 X 1 2 3 2 1 X 1 X 1 X 1 2 1 2 1 2Candidate: 1 6 4 4 4 4 4 ? 2 ? 1 ? 2 2 2 2 2 22 is a candidate, but is not a majority in the whole list. Complexity: n-1 compares to find a candidate. n-1

compares to test if it is a majority.So why do this over other ways? Simple to code. No

different in terms of complexity, but interesting to think about.