Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although...
Transcript of Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although...
![Page 1: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/1.jpg)
Text SearchSteven R. Bagley
![Page 2: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/2.jpg)
Introduction
• Last lecture, looked at Text Search algorithms
• Saw how by careful constructing our algorithm
• We can massively reduce the time taken to search
![Page 3: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/3.jpg)
String Search
• Definitions
• The pattern — string to search for
• The text — the text we want to search
• Problem
• Does pattern occurs inside text
Will use these throughout the lecture
![Page 4: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/4.jpg)
Text Search
• Naive Search — Test pattern in every possible alignment with text
• Boyer-Moore — Uses information about the mismatches to skip big chunks of text
![Page 5: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/5.jpg)
Boyer-Moore
• If mismatching char not in pattern slide length of pattern
• Slide so the rightmost occurrence of char in pattern is aligned with char in text
• Slide so that a matched subpattern is aligned with the rightmost occurrence of the subpattern not preceded by this mismatched char
(assuming this is not backwards)
![Page 6: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/6.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
![Page 7: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/7.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
![Page 8: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/8.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
![Page 9: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/9.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
![Page 10: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/10.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Okay, found a character that matches, step back to test the previous character
![Page 11: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/11.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Okay, found a character that matches, step back to test the previous character
![Page 12: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/12.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Okay, found a character that matches, step back to test the previous character
![Page 13: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/13.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Okay, found a character that matches, step back to test the previous character
![Page 14: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/14.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Okay, found a character that matches, step back to test the previous character
![Page 15: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/15.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Okay, found a character that matches, step back to test the previous character
![Page 16: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/16.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Okay, found a character that matches, step back to test the previous character
![Page 17: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/17.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Okay, found a character that matches, step back to test the previous character
![Page 18: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/18.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Okay, found a character that matches, step back to test the previous character
![Page 19: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/19.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Look we’ve matched the string
![Page 20: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/20.jpg)
W H I C H F I N A L L Y H A L T S , A T T H A T P O I N T
A T T H A T
Look we’ve matched the string
![Page 21: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/21.jpg)
Boyer-Moore
• Nice, fast algorithm
• Easy to implement, although requires some largish tables
• The ‘gold standard’, other algorithms compared to this
• Longer the pattern the faster it tends…
![Page 22: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/22.jpg)
Boyer-Moore
• But can break down if the alphabet is smalle.g. DNA sequences – only four letters (TACG)
• Much more likely that the character will be found close to the right edge
• So the pattern will move by a much smaller amount each time
![Page 23: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/23.jpg)
Search Algorithms
• However, other algorithms don’t suffer from this problem
• One such algorithm makes use of the Burrows-Wheeler Transform…
![Page 24: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/24.jpg)
Burrows-Wheeler Transform
• Developed in 1994 by Mike Burrows and David Wheeler
• Takes a string and transforms it into another string
• However, this transformed string is much more compressible
• What’s compression got to do with search?
David Wheeler one of the original computer people at CambridgeMike Burrows went on to design Altavista and now works at Google (via MS)
![Page 25: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/25.jpg)
Suffix Arrays
• Helps to understand BWT search if we first look at a simpler related search algorithm
• The Suffix Array is an array of all possible suffixes of text
• Where a suffix is a substring that ends with the last character of text
Gives some examples with Mississippi, e.g.i, pi, ippi, ssippi, Mississippi are all suffixes But Miss, and ssiss are not (because they don’t run to the end of the string)
![Page 26: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/26.jpg)
0 MISSISSIPPI
1 ISSISSIPPI
2 SSISSIPPI
3 SISSIPPI
4 ISSIPPI
5 SSIPPI
6 SIPPI
7 IPPI
8 PPI
9 PI
10 I
All possible suffixes of Mississippi
![Page 27: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/27.jpg)
Searching Suffix Arrays
• If pattern exists within the text then it will be a prefix of one the suffix array entries
• Therefore, if we compare the pattern with the beginning of each suffix
• We can find our string
• It’s exactly the same as our original naive search…
prefix, one array will start with it
![Page 28: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/28.jpg)
0 MISSISSIPPI
1 ISSISSIPPI
2 SSISSIPPI
3 SISSIPPI
4 ISSIPPI
5 SSIPPI
6 SIPPI
7 IPPI
8 PPI
9 PI
10 I
Suppose we are search for the pattern ‘SIP’
![Page 29: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/29.jpg)
0 MISSISSIPPI
1 ISSISSIPPI
2 SSISSIPPI
3 SISSIPPI
4 ISSIPPI
5 SSIPPI
6 SIPPI
7 IPPI
8 PPI
9 PI
10 I
SIP
Suppose we are search for the pattern ‘SIP’
![Page 30: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/30.jpg)
0 MISSISSIPPI
1 ISSISSIPPI
2 SSISSIPPI
3 SISSIPPI
4 ISSIPPI
5 SSIPPI
6 SIPPI
7 IPPI
8 PPI
9 PI
10 I
SIP
Suppose we are search for the pattern ‘SIP’
![Page 31: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/31.jpg)
0 MISSISSIPPI
1 ISSISSIPPI
2 SSISSIPPI
3 SISSIPPI
4 ISSIPPI
5 SSIPPI
6 SIPPI
7 IPPI
8 PPI
9 PI
10 I
SIP
Suppose we are search for the pattern ‘SIP’
![Page 32: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/32.jpg)
0 MISSISSIPPI
1 ISSISSIPPI
2 SSISSIPPI
3 SISSIPPI
4 ISSIPPI
5 SSIPPI
6 SIPPI
7 IPPI
8 PPI
9 PI
10 I
SIP
Suppose we are search for the pattern ‘SIP’
![Page 33: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/33.jpg)
0 MISSISSIPPI
1 ISSISSIPPI
2 SSISSIPPI
3 SISSIPPI
4 ISSIPPI
5 SSIPPI
6 SIPPI
7 IPPI
8 PPI
9 PI
10 I
SIP
Suppose we are search for the pattern ‘SIP’
![Page 34: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/34.jpg)
0 MISSISSIPPI
1 ISSISSIPPI
2 SSISSIPPI
3 SISSIPPI
4 ISSIPPI
5 SSIPPI
6 SIPPI
7 IPPI
8 PPI
9 PI
10 I
SIP
Suppose we are search for the pattern ‘SIP’
![Page 35: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/35.jpg)
0 MISSISSIPPI
1 ISSISSIPPI
2 SSISSIPPI
3 SISSIPPI
4 ISSIPPI
5 SSIPPI
6 SIPPI
7 IPPI
8 PPI
9 PI
10 I
SIP
Suppose we are search for the pattern ‘SIP’
![Page 36: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/36.jpg)
Suffix Array Search
• Just as slow as our original naive algorithm
• Also need to build up array (more RAM)
• That’s true — but if you sort the suffix array, you can use binary search
• Finds pattern in O(m log n) time
• Also, finds all occurrences
more RAM than the naive approach, but probably about the same as Boyer-Moore
![Page 37: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/37.jpg)
10 I
7 IPPI
4 ISSIPPI
1 ISSISSIPPI
0 MISSISSIPPI
9 PI
8 PPI
6 SIPPI
3 SISSIPPI
5 SSIPPI
2 SSISSIPPI
Suppose we are search for the pattern ‘SIP’Oh look found it in two
![Page 38: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/38.jpg)
10 I
7 IPPI
4 ISSIPPI
1 ISSISSIPPI
0 MISSISSIPPI
9 PI
8 PPI
6 SIPPI
3 SISSIPPI
5 SSIPPI
2 SSISSIPPI
SIP
Suppose we are search for the pattern ‘SIP’Oh look found it in two
![Page 39: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/39.jpg)
10 I
7 IPPI
4 ISSIPPI
1 ISSISSIPPI
0 MISSISSIPPI
9 PI
8 PPI
6 SIPPI
3 SISSIPPI
5 SSIPPI
2 SSISSIPPI
SIP
Suppose we are search for the pattern ‘SIP’Oh look found it in two
![Page 40: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/40.jpg)
Burrows-Wheeler
• The BWT is incredible simple to implement
• Although understand how it works is not so simple
• BWT takes a string, and produces an NxN matrix (where N is the length of the string)
• First row of the matrix is just the string
BWT == Burrows-Wheeler Transform
![Page 41: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/41.jpg)
M I S S I S S I P P I
![Page 42: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/42.jpg)
Burrows-Wheeler
• For every other row take the previous row and rotate it to the left by one…
• Putting the character shifted out the left back round into the right
• The resulting matrix is then sorted lexicographcially
![Page 43: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/43.jpg)
M I S S I S S I P P I
I S S I S S I P P I M
S S I S S I P P I M I
S I S S I P P I M I S
I S S I P P I M I S S
S S I P P I M I S S I
S I P P I M I S S I S
I P P I M I S S I S S
P P I M I S S I S S I
P I M I S S I S S I P
I M I S S I S S I P P
![Page 44: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/44.jpg)
M I S S I S S I P P I
I S S I S S I P P I M
S S I S S I P P I M I
S I S S I P P I M I S
I S S I P P I M I S S
S S I P P I M I S S I
S I P P I M I S S I S
I P P I M I S S I S S
P P I M I S S I S S I
P I M I S S I S S I P
I M I S S I S S I P P
![Page 45: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/45.jpg)
M I S S I S S I P P I
I S S I S S I P P I M
S S I S S I P P I M I
S I S S I P P I M I S
I S S I P P I M I S S
S S I P P I M I S S I
S I P P I M I S S I S
I P P I M I S S I S S
P P I M I S S I S S I
P I M I S S I S S I P
I M I S S I S S I P P
![Page 46: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/46.jpg)
M I S S I S S I P P I
I S S I S S I P P I M
S S I S S I P P I M I
S I S S I P P I M I S
I S S I P P I M I S S
S S I P P I M I S S I
S I P P I M I S S I S
I P P I M I S S I S S
P P I M I S S I S S I
P I M I S S I S S I P
I M I S S I S S I P P
![Page 47: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/47.jpg)
I M I S S I S S I P P
I P P I M I S S I S S
I S S I P P I M I S S
I S S I S S I P P I M
M I S S I S S I P P I
P I M I S S I S S I P
P P I M I S S I S S I
S I P P I M I S S I S
S I S S I P P I M I S
S S I P P I M I S S I
S S I S S I P P I M I
Lexicogrpahically sorted version!
![Page 48: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/48.jpg)
BWT
• The last column of this is then extracted as the product of the BWT
• This transformed string made from the original string is highly compressible
Move-To-Front Encoding, then huffman coding
![Page 49: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/49.jpg)
I M I S S I S S I P P
I P P I M I S S I S S
I S S I P P I M I S S
I S S I S S I P P I M
M I S S I S S I P P I
P I M I S S I S S I P
P P I M I S S I S S I
S I P P I M I S S I S
S I S S I P P I M I S
S S I P P I M I S S I
S S I S S I P P I M I
Last column taken and used as basis for compression
![Page 50: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/50.jpg)
t,<s]]meem =========-+-<<nhiiiiix]mhiiiinx]]]te]eskt]]tiejm]iiijscwwewewes*e m mte=e wwt= =+ +- ;t;cdm==mnnnt ;;;;;tttt tt;;;;;m wwt;;tt=,on;t;======= m=e t= tn==,,<<<<<==t y)n};t;= (r" ((nnnrtthkhhereydyk mrrrrrrrrreme ntf+)f+e++tt+-+++s(tt)(stee)("](ketx"1/ }../ ++++++++iniiixih"et]d -jlommta]mm]trtrssssa]t**! = = [222555))))))))))k)]])])ste]eentste]*0]1])x]])]0]])*]0600000eee6e100)0e = > ! ! dd t (! TTZZZ(eee OOFO"ed (aaa(dadds IIE ("tttttttt! WWsssBBBBBtcpetkcccccccpcpekkeeekkkkktyeet]iij]ii[[hiiiiie[[eeei]h[6h]iiii !
Extract from a compressed Java program — lots of repeated characters
![Page 51: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/51.jpg)
BWT
• Reversible transform – given the last column, the entire matrix can be regenerated
• Store index of the original string row
• Exploits the fact that the first column can be made by sorting the last column
• And that every character in the last column precedes the character in the first column
![Page 52: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/52.jpg)
BWT
• The first characters of each entry of the BWT matrix are identical to the strings in the suffix array
• Followed by the rest of the string
• Therefore, we can do a binary search in the BWT matrix to find the pattern
![Page 53: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/53.jpg)
I M I S S I S S I P P
I P P I M I S S I S S
I S S I P P I M I S S
I S S I S S I P P I M
M I S S I S S I P P I
P I M I S S I S S I P
P P I M I S S I S S I
S I P P I M I S S I S
S I S S I P P I M I S
S S I P P I M I S S I
S S I S S I P P I M I
Last column taken and used as basis for compression
![Page 54: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/54.jpg)
10 I
7 IPPI
4 ISSIPPI
1 ISSISSIPPI
0 MISSISSIPPI
9 PI
8 PPI
6 SIPPI
3 SISSIPPI
5 SSIPPI
2 SSISSIPPI
Suppose we are search for the pattern ‘SIP’Oh look found it in two
![Page 55: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/55.jpg)
BWT search
• The BWT matrix will be bigger than the Suffix array
• But fortunately, there is a way to build up any line of the sorted matrix in linear time
• Using just the last line (and a couple of easily precomputed tables)
![Page 56: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/56.jpg)
int sum = 0, c[] = new int[256], p[] = new int[size];byte out = new byte[size];
for(int i = 0; i < 256; i++) c[i] = 0; for(int i = 0; i < size; i++){ p[i] = c[block[i]]; c[block[i]] = c[block[i]] + 1; } for(int ch = 0; ch < 256; ch++){ sum = sum + c[ch]; c[ch] = sum - c[ch];}
int i = x;for(int j = size - 1; j >=0; j--){ out[j] = block[i]; i = p[i] + c[block[i]]; }
Java code to find row x based on the last column block[]out[] contains row xp[] and c[] are two tables used to help rebuild the row
![Page 57: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/57.jpg)
BWT search
• Therefore, using the last line of the matrix
• Which could well all ready exist if we’ve compressed text
• The reconstruction algorithm and a binary search
• We can find all occurrences of a pattern very quickly
e.g. on disk
![Page 58: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/58.jpg)
Page Description Langauges
![Page 59: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/59.jpg)
INFO
RM
ATIO
N
RASTERPAGE
DESCRIPTION
A B C D
LOGICALSTRUCTURE
A B
C D
LAYOUT
![Page 60: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/60.jpg)
Page Description Language
• Holds a description of a ‘page’ in a form that can be converted to a viewable image
• At this level everything is in a fixed position
• Dominated now by the Adobe Graphics Model
• Used by PDF, PostScript and SVG
As well, as iOS and OS X
![Page 61: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/61.jpg)
Early PDLs
• Before PostScript, PDLs were very basic
• Weren’t necessarily driving a raster device
• Were also tied to a specific machine
• And its specific features
• Such as units used for measurements
Technically, Warnock and Geschke had done very similar work at Xerox (Interpress) and at Evans and Sutherland to what would become PostScriptSimilar to the operations you get in a typical grpahics library on a computerPrograms driving a Linotype 202 would be incompatible with Monotype machines
![Page 62: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/62.jpg)
Early PDLs
• Aimed at typesetters
• Mainly limited to text – set font, point size, position, print text
• Perhaps some very basic straight line drawing operations
Technically, Warnock and Geschke had done very similar work at Xerox (Interpress) and at Evans and Sutherland to what would become PostScriptSimilar to the operations you get in a typical grpahics library on a computer
![Page 63: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/63.jpg)
Early Software Driving
• Software would often want to drive multiple devices
• Produced an intermediate code, which would be converted to the relevant format
• Often by a separate backend program
• But the intermediate codes would be incompatible across software packages
Meaning you had lots of different backends
![Page 64: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/64.jpg)
PostScript
• Developed by John Warnock andChuck Geschke in early 1980s
• PostScript took a different approach
• Not aimed at driving typesetters, abstracted away from a specific device
• Text was just another mark on the page
• Based around the Adobe Graphics Model
![Page 65: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/65.jpg)
Adobe Graphics Model
• At the heart of PDF, Postscript and SVG
• Separates the Graphics from the Rendering
• Application composes graphics on the page
• Once completed the entire page is rendered for a specific device
• Crucially, abstracts the application from device
Might be useful to know if its colour/black and white, what fonts are installed…But that’s general metadata
![Page 66: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/66.jpg)
Adobe Graphics Model
• Based around the idea of painting marks on a blank page
• Any mark made obscures any previous mark underneath it
• Although modern versions allow for transparency
![Page 67: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/67.jpg)
Adobe Graphics Model
• Painted Marks may be:
• Character shapes (glyphs)
• Geometric shapes or lines (paths)
• Sampled images (bitmaps)
![Page 68: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/68.jpg)
Paths
• Constructed from a sequence of lines, curved and points
• Then filled or stroked
![Page 69: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/69.jpg)
Adobe Graphics Model
• Painted Marks may be in
• Colour
• Greyscale
• A repeating Pattern
• Smooth transition between colours (gradient)
![Page 70: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/70.jpg)
Adobe Graphics Model
• Painted marks may also be clipped to some other shape
• Clipping defined by a path
• As they are placed on the page
• Once its on the page it can’t be removed
• Only covered up by new marks
![Page 71: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/71.jpg)
User Space
• AGM defines User Space which is what the application draws in
• Infinite, but clipped to the display…
• XY-coordinate system, y +ve upwards and x +ve to the right
• Origin is (0,0) — i.e. the bottom-left
• Although the user can change this
AGM defines lots of interelated spaces…
![Page 72: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/72.jpg)
User Space
• Also defines a fixed unit for User Space
• The point — 1/72 inch
• Unrelated to the capabilities of the device
• So is a floating point value
• Before being rendered, user space coordinates are transformed into Device Space
![Page 73: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/73.jpg)
Device Space
• Device Space is specific to the device
• Varies from device to device
• Different origin, axes, units for measurement
• Specifying graphics in User Space and transforming to Device Space gives device independence
Think about it — if you ask for a box to be 72 points high on a 72dpi machine its an inch,On a 300dpi machine it’s ~0.25 inch high
![Page 74: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/74.jpg)
GraphicsCHAPTER 4138
If coordinates in a PDF file were specified in device space, the file would bedevice-dependent and would appear differently on different devices. For exam-ple, images specified in the typical device spaces of a 72-pixel-per-inch displayand a 600-dot-per-inch printer would differ in size by more than a factor of 8; an8-inch line segment on the display would appear less than 1 inch long on theprinter. Figure 4.2 shows how the same graphics object, specified in device space,can appear drastically different when rendered on different output devices.
FIGURE 4.2 Device space
User Space
To avoid the device-dependent effects of specifying objects in device space, PDFdefines a device-independent coordinate system that always bears the same rela-tionship to the current page, regardless of the output device on which printing ordisplaying will occur. This device-independent coordinate system is called userspace.
The user space coordinate system is initialized to a default state for each page of adocument. The CropBox entry in the page dictionary specifies the rectangle ofuser space corresponding to the visible area of the intended output medium (dis-play window or printed page). The positive x axis extends horizontally to theright and the positive y axis vertically upward, as in standard mathematical prac-tice (subject to alteration by the Rotate entry in the page dictionary). The lengthof a unit along both the x and y axes is 1 ⁄72 inch. This coordinate system is calleddefault user space.
Device space for72-dpi screen
Device space for 300-dpi printer
Example of difference in apparent size of object with same dimensions on different devices
![Page 75: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/75.jpg)
Current Transformation Matrix
• Current Transformation Matrix (CTM) specifies translation
• Allows the coordinates to be scaled, translated, rotated and sheared
• Starts with a default CTM (maps User Space to a specific Device Space)
• Could be more than just scaling
May have a different origin, axis direction etc as well
![Page 76: Text Search · Burrows-Wheeler • The BWT is incredible simple to implement • Although understand how it works is not so simple • BWT takes a string, and produces an NxN matrix](https://reader031.fdocuments.net/reader031/viewer/2022011915/5fc7ac2fc2fcb936a02df839/html5/thumbnails/76.jpg)
GraphicsCHAPTER 4140
FIGURE 4.3 User space
Other Coordinate Spaces
In addition to device space and user space, PDF uses a variety of other coordinatespaces for specialized purposes:
• The coordinates of text are specified in text space. The transformation from textspace to user space is defined by a text matrix in combination with several text-related parameters in the graphics state (see Section 5.3.1, “Text-PositioningOperators”).
• Character glyphs in a font are defined in glyph space (see Section 5.1.3, “GlyphPositioning and Metrics”). The transformation from glyph space to text spaceis defined by the font matrix. For most types of font, this matrix is predefinedto map 1000 units of glyph space to 1 unit of text space; for Type 3 fonts, thefont matrix is given explicitly in the font dictionary (see Section 5.5.4, “Type 3Fonts”).
User space
Device space for72-dpi screen
Device space for 300-dpi printer
CTM