Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading...
Transcript of Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading...
![Page 1: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/1.jpg)
Suffix SortingMichael Liut
University of Toronto
![Page 2: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/2.jpg)
Overview
1. Introduction/Background of the Audience
2. Suffix Sorting
3. Pedagogical Approach Taken
4. Question & Answer
2
![Page 3: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/3.jpg)
Introduction/Background 3
![Page 4: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/4.jpg)
Why Suffix Sorting?
• It’s a topic that is more commonly taught in Europe and Japan• I want our students to be exposed to this important topic too!
• It was related to my thesis• I am well versed to teach cutting edge content in this space
• It’s an introductory topic which makes for a good segue into the importance of the Suffix Array!• Searching large corpuses (e.g. Google), data compression, finding all
occurrences of a particular substring, computational biology, etc.
4
![Page 5: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/5.jpg)
Who is the intended talk for?
• Students enrolled in CSC-373 (Algorithm Design and Analysis)ØThis can be the introduction to divide and conquer algorithms.
Background
• Have taken programming classes (e.g. CSC-209, CSC-148)
• Have taken data structures classes (e.g. CSC-263, CSC-265)
5
![Page 6: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/6.jpg)
Layout of Lesson
• ~5 minutes of the end of the previous class
• Assign the reading for next class
• ~20 minutes for this class
§ Recap of reading
§ Active Learning Exercise
6
![Page 7: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/7.jpg)
Intended Learning Objectives
1. Students should be able to construct a Trie and Radix Tree
2. Students should understand the difference (spatially) between a Trie and a Radix Tree.
1. Students should be able to construct a Suffix Tree and Suffix Array
2. Students should understand the difference (spatially) between a Suffix Tree and a Suffix Array.
7
End of Class 1 + Homework
Beginning of Class 2
![Page 8: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/8.jpg)
Trie (pronounced ‘try’)
Ø a dictionary tree (prefix tree)
• Composed of Nodes and Links• Stores a set of words, each Node
representing a character
8
a b$
b
$
ab
c
$abc
c
bc$
$
A Trie on X = { ab, abc, bc }
*note: the sentinel symbol $ is used to terminatethe string, it is lexicographically smallest.
![Page 9: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/9.jpg)
Radix Tree
Ø a Trie with a compressed chain of nodes
• Each internal node having at least 2 children• AKA: Patricia Trie, Compacted Trie, and Radix
Trie
A Radix Tree on X = { ab, abc, bc }
ab bc$
$
ab
c
$abc
bc$
$
9
![Page 10: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/10.jpg)
Homework
• Algorithms, 4th edition by Sedgewick and Wayne.
• Read Chapters:• 5.1 (String Sorting) • 5.2 (Tries)• 6. Pages: 875-878 (Sorting Suffixes and Suffix Arrays)
• After reading, check to ensure you’ve me today’s intended learning objectives!
10
![Page 11: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/11.jpg)
Suffix Sorting 11
![Page 12: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/12.jpg)
What is the goal?
• To identify all occurrences of a substring fast and efficiently.• Think of trying to catalogue all the substrings of your favourite CS textbook!
• Instead of re-scanning the string every time we are looking for a pattern, we “prepare” a data structure to do the search easily.• The idea is that any substring is a prefix of a suffix!
12
a b c a a b c a b a c c a b a a c b1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
![Page 13: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/13.jpg)
Suffix Tree
• Suffix Tree: a Suffix Radix Tree
13
0
1 1 1
2 2 2 18 2 3 2 17 11
3 4
12 7
14 9 2 64 15 16 10
3 4
13 8 1 5
a
ab
c
a ca
b
aca
c
a
ba
abcabaccabaacb$
bcabaccabaacb$
baccabaacb$ccabaacb$acb$
cb$ b$
cabaacb$
$
acb$ ccabaacb$
abcabaccabaacb$
baccabaacb$
abcabaccabaacb$
ccabaacb$
cabaacb$
acb$
b$
a b c a a b c a b a c c a b a a c b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
$
$
$
$$
$
$
![Page 14: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/14.jpg)
Worksheet: Task 1!
• Let’s construct a Suffix Tree for the word “Mississippi”
14
m i s s i s s i p p i1 2 3 4 5 6 7 8 9 10 11
![Page 15: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/15.jpg)
Issues with Suffix Trees
• Require a lot of space! Typically 10-20x more space than the original string!
• Even using some compression techniques, it’s still ~5x bigger than the original string!
15
![Page 16: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/16.jpg)
Suffix Array 16
a b c a a b c a b a c c a b a a c b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
4 15 13 8 1 5 16 10 18 14 9 2 6 3 12 7 17 11SA
• Introduced by Manber & Myers (1990).
• Sorted array of all suffixes of a particular string.
![Page 17: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/17.jpg)
Suffix Array
• Best algorithms were O (n log n)
• In 2003, several researchers emulated Farach's approach to provide a recursive linear algorithm for suffix sorting.
• In 2015 Baier introduced a non-recursive linear suffix sorting algorithm.
17
![Page 18: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/18.jpg)
Worksheet: Task 2!
• Let’s construct a Suffix Array from “Mississippi”
18
1 2 3 4 5 6 7 8 9 10 11
5 4 11 9 3 10 8 2 7 6 1SA
m i s s i s s i p p i1 2 3 4 5 6 7 8 9 10 11
Question: if we included $ where would it go?
Answer: at the beginning, it’s the smallest!
11 8 5 2 1 10 9 7 4 6 3SA-1
![Page 19: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/19.jpg)
The Agenda for Next Week
• Suffix Array + Longest Common Prefix (LCP)
• Suffix Tree Algorithms1. Weiner, then McCreight 1973/19762. Ukkonen, 19953. Farach, 1997
• Recursive vs. Iterative implementation
19
![Page 20: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/20.jpg)
Pedagogical Approach 20
![Page 21: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/21.jpg)
Pedagogical Approach
• The 3 “P”s: Prepare, Practice, and Perform• Similar to that in CSC-108 and CSC-148
• Actively learning in the classroom, but also applying these experientially through homework assignments and weekly labs.
• Breaking up topics into foundational building blocks for them to tackle one step at a time (divide and conquer J).
21
![Page 22: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/22.jpg)
22
Q A&
Thanks for listening! J
Does anyone have any questions?
![Page 23: Suffix SortingLayout of Lesson •~5 minutes of the end of the previous class •Assign the reading for next class •~20 minutes for this class §Recap of reading §Active Learning](https://reader034.fdocuments.net/reader034/viewer/2022052001/6013b915518cf3194258267c/html5/thumbnails/23.jpg)
References
Baier, U. Linear-time Suffix Sorting. Ulm University, Germany. November 2015.
Franek, F. Suffix-based text indices, construction algorithms, and applications. 2nd CanaDAM Conference, Centre de Recherches, Mathématiques in Montréal. May 2009.
Liut, M. Computing Lyndon Arrays. McMaster University, Canada. September 2019.
Sedgewick, R. and Wayne, K. Algorithms (4th ed.). Addison-Wesley. March 2011.
Yang, J. Algorithm of Suffix Tree. Osaka University, Japan. November 2011.
23