Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats...

90
Outline Introductory Concepts Tandem Repeats Palindromes Research Directions Repeats and Palindromes: an Overview Sara H. Geizhals Advisor: Dina Sokol The Graduate Center 30 July 2014

Transcript of Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats...

Page 1: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Repeats and Palindromes: an Overview

Sara H. GeizhalsAdvisor: Dina Sokol

The Graduate Center

30 July 2014

Page 2: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 3: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 4: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 5: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

Pattern matching problem:

I Let Σ be an alphabet.I INPUT:

I string text, of length n: T = t1t2 . . . tnI string pattern, of length m: P = p1p2 . . . pmI ti , pi ∈ Σ

I OUTPUT: all positions i in T , where there is an occurrenceof P, i.e., T [i + k] = P[k], 1 ≤ k ≤ m

I Example: P = ababc in T = cababccdddddababccc

I Example: P = aba in T = cabaabadddababac

Page 6: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

I Naive solution - O(nm): since each text position is a possiblepattern start, compare each text position with the pattern.

I Better solution: KMP (Knuth, Morris, and Pratt, 1977)

Page 7: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

I Naive solution - O(nm): since each text position is a possiblepattern start, compare each text position with the pattern.

I Better solution: KMP (Knuth, Morris, and Pratt, 1977)

Page 8: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

Stage 1: O(m) to construct KMP automaton for P

Figure: KMP automaton for P of ababc

I Each of m + 1 nodes has success arc (to character on right)and failure arc (to border)

I Border of X = longest substring that is both proper prefix andproper suffix of X

I Example: abab’s border is ab, ababa’s border is aba, andababc’s border is empty string

Page 9: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

Stage 2: O(n) to go through characters of T to look for P

I If character in T matches first character of P, algorithm startsat first node of automaton

I For each character match, automaton follows success arc

I For each character mismatch, automaton follows failure arc

I If reaches last node of automaton, then P is in T

Page 10: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

Example: P = ababc, T = cabab . . .

T = cabab . . . T = cabab . . .

If T ’s next character is:

I c - success arc to node 5, and P is in T

I a - failure arc to node 2 (border of abab)

I anything else

Page 11: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 12: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Indexing problem:

I Same as pattern matching problem, except that searches forany number of pattern strings in one text string

I Therefore, want queries to be answered in time proportionalto (short) Ps, rather than (long) T

Page 13: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Figure: ST for T of xabxac

Stage 1 of solution for indexing problem: construct suffix tree(ST), which is trie of all suffixes of T

I O(n) construction, when alphabet size is constant, as perWeiner (1973), McCreight (1976), and Ukkonen (1995)

Page 14: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Stage 2 of solution for indexing problem:O(m) to decide whether P is in T :

I Start at root

I Use characters of P to follow paths of ST

I If gets stuck, then P is not in T ; if does not get stuck, then Pis in T

Page 15: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Stage 2 of solution for indexing problem:O(m) to decide whether P is in T :

I Start at root

I Use characters of P to follow paths of ST

I If gets stuck, then P is not in T ; if does not get stuck, then Pis in T

Page 16: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Stage 2 of solution for indexing problem:O(m) to decide whether P is in T :

I Start at root

I Use characters of P to follow paths of ST

I If gets stuck, then P is not in T ; if does not get stuck, then Pis in T

Page 17: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

T = xabxac

I P = xab is in T

I P = xaa is not in T

Figure: ST for T of xabxac

Page 18: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I Use GeneralizedSuffix Tree (GST)to index severaltexts

I Numbers at leaf =text, startingposition of suffixin text Figure: GST for T1 = xabxa$ and T2 =

babxba$

Page 19: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I LCP (v, w) = length of longest common prefix betweenstrings v and w

I LCA (lowest common ancestor) of tree nodes v and w =lowest node with v and w as descendants

I Harel and Tarjan (1984): preprocess tree with n nodes in O(n)time; following that, can answer LCA queries in constant time

I Solution to LCP problem on strings is same as solution toLCA problem on ST (to be shown)

I Therefore, ST solves LCP with the same runtime

Page 20: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I LCP (v, w) = length of longest common prefix betweenstrings v and w

I LCA (lowest common ancestor) of tree nodes v and w =lowest node with v and w as descendants

I Harel and Tarjan (1984): preprocess tree with n nodes in O(n)time; following that, can answer LCA queries in constant time

I Solution to LCP problem on strings is same as solution toLCA problem on ST (to be shown)

I Therefore, ST solves LCP with the same runtime

Page 21: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I LCP (v, w) = length of longest common prefix betweenstrings v and w

I LCA (lowest common ancestor) of tree nodes v and w =lowest node with v and w as descendants

I Harel and Tarjan (1984): preprocess tree with n nodes in O(n)time; following that, can answer LCA queries in constant time

I Solution to LCP problem on strings is same as solution toLCA problem on ST (to be shown)

I Therefore, ST solves LCP with the same runtime

Page 22: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I LCP (v, w) = length of longest common prefix betweenstrings v and w

I LCA (lowest common ancestor) of tree nodes v and w =lowest node with v and w as descendants

I Harel and Tarjan (1984): preprocess tree with n nodes in O(n)time; following that, can answer LCA queries in constant time

I Solution to LCP problem on strings is same as solution toLCA problem on ST (to be shown)

I Therefore, ST solves LCP with the same runtime

Page 23: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I LCP (v, w) = length of longest common prefix betweenstrings v and w

I LCA (lowest common ancestor) of tree nodes v and w =lowest node with v and w as descendants

I Harel and Tarjan (1984): preprocess tree with n nodes in O(n)time; following that, can answer LCA queries in constant time

I Solution to LCP problem on strings is same as solution toLCA problem on ST (to be shown)

I Therefore, ST solves LCP with the same runtime

Page 24: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Solution to LCP problem onstrings is same as solution to LCAproblem on ST

I LCA (6, 9) = 10;LCP (babba$, bba$) = b

I LCA (1, 2) = 3;LCP (ababba$, abba$) = ab

I LCA (5, 9) = root;LCP (a$, bba$) = emptystring

Figure: ST for T of ababba$

Page 25: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Solution to LCP problem onstrings is same as solution to LCAproblem on ST

I LCA (6, 9) = 10;LCP (babba$, bba$) = b

I LCA (1, 2) = 3;LCP (ababba$, abba$) = ab

I LCA (5, 9) = root;LCP (a$, bba$) = emptystring

Figure: ST for T of ababba$

Page 26: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Solution to LCP problem onstrings is same as solution to LCAproblem on ST

I LCA (6, 9) = 10;LCP (babba$, bba$) = b

I LCA (1, 2) = 3;LCP (ababba$, abba$) = ab

I LCA (5, 9) = root;LCP (a$, bba$) = emptystring

Figure: ST for T of ababba$

Page 27: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 28: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Tandem Repeats

I DNA contains consecutive copies of pattern, e.g. cagcagcag

I Known as tandem repeat (TR)I Identification of TRs is useful

I Linked to over thirty hereditary disorders in humansI Used as genetic markers in DNA fingerprinting, mapping

genes, comparative genomics, and evolution studiesI Used in population studies and conservation biology

Page 29: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 30: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I A string r = a1 . . . an is periodic if there exists some p suchthat ai = ai+p for all i , when 1 ≤ i , i + p ≤ n

I r = uhu′

I u or |u| = period of rI abc abc abc abc vs. abcabc abcabc

I h = 4, u = abc = 3 vs. h = 2, u = abcabc = 6

Page 31: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I A string r = a1 . . . an is periodic if there exists some p suchthat ai = ai+p for all i , when 1 ≤ i , i + p ≤ n

I r = uhu′

I u or |u| = period of rI abc abc abc abc vs. abcabc abcabc

I h = 4, u = abc = 3 vs. h = 2, u = abcabc = 6

Page 32: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I A string r = a1 . . . an is periodic if there exists some p suchthat ai = ai+p for all i , when 1 ≤ i , i + p ≤ n

I r = uhu′

I u or |u| = period of r

I abc abc abc abc vs. abcabc abcabcI h = 4, u = abc = 3 vs. h = 2, u = abcabc = 6

Page 33: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I A string r = a1 . . . an is periodic if there exists some p suchthat ai = ai+p for all i , when 1 ≤ i , i + p ≤ n

I r = uhu′

I u or |u| = period of rI abc abc abc abc vs. abcabc abcabc

I h = 4, u = abc = 3 vs. h = 2, u = abcabc = 6

Page 34: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I PrimitiveI If not periodic: e.g., abcdI If periodic: last period is partial - e.g., abcdabcdab

I The period = the primitive periodI abc abc abc abc - not abcabc abcabc

I h = 4, u = abc = 3 - not h = 2, u = abcabc = 6

I Periodic string also called repetition or exact tandem repeat(ETR) - a repeat, for short.

Page 35: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I PrimitiveI If not periodic: e.g., abcdI If periodic: last period is partial - e.g., abcdabcdab

I The period = the primitive periodI abc abc abc abc - not abcabc abcabc

I h = 4, u = abc = 3 - not h = 2, u = abcabc = 6

I Periodic string also called repetition or exact tandem repeat(ETR) - a repeat, for short.

Page 36: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I PrimitiveI If not periodic: e.g., abcdI If periodic: last period is partial - e.g., abcdabcdab

I The period = the primitive periodI abc abc abc abc - not abcabc abcabc

I h = 4, u = abc = 3 - not h = 2, u = abcabc = 6

I Periodic string also called repetition or exact tandem repeat(ETR) - a repeat, for short.

Page 37: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I r = caaaad or cababababadI Periodic in all characters but first and last

I Search r for repeats - in particular, for maximal repeats =period cannot be extended right or left by even one character

I r = caaaad:I repeat aaaa is maximalI repeats aa and aaa are not maximal

I The repeat = the maximal repeat

I Algorithms to find repeats in string r of length n:I O(n log n) - Crochemore (1981), Apostolico (1983),

Main and Lorentz (1984)I O(n) - KK = Kolpakov and Kucherov (1999)

Page 38: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I r = caaaad or cababababadI Periodic in all characters but first and last

I Search r for repeats - in particular, for maximal repeats =period cannot be extended right or left by even one character

I r = caaaad:I repeat aaaa is maximalI repeats aa and aaa are not maximal

I The repeat = the maximal repeat

I Algorithms to find repeats in string r of length n:I O(n log n) - Crochemore (1981), Apostolico (1983),

Main and Lorentz (1984)I O(n) - KK = Kolpakov and Kucherov (1999)

Page 39: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I r = caaaad or cababababadI Periodic in all characters but first and last

I Search r for repeats - in particular, for maximal repeats =period cannot be extended right or left by even one character

I r = caaaad:I repeat aaaa is maximalI repeats aa and aaa are not maximal

I The repeat = the maximal repeat

I Algorithms to find repeats in string r of length n:I O(n log n) - Crochemore (1981), Apostolico (1983),

Main and Lorentz (1984)I O(n) - KK = Kolpakov and Kucherov (1999)

Page 40: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 41: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

I DNA has mutations, translocations, reversals, etc.

I Approximate tandem repeat or ATRI Metrics include:

I Hamming distance - mismatchI one character substituted by another characterI 1 mismatch between abc and abd

I Edit distance - errorI one character substituted by another character or one

character inserted or deletedI 2 errors between abc and abdd

Page 42: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

Search for approximate tandem repeat

Types:I Group 1: neighboring - k-repetition, k-run, k-edit repeat

I Each occurrence is similar to adjacent (or neighboring)occurrences

I Group 2: consensus - consensus model, k-MARI Each occurrence is similar to consensus string

Page 43: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 1: neighboring - k-repetition

Definition(KK, 2003) A string r [1 . . . n] is called a k-repetition of period p,p ≤ n/2, iff the Hamming distance of r [1 . . . n − p] andr [p + 1 . . . n] ≤ k.

I ataa atta ctta ct is two-repetition, of period four (total of twomismatches, when comparing each occurrence to nextoccurrence)

Page 44: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 1: neighboring - k-run

Definition(KK, 2003) A string r [1 . . . n] is called a k-run, of period p,p ≤ n/2, iff for every i ∈ [1 . . . n − 2p + 1], the Hamming distanceof r [i . . . i + p − 1] and r [i + p . . . i + 2p − 1] ≤ k .

One-run, of period four (up to one mismatch, when comparingeach occurrence to next occurrence):

I ataa atta ctta ggta ag - four mismatches

I ataa atta atta atta at - one mismatch

I Total number of mismatches unbounded

Page 45: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 1: neighboring - k-edit repeat

Definition(Sokol 2007, Sokol 2013) A string r is called a k-edit repeat if itcan be partitioned into consecutive substrings, r = u′′u1u2...ulu

′,l ≥ 2, such that the sum of edit distance errors of each occurrencewith its neighbor does not exceed k.

I k-edit repeat = k-repetition, except that uses edit distanceI caagct ccagct ccgc is two-edit repeat

I ccgc is partial period, so it is not counted as deletion of tI with edit distance, period size is not fixed and so it not given

Page 46: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 2: consensus - consensus model

Definition(Sagot and Myers, 1998) Each occurrence is called a wagon, andthe sequence of wagons is a train. A prefix model is when the totalnumber of edit distance errors, from each wagon to the consensusstring, does not exceed k . Jumps of a certain size are allowed (ajump is from the left of one wagon to the left of the next wagon).A consensus model is a prefix model, with an additional allowance- that gaps of a certain size are allowed (a gap is from the right ofone wagon to the left of the next wagon).

Page 47: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 2: consensus - consensus model

I Search for trains in satellite DNA (type of DNA that islocated near region of centromere and is involved inchromosome segregation during mitosis and meiosis)

I Algorithm outputs any number of trains, where all of thefollowing hold true:

I there are at least min repeat wagons in a train, and eachwagon’s length is between min range and max range (wheremin repeat, min range, and max range are inputs)

I there are at most k = 15-20% edit distance errors from eachwagon to the consensus string (where k is an input)

I jumps and gaps are of allowed sizes

Page 48: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 2: consensus - consensus model

Stage one: filtering = which substrings likely contain repeatsI Sum values in upper right region of local alignment dynamic

programming matrix (Smith-Waterman, 1981)I ETRs tend to have high sum - e.g., gacgac - sum of 20I non-ETRs tend not to - e.g., gtcagt - sum of 9

g a c g a c

0 0 0 0 0 0 0g 0 1 0 0 1 0 0a 0 0 2 1 0 2 1c 0 0 1 3 2 1 3g 0 1 0 2 4 3 2a 0 0 2 1 3 5 4c 0 0 1 3 2 4 6

g t c a g t

0 0 0 0 0 0 0g 0 1 0 0 0 1 0t 0 0 2 0 0 0 2c 0 0 0 3 1 0 0a 0 0 0 1 4 2 0g 0 1 0 0 2 5 3t 0 0 2 0 0 3 6

Page 49: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 2: consensus - consensus model

Stage two: brute force search of substrings that were not filtered

I Set Lm stores left-end positions of wagons known to be validfor consensus string m

I Attempt to find another valid consensus string by taking thispreviously good consensus string m and inserting to left,substituting on left, or deleting on left

Page 50: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 2: consensus - k-MAR

Definition(Amit, 2013) A k-maximal approximate run (or k-MAR) is anon-empty string r , such that the modification of at most kcharacters in r generates a maximal (exact) run. A modification isa substitution of a single character with a different single character.

I aba abc aba - one-MAR of period three and three occurrences

I ab ab ab ac ac ab - two-MAR of period two and six occurrences

I k-MAR is consensus, since string will be modified in order tomatch consensus string

Page 51: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

Figure: Alternative Grouping of ATRS

Page 52: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

1. Hamming distance mismatches vs. edit distance errors

Page 53: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

2. Overall, all occurrences similar to all others or whether string isevolutive = only adjacent occurrences must be similar (it is possiblethat evolve so much that far occurrences bear little resemblance)

I Consensus model - occurrences must be somewhat similar, asall occurrences similar to consensus string

I k-MARs are unusual - even though grouped as consensus inoriginal grouping, they do not require similar occurrences (ifall k mismatches are in few occurrences, these fewoccurrences may be non-similar to other occurrences)

Page 54: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

3. Counting variations is total count (count over all occurrences)or individual count (counter is reset for each occurrence)

4. Σv = total number of variations - bounded (Σv does notexceed k = fixed number known in advance) or not (Σvdepends on number of occurrences, so Σv has no boundknown in advance)

I Above two seem similar, but are differentI k-repetition, k-edit repeat, and k-MAR - total count, Σv

bounded by kI k-run - individual count, Σv unbounded (k mismatches

allowed for each occurrence, so Σv unbounded)I Consensus model unusual - individual counter (counter is

reset), yet Σv bounded (algorithm works when ≤ 15-20% editdistance errors between input string and consensus string)

Page 55: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

Page 56: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 57: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Related Research

Nested Tandem Repeat (NTR) = two distinct TRs, where one isnested within the other - e.g., NTR of the form XxxxXxxxxXxxXxxXxXxxx, such as:

CTTGAGATT ACAT ACAT ACATCTTGAGATT ACAT ACAT ACAT ACATCTTGAGATT ACAT ACATCTTGAGATT ACAT ACATCTTGAGATT ACATCTTGAGATT ACAT ACAT ACAT

Page 58: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Related Research

(Amir, 2010 and Eisenberg, 2012) Period Recovery Problem:

I String r is periodic, but unknown.

I INPUT: string r ′ = r corrupted with up to one edit distanceerror per occurrence, on average.

I OUTPUT: What’s r ’s period? Identify the candidate set.I Example: r’ = abcabdabeabcabcabe:

I abcabdabe (abcabdabe abcabcabe)I abcabcabe (abcabdabe abcabcabe)I abc (abc abd abe abc abc abe)

I Example: r’ = abbabdacc:

a. abc (abb abd acc) c. abb (abb abd acc)b. abd (abb abd acc) d. not acc (abb abd acc)

Page 59: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 60: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Palindromes

String r = uauR is palindrome

I radius of palindrome =length of one arm

I diameter of palindrome =sum of lengths of arms

Page 61: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Palindromes

I The palindrome refers to maximal oneI r = abababaI the palindrome is abababaI not maximal: aba

I Search string for substrings that are palindromes

I One common approach is to find longest palindrome centeredat each position

Page 62: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Palindromes

I The palindrome refers to maximal oneI r = abababaI the palindrome is abababaI not maximal: aba

I Search string for substrings that are palindromes

I One common approach is to find longest palindrome centeredat each position

Page 63: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 64: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Linear-time Algorithm #1 for (Exact) Palindromes

I Manacher (1975): first linear-time algorithm to find (exact)palindromes

position 1 2 3 4 5 6 7 8 9 10 11 12character b a b c b a b c b a c calgorithm 0 1 3 1 7 1 9 1 5 1 1 1 1 0

Table: Longest palindrome centered at each position

I Intuition is that if r = uauR , then lengths of palindromescentered at positions of left arm, u, will mostly be mirrorimage of lengths of palindromes centered at positions of rightarm, uR . Therefore, calculations of values in right arms arebased on previously-calculated values of left arms.

Page 65: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Linear-time Algorithm #1 for (Exact) Palindromes

I Manacher (1975): first linear-time algorithm to find (exact)palindromes

position 1 2 3 4 5 6 7 8 9 10 11 12character b a b c b a b c b a c calgorithm 0 1 3 1 7 1 9 1 5 1 1 1 1 0

Table: Longest palindrome centered at each position

I Intuition is that if r = uauR , then lengths of palindromescentered at positions of left arm, u, will mostly be mirrorimage of lengths of palindromes centered at positions of rightarm, uR . Therefore, calculations of values in right arms arebased on previously-calculated values of left arms.

Page 66: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 67: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Linear-time Algorithm #2 for (Exact) Palindromes

I Create, in linear time, GST of r and of rR . Preprocess inorder to be able to answer LCA query (which, for strings, isLCP query) in constant time.

I For each position, perform type of LCP query called longestcommon extension (or LCE) query (see table below).

I LCE query for position x = how many positions on right of xmatch positions on left of x .

I If any LCP query results in non-zero value, then there ispalindrome, with radius size of that value, centered there.

Page 68: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Linear-time Algorithm #2 for (Exact) Palindromes

LCP queries:

I abccba -palindrome withradius three(diameter six)

I abbaba -palindrome withradius two(diameter four)

abccba bccba ccba cba ba a

abccba a ba cba ccba bccba

LCP 0 0 3 0 0

abbaba bbaba baba aba ba a

ababba a ba bba abba babba

LCP 0 2 0 0 0

Page 69: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Linear-time Algorithm #2 for (Exact) Palindromes

LCP queries:

I cabbad -palindrome withradius two(diameter four)

I abdcba - nopalindrome, sinceevery LCP queryreturns zero

cabbad abbad bbad bad ad d

dabbac c ac bac bbac abbac

LCP 0 0 2 0 0

abcdba bcdba cdba dba ba a

abdcba a ba cba dcba bdcba

LCP 0 0 0 0 0

Page 70: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 71: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Palstar

Palstars = concatenation of palindromesI KMP (1977): concatenation of even-length palindromes: a

string is a palstar iff r = a1, . . . , an, where each ai , 1 ≤ i ≤ n,is a palindrome of even length.

I Example: string of length zero, abba, abbacc, not abaabba

I Galil (1978): a string is a palstar iff r = a1, . . . , an, whereeach ai , 1 ≤ i ≤ n, is a palindrome, of any length (even orodd), and |r | ≥ 2.

I Example: not string of length zero, not abba, abbacc, abaabba

Page 72: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Palstar

Palstars = concatenation of palindromesI KMP (1977): concatenation of even-length palindromes: a

string is a palstar iff r = a1, . . . , an, where each ai , 1 ≤ i ≤ n,is a palindrome of even length.

I Example: string of length zero, abba, abbacc, not abaabba

I Galil (1978): a string is a palstar iff r = a1, . . . , an, whereeach ai , 1 ≤ i ≤ n, is a palindrome, of any length (even orodd), and |r | ≥ 2.

I Example: not string of length zero, not abba, abbacc, abaabba

Page 73: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Gapped Palindrome

(KK, 2008) Gapped palindrome = uauR , |a| ≥ 2

Types:

1. long-armed: |u| > |a| - e.g., abbbcddbbba

2. length-constrained: |u| and |a| constrained by inputted sizes

Page 74: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 75: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Run-length encoding (RLE): encode string as runs of consecutiveand identical symbols

I bb cccc e dd aaaaa =I b2c2c2e1d2a2a2a1 (not optimal RLE)I b2c4e1d2a5 (optimal RLE)

Algorithm from Chen (2012):

I INPUT: string r of length n in optimal RLE-form, specifiedcenter position c , allowed number of Hamming distancemismatches k

I OUTPUT: radius of approximate (≤ k Hamming distancemismatches) palindrome in r centered at c

I Without decompressing

Page 76: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Run-length encoding (RLE): encode string as runs of consecutiveand identical symbols

I bb cccc e dd aaaaa =I b2c2c2e1d2a2a2a1 (not optimal RLE)I b2c4e1d2a5 (optimal RLE)

Algorithm from Chen (2012):

I INPUT: string r of length n in optimal RLE-form, specifiedcenter position c , allowed number of Hamming distancemismatches k

I OUTPUT: radius of approximate (≤ k Hamming distancemismatches) palindrome in r centered at c

I Without decompressing

Page 77: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Preprocessing: construct array A[1 · · ·m], where m = number ofruns in compressed string. Array elements are start positions ofrun, when decompressed.

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

RLE (r) a3 b1 a2 c2 a3 b1 c2 a2 b2

Table: A for r = aaabaaccaaabccaabb. There are m = 9 runs.

LCP (RLE (r), 5, 6) = 2, since X4 = X7 and X3 = X8 (samecharacter and same number of that character), but X2 6= X9

Page 78: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Preprocessing: construct array A[1 · · ·m], where m = number ofruns in compressed string. Array elements are start positions ofrun, when decompressed.

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

RLE (r) a3 b1 a2 c2 a3 b1 c2 a2 b2

Table: A for r = aaabaaccaaabccaabb. There are m = 9 runs.

LCP (RLE (r), 5, 6) = 2, since X4 = X7 and X3 = X8 (samecharacter and same number of that character), but X2 6= X9

Page 79: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

I Stage one: binary search on RLE (r), in order to find in whichrun c is.

I Stage two: iterate, in order to extend palindrome outward, asmuch as possible.

I Iteration ends when hits run boundary on left or on right.I If hit left and right simultaneously, then additional LCP query

is performed, so that algorithm can “jump” over other runs, inorder to extend even more.

I End iterating when use k mismatches or when hit stringboundary (position 1 or n).

Page 80: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

I Stage one: binary search on RLE (r), in order to find in whichrun c is.

I Stage two: iterate, in order to extend palindrome outward, asmuch as possible.

I Iteration ends when hits run boundary on left or on right.I If hit left and right simultaneously, then additional LCP query

is performed, so that algorithm can “jump” over other runs, inorder to extend even more.

I End iterating when use k mismatches or when hit stringboundary (position 1 or n).

Page 81: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

INPUT:

I RLE (r) for aaabaaccaaabccaabb = a3ba2c2a3bc2a2b2

I c = 10.5

I k = 1

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

Stage one - binary search: c is in X5

Page 82: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

INPUT:

I RLE (r) for aaabaaccaaabccaabb = a3ba2c2a3bc2a2b2

I c = 10.5

I k = 1

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

Stage one - binary search: c is in X5

Page 83: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Stage two - iterations:

1. Extend to left and right of position 10.5,to positions 10 and 11, until right hitsboundary.

radius = 1mismatches = 0

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

Page 84: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

2. Extend to positions 9 and 12, which donot match, so 1 mismatch is found. Leftboundary hits X5 at same time that rightboundary hits X6, so there is query of LCP(RLE (r), 5, 6), which returns two (asexplained above), meaning that two runswill be jumped over, and next iterationlooks at X2 and X9.

radius = 6mismatches = 1

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

Page 85: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Iterations:

3. X2 and X9 contain b. Left hits boundaryof X2 before right hits boundary.

radius = 7mismatches = 1

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

Page 86: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Iterations:

4. X1 and X9 contain different characters.No more mismatches allowed, soalgorithm stops.

radius = 7mismatches = 1

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

Page 87: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

Page 88: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Research Directions

Warburton (2004):

I Palindrome = inverted repeat (IR)I IR = substring followed by gap and then by substring’s reverse

= gapped palindromeI (and when gap is length zero, then IR = palindrome)

I Repeats and palindromes really are similar problems

I We want to apply research from repeats towards palindromes

Page 89: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Research Directions

2D palindrome = square block of text rotated around centerresults in same square block of text

Page 90: Repeats and Palindromes: an Overview · 2015-02-17 · Outline Introductory Concepts Tandem Repeats Palindromes Research Directions KMP Su x Tree KMP Stage 1: O(m) to construct KMP

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Research Directions

I Discussed how to find palindromes in RLE-compressed data,without decompressing

I Want to find palindromes (or 2D palindromes) in data thatwas compressed using other compression schemes, withoutdecompressing