Preobrayensky ergonomía y postura compressed ilovepdf compressed
Finding Characteristic Substrings from Compressed Texts
description
Transcript of Finding Characteristic Substrings from Compressed Texts
![Page 1: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/1.jpg)
Finding Characteristic Substringsfrom Compressed Texts
Shunsuke InenagaKyushu University, Japan
Hideo BannaiKyushu University, Japan
![Page 2: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/2.jpg)
Text Mining and Text Compression
Text mining is a task of finding some rule and/or knowledge from given textual data.
Text compression is to reduce a space to store given textual data by removing redundancy.
compress
decompress
![Page 3: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/3.jpg)
Our Contribution
We present efficient algorithms to find characteristic substrings (patterns) from given compressed strings directly (i.e., without decompression).
Longest repeating substring (LRS)Longest non-overlapping repeating substring (LNRS)Most frequent substring (MFS)Most frequent non-overlapping substring (MFNS)Left and right contexts of given pattern
![Page 4: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/4.jpg)
X1 = aX2 = bX3 = X1X2 X4 = X3X1
X5 = X3X4
X6 = X5X5
X7 = X4X6
X8 = X7X5 T =
SLP T
Text Compression by Straight Line Program
SLP T is a CFG in the Chomsky normal form which generates language {T}.
![Page 5: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/5.jpg)
X1 = aX2 = bX3 = X1X2 X4 = X3X1
X5 = X3X4
X6 = X5X5
X7 = X4X6
X8 = X7X5 T =
SLP T
Text Compression by Straight Line Program
Encodings of the LZ-family, run-length, Sequitur, etc. can quickly be transformed into SLP.
![Page 6: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/6.jpg)
Exponential Compression by SLP
Highly repetitive texts can be exponentially large w.r.t. the corresponding SLP-compressed texts.Text T = ababab…ab (T is an N repetition of ab)
SLP T : X1 = a, X2 = b, X3 = X1X2, X4 = X3X3, X5 = X4X4, ... , Xn = Xn-1Xn-1
N = O(2n)Any algorithms that decompress given SLP-compressed texts can take exponential time!We present efficient (i.e., polynomial-time) algorithms without decompression.
![Page 7: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/7.jpg)
Finding Longest Repeating Substring
Input: SLP T which generates text TOutput: A longest repeating substring (LRS) of T
T≠
≠
T = aabaabcabaabb
Example
![Page 8: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/8.jpg)
Key Observation – 6 Cases of Occurrences of LRS
Xi
Xl Xr
Xi
Xl Xr
Xi
Xl Xr
Xi
Xl Xr
Xi
Xl Xr
Case 1
Xi
Xl Xr
Case 2 Case 3
Case 6Case 5Case 4
![Page 9: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/9.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
![Page 10: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/10.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Xi
Xl Xr
Case 1
![Page 11: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/11.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Xi
Xl Xr
Case 1
![Page 12: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/12.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
![Page 13: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/13.jpg)
Xi
Xl Xr
Case 2
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
![Page 14: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/14.jpg)
Xi
Xl Xr
Case 2
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
![Page 15: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/15.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
![Page 16: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/16.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Xi
Xl Xr
Case 3
LRS of Xi of Case 3 is
the longest common substring of Xl and
Xr.
![Page 17: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/17.jpg)
Longest Common Substring of Two SLPs
Theorem 1 [Matsubara et al. 2009]
For every pair of variables Xl and Xr, we can compute a longest common substring of Xl and Xr in total of O(n4logn) time.
n is num. of variables in SLP T
![Page 18: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/18.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Xi
Xl Xr
Case 4
![Page 19: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/19.jpg)
Xi
Case 4
Xl XrXj
![Page 20: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/20.jpg)
Xj
Xi
Case 4-1
Xl Xr
Xj
XtXs
![Page 21: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/21.jpg)
Case 4-1
Xl
Xt
Overlap of Xl and
Xt
![Page 22: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/22.jpg)
Xj
Xi
Case 4-1
Xl Xr
Xj
XtXs
Overlap of Xl and
Xt
Expand overlap
![Page 23: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/23.jpg)
Xj
Xi
Case 4-2
Xl Xr
Xj
XtXs
Overlap of Xr and
Xs
Expand overlap
![Page 24: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/24.jpg)
Set of Overlaps
X
Y
Set of length of overlaps of X and Y
![Page 25: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/25.jpg)
a a b a a b aa b a a b a b b
a b a a b a b b
a b a a b a b b
Set of Overlaps
OL(aabaaba, abaababb) = {1, 3, 6}
X
Y
Y
Y
![Page 26: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/26.jpg)
Set of Overlaps
Lemma 1 [Kaprinski et al. 1997]
For every pair of variables Xi and Yj,OL(Xi, Yj) forms O(n) arithmetic progressions.
Lemma 2 [Kaprinski et al. 1997]
For every pair of variables Xi and Yj,OL(Xi, Yj) can be computed in total of O(n4logn)
time.
n is num. of variables in SLP T
![Page 27: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/27.jpg)
Case 4
Lemma 3
For every variable Xi, a longest repeating substring
in Case 4 can be computed in O(n3logn) time.[Sketch of proof]•We can expand all elements of each arithmetic progression of OL(Xi, Xj) in O(nlogn) time.•The size of OL(Xi, Xj) is O(n) by Lemma 1.•There are at most n-1 descendants Xj of Xi.
![Page 28: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/28.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Xi
Xl Xr
Case 5
Symmetric to Case 4
![Page 29: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/29.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Similarly to Case 4
Xl Xr
Case 6
Xi
![Page 30: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/30.jpg)
Finding Longest Repeating Substring
Theorem 2
For any SLP T which generates text T, we can compute an LRS of T in O(n4logn) time.
n is num. of variables in SLP T
![Page 31: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/31.jpg)
Finding Longest Non-Overlapping Repeating Substring
Input: SLP T which generates text TOutput: A longest non-overlapping repeating substring (LNRS) of T
T = ababababab
Example
LRS of T is ababababLRNS of T is abab
![Page 32: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/32.jpg)
Finding Longest Non-Overlapping Repeating Substring
Theorem 3
For any SLP T which generates text T, we can compute an LNRS of T in O(n6logn) time.
n is num. of variables in SLP T
![Page 33: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/33.jpg)
Finding Most Frequent Substring
Input: SLP T which generates text TOutput: A most frequent substring (MFS) of T
The solution is always the empty
string e.
![Page 34: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/34.jpg)
Finding Most Frequent Substring
Input: SLP T which generates text TOutput: A most frequent substring (MFS) of T of length 2
T
![Page 35: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/35.jpg)
Algorithm to Compute MFS
Input: SLP TOutput: MFS of text Tforeach substring P of T of length 2 do
construct an SLP P which generates substring P;compute num. of occurrences of P in T;
return substring of maximum num. of occurrences;
|S|2 substrings of length 2
Y3
Y1 Y2
a b
Lemma 4For every pair of variables Xi and Yj, the number of
occurrences of Yj in Xi can be computed in total of O(n2) time.
![Page 36: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/36.jpg)
Finding Most Frequent Substring
Theorem 4
For any SLP T which generates text T, we can compute an MFS of T of length 2 in O(|S|2n2) time.
n is num. of variables in SLP T
![Page 37: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/37.jpg)
T = aaaaababab
Example
Finding Most Frequent Non-Overlapping Substring
Input: SLP T which generates text TOutput: A most frequent non-overlapping substring (MFNS) of T of length 2
MFS of T of length 2 is aaMFNS of T of length 2 is ab
![Page 38: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/38.jpg)
Finding Most Frequent Non-Overlapping Substring
Theorem 5
For any SLP T which generates text T, we can compute an MFNS of T of length 2 in O(n4logn) time.
n is num. of variables in SLP T
![Page 39: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/39.jpg)
T = bbaabaabbaabbExample
Computing Left and Right Contexts of Given Pattern
Input: Two SLPs T and P which generate text T and pattern P, respectivelyOutput: Substring aPb of T such that
a (resp. b) always precedes (resp. follows) P in Ta and b are as long as possible
P = aba = bab = e
![Page 40: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/40.jpg)
Computing Left and Right Contexts of Given Pattern
Examples of applications of computing left and right contexts of patterns are:
Blog spam detection [Narisawa et al. 2007]Compute maximal extension of most frequent substrings (MFS)
T
![Page 41: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/41.jpg)
Boundary Lemma [1/2]
Lemma 5 [Miyazaki et al. 1997]For any SLP variables X = XlXr and Y, the
occurrences of Y that touch or cover the boundary of X form a single arithmetic progression.
X
Xl Xr
YY
Y
![Page 42: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/42.jpg)
u u v
u u v
u u v
Boundary Lemma [2/2]
Lemma 5 [Miyazaki et al. 1997](Cont.) If the number of elements in the
progression is more than 2, then the step of the progression is the smallest period of Y.
X
Xl Xr
YY
Y
![Page 43: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/43.jpg)
a
a
a
u u v
u u v
u u v
Left and Right Contexts
X
Xl Xr
Yb
b
b
The left context a of Y in X is a suffix of u.
The right context b of Y in X is a prefix of uv[|v|: |uv|].
![Page 44: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/44.jpg)
Computing Left and Right Contexts of Given Pattern
Theorem 6For any SLPs T and P which generate
text T and pattern P, respectively, we can compute the left and right contexts of P in T in O(n4logn) time.
n is num. of variables in SLP T
![Page 45: Finding Characteristic Substrings from Compressed Texts](https://reader035.fdocuments.net/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/45.jpg)
Conclusions and Future Work
We presented polynomial time algorithms to find characteristic substrings of given SLP-compressed texts.
Our algorithms are more efficient than any algorithms that work on uncompressed strings.
Would it be possible to efficiently find other types of substrings from SLP-compressed texts?
Squares (substrings of form xx)Cubes (substrings of form xxx)Runs (maximal substrings of form xk with k ≥ 2)Gapped palindromes (substrings of form xyxR with |y| ≥ 1)