Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string...

50
Fast q-gram Mining on SLP Compressed Strings Kyushu University ◯Keisuke Goto,Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda 1 Sponsored by Hara Research Foundation 20111014日金曜日

Transcript of Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string...

Page 1: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

Fast q-gram Mining on SLP Compressed Strings

Kyushu University◯Keisuke Goto,Hideo Bannai,Shunsuke Inenaga, Masayuki Takeda

1 Sponsored by Hara Research Foundation

2011年10月14日金曜日

Page 2: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Large ScaleString

Background✦ Large scale string data is usually stored in compressed form

22011年10月14日金曜日

Page 3: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Background✦ Large scale string data is usually stored in compressed form

compressedstring

22011年10月14日金曜日

Page 4: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Background✦ Large scale string data is usually stored in compressed form

compressedstring

3

✦ In order to process such data, we usually need to decompress, which requires a lot of space and time

2011年10月14日金曜日

Page 5: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Large ScaleString

Background✦ Large scale string data is usually stored in compressed form

compressedstring

decompress

3

✦ In order to process such data, we usually need to decompress, which requires a lot of space and time

string processing,e.g. pattern matching

2011年10月14日金曜日

Page 6: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Large ScaleString

Background✦ Large scale string data is usually stored in compressed form

compressedstring

decompress

3

      ____     /      \   /  _ノ  ヽ、_  \  / o゚((●)) ((●))゚o \    ¦     (__人__)'    ¦  \     `⌒´     /

✦ In order to process such data, we usually need to decompress, which requires a lot of space and time

string processing,e.g. pattern matching

2011年10月14日金曜日

Page 7: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Background✦ Large scale string data is usually stored in compressed form

compressedstring

4

✦ One solution is to process compressed strings without explicit decompression

✦ In order to process such data, we usually need to decompress, which requires a lot of space and time

2011年10月14日金曜日

Page 8: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Background✦ Large scale string data is usually stored in compressed form

string processing,e.g. pattern matching

compressedstring

without decompress

4

✦ One solution is to process compressed strings without explicit decompression

✦ In order to process such data, we usually need to decompress, which requires a lot of space and time

2011年10月14日金曜日

Page 9: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Background✦ Large scale string data is usually stored in compressed form

string processing,e.g. pattern matching

compressedstring

without decompress

4

✦ One solution is to process compressed strings without explicit decompression

✦ In order to process such data, we usually need to decompress, which requires a lot of space and time

      ____     /⌒  ⌒\   /( ●)  (●)\  /::::::⌒(__人__)⌒::::: \   ¦     ¦r┬-¦     ¦  \      `ー'´     /

2011年10月14日金曜日

Page 10: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Background✦ Large scale string data is usually stored in compressed form

string processing,e.g. pattern matching

compressedstring

without decompress

4

✦ One solution is to process compressed strings without explicit decompression

this work: algorithm for computing q-gram frequencies on compressed strings

✦ In order to process such data, we usually need to decompress, which requires a lot of space and time

      ____     /⌒  ⌒\   /( ●)  (●)\  /::::::⌒(__人__)⌒::::: \   ¦     ¦r┬-¦     ¦  \      `ー'´     /

2011年10月14日金曜日

Page 11: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

q-gram Frequencies Problem

T = q-gram Frequencyaab 2aba 3baa 2bab 1

q = 3

5

Input:string T,integer qOutput:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0

abaababaab

q-gram Frequencies Problem

Example

ababaaaababa

babababaaaab

✦ q-gram frequencies are important features of strings, widely used in data mining and machine learning

i.e. frequencies of all length-q strings (q-grams) occuring in T

where Occ(T, P)={ i | T[i..(i+|P|-1)] = P, 0 ≦ i < |T| - |P| }

2011年10月14日金曜日

Page 12: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Straight Line Program (SLP)

6

✦ SLP can represent the output of well-known compression schemes✦ e.g. RE-PAIR SEQUITUR, LZ78, LZW

✦ LZ77, LZSS can be quickly transformed to SLP

Definition (Straight Line Program)

Straight Line Program is a context free grammar in the Chomsky normal form that derives a single string

X1 = expr1, X2 = expr2, ...., Xn = exprn

expri ∈ Σ orexpri = Xl・Xr (l, r < i)

[Rytter, 2003]

2011年10月14日金曜日

Page 13: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Example SLP

D = X1 = aX2 = bX3 = X1 X2

X4 = X1 X3

X5 = X3 X4

X6 = X4 X5

X7 = X6 X5

7

n = |D| = 7

Derivation Tree

Length of decompressed string can be Θ(2n)

X1 X2

a ba a ab a b a b a a b

X1 X3X1 X2X3

X1 X2X3

X4X1

X5X4X6

X1 X2X3

X1 X2X3

X4X1

X5

X7

2 31 4 65 7 8 9 10 11 12 13

vOcc(Xi): the number of occurrence Xi in the derivation tree of Xn

T =

2011年10月14日金曜日

Page 14: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Example SLP

D = X1 = aX2 = bX3 = X1 X2

X4 = X1 X3

X5 = X3 X4

X6 = X4 X5

X7 = X6 X5

7

n = |D| = 7

Derivation Tree

Length of decompressed string can be Θ(2n)

X1 X2

a ba a ab a b a b a a b

X1 X3X1 X2X3

X1 X2X3

X4X1

X5X4X6

X1 X2X3

X1 X2X3

X4X1

X5

X7

2 31 4 65 7 8 9 10 11 12 13

vOcc(Xi): the number of occurrence Xi in the derivation tree of Xn

T =

Example vOcc(X4) = 3

2011年10月14日金曜日

Page 15: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Computing q-gram frequencies on uncompressed strings

82011年10月14日金曜日

Page 16: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

O(q|T | log |Σ|)-time Algorithm

✦ Store frequency of q-grams in associative array✦ Simply scan string T and increment frequency of

each q-gram T [i .. i+q-1]

9

T =q = 3

abaababaab

Example

ababaaaababa

babababaaaab

q-gram Frequencyaab 2

aba 3

baa 2

bab 1

✦ Each access to associative array: O(q log |Σ|) time✦ Total: O(q|T | log |Σ|) time,O(q|T |) space

2011年10月14日金曜日

Page 17: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

i SA[i] LCP[i] T[SA[i] .. |T|] 0 7 0 aab 1 2 3 aababaab 2 8 1 ab 3 5 2 abaab 4 0 5 abaababaab5 3 3 ababaab 6 9 0 b 7 6 1 baab 8 1 4 baababaab 9 4 2 babaab

O(|T |)-time Algorithm✦ Construct and simply scan the Suffix Array, LCP Array of T for

intervals where LCP ≧ q

Example

10

[Kasai et al, 2001]e.g. [Kärkkäinen & Sanders, 2003]

Intervals where LCP ≧ q,represent all occurrences of a specific q-gram which occur more than once

✦ Suffix Array, LCP Array construction: O(|T |) time & space✦ Total: O(|T |) time & space

T = abaababaabq = 3

Output position i and frequency of T[i .. i+q-1]

2011年10月14日金曜日

Page 18: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3511

Computing q-gram frequencies on SLPs

2011年10月14日金曜日

Page 19: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Occ*: Crossing Occurrences

12

PP

P

Xi

Xl Xr

✦ For Xi=Xl Xr , Occurrences of P is crossing occurrences of Xiif occurrences of P start in Xl and end in Xr

✦ We denote such occurrence by Occ*(Xi, P)

2011年10月14日金曜日

Page 20: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3513

LemmaLemma

|Occ(T,P)| =n

∑i

|Occ∗(Xi,P)| · vOcc(Xi)

T P PP

Xn

2011年10月14日金曜日

Page 21: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3514

Lemma

T P PP

Lemma

Xj2Xj1Xj1

Xn

|Occ(T,P)| =n

∑i

|Occ∗(Xi,P)| · vOcc(Xi)

2011年10月14日金曜日

Page 22: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3514

Lemma

T

For any substring P of T,there exists Xi =Xl Xr such that P occurscrossing the boundary of Xl and Xr

P PP

Lemma

Xj2Xj1Xj1

Xn

|Occ(T,P)| =n

∑i

|Occ∗(Xi,P)| · vOcc(Xi)

2011年10月14日金曜日

Page 23: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3514

Lemma

T

For any substring P of T,there exists Xi =Xl Xr such that P occurscrossing the boundary of Xl and Xr

P PPThe number of P crossing the boundary of Xl and Xr is vOcc(Xi), thus lemma holds.

Lemma

Xj2Xj1Xj1

Xn

|Occ(T,P)| =n

∑i

|Occ∗(Xi,P)| · vOcc(Xi)

2011年10月14日金曜日

Page 24: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

multiple weighted q-gram frequencies problem

We reduction q-gram frequencies problem to multiple weighted q-gram frequencies problem

15

for all Xi =Xl, Xr, let ti = slpr

pr: length-(q-1) suffix of Xr

sl: length-(q-1) suffix of Xlwhere

・・・・

Xi

Xl Xr

q-1 q-1

sl pr

Occurrences of q-gram in ti correspond tocrossing occurrences of Xi

Input:string set S ={t1, t2, ..., tn},integer set w = {w1, w2, ..., wn}

Output: for ∀ x ∈ Σq

multiple weighted q-gram frequencies problem

n

∑i=1

|Occ(ti,x)| · wi

ti =

2011年10月14日金曜日

Page 25: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Our Strategy

✦ STEP 1. for all Xi = Xl Xr , compute vOcc(Xi)

16

Xi

Xl Xr

sl prti =

2011年10月14日金曜日

Page 26: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Our Strategy

✦ STEP 1. for all Xi = Xl Xr , compute vOcc(Xi)

16

Xi

Xl Xr

q-1 q-1

✦ STEP 2. for all Xi = Xl Xr , compute decompressed prefix pi and suffix si of length at most q-1, and ti = slpr

sl prti =

2011年10月14日金曜日

Page 27: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Our Strategy

✦ STEP 1. for all Xi = Xl Xr , compute vOcc(Xi)

16

・・・・

Xi

Xl Xr

q-1 q-1

✦ STEP 2. for all Xi = Xl Xr , compute decompressed prefix pi and suffix si of length at most q-1, and ti = slpr

✦ STEP 3. solve the multi weighted q-gram frequencies for all ti and vOcc(Xi)

sl prti =

2011年10月14日金曜日

Page 28: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Our Strategy

✦ STEP 1. for all Xi = Xl Xr , compute vOcc(Xi)

16

・・・・

Xi

Xl Xr

q-1 q-1

From now, we describes STEP 3

STEP 1 and 2 can be computed in O(qn) time and space by a simple dynamic programming

✦ STEP 2. for all Xi = Xl Xr , compute decompressed prefix pi and suffix si of length at most q-1, and ti = slpr

✦ STEP 3. solve the multi weighted q-gram frequencies for all ti and vOcc(Xi)

sl prti =

2011年10月14日金曜日

Page 29: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3517

O(q2n log |Σ|)-time Algorithm

・・・・

✦ Store frequency of q-grams in associative array✦ For each q-gram which starts in Xl and ends in Xr, we increase its

frequency by vOcc(Xi)

aa ba ba ab

q-gram Frequency

aab 0aba 0baa 0

・・・

・・・

・・・

・・・

ti ti+1 ・・・・

vOcc(Xi) = 4 vOcc(Xi) = 3

Example

2011年10月14日金曜日

Page 30: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3517

O(q2n log |Σ|)-time Algorithm

・・・・

✦ Store frequency of q-grams in associative array✦ For each q-gram which starts in Xl and ends in Xr, we increase its

frequency by vOcc(Xi)

aa ba ba ab

q-gram Frequency

aab 0aba 0baa 0

・・・

・・・

・・・

・・・

ti ti+1 ・・・・

vOcc(Xi) = 4 vOcc(Xi) = 3

Example

2011年10月14日金曜日

Page 31: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3517

O(q2n log |Σ|)-time Algorithm

・・・・

✦ Store frequency of q-grams in associative array✦ For each q-gram which starts in Xl and ends in Xr, we increase its

frequency by vOcc(Xi)

aa ba ba ab

q-gram Frequency

aab 0aba 0baa 0

・・・

・・・

・・・

・・・

4ti ti+1

・・・・

vOcc(Xi) = 4 vOcc(Xi) = 3

Example

2011年10月14日金曜日

Page 32: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3517

O(q2n log |Σ|)-time Algorithm

・・・・

✦ Store frequency of q-grams in associative array✦ For each q-gram which starts in Xl and ends in Xr, we increase its

frequency by vOcc(Xi)

aa ba ba ab

q-gram Frequency

aab 0aba 0baa 0

・・・

・・・

・・・

・・・

4ti ti+1

・・・・

vOcc(Xi) = 4 vOcc(Xi) = 3

Example

2011年10月14日金曜日

Page 33: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3517

O(q2n log |Σ|)-time Algorithm

・・・・

✦ Store frequency of q-grams in associative array✦ For each q-gram which starts in Xl and ends in Xr, we increase its

frequency by vOcc(Xi)

aa ba ba ab

q-gram Frequency

aab 0aba 0baa 0

・・・

・・・

・・・

・・・

44

ti ti+1 ・・・・

vOcc(Xi) = 4 vOcc(Xi) = 3

Example

2011年10月14日金曜日

Page 34: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3518

O(q2n log |Σ|)-time Algorithm

ba ab

q-gram Frequency

aab 4aba 4baa 0

・・・

・・・

・・・

・・・

✦ Store frequency of q-grams in associative array✦ For each q-gram which starts in Xl and ends in Xr, we increase its

frequency by vOcc(Xi)

・・・・ti ti+1

・・・・aa ba

vOcc(Xi) = 4 vOcc(Xi) = 3

Example

✦ Each access to associative array: O(q log |Σ|) time✦ Total: O(q2n log |Σ|) time, O(q2n) space

2011年10月14日金曜日

Page 35: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3518

O(q2n log |Σ|)-time Algorithm

ba ab

q-gram Frequency

aab 4aba 4baa 0

・・・

・・・

・・・

・・・

✦ Store frequency of q-grams in associative array✦ For each q-gram which starts in Xl and ends in Xr, we increase its

frequency by vOcc(Xi)

・・・・ti ti+1

・・・・aa ba

vOcc(Xi) = 4 vOcc(Xi) = 3

Example

✦ Each access to associative array: O(q log |Σ|) time✦ Total: O(q2n log |Σ|) time, O(q2n) space

2011年10月14日金曜日

Page 36: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3518

O(q2n log |Σ|)-time Algorithm

ba ab

q-gram Frequency

aab 4aba 4baa 0

・・・

・・・

・・・

・・・

3

✦ Store frequency of q-grams in associative array✦ For each q-gram which starts in Xl and ends in Xr, we increase its

frequency by vOcc(Xi)

・・・・ti ti+1

・・・・aa ba

vOcc(Xi) = 4 vOcc(Xi) = 3

Example

✦ Each access to associative array: O(q log |Σ|) time✦ Total: O(q2n log |Σ|) time, O(q2n) space

2011年10月14日金曜日

Page 37: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3518

O(q2n log |Σ|)-time Algorithm

ba ab

q-gram Frequency

aab 4aba 4baa 0

・・・

・・・

・・・

・・・

3

✦ Store frequency of q-grams in associative array✦ For each q-gram which starts in Xl and ends in Xr, we increase its

frequency by vOcc(Xi)

・・・・ti ti+1

・・・・aa ba

vOcc(Xi) = 4 vOcc(Xi) = 3

Example

✦ Each access to associative array: O(q log |Σ|) time✦ Total: O(q2n log |Σ|) time, O(q2n) space

2011年10月14日金曜日

Page 38: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3518

O(q2n log |Σ|)-time Algorithm

ba ab

q-gram Frequency

aab 4aba 4baa 0

・・・

・・・

・・・

・・・

7

3

✦ Store frequency of q-grams in associative array✦ For each q-gram which starts in Xl and ends in Xr, we increase its

frequency by vOcc(Xi)

・・・・ti ti+1

・・・・aa ba

vOcc(Xi) = 4 vOcc(Xi) = 3

Example

✦ Each access to associative array: O(q log |Σ|) time✦ Total: O(q2n log |Σ|) time, O(q2n) space

2011年10月14日金曜日

Page 39: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

O(qn)-time Algorithm

19

Construct string z and weight array w, as follows

z a a b a b a a bw 4 4 0 0 3 3 0 0・・・・・・

・・・・・・

✦ For each q-gram z[ j .. j + q-1], we count by w[ j ]

where |wi| = |ti|,

wi[ j] =�

vOcc(Xi) if j ≤ |sl |0 if j > |sl |

・・・・

ti ti+1

・・・・

vOcc(Xi) = 4 vOcc(Xi) = 3

a a b a4 4 0 0

b a a b3 3 0 0

z = t1 t2 t3 .... tn , w = w1 w2 w3 ... wn

2011年10月14日金曜日

Page 40: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

✦ Construct Suffix Array and LCP Array of z✦ Compute sum of w[SA[i]] in the interval where LCP[i] ≧ q

20

i SA LCP w[SA[i]] z[SA[i]] 0 16 0 1 aab

1 2 3 0 aababbaabbaabbaab

2 12 3 1 aabbaab

3 8 7 1 aabbaabbaab

4 17 1 0 ab

5 0 2 4 abaababbaabbaabbaab

6 3 3 3 ababbaabbaabbaab

7 13 2 0 abbaab

|Occ(T, aab)| = 3

O(qn) time and space

O(qn)-time Algorithm

|Occ(T, aba)| = 7

・・・

2011年10月14日金曜日

Page 41: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Experiments

212011年10月14日金曜日

Page 42: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Computational Experiments

✦ Compared running time of 4 algorithms, after reading input into memory

✦ Uncompressed String: O(q|T | log |Σ|)✦ Uncompressed String: O(|T |)✦ SLP: O(q2n log |Σ|)✦ SLP: O(qn)

✦ Implementation✦ language: C++✦ suffix array construction: libdivsufsort[1]

✦ associative array: std::map

[1] http://code.google.com/p/libdivsufsort/22

2011年10月14日金曜日

Page 43: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Experiment (Fibonacci Strings)q = 50

23

O(q|T | log |Σ|)

O(|T |)

SLP: O(q2n log |Σ|)

SLP: O(qn)

2011年10月14日金曜日

Page 44: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Pizza & Chili Corpus

✦ Data: XML, DNA, ENGLISH, and PROTEINS with size 200MB

✦ Input SLPs were generated by RE-PAIR✦ q is varied from 2 to 10

24

[Larsson & Moffat, 1999]

2011年10月14日金曜日

Page 45: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Experiment (ENGLISH 200MB)

25

0

100

200

300

400

500

600

2 3 4 5 6 7 8 9 10

tota

l tim

e (s

econ

ds)

q

NMPNSASMPSSA

q

O(q|T | log |Σ|)

SLP: O(q2n log |Σ|)

SLP: O(qn)

O(|T |)

| z | > | T || z | < | T |

2011年10月14日金曜日

Page 46: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Experiment (PROTEINS 200MB)

26

0

100

200

300

400

500

600

700

800

900

2 3 4 5 6 7 8 9 10

tota

l tim

e (s

econ

ds)

q

NMPNSASMPSSA

q

O(|T |)

O(q|T | log |Σ|)

SLP: O(q2n log |Σ|)

SLP: O(qn)

| z | > | T || z | < | T |

2011年10月14日金曜日

Page 47: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

0

50

100

150

200

250

2 3 4 5 6 7 8 9 10

tota

l tim

e (s

econ

ds)

q

NMPNSASMPSSA

Experiment (DNA 200MB)

27

O(|T |)

O(q|T | log |Σ|)SLP: O(qn)

q

SLP: O(q2n log |Σ|)

| z | > | T || z | < | T |

2011年10月14日金曜日

Page 48: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3528q

Experiment (DBLP 200MB)

O(|T |)

O(q|T | log |Σ|)

SLP: O(q2n log |Σ|)

SLP: O(qn)

| z | < | T |

2011年10月14日金曜日

Page 49: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 35

Summary

29

uncompressed string

SLP(previous work)

q-gram Frequencies O(|T |) O(|Σ |q qn2)

q-gram Kernel O(|T1| + |T2|) None

Optimal q-gramPattern Discovery O(M) None

[1]

SLP(Our Work)

O(q2n log |Σ|) O(qn)

O(q(n1 + n2))

O(qN)

[Inenaga and Bannai, 2009][2]

[2]

N: sum of compressed data size[3]

M: sum of data size[1]

✦ We proposed new algorithms for computing q-gram frequencies on compressed string

✦ Proposed algorithms are simple, and practically fast when q is small

[3]

2011年10月14日金曜日

Page 50: Fast q-gram Mining on SLP Compressed Strings · 2020. 1. 2. · bab 1 q = 3 5 Input:string T,integer q Output:|Occ(T, x)| for ∀ x ∈ Σq s.t. |Occ(T, x)| > 0 abaababaab q-gram

/ 3530

Thank you for listening

2011年10月14日金曜日