A Grammar Compression Algorithm based on Induced Suffix ... · IntroductionGCISResultsFinal...
Transcript of A Grammar Compression Algorithm based on Induced Suffix ... · IntroductionGCISResultsFinal...
Introduction GCIS Results Final Considerations
A Grammar Compression Algorithm based onInduced Suffix Sorting
D. S. N. Nunes 1,2 F. Louza 3 S. Gog4 M. Ayala-Rincon 2
G. Navarro 5
1Federal Institute of Education, Science and Technology of Brasılia, Brazil
2Department of Computer Science, University of Brasılia, Brazil
3Department of Computing and Mathematics, University of Sao Paulo, Brazil
4Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Germany
5Department of Computer Science, University of Chile, Chile
28th March 2018
Data Compression Conference
Snowbird, Utah, U.S.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 1/33
Introduction GCIS Results Final Considerations
Summary
1 Introduction
2 GCIS
3 Results
4 Final Considerations
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 2/33
Introduction GCIS Results Final Considerations
Summary
1 Introduction
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 3/33
Introduction GCIS Results Final Considerations
Introduction
Lossless data compression reduce space requirement byidentifying and eliminating redundancy.
Useful in practice: reduce resources required to store andtransmit data.
Space/Time trade-off.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 4/33
Introduction GCIS Results Final Considerations
Introduction
The Suffix Array Data Structure is used extensively inStringology.
Key to solve several text-related problems in efficient oroptimal time, using a small footprint of memory, if comparedto other data structures.I Exact Pattern Matching;I Approximate Pattern Matching;I LZ factorization;I Finding repeats.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 5/33
Introduction GCIS Results Final Considerations
Suffix Array
i A[i] TA[i]
0 22 0
1 21 a0
2 18 aana0
3 13 aananaana0
4 8 aananaananaana0
5 3 aananaananaananaana0
6 19 ana0
7 16 anaana0
8 11 anaananaana0
9 6 anaananaananaana0
10 1 anaananaananaananaana0
11 14 ananaana0
12 9 ananaananaana0
13 4 ananaananaananaana0
14 0 banaananaananaananaana0
15 20 na0
16 17 naana0
17 12 naananaana0
18 7 naananaananaana0
19 2 naananaananaananaana0
20 15 nanaana0
21 10 nanaananaana0
22 5 nanaananaananaana0
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 6/33
Introduction GCIS Results Final Considerations
Nong’s Algorithm
Nong et al. algorithm is capable of sorting the suffixes of atext in linear optimal time.
Very fast in practice.
Induces the order of suffixes based in the already calculatedorder of other suffixes.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 7/33
Introduction GCIS Results Final Considerations
Our Contribution
Develop a novel grammar compression algorithm based on theinduced suffix sorting algorithm from Nong et al. to compressthe original string.I Faster in compression than 7-zip and Re-Pair.I Lower memory peak under compression than Re-Pair.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 8/33
Introduction GCIS Results Final Considerations
Summary
2 GCIS
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 9/33
Introduction GCIS Results Final Considerations
GCISWe adapt SAIS framework from Nong et al. to build our grammar:
1 Classify suffixes into “S”, “L”, or “LMS” types.
2 Sort all LMS-suffixes by its first symbol.3 Induce:
1 L-type suffixes from LMS-type suffixes;2 S-Type suffixes from L-type suffixes;
4 Rename the LMS-Substrings to obtain T ′ and create grammarrules;
5 If there are equal renamed factors, solve the problemrecursively for T ′. Else, store T explicitly.
6 LMS-suffixes are now sorted regarding all symbols. Repeat 3.1and 3.2 to obtain the suffix array.
7 LMS-suffixes are now sorted regarding all symbols. Repeat 3.1and 3.2 to obtain the suffix array.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 10/33
Introduction GCIS Results Final Considerations
GCIS
Example
i
T
0
b
1
a
2
n
3
a
4
a
5
n
6
a
7
n
8
a
9
a
10
n
11
a
12
n
13
a
14
a
15
n
16
a
17
n
18
a
19
a
20
n
21
a
22
0
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 11/33
Introduction GCIS Results Final Considerations
GCIS
Classification
i
T
B
0
b
L
1
a
*
2
n
L
3
a
*
4
a
S
5
n
L
6
a
*
7
n
L
8
a
*
9
a
S
10
n
L
11
a
*
12
n
L
13
a
*
14
a
S
15
n
L
16
a
*
17
n
L
18
a
*
19
a
S
20
n
L
21
a
L
22
0
*
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 12/33
Introduction GCIS Results Final Considerations
GCIS
Radixsort
i
T
B
A
0
b
L
0
22
1
a
*
1
2
n
L
2
3
a
*
3
4
a
S
4
5
n
L
5
6
a
*
6
1
7
n
L
7
3
8
a
*
8
6
9
a
S
9
8
10
n
L
10
11
11
a
*
11
13
12
n
L
12
16
13
a
*
13
18
14
a
S
14
15
n
L
15
16
a
*
16
17
n
L
17
18
a
*
18
19
a
S
19
20
n
L
20
21
a
L
21
22
0
*
22
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 13/33
Introduction GCIS Results Final Considerations
GCIS
Inducing L-Type Suffixes
i
T
B
A
A
0
b
L
22
22
0
1
a
*
21
1
2
n
L
2
3
a
*
3
4
a
S
4
5
n
L
5
6
a
*
1
1
6
7
n
L
3
3
7
8
a
*
6
6
8
9
a
S
8
8
9
10
n
L
11
11
10
11
a
*
13
13
11
12
n
L
16
16
12
13
a
*
18
18
13
14
a
S
0
14
15
n
L
20
15
16
a
*
2
16
17
n
L
5
17
18
a
*
7
18
19
a
S
10
19
20
n
L
12
20
21
a
L
15
21
22
0
*
17
22
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 14/33
Introduction GCIS Results Final Considerations
GCIS
Inducing S-Type Suffixes
i
T
B
A
A
0
b
L
22
22
0
1
a
*
21
21
1
2
n
L
18
2
3
a
*
3
3
4
a
S
8
4
5
n
L
13
5
6
a
*
1
19
6
7
n
L
3
1
7
8
a
*
6
4
8
9
a
S
8
6
9
10
n
L
11
9
10
11
a
*
13
11
11
12
n
L
16
14
12
13
a
*
18
16
13
14
a
S
0
0
14
15
n
L
20
20
15
16
a
*
2
2
16
17
n
L
5
5
17
18
a
*
7
7
18
19
a
S
10
10
19
20
n
L
12
12
20
21
a
L
15
15
21
22
0
*
17
17
22
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 15/33
Introduction GCIS Results Final Considerations
GCIS
Definition (LMS-Substring)
A LMS-Substring is either the sentinel symbol 0 or a substringT [i, j − 1] with both T [i] and T [j] being of type LMS and thereis no other LMS symbols for i 6= j. We denote this substring bysub(i).
After inducing L and S-type suffixes, all LMS-substrings aresorted.
All LMS-substrings are renamed according to their order.
A pairwise comparison should be done to check if theLMS-substrings are equal.
T is replaced by the renamed factors, giving place to T ′.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 16/33
Introduction GCIS Results Final Considerations
GCIS
Renaming
i
T
B
A
0
b
L
22
*
1
a
*
21
2
n
L
18
*
3
a
*
3
*
4
a
S
8
*
5
n
L
13
*
6
a
*
19
7
n
L
1
*
8
a
*
4
9
a
S
6
*
10
n
L
9
11
a
*
11
*
12
n
L
14
13
a
*
16
*
14
a
S
0
15
n
L
20
16
a
*
2
17
n
L
5
18
a
*
7
19
a
S
10
20
n
L
12
21
a
L
15
22
0
*
17
sub(22) = 0 7→ 0 sub(1) = an 7→ 3sub(18) = aana 7→ 1 sub(6) = an 7→ 3sub(3) = aan 7→ 2 sub(11) = an 7→ 3sub(8) = aan 7→ 2 sub(16) = an 7→ 3sub(13) = aan 7→ 2
T ′ = 323232310Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 17/33
Introduction GCIS Results Final Considerations
Creating Rules
For every unique LMS-substring S a rule of the form X → Sis created.
In order to obtain a good compression ratio, certain propertiesshared between the LMS-substrings shall be explored.
The first property is that the LMS-substrings are sorted!
We can represent them by a pair (`, s):I `: the common prefix length shared with the previous
LMS-substring.I s the remaining suffix not encoded by `.
A special rule is created to store the first symbols of T notcovered by a LMS-substring.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 18/33
Introduction GCIS Results Final Considerations
Encoding the Factors
For T = banaananaananaananaana0
0→ (0,’0’); //’0’;
1→ (0,’aana’); // ’aana’
2→ (3,”); // ’aan’
3→ (1,’n’); // ’an’
Tail→ b
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 19/33
Introduction GCIS Results Final Considerations
Encoding the Factors
The ` values can be encoded succinctly by using Simple-8b
Tries to fit as many fixed-width integer as possible in a 64-bitword.
With a single memory access, many values are retrieved.
Table: Simple8b possible arrangements.
Selector value 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Item width 0 0 1 2 3 4 5 6 7 8 10 12 15 20 30 60Group Size 240 120 60 30 20 15 12 10 8 7 6 5 4 3 2 1Wasted bits 60 60 0 0 0 0 0 0 4 4 0 0 0 0 0 0
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 20/33
Introduction GCIS Results Final Considerations
Encoding the Factors
The s parts are concatenated in a single array of integers withfixed width of dlg(|Σ|)e bits.
V = 0aanan
A support bitmap encoding the length of each s part iscreated.
BV = 0100001101
Length of suffix i =select1(BV, i + 1)− select1(BV, i)− 1
Start of suffix i in V = select1(BV, i)− i + 1.
To improve space usage, the bitmap is encoded usingElias-Fano.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 21/33
Introduction GCIS Results Final Considerations
Decoding
We begin by the reduced string.
At each recursion level, we read T [i] = X and replace it by itsthe correct LMS-substring X → S.
To make the decoding faster, all rules from the current levelare decompressed previously.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 22/33
Introduction GCIS Results Final Considerations
Summary
3 Results
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 23/33
Introduction GCIS Results Final Considerations
Results
In order to visualize the potential of the proposed grammarcompressor (GCIS), experiments were done with therepetitive Pizza-Chili corpus, which was populated with textsof different nature.
GCIS was compared against popular compressors suited torepetitive sequences: Re-Pair and 7-zip.
Three subjects were evaluated:I Compression ratio: compressedSize/Size.I Compression time (s).I Decompression time (s).
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 24/33
Introduction GCIS Results Final Considerations
Compression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 25/33
Introduction GCIS Results Final Considerations
Compression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 25/33
Introduction GCIS Results Final Considerations
Compression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 25/33
Introduction GCIS Results Final Considerations
Compression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 25/33
Introduction GCIS Results Final Considerations
Compression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 25/33
Introduction GCIS Results Final Considerations
Compression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 25/33
Introduction GCIS Results Final Considerations
Compression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 25/33
Introduction GCIS Results Final Considerations
Compression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 25/33
Introduction GCIS Results Final Considerations
Compression Ratio
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 26/33
Introduction GCIS Results Final Considerations
Compression Ratio
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 26/33
Introduction GCIS Results Final Considerations
Compression Ratio
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 26/33
Introduction GCIS Results Final Considerations
Compression Ratio
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 26/33
Introduction GCIS Results Final Considerations
Compression Ratio
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 26/33
Introduction GCIS Results Final Considerations
Compression Ratio
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 26/33
Introduction GCIS Results Final Considerations
Compression Ratio
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 26/33
Introduction GCIS Results Final Considerations
Compression Ratio
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 26/33
Introduction GCIS Results Final Considerations
Decompression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 27/33
Introduction GCIS Results Final Considerations
Decompression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 27/33
Introduction GCIS Results Final Considerations
Decompression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 27/33
Introduction GCIS Results Final Considerations
Decompression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 27/33
Introduction GCIS Results Final Considerations
Decompression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 27/33
Introduction GCIS Results Final Considerations
Decompression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 27/33
Introduction GCIS Results Final Considerations
Decompression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 27/33
Introduction GCIS Results Final Considerations
Decompression Time
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 27/33
Introduction GCIS Results Final Considerations
Peak Memory during Compression
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 28/33
Introduction GCIS Results Final Considerations
Peak Memory during Decompression
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 29/33
Introduction GCIS Results Final Considerations
Summary
4 Final Considerations
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 30/33
Introduction GCIS Results Final Considerations
Final Considerations
GCIS showed to be a practical alternative to popularcompressors due its good compression ratio and time.
Competitive in compression regarding 7-zip and Re-Pairand much faster than both.
Slower decode times.
Worse compression ratio, but comparable to Re-Pair.
Lower memory peak than Re-Pair.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 31/33
Introduction GCIS Results Final Considerations
Work in Progress
It is possible to support extraction of any substring withoutmuch space by storing the rule lenghts succinctly.
Re-Pair also supports extract but in a less space-efficientversion which requires (2.x to 3 times more space) .
7-zip does not support extraction.
Develop extraction and compare with the less space-efficientversion of Re-Pair.
We hope that GCIS will be more space-efficient thanRe-Pair when supporting the extract operation.
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 32/33
Introduction GCIS Results Final Considerations
Thank you
Thank you!
Nunes, D. S. N. et al. A Grammar Compression Algorithm based on Induced Suffix Sorting DCC 2018 33/33