Satoshi Yoshida* and Takuya Kida*
Transcript of Satoshi Yoshida* and Takuya Kida*
Satoshi Yoshida* and Takuya Kida**Hokkaido University Graduate school of Information Science and
Technology Division of Computer Science
1
※ESKM2012(9月20⽇〜9月22⽇)にて発表予定
2
Input Text
Fixed Length Variable Length
CompressedText
FixedLength
FF Code(Fixed length to Fixed
length code)
VF Code(Variable length to Fixed
length code)
VariableLength
FV Code(Fixed length to
Variable length code)
VV Code(Variable length to Variable
length code)
Huffman Huffman Code
TunstallTunstallCode
Program Searching on Compressed Data
011101011101110100101001011101011101110100101001・・・
Compressed DataSearch Directly
3
Tunstall code [Tunstall, 1967]
AIVF code using multiple parse trees[Yamamoto and Yokoo, 2001]
AIVF code [Yamamoto and Yokoo, 2001]
utilizing unused codewords
utilizing context between blocks
YY codeYY code
Improves
considerably
Improves compression ratio
considerably
4
Multiple parse trees
001011010010010001…
VMA tree
Huge time Huge time and memory
Proposing methodProposing method
relatively to the number of kind of characters (k) in the input text
k-1 parse trees
� We can reduce the total number of nodes in comparison to YY code by using a VMA tree.
� We can reduce the total number of nodes and compression time considerably also in experiments.
� We found an upper bound and a lower bound of the number of nodes in a VMA tree.
5
Number of nodes in VMA tree
Number of nodes reduced byintegration
Upper bound �� �1
2�� �
1
2� � � � 1
1
2� � 1 � � 2
� �� � � � 1
� � 1� � � 2
Lower bound1
� � 1� �� � �
1
2�� �
1
2� � 3
6
k : alphabet sizeM : number of codewords
� Let Σ be a finite alphabet.
� Elements in alphabet are sorted in descending
order of their probabilities.
� We assume information source is memoryless.
� We simply say “encoding” encoding to the sequence
on {0, 1}.
7
aabbbbc Σ={a, b, c}
8
a
a
b
b
c
c a001
010
000
100
101
110011
111
a
a
Incomplete internal nodeAn internal node which doesn’thave k children.
input text:aabbbbc
compressed data sequence:001 101 101101 101 111
Each branch is labeled by a symbol
in Σ={a, b, c}.
Each leaf node and incomplete internal node has a codeword.
to the block.
Input text is parsed into blocks and we output
codeword corresponding to the block.
9
aa
b
b
c
ca
001 010000
100
101
111
011
a110
b c
a
a
b
b
c
c a001
010
000
100
101
110011
111
a
a
We switch multiple parse trees according to the context.
� Problem: time and space consuming◦ We construct k – 1 parse trees.
� We can reduce total number of nodes by sharing nodes.◦ We need to mark each node n in a VMA tree in order to tell which trees the node belongs to.◦ We can realize that by holding the least i such that nbelongs to Ti.
10
…0T1T
2−kT2T
VMA treemultiple parse trees
� Theorem
11
…
…
)(
1
i
iS +
)(
2
i
kS −
)(
1
i
kS −)(i
kS
)1(
2
+
+
i
iS)1(
2
+
−
i
kS)1(
1
+
−
i
kS)1( +i
kS
)(
2
i
iS +
Ti Ti +1
Let be the subtree of that consists of all the nodes under the node corresponding with , which is direct child of the root. Then completely covers . We denote this relation by .
)(i
jS iT
ja)1( +
+
i
jiS)(i
jiS +
)1()( +
++
i
ji
i
ji SS p
From this theorem, we can tell which trees a node belongs to easily.
12
…
…
…
…
…
…
…
0T 1T 2−kT
VT
)0(
1S)0(
2−kS)0(
1−kS
)0(
kS
)1(
2S)1(
2−kS)1(
1−kS)1(
kS
)2(
1
−
−
k
kS
)2( −k
kS
kS1−kS1S 2−kS
.SSS
,SSS
,SSS
,SSSS
,SSS
,SS
k
k
kk
k
k
kk
k
k
kk
=
=
=
=
=
=
−
−
−
−−
−
−
−−
)2()0(
1
)2(
1
)0(
1
2
)3(
2
)0(
2
3
)2(
3
)1(
3
)0(
3
2
)1(
2
)0(
2
1
)0(
1
pLp
pLp
pLp
M
pp
p
From the theorem, we have:
VMA tree
multiple parse trees
We call the integrated parse tree a VMA tree.
�Theorem
The number of reduced nodes is not less
than .
13
1
2�� �
1
2� � 3
1
2�� �
1
2� � 3
14
…
…
…
……
…
…
0T1T
2−kT
VT
)0(
1S
)0(
2−kS
)0(
1−kS
)0(
kS
)1(
2S
)1(
2−kS)1(
1−kS)1(
kS
)2(
1
−
−
k
kS
)2( −k
kS
)2( −k
kS)2(
1
−
−
k
kS)0(
1S)3(
2
−
−
k
kS
3−kT
)3(
2
−
−
k
kS
)3( −k
kS)3(
1
−
−
k
kS
)1(
3S)0(
2S
)1(
2S
Each reduced subtree has at least one node.
1−k 2−k 2
Summation:
�Theorem
The number of nodes in a VMA tree is
not less than .
15
1
� � 1� �� � �
16
…
…
��
�� �� �� �
� ����� ����
� ���
���� ���� ���� � �� �
Largest subtree and it remains in VMA tree.
Pr �� � Pr �� � ⋯ � Pr ��
The VMA tree is smallest when Pr �� � Pr �� � ⋯ � Pr �� .
17
…
…
VT
)2( −k
kS)2(
1
−
−
k
kS)0(
1S)3(
2
−
−
k
kS)1(
2S
#of codewords:
#of nodes:
�
�
�
� � 1
�
3…
�
2
�
2
���
��
�� 1
� � 1
���
����
�� 1
� � 1
���
��
�� 1
� � 1
���
��
�� 1
� � 1
���
��
�� 1
� � 1
Summation is not less than �
���� �� � �.
� Comparison◦ Total number of nodes (Exp1)◦ Compression times (Exp2)◦ Total number of nodes and upper bound and lower bound of nodes in VMA tree on random sequence (Exp3)
� Algorithms◦ YY coding (YY)◦ Encoding using a VMA tree (VMA)
� Environments◦ CPU: Intel Pentium® 4 Processor 3.0GHz Hyper Threading◦ Memory: 2GB◦ OS: Debian GNU/Linux 5.0◦ Language: C++◦ Compiler: g++4.3
� Codeword length: 12 bits
18
� The Canterbury Corpus†
19
file size (in bytes) k content
alice29.txt 152,089 74 English text
asyoulik.txt 125,179 68 Shakespeare
cp.html 24,603 86 HTML source
fields.c 11,150 90 C source
grammar.lsp 3,721 76 LISP source
kennedy.xls 1,029,744 256 Excel Spreadsheet
lcet10.txt 424,754 84 Technical writing
plrabn12.txt 481,861 81 Poetry
ptt5 513,216 159 CCITT test set
sum 38,240 255 SPARC Executable
xargs.1 4,227 74 GNU manual page†http://corpus.canterbury.ac.nz/descriptions
20
0
200
400
600
800
1000
1200
tota
l am
ou
nt o
f no
des
tota
l am
ou
nt o
f no
des
tota
l am
ou
nt o
f no
des
tota
l am
ou
nt o
f no
des
VMA
YY
103
k : 74 68 86 90 76 256 84 81 159 255 74
7 times
32 times
21
0
10
20
30
40
50
60
70
co
mp
ressio
n tim
e (s
ec)
co
mp
ressio
n tim
e (s
ec)
co
mp
ressio
n tim
e (s
ec)
co
mp
ressio
n tim
e (s
ec)
VMA
YY
k : 74 68 86 90 76 256 84 81 159 255 74
3 times
18 times
22
0
200
400
600
800
1000
1200
0 32 64 96 128 160 192 224 256
tota
l n
um
be
r o
f n
od
es
alphabet size
the number of nodes
on uniform distribution
VMA
YY
upper bound
lower bound
0
200
400
600
800
1000
1200
0 32 64 96 128 160 192 224 256
tota
l n
um
be
r o
f n
od
es
alphabet size
the number of nodes
on Zipf distribution
VMA
YY
upper bound
lower bound
� We showed that we can reduce the total numberof nodes in comparison to YY code by using a VMAtree.
� We also showed that we can reduce the totalnumber of nodes and times considerably inexperiments.
� We found an upper bound and a lower bound of thenumber of nodes in a VMA tree.
23
� Applying this idea to STVF coding [Kida, DCC2009]
� Finding tighter bounds