Post on 14-Dec-2015
5th July 2004 CPM 2004 1
A Simple Optimal Representation for Balanced Parentheses
Richard Geary, Naila Rahman, Rajeev Raman
(University of Leicester, UK)
and
Venkatesh Raman
(Institute for Mathematical Sciences, Chennai, India)
5th July 2004 CPM 2004 2
A Parentheses Data Structure
• Given: Balanced string of 2n parentheses.
( ( ( ( ) ) ) ( ) ( ) )
• Support operations:– ENCLOSE ( i )– FINDCLOSE ( i ), FINDOPEN( i )– EXCESS ( i )
• Applications to suffix tree, ordinal trees and stack-sortable permutations.
5th July 2004 CPM 2004 3
Parentheses Representation
2n bits, O(n) time. Θ(n lg n ) bits, O(1) time. O(n) bits, O(1) time. [Jacobson, `89]2n+o(n) bits, O(1) time. [Munro, Raman, `01]2n+o(n) bits, O(1) time. New data structure.
• Our new DS
– is simpler (no perfect hash tables),
– smaller o(n) term,
– uniform o(n) time and space construction algorithm.• Implemented and shown to be quite practical
– far more compact than D/S using naïve representation,
– speed comparable to D/S using naïve representation.
5th July 2004 CPM 2004 4
XML
• XML: eXtensible Markup Language– de facto standard for electronic data interchange.
• Document Object Model (DOM) standard API for manipulating XML documents – holds all data in memory,
– large memory usage.
5th July 2004 CPM 2004 5
Example XML document
<person> <name> <first>Bill</first> <surname>Bloggs</surname> </name> <dob> <day>1</day> <month>April</month> <year>1961</year> </dob></person>
• DOM NODE interface has methods PARENT(x), NEXTSIB(x), PREVSIB(x), LASTCHILD(x),FIRSTCHILD(x)
person
name
firstname surname day month year
dob
5th July 2004 CPM 2004 7
Using parentheses
<person> <name> <first>Bill</first> <surname>Bloggs</surname> </name> <dob> <day>1</day> <month>April</month> <year>1961</year> </dob></person>
parentheses representation: ( ( ( ) ( ) ) ( ( ) ( ) ( ) ) )1 2 3 4 5 6 7 8
2n + o(n) bits for tree structure.
1
2
3 4 6 7 8
5
5th July 2004 CPM 2004 8
Node interface ops using Parentheses DS
Node interface Parentheses DS
PARENT ENCLOSE
NEXTSIB FINDCLOSE
PREVSIB FINDOPEN
LASTCHILD FINDCLOSE, FINDOPEN
5th July 2004 CPM 2004 9
Succinct DOM
• Succinct DOM:
– uses far less space than standard DOM,
– performance competitive with DOM.
• Node interface implemented by natural parentheses ops.
• Operations supported by parentheses data structures
– Jacobson `89,
– Munro and Raman `01,
– Our new data structure.
5th July 2004 CPM 2004 10
Our new D/S
Input: balanced string of 2n parentheses.
Assume recursive data structure to store balanced string of 2N 2n parentheses.
If N is O(n / lg2 n) store answers explicitly for every pair of parentheses.
OtherwiseDivide into blocks of size Number of blocks
2/)(lg NB NN lg/4
5th July 2004 CPM 2004 11
FINDCLOSE(x)
( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )
1 2 3 4 5 6 7 8 9 10
• FINDCLOSE(3)?• Matching parenthesis inside block – near parenthesis.
• Pre-computed table stores position of matching parentheses for all near parentheses.– O(1) time if near parenthesis.
– Table size is ))(lg( 2NNO
5th July 2004 CPM 2004 12
Pioneer Parentheses
FINDCLOSE(5)?
Matching parenthesis outside block – far parenthesis.
b(p) = block# of parenthesis at position p
= position of match of p
q is 1st far parenthesis before p
p is pioneer if
At most 2β-3 open pioneers.
Similarly at most 2β-3 close pioneers.
( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )
1 2 3 4 5 6 7 8 9 10
)( p
))(())!(( qbpb
5th July 2004 CPM 2004 13
Pioneer Family
• Pioneer family: set of all opening and closing pioneers along with their matching parentheses.
• Balanced string of size at most 4β-6.
( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )
1 2 3 4 5 6 7 8 9 10
( ( ) )
5th July 2004 CPM 2004 14
Our D/S
( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )
( ( ) )
NND
2N
O(N / lg N)
Two levels of recursion. When pioneer family is O(N/lg2N) we store explicit answers.
5th July 2004 CPM 2004 15
Space usage
NND uses O(N lg lg N / lg N) bits.
Tables use O( N lg lg N / lg N) bits.
otherwiseN) N/O(NN)N/S(N
n)n /if N is O( N)O(NNS
lglglglg162
lg lg)(
2
S(n) = 2n+ O(n lg lg n / lg n) = 2n +o(n) bits.
5th July 2004 CPM 2004 16
Pseudo-pioneers
• Near blocks: blocks which have no pioneers.• Insert pseudo-pioneers at start and end of every
near block.– Pseudo-pioneers do not effect FINDOPEN(x), FINDCLOSE(x), ENCLOSE(x)
• Gap between pioneers now at most 2B = O(lg N).
5th July 2004 CPM 2004 17
NND
• 2n-bit vector used to find the pioneer for a far parenthesis.
• If pioneer at pos i in parentheses string then 1 at i in NND.
• Operations we need:– Find address of most recent 1 at position i
r = Rank(i)p = Select(r)
– Find ith 1in bit vector p = Select(i)
• We want succinct representation.• D/S should be simple and fast.
( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )
1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
5th July 2004 CPM 2004 18
NND
Bit vector of length M with N 1s.
Gap between 1s at most (lg M)c.
t = lg M / 2 c lg lg M.
5th July 2004 CPM 2004 19
Select(i)• Find ith 1 in bit vector.• Array A1 stores position of every tth 1
– Space is
• Array A2 stores gaps between consecutive 1s– Space is O( N lg lg M ) or O( M lg lg M / lg M ) bits.
• Table T1 allows us to lookup sum of upto t gap.– Space is
SELECT(i)i’ = i’’ = (i+1)/mod ty = concat of A2[i’+1],..,A2[i’+i’’]return A1[x] + T1[y]
bits )lg/lglg(/lg MMMOtMN
bits. )lglg( MMO
1)/t(i
5th July 2004 CPM 2004 20
Rank(i)
• Prefix sum at position i.• Need two more arrays and tables of size at most
O(M lg lg M / lg M) bits.
5th July 2004 CPM 2004 21
Implementation Details
• C++ on Sun UltraSparc-III and Pentium 4.• Implemented new and optimised Jacobson D/S.• CenterPoint XML for DOM.• Sample of 12 XML documents of varying sizes
and node counts.• Blocksizes 32, 64, 128 and 256.• Test was depth first tree walk, counting nodes of a
given XML type.
5th July 2004 CPM 2004 22
Space usage and performance
• Space usage for tree structure– Std DOM: 96 bits per node.
– Jacobson: 3.3 – 16 bits per node.
– New D/S: 2.9 – 12.8 bits per node.
• Avg performance for succinct D/S relative to std DOM– UltraSparc: 1 to 2.5 times slower.
– Pentium 4: 1.7 to 4 times slower.
5th July 2004 CPM 2004 23
Conclusions and Future work
• Conceptually simple succinct representation for balanced parentheses with O(1) time ops.
• o(n) time and space construction algorithm.• Improved lower bound term for space bound.• Relative performance very good on UltraSparc but
poorer on Pentium 4, which has small cache– Cache optimisation is an interesting problem.
• Complete set of D/S for succinct DOM.