5th July 2004CPM 20041 A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila...

23
5th July 2004 CPM 2004 1 A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK) and Venkatesh Raman (Institute for Mathematical Sciences, Chennai, India)

Transcript of 5th July 2004CPM 20041 A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila...

5th July 2004 CPM 2004 1

A Simple Optimal Representation for Balanced Parentheses

Richard Geary, Naila Rahman, Rajeev Raman

(University of Leicester, UK)

and

Venkatesh Raman

(Institute for Mathematical Sciences, Chennai, India)

5th July 2004 CPM 2004 2

A Parentheses Data Structure

• Given: Balanced string of 2n parentheses.

( ( ( ( ) ) ) ( ) ( ) )

• Support operations:– ENCLOSE ( i )– FINDCLOSE ( i ), FINDOPEN( i )– EXCESS ( i )

• Applications to suffix tree, ordinal trees and stack-sortable permutations.

5th July 2004 CPM 2004 3

Parentheses Representation

2n bits, O(n) time. Θ(n lg n ) bits, O(1) time. O(n) bits, O(1) time. [Jacobson, `89]2n+o(n) bits, O(1) time. [Munro, Raman, `01]2n+o(n) bits, O(1) time. New data structure.

• Our new DS

– is simpler (no perfect hash tables),

– smaller o(n) term,

– uniform o(n) time and space construction algorithm.• Implemented and shown to be quite practical

– far more compact than D/S using naïve representation,

– speed comparable to D/S using naïve representation.

5th July 2004 CPM 2004 4

XML

• XML: eXtensible Markup Language– de facto standard for electronic data interchange.

• Document Object Model (DOM) standard API for manipulating XML documents – holds all data in memory,

– large memory usage.

5th July 2004 CPM 2004 5

Example XML document

<person> <name> <first>Bill</first> <surname>Bloggs</surname> </name> <dob> <day>1</day> <month>April</month> <year>1961</year> </dob></person>

• DOM NODE interface has methods PARENT(x), NEXTSIB(x), PREVSIB(x), LASTCHILD(x),FIRSTCHILD(x)

person

name

firstname surname day month year

dob

5th July 2004 CPM 2004 6

Obvious representation

• 2n pointers– DOM: 3n.

• Ω(n log n) bits.

5th July 2004 CPM 2004 7

Using parentheses

<person> <name> <first>Bill</first> <surname>Bloggs</surname> </name> <dob> <day>1</day> <month>April</month> <year>1961</year> </dob></person>

parentheses representation: ( ( ( ) ( ) ) ( ( ) ( ) ( ) ) )1 2 3 4 5 6 7 8

2n + o(n) bits for tree structure.

1

2

3 4 6 7 8

5

5th July 2004 CPM 2004 8

Node interface ops using Parentheses DS

Node interface Parentheses DS

PARENT ENCLOSE

NEXTSIB FINDCLOSE

PREVSIB FINDOPEN

LASTCHILD FINDCLOSE, FINDOPEN

5th July 2004 CPM 2004 9

Succinct DOM

• Succinct DOM:

– uses far less space than standard DOM,

– performance competitive with DOM.

• Node interface implemented by natural parentheses ops.

• Operations supported by parentheses data structures

– Jacobson `89,

– Munro and Raman `01,

– Our new data structure.

5th July 2004 CPM 2004 10

Our new D/S

Input: balanced string of 2n parentheses.

Assume recursive data structure to store balanced string of 2N 2n parentheses.

If N is O(n / lg2 n) store answers explicitly for every pair of parentheses.

OtherwiseDivide into blocks of size Number of blocks

2/)(lg NB NN lg/4

5th July 2004 CPM 2004 11

FINDCLOSE(x)

( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )

1 2 3 4 5 6 7 8 9 10

• FINDCLOSE(3)?• Matching parenthesis inside block – near parenthesis.

• Pre-computed table stores position of matching parentheses for all near parentheses.– O(1) time if near parenthesis.

– Table size is ))(lg( 2NNO

5th July 2004 CPM 2004 12

Pioneer Parentheses

FINDCLOSE(5)?

Matching parenthesis outside block – far parenthesis.

b(p) = block# of parenthesis at position p

= position of match of p

q is 1st far parenthesis before p

p is pioneer if

At most 2β-3 open pioneers.

Similarly at most 2β-3 close pioneers.

( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )

1 2 3 4 5 6 7 8 9 10

)( p

))(())!(( qbpb

5th July 2004 CPM 2004 13

Pioneer Family

• Pioneer family: set of all opening and closing pioneers along with their matching parentheses.

• Balanced string of size at most 4β-6.

( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )

1 2 3 4 5 6 7 8 9 10

( ( ) )

5th July 2004 CPM 2004 14

Our D/S

( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )

( ( ) )

NND

2N

O(N / lg N)

Two levels of recursion. When pioneer family is O(N/lg2N) we store explicit answers.

5th July 2004 CPM 2004 15

Space usage

NND uses O(N lg lg N / lg N) bits.

Tables use O( N lg lg N / lg N) bits.

otherwiseN) N/O(NN)N/S(N

n)n /if N is O( N)O(NNS

lglglglg162

lg lg)(

2

S(n) = 2n+ O(n lg lg n / lg n) = 2n +o(n) bits.

5th July 2004 CPM 2004 16

Pseudo-pioneers

• Near blocks: blocks which have no pioneers.• Insert pseudo-pioneers at start and end of every

near block.– Pseudo-pioneers do not effect FINDOPEN(x), FINDCLOSE(x), ENCLOSE(x)

• Gap between pioneers now at most 2B = O(lg N).

5th July 2004 CPM 2004 17

NND

• 2n-bit vector used to find the pioneer for a far parenthesis.

• If pioneer at pos i in parentheses string then 1 at i in NND.

• Operations we need:– Find address of most recent 1 at position i

r = Rank(i)p = Select(r)

– Find ith 1in bit vector p = Select(i)

• We want succinct representation.• D/S should be simple and fast.

( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) )

1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1

5th July 2004 CPM 2004 18

NND

Bit vector of length M with N 1s.

Gap between 1s at most (lg M)c.

t = lg M / 2 c lg lg M.

5th July 2004 CPM 2004 19

Select(i)• Find ith 1 in bit vector.• Array A1 stores position of every tth 1

– Space is

• Array A2 stores gaps between consecutive 1s– Space is O( N lg lg M ) or O( M lg lg M / lg M ) bits.

• Table T1 allows us to lookup sum of upto t gap.– Space is

SELECT(i)i’ = i’’ = (i+1)/mod ty = concat of A2[i’+1],..,A2[i’+i’’]return A1[x] + T1[y]

bits )lg/lglg(/lg MMMOtMN

bits. )lglg( MMO

1)/t(i

5th July 2004 CPM 2004 20

Rank(i)

• Prefix sum at position i.• Need two more arrays and tables of size at most

O(M lg lg M / lg M) bits.

5th July 2004 CPM 2004 21

Implementation Details

• C++ on Sun UltraSparc-III and Pentium 4.• Implemented new and optimised Jacobson D/S.• CenterPoint XML for DOM.• Sample of 12 XML documents of varying sizes

and node counts.• Blocksizes 32, 64, 128 and 256.• Test was depth first tree walk, counting nodes of a

given XML type.

5th July 2004 CPM 2004 22

Space usage and performance

• Space usage for tree structure– Std DOM: 96 bits per node.

– Jacobson: 3.3 – 16 bits per node.

– New D/S: 2.9 – 12.8 bits per node.

• Avg performance for succinct D/S relative to std DOM– UltraSparc: 1 to 2.5 times slower.

– Pentium 4: 1.7 to 4 times slower.

5th July 2004 CPM 2004 23

Conclusions and Future work

• Conceptually simple succinct representation for balanced parentheses with O(1) time ops.

• o(n) time and space construction algorithm.• Improved lower bound term for space bound.• Relative performance very good on UltraSparc but

poorer on Pentium 4, which has small cache– Cache optimisation is an interesting problem.

• Complete set of D/S for succinct DOM.