Buffering in Query Evaluation over XML Streams

Buffering in Query Evaluation over XML

Streams

Ziv Bar-YossefTechnion

Marcus FontouraVanja Josifovski

IBM Almaden Research Center

2

XML Document1: <paper>2: <section id = 1>3: <title>4: Intro5: </title>6: <content>7: bla bla bla8: </content>9: </section>10: <section id = 2>11: <title>12: Results13: </title>14: <content>15: yada yada yada16: </content>17: </section>

18: <section id = 3>19: <title>20: Conclusions21: </title>22: <content>23: etc etc etc24: </section>25: <title>26: On the Complexity of Database Queries27: </title>28: <author>29: Papadimitriou30: </author>31: <author>32: Yannakakis33: </author>34: </paper>

3

content

XML Document Tree

paper

title

section

id title

root

section

idtitle

On the Complexity of Database Queries

Intro

2

author

author

content

Papadimitriou

Yannakakis

Results yada yada yada

section

idtitle

1

etc etc etc

3content

Conclusions

bla bla bla

4

XPath Queries


content

paper

title

section

id title

root

sectionid

titleOn the Complexity of

Database Queries

Intro

2

author

author

content

Papadimitriou

Yannakakis

section

idtitle

1

etc etc etc

3content

Conclusions

bla bla bla

/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content

5

XPath Queries


content

paper

title

section

id title

root

sectionid

titleOn the Complexity of

Database Queries

Intro

2

author

author

content

Papadimitriou

Yannakakis

section

idtitle

1

etc etc etc

3content

Conclusions

bla bla bla

/paper[title != section/title]/author

6

XPath Query = path pattern + predicates

XPath 2.0 Forward axis only

Eval(Q,D): nodes in D that match Q

Two modes of XPath evaluation: Full fledged evaluation: given Q,D, output Eval(Q,D) Filtering: given Q,D, determine whether Eval(Q,D) is

nonempty.

7

XML Streams XML stream: sequence of SAX events

startDocument(), endDocument(), startElement(name), endElement(name), text(str)

Why XML streams? For transferring XML between systems For efficient access to large XML documents

Critical resources Memory Processing time

8

Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] FluX [Koch el al 04] TurboXPath [Josifovski, Fontoura, and Barta 05] …

All of them use lots of memory on certain queries & documents

9

Memory Bottleneck I: Storage of Large Transition Tables Framework of most algorithms:

Q NFA Simulate NFA by DFA

Caveat: exponential blowup However: exponential blowup is not necessary

[Bar-Yossef, Fontoura, Josifovski 04] Algorithm for filtering XML streams whose space is

linear in the query size

10

Memory Bottleneck II:Buffering of Document Fragments Scenario 1: buffering nodes, which may or may not be part

of the output.


content

paper

title

section

id title

root

sectionid title


Intro

2

author

author

content

Papadimitriou

Yannakakis

sectionid

title1

etc etc etc

3content

Conclusions

bla bla bla


11

Memory Bottleneck II:Buffering of Document Fragments Scenario 2: buffering nodes needed for evaluating pending

predicates.


content

paper

title

section

id title

root

sectionid title


Intro

2

author

author

content

Papadimitriou

Yannakakis

sectionid

title1

etc etc etc

3content

Conclusions

bla bla bla

/paper[title != section/title]/author

12

Memory Bottleneck II:Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that

are nested within each other.

a

root

ca

ba

c

b

//a[b and c]

Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski

04]

13

Our Results Quantitative space lower bounds for:

Full-fledged evaluation of queries with predicates (Scenario 1)

Filtering/full-fledged evaluation of queries with “multi-variate” predicates (Scenario 2)

Matching upper bound Eager evaluation of predicates

In all other scenarios: no buffering required Filtering of queries with “univariate” predicates over

non-recursive documents is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]

14

Related Work Space complexity of XPath evaluation over non-

streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]

Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]

Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

15

Document Concurrency Q: query D = 1,…,n: document

Each i is an SAX event t = (1,…,t) Definition: x D is alive at step t if x t and

s.t. x Eval(Q, t) x Eval(Q, t)

t-concurrency(D,Q): number of nodes that are alive at step t

concurrency(D,Q): maxt t-concurrency(D,Q)

16

Concurrency: Example

1: <paper>2: <section id = 1>3: <title>4: Intro5: </title>6: <content>7: bla bla bla8: </content>9: </section>10: <section id = 2>11: <title>12: Results13: </title>14: <content>15: yada yada yada16: </content>17: </section>

18: <section id = 3>19: <title>20: Conclusions21: </title>22: <content>23: etc etc etc24: </content>25: </section>26: <title>27: On the Complexity of Database Queries28: </title>29: <author>30: Papadimitriou31: </author>32: <author>33: Yannakakis34: </author>35: </paper>

alive

alive

dead


17

Lower Bound Notions A “normal” lower bound:

For every algorithm A, there exist Q and D s.t. A uses on Q and D (concurrency(D,Q)) bits of space. Q and D may be “pathological” Doesn’t say much about real-world queries/documents

An “ideal” lower bound:For every A, every Q, and every D, A uses on Q and D (concurrency(D,Q)) bits of space. Too good to be true

A can have D and Q “hard-coded”, and then know the result a priori Space of A on D and Q = minimum description length of Q and D

18

Our Lower Bound Theorem: For every A, every Q, and every D,

there exists an almost isomorphic document D’, s.t. A uses on Q and D’, (concurrency(D,Q)) bits of space. D’ is the same as D, except for a few extra

empty nodes with auxiliary names. Theorem holds only if:

Q is “star-free” D is non-recursive

19

Why isn’t this Obvious? Reason 1: we want the theorem to work for

every Q and D, not only ones with high MDL. Reason 2:

Obvious: If x is alive at step t A has to remember x Because: A may or may not need to output x

Not obvious: If x and y are alive at step t A has to remember both If x and y are not “independent”, maybe it’s enough to

remember just x (or just y)

20

Proof of Lower Bound C = t-concurrency(D,Q) x1,…,xC = nodes that are alive at step t Recall: for every xi there exist i and i s.t.

xi Eval(Q, ti) x Eval(Q, ti)

Lemma: there exist a single and a single s.t. for all i, xi Eval(Q, t) xi Eval(Q, t)

21

Proof of Lower Bound (cont.) For every S { 1,…,C } define document DS: DS is the same as D, except

For every i S, we “mark” xi Marking: an extra empty child with an auxiliary

name Note: DS is almost-isomorphic to D

A = any algorithm Note: From output of A on DS, one can

“reconstruct” the set S.

22

Proof of Lower Bound (cont.) Consider state of A at step t when running on

DS

If suffix = , none of the xi’s should be output A could not have output any xi by step t

If suffix = , no information in suffix about S but S can be reconstructed from output state of A at step t must have all information

about S Conclusion: space ≥ (C)

Actual proof: by one-way communication complexity

23

Conclusions Our contributions:

Quantitative space lower bounds Full-fledged evaluation of queries with predicates Filtering/full-fledged evaluation of queries with “multi-

variate” predicates Matching upper bound

Open problems: Quantitative lower bounds for XQuery evaluation

over streams Address larger fragments of XPath

Buffering in Query Evaluation over XML Streams

Documents

Transcript of Buffering in Query Evaluation over XML Streams