Buffering in Query Evaluation over XML Streams

23
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

description

Buffering in Query Evaluation over XML Streams. Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center . XML Document. 1: < paper > 2: < section id = 1 > 3: < title > 4: Intro 5: 6: < content > - PowerPoint PPT Presentation

Transcript of Buffering in Query Evaluation over XML Streams

Page 1: Buffering in Query Evaluation over XML Streams

Buffering in Query Evaluation over XML

Streams

Ziv Bar-YossefTechnion

Marcus FontouraVanja Josifovski

IBM Almaden Research Center

Page 2: Buffering in Query Evaluation over XML Streams

2

XML Document1: <paper>2: <section id = 1>3: <title>4: Intro5: </title>6: <content>7: bla bla bla8: </content>9: </section>10: <section id = 2>11: <title>12: Results13: </title>14: <content>15: yada yada yada16: </content>17: </section>

18: <section id = 3>19: <title>20: Conclusions21: </title>22: <content>23: etc etc etc24: </section>25: <title>26: On the Complexity of Database Queries27: </title>28: <author>29: Papadimitriou30: </author>31: <author>32: Yannakakis33: </author>34: </paper>

Page 3: Buffering in Query Evaluation over XML Streams

3

content

XML Document Tree

paper

title

section

id title

root

section

idtitle

On the Complexity of Database Queries

Intro

2

author

author

content

Papadimitriou

Yannakakis

Results yada yada yada

section

idtitle

1

etc etc etc

3content

Conclusions

bla bla bla

Page 4: Buffering in Query Evaluation over XML Streams

4

XPath Queries

Results yada yada yada

content

paper

title

section

id title

root

sectionid

titleOn the Complexity of

Database Queries

Intro

2

author

author

content

Papadimitriou

Yannakakis

section

idtitle

1

etc etc etc

3content

Conclusions

bla bla bla

/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content

Page 5: Buffering in Query Evaluation over XML Streams

5

XPath Queries

Results yada yada yada

content

paper

title

section

id title

root

sectionid

titleOn the Complexity of

Database Queries

Intro

2

author

author

content

Papadimitriou

Yannakakis

section

idtitle

1

etc etc etc

3content

Conclusions

bla bla bla

/paper[title != section/title]/author

Page 6: Buffering in Query Evaluation over XML Streams

6

XPath Query = path pattern + predicates

XPath 2.0 Forward axis only

Eval(Q,D): nodes in D that match Q

Two modes of XPath evaluation: Full fledged evaluation: given Q,D, output Eval(Q,D) Filtering: given Q,D, determine whether Eval(Q,D) is

nonempty.

Page 7: Buffering in Query Evaluation over XML Streams

7

XML Streams XML stream: sequence of SAX events

startDocument(), endDocument(), startElement(name), endElement(name), text(str)

Why XML streams? For transferring XML between systems For efficient access to large XML documents

Critical resources Memory Processing time

Page 8: Buffering in Query Evaluation over XML Streams

8

Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] FluX [Koch el al 04] TurboXPath [Josifovski, Fontoura, and Barta 05] …

All of them use lots of memory on certain queries & documents

Page 9: Buffering in Query Evaluation over XML Streams

9

Memory Bottleneck I: Storage of Large Transition Tables Framework of most algorithms:

Q NFA Simulate NFA by DFA

Caveat: exponential blowup However: exponential blowup is not necessary

[Bar-Yossef, Fontoura, Josifovski 04] Algorithm for filtering XML streams whose space is

linear in the query size

Page 10: Buffering in Query Evaluation over XML Streams

10

Memory Bottleneck II:Buffering of Document Fragments Scenario 1: buffering nodes, which may or may not be part

of the output.

Results yada yada yada

content

paper

title

section

id title

root

sectionid title

On the Complexity of Database Queries

Intro

2

author

author

content

Papadimitriou

Yannakakis

sectionid

title1

etc etc etc

3content

Conclusions

bla bla bla

/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content

Page 11: Buffering in Query Evaluation over XML Streams

11

Memory Bottleneck II:Buffering of Document Fragments Scenario 2: buffering nodes needed for evaluating pending

predicates.

Results yada yada yada

content

paper

title

section

id title

root

sectionid title

On the Complexity of Database Queries

Intro

2

author

author

content

Papadimitriou

Yannakakis

sectionid

title1

etc etc etc

3content

Conclusions

bla bla bla

/paper[title != section/title]/author

Page 12: Buffering in Query Evaluation over XML Streams

12

Memory Bottleneck II:Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that

are nested within each other.

a

root

ca

ba

c

b

//a[b and c]

Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski

04]

Page 13: Buffering in Query Evaluation over XML Streams

13

Our Results Quantitative space lower bounds for:

Full-fledged evaluation of queries with predicates (Scenario 1)

Filtering/full-fledged evaluation of queries with “multi-variate” predicates (Scenario 2)

Matching upper bound Eager evaluation of predicates

In all other scenarios: no buffering required Filtering of queries with “univariate” predicates over

non-recursive documents is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]

Page 14: Buffering in Query Evaluation over XML Streams

14

Related Work Space complexity of XPath evaluation over non-

streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]

Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]

Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

Page 15: Buffering in Query Evaluation over XML Streams

15

Document Concurrency Q: query D = 1,…,n: document

Each i is an SAX event t = (1,…,t) Definition: x D is alive at step t if x t and

s.t. x Eval(Q, t) x Eval(Q, t)

t-concurrency(D,Q): number of nodes that are alive at step t

concurrency(D,Q): maxt t-concurrency(D,Q)

Page 16: Buffering in Query Evaluation over XML Streams

16

Concurrency: Example

1: <paper>2: <section id = 1>3: <title>4: Intro5: </title>6: <content>7: bla bla bla8: </content>9: </section>10: <section id = 2>11: <title>12: Results13: </title>14: <content>15: yada yada yada16: </content>17: </section>

18: <section id = 3>19: <title>20: Conclusions21: </title>22: <content>23: etc etc etc24: </content>25: </section>26: <title>27: On the Complexity of Database Queries28: </title>29: <author>30: Papadimitriou31: </author>32: <author>33: Yannakakis34: </author>35: </paper>

alive

alive

dead

/paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content

Page 17: Buffering in Query Evaluation over XML Streams

17

Lower Bound Notions A “normal” lower bound:

For every algorithm A, there exist Q and D s.t. A uses on Q and D (concurrency(D,Q)) bits of space. Q and D may be “pathological” Doesn’t say much about real-world queries/documents

An “ideal” lower bound:For every A, every Q, and every D, A uses on Q and D (concurrency(D,Q)) bits of space. Too good to be true

A can have D and Q “hard-coded”, and then know the result a priori Space of A on D and Q = minimum description length of Q and D

Page 18: Buffering in Query Evaluation over XML Streams

18

Our Lower Bound Theorem: For every A, every Q, and every D,

there exists an almost isomorphic document D’, s.t. A uses on Q and D’, (concurrency(D,Q)) bits of space. D’ is the same as D, except for a few extra

empty nodes with auxiliary names. Theorem holds only if:

Q is “star-free” D is non-recursive

Page 19: Buffering in Query Evaluation over XML Streams

19

Why isn’t this Obvious? Reason 1: we want the theorem to work for

every Q and D, not only ones with high MDL. Reason 2:

Obvious: If x is alive at step t A has to remember x Because: A may or may not need to output x

Not obvious: If x and y are alive at step t A has to remember both If x and y are not “independent”, maybe it’s enough to

remember just x (or just y)

Page 20: Buffering in Query Evaluation over XML Streams

20

Proof of Lower Bound C = t-concurrency(D,Q) x1,…,xC = nodes that are alive at step t Recall: for every xi there exist i and i s.t.

xi Eval(Q, ti) x Eval(Q, ti)

Lemma: there exist a single and a single s.t. for all i, xi Eval(Q, t) xi Eval(Q, t)

Page 21: Buffering in Query Evaluation over XML Streams

21

Proof of Lower Bound (cont.) For every S { 1,…,C } define document DS: DS is the same as D, except

For every i S, we “mark” xi Marking: an extra empty child with an auxiliary

name Note: DS is almost-isomorphic to D

A = any algorithm Note: From output of A on DS, one can

“reconstruct” the set S.

Page 22: Buffering in Query Evaluation over XML Streams

22

Proof of Lower Bound (cont.) Consider state of A at step t when running on

DS

If suffix = , none of the xi’s should be output A could not have output any xi by step t

If suffix = , no information in suffix about S but S can be reconstructed from output state of A at step t must have all information

about S Conclusion: space ≥ (C)

Actual proof: by one-way communication complexity

Page 23: Buffering in Query Evaluation over XML Streams

23

Conclusions Our contributions:

Quantitative space lower bounds Full-fledged evaluation of queries with predicates Filtering/full-fledged evaluation of queries with “multi-

variate” predicates Matching upper bound

Open problems: Quantitative lower bounds for XQuery evaluation

over streams Address larger fragments of XPath