1 A Unified Model for XQuery Evaluation over XML Data Streams Jinhui Jian Hong Su Elke A....

27
1 A Unified Model for XQue ry Evaluation over XML D ata Streams Jinhui Jian Hong Su Elke A. Rundensteiner Worcester Polytechnic Institute ER 2003
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of 1 A Unified Model for XQuery Evaluation over XML Data Streams Jinhui Jian Hong Su Elke A....

1

A Unified Model for XQuery Evaluation over XML Data Streams

Jinhui Jian

Hong Su

Elke A. Rundensteiner

Worcester Polytechnic Institute

ER 2003

2

Need for Stream Processing New environment

Data source is everywhere Data request is everywhere

New applications Sensor networks Analysis of XML web logs Selective dissemination of XML information (e.g., news)

New features On-line arriving data Potentially unstable data Real-time response requirement Scalability requirement

3

Specific Challenges for XML Streams

Pattern retrieval on nested data

+ filtering/restructuring

FOR $b in doc (bib.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p > 50Return <expensive> $t </expensive>

<bib>

<book year="1994">

<title>TCP/IP Illustrated</title>

<author><last>Stevens</last><first>W.</first></author>

<publisher>Addison-Wesley</publisher>

<price> 65.95</price>

</book>

Token-by-Token access manner

timeline

<bib> <book> <title> TCP/IP Illustrated </title> …

A token: can be an open tag/close tag

/PCDATA is not a direct counterpart of

a tuple

4

Observations and Questions

Observations Pattern retrieval->The Automata model is long studied for

pattern retrieval on tokens Filtering/Structuring->The Algebraic model is long studied f

or optimizing query plan on tuples Questions

How to integrate the two models? How to optimize a query within the integrated query model?

5

Uniform Modeling in an Algebraic Framework

6

A Running ExampleGive me book titles whose price is greater than 50: FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN <expensive> $b/title </expensive>

<bib> <book year="1994"> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> <price> 65.95</price> </book> <book year="2000"> <title>Languages and Machines</title> <author><last>Sudkamp</last><first>T.</first></author> <publisher> Addison-Wesley </publisher> <price>39.95</price> </book> …

</bib>

<expensive> <title>TCP/IP Illustrated</title> </expensive> …

timeline

<bib> <book> <title> TCP/IP Illustrated </title> <author> <last> Stevens</last> …</book>…

Input XML stream

7

Automata Computation: NFAs + BuffersFOR $b in doc (bib.xml) //bookWHERE $b/price > 50RETURN $b/title

1book

*

2

4title

3

price

<title>TCP/IP Illustrated</title>

<price>65.96</price>

Buffer for title

Buffer for price

t0 t1 t2 t3 t4 t5 t6 t7 <bib> <book> <title> TCP/IP Illustrated </title> <price> 65.95 </price> …

input

active states +0 +1 +1,2 +1,4 -1,4 +1,3 … …

stack [0] [0]

[1]

[0]

[1]

[1,2]

[0]

[1]

[1,2]

[1,4]

[0]

[1]

[1,2]

[0]

[1]

[1,2]

[1,3]

… …

• No materialization needed

• Multiple patterns resolved in one pass

8

Algebraic Computation

FOR $b in doc (bib.xml) //bookWHERE $b/price > 50 RETURN $b/title

Extract //book

Navigate //book, price

Select price > 50

Tagger

Navigate //book, title

book bookbook

title author

last first

publisher price

Text

Text Text

Text Text

•Selection push-down enabled

9

The Raindrop Approach

Uniform Automata computation modeled in an algebraic manner

Tight-coupling Automata and regular tuple-based computation

interchangeable

10

Path Bindings in XQuery

FOR $b in doc (bib.xml) //bookLET $p := $b/price, $t := $b/titleWHERE $p > 50 RETURN $t

FLWR expression:

FOR…LET...WHERE…RETURN…

Path bindings Filtering and restructuring

“The purpose of path bindings is to produce a tuple stream in which each tuple consists of one or more bound variables” [W3C]

11

Data Flow

Automata plan

Regular algebraic plan

Tuple stream

XML data stream

Query answer

12

Modeling the Automata Plan:Black Box[xscan] vs. White Box

AutomataPlan

Q1 := //bookQ2 := //book/priceQ3 := //book/title

SJoin//book

Extract//book/price

Extract//book/title

Black Box White Box

13

A Unified Process at the Logical View

Select //book/price >5 0

Navigate //book, //book/title

SJoin//book

Extract//book/price

Extract//book/title

The Algebra CoreOp Symbol Semantic

Selection Filter tuples based on the predicate pred

Projection Filter columns in the input tuples based on the variable list v

Join Join input tuples based on the predicate pred

Aggregate Aggregate over input tuples with the aggregate function f, e.g., sum and average

Tagger Format outputs based on the pattern pt, i.e., reconstruct XML tags

Navigate Take input elements of path p1 and output ancestor elements of path p2

Extract Identify elements of path p from the input stream

Structural Join

Join input tuples on their structural relationship, e.g, the common parent relationship p

2,1 pp

p

pred

v

pred

ptT

f

p

Relatinal-like

XML-Specific

15

The Extract Operator

1 2book

*

Extract//book/title

<bib> <book> <title> TCP/IP Illustrated </title> … </book>…

1title

<title> TCP/IP Illustrated </title>

<title> Data on the Web </title>

<title>Advanced Programming in the Unix environment</title>

16

The Structural Join Operator

1 2book

3title*

4price

Extract//book/title

Extract//book/price

SJoin//book

FOR $b in doc (bib.xml) //bookLET $p := $b/price, $t := $b/titleWHERE $p > 50 RETURN $t

<title>…</title> <price>…</price>

<title>…</title> <price>…</price>

<bib> <book> <title> TCP/IP Illustrated </title> … </book>… <book>… </book>

Tight coupling<price>…</price>

<price>…</price>

<title>…</title>

<title>…</title>

17

The Navigate Operator

<book year="1994"> <title>TCP/IP Illustrated</title> <author> <last> Stevens </last> <first> W. </first> </author> <publisher> Addison-Wesley </publisher> <price> 65.95 </price> </book>

<book>… … </book>

<book>… … </book>

<book>… … </book> <title>…</title>

<book>… … </book> <title>…</title>

<book>… … </book> <title>…</title>

Navigate//book, title

18

Optimization

19

In or Out?

Automata plan

Regular algebraic plan

Tuple stream

XML data stream

Query answer

Pattern retrieval

Pattern Retrieval Alternatives<title>…</title> <price>…</price

<title>…</title> <price>…</price>

<price>…</price>

<price>…</price>

<title>…</title>

<title>…</title>

<book year="1994"> <title>TCP/IP Illustrated</title> <author> <last> Stevens </last> <first> W. </first> </author> <publisher> Addison-Wesley </publisher> <price> 65.95 </price> </book>

<book>… … </book>

<book>… … </book>

<book>… … </book> <title>…</title>

<book>… … </book> <title>…</title>

<book>… … </book> <title>…</title> <price>…</price>

<book>… … </book> <title>…</title> <price>…</price>

In Automata (/title, /price) Out of Automata(/title, /price)

1book

*

2

4title

3

price1

book*

2

21

Plan Alternatives

1

Extract //book

*

Navigate //book, price

2book

Select price >5 0

Navigate //book, title

The pull-out plan

Extract //book/price

13

4

title

price

Extract //book/title

*

SJoin //book

2book

Select //book/price

>50

The push-in plan

TaggerTagger

22

Experiment 1:

23

Experiment 2

24

Camp 1: Complete Automata Model [XSQ, XSM, XPush]

All details on the same level Hard to understand Not suitable for

optimizing at different levels

Little studied for using automata as query processing paradigm

For $x in $R/a return

for $Y in $X/b return

<res>$Y, $X </res>

0,0,0

1,0,0

2,1,0

2,2,1

2,2,2

2,1,3

1,1,3

1,2,2

1,2,1

1,1,0

*r=er|r++*r=sr|r++

*r!=<a>|r++*r=<a>|w(x,sx),w(x,<a>),r++,x”++

*r=</a>|w(x,</a>),w(x,ex),r++,xs=x

*r!=</a>&*r!=</b>|w(x,*r),r++,x”++

*r=<b>|w(x,<b>),r++

*true|xm=x’, w(o,<res>),w(o,<b>),x’++

*r!=</a>&*r!=</b>|w(x,*r),w(o,*r),x”++,r++

*r=</b>|w(x,</b>),w(o,</b>),r++,x”++

!AE(x’)&*x’!=ex|w(o,*x’),x’++

AE(x’)&*r!=</a>|w(x,*r),w(o,*r),r++,x”++

AE(x’)&*r=</a>|w(x,</a>),w(o,</a>),w(x,ex),r++,x’++

!AE(x’)&x’!=ex|w(o,*x’),x’++

!AE(x”)&x”=</b>|w(o,</b>),x”++

!AE(x”)&*x”!=</b>|w(o,*x”),x”++

True|xm=x’,w(o,<res>),w(o,<b>),x’++

!AE(x”)&*x”=<b>|x”++

!AE(x”)&*x”!=<b>&*x”!=ex|x”++

!AE(x”)&*x”=ex|xs=x”

25

Camp 2: Automata-Algebra Loosely Coupled Model [Tukwila, YFilter]

Fixed interface for automata computation (all pattern retrieval pushed down)

No opportunity of pushing/pulling computation into/from automata

Bloated, black box operator Algebraic rewriting impossible for internal

optimization

AutomataPlan

$b := //book$p := //book/price$t := //book/title

$b $p $t

26

Contribution

Automata and algebra modeled into one framework allowing a uniform logical view

Opportunity of push-into-automata and pull-out of-automata provided via query rewriting

Optimization necessity verified by experiments

27

http://davis.wpi.edu/dsrg/raindrop/