Efficient Evaluation of XQuery over Streaming Data
description
Transcript of Efficient Evaluation of XQuery over Streaming Data
![Page 1: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/1.jpg)
Efficient Evaluation of XQuery over Streaming Data
Xiaogang Li Gagan Agrawal
The Ohio State University
![Page 2: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/2.jpg)
Motivation
Why Stream Data needs to be analyzed at real time
- Stock Market, Climate, Network Traffic Rapid improvements in networking
technologies Lack of disk space
- 101.13 Gbps at SC2004 Bandwidth Challenge
- Retrieval from local disk may be much slower than from remote site
![Page 3: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/3.jpg)
Motivation
Why XML - Standard data exchanging format for the Internet
- Widely adapted in web-based, distributed and grid computing
- Virtual XML is becoming popular
Why XQuery - Widely accepted language for querying XML
- Declarative: Easy to use - Powerful: Types, user-defined functions, binary
expressions,
![Page 4: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/4.jpg)
Current Work: XQuery Over Streaming Data
XPath over Streaming Data XPath is relatively simple
XQuery over Streaming Data Limited features handled Focus on queries that are written for
single pass evaluation
![Page 5: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/5.jpg)
Contributions Can the given query be evaluated correctly on
streaming data? - Only a single pass is allowed
- Decision made by compiler, not a user If not, can it be correctly transformed ? How to generate efficient code for XQuery? - Computations involved in streaming application are
non-trivial - Recursive functions are frequently used - Efficient memory usage is important
![Page 6: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/6.jpg)
Our Approach For an arbitrary query, can it be evaluated
correctly on streaming data? - Construct data-flow graph for a query - Static analysis based on data-flow graph If not, can it be transformed to do so ? - Query transformation techniques based on static
analysis How to generate efficient code for XQuery? - Techniques based on static analysis to minimize
memory usage and optimize code - Generating imperative code -- Recursive analysis and aggregation rewrite
![Page 7: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/7.jpg)
Query Evaluation Model Single input stream Internal computations - Limited memory -Linked operators Pipeline operator and
Blocking operator
Op1
Op3Op2
Op4
![Page 8: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/8.jpg)
Pipeline and Blocking Operators Pipeline Operator: - each input tuple produces an output tuple independently
- Selection, Increment etc
Blocking Operator: - Can only compute output after receiving all input tuples - Sort, Join etc
Progressive Blocking Operator: (1)|output|<<|input|: we can buffer the output
(2) Associative and commutative operation: discard input - count(), sum()
![Page 9: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/9.jpg)
Single Pass? Pixels with x and yQ1: let $i := …/pixel sortby (x)
Q2: let $i := for $p in /pixel where $p/x > .. x = count(/pixel)
(1) A blocking operator exists
(2) A progressive blocking operator is referred by another pipeline operator or progressive operator
Check condition 2 in a query
![Page 10: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/10.jpg)
Single-Pass? Challenges
Must Analyze data dependence at expression level
A Query may be complexNeed a simplified view of the query to make decision
![Page 11: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/11.jpg)
Overall Framework
Data Flow Graph Construction
Horizontal Fusion
High level Transformation
Vertical Fusion
Single-Pass Analysis
Low level Transformation
GNL Generation
Recursion Analysis
Aggregation Rewrite
Stream Code Generation
![Page 12: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/12.jpg)
Roadmap
Stream Data Flow Graph High-Level Transformations
- Horizontal Fusion
- Vertical Fusion Single Pass Analysis Low Level Optimization Experiment and Conclusion
![Page 13: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/13.jpg)
Stream Data Flow Graph (DFG) Node represents
variable: Explicit and implicit Sequence and atomic
S1 S2
v1 i
b
S1:stream/pixel[x>0]S2:stream/pixelV1: count()
Edge: dependence relation v1->v2 if v2 uses v1 Aggregate dependence and
flow dependence
A DFG is acyclic Cardinality inference is
required to construct the DFG
![Page 14: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/14.jpg)
High-level Transformation
Goals 1: Enable single pass evaluation
2: Simplify the SDFG and single-pass analysis
Horizontal Fusion and Vertical Fusion
- Based on SDFG
![Page 15: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/15.jpg)
Horizontal Fusion
Enable single-pass evaluation - Merge sequence node with common prefix
S1 S2
v1 v2
b
S1:stream/pixel[x>0]S2:stream/pixel/yV1: count() V2: sum()
S1 S2
v1 v2
b
S0
S0:/stream/pixel
S1:[x>0] S2: /y
V1: count() V2: sum()
![Page 16: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/16.jpg)
Vertical Fusion Simplify DFG and single-pass analysis - Merge a cluster of nodes linked by flow dependence edges
S2
S1
v
i
j
b
S2
S1
v
i
j
bS v
![Page 17: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/17.jpg)
Roadmap
Stream Data Flow Graph High-Level Transformations
- Horizontal Fusion
- Vertical Fusion Single Pass Analysis Low Level Optimization Experiment and Conclusion
![Page 18: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/18.jpg)
Single-pass Analysis Can a query be evaluated on-the fly?
THEOREM 1. If a query with dependence graph G=(V,E) contains more than one sequence node after vertical fusion, it can not be evaluated correctly in a single pass.
Reason: Sequence node with infinite length can not
be buffered
![Page 19: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/19.jpg)
Single-pass Analysis- Continue
THEOREM 2. Let S be the set of atomic nodes that are aggregate dependent on any sequence node in a stream data flow graph. For any given two elements s1 and s2, if there is a path between s1 and s2, the query may not be evaluated correctly in a single pass.
Reason: A progressive blocking operator is referred
by another progressive blocking operator
Example : count (pixel) where /x>0.005*sum(/pixel/x)
![Page 20: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/20.jpg)
Single-pass Analysis - Continue
THEOREM 3. In there is a cycle in a stream data flow graph G, the corresponding query may not be evaluated correctly using a single pass.
Reason: A progressive blocking operator is referred by a pipeline operator
S2
S1
v
i
j
b
S2 v
![Page 21: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/21.jpg)
Single-pass Analysis Check conditions corresponding to Theorem 1 2
and 3 -Stop further processing if any condition is true
Completeness of the analysis - If a query without blocking operator pass the test, it can be
evaluated in a single pass
THEOREM 4. If the results of a progressive blocking operator with an unbounded input are referred to by a pipeline operator or a progressive blocking operator with unbounded input, then for the stream data flow graph, at least one of the three conditions holds true
![Page 22: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/22.jpg)
Conservative analysis Our analysis is conservative - A valid query may be labeled as “cannot be evaluated
in a single-pass”
Example:
![Page 23: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/23.jpg)
A review of the procedure
Can not be evaluated in a single pass!!
S1 S2
v1 i
b
S
v1 i
b
v1
b
S
iS v
![Page 24: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/24.jpg)
Roadmap
Stream Data Flow Graph High-Level Transformations
- Horizontal Fusion
- Vertical Fusion Single Pass Analysis Low Level Optimization Experiment and Conclusion
![Page 25: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/25.jpg)
Low-level Transformations Use GNL as intermediate
representation - GNL is similar to nested loops in Java - Enable efficient code generation for reductions - Enable transformation of recursive function into iterative
operation
From SDFG to GNL - Generate a GNL for each sequence node associated
with XPath expression - Move aggregation into GNL using aggregation rewrite
and recursion analysis
![Page 26: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/26.jpg)
GNL Example
Facilitate code generation for any desired platform
S1 S2
v1 v2
b
S0
![Page 27: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/27.jpg)
Low-Level Transformations Recursive Analysis
- extract commutative and associative operations from recursive functions
Aggregation Rewirte - perform function inlining
- transform built-in and user-defined aggregate into iterative operations
![Page 28: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/28.jpg)
Code Generation Using SAX XML stream parser - XML document is parsed as stream of events <x> 5 </x>: startelement <x>, content 5, endelement <x>
- Event-Driven: Need to generate code to handle each event
Using Java JDK -Our compiler generates Java source code
![Page 29: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/29.jpg)
Code Generation: Example
startElement: Insert each referred element into buffer
endElement: Process each element in the buffer, dispatch the buffer
![Page 30: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/30.jpg)
Roadmap Stream Data Flow Graph High-Level Transformations - Horizontal Fusion - Vertical Fusion Single Pass Analysis Low Level Optimization Experimental Results Conclusions
![Page 31: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/31.jpg)
Experiment
Query Benchmark - Selected Benchmarks from XMARK - Satellite, Virtual Microscope, Frequent Item
Systems compared with - Galax
- Saxon - Qizx/Open
![Page 32: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/32.jpg)
Performance: XMARK Benchmark
>25% faster on small dataset Scales well on very large datasets
![Page 33: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/33.jpg)
Performance: Real Applications
>One order of magnitude faster on small dataset Works well for very large datasets 10-20% performance gain with control-flow optimization
![Page 34: Efficient Evaluation of XQuery over Streaming Data](https://reader030.fdocuments.net/reader030/viewer/2022032709/56812e51550346895d93f231/html5/thumbnails/34.jpg)
Conclusions
Provide a formal approach for query evaluation on streaming XML
- Query transformation to enable correct execution on stream
- Formal methods for single-pass analysis - Strategies for efficient low-level code generation - Experiment results show advantage over other well-
known systems