Pattern tree algebras: sets or sequences? Stelios Paparizos, H. V. Jagadish University of Michigan...
-
Upload
ursula-price -
Category
Documents
-
view
216 -
download
0
Transcript of Pattern tree algebras: sets or sequences? Stelios Paparizos, H. V. Jagadish University of Michigan...
Pattern tree algebras: sets or sequences?
Stelios Paparizos, H. V. JagadishUniversity of Michigan
Ann Arbor, MI USA
Outline XML and XQuery Order and Duplicates
Document Order OrderBy Clause Binding Order Duplicates and XQuery
Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words
Document Order Usage
Provides capability to re-establish the original document information
<author> Mario </author><author> Stelios </author> <author> Alton </author>
Example: Return authors of book with title = “Grilling…”
FOR $b IN document(t)//bookWHERE $b/title = “Grilling for amateurs” RETURN $b/author
Document Order
Implicit, derived from XML data model The order in which data is represented in a
document is important information Requires original XML order
representation within a single document Requires an order amongst documents
during a single execution of a query Enforced on every XPath expression and
every sequence operation e.g. Union
ORDER BY Clause Order
Explicit specification with ORDER BY clause Results sorted using item’s value
Example: Return all books sorted by year of publication
XQuery: FOR $b IN document(t)//bookORDER BY $b/yearRETURN $b
SQL: SELECT book FROM t ORDER BY year
Binding Order Usage
Provides mechanism to produce results in multiple document orders
Example: Return books and articles with the same author, order the results by document order of
FOR $b IN document(t)//bookFOR $a IN document(t)//articleWHERE $b/author = $a/authorRETURN ($b, $a) book1 – article1
book1 – article2book2 – article1book2 – article2book2 – article3
FOR $a IN document(t)//articleFOR $b IN document(t)//bookWHERE $b/author = $a/authorRETURN ($b, $a)
book, articlearticle, book
Results
book1 – article1book2 – article1book1 – article2book2 – article2book2 – article3
Binding Order
Implicit, derived from the way the query is typed by the user
Results are sorted based on the order variables are bound Uses multiple document orders
XQuery and Duplicates
XQuery operates on duplicate-free sequences LET clause creates binding to sequence of
matching elements FOR clause creates binding to each element of
sequence of matching elements
Hence, XQuery requires all duplicates to be removed at variable binding
Outline
XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words
Dilemma: Use Sequences or Sets (or Bags or …)
Sets lose all ordering information Order can be important in intermediate steps
Sequences are expensive to manipulate Optimization possibilities can be restricted
Both sets and sequences are duplicate-free Duplicate elimination can be costly procedure
that should be avoided when possible
Solution: Use Hybrid Collections
A Hybrid Collection can have duplicate semantics that varies between a bag and a set and order semantics that varies between a set and a sequence Duplicate Specification Ordering Specification
Duplicate Specification (D-Spec)
Given a collection of trees CT, D-Spec describes how duplicates were removed from the collection
Possible Parameter Values: “empty”: Duplicates can be present “tree”: Duplicates were removed using deep-tree
comparison amongst trees in CT List of Nodes u: Duplicates were removed using
a comparison of the nodes referred by “u” in each tree in CT
Duplicate Specification Example
D-Spec(empty)(1)
B1
E2 A1A2 E1 E2 A2
B1 B1 B1
E1 A1 E2 A2
B1
D-Spec(tree)(2)
B1
E2 A1A2 E1 E2 A2
B1 B1 B1
E1 A1
D-Spec({B, E})(3)
B1
E2 A1E1
B1
A2
Ordering Item (O-Item)
Minimum unit used when sorting a collection CT
Parameters: Reference to sort by node Ascending (‘asc’) or descending (‘desc’) Empty greater (‘g’) or empty least (‘l’) for trees
without a matching node
Example: O-Item (B, asc, l)
Ordering Specification (O-Spec)
Given a collection CT, O-Spec describes how the trees are sorted in the collection
It accepts as parameter an ordered list of Ordering-Items Sorting took place in the order O-Items are
specified
Ordering Specification Example
O-Spec{(B, asc, l), (E, asc, l), (A, asc, l)}(1)
B1
E1 A1A2 E2 E2 A2
B1 B1 B1
E1 A1
(2.a)
B1
E2 A1A1 E1 E2 A2
B1 B1 B1
E1 A2
O-Spec{(B, asc, l), (A, asc, l)}
B1
E2 A1A1 E2 E1 A2
B1 B1 B1
E1 A2
O-Spec{}(3)
B1
E2 A1A2 E2 E1 A1
B1 B1 B1
E1 A2
(2.b)O-Spec{(B, asc, l), (A, asc, l)}
E2 A2
B1
E2 A2
B1
E2 A2
B1
E2 A2
B1
“Fully-ordered”
“Partially-ordered”“any order”
Outline
XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experiments Final Words
TLC-C Basic Principles Duplicate behavior is correct with sets Document order is modeled by our node
identifiers Pattern tree matches return information in document order
ORDER BY clause is mapped to a list of ordering items and a sort operation
Binding order is determined during parsing by tracking how the query was typed A sort operation is used at the end of each single block
FLWOR statement to capture the binding order
Binding Order ExampleFOR $b IN document(“lib.xml”)//bookFOR $a IN $b/authorFOR $e IN $b/editorFOR $h IN $e/hobbyFOR $i IN $a/interestRETURN $b Project: Keep (2)
LC=1doc_root
LC=2book
LC=3 LC=5editorauthor
Select
1interest hobbyLC=4 LC=6
2
Algebraic plan (TLC)
Orderlist: 2, 3, 5, 6, 4
Binding Order ExampleFOR $b IN document(“lib.xml”)//bookFOR $a IN $b/authorFOR $e IN $b/editorFOR $h IN $e/hobbyFOR $i IN $a/interestRETURN $b
Algebraic plan withcorrect output order
(TLC-C)
Orderlist: 2, 3, 5, 6, 4
Project: Keep (2), (3), (4), (5), (6)
LC=1doc_root
LC=2book
LC=3 LC=5editorauthor
Select
1interest hobbyLC=4 LC=6
2
Sort: ID(2), ID(3), ID(5), ID(6), ID(4) 3
Outline XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently
Enhancing an algebra with Hybrid Collections Minimizing Duplicate Elimination procedures Selections and Ordering Nested Queries and Ordering
Experimental Evaluation Final Words
Operators with Ordering (example)
Select S[apt, ord](CT): produces the matches of the annotated pattern tree (apt) on the input collection CT
New parameter ord is used for ordering ‘empty’, unspecified order ‘maintain’, preserve order of input CT
‘list-resort u’, destroy order of CT and resort using input list of node references u
‘list-add u’, preserve order of input CT and sort ties using input list of node references u
Algebraic Identities (example)
Select S and Sort O can be merged O[ol](S[any, any](…)) ↔ S[any, ol](…)
Select S and Sort O can be swaped O[ol](S[any, maintain](…)) ↔ S[any, maintain](O[ol](…))
Minimize Duplicate Eliminations
Step 1: Remove redundant duplicate elimination procedures
Step 2: Explore partial duplicate specifications to further minimize duplicate elimination procedures
Minimize DEs Step 1 ExampleFOR $o IN document(“auction.xml”)//open_auctionWHERE count($o/bidder) > 5RETURN <result> {$o/quantity} {$o/type} </result>
Aggregate (count, (3), newLC=4)
Filter : (4) > 5
Project: Keep (2)
*
*
Select
1LC=3
LC=2
LC=1
bidder
open_auction
doc_root
2
3
4
(2)Select
LC=5quantity 5
LC=7
(6)(5)
<result>Construct
6
type* LC=6
Aggregate (count, (3), newLC=4)
Filter : (4) > 5
Project: Keep (2)
*
*
Select
1LC=3
LC=2
LC=1
bidder
open_auction
doc_root
2
3
4
(2)Select
LC=5quantity 6
LC=7
(6)(5)
<result>Construct
7
type* LC=6
Duplicate Elimination: ID(tree) 5
Duplicate Elimination: ID(tree)
Duplicate Elimination: ID(tree)
Duplicate Elimination: ID(tree)
Duplicate Elimination: ID(tree)
Duplicate Elimination: ID(tree)
Duplicate Elimination: ID(tree)
From 6 DE procedures to 1
Minimize DEs Step 2 ExampleFOR $o IN document(“auction.xml”)//open_auctionWHERE count($o/bidder) > 5RETURN <result> {$o/quantity} {$o/type} </result>
Aggregate (count, (3), newLC=4)
Filter : (4) > 5
Project: Keep (2)
*
*
Select
1LC=3
LC=2
LC=1
bidder
open_auction
doc_root
2
3
4
(2)Select
LC=5quantity 6
LC=7
(6)(5)
<result>Construct
7
type* LC=6
Duplicate Elimination: ID(tree) 5
Aggregate (count, (3), newLC=4)
Filter : (4) > 5
*
*
Select
1LC=3
LC=2
LC=1
bidder
open_auction
doc_root
2
3
(2)Select
LC=5quantity 4
LC=7
(6)(5)
<result>Construct
5
type* LC=6
DE procedure is modified to DE: ID(2).
Then using algebraic rewrites is eliminated completely.
Selections and Ordering
For “selection” type queries, use algebraic rewrites and push the sort down to the select operator.
Selections and Ordering Example
Project: Keep (2), (3), (4), (5), (6)
LC=1doc_root
LC=2book
LC=3 LC=5editorauthor
Select
1interest hobbyLC=4 LC=6
2
(2)Construct 4
Sort: ID(2), ID(3), ID(5), ID(6), ID(4) 3
ord=empty
Project: Keep (2)
LC=1doc_root
LC=2book
LC=3 LC=5editorauthor
Select
1interest hobbyLC=4 LC=6
2
(2)Construct 3
ord=ID(2), ID(3), ID(5), ID(6), ID(4)
FOR $b IN document(“lib.xml”)//bookFOR $a IN $b/authorFOR $e IN $b/editorFOR $h IN $e/hobbyFOR $i IN $a/interestRETURN $b
Push Sort into Select using algebraic identities.
Optimizer can plan Select operator without having the forced blocking sort at the end.
Joins and Ordering ExampleFOR $a IN document(t)//articleFOR $b IN document(t)//bookWHERE $b/author = $a/authorRETURN ($b, $a)
Project: Keep (9), (2), (4)
Duplicate Elimination : ID(2), ID(4)
Select
2
LC=8
LC=4
LC=3
author
article
doc_root
conf = VLDB
LC=1doc_root
LC=2book
LC=5 LC=7
year = 1999author
Select
1
Join(5) = (6)
(2) (4)
join_root LC=9
3
4
5
LC=10
(4)(2)
<result>Construct
7
ord = empty ord = empty
LC=6
ord = empty
Sort : ID(2), ID(4) 6
Algebraic plan withcorrect output order
(TLC-C)
Joins and Ordering Example
Project: Keep (9), (2), (4)
Duplicate Elimination : ID(2), ID(4)
Select
2
LC=8
LC=4
LC=3
author
article
doc_root
conf = VLDB
LC=1doc_root
LC=2book
LC=5 LC=7
year = 1999author
Select
1
Join(5) = (6)
(2) (4)
join_root LC=9
3
4
5
LC=10
(4)(2)
<result>Construct
7
ord = empty ord = empty
LC=6
ord = empty
Sort : ID(2), ID(4) 6
Project: Keep (9), (2), (4)
Duplicate Elimination : ID(2), ID(4)
Select
2
LC=8
LC=4
LC=3
author
article
doc_root
conf = VLDB
LC=1doc_root
LC=2book
LC=5 LC=7
year = 1999author
Select
1
Join(5) = (6)
(2) (4)
join_root LC=9
3
4
5
LC=10
(4)(2)
<result>Construct
7
ord = empty ord = empty
LC=6
ord =ID(2), ID(4)
Push Sort into Join using algebraic identities.
Joins and Ordering Example
Project: Keep (9), (2), (4)
Duplicate Elimination : ID(2), ID(4)
Select
2
LC=8
LC=4
LC=3
author
article
doc_root
conf = VLDB
LC=1doc_root
LC=2book
LC=5 LC=7
year = 1999author
Select
1
Join(5) = (6)
(2) (4)
join_root LC=9
3
4
5
LC=10
(4)(2)
<result>Construct
7
ord = empty ord = empty
LC=6
ord =ID(2), ID(4)
Project: Keep (9), (2), (4)
Duplicate Elimination : ID(2), ID(4)
Select
2
LC=8
LC=4
LC=3
author
article
doc_root
conf = VLDB
LC=1doc_root
LC=2book
LC=5 LC=7
year = 1999author
Select
1
Join(5) = (6)
(2) (4)
join_root LC=9
3
4
5
LC=10
(4)(2)
<result>Construct
7
ord = ID(2) ord = ID(4)
LC=6
ord = maintain
Push Sort further down into Selects using algebraic identities.
Nested Queries and OrderingFOR $b IN document(“lib.xml”)/bookLET $k := FOR $a IN document(“lib.xml”)/article
WHERE $b/author = $a/author AND$a/conf = “VLDB”
RETURN $aWHERE $b/year = 1999RETURN <result> {$b} {$k} </result>
Project: Keep (9), (2), (4)
Duplicate Elimination : ID(2), ID(4)
Select
2
LC=8
LC=4
LC=3
author
article
doc_root
conf = VLDB
LC=1doc_root
LC=2book
LC=5 LC=7
year = 1999author
Select
1
Join(5) = (6)
(2) (4)
join_root LC=9
3
9
10
LC=10
(4)(2)
<result>Construct
12
ord = empty
LC=6
Sort : ID(2) 11
Project: Keep (4), (6)
DE : ID(4), ID(6)
4
5
(6)
(4)Construct
7
Sort : ID(4) 6
*
ord = maintain(left, right)
Algebraic plan withcorrect output order
(TLC-C)
Nested Queries and Reorder
Project: Keep (9), (2), (4)
Duplicate Elimination : ID(2), ID(4)
Select
2
LC=8
LC=4
LC=3
author
article
doc_root
conf = VLDB
LC=1doc_root
LC=2book
LC=5 LC=7
year = 1999author
Select
1
Join(5) = (6)
(2) (4)
join_root LC=9
3
9
10
LC=10
(4)(2)
<result>Construct
12
ord = empty
LC=6
Sort : ID(2) 11
Project: Keep (4), (6)
DE : ID(4), ID(6)
4
5
(6)
(4)Construct
7
Sort : ID(4) 6
*
ord = maintain(left, right)
Project: Keep (9), (2), (4)
Duplicate Elimination : ID(2), ID(4)
Select
2
LC=8
LC=4
LC=3
author
article
doc_root
conf = VLDB
LC=1doc_root
LC=2book
LC=5 LC=7
year = 1999author
Select
1
Join(5) = (6)
(2) (4)
join_root LC=9
3
9
10
LC=10
(4)(2)
<result>Construct
12
ord = empty
LC=6
Sort : ID(2) 11
Project: Keep (4), (6)
DE : ID(4), ID(6)
4
5
(6)
(4)Construct
7
*
Reorder: (9), (4), ID(4) 8
ord = empty
Rewrite Sort and blocking Join to Reorder operation.
Outline
XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words
Experimental Setup Timber System
128MB buffer pool Value index when necessary (not for all queries)
Intel Pentium III-M 866 Mhz Windows 2000 professional IDE Hard Drive 512MB RAM
XMark dataset factor 1 707MB total space (472MB data + 241MB index)
Minimizing Duplicate Eliminations
0
2
4
6
8
10
12
14
16
x17 x19 q2
TLC-C
TLC-D
x17 more selectivex19 less selectiveq2 value join
Selections and Ordering
0
1
2
3
4
5
6
x13 x17 x19
TLC-C
TLC-O x13 simple outputx17 more selectivex19 less selective
Join and Ordering
0
2
4
6
8
10
12
14
16
q1 q2 x3
TLC-C
TLC-O
q1 less selectiveq2 more selectivex3 less selective
Ordering and Duplicate Optimizations
0%
50%
100%
x19 q2 x8
TLC-C
TLC-D
TLC-O
TLC-OD x19 selection q2 value joinX8 nested query
Outline
XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words
Related Work Relational Systems recognize smart sort
placement as a problem D. Simmen, E. Shekita, and T. Malkemus. Fundamental
techniques for order optimization. In Proc.SIGMOD Conf., 1996 XML Navigational-based approach has study of
ordering requirements in: J. Hidders and P. Michiels. Avoiding unnecessary ordering
operations in XPath. In Proc. DBPL Conf.,2003. XML Algebraic-based approaches use sets or
sequences. Aside from the performance limitations, it is unknown whether they fully address the XQuery binding order to produce correct results.
Final Words Ordering in XQuery is a complex procedure with
significant performance ramifications Introduced Hybrid Collections with Ordering
Specification as means to a correct and flexible solution Similar path for Duplicates
Showed algebraic optimizations that take advantage of provided flexibility
Demonstrated experimentally the performance increase