Post on 06-Jan-2016
description
Approximate XML Query Answers
Alkis Polyzotis (UC Santa Cruz)Minos Garofalakis (Bell Labs)Yannis Ioannidis (U. of Athens, Hellas)
Motivation
XML: de-facto standard for data exchange Development of the “XML Warehouse” Conflict between “on-line” and query execution cost
Increased query response times Users might wait for un-interesting results
XML Data
Warehouse
XMLR
Q
Approximate Query Answers
Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result
Use approximate result as timely feedback User can assess the “value” of the query
Goal: reduce number of evaluated queries
XML Data
Warehouse
Synopsis
XMLR
XML R’
Q
Contributions
TreeSketch Synopses Structural summaries for XML data Approximate answers for complex twig queries Summarization model Structural clustering of elements Efficient processing and construction
Element Simulation Distance Novel distance metric for XML data Captures “approximate” similarity between two XML trees
Experimental Results Accurate approximate answers for low space budgets Low-error selectivity estimates Efficient construction algorithm
Outline
Preliminaries TreeSketches
Synopsis model Computing approximate answers Summary construction
Element Simulation Distance Experimental Study Conclusions
Data and Query Model
XML Document
q0
q1
q2 q3
//section
.//equation./figure
Twig Query
s2
e11 e13f5 f7
rNesting Tree
p1
s2
f5
c11
s3
f6
c12
f4
e8 c9 e10
f7
c13
r
e10f5s2r
e8f5s2r
e10f4s2r
e8f4s2r
q3q2q1q0
Binding Tuples
Problem Definition
Process twig query over a synopsis Compute approximation of nesting tree
q0
q1
q2 q3
//section
.//equation./figure
s2
e11 e13f5 f7
r
s
e ef
r ApproximateNesting Tree
True Nesting Tree
XML Data
Synopsis
Graph Synopsis
XML Document Graph Synopsis
Synopsis node Set of elements of the same tag Synopsis edge Document edge(s)
P(1)
S(2)
F(2)
C(4)
F(2)
E(2)
R(1)
p1
s2
f5
c11
s3
f6
c12
f4
e8 c9 e10
f7
c13
r
XML Document TreeSketch
TreeSketch Synopsis
Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u
2
1
1 1
2
1 1
111
P(1)
S(2)
F(2)
C(4)
F(2)
E(2)
R(1)
p1
s2
f5
c11
s3
f6
c12
f4
e8 c9 e10
f7
c13
r 2
#F
#F
XML Document TreeSketch
TreeSketch Synopsis
Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u
2
1
2
2
10.5
P(1)
S(2)
C(4)
F(4)
E(2)
R(1)
p1
s2
f5
c11
s3
f6
c12
f4
e8 c9 e10
f7
c13
r
#F
TreeSketches and Clustering
TreeSketch Clustering based on structure All elements in a node are mapped to a “centroid” Tight clusters Accurate synopsis The perfect synopsis corresponds to a perfect clustering
Synopsis quality quantified by clustering error Options: Manhattan Distance, Squared Error, … Quality can be measured independent of a workload Key for effective construction
€
Error = ce − cu2
e∈u
∑u
∑
Computing Approximate Answers
TreeSketch
q0
q1
q2 q3
//section
.//equation.//caption
Query Approximate Nesting Tree
R
E
11+1=2
C
S
2
Compute TreeSketch of approximate answer Accuracy depends on quality of clustering
1
2
1 1
111
P(1)
S(2)
F(2)
C(4)
F(2)
E(2)
R(1)
TreeSketch Construction
Given an XML tree T, build a TreeSketch of size B Difficult clustering problem
Space dimensionality depends on the clustering itself
Construction based on bottom-up clustering Compress perfect synopsis by merging clusters Best merge determined by marginal gains
Perfect Space Budget
…
Depth-Guided Merging
Key observation: Two elements have similar structure, if their children have similar structure Children clusters should be merged first
Bottom-up merging, based on depth Depth: distance from the leaves of the tree Build a pool of candidate merges by increasing depth Replenish the pool when it falls below a given threshold
Improved construction time - good performance
Outline
Preliminaries TreeSketches
Synopsis model Computing approximate answers Summary construction
Element Simulation Distance Experimental Study Conclusions
Error of Approximation
Error Distance between R’ and R Popular metric: Tree-edit distance
Min-cost sequence of operations that transform R’ to R Measures syntactic differences between R and R’
Not intuitive for approximate answers!
T1 T
r
s
e
s
f1 4
ef4 1
r
s
e
s
f4 4
ef1 1
r
s
e
s
f2 6
ef6 2
T2
Different countsSimilar Trait
Same countsOpposite Trait
Element Simulation Distance
Capture approximate similarity between R and R’ u simulates v: u and v have identical structure ESD(u,v): “degree” of simulation between u,v
How well the structure of u matches the structure of v
Modeled as the distance between multi-sets Efficient computation using perfect summaries
T1 T
r
s
e
s
f1 4
ef4 1
r
s
e
s
f4 4
ef1 1
r
s
e
s
f2 6
ef6 2
T2
Outline
Preliminaries TreeSketches
Synopsis model Computing approximate answers Summary construction
Element Simulation Distance Experimental Study Conclusions
Experimental Methodology
Data Sets: XMark, DBLP, IMDB, SwissProt Workload: 1000 random twig queries Evaluation metrics:
Average ESD for approximate answers Mean absolute relative error for selectivity estimation
€
1
|W |×
| estim(q) − count(q) |
count(q)q∈W
∑
Approximate Answers
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
10 15 20 25 30 35 40 45 50Summary Size (KB)
Mean ESD
TreeSketchXSketch
IMDB (~102K Elements)Avg. Result Size: 3,477 tuples
Selectivity Estimation - SwissProt
0
20
40
60
80
100
120
140
160
10 15 20 25 30 35 40 45 50Summary Size (KB)
Estimation Error (%)
TreeSketchXSketch
SwissProt (~182K Elements)Avg. Result Size: 104,592 tuples
Selectivity Estimation
0
5
10
15
20
25
30
10 15 20 25 30 35 40 45 50Summary Storage (KB)
Error (%)
DBLPIMDBSwiss ProtXMark
Data Set
#Elements (x 103)
# Tuples (x 103)
DBLP 1,500 78
IMDB 236 13
S-Prot 473 365
XMark 2,000 145
Data Set
Construction Time (min)
DBLP 11
IMDB 2.5
S-Prot 38
XMark 240
Conclusions
Approximate query answering for XML databases TreeSketch Synopses
Structural summaries for tree-structured XML Approximate answers for twig-queries Model: Graph Synopsis + Edge-counts Efficient processing and construction
Element Simulation Distance Capture approximate similarity b/w XML trees
Experimental Results High accuracy for low space budgets Efficient construction
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Questions?
XML Document
p1
s2
f7
c14
s3
f9
c17
f5
e11 c12 e13
f9
c17
r
P(1)
S(2)
F(2)
C(4)
F(2)
E(2)
R
TreeSketch
1
2
1 1
111
TreeSketch Model (2/2)
Average number of children <--> Edge count
#E
#C
1
1
XML
XML Document
p1
s2
f7
c14
s3
f9
c17
f5
e11 c12 e13
p: papers: sectionc: captiont: titlef: figuree: equationf9
c17
r