Approximate XML Query Answers

Alkis Polyzotis (UC Santa Cruz)Minos Garofalakis (Bell Labs)Yannis Ioannidis (U. of Athens, Hellas)

Motivation

XML: de-facto standard for data exchange Development of the “XML Warehouse” Conflict between “on-line” and query execution cost

Increased query response times Users might wait for un-interesting results

XML Data

Warehouse

Approximate Query Answers

Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result

Use approximate result as timely feedback User can assess the “value” of the query

Goal: reduce number of evaluated queries

XML Data

Warehouse

Synopsis

XML R’

Contributions

TreeSketch Synopses Structural summaries for XML data Approximate answers for complex twig queries Summarization model Structural clustering of elements Efficient processing and construction

Element Simulation Distance Novel distance metric for XML data Captures “approximate” similarity between two XML trees

Experimental Results Accurate approximate answers for low space budgets Low-error selectivity estimates Efficient construction algorithm

Outline

Preliminaries TreeSketches

Synopsis model Computing approximate answers Summary construction

Element Simulation Distance Experimental Study Conclusions

Data and Query Model

XML Document

//section

.//equation./figure

Twig Query

e11 e13f5 f7

rNesting Tree

e8 c9 e10

e10f5s2r

e8f5s2r

e10f4s2r

e8f4s2r

q3q2q1q0

Binding Tuples

Problem Definition

Process twig query over a synopsis Compute approximation of nesting tree

//section

.//equation./figure

e11 e13f5 f7

r ApproximateNesting Tree

True Nesting Tree

XML Data

Synopsis

Graph Synopsis

XML Document Graph Synopsis

Synopsis node Set of elements of the same tag Synopsis edge Document edge(s)

e8 c9 e10

XML Document TreeSketch

TreeSketch Synopsis

Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u

e8 c9 e10

XML Document TreeSketch

TreeSketch Synopsis

Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u

e8 c9 e10

TreeSketches and Clustering

TreeSketch Clustering based on structure All elements in a node are mapped to a “centroid” Tight clusters Accurate synopsis The perfect synopsis corresponds to a perfect clustering

Synopsis quality quantified by clustering error Options: Manhattan Distance, Squared Error, … Quality can be measured independent of a workload Key for effective construction

Error = ce − cu2

Computing Approximate Answers

TreeSketch

//section

.//equation.//caption

Query Approximate Nesting Tree

11+1=2

Compute TreeSketch of approximate answer Accuracy depends on quality of clustering

TreeSketch Construction

Given an XML tree T, build a TreeSketch of size B Difficult clustering problem

Space dimensionality depends on the clustering itself

Construction based on bottom-up clustering Compress perfect synopsis by merging clusters Best merge determined by marginal gains

Perfect Space Budget

Depth-Guided Merging

Key observation: Two elements have similar structure, if their children have similar structure Children clusters should be merged first

Bottom-up merging, based on depth Depth: distance from the leaves of the tree Build a pool of candidate merges by increasing depth Replenish the pool when it falls below a given threshold

Improved construction time - good performance

Outline

Error of Approximation

Error Distance between R’ and R Popular metric: Tree-edit distance

Min-cost sequence of operations that transform R’ to R Measures syntactic differences between R and R’

Not intuitive for approximate answers!

Different countsSimilar Trait

Same countsOpposite Trait

Element Simulation Distance

Capture approximate similarity between R and R’ u simulates v: u and v have identical structure ESD(u,v): “degree” of simulation between u,v

How well the structure of u matches the structure of v

Modeled as the distance between multi-sets Efficient computation using perfect summaries

Outline

Experimental Methodology

Data Sets: XMark, DBLP, IMDB, SwissProt Workload: 1000 random twig queries Evaluation metrics:

Average ESD for approximate answers Mean absolute relative error for selectivity estimation

|W |×

| estim(q) − count(q) |

count(q)q∈W

Approximate Answers

10 15 20 25 30 35 40 45 50Summary Size (KB)

Mean ESD

TreeSketchXSketch

IMDB (~102K Elements)Avg. Result Size: 3,477 tuples

Selectivity Estimation - SwissProt

10 15 20 25 30 35 40 45 50Summary Size (KB)

Estimation Error (%)

TreeSketchXSketch

SwissProt (~182K Elements)Avg. Result Size: 104,592 tuples

Selectivity Estimation

10 15 20 25 30 35 40 45 50Summary Storage (KB)

Error (%)

DBLPIMDBSwiss ProtXMark

Data Set

#Elements (x 103)

# Tuples (x 103)

DBLP 1,500 78

IMDB 236 13

S-Prot 473 365

XMark 2,000 145

Data Set

Construction Time (min)

DBLP 11

IMDB 2.5

S-Prot 38

XMark 240

Conclusions

Approximate query answering for XML databases TreeSketch Synopses

Structural summaries for tree-structured XML Approximate answers for twig-queries Model: Graph Synopsis + Edge-counts Efficient processing and construction

Element Simulation Distance Capture approximate similarity b/w XML trees

Experimental Results High accuracy for low space budgets Efficient construction

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Questions?

XML Document

e11 c12 e13

TreeSketch

TreeSketch Model (2/2)

Average number of children <--> Edge count

XML Document

e11 c12 e13

p: papers: sectionc: captiont: titlef: figuree: equationf9

Approximate XML Query Answers

Documents

Transcript of Approximate XML Query Answers

Approximate XML Query Answers Neoklis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas) Represented by: Gal.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

W3C XML Query

Approximate XML Query Answers

Approximate XML Joins

Approximate Query Processing: Taming the TeraBytes!

Join Synopses for Approximate Query Answering

XML Query Relaxation

Approximate Query Processing using Wavelets

1Garofalakis & Gibbons, VLDB 2001 # Approximate Query Processing: Taming the TeraBytes! A Tutorial Approximate Query Processing: Taming the TeraBytes!

XML Native Query Processing

Aditya Chintaluri ITCS6050. XML today Introduction of Query Relaxation ◦ XML Data Model ◦ XML Query Model ◦ XML Query answer ◦ Relaxation Types ◦

XPath – an XML query language - NUS Computinglingtw/cs4221/xpath.pdfXPath – an XML query language Some XML query languages: • XML-QL ... Note: Serge Abiteboul, Victor Vianu,

Querying XML Documents · Cotton/Robie 8 Unicode Conf, Jan 2002 W3C XML Query WG - Status June 2001: Revised Working Drafts – XQuery 1.0: An XML Query Language – XML Query Use

Query Languages for XML

Data mining for XML query-answering support...the second feature. A prototype system and experimental results demonstrate the effectiveness of the approach. Index Terms—XML, approximate

Module 3: XML Query and Manipulation · Module 3: XML Query and Manipulation Key XML query and manipulation languages include XPath XQuery XSLT SQL/XML c Munindar P. Singh, CSC 513,

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Query Languages for XML: XQuery

Approximate Query Processing (AQP) in Data Streams