FIX: Feature-based Indexing Technique for XML Documents [1em] fileApproaches to Evaluating Twig...

38
FIX: Feature-based Indexing Technique for XML Documents Ning Zhang University of Waterloo http://www.cs.uwaterloo.ca/ ~ nzhang Joint work with M. Tamer ¨ Ozsu, Ihab F. Ilyas, and Ashraf Aboulnaga 1 Ning Zhang

Transcript of FIX: Feature-based Indexing Technique for XML Documents [1em] fileApproaches to Evaluating Twig...

FIX: Feature-based Indexing Technique forXML Documents

Ning Zhang

University of Waterloohttp://www.cs.uwaterloo.ca/~nzhang

Joint work with M. Tamer Ozsu, Ihab F. Ilyas, and AshrafAboulnaga

1Ning Zhang

Motivating Example

• Twig Query (root axis couldbe //, others are /):Q1: Find phone numbers(P) of all authors (A) whoalso have email (E) andschool (S).

//A[./E][./S]/P

• Find all subtrees satisfying apattern tree:

/

E

A

PS

//

• A general path containing// in the middle can bedecomposed intointerconnected twig queries.

2Ning Zhang

Motivating Example

• Twig Query (root axis couldbe //, others are /):Q1: Find phone numbers(P) of all authors (A) whoalso have email (E) andschool (S).

//A[./E][./S]/P

• Find all subtrees satisfying apattern tree:

/

E

A

PS

//

• A general path containing// in the middle can bedecomposed intointerconnected twig queries.

N

/// /

/ /

A

B C L

M

2Ning Zhang

Motivating Example

• Twig Query (root axis couldbe //, others are /):Q1: Find phone numbers(P) of all authors (A) whoalso have email (E) andschool (S).

//A[./E][./S]/P

• Find all subtrees satisfying apattern tree:

/

E

A

PS

//

• A general path containing// in the middle can bedecomposed intointerconnected twig queries.

N

/// /

/ /

A

B C L

M

2Ning Zhang

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree andperform Tree Pattern Matching(TPM) operation on every treenode

XML Tree

• Analogous to sequential scan, very expensive:• 4,000,000+ TPM operations on DBLP.

• Many unnecessary operations for highly selective queries.

• Analogous to index-based join.• Tag-name indexes are not discriminative enough:

• Elements are selected to join solely based on their tag names,without considering their descendants.

3Ning Zhang

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree andperform Tree Pattern Matching(TPM) operation on every treenode

XML Tree

• Analogous to sequential scan, very expensive:• 4,000,000+ TPM operations on DBLP.

• Many unnecessary operations for highly selective queries.

• Analogous to index-based join.• Tag-name indexes are not discriminative enough:

• Elements are selected to join solely based on their tag names,without considering their descendants.

3Ning Zhang

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree andperform Tree Pattern Matching(TPM) operation on every treenode

������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������������

XML Tree

• Analogous to sequential scan, very expensive:• 4,000,000+ TPM operations on DBLP.

• Many unnecessary operations for highly selective queries.

• Analogous to index-based join.• Tag-name indexes are not discriminative enough:

• Elements are selected to join solely based on their tag names,without considering their descendants.

3Ning Zhang

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree andperform Tree Pattern Matching(TPM) operation on every treenode

��������������������������������������������������������������������������������������������������������������������������������������������

��������������������������������������������������������������������������������������������������������������������������������������������

XML Tree

• Analogous to sequential scan, very expensive:• 4,000,000+ TPM operations on DBLP.

• Many unnecessary operations for highly selective queries.

• Analogous to index-based join.• Tag-name indexes are not discriminative enough:

• Elements are selected to join solely based on their tag names,without considering their descendants.

3Ning Zhang

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree andperform Tree Pattern Matching(TPM) operation on every treenode

������������������������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������������������������������������������������������������

XML Tree

• Analogous to sequential scan, very expensive:• 4,000,000+ TPM operations on DBLP.

• Many unnecessary operations for highly selective queries.

• Analogous to index-based join.• Tag-name indexes are not discriminative enough:

• Elements are selected to join solely based on their tag names,without considering their descendants.

3Ning Zhang

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree andperform Tree Pattern Matching(TPM) operation on every treenode

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

XML Tree

• Analogous to sequential scan, very expensive:• 4,000,000+ TPM operations on DBLP.

• Many unnecessary operations for highly selective queries.

• Analogous to index-based join.• Tag-name indexes are not discriminative enough:

• Elements are selected to join solely based on their tag names,without considering their descendants.

3Ning Zhang

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree andperform Tree Pattern Matching(TPM) operation on every treenode

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

XML Tree

• Analogous to sequential scan, very expensive:• 4,000,000+ TPM operations on DBLP.

• Many unnecessary operations for highly selective queries.

• Analogous to index-based join.• Tag-name indexes are not discriminative enough:

• Elements are selected to join solely based on their tag names,without considering their descendants.

3Ning Zhang

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree andperform Tree Pattern Matching(TPM) operation on every treenode

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

XML Tree

• Analogous to sequential scan, very expensive:• 4,000,000+ TPM operations on DBLP.

• Many unnecessary operations for highly selective queries.

• Analogous to index-based join.• Tag-name indexes are not discriminative enough:

• Elements are selected to join solely based on their tag names,without considering their descendants.

3Ning Zhang

Approaches to Evaluating Twig Queries

Join-based Approach

XML elements are clustered bytag-names; elements matchedwith the tag-names arestructurally joined

...E

A

S P

...

SSS

A A A

P P PEEE...

...

• Analogous to index-based join.

• Tag-name indexes are not discriminative enough:• Elements are selected to join solely based on their tag names,

without considering their descendants.

3Ning Zhang

Approaches to Evaluating Twig Queries

Join-based Approach

XML elements are clustered bytag-names; elements matchedwith the tag-names arestructurally joined

...E

A

S P

...

SSS

A A A

P P PEEE...

...

• Analogous to index-based join.

• Tag-name indexes are not discriminative enough:• Elements are selected to join solely based on their tag names,

without considering their descendants.

3Ning Zhang

Approaches to Evaluating Twig Queries

Join-based Approach

XML elements are clustered bytag-names; elements matchedwith the tag-names arestructurally joined

...E

A

S P

...

SSS

A A A

P P PEEE...

...

• Analogous to index-based join.

• Tag-name indexes are not discriminative enough:• Elements are selected to join solely based on their tag names,

without considering their descendants.

3Ning Zhang

Objectives

• Build an index that does not only consider root tags, but alsothe whole subtree.

Starting points using index considering the whole subtree

• Exploiting existing indexes (e.g., B+ tree) to build the newindex.

• Incorporating both structures and values in the index.

4Ning Zhang

Objectives

• Build an index that does not only consider root tags, but alsothe whole subtree.

TPM Starting Points using sequential scan

abz)e)cf)g))cf)g))cf)g))c)c))cz)e)df)g))i)c)))

• Exploiting existing indexes (e.g., B+ tree) to build the newindex.

• Incorporating both structures and values in the index.

4Ning Zhang

Objectives

• Build an index that does not only consider root tags, but alsothe whole subtree.

Starting Points Using Tag−name Index

abz)e)cf)g))cf)g))cf)g))c)c))cz)e)df)g))i)c)))

• Exploiting existing indexes (e.g., B+ tree) to build the newindex.

• Incorporating both structures and values in the index.

4Ning Zhang

Objectives

• Build an index that does not only consider root tags, but alsothe whole subtree.

Starting points using index considering the whole subtree

abz)e)cf)g))cf)g))cf)g))c)c))cz)e)df)g))i)c)))

• Exploiting existing indexes (e.g., B+ tree) to build the newindex.

• Incorporating both structures and values in the index.

4Ning Zhang

Objectives

• Build an index that does not only consider root tags, but alsothe whole subtree.

Starting points using index considering the whole subtree

abz)e)cf)g))cf)g))cf)g))c)c))cz)e)df)g))i)c)))

• Exploiting existing indexes (e.g., B+ tree) to build the newindex.

• Incorporating both structures and values in the index.

4Ning Zhang

Objectives

• Build an index that does not only consider root tags, but alsothe whole subtree.

Starting points using index considering the whole subtree

abz)e)cf)g))cf)g))cf)g))c)c))cz)e)df)g))i)c)))

• Exploiting existing indexes (e.g., B+ tree) to build the newindex.

• Incorporating both structures and values in the index.

4Ning Zhang

Related Work — Structural Indexes

Cluster XML tree nodes having similar structures in terms of:

• Tag-name: tag-name indexes: tag-names as keys for B+ tree(the pruning power comes from the root of query tree).

• Rooted path: DataGuide, 1-index, A(k)-index (simple pathexpressions only)

• Subtree: bisimulation graph• Rooted path & subtree: F&B bisimulation graph

author

affiliation

article

bib

email

title

author

author

author

book

article

www

inproceedings

address

phone

Bisimulation Graph

The bisimulation and F&B bisimulation graphs could be very large(3× 105 vertices and 2× 106 edges for Treebank).

5Ning Zhang

Our Approach: Feature-based Index (FIX )

Key Idea:1. Data and query trees are all converted to bisimulation graphs.

1.1 Bisimulation graph is much smaller than the XML tree1.2 Bisimulation graph preserves all structural information

2. Enumerate all subgraphs of depth k (indexable units) in thedata bisimulation graph.

3. Insert indexable units based on their distinctive features.4. Calculate the features of query bisimulation graph and use

them to filter out indexed units by comparing their features.

What are the features for labeled trees?

6Ning Zhang

Our Approach: Feature-based Index (FIX )

Key Idea:1. Data and query trees are all converted to bisimulation graphs.

1.1 Bisimulation graph is much smaller than the XML tree1.2 Bisimulation graph preserves all structural information

2. Enumerate all subgraphs of depth k (indexable units) in thedata bisimulation graph.

3. Insert indexable units based on their distinctive features.4. Calculate the features of query bisimulation graph and use

them to filter out indexed units by comparing their features.

What are the features for labeled trees?

6Ning Zhang

Features of XML Tree

Three features of indexable units (after converted to a specialmatrix):

1. minimum eigenvalue λmin

2. maximum eigenvalue λmax

3. root label r

TheoremGiven two graphs G and H, if H is an induced subgraph of G, thenλmin(G ) ≤ λmin(H) ≤ λmax(H) ≤ λmax(G ).

Necessary conditions for a query Q having positive answers in datatree D:

λmin(D) ≤ λmin(Q) ≤ λmax(Q) ≤ λmax(D)

∧ r(Q) = r(D)

Returned results may have false-positives: need refinement.

7Ning Zhang

Features of XML Tree

Three features of indexable units (after converted to a specialmatrix):

1. minimum eigenvalue λmin

2. maximum eigenvalue λmax

3. root label r

TheoremGiven two graphs G and H, if H is an induced subgraph of G, thenλmin(G ) ≤ λmin(H) ≤ λmax(H) ≤ λmax(G ).

Necessary conditions for a query Q having positive answers in datatree D:

λmin(D) ≤ λmin(Q) ≤ λmax(Q) ≤ λmax(D)

∧ r(Q) = r(D)

Returned results may have false-positives: need refinement.

7Ning Zhang

Features of XML Tree

Three features of indexable units (after converted to a specialmatrix):

1. minimum eigenvalue λmin

2. maximum eigenvalue λmax

3. root label r

TheoremGiven two graphs G and H, if H is an induced subgraph of G, thenλmin(G ) ≤ λmin(H) ≤ λmax(H) ≤ λmax(G ).

Necessary conditions for a query Q having positive answers in datatree D:

λmin(D) ≤ λmin(Q) ≤ λmax(Q) ≤ λmax(D)

∧ r(Q) = r(D)

Returned results may have false-positives: need refinement.

7Ning Zhang

Features of XML Tree

Three features of indexable units (after converted to a specialmatrix):

1. minimum eigenvalue λmin

2. maximum eigenvalue λmax

3. root label r

TheoremGiven two graphs G and H, if H is an induced subgraph of G, thenλmin(G ) ≤ λmin(H) ≤ λmax(H) ≤ λmax(G ).

Necessary conditions for a query Q having positive answers in datatree D:

λmin(D) ≤ λmin(Q) ≤ λmax(Q) ≤ λmax(D)

∧ r(Q) = r(D)

Returned results may have false-positives: need refinement.7Ning Zhang

Calculating Features

1. Compute bisimulation graph for an XML tree

2. Convert labeled directed graph into weighted directed graph(encode labeled edge into edge weight). e.g.,

(article, title) → 3 (article, author) → 5(author , address) → 4 (author , email) → 7

3. Convert weighted directed graph into anti-symmetric matrix

4

email

authorarticle address

title

2

3

4

5

13

5

7

M =

0 5 3 0 0−5 0 0 4 7−3 0 0 0 00 −4 0 0 00 −7 0 0 0

4. Calculate eigenvalues λ1, λ2, . . . , λn of the matrix. The λmin,

λmax and the root note label r are three features.

8Ning Zhang

Building Index

• Bisimulation graph could be too large — restrict the depth ofXML trees to a small k.

• For each document in a collection:• if the document’s maximum depth is less than k, convert the

whole tree into bisimulation graph.

• otherwise, enumerate all subtrees having depth under the limitk, convert them into bisimulation graph.

• Both clustered and unclustered indexes can be built for acollection

IndexUnclustered

Copy of Primary XML Data Storage with RedundancyPrimary XML Data Storage

Clustered Index

9Ning Zhang

Query Processing

• Convert the query tree into bisimulation graph

• Convert bisimulation graph into anti-symmetric matrix

• Calculate λmin(Q) and λmax(Q)

• Reduced to range query:• find all [λmin, λmax] in the index that contain

[λmin(Q), λmax(Q)] and r = Q.root.

• Refinement: evaluating Tree Pattern Matching on returnedcandidate results.

10Ning Zhang

Incorporating Values

• Treat values as special tag names:• Values are hashed into a small domain Dv outside of tag name

encodings Dt , i.e., Dv ∩ Dt = ∅.• (tag, value) edges are also mapped to a distinct integer.

• Can answer equality constrained queried, e.g.,

//book[title="TCP/IP Illustrated"]/price

11Ning Zhang

Performance Evaluation:Implementation-independent Metrics

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TCMD DBLP Xmark Treebank

Data Sets

Avera

ge v

alu

es

for

dif

fere

nt

metr

ics

avg. sel avg. pp avg. fpr

Implementation-independentmetrics:

sel = 1− rst/ent

pp = 1− cdt/ent

fpr = 1− rst/cdt

12Ning Zhang

Performance Evaluation: Runtime

Runtime Performance on XMark

1

10

100

1000

10000

Xmark_hi_sp Xmark_lo_sp Xmark_hi_bp Xmark_lo_bpTest Queries

Run

time

in lo

g s

cale

(mse

c.)

NoK FIX unclustered F&B FIX clustered

FIX improvesperformancesignificantly onstructure-rich datasets.

13Ning Zhang

Performance Evaluation: Runtime (cont.)

Runtime Performance on DBLP

0.1

1

10

100

1000

10000

100000

DBLP_hi_sp DBLP_lo_sp DBLP_hi_bp DBLP_lo_bp

Test Queries

Run

time

in lo

g s

cale

(mse

c)

NoK FIX unclustered F&B FIX clustered

FIX does notperform as well onsimple structureddata sets.

14Ning Zhang

Performance Evaluation: Runtime (cont.)

Runtime Performance on DBLP with values

1

10

100

1000

10000

DBLP_hi_bp DBLP_lo_bpQueries

Run

time

in lo

g sc

ale

(mse

c.)

F&B FIX clustered

But whenconsidering values,FIX performsbetter.

15Ning Zhang

Conclusion and Future Work

Summary:

• We identify three features for pruning subtrees during queryprocessing.

• Easy to calculate.• Pruning uses simple numeric comparisons.

• A unified structure and value index (FIX ) can be built basedon these features to improve query performance significantly.

• Query evaluating based on FIX is simple and based onwell-studied techniques.

Future Work:

• Try R-tree or other high-dimensional index instead of B+ tree.

• Support wider range of queries.

• Find more features!

16Ning Zhang

Thank you!

17Ning Zhang