From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching

From Region Encoding To Extended Dewey: On Efficient

Processing of XML Twig Pattern Matching

Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen

National University of Singapore

2

Outline Background

Define our problem: XML twig pattern matching Previous work and problems

Our new twig matching algorithms A new labeling scheme: extended Dewey A new holistic algorithm: TJFast

Experimental results Conclusion

3

XML basics Short for Extensible Markup Language A language for defining the syntax and semantics of

structured data An XML document is commonly modeled as a

rooted, ordered and tagged tree. book

preface chapter chapter

section

section

paragraph

section

paragraph

paragraph

………….

title

title

“XML”“Data”

“Intro”

“…” “…”

“…”

4

Querying XML Data Major standards for querying XML data

XPath and XQuery XML twig pattern matching is a core operation in

XPath and XQuery Definition of XML twig pattern : An XML twig pattern

is a small tree whose nodes are tags, attributes or text values; and edges are either Parent-Child edges or Ancestor-Descendant edges

5

An XML twig pattern example Create a flat list of all the title-author pairs for

every book in bibliography.

$b: book

$t: title

bib

$a: author

Ancestor-descendant relationship

Parent-child relationship

XQuery:

<results>

{

for $b in doc("bib.xml")/bib//book,

$t in $b/title,

$a in $b/author,

return

<result> { $t } { $a } </result>

}

</results>

To answer the XQuery, we need to first match the following XML twig pattern:

6

Our research problem

Problem Statement Given an XML twig pattern Q, and an XML database

D, we need to find ALL the matches of Q on D. E.g. Consider the following twig pattern and document:

An XML tree:

s1

s2

f1

p1

t1

t2

Section

Title Figure

Twig pattern: Query answers:

(s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

7




An XML tree:

s1

s2

f1

p1

t1

t2

Section

Title Figure

Twig pattern: Query solutions:

(s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

8




An XML tree:

s1

s2

f1

p1

t1

t2

Section

Title Figure

Twig pattern: Query solutions:

(s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

9

Outline Background

Define our problem: XML twig pattern matching Previous work and challenge


Experiments Conclusion

10

Related work TreeMerge and Stack-tree [Al-Khalifa ICDE 2002]

A stack-based binary join algorithm But large intermediate results

TwigStack [ Bruno SIGMOD 2002] A holistic twig join algorithm. Sub-optimal for queries with parent-child relationships

TwigStackList [ Lu CIKM 2004] A new holistic twig join algorithm, which produces less

useless intermediate results than TwigStack does for queries with parent-child relationship

11

Our research goal In this research, we want to design a new holistic twig

join algorithm which is more efficient than previous work.

Two aspects to achieve this goal: (1) Input: reduce the input I/O cost

(2) Output: reduce the size of intermediate results

12

Outline Background

Define our problem: XML twig pattern matching Previous work and challenges


Experiments Conclusion

13

Original Dewey Labeling Scheme In Dewey labeling scheme, each element is presented by a vector: (i) the root is labeled by an empty stringε (ii) for a non-root element u, label(u)= label(s).x, where u is the

x-th child of s. For example:

s1

s2

f1

f2t1

t2

1 2 3

2.1 2.2

ε

14

Main problem of the original Dewey If we use the original Dewey labeling scheme to answer a twig query, we need to read labels for all query nodes. Thus, we have no performance benefit compared to pervious methods.

Our idea: Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path of e from this label alone.

15

Modulo function We need to know some schema information: DTD

(Document Type Definitions ) or XML schema Given DTD information: book → author, title, chapter* Our solution: using modulo function, we create a match

between an element tag and a integer number. We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3

= 2;

where Xt is the last component of the label of tag t.

bookε

0

titleauthor 1

chapter2

chapter

5

Why not 3 as the original Dewey ?

16

Derive element tag From a label , we can derive its tag name. book → author, title, chapter* Recall that we define: Xauthormod 3 = 0 Xtitlemod 3 = 1

Xchaptermod 3 = 2.

bookε

0

titleauthor 1

chapter2

chapter

5

? ? ? ?

17

Derive the path from a label By following a finite state transducer (FST), we may recursively

derive the whole path from any extended Dewey label. For example:

DTD:

book → author, title, chapter*

chapter → (paragraph | section)*

section → (paragraph | section)*

book

chapter

sectionauthor title

book

author

title

chapter

paragraph

section

Mod 3=0

Mod 3=1

Mod 3=2 Mod 2=0

Mod 2=1

Mod 2=0

Mod 2=1

Question: Given a label 5.1.0 for an element, what is the corresponding path ?

Document:

FST:

chapter

section

paragraphsection

18

Derive the path from a label By following a finite state transducer (FST), we may recursively

derive the whole path from any extended Dewey label. For example:

DTD:

book → author, title, chapter*

chapter → (paragraph | section)*

section → (paragraph | section)*

book

chapter

sectionauthor title

Document:chapter

section

paragraphsection

Following the above red path, we get

5.1.0 denotes :

book/ chapter/section/paragraph

book

author

title

chapter

paragraph

section

Mod 3=0

Mod 3=1

Mod 3=2 Mod 2=0

Mod 2=1

Mod 2=0

FST:

Mod 2=1

19

Two properties of extended Dewey Find Ancestor Label From a label of any element, we can derive the labels of its all

ancestors. Find Ancestor Name

From a label of any element, we can derive the tag names of its all ancestors.

Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.

20

Outline Background


Our new twig matching algorithms A new labeling scheme: extended Dewey A new holistic algorithm: TJFast (a Fast Twig Join

algorithm) Experiments Conclusion

21

A new algorithm: TJFast

For each node n in the query, there exists a corresponding input stream Tn.

Tn contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order.

For each branching node b of the twig pattern, there is a corresponding set Sb, which contains elements possibly involving query answers. (Compared to TwigStack, what difference? )

During any point of computing, the size of set Sb is bounded by the depth of the XML document.

22

A new algorithm: TJFast Two-phase algorithm:

Phase 1 : parts of intermediate root-leaf paths are output Insert elements that possibly involve in query answers to sets Output intermediate paths according to elements in sets

Phase 2 : the intermediate paths are merge-joined to get the final results

23

An example for TJFast algorithmDocument:

Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0TD:

TC:

{ }

DTD:

a -> a*,d*, b*

b -> d*, c*

d -> c*

Root

0…

0.5.0

ε

A set for the branching node A

Why do we not need TA, TB streams?

24


Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

{ }Root

0…

0.5.0

ε

0.0.1 a1/a2/d1derive

0.3.2.1 a1/a3/b1/c1derive

By finite state transducer of extended Dewey labeling scheme

TD:

TC:

25


Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

{ }Root

0…

0.5.0

ε

Both a1 and a3 possibly involve in query answers. (Why not a2 ?)TD:

TC:

26


Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

{ }Root

0…

0.5.0

ε

Then we insert a1 to the set, since a1 is an ancestor of a3. TD:

TC:

27


Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

{a1 }

Root

0…

0.5.0

ε

Move the cursor of TD from d1 to d2 and output one path solution <a1, d1>

TD:

TC:

28


Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

{a1,a3 }Root

0…

0.5.0

ε

We insert a3 to the set, since a3 definitely involves in query answers.

0.3.1 a1/a3/d2derive

TD:

TC:

29


Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

{a1,a3 }Root

0…

0.5.0

ε

Move the cursor of stream TD from d2 to d3 and output <a1,d2> and <a3,d2>.

TD:

TC:

30


Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

{a1,a3 }Root

0…

0.5.0

ε

Move the cursor of stream TC from c1 to c2 and output the path <a3,b1,c1>TD:

TC:

31


Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

{a1,a3 }Root

0…

0.5.0

ε

1. Move the cursor TD of to the end and output path solution <a1,d3>

TD:

TC:

32


Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

{a1,a3 }Root

0…

0.5.0

ε

1. Move the cursor of TC of to the end and output <a1,b2,c2>

TD:

TC:

33


Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

{a1,a3 }Root

0…

0.5.0

ε

Now all five elements has been scanned, in the second phase we merge-join all output path solutions.

TD:

TC:

34

An example for TJFast algorithmDocument: Query:

A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

A// D:<a1, d1>, <a1, d2>,<a1, d3>,<a3, d2>

A/B//C:<a1,b2, c2>,<a3, b1,c1>

Phase 1. Intermediate paths

<a1,d1,b2,c2>,<a1,d2, b2,c2>,

<a1,d3,b2,c2>,<a3,d2, b1,c1>,

<A, D, B,C>

Phase 2. Final solutions

Join

35

Outline Background




36

Experiments

Benchmarks XMark: Synthetic Data DBLP: Real Data for DBLP database Treebank: Real Data from Wall Street Journal

XMark DBLP Treebank

Data size(MB) 582 130 82

Nodes(million) 8 3.3 2.4

Max/Avg depth 12/5 6/2.9 36/7.8

37

Path query

Path Queries

PQ1 /site/closed-auctions/closed_auction/price

PQ2 /site/regions//item/location

PQ3 /site/people/person/gender

PQ4 /site/open_auctions/open_auction/reserve

We compared PathStack[1] and TJFast on the following four path queries on XMark data.

38

Experiments: Number of elements read and input file size for path queries

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Q1 Q2 Q3 Q4

Disk

file

s re

ad(K

Byt

es)

PathStack TJFast

0

50000

100000

150000

200000

250000

Q1 Q2 Q3 Q4

Number of elements read

PathStack TJFast

Observation: TJFast scans less elements than PathStack does.

Explanation: TJFast only scans labels for leaf nodes in queries, but PathStack scans all nodes in the query.

39

Experiments: Execution time for path queries

0

0. 5

1

1. 5

2

2. 5

3

Q1 Q2 Q3 Q4

Exec

utio

n ti

me(s

econ

d)

PathStack TJ Fast

Observation: TJFast has better performance for all four path queries than PathStack.

Explanation: TJFast reduces I/O cost by reading less elements.

40

Twig queries

Source Twig Queries

TQ1 DBLP //proceedings//title[.//i]//sup

TQ2 DBLP //article[.//sup]//title//sub

TQ3 Treebank /S[.//VP/IN]//NP

TQ4 Treebank /S/VP/PP[IN]/NP/VBN

TQ5 Treebank //VP[DT]//PRP_DOLLAR_

We compared TwigStack, TwigStackList and TJFast on the following five twig queries on DBLP and TreeBank data.

41

Experiments: Number of elements read and input file size for twig queries

0

1000

2000

3000

4000

5000

6000

Q1 Q2

Disk

file

s si

ze(K

Byt

es)

Twi gStack Twi gStackLi st TJFast

Observation: TJFast scans far less elements than TwigStack and TwigStackList do in two twig queries.

Explanation: TJFast only scans elements for leaf nodes in queries. But TwigStack/TwigStackList needs to scan elements for all nodes. And the number of elements for non-leaf nodes is much more than that of leaf nodes.

0

100000

200000

300000

400000

500000

600000

Q1 Q2

Numb

er o

f el

emen

ts r

ead

Twi gStack Twi gStackLi st TJFast

42

Experiments: Execution time for twig queries

Observation: For DBLP data, TJFast has much better performance than that of TwigStack/TwigStackList.

Explanation: TJFast reduces I/O cost by reading less elements.

TW-SS and TJ-SS denote the sequential scan time of input data for TwigStack/TwigStacklist and TJFast, respectively.

0

12

34

56

78

9

Q1 Q2

Exec

utio

n ti

me(s

econ

d)

TW- SS Twi gStack Twi gStackLi st TJ - SS TJ Fast

43

Outline Background




44

Conclusions Efficient processing of twig queries is a core

operation in XPath and XQuery We have proposed a new labeling scheme,

extended Dewey and a new holistic twig pattern matching algorithm: TJFast.

Compared to previous work TJFast reduces the input I/O cost TJFast reduces the output I/O cost for intermediate

results.

45

Reference [1] S. Al-Khalifa , H.V. Jagadish, J. Patel, Y. Wu N. Koudas, D.

Srivastava : Structural Joins: A Primitive for Efficient XML Query Pattern Matching. ICDE 2002 141- 152

Propose StackTree algorithm [2] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins:

optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.

Propose TwigStack algorithm [3] T. Chen, J. Lu, and T. Ling. On boosting holism in xml twig

pattern matching using structural indexingtechniques. In SIGMOD, 2005.

Propose two new data streaming techniques [4] Y. Chen, S. B. Davidson, and Y. Zheng. BLAS: An efficient

XPath processing system. In Proc. of SIGMOD, pages 47-58, 2004. Propose a new algorithm for XPath query

46

Reference [5] H. Jiang, W Wang and H. Lu Holistic twig joins on indexed XML

documents VLDB 2003 Propose TSGeneric algorithm [6] J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig

patterns with parent child edges: a look-ahead approach. In CIKM, pages 533-542, 2004.

Propose TwigStackList algorithm [7] P. Rao and B. Moon PRIX: Indexing and querying XML using prufer

sequences In ICDE pages 288-300 2004 Propose PRIX system [8] H. Wang, S. park, W Fan and P.S. Yu ViST: A dynamic index

method for querying XML data by tree structures In SIGMOD 2003 Propose ViST system [9] B. Yang M. Fontoura, E.J. Shekita, S. Rajagopalan and K.S. Beyer

Virtual Corsors for XML joins CIKM pages 523-532 2004 Propose Virtual cursor algorithm

47

END

Thank you!

Q & A

48

Related work Comparison between Virtual Cursor (VC) [Yang

CIKM 2004] and our work Develop independently Finite state transducer in TJFast, path table in VC

Size of path table depends on the distinct paths, but that of FST depends on the distinct elements types.

TJFast reduces the number of useless intermediate path when queries with parent-child edges, but VC has not this property

49

Backup

a

b c

d e

Query:

a1

b1

a2

d1

c1

f2

c2

e1

f1

Document

TwigStackList outputs <a1,b1> . But TJFast does not output this

path solution.

50

Labels sizeXmark DBLP TreeBank

Region encoding(MB)

71.9 21.6 23.3

Original Dewey(MB)

56.2 18.1 22.8

Extended Dewey(MB)

72.6 19.5 28.7

51

Optimal query classes If an algorithm does not output any

useless intermediate results for an query Q for all given documents, we call this algorithm is optimal for query Q.

If an algorithm has a larger optimal query class, this algorithm has better ability to control the size of intermediate results.

52

Optimal class of TJFast and TwigStack

TwigStack TJFast

Optimal query class

All edges are ancestor-descendant relationships

All edges connecting branching nodes and the children are ancestor-descendant relationship

a

b c

a

b c

d

a

b c

Even for non-optimal queries, TJFast usually output less useless intermediate paths than TwigStack do.

53

Update of XML documents In order to support the update of XML

documents, we need to slightly modify extended Dewey labeling scheme.

Our idea comes from ORDPATH*. We can avoid to relabel the documents in any

circumstance of update.

* P. O'Neil, E. O'Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. ORDPATHs: Insert-friendly XML node labels. In

SIGMOD, pages 903--908, 2004.

54

More examples for assigning labels Let us consider a more complicated DTD

a → (b | c )*, d?, c+ We define: Xbmod 3 = 0 Xcmod 3 = 1 Xd mod 3 = 2

(Why do we use mod 3 instead of 4?)

aε

0

db

2c4

c

7

55

Computing cost of FST The CPU time complexity of FST is linear in the length

of an extended Dewey label, but independent of the complexity of schema definition.

The main memory size of FST is quadratic to the number of distinct element names in XML documents, as the number of transition in FST is quadratic in the worst case.

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching

Documents

Transcript of From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching