Covering Indexes for Branching Path Queries

24
Covering Indexes for Branching Path Queries Raghav Kaushik, Philip Bohannon, Jeffrey F Naughton and Henry F Korth 1 Abdullah Mueen

description

Covering Indexes for Branching Path Queries. Raghav Kaushik , Philip Bohannon, Jeffrey F Naughton and Henry F Korth. XML as Graph Data. Leaf nodes with attributes are suppressed. oid. label(3). Non-tree edges: model IDREF relationships in the document. Branching Path Expression. - PowerPoint PPT Presentation

Transcript of Covering Indexes for Branching Path Queries

Page 1: Covering Indexes for Branching Path Queries

Covering Indexes for Branching Path Queries

Raghav Kaushik, Philip Bohannon, Jeffrey F Naughton and Henry F Korth

1Abdullah Mueen

Page 2: Covering Indexes for Branching Path Queries

XML as Graph Data

Abdullah Mueen 2

oid

label(3)

Non-tree edges: model IDREF relationships in the document

Leaf nodes with attributes are suppressed

Page 3: Covering Indexes for Branching Path Queries

Branching Path Expression

Abdullah Mueen 3

ROOT/metro/neighborhoods/neighborhood[/business=>cinema-hall]/cultural=>museum

Page 4: Covering Indexes for Branching Path Queries

Example (1)

Abdullah Mueen 4

//hotel[/star][<=business\neighborhood[/cultural=>museum[\art]]]

Page 5: Covering Indexes for Branching Path Queries

Covering Index

• A covering index can answer any query from a set of queries without consulting with the original document.

• The GOAL of this paper is to find a covering index for “Branching Path Queries” .

Abdullah Mueen 5

Page 6: Covering Indexes for Branching Path Queries

k-bisimilarity

Abdullah Mueen 6

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

Two nodes u and v are called k-bisimilar (u ≈k v) if

1.label(u) = label(v) 2.every incoming label path of length≤k to u matches with at least one incoming path of length≤k to v and vice versa.

2,4 are 0-bisimilar. 5,7 are 1-bisimilar 8,9 are 2-bisimilar 6,8 are 1-bisimilar

≈k defines an equivalence class over the set of nodes in G

The algorithm for computing k-bisimulation will be shown later

Page 7: Covering Indexes for Branching Path Queries

1-index : Covering Index for Simple Path Expression

Abdullah Mueen 7Abdullah Mueen 7

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

18

15

12

13

16

14

17

11

R

D

CB

A

DCB

{8,9}

{1}{2}

{4} {5}

{3}

{6}

19

C

{7}

18

15

12

13

16

14

17

11

R

D

CB

A

DCB

{8}

{1}{2}

{4} {5}

{3}

{6}

19

C

{7}

18

D{9}

12

13

14

15

11

R

CB

A

D

{1} {2,4}{3,5,7}

{6,8,9}

A(0) A(1)

A(2) A(3) = 1-index

data graph G

15

12

13

16

14

17

11

R

CBA

DCB

{1}{2}

{4} {5,7}

{3}

{6,8,9}

SuccStable

SuccStable

SuccStable

Page 8: Covering Indexes for Branching Path Queries

Inverse edges

Abdullah Mueen 8

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

5,7 are not 1-bisimilar 5,7 are 1-bisimilar

Page 9: Covering Indexes for Branching Path Queries

The F&B index

Abdullah Mueen 9

• While there is no change– Reverse all edges– Compute Forward Bismilarity Partition– Reverse all edges again.– Compute Backward Bisimilarity Partition

Page 10: Covering Indexes for Branching Path Queries

Forward Bisimulation

Abdullah Mueen 10

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

Page 11: Covering Indexes for Branching Path Queries

Backword Bisimulation

Abdullah Mueen 11

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

Page 12: Covering Indexes for Branching Path Queries

Properties of F&B index

• The F&B index over a data graph G covers all branching path expression.

• F&B index is the smallest of the indexes that covers branching path queries.

• Generally F&B is large for most of the real documents.

Abdullah Mueen 12

Page 13: Covering Indexes for Branching Path Queries

1. Tags to be indexed

• There are tags that are not used for Queries. • bold, emph• We specify set of tags to be indexed.• In a 100MB document, the F&B index on all

tags has 436,000 nodes while ignoring formatting tags it has 18,000 nodes.

Abdullah Mueen 13

Page 14: Covering Indexes for Branching Path Queries

2. IDREF edges to be indexed

• IDREF edges are not counted in // operation.• IDREF edges are explicitly described in the

path expression by => operator.• We specify the Set of IDREF edges to be

indexed.• The 100MB document has 1.3 million nodes

with all IDREF edges while it has 18,000 nodes without any IDREF edges and formatting tags.

Abdullah Mueen 14

Page 15: Covering Indexes for Branching Path Queries

3. Exploiting Local Similarity

• Long Queries are not frequent and interesting.• If we restrict the length of the possible

queries, we can get much smaller index tree than the F&B index.

• We specify the length of the local path by using k-bisimilarity instead of bisimilarity while computing the F&B index.

Abdullah Mueen 15

Page 16: Covering Indexes for Branching Path Queries

4. Restricting Tree Depth

• Long nested conditions are less likely to occur.• We specify the maximum depth of the

conditional path expression by tree-depth (defined next).

Abdullah Mueen 16

Page 17: Covering Indexes for Branching Path Queries

tree depth

Abdullah Mueen 17

//museums/history/museum[/featured and <=cultural\neighborhood[/cultural=>museum[\art]]]

Page 18: Covering Indexes for Branching Path Queries

Definition of an Index

• A set of tags T• Set of IDREF edges on both directions reffwd

and refbwd

• Two parameters kbwd and kfwd to restrict the length of the path queries

• One parameter td to restrict the depth of the nested conditional expression.

Abdullah Mueen 18

Page 19: Covering Indexes for Branching Path Queries

The BPCI index

Abdullah Mueen 19

• Remove all tags not in T such that the removal does not cut out a tag in T.

• Start with label grouping as current partition P• For i=0 and i≤td

– Reverse all edges in G, retain IDREF edges only in reffwd .

– P ← Forward kfwd -Bismilar Partition of P and inc(i)

– Reverse all edges in G again, retain IDREF edges only in refbwd .

– P ← Backward kbwd-Bisimilar Partition of P and inc(i)

Page 20: Covering Indexes for Branching Path Queries

Variations of BPCI

Abdullah Mueen 20

Page 21: Covering Indexes for Branching Path Queries

Testing if an Index covers a Query

• Build the Query graph• Check if all tags and IDREF edges in the query

are in T and in (refbwd U reffwd)

• Check if the tree depth of the query is less than td of the index

• Check if all paths in the query with even tree depth have length < kbwd

• Check if all paths in the query with odd tree depth have length < kfwdAbdullah Mueen 21

Page 22: Covering Indexes for Branching Path Queries

Result on Xmark benchmark

Abdullah Mueen 22

1. Iall is the F&B index2. Iallmost-all is F&B with kfwd = 13. Ispecific is built on the query

Page 23: Covering Indexes for Branching Path Queries

Result

Abdullah Mueen 23

Page 24: Covering Indexes for Branching Path Queries

Conlclusion

• BPCI is the covering index for Branching Path Queries.

• By setting appropriate parameters, we can get a wide range of queries suitable for various applications.

• Extensions– Updating and Bulk loading– Integration with value indexes

Abdullah Mueen 24