Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis...

31
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis...

Page 1: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Graph-Based Synopses for Relational Selectivity Estimation

Joshua Spiegel and Neoklis PolyzotisUniversity of California, Santa Cruz

Page 2: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

2

Motivation Problem: determine the result cardinality of a

complex relational query Query optimization: cost factors of candidate plans depend

on query selectivity Data exploration: query selectivity provides timely feedback

Solution: approximate selectivity over data synopses

RelationalDatabase

Count(Q)Selectivity

Expensive

Efficient

Database synopsis SelectivityEstimate

Count(Q)

Page 3: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

3

Previous Work Table Level Synopses

Examples: Histograms [Poosala+96], Sketches [Dobra+02], Wavelets [Chakrabarti+00], Table samples [Lipton+93]

Weakness: do not summarize key values well Schema Level Synopses

Examples: Join Synopses [Acharya+99], PRMs [Getoor+01] Weakness: restricted to certain types of schemata

R

T

Z

W

SRZT

SRZW

SR SZ

ST

SW

Page 4: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

4

Synopsis Desiderata

Schema level Capture key/foreign-key joins Applicable to general schemata and queries

Page 5: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

5

Contributions Tuple Graph Synopsis (TuG)

Model: Semi-structured view of relational data Schema level summary

Schemata with many-to-many relationships Complex join queries

TuG construction algorithm Basis: Tuple clustering Novel heuristics Builds on existing clustering techniques

Experimental study TuGs are effective synopses for small space budgets Better accuracy compared to previous techniques

Page 6: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

6

Outline

The TuG Synopsis Synopsis model Estimation framework

TuG Construction TuG Compression Construction Algorithm

Experimental Study Conclusions

Page 7: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

7

TuG Model: Intuition #1 Relational database ↔ Data graph

mid year genre

1 2005 Action

2 2004 Action

3 2000 Drama

aid sex

1 Male

2 Female

3 Male

4 Male

Movies

mid aid

1 1

1 2

2 3

3 3

3 4

Cast Actors

c1

Action

Drama

c2

c3

c4

c5

m1

m2

m3

a1

a2

a3

a4

Female

Male

2005

2004

2000

Page 8: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

8

TuG Model: Intuition #2 Join query ↔ Sub-graph matching Selectivity ↔ Count of matching sub-graphs

SELECT *FROM M, C, AWHERE M.mid=C.midAND C.aid=A.aidAND A.sex=Male AND M.genre=Drama

c1

Action

Drama

c2

c3

c4

c5

m1

m2

m3

a1

a2

a3

a4

Female

Male

2005

2004

2000

M C A

MaleDrama

Page 9: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

9

3

2

1

1

1

1

TuG Synopsis Model

Tuple Node: Set of tuples from the same relation Node count: number of tuples

Edge: Join between tuple sets Edge count: result size of join

Action

Drama

Female

Male

2005

2004

a1

a2

a3

c1

c2

c3

m3

c5

m1

m2

Action

Drama

Female

Male

2005

2004

1

mα(2)

mβ(1)

cα(3)

cβ(2)

aα(3)2

1

3

2c4

1

Data Graph TuG

Page 10: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

10

3

TuG Synopsis Model

Value node: A distinct value in the database Value edge: Appearance of value in the tuple set

Edge count: frequency of value in the tuple set

In practice, distributions are compressed with summaries

Action

Drama

Female

Male

2005

2004

a1

a2

a3

c1

c2

c3

m3

c5

m1

m2

Action

Drama

Female

Male

2005

2004

1

1

1 1

mα(2)

mβ(1)

cα(3)

cβ(2)

aα(3)

Data Graph

2

1

3

2 2

1

c4

1

TuG

Page 11: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

11

3

Action

Drama

2005

2004

1

1

1 1

mα(2)

mβ(1)

2

1

1

TuG Semantics

Assumption 1: Independence across edges Assumption 2: Uniformity along each edge

Female

Male

cα(3)

cβ(2)

aα(3)

3

2 2

1For each actor:- prob[cα aα] = 1/3

- prob[cβ aα] = 1/3- prob[sex=Female] = 1/3- prob[sex=Male] = 2/3

Page 12: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

12

3

Action

Drama

2005

2004

1

1

1 1

mα(2)

mβ(1)

2

1

1

Tuple Clustering

Tuple node Cluster Join and value probabilities Centroid Validity of assumptions Error of clustering

Tight clusters Valid assumptions Accurate synopsis

Female

Male

cα(3)

cβ(2)

aα(3)

cα cβ Female Male

1/3 1/3 1/3 2/3The centroid of aα: ( )

For each actor:- prob[cα aα] = 1/3

- prob[cβ aα] = 1/3- prob[sex=Female] = 1/3- prob[sex=Male] = 2/3

3

2 2

1

Page 13: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

13

1

1 2

Selectivity Estimation

M C A

Male2005

3

Action

Drama

Female

Male

2005

2004

1

1

1

mα(2)

mβ(1)

cα(3)

cβ(2)

aα(3)2

3

2

1

1

23

⎝ ⎜

⎠ ⎟

13

⎝ ⎜

⎠ ⎟

12

⎝ ⎜

⎠ ⎟

Single pass estimation algorithm Accuracy depends on the validity of our assumptions

Tight clustering Accurate estimates

Sug-graph Selectivity = (2 · 3 · 3) · Prob[Male] Prob[mα cα] Prob[cα aα] Prob[2005] · · ·

12

⎝ ⎜

⎠ ⎟

= 1 tuple

Prob[2005 ^ mα cα ^ cα aα ^ Male ]

Page 14: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

14

Outline

The TuG Synopsis Synopsis model Estimation framework

TuG Construction TuG Compression Construction Algorithm

Experimental Study Conclusions

Page 15: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

15

The Node-Merge Operation Collapse a set of nodes into one new node

New node acquires aggregate characteristics New centroid represents the union of the tuple sets

4cα(6) aα(2) Male

aβ(2)

2

2

26cα(6) aγ(4) Male4

Merge aα and aβ

cα Male

1/3 1 )(

cα Male

1/6 1 )(

cα Male

1/4 1 )(

- When is a merge lossless?- How do we quantify lossy merges?

Page 16: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

16

Full Similarity Merge Nodes are fully similar if they have the same centroids

4cα(12) aα(2) Male

aβ(4)

8

2

412cβ(12) aγ(6) Male6

Merge aα and aβ

cα Male

1/6 1)(

cα Male

1/6 1)(

cα Male

1/6 1)(

Page 17: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

17

Nodes are all-but-one similar if they have the same centroids with respect to all schema neighbors but one

Theorem: a merge of all-but-one similar nodes is lossless

Order of merging can affect final compression Potential application in other domains (e.g., XML summarization)

All-but-one Similarity

4cα(12) aα(2) Male

aβ(4)

2

2

1

Female3

cα cβ Male Female

1/6 1/8 1 0( )

cα cβ Male Female

1/6 1/8 1/4 3/4( )

cβ(8)

84

12cα(12) aγ(6) Male3

Merge aα and aβ

(Lossless) Female

3

cα cβ Male Female

1/6 1/8 1/2 1/2( )

cβ(8)

6

Page 18: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

18

Effect of All-but-one Similarity

Data set Data graph Full-SimilaritySynopsis

Ab1-SimilaritySynopsis

TPCH 8 million 4.4 million 33K

IMDB 4.7 million 4.5 million 65K

Number of nodes in synopsis graph

Page 19: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

19

Question: When is a lossy merge good? Intuition: Similar centroids Good merge

Measure merge quality by error of centroid clustering Radius, Diameter, Manhattan Distance, …

Lossy Merges

bα(3)

bβ(2)

bγ(5)

aα(10)

12

6

6

8

6

5

cα(8) bα

Centroids

Join prob. to aα

Join

pro

b. to

cα aα cα

bα(0.4 0.3)

bβ(0.3 0.4)

bγ(0.1 0.1)

Page 20: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

20

Construction Algorithm Database Reference synopsis

All-but-one similar (lossless) merges Adaptive selection of merge operations

Reference Synopsis Join Compressed TuG Lossy merges Good merges are identified by adaptive clustering Clustering algorithm: BIRCH [Zhang+96] + CM-Sketches [CM04]

Join Compressed TuG Value Compressed TuG Value distributions Histograms Histograms are shared among nodes with similar distributions

DatabaseReferenceSynopsis

Lossless Merge Compression

TuGLossy Merge Compression

TuGValue Compression

Page 21: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

21

Outline

The TuG Synopsis Synopsis model Estimation framework

TuG Construction TuG Compression Construction Algorithm

Experimental Study Conclusions

Page 22: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

22

Techniques

TuGs Join Synopses [Acharya+99]

Multidimensional wavelets [Chakrabarti+00]

Single dimensional histograms [Poosala+96] Generated by commercial database System X

Page 23: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

23

Data and Queries

Data Sets

Workload ~200 randomly generated positive queries 4-8 join predicates 1-7 value predicates

Dataset Size Data Graph Nodes Budget Space

TPCH 1 GB 8 Million 30 KB

IMDB 139 MB 4.7 Million 20 KB

Page 24: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

24

Evaluation Metric

Absolute relative error (ARE) Sanity bound = 10th percentile of the true selectivities of

the workload

ARE = | selectivity- estimate|( ,max selectivity )sanity bound

Page 25: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

25

Estimation Error - TPCH

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140

Absolute Relative Error (%)

Queries in Workload (%)

TuG

Histograms

Join Synop.

TuG error isless than 30% for 56% of the queries in the workload

Join Synopsis error isless than 30% for 40% of the queries in the workload

Histogram error is less than 30% for 25% of the queries in the workload

Page 26: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

26

Estimation Error - IMDB

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140

Absolute Relative Error (%)

Queries in Workload (%) TuG

Wavelet

TuGs have significantly less estimation error for most queries in the workload

Join Synopses are not applicable for this schema

Page 27: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

27

Conclusions

TuG Synopses Schema-level relational summaries Model: Semi-structured view of the relational data set Selectivity estimates for complex join queries Support for a large class of practical schemata Effective construction algorithm

Experimental Results Accurate selectivity estimates given a small budget Benefits over existing techniques

Page 28: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

28

Questions?

Page 29: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

29

Construction Times

TuG Construction Times TPCH: 55 minutes IMDB : 85 minutes

Histograms and Join Synopses can be constructed relatively quickly (e.g. < 10 minutes for our datasets)

Multidimensional wavelets are prohibitively expensive to construct over key values

DatabaseReferenceSynopsis

Lossless Merge Compression

TuGLossy Merge Compression

TuGValue Compression

Page 30: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

30

Estimation Error - IMDB

0

10

20

30

40

50

60

70

80

90

0 20 40 60 80 100 120 140

Absolute Relative Error

Queries in Workload (%)

TuG

Histograms

Page 31: Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

31

A synopsis should be:

Accurate Much smaller than the database Efficient to construct Applicable for any schema and query

Many-to-many relationships Join graphs with cycles

Movies

Cast

ActorsCustomerRegion

Orders