Senlin Liang and Michael Kifer Stony Brook University

24
Deriving Predicate Statistics (SDP) in Datalog Principles and Practice of Declarative Programming 12 th International ACM SIGPLAN Symposium July 26, 2010, Hagenberg, Austria Senlin Liang and Michael Kifer Stony Brook University

description

Deriving Predicate Statistics (SDP) in Datalog Principles and Practice of Declarative Programming 12 th International ACM SIGPLAN Symposium July 26, 2010, Hagenberg, Austria. Senlin Liang and Michael Kifer Stony Brook University. Summary of Our Approach. Motivation - PowerPoint PPT Presentation

Transcript of Senlin Liang and Michael Kifer Stony Brook University

Page 1: Senlin  Liang and Michael  Kifer Stony Brook University

Deriving Predicate Statistics

(SDP) in Datalog

Principles and Practice of Declarative Programming12th International ACM SIGPLAN Symposium

July 26, 2010, Hagenberg, Austria Senlin Liang and Michael Kifer

Stony Brook University

Page 2: Senlin  Liang and Michael  Kifer Stony Brook University

Summary of Our Approach Motivation

Take advantage of cost-based optimizations in deductive database systems Compute cost information (predicate statistics) Store and retrieve cost information efficiently Apply optimization techniques

Advantages of our approach Keeps argument dependencies Handles recursion Handles negation

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 2

Page 3: Senlin  Liang and Michael  Kifer Stony Brook University

Outline Introduction

Traditional approach: histograms + argument independence assumption

Error grows exponentially SDP

Dependency matrix stores predicate statistics Abstract interpretation of Datalog rules, which are

evaluated over dependency matrices Experimental studies Future workPPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 3

Page 4: Senlin  Liang and Michael  Kifer Stony Brook University

Histograms Data distribution: T=((v1, f1), ……, (vn,fn)).

E.g. ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) Histograms

Partition data distribution into groups Summarize each group as a bucket:

(floor, ceiling, size, count) Compute the values and frequencies in each bucket

efficiently MaxDiff histograms with β buckets

Partition T using β-1 largest frequency differences

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 4

Page 5: Senlin  Liang and Michael  Kifer Stony Brook University

Example: MaxDiff Histograms (3 buckets)1. Partition T using 2 largest frequency differences

2. Summarize as (floor, ceiling, size, count)3. Value-frequency approximation

vals(bucket) = [floor, ceiling];f(val) = count/size, e.g. f(7)=5/3

T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) o o o

1 1 2 2 1 0T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2))

(2,4,3,4) (5,5,1,3) (6,8,3,5)

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 5

1

2

Page 6: Senlin  Liang and Michael  Kifer Stony Brook University

Argument Independence Assumption Common in database size estimates Data distributions of different arguments are

independent of each other For example, in predicate p(X,Y), the data

distributions of X and Y are independent Joint data distribution can be easily computed

from individual distributionsE.g., p(X=a, Y=b) = p(X=a) × p(Y=b)

Unfortunately, the independence assumption is almost always wrong in real datasets

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 6

Page 7: Senlin  Liang and Michael  Kifer Stony Brook University

Example: Histogram+Independence = Poor Estimate answer(X,Y) :- e(X,Y), 5 ≤X≤7. Facts: e(2,2), … as in Example 1 of the paper. Histogram buckets of e

X: (2,4,3,4) (5,5,1,3) (6,8,3,5) Y: (1,1,1,1) (2,4,3,3) (5,8,4,8)

Size estimate Answer size estimate for each bucket

size(answer) = |[floor, ceiling] ∩ [5,7]|/|[floor, ceiling]| × count

size(answer) = 6.33PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 7

Page 8: Senlin  Liang and Michael  Kifer Stony Brook University

Example: Histogram+Independence = Poor Estimate Histogram buckets of e

X: (2,4,3,4) (5,5,1,3) (6,8,3,5) Y: (1,1,1,1) (2,4,3,3) (5,8,4,8)

Histogram buckets of answer X: (5,5,1,3) (6,7,2,3.33) Y: (1,1,1,0.53) (2,4,3,1.58) (5,8,4,4.22) answer.count = e.count ×size(answer)/size(e)

Real results for answer.Y (1,1,1,0) (2,4,3,0) (5,8,4,6) Independence causes information loss

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 8

Page 9: Senlin  Liang and Michael  Kifer Stony Brook University

Our Approach: Dependency Matrices Only considers dependency matrices (DM)

for binary predicates Partitions facts into local groups Sum up the groups into DM values Sum up each row/column into

(floor, ceiling, size)

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 9

Page 10: Senlin  Liang and Michael  Kifer Stony Brook University

Example: DM Fact Matrix

F(i,j) = 1 iff p(i,j) is a fact Partition fact matrix using

MaxDiff Sum up partitions into

matrix values

10

F 1 2 3 4 5 6 7 8

2 1

3 1 1

4 1

5 1 1 1

6 1

7 1 1

8 1 1

F 1 2 3 4 5 6 7 8

22 23

4

5 36

1 1 37

8

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 11: Senlin  Liang and Michael  Kifer Stony Brook University

Example: DM Fact Matrix

F(i,j) = 1 iff p(i,j) is a fact Partition fact matrix using

MaxDiff Sum up partitions into

matrix values Sum up each row/column,

into (floor,ceiling,size)

11

F 1 2 3 4 5 6 7 8

22 23

4

5 36

1 1 37

8

M 1 2 3

1 2 2

2 3

3 1 1 3

(1,1,

1)(2,4,

3) (5,8,

4)

(2,4,3)(5,5,

1)(6,8,3)

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 12: Senlin  Liang and Michael  Kifer Stony Brook University

SDP for Selection by Example answer(X,Y) :- e(X,Y), 5 ≤X≤7.

12

F 1 2 3 4 5 6 7 8

2 1

3 1 1

4 1

5 1 1 1

6 1

7 1 1

8 1 1

From fact matrix, we know thatsize(answer)= ΣF(i,j) for 5 ≤ i ≤ 7 = 6

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 13: Senlin  Liang and Michael  Kifer Stony Brook University

SDP for Selection by Example answer(X,Y) :-

e(X,Y), 5 ≤X≤7. Extract the portions

covered by the selection Recompute matrix values Sum them up as

size(answer)=3+.67+.67+2 =6.34

For each row, recompute (floor, ceiling, size)

13

M 1 2 3

1 2 2

2 3

3 1 1 3

(1,1,

1)(2,4,

3)(5,8,

4)

(2,4,3)(5,5,1)(6,8,3)

d 1 2 31 32 .6

7.67

2(5,5,1)(6,7,2)

(1,1,

1)(2,4,

3)(5,8,

4)

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 14: Senlin  Liang and Michael  Kifer Stony Brook University

Example: Sort-Merge-Join

...... ……a(4,3) b(3,1)a(4,4) b(3,5)

b(4,5)…… ……

14

answer(X,Z) :- a(X,Y), b(Y,Z) middle(X,Y,Z) is for the ease of

explanation middle(4,3,1)(4,3,5)(4,4,5)……

Duplicates!

answer

(4,1)(4,5)(4,5)……

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 15: Senlin  Liang and Michael  Kifer Stony Brook University

SDP for Join by Example

15

answer(X,Z) :- a(X,Y), b(Y,Z). Simulate Sort-Merge-Join

A 1 21 2 42

(1,1,1)

(2,4,2)

(2,4,3)(5,5,1)

B 1 21 12 3

(6,8,2)

(9,9,1)

(1,1,1)(2,4,2)align

A.X, A.Y, A.Val

(2,4,3), (1,1,1), 2

(2,4,3), (2,4,2), 4

B.Y, B.Z, B.Val

(1,1,1), (6,8,2), 1

(2,4,2), (6,8,2), 3

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 16: Senlin  Liang and Michael  Kifer Stony Brook University

A.X, A.Y, A.Val

(2,4,3), (1,1,1), 2

(2,4,3), (2,4,2), 4

SDP for Join by Example

16

answer(X,Z) :- a(X,Y), b(Y,Z). B.Y, B.Z,

B.Val(1,1,1), (6,8,2),

1(2,4,2), (6,8,2),

3 Result size of middle(X,Y,Z) can be estimated

asmin(A.Y.size,B.Y.size) × (A.Val/A.Y.size) × (B.Val/B.Y.size)

Examples: size(middle((2,4,3),(1,1,1),(6,8,2))) ~ min(1,1) ×

(2/1) × (1/1) size(middle((2,4,3),(2,4,2),(6,8,2))) ~ min(2,2) ×

(4/2) × (3/2)

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 17: Senlin  Liang and Michael  Kifer Stony Brook University

Duplicates!

SDP for Join by Example

17

answer(X,Z) :- a(X,Y), b(Y,Z). Examples:

middle((2,4,3),(1,1,1),(6,8,2)) answer((2,4,3),(6,8,2))middle((2,4,3),(2,4,2),(6,8,2)) answer((2,4,3),(6,8,2))

Three duplicate handling approaches Sum: no duplicate removal Max: most aggressive removal Expected sum: remove “expected” number of

duplicatesPPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 18: Senlin  Liang and Michael  Kifer Stony Brook University

SDP for Recursive Predicates

Recursive predicates are computed incrementally until they reach approximate fixed points

Size reaches α-approximate fixed point ifΔ(size)/size ≤ α

where Δ(…) is the difference between two consecutive

iterations in fixed point computation 0 ≤ α ≤ 1

18PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 19: Senlin  Liang and Michael  Kifer Stony Brook University

Example: Recursive Predicates Transitive closure

path(X,Y) :- edge(X,Y). (base)path(X,Y) :- edge(X,Z), path(Z,Y). (rec)

Computation of the estimate:1. Compute size(path) and DM(path) using rule base2. Compute size(path) and DM(path) using rule rec as

in the case of a join3. If size(path) reaches approximate fixed points, stop;

Otherwise, go to step 2

19PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 20: Senlin  Liang and Michael  Kifer Stony Brook University

Experimental Studies Test programs:

Transitive closure General same generation

Datasets: generated with Thomas Process and Matern Cluster Process

Results SDP estimates converge to real sizes for

recursive predicates Expected sum is good for duplicate removal Details in the paper

20PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 21: Senlin  Liang and Michael  Kifer Stony Brook University

Experimental Studies SDP estimates converge to real sizes for recursive

predicates

PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 21

Transitive Closure

Page 22: Senlin  Liang and Michael  Kifer Stony Brook University

Experimental Studies Expected sum is good for duplicate removal

PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 22

Transitive Closure

Page 23: Senlin  Liang and Michael  Kifer Stony Brook University

Conclusion Dependency matrix for binary predicates Overcomes problems with argument

independence assumption SDP for selection, join, and recursion Experimental validations

23PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Page 24: Senlin  Liang and Michael  Kifer Stony Brook University

Future works More complex recursions Negation Extending SDP to n-ary predicates Apply cost-based optimization in deductive

systems, such as XSB

24PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer