Senlin Liang and Michael Kifer Stony Brook University

Deriving Predicate Statistics

(SDP) in Datalog

Principles and Practice of Declarative Programming12th International ACM SIGPLAN Symposium

July 26, 2010, Hagenberg, Austria Senlin Liang and Michael Kifer

Stony Brook University

Summary of Our Approach Motivation

Take advantage of cost-based optimizations in deductive database systems Compute cost information (predicate statistics) Store and retrieve cost information efficiently Apply optimization techniques

Advantages of our approach Keeps argument dependencies Handles recursion Handles negation

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 2

Outline Introduction

Traditional approach: histograms + argument independence assumption

Error grows exponentially SDP

Dependency matrix stores predicate statistics Abstract interpretation of Datalog rules, which are

evaluated over dependency matrices Experimental studies Future workPPDP July 26, 2010


Histograms Data distribution: T=((v1, f1), ……, (vn,fn)).

E.g. ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) Histograms

Partition data distribution into groups Summarize each group as a bucket:

(floor, ceiling, size, count) Compute the values and frequencies in each bucket

efficiently MaxDiff histograms with β buckets

Partition T using β-1 largest frequency differences

PPDP July 26, 2010


Example: MaxDiff Histograms (3 buckets)1. Partition T using 2 largest frequency differences

2. Summarize as (floor, ceiling, size, count)3. Value-frequency approximation

vals(bucket) = [floor, ceiling];f(val) = count/size, e.g. f(7)=5/3

T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) o o o

1 1 2 2 1 0T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2))

(2,4,3,4) (5,5,1,3) (6,8,3,5)

PPDP July 26, 2010


1

2

Argument Independence Assumption Common in database size estimates Data distributions of different arguments are

independent of each other For example, in predicate p(X,Y), the data

distributions of X and Y are independent Joint data distribution can be easily computed

from individual distributionsE.g., p(X=a, Y=b) = p(X=a) × p(Y=b)

Unfortunately, the independence assumption is almost always wrong in real datasets

PPDP July 26, 2010


Example: Histogram+Independence = Poor Estimate answer(X,Y) :- e(X,Y), 5 ≤X≤7. Facts: e(2,2), … as in Example 1 of the paper. Histogram buckets of e

X: (2,4,3,4) (5,5,1,3) (6,8,3,5) Y: (1,1,1,1) (2,4,3,3) (5,8,4,8)

Size estimate Answer size estimate for each bucket

size(answer) = |[floor, ceiling] ∩ [5,7]|/|[floor, ceiling]| × count

size(answer) = 6.33PPDP July 26, 2010


Example: Histogram+Independence = Poor Estimate Histogram buckets of e

X: (2,4,3,4) (5,5,1,3) (6,8,3,5) Y: (1,1,1,1) (2,4,3,3) (5,8,4,8)

Histogram buckets of answer X: (5,5,1,3) (6,7,2,3.33) Y: (1,1,1,0.53) (2,4,3,1.58) (5,8,4,4.22) answer.count = e.count ×size(answer)/size(e)

Real results for answer.Y (1,1,1,0) (2,4,3,0) (5,8,4,6) Independence causes information loss

PPDP July 26, 2010


Our Approach: Dependency Matrices Only considers dependency matrices (DM)

for binary predicates Partitions facts into local groups Sum up the groups into DM values Sum up each row/column into

(floor, ceiling, size)

PPDP July 26, 2010


Example: DM Fact Matrix

F(i,j) = 1 iff p(i,j) is a fact Partition fact matrix using

MaxDiff Sum up partitions into

matrix values

10

F 1 2 3 4 5 6 7 8

2 1

3 1 1

4 1

5 1 1 1

6 1

7 1 1

8 1 1

F 1 2 3 4 5 6 7 8

22 23

4

5 36

1 1 37

8

PPDP July 26, 2010

“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

Example: DM Fact Matrix

F(i,j) = 1 iff p(i,j) is a fact Partition fact matrix using

MaxDiff Sum up partitions into

matrix values Sum up each row/column,

into (floor,ceiling,size)

11

F 1 2 3 4 5 6 7 8

22 23

4

5 36

1 1 37

8

M 1 2 3

1 2 2

2 3

3 1 1 3

(1,1,

1)(2,4,

3) (5,8,

4)

(2,4,3)(5,5,

1)(6,8,3)

PPDP July 26, 2010


SDP for Selection by Example answer(X,Y) :- e(X,Y), 5 ≤X≤7.

12

F 1 2 3 4 5 6 7 8

2 1

3 1 1

4 1

5 1 1 1

6 1

7 1 1

8 1 1

From fact matrix, we know thatsize(answer)= ΣF(i,j) for 5 ≤ i ≤ 7 = 6

PPDP July 26, 2010


SDP for Selection by Example answer(X,Y) :-

e(X,Y), 5 ≤X≤7. Extract the portions

covered by the selection Recompute matrix values Sum them up as

size(answer)=3+.67+.67+2 =6.34

For each row, recompute (floor, ceiling, size)

13

M 1 2 3

1 2 2

2 3

3 1 1 3

(1,1,

1)(2,4,

3)(5,8,

4)

(2,4,3)(5,5,1)(6,8,3)

d 1 2 31 32 .6

7.67

2(5,5,1)(6,7,2)

(1,1,

1)(2,4,

3)(5,8,

4)

PPDP July 26, 2010


Example: Sort-Merge-Join

...... ……a(4,3) b(3,1)a(4,4) b(3,5)

b(4,5)…… ……

14

answer(X,Z) :- a(X,Y), b(Y,Z) middle(X,Y,Z) is for the ease of

explanation middle(4,3,1)(4,3,5)(4,4,5)……

Duplicates!

answer

(4,1)(4,5)(4,5)……

PPDP July 26, 2010


SDP for Join by Example

15

answer(X,Z) :- a(X,Y), b(Y,Z). Simulate Sort-Merge-Join

A 1 21 2 42

(1,1,1)

(2,4,2)

(2,4,3)(5,5,1)

B 1 21 12 3

(6,8,2)

(9,9,1)

(1,1,1)(2,4,2)align

A.X, A.Y, A.Val

(2,4,3), (1,1,1), 2

(2,4,3), (2,4,2), 4

B.Y, B.Z, B.Val

(1,1,1), (6,8,2), 1

(2,4,2), (6,8,2), 3

PPDP July 26, 2010


A.X, A.Y, A.Val

(2,4,3), (1,1,1), 2

(2,4,3), (2,4,2), 4


16

answer(X,Z) :- a(X,Y), b(Y,Z). B.Y, B.Z,

B.Val(1,1,1), (6,8,2),

1(2,4,2), (6,8,2),

3 Result size of middle(X,Y,Z) can be estimated

asmin(A.Y.size,B.Y.size) × (A.Val/A.Y.size) × (B.Val/B.Y.size)

Examples: size(middle((2,4,3),(1,1,1),(6,8,2))) ~ min(1,1) ×

(2/1) × (1/1) size(middle((2,4,3),(2,4,2),(6,8,2))) ~ min(2,2) ×

(4/2) × (3/2)

PPDP July 26, 2010


Duplicates!


17

answer(X,Z) :- a(X,Y), b(Y,Z). Examples:

middle((2,4,3),(1,1,1),(6,8,2)) answer((2,4,3),(6,8,2))middle((2,4,3),(2,4,2),(6,8,2)) answer((2,4,3),(6,8,2))

Three duplicate handling approaches Sum: no duplicate removal Max: most aggressive removal Expected sum: remove “expected” number of

duplicatesPPDP July 26, 2010


SDP for Recursive Predicates

Recursive predicates are computed incrementally until they reach approximate fixed points

Size reaches α-approximate fixed point ifΔ(size)/size ≤ α

where Δ(…) is the difference between two consecutive

iterations in fixed point computation 0 ≤ α ≤ 1

18PPDP July 26, 2010


Example: Recursive Predicates Transitive closure

path(X,Y) :- edge(X,Y). (base)path(X,Y) :- edge(X,Z), path(Z,Y). (rec)

Computation of the estimate:1. Compute size(path) and DM(path) using rule base2. Compute size(path) and DM(path) using rule rec as

in the case of a join3. If size(path) reaches approximate fixed points, stop;

Otherwise, go to step 2



Experimental Studies Test programs:

Transitive closure General same generation

Datasets: generated with Thomas Process and Matern Cluster Process

Results SDP estimates converge to real sizes for

recursive predicates Expected sum is good for duplicate removal Details in the paper



Experimental Studies SDP estimates converge to real sizes for recursive

predicates

PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 21

Transitive Closure

Experimental Studies Expected sum is good for duplicate removal

PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 22

Transitive Closure

Conclusion Dependency matrix for binary predicates Overcomes problems with argument

independence assumption SDP for selection, join, and recursion Experimental validations



Future works More complex recursions Negation Extending SDP to n-ary predicates Apply cost-based optimization in deductive

systems, such as XSB



Senlin Liang and Michael Kifer Stony Brook University

Documents

Transcript of Senlin Liang and Michael Kifer Stony Brook University