Senlin Liang and Michael Kifer Stony Brook University
description
Transcript of Senlin Liang and Michael Kifer Stony Brook University
Deriving Predicate Statistics
(SDP) in Datalog
Principles and Practice of Declarative Programming12th International ACM SIGPLAN Symposium
July 26, 2010, Hagenberg, Austria Senlin Liang and Michael Kifer
Stony Brook University
Summary of Our Approach Motivation
Take advantage of cost-based optimizations in deductive database systems Compute cost information (predicate statistics) Store and retrieve cost information efficiently Apply optimization techniques
Advantages of our approach Keeps argument dependencies Handles recursion Handles negation
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 2
Outline Introduction
Traditional approach: histograms + argument independence assumption
Error grows exponentially SDP
Dependency matrix stores predicate statistics Abstract interpretation of Datalog rules, which are
evaluated over dependency matrices Experimental studies Future workPPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 3
Histograms Data distribution: T=((v1, f1), ……, (vn,fn)).
E.g. ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) Histograms
Partition data distribution into groups Summarize each group as a bucket:
(floor, ceiling, size, count) Compute the values and frequencies in each bucket
efficiently MaxDiff histograms with β buckets
Partition T using β-1 largest frequency differences
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 4
Example: MaxDiff Histograms (3 buckets)1. Partition T using 2 largest frequency differences
2. Summarize as (floor, ceiling, size, count)3. Value-frequency approximation
vals(bucket) = [floor, ceiling];f(val) = count/size, e.g. f(7)=5/3
T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) o o o
1 1 2 2 1 0T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2))
(2,4,3,4) (5,5,1,3) (6,8,3,5)
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 5
1
2
Argument Independence Assumption Common in database size estimates Data distributions of different arguments are
independent of each other For example, in predicate p(X,Y), the data
distributions of X and Y are independent Joint data distribution can be easily computed
from individual distributionsE.g., p(X=a, Y=b) = p(X=a) × p(Y=b)
Unfortunately, the independence assumption is almost always wrong in real datasets
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 6
Example: Histogram+Independence = Poor Estimate answer(X,Y) :- e(X,Y), 5 ≤X≤7. Facts: e(2,2), … as in Example 1 of the paper. Histogram buckets of e
X: (2,4,3,4) (5,5,1,3) (6,8,3,5) Y: (1,1,1,1) (2,4,3,3) (5,8,4,8)
Size estimate Answer size estimate for each bucket
size(answer) = |[floor, ceiling] ∩ [5,7]|/|[floor, ceiling]| × count
size(answer) = 6.33PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 7
Example: Histogram+Independence = Poor Estimate Histogram buckets of e
X: (2,4,3,4) (5,5,1,3) (6,8,3,5) Y: (1,1,1,1) (2,4,3,3) (5,8,4,8)
Histogram buckets of answer X: (5,5,1,3) (6,7,2,3.33) Y: (1,1,1,0.53) (2,4,3,1.58) (5,8,4,4.22) answer.count = e.count ×size(answer)/size(e)
Real results for answer.Y (1,1,1,0) (2,4,3,0) (5,8,4,6) Independence causes information loss
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 8
Our Approach: Dependency Matrices Only considers dependency matrices (DM)
for binary predicates Partitions facts into local groups Sum up the groups into DM values Sum up each row/column into
(floor, ceiling, size)
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 9
Example: DM Fact Matrix
F(i,j) = 1 iff p(i,j) is a fact Partition fact matrix using
MaxDiff Sum up partitions into
matrix values
10
F 1 2 3 4 5 6 7 8
2 1
3 1 1
4 1
5 1 1 1
6 1
7 1 1
8 1 1
F 1 2 3 4 5 6 7 8
22 23
4
5 36
1 1 37
8
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Example: DM Fact Matrix
F(i,j) = 1 iff p(i,j) is a fact Partition fact matrix using
MaxDiff Sum up partitions into
matrix values Sum up each row/column,
into (floor,ceiling,size)
11
F 1 2 3 4 5 6 7 8
22 23
4
5 36
1 1 37
8
M 1 2 3
1 2 2
2 3
3 1 1 3
(1,1,
1)(2,4,
3) (5,8,
4)
(2,4,3)(5,5,
1)(6,8,3)
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
SDP for Selection by Example answer(X,Y) :- e(X,Y), 5 ≤X≤7.
12
F 1 2 3 4 5 6 7 8
2 1
3 1 1
4 1
5 1 1 1
6 1
7 1 1
8 1 1
From fact matrix, we know thatsize(answer)= ΣF(i,j) for 5 ≤ i ≤ 7 = 6
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
SDP for Selection by Example answer(X,Y) :-
e(X,Y), 5 ≤X≤7. Extract the portions
covered by the selection Recompute matrix values Sum them up as
size(answer)=3+.67+.67+2 =6.34
For each row, recompute (floor, ceiling, size)
13
M 1 2 3
1 2 2
2 3
3 1 1 3
(1,1,
1)(2,4,
3)(5,8,
4)
(2,4,3)(5,5,1)(6,8,3)
d 1 2 31 32 .6
7.67
2(5,5,1)(6,7,2)
(1,1,
1)(2,4,
3)(5,8,
4)
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Example: Sort-Merge-Join
...... ……a(4,3) b(3,1)a(4,4) b(3,5)
b(4,5)…… ……
14
answer(X,Z) :- a(X,Y), b(Y,Z) middle(X,Y,Z) is for the ease of
explanation middle(4,3,1)(4,3,5)(4,4,5)……
Duplicates!
answer
(4,1)(4,5)(4,5)……
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
SDP for Join by Example
15
answer(X,Z) :- a(X,Y), b(Y,Z). Simulate Sort-Merge-Join
A 1 21 2 42
(1,1,1)
(2,4,2)
(2,4,3)(5,5,1)
B 1 21 12 3
(6,8,2)
(9,9,1)
(1,1,1)(2,4,2)align
A.X, A.Y, A.Val
(2,4,3), (1,1,1), 2
(2,4,3), (2,4,2), 4
B.Y, B.Z, B.Val
(1,1,1), (6,8,2), 1
(2,4,2), (6,8,2), 3
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
A.X, A.Y, A.Val
(2,4,3), (1,1,1), 2
(2,4,3), (2,4,2), 4
SDP for Join by Example
16
answer(X,Z) :- a(X,Y), b(Y,Z). B.Y, B.Z,
B.Val(1,1,1), (6,8,2),
1(2,4,2), (6,8,2),
3 Result size of middle(X,Y,Z) can be estimated
asmin(A.Y.size,B.Y.size) × (A.Val/A.Y.size) × (B.Val/B.Y.size)
Examples: size(middle((2,4,3),(1,1,1),(6,8,2))) ~ min(1,1) ×
(2/1) × (1/1) size(middle((2,4,3),(2,4,2),(6,8,2))) ~ min(2,2) ×
(4/2) × (3/2)
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Duplicates!
SDP for Join by Example
17
answer(X,Z) :- a(X,Y), b(Y,Z). Examples:
middle((2,4,3),(1,1,1),(6,8,2)) answer((2,4,3),(6,8,2))middle((2,4,3),(2,4,2),(6,8,2)) answer((2,4,3),(6,8,2))
Three duplicate handling approaches Sum: no duplicate removal Max: most aggressive removal Expected sum: remove “expected” number of
duplicatesPPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
SDP for Recursive Predicates
Recursive predicates are computed incrementally until they reach approximate fixed points
Size reaches α-approximate fixed point ifΔ(size)/size ≤ α
where Δ(…) is the difference between two consecutive
iterations in fixed point computation 0 ≤ α ≤ 1
18PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Example: Recursive Predicates Transitive closure
path(X,Y) :- edge(X,Y). (base)path(X,Y) :- edge(X,Z), path(Z,Y). (rec)
Computation of the estimate:1. Compute size(path) and DM(path) using rule base2. Compute size(path) and DM(path) using rule rec as
in the case of a join3. If size(path) reaches approximate fixed points, stop;
Otherwise, go to step 2
19PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Experimental Studies Test programs:
Transitive closure General same generation
Datasets: generated with Thomas Process and Matern Cluster Process
Results SDP estimates converge to real sizes for
recursive predicates Expected sum is good for duplicate removal Details in the paper
20PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Experimental Studies SDP estimates converge to real sizes for recursive
predicates
PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 21
Transitive Closure
Experimental Studies Expected sum is good for duplicate removal
PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 22
Transitive Closure
Conclusion Dependency matrix for binary predicates Overcomes problems with argument
independence assumption SDP for selection, join, and recursion Experimental validations
23PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
Future works More complex recursions Negation Extending SDP to n-ary predicates Apply cost-based optimization in deductive
systems, such as XSB
24PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer