General Database Statistics Using Maximum Entropy
description
Transcript of General Database Statistics Using Maximum Entropy
General Database Statistics Using Maximum Entropy
Raghav Kaushik1, Christopher Ré2, and Dan Suciu3
1Microsoft Research2University of Wisconsin--Madison
3University of Washington
2
Study Cardinality Estimation
1. Model: Information that optimizer knows
2. Prediction: use the model to estimate cardinality of future queries
Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization.
“We estimate that distinct # of Employees is 10”
Propose a declarative language with statistical assertions
3
Motivating Applications
1. Incorporate query feedback records-
3. Data generation and description
2. Optimizers for new domains (DB Kit 2.0)
Cloud Computing, Information Extraction
Underutilized: No general purpose mechanism
4
Outline
• Statistical programs and desiderata
• Semantics of Statistical Programs
• Two examples
• Conclusions
5
Statistical Assertions
An assertion is a CQ Views + sharp (#) statement:
V1(x) :- R(x,-)
“The number of values in the output of V1 is 20”
#V1 = 20
V2(y) :- R(-,y),S(y)
“The number of values in the output V2 is 50”
#V2 = 50
A program is a set of assertions
V(x) :- R(x,y), …. #V= 106
6
Model as a Probabilistic Database
Intuitively, # is “Expected Value”
V1(x) :- R(x,-)
A model is a probabilistic database s.t. the expected number of tuples in V1 is 20.
Ok, but which pdb?
#V1 = 20
V(x) :- R(x,y), …. #V= 106
“The number of values in the output of V1 is 20”
7
Desiderata for our solution
• Two Desiderata for the distribution(D1): Should agree with provided statistics(D2): Should assume nothing else
Approach: maximize entropy subject to D1
Challenge: Compute params of MaxEnt Distribution
Technical Desideratum: want params analytically
V(x) :- R(x,y), …. #V= 106
8
Outline
• Statistical programs and desiderata
• Semantics of Statistical Programs
• Two examples
• Conclusions
9
Notation for Probabilistic Databases
• Consider a domain D of size n.• Fix a schema R=R1, R2,…• Let Inst(n) = all instances over R on D• An element I of Inst(n) is called a world
10
Notation for Probabilistic Databases
• Consider a domain D of size n.• Fix a schema R=R1, R2,…• Let Inst(n) = all instances over R on D• An element I of Inst(n) is called a world
Essentially, any discrete probability distribution on relations
A probabilistic database is a pair (Inst(n),p)
( )
: ( ) [0,1] . . ( ) 1I Inst n
p Inst n s t p I
11
1( )
( ) 20I
I Inst n
p I V
The semantics of #
V1(x) :- R(x,-)
# means “expected value”
#V1 = 20
Achieving (D1): Stats must agree
NB: In truth, we let n tend to infinity, and settle for asymptotically equal.
“The number of values in the output of V1 is 20”
12
( )
1, , ( ) Ii i
I Inst n
for i t p I V d
Multiple Views
• Given V1, V2, … with #Vi = di for i=1,…,t
If p satisfies these equations, we’ve achieved:(D1): Should agree with provided statistics
Many such distributions exist. How do we pick one?
Achieving (D1): Stats must agree
13
Selecting the best one
• Maximize Entropy subject to constraints:
Achieving (D2) : No ad-hoc assumptions
# 1, ,i id forV i t
14
# 1, ,i id forV i t
| |
1
( ) 1 IiV
t
iiI
Zp
Selecting the best one
• Maximize Entropy subject to constraints:
Achieving (D2) : No ad-hoc assumptions
Z is normalizing constant and i is positive parameter for i=1,..,t
NB: p is only a function of the stats, and so we have achieved (D2)
One can show that p has following form:
15
| |
1
( ) 1 IiV
t
iiI
Zp
# 1, ,i id forV i t
Benefits of MaxEnt
• Every (consistent) statistical program induces a well-defined distribution– Every query has a well-defined cardinality estimate
• Statistics as a whole, not as individual stats.• Can add new statistics to our heart’s content
Technical Challenge: i analytically
A statistical program
16
Outline
• Statistical programs and desiderata
• Semantics of Statistical Programs
• Two examples
• Conclusions
17
Two quick Examples
• I: A material random Graph– Even simple EM solutions have interesting theory
• II: Intersection Models– Generating function , and– Different, analytic technique
18
Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d Random Graph: Add edges
independently at random
19
2
( ) (1 )v n vp I x x
2Ilet v V and x dn
Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d
By Linearity, E[V] = xn2 = d
Random Graph: Add edges independently at random
20
2
( ) (1 )v n vp I x x
Example I: Random Graphs are EMV(x,y) :- R(x,y) #V = d Random Graph: Add edges
independently at random
By Linearity, E[V] = xn2 = d
2Ilet v V and x dn
This is MaxEnt…write:
( ) 1 vp IZ
1xx
2
(1 ) nZ x
21
Example II:an intersection model
Read: Each element is either in R1, R2, or all three
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
22
Example II:an intersection model
k kdx x kxdx
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
23
Example II:an intersection model
k kdx x kxdx
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
3 33
1 nn
ddZ
x Zdx
24
Example II:an intersection model
k kdx x kxdx
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
3
1 2 33 3
1 nn
x xdx Zx
nddx ZZ
25
Example II:an intersection model
k kdx x kxdx
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
3
1 2 33 3
1 nn
x xdx Zx
nddx ZZ
1 1 2 3 13
x= n di
x x xxd n
Z Z
26
Example II:an intersection model
k kdx x kxdx
V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3
Read: Each element is either in R1, R2, or all three1 2 1 2 3(1 )n nZ x x x x x
e.g., term with x1k is an instance
where k distinct values in R1
3
1 2 33 3
1 nn
x xdx Zx
nddx ZZ
1 1 2 3 13
x= n di
x x xxd n
Z Z
3 1, 2ii
d dfor i
nx
33
1 2
( )d
x nnx x
27
Results in the paper
• Normal Form for statistical programs
• Syntactic classes that we can solve analytically– “Project-Semijoin” queries (previous slide)
• A general technique, conditioning:– Start with tuple independent prior, and condition– Introduces inclusion constraints
• Extensions to handle histograms
28
Conclusion
• Showed a principled, general model for database statistics based on MaxEnt
• Analytically solved syntactic classes of statistics
• Applications: Query Feedback and the Cloud