.
PGM: Tirgul 10Parameter Learning
and Priors
2
Why learning?
Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert
Data is cheap Vast amounts of data becomes available to us
Learning allows us to build systems based on the data
3
Learning Bayesian networks
InducerInducerInducerInducerData + Prior information
E
R
B
A
C .9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
4
Known Structure -- Complete Data E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is specified Inducer needs to estimate parameters
Data does not contain missing values
5
Unknown Structure -- Complete Data E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is not specified Inducer needs to select arcs & estimate parameters
Data does not contain missing values
6
Known Structure -- Incomplete Data
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is specified Data contains missing values
We consider assignments to missing values
E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>
7
Known Structure / Complete Data
Given a network structure G And choice of parametric family for P(Xi|Pai)
Learn parameters for network
Goal Construct a network that is “closest” to probability
that generated the data
8
Example: Binomial Experiment(Statistics 101)
When tossed, it can land in one of two positions: Head or Tail
We denote by the (unknown) probability P(H).Estimation task: Given a sequence of toss samples x[1], x[2], …,
x[M] we want to estimate the probabilities P(H)= and P(T) = 1 -
Head Tail
9
Statistical Parameter Fitting Consider instances x[1], x[2], …, x[M] such
that The set of values that x can take is known Each is sampled from the same distribution Each sampled independently of the rest
The task is to find a parameter so that the data can be summarized by a probability P(x[j]| ).
Depends on the given family of probability distributions: multinomial, Gaussian, Poisson, etc.
For now, focus on multinomial distributions
i.i.d.samples
10
The Likelihood Function How good is a particular ?
It depends on how likely it is to generate the observed data
The likelihood for the sequence H,T, T, H, H is
m
mxPDPDL )|][()|():(
)1()1():( DL
0 0.2 0.4 0.6 0.8 1
L(
:D)
11
Sufficient Statistics
To compute the likelihood in the thumbtack example we only require NH and NT
(the number of heads and the number of tails)
NH and NT are sufficient statistics for the binomial distribution
TH NNDL )1():(
12
Sufficient Statistics
A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood
Formally, s(D) is a sufficient statistics if for any two datasets D and D’
s(D) = s(D’ ) L( |D) = L( |D’)
Datasets
Statistics
13
Maximum Likelihood Estimation
MLE Principle:
Choose parameters that maximize the likelihood function
This is one of the most commonly used estimators in statistics
Intuitively appealing
14
Example: MLE in Binomial Data
Applying the MLE principle we get
(Which coincides with what one would expect)
0 0.2 0.4 0.6 0.8 1
L(
:D)
Example:(NH,NT ) = (3,2)
MLE estimate is 3/5 = 0.6
TH
H
NN
N
15
Learning Parameters for a Bayesian Network
E B
A
C
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
D
Training data has the form:
16
Learning Parameters for a Bayesian Network
E B
A
C
Since we assume i.i.d. samples,likelihood function is
m
mCmAmBmEPDL ):][],[],[],[():(
17
Learning Parameters for a Bayesian Network
E B
A
C
By definition of network, we get
m
m
mAmCP
mEmBmAP
mBP
mEP
mCmAmBmEPDL
):][|][(
):][],[|][(
):][(
):][(
):][],[],[],[():(
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
18
Learning Parameters for a Bayesian Network
E B
A
C
Rewriting terms, we get
m
m
m
m
m
mAmCP
mEmBmAP
mBP
mEP
mCmAmBmEPDL
):][|][(
):][],[|][(
):][(
):][(
):][],[],[],[():(
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
19
General Bayesian Networks
Generalizing for any Bayesian network:
The likelihood decomposes according to the structure of the network.
iii
i miii
m iiii
mn
DL
mPamxP
mPamxP
mxmxPDL
):(
):][|][(
):][|][(
):][,],[():( 1 i.i.d. samples
Network factorization
20
General Bayesian Networks (Cont.)
Decomposition
Independent Estimation Problems
If the parameters for each family are not related, then they can be estimated independently of each other.
21
From Binomial to Multinomial
For example, suppose X can have the values 1,2,…,K
We want to learn the parameters 1, 2. …, K
Sufficient statistics: N1, N2, …, NK - the number of times each
outcome is observed
Likelihood function:
MLE:
K
k
Nk
kDL1
):(
N
Nkk
ˆ
22
Likelihood for Multinomial Networks
When we assume that P(Xi | Pai ) is multinomial, we get further decomposition:
i i
ii
ii
i i
ii
i ii
pa x
paxNpax
pa x
paxNiii
pa pamPamiii
miiiii
paxP
pamxP
mPamxPDL
),(|
),(
][,
):|(
):|][(
):][|][():(
m
iiiii mPamxPDL ):][|][():(
i i
ii
i ii
pa x
paxNiii
pa pamPamiii
miiiii
paxP
pamxP
mPamxPDL
),(
][,
):|(
):|][(
):][|][():(
i iipa pamPamiii
miiiii
pamxP
mPamxPDL
][,
):|][(
):][|][():(
23
Likelihood for Multinomial Networks
When we assume that P(Xi | Pai ) is multinomial, we get further decomposition:
For each value pai of the parents of Xi we get an independent multinomial problem
The MLE is
i i
ii
iipa x
paxNpaxii DL ),(
|):(
)(),(ˆ
|i
iipax paN
paxNii
24
Maximum Likelihood Estimation
Consistency Estimate converges to best possible value as the
number of examples grow
To make this formal, we need to introduce some definitions
25
KL-Divergence
Let P and Q be two distributions over X A measure of distance between P and Q is the
Kullback-Leibler Divergence
KL(P||Q) = 1 (when logs are in base 2) = The probability P assigns to an instance is, on
average, half the probability Q assigns to it KL(P||Q) 0 KL(P||Q) = 0 iff are P and Q equal
x xQ
xPxPQPKL
)()(
log)()||(
26
Consistency
Let P(X| ) be a parametric family We need to make various regularity condition we won’t
go into now Let P*(X) be the distribution that generates the data Let be the MLE estimate given a dataset D
Thm As N , where
with probability 1
D
*ˆ D
):(||)((minarg ** XPXPKL
27
Consistency -- Geometric Interpretation
P*
P(X| * )
Space of probability distribution
Distributions that canrepresented by P(X| )
28
Is MLE all we need?
Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack Would you bet on heads for the next toss?
Suppose now that after 10 observations,ML estimates P(H) = 0.7 for a coinWould you place the same bet?
29
Bayesian Inference
Frequentist Approach: Assumes there is an unknown but fixed parameter Estimates with some confidence Prediction by using the estimated parameter value
Bayesian Approach: Represents uncertainty about the unknown parameter Uses probability to quantify this uncertainty:
Unknown parameters as random variables Prediction follows from the rules of probability:
Expectation over the unknown parameters
30
Bayesian Inference (cont.)
We can represent our uncertainty about the sampling process using a Bayesian network
The values of X are independent given
The conditional probabilities, P(x[m] | ), are the parameters in the model
Prediction is now inference in this network
X[1] X[2] X[m] X[m+1]
Observed data Query
31
Bayesian Inference (cont.)
Prediction as inference in this network
where
dMxxPMxP
dMxxPMxxMxP
MxxMxP
])[,],1[|()|]1[(
])[,],1[|(])[,],1[,|]1[(
])[,],1[|]1[(
])[],1[()()|][],1[(
])[],1[|(MxxP
PMxxPMxxP
Posterior
Likelihood Prior
Probability of data
X[1] X[2] X[m] X[m+1]
32
Example: Binomial Data Revisited
Prior: uniform for in [0,1] P( ) = 1
Then P( |D) is proportional to the likelihood L( :D)
(NH,NT ) = (4,1)
MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is 7142.0
75
)|()|]1[( dDPDHMxP
0 0.2 0.4 0.6 0.8 1
)()|][],1[(])[],1[|( PMxxPMxxP
33
Bayesian Inference and MLE
In our example, MLE and Bayesian prediction differ
But…
If prior is well-behaved Does not assign 0 density to any “feasible”
parameter value
Then: both MLE and Bayesian prediction converge to the same value
Both are consistent
34
Dirichlet Priors
Recall that the likelihood function is
A Dirichlet prior with hyperparameters 1,…,K is defined as
for legal 1,…, K
Then the posterior has the same form, with
hyperparameters 1+N 1,…,K +N K
K
1k
Nk
kDL ):(
K
kk
kP1
1)(
K
k
Nk
K
k
Nk
K
kk
kkkkDPPDP1
1
11
1)|()()|(
35
Dirichlet Priors (cont.)
We can compute the prediction on a new event in closed form:
If P() is Dirichlet with hyperparameters 1,…,K then
Since the posterior is also Dirichlet, we get
kk dPkXP )()]1[(
)(
)|()|]1[(N
NdDPDkMXP kk
k
36
Dirichlet Priors -- Example
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 0.2 0.4 0.6 0.8 1
Dirichlet(1,1)Dirichlet(2,2)
Dirichlet(0.5,0.5)Dirichlet(5,5)
37
Prior Knowledge
The hyperparameters 1,…,K can be thought of as “imaginary” counts from our prior experience
Equivalent sample size = 1+…+K
The larger the equivalent sample size the more confident we are in our prior
38
Effect of Priors
Prediction of P(X=H ) after seeing data with NH = 0.25•NT for different sample sizes
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0 20 40 60 80 100
Different strength H + T Fixed ratio H / T
Fixed strength H + T
Different ratio H / T
39
Effect of Priors (cont.) In real data, Bayesian estimates are less
sensitive to noise in the data
0.1
0.2
0.3
0.4
0.5
0.6
0.7
5 10 15 20 25 30 35 40 45 50
P(X
= 1
|D)
N
MLEDirichlet(.5,.5)
Dirichlet(1,1)Dirichlet(5,5)
Dirichlet(10,10)
N
0
1Toss Result
40
Conjugate Families The property that the posterior distribution follows the
same parametric form as the prior distribution is called conjugacy
Dirichlet prior is a conjugate family for the multinomial likelihood
Conjugate families are useful since: For many distributions we can represent them with
hyperparameters They allow for sequential update within the same
representation In many cases we have closed-form solution for
prediction
41
Bayesian Networks and Bayesian Prediction
Priors for each parameter group are independent Data instances are independent given the
unknown parameters
X
X[1] X[2] X[M] X[M+1]
Observed dataPlate notation
Y[1] Y[2] Y[M] Y[M+1]
Y|X X
Y|Xm
X[m]
Y[m]
Query
42
Bayesian Networks and Bayesian Prediction (Cont.)
We can also “read” from the network:
Complete data posteriors on parameters are independent
X
X[1] X[2] X[M] X[M+1]
Observed dataPlate notation
Y[1] Y[2] Y[M] Y[M+1]
Y|X X
Y|Xm
X[m]
Y[m]
Query
43
Bayesian Prediction(cont.) Since posteriors on parameters for each family are
independent, we can compute them separately Posteriors for parameters within families are also
independent:
Complete data independent posteriors on Y|X=0 and Y|X=1
X
Y|Xm
X[m]
Y[m]
Refined model
X
Y|X=0m
X[m]
Y[m]
Y|X=1
44
Bayesian Prediction(cont.)
Given these observations, we can compute the posterior for each multinomial Xi | pai
independently
The posterior is Dirichlet with parameters
(Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)
The predictive distribution is then represented by the parameters
)()(
),(),(~|
ii
iiiipax paNpa
paxNpaxii
45
Assessing Priors for Bayesian Networks
We need the(xi,pai) for each node xj
We can use initial parameters 0 as prior information
Need also an equivalent sample size parameter M0
Then, we let (xi,pai) = M0P(xi,pai|0)
This allows to update a network using new data
46
Learning Parameters: Case Study (cont.)
Experiment: Sample a stream of instances from the alarm
network Learn parameters using
MLE estimator Bayesian estimator with uniform prior with
different strengths
47
Learning Parameters: Case Study (cont.)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
KL
Div
erg
en
ce
M
MLEBayes w/ Uniform Prior, M'=5
Bayes w/ Uniform Prior, M'=10Bayes w/ Uniform Prior, M'=20Bayes w/ Uniform Prior, M'=50
48
Learning Parameters: Summary
Estimation relies on sufficient statistics For multinomial these are of the form N (xi,pai) Parameter estimation
Bayesian methods also require choice of priors Both MLE and Bayesian are asymptotically equivalent
and consistent Both can be implemented in an on-line manner by
accumulating sufficient statistics
)()(),(),(~
|ii
iiiipax paNpa
paxNpaxii
)(),(ˆ
|i
iipax paN
paxNii
MLE Bayesian (Dirichlet)
Top Related