Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009.
Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final...
Transcript of Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final...
Review• Multiclass logistic regression
• Priors, conditional MAP logistic regression
• Bayesian logistic regression
‣ MAP is not always typical of posterior
‣ posterior predictive can avoid overfitting
!"
!#
!$ " $ #"!%
!&
!'
"
'
&
!!" !#" " #" !"
"
"$!
"$%
"$&
"$'
#
1
Review
• Finding posterior predictive distribution often requires numerical integration
‣ uniform sampling
‣ importance sampling
‣ parallel importance sampling
• These are all Monte-Carlo algorithms
‣ another well-known MC algorithm coming up
2
Application: SLAM
Elia
zar
and
Parr
, IJC
AI-0
3
3
Parallel IS
Parallel IS
• Final estimate:
32
Parallel IS
• Pick N samples Xi from proposal Q(X)
• If we knew Wi = P(Xi)/Q(Xi), could do IS
• Instead, set
! and,
• Then:
31
4
Parallel IS is biased
0 1 2 30
0.5
1
1.5
2
2.5
3
mean(weights)
1 / m
ean(
weig
hts)
E(mean(weights))
E(W̄ ) = Z, but E(1/W̄ ) != 1/Z in general5
−2 −1 0 1 2−2
−1
0
1
2
Q : (X, Y ) ! N(1, 1) ! ! U("",")f(x, y, !) = Q(x, y, !)P (o = 0.8 | x, y, !)/Z
6
−2 −1 0 1 2−2
−1
0
1
2
Posterior E(X, Y, !) = (0.496, 0.350, 0.084)
7
SLAM revisited
• Uses a recursive version of parallel importance sampling: particle filter
‣ each sample (particle) = trajectory over time
‣ sampling extends trajectory by one step
‣ recursively update importance weights and renormalize
‣ resampling trick to avoid keeping lots of particles with low weights
8
Particle filter example
9
Monte-Carlo revisited
• Recall: wanted
• Would like to search for areas of high P(x)
• But searching could bias our estimates
EP (g(X)) =!
g(x)P (x)dx =!
f(x)dx
10
Markov-Chain Monte Carlo
• Randomized search procedure
• Produces sequence of RVs X1, X2, …
‣ Markov chain: satisfies Markov property
• If P(Xt) small, P(Xt+1) tends to be larger
• As t → ∞, Xi ~ P(X)
• As Δ → ∞, Xt+Δ ⊥ Xt
11
Markov chain
12
Stationary distribution
13
Markov-Chain Monte Carlo
• As t → ∞, Xi ~ P(X); as Δ → ∞, Xt+Δ ⊥ Xt
• For big enough t and Δ, an approximately i.i.d. sample from P(X) is
‣ { Xt, Xt+Δ, Xt+2Δ, Xt+3Δ, … }
• Can use i.i.d. sample to estimate EP(g(X))
• Actually, don’t need independence:
14
Metropolis-Hastings
• Way to design chain w/ stationary dist’n P(X)
• Basic strategy: start from arbitrary X
• Repeatedly tweak X to get X’
‣ If P(X’) ≥ P(X), move to X’
‣ If P(X’) ≪ P(X), stay at X
‣ In intermediate cases, randomize
15
Proposal distribution
• Left open: what does “tweak” mean?
• Parameter of MH: Q(X’ | X)
• Good proposals explore quickly, but remain in regions of high P(X)
• Optimal proposal?
16
MH algorithm• Initialize X1 arbitrarily
• For t = 1, 2, …:
‣ Sample X’ ~ Q(X’ | Xt)
‣ Compute p =
‣ With probability min(1, p), set Xt+1 := X’
‣ else Xt+1 := Xt
• Note: sequence X1, X2, … will usually contain duplicates
17
Acceptance rate
• Want acceptance rate (avg p) to be large, so we don’t get big runs of the same X
• Want Q(X’ | X) to move long distances (to explore quickly)
• Tension between long moves and P(accept):
18
Mixing rate, mixing time
• If we pick a good proposal, we will move rapidly around domain of P(X)
• After a short time, won’t be able to tell where we started
• This is short mixing time = # steps until we can’t tell which starting point we used
• Mixing rate = 1 / (mixing time)
19
MH example
−1
−0.5
0
0.5
1−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
0
1
2
3
4
5
YX
f(X,Y)
20
MH example
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
21
In example• g(x) = x2
• True E(g(X)) = 0.28…
• Proposal:
• Acceptance rate 55–60%
• After 1000 samples, minus burn-in of 100:
final estimate 0.282361final estimate 0.271167final estimate 0.322270final estimate 0.306541final estimate 0.308716
Q(x! | x) = N(x! | x, 0.252I)
22
Gibbs sampler
• Special case of MH
• Divide X into blocks of r.v.s B(1), B(2), …
• Proposal Q:
‣ pick a block i uniformly
‣ sample XB(i) ~ P(XB(i) | X¬B(i))
• Useful property: acceptance rate p = 1
23
Gibbs example
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
24
Gibbs example
−1.5 −1 −0.5 0 0.5 1 1.5
−1
−0.5
0
0.5
1
25
Gibbs failure example
−6 −4 −2 0 2 4 6−5
−4
−3
−2
−1
0
1
2
3
4
5
26
Relational learning
• Linear regression, logistic regression: attribute-value learning
‣ set of i.i.d. samples from P(X, Y)
• Not all data is like this
‣ an attribute is a property of a single entity
‣ what about properties of sets of entities?
27
Application: document clustering
28
Application: recommendations
29
Latent-variable models
30
Best-known LVM: PCA
• Suppose Xij, Uik, Vjk all ~ Gaussian
‣ yields principal components analysis
‣ or probabilistic PCA
‣ or Bayesian PCA
31