Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final...

31
Review Multiclass logistic regression Priors, conditional MAP logistic regression Bayesian logistic regression MAP is not always typical of posterior posterior predictive can avoid overfitting ! " ! # !$ " $ #" !% !& !" & !!" !#" " #" !" " "$! "$% "$& "$’ # 1

Transcript of Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final...

Page 1: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Review• Multiclass logistic regression

• Priors, conditional MAP logistic regression

• Bayesian logistic regression

‣ MAP is not always typical of posterior

‣ posterior predictive can avoid overfitting

!"

!#

!$ " $ #"!%

!&

!'

"

'

&

!!" !#" " #" !"

"

"$!

"$%

"$&

"$'

#

1

Page 2: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Review

• Finding posterior predictive distribution often requires numerical integration

‣ uniform sampling

‣ importance sampling

‣ parallel importance sampling

• These are all Monte-Carlo algorithms

‣ another well-known MC algorithm coming up

2

Page 3: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Application: SLAM

Elia

zar

and

Parr

, IJC

AI-0

3

3

Page 4: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Parallel IS

Parallel IS

• Final estimate:

32

Parallel IS

• Pick N samples Xi from proposal Q(X)

• If we knew Wi = P(Xi)/Q(Xi), could do IS

• Instead, set

! and,

• Then:

31

4

Page 5: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Parallel IS is biased

0 1 2 30

0.5

1

1.5

2

2.5

3

mean(weights)

1 / m

ean(

weig

hts)

E(mean(weights))

E(W̄ ) = Z, but E(1/W̄ ) != 1/Z in general5

Page 6: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

−2 −1 0 1 2−2

−1

0

1

2

Q : (X, Y ) ! N(1, 1) ! ! U("",")f(x, y, !) = Q(x, y, !)P (o = 0.8 | x, y, !)/Z

6

Page 7: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

−2 −1 0 1 2−2

−1

0

1

2

Posterior E(X, Y, !) = (0.496, 0.350, 0.084)

7

Page 8: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

SLAM revisited

• Uses a recursive version of parallel importance sampling: particle filter

‣ each sample (particle) = trajectory over time

‣ sampling extends trajectory by one step

‣ recursively update importance weights and renormalize

‣ resampling trick to avoid keeping lots of particles with low weights

8

Page 9: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Particle filter example

9

Page 10: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Monte-Carlo revisited

• Recall: wanted

• Would like to search for areas of high P(x)

• But searching could bias our estimates

EP (g(X)) =!

g(x)P (x)dx =!

f(x)dx

10

Page 11: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Markov-Chain Monte Carlo

• Randomized search procedure

• Produces sequence of RVs X1, X2, …

‣ Markov chain: satisfies Markov property

• If P(Xt) small, P(Xt+1) tends to be larger

• As t → ∞, Xi ~ P(X)

• As Δ → ∞, Xt+Δ ⊥ Xt

11

Page 12: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Markov chain

12

Page 13: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Stationary distribution

13

Page 14: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Markov-Chain Monte Carlo

• As t → ∞, Xi ~ P(X); as Δ → ∞, Xt+Δ ⊥ Xt

• For big enough t and Δ, an approximately i.i.d. sample from P(X) is

‣ { Xt, Xt+Δ, Xt+2Δ, Xt+3Δ, … }

• Can use i.i.d. sample to estimate EP(g(X))

• Actually, don’t need independence:

14

Page 15: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Metropolis-Hastings

• Way to design chain w/ stationary dist’n P(X)

• Basic strategy: start from arbitrary X

• Repeatedly tweak X to get X’

‣ If P(X’) ≥ P(X), move to X’

‣ If P(X’) ≪ P(X), stay at X

‣ In intermediate cases, randomize

15

Page 16: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Proposal distribution

• Left open: what does “tweak” mean?

• Parameter of MH: Q(X’ | X)

• Good proposals explore quickly, but remain in regions of high P(X)

• Optimal proposal?

16

Page 17: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

MH algorithm• Initialize X1 arbitrarily

• For t = 1, 2, …:

‣ Sample X’ ~ Q(X’ | Xt)

‣ Compute p =

‣ With probability min(1, p), set Xt+1 := X’

‣ else Xt+1 := Xt

• Note: sequence X1, X2, … will usually contain duplicates

17

Page 18: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Acceptance rate

• Want acceptance rate (avg p) to be large, so we don’t get big runs of the same X

• Want Q(X’ | X) to move long distances (to explore quickly)

• Tension between long moves and P(accept):

18

Page 19: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Mixing rate, mixing time

• If we pick a good proposal, we will move rapidly around domain of P(X)

• After a short time, won’t be able to tell where we started

• This is short mixing time = # steps until we can’t tell which starting point we used

• Mixing rate = 1 / (mixing time)

19

Page 20: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

MH example

−1

−0.5

0

0.5

1−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

0

1

2

3

4

5

YX

f(X,Y)

20

Page 21: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

MH example

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

21

Page 22: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

In example• g(x) = x2

• True E(g(X)) = 0.28…

• Proposal:

• Acceptance rate 55–60%

• After 1000 samples, minus burn-in of 100:

final estimate 0.282361final estimate 0.271167final estimate 0.322270final estimate 0.306541final estimate 0.308716

Q(x! | x) = N(x! | x, 0.252I)

22

Page 23: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Gibbs sampler

• Special case of MH

• Divide X into blocks of r.v.s B(1), B(2), …

• Proposal Q:

‣ pick a block i uniformly

‣ sample XB(i) ~ P(XB(i) | X¬B(i))

• Useful property: acceptance rate p = 1

23

Page 24: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Gibbs example

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

24

Page 25: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Gibbs example

−1.5 −1 −0.5 0 0.5 1 1.5

−1

−0.5

0

0.5

1

25

Page 26: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Gibbs failure example

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

5

26

Page 27: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Relational learning

• Linear regression, logistic regression: attribute-value learning

‣ set of i.i.d. samples from P(X, Y)

• Not all data is like this

‣ an attribute is a property of a single entity

‣ what about properties of sets of entities?

27

Page 28: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Application: document clustering

28

Page 29: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Application: recommendations

29

Page 30: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Latent-variable models

30

Page 31: Reviewggordon/10601/slides/Lec13_MCMC.pdf · • After 1000 samples, minus burn-in of 100: final estimate 0.282361 final estimate 0.271167 final estimate 0.322270 final estimate 0.306541

Best-known LVM: PCA

• Suppose Xij, Uik, Vjk all ~ Gaussian

‣ yields principal components analysis

‣ or probabilistic PCA

‣ or Bayesian PCA

31