RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k +...
Transcript of RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k +...
![Page 1: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/1.jpg)
RANDOM TOPICSstochastic gradient descent
&Monte Carlo
![Page 2: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/2.jpg)
MASSIVE MODEL FITTING
minimize1
2kAx� bk2 =
X
i
1
2(aix� bi)
2
least squares
minimize1
2kwk2 + h(LDw) =
1
2kwk2 +
X
i
h(lidiw)
SVM
low-rank factorization
Big!(over 100K)
minimize f(x) =1
n
nX
i=1
fi(x)
minimize1
2kD �XY k2 =
X
ij
1
2(dij � xiyj)
2
![Page 3: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/3.jpg)
THE BIG IDEA
True gradient
⌦ ⇢ [1, N ]
Idea: choose a subset of the data
usually just one sample
minimize f(x) =1
n
nX
i=1
fi(x)
rf =1
n
nX
i=1
rfi(x)
rf ⇡ g⌦ =1
|⌦|X
i2⌦
rfi(x)
![Page 4: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/4.jpg)
INFINITE SAMPLE VS FINITE SAMPLE
minimize f(x) = Es[fs(x)] =
Z
sfs(x)p(s)ds
minimize f(x) =1
n
nX
i=1
fi(x)
We can solve finite sample problem to high accuracy….
…but true accuracy is limited by sample size
finite sample
infinite sample
![Page 5: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/5.jpg)
SGD
big errorsolution improves
compute gradientselect data
update
gk =1
M
MX
i=1
rf(x, di)gk ⇡ rf(x, d8)gk ⇡ rf(x, d12)
small errorsolutions gets worse
xk+1 = xk � ⌧kgk
![Page 6: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/6.jpg)
SGDcompute gradientselect
dataupdate
gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk
Error must decrease as we approach solution
limk!1
⌧k = 0
classical solution
O(1/pk)
shrink stepsize slow convergence
Variance reductionCorrect error in gradient approximations
![Page 7: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/7.jpg)
EXAMPLE: WITHOUT DECREASING STEPSIZE
minimize1
2kAx� bk2
inexact gradientwhy does this happen?
what's happening?
![Page 8: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/8.jpg)
DECREASING STEP SIZEbig stepsize
small stepsize
⌧k =a
b+ k
used for strongly convex problems
xk+1 = xk � ⌧krfk(x)
(almost) equivalent tochoose larger sample
why?
![Page 9: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/9.jpg)
AVERAGING
averaging
xk+1 = xk � ⌧krfk(x)
⌧k =apk + b
xk+1 =1
k + 1
Xxi
“ergodic” averaging
used for weakly convex problems
x
does this limit convergence rate?
why is this bad?
![Page 10: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/10.jpg)
AVERAGING
xk+1 =1
k + 1
Xxi
ergodic averaging
averaging
x
xk+1 =k
k + 1xk +
1
k + 1xk+1
compute without storage
xk+1 =k
k + ⌘xk +
⌘
k + ⌘xk+1
short memory version
⌘ � 1
tradeoff variance for bias
![Page 11: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/11.jpg)
KNOWN RATES: WEAKLY CONVEX
TheoremSuppose f is convex, krf(x)k G, and that the diameterof dom(f) is less than D. If we use the stepsize
⌧k =cpk,
then
SeeShamir and Zhang, ICML ’13
Rakhlin, Shamir, Sridharan, CoRR ‘11
E[f(xk)� f?] ✓D2
c+ cG
◆2 + log(k)p
k
![Page 12: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/12.jpg)
KNOWN RATES: STRONGLY CONVEX
Theorem
Shamir and Zhang, ICML ’13
E[f(xk)� f?] 58(1 + ⌘/k)
✓⌘(⌘ + 1) +
(⌘ + .5)3(1 + log k)
k
◆G2
mk
xk+1 =k
k + ⌘xk +
⌘
k + ⌘xk+1
Suppose f is strongly convex with parameter m, and thatkrf(x)k G. If you use stepsize ⌧k = 1/mk, and the limitedmemory averaging
with ⌘ � 1, then
![Page 13: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/13.jpg)
EXAMPLE: SVMPEGASOS: Primal Estimated sub-GrAdient SOlver for SVM
rh(x) =
(�1, x < 1
0, otherwise
note: this is a “subgradient” descent method
minimize1
2kwk2 + Ch(Aw)
![Page 14: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/14.jpg)
PEGOSOSPEGASOS: Primal Estimated sub-GrAdient SOlver for SVM
⌧k =1
�k
minimizeX
i
�
2kwk2 + h(aiw)
While “not converged”:
not used in practicewk+1 =1
k + 1
k+1X
i=1
wk
If aTkwk < 1 : wk+1 = wk � ⌧k(�wk � aTi )
If aTkwk � 1 : wk+1 = wk � ⌧k�wk
![Page 15: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/15.jpg)
PEGOSOS
gradientmethod
stochasticmethods
finite sample method
![Page 16: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/16.jpg)
PEGOSOS
gradientmethod
stochasticmethods
classification error (CV)
You hope to converge before SGD gets slow!
![Page 17: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/17.jpg)
ALLREDUCEbut if you don’t…
use SGD as warm start for iterative method
Agarwal. “Effective Terascale Linear Learning.” ‘12
![Page 18: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/18.jpg)
SGDcompute gradientselect
dataupdate
gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk
Error must decrease as we approach solution
variance reduction solutionmake gradient more accurate
preserve fast convergence
![Page 19: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/19.jpg)
SGD+VARIANCE REDUCTIONcompute gradient update
gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk
Error must decrease as we approach solution
variance reduction solutionmake gradient more accurate
preserve fast convergence
� error8
select data
![Page 20: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/20.jpg)
VR APPROACHES
Central VRA VR approach targeting distributed ML
SAGADefazio, Bach, Lacoste-Julian, 2014
SVRGJohnson, Zhang, 2013
SAGLe Roux, Schmidt, Bach, 2013
many more…
“Efficient Distributed SGD with Variance Reduction,” ICDM 2016
shameless self-promotion
The original: requires full gradient computations
Avoid full gradient computations
![Page 21: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/21.jpg)
SVRG
rf1(x1m)
rf2(x2m)
rfn(xnm)
...
rfn�1(xn�1m )
First epoch gradient tableau
rf3(x3m)
![Page 22: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/22.jpg)
rf1(x1m)
rf2(x2m)
rfn(xnm)
...
rfn�1(xn�1m )} gm =
1
n
nX
i=1
rfi(xim)
Average gradientover last epoch rf3(x
3m)
gradient tableau
SVRG
![Page 23: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/23.jpg)
rf1(x1m)
rf2(x2m)
rfn(xnm)
...
rfn�1(xn�1m )
gm =1
n
nX
i=1
rfi(xim)
Average gradientover last epoch
rf3(x3m)
(rf3(x3m)� gm)�
corrected gradient}error
rf3(x3m+1)
new gradient
gradient tableau
SVRG
![Page 24: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/24.jpg)
SVRG GETS BACK DETERMINISTIC RATES
Theorem
Johnson and Zhang, 2013
Suppose of objective terms are m stronglyconvex with L Lipschitz gradients. If thelearning rate is small enough, then
<latexit sha1_base64="HTGAliInHfDxWKYbVWtacL71/ok=">AAACj3icbVHLbtNAFB2bVzGvFJbdXBFXYoEiOyzKCkWwoVIXRZC2UhxF48m1PXQe1sx1IET5Br6PT+AD2DNJs6AtVxrp6Dx0r86UrZKesuxXFN+5e+/+g72HyaPHT54+6+0/P/O2cwLHwirrLkruUUmDY5Kk8KJ1yHWp8Ly8/LDRzxfovLTmCy1bnGpeG1lJwSlQs97PolKdbxRWBMnnrm2tR7AV2PIrCpILBEKnPXCHkOoUPDlrarUsigSENQv8Dt8kNZCepHAiWy8aST+gdnwu0ZAfABxXQA1CCCjkzkhTg+OEID14zZUCNLarm9cbl0lmvX42yLYDt0G+A322m9NZ73cxt6LTYZtQ3PtJnrU0XXFHUihcJ0XnseXiktc4CdBwjX662ha3hsPAzKGyLjxDsGX/Tay49n6py+DUnBp/U9uQ/9MmHVVvpytp2o7QiKtFVaeALGx+AebShXrVMgAunAy3gmi44yK07ZPk2h4jBVZBWodu8ptN3AZnw0H+ZjD8NOyP3u9a2mMH7CV7xXJ2xEbsIztlYybYn+ggSqPDeD8+it/FoytrHO0yL9i1iY//AlY7xdk=</latexit>
Ef(wk)� f? ck[f(w0)� f?]<latexit sha1_base64="FkMPZaQmJTfxjKaog3LujK0S4+U=">AAACNHicbVBNS8NAEN3U7/pV9ehlsQj1YElqQY+iCB4rWBWatGy2k3bpZhN2N0oJ/SP+EM9e9R8I3sSDF3+DmzYHW30w8Hhvhpl5fsyZ0rb9ZhXm5hcWl5ZXiqtr6xubpa3tGxUlkkKTRjySdz5RwJmApmaaw10sgYQ+h1t/cJ75t/cgFYvEtR7G4IWkJ1jAKNFG6pTqbkh03/fTixEOKg/twcEhDtqu0kRilwOm7QFuZYZ9cJjrHu6UynbVHgP/JU5OyihHo1P6crsRTUIQmnKiVMuxY+2lRGpGOYyKbqIgJnRAetAyVJAQlJeOvxvhfaN0cRBJU0Ljsfp7IiWhUsPQN53ZL2rWy8T/vFaigxMvZSJONAg6WRQkHOsIZ1HhLpNANR8aQqhk5lZM+0QSqk2gxeLUHsEoBMYamWyc2ST+kpta1Tmq1q7q5dOzPKVltIv2UAU56BidokvUQE1E0SN6Ri/o1Xqy3q0P63PSWrDymR00Bev7BxyTqZY=</latexit>
for some c < 1.<latexit sha1_base64="7i+9y8tRinww0pMsrhRs+pZ9DCo=">AAACD3icbVC9TgJBGNzDP8QfTi1tNoKJFbnDQgsLoo0lJvKTACF7y3ewYW/3srtnQggPYW2rz2BnbH0EH8G3cA+uEHCqycx8+SYTxJxp43nfTm5jc2t7J79b2Ns/OCy6R8dNLRNFoUEll6odEA2cCWgYZji0YwUkCji0gvFd6reeQGkmxaOZxNCLyFCwkFFirNR3i6FUWMsIcJne+JVy3y15FW8OvE78jJRQhnrf/ekOJE0iEIZyonXH92LTmxJlGOUwK3QTDTGhYzKEjqWCRKB703nxGT63ygCnFUIpDJ6rfy+mJNJ6EgU2GREz0qteKv7ndRITXvemTMSJAUEXj8KEYyNxugIeMAXU8IklhCpmu2I6IopQY7cqFJb+CEYhtNbMbuOvLrFOmtWKf1mpPlRLtdtspTw6RWfoAvnoCtXQPaqjBqIoQS/oFb05z8678+F8LqI5J7s5QUtwvn4BB86bCg==</latexit>
![Page 25: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/25.jpg)
MONTO CARLO METHODS
Monte Carlo casino
Methods that involve randomly sampling a distribution
Monaco?
![Page 26: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/26.jpg)
BAYESIAN LEARNING
ingredients…prior
likelihood
estimate parameters in a model
probability distribution of unknown parameters
data = M(parameters)estimate
thesemeasure
these
probability of datagiven (unknown) parameters Thomas Bayes
![Page 27: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/27.jpg)
BAYES RULE
P (A|B) =P (B|A)P (A)
P (B)
P (parameters|data) = P (data|parameters)P (parameters)
P (data)
likelihood prior
P (parameters|data) / P (data|parameters)P (parameters)
we really care about this
“posterior distribution”
![Page 28: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/28.jpg)
EXAMPLE: LOGISTIC REGRESSION
P (parameters|data) / P (data|parameters)P (parameters)
prior
P (y,X|w) =Y
i
exp(yi · xTi w)
1 + exp(yi · xTi w)
<latexit sha1_base64="lWgkerH9AERziYYhNxTP4CqGnHg=">AAAIGXicfVXNbtw2EFaSNpu6f0567EWoE8BuXGPlBEgvBZK6SNuDUbeIHQOmLVBcasUsfxSSyu6W1pPkGfICufVW9JpTgfZdOtTuNruU1jxIQ37fzHBmyGFWcmZsv//3tes3PvjwZu/WRxsff/LpZ59v3r5zYlSlCT0miit9mmFDOZP02DLL6WmpKRYZp8+z0YHHn7+i2jAln9lpSc8FHkqWM4ItLKWbPx5tT3dPL8c78XcxKrUapAzlGhOH6KTcnsKMDJSNJym7eDbeqV1yvxtIN7f6e/1mxG0hmQtb0XwcpbdvvkUDRSpBpSUcG3OW9Et77rC2jHBab6DK0BKTER5Sh4UxU5HV8T2BbWFCzC92YpMmQfXGxr2l1bPK5t+eOybLylJJQBGwvOKxVbFPUTxgmhLLpyBgohnsJyYFhqRYSOSKKZcpNbI4M+ACSTomSggsBw4RY+qz5NyhjA6ZdAQqZOqtBFEAZ5N6VUHZguoxM7R2yNKJdfH7ldA4ZwKOBthwSLNhYbHWalyHnHxh6ec8Dr1xsC3/9+QncctLgbk30RyGpHb7oQ2NvX8sh02xVqx7hHchxgJSZGriYrSLdk2VvYBMQ+L9rNnCKp/JASj4uhLM3Wm4BSZf1e7CfZOEgGU+OFx1YUrZpbA8q6X9ntHA7dhHpfc7ut8yPhIN0OG1BJsXkACswzxSzM08yixzv4WaIJRL+EELLxiTvlIgpO4ghazkdhqyxFJMoiskSMtyXva7EuM5o5CUjjpoZoVl2FDgdo189a4shAko3XaAcokkzjiOJ5ctUOBJ7aazZrVgBSTKuZJw8e56KU3uBvBBwa46hII8WYKfhHBul9CnrVR5v7PS5i4J7iBU0rd1+lIqf3VoSVQloQc5+rJq2neN4m0Et3cx3/H6P1Bop5oegs1fSqqxVfprcCEgTYeHJ/VaApNMsN9hOwsJgdXBej6eLPhz6Wo+1A46TvNdR8F62FQL/mjXS1cRmVwQ2XqLptBMwomd/ztpzj974LX5rmFYOIEQ6ewHTQpeuyR829rCyf5e8mBv/9eHW4+/n797t6Ivo6+i7SiJHkWPo5+io+g4ItGb6F30T/Rv73Xvj96fvb9m1OvX5jpfRCuj9+4/+/8CsA==</latexit>
P (yi = 1|xi) =exp(xT
i w)
1 + exp(xTi w)
<latexit sha1_base64="30xsw632vjL+Lqp1+IPVHH/WWSI=">AAAIBHicfVXNbtw2EFbSNJu4f05zzEWIE8BuXGO1CZBcAiRxULQHo24ROwZMW6C41IpZ/igkFe+G1jXP0Iforeilh1zbV+jbdCjvNruU1jpIQ37ffMOZIams5MzYfv/fK1c/u/b59d6Nm2tffPnV19+s3/r20KhKE3pAFFf6KMOGcibpgWWW06NSUywyTl9n412Pv35HtWFKvrLTkp4IPJIsZwRbmErXB/ub05Q9Tc4nKduKn8Yo15g4RCflJsycvjrbql3yYGmcrm/0d/rNE7eNZGZsRLNnP711/S80VKQSVFrCsTHHSb+0Jw5rywin9RqqDC0xGeMRdVgYMxVZHd8X2BYmxPxkJzZpalGvrd1fmD2ubP7kxDFZVpZKAo6A5RWPrYp9NeIh05RYPgUDE81gPTEpMJTAQs2WpFym1NjizEAIJOkZUUJgOXSIGFMfJycOZXTEpCPQDFNvJIgCeDGolx2ULag+Y4bWDlk6sS7+NBOKcyZgF4CGQ5qNCou1Vmd1yMnnSj/lcRiNg7b8P5IfxK0oBeZeoml9UrtBqKGxj4/lqGnWkrpHeBdiLCBFpiYuRtto21TZG6g0FN6PmiUs85kcgoPvK8HcHYVLYPJd7U7d90kIWOaTw1UXppRdSMuzWt6fGA3czn1c+rjjBy3xsWiAjqglaJ5CAbAO60gxN7Mss8z9GnqCUS7guy28YEz6ToGRut0UqpLbacgSCzmJrpSgLIt1GXQVxnPGISkdd9DMEsuwkcDtHvnuXdoIE1C6dYByjiTOOI4n5y1Q4EntpogMlY3nrIBEOVcSDt49b6XJvQDeLdhlm1CQ5wvw8xDO7QL6Q6tUPu5Fa3OXBGcQOulvcPpWKn90aElUJeEOcvRt1dzUNYo3EZze+XjL+7+kcJ1qugeaP5dUY6v0dxBCQJn29g7rlQQmmWDvYTlzC4HqcDUfT+b8mXU5H3oHN07zXkXBetR0C75o21uXEZmcE9lqRVNoJmHHzr6dNIdKrSBq817BsLADIdOLD1xS8LdLwn9b2zgc7CQPdwa/PNp49mL237sR3YnuRptREj2OnkU/RvvRQUSi36KP0d/RP70Pvd97f/T+vKBevTLzuR0tPb2P/wHrBPmj</latexit>
![Page 29: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/29.jpg)
EXAMPLE: LOGISTIC REGRESSION
P (parameters|data) / P (data|parameters)P (parameters)
prior
logP (w) = �|w|Example:
P (w|y,X) =
Y
i
exp(yi · xTi w)
1 + exp(yi · xTi w)
!P (w)
<latexit sha1_base64="004+ZLNp5XNu9cMH76QXus7GxG4=">AAAIK3icfVXdchs1FN4WqEv4S+GSmx3Sztg0ZLyhM+WGmZYwDFxkCEyThomSHa2s9QrrZytpaxtlH4kX4IaH4AqG297CK3C0tqmtXUcXu0f6vvMd6Rz9ZCVnxg6Hf966/cabb93p3X17551333v/g917H54ZVWlCT4niSp9n2FDOJD21zHJ6XmqKRcbp82xy5PHnL6k2TMlndl7SS4HHkuWMYAtD6e5PJ/3p9Xz/fBB/iTjNbT9GpVajlKFcY+IQnZX9OfTISNl4lrKrZ9NB7ZKH3QDSbFzYQQyig3R3b3gwbFrcNpKlsRct20l6785vaKRIJai0hGNjLpJhaS8d1pYRTusdVBlaYjLBY+qwMGYusjp+ILAtTIj5wU5s1qSs3tl5sDZ6Udn8i0vHZFlZKgk4ApZXPLYq9kmLR0xTYvkcDEw0g/nEpMCQIAup3ZBymVITizMDIZCkU6KEwHLkEDGmvkguHcromElHoGam3ksQBXDRqTcdlC2onjJDa4csnVkXvx4JxTkTsFlAwy0qgLVW0zrk5Cul7/I4jMZBW/4fyXfiVpQCcy/RbIykdoehhsY+Ppbjplgb6h7hXYixgBSZmrkY7aN9U2U/Q6Yh8b7XTGGTz+QIHHxdCebuPJwCky9rd+U+S0LAMr84XHVhStm1ZXlWy/s1o4Hba5+UPu7kYUt8IhqgI2oJmleQAKzDPFLMzXKVWeZ+DD3BKNfwoxZeMCZ9pcBI3VEKWcntPGSJtTWJriVBWtbzctiVGM+ZhKR00kEzGyzDxgK3a+Srd2MhTEDp1gHKNZI44zieXbdAgWe1my8urhUrIFHOlYSDd99baXI/gI8KdtMmFOTpGvw0hHO7hn7TSpWPuyht7pLgDEIl/UVPX0jljw4tiaok3EGOvqiaC71GcR/B6V31B97/awrXqabHoPl9STW2Sn8KIQSk6fj4rN5KYJIJ9gtMZ2UhUB1t5+PZir+0buZD7eDGab7bKFiPm2rBH+176yYikysi265oCs0k7Njlv5Pm/BMIUZvvFoaFHQgrXfzgkoLXLgnftrZxdniQfH5w+MOjvSdfLd+9u9HH0SdRP0qix9GT6NvoJDqNSPR79Cr6J/q392vvj95fvb8X1Nu3lj4fRRut9+o/62oJbg==</latexit>
logP (w|y,X) = logP (w) +X
i
� log(1 + exp(�yi · xTi w))
<latexit sha1_base64="3Y80qniIrti7yfwt00Tirz2HpIc=">AAAIFXicfVXdjhs1FJ620JTlbwuX3FhsKyXsjzILEr1BaglCcLFiQd3tSuvdkcfxZEz8M7U9TYJ3noNn4BWQuEPc9pa+DceThCYzyc7FzLG/z9/xOcc+kxaCW9fvv7lz9947797vPHhv5/0PPvzo492Hn5xbXRrKzqgW2lykxDLBFTtz3Al2URhGZCrYi3Q8CPiLV8xYrtVzNyvYlSQjxTNOiYOpZHeAhR6h0+7kZnZw0UPfoOW4h/YRtqVMODqsJ7vxPmbTons4SzimQ+3QNOHXzye9XrK71z/q1w9qG/HC2IsWz2ny8P6feKhpKZlyVBBrL+N+4a48MY5TwaodXFpWEDomI+aJtHYm0wo9lsTltomFyY3YtE5OtbPzeGX2snTZkyvPVVE6pigsBCwrBXIahfSgITeMOjEDg1DDYT+I5sQQ6iCJa1I+1XrsSGrBBVZsQrWURA09ptZWl/GVxykbceUpVMdWezFmAM4H1foC7XJmJtyyymPHps6jtzNNccElHAvQ8NjwUe6IMXpSNTnZUunHDDW9CdBW/3sKA9TykhMRJDKI28eVP25qGBL8EzWqi7WmHhCxCbEOkDzVU4/wAT6wZforZBoSH0b1Ftb5XA1hQagrJcJfNLfA1avKX/vDuAk4HoIj5SZMa7cSVmC1Vr9l1HA79nER/I73W+JjWQMbvBageQ0JIKaZR0aEXUSZpv6X5kowihV80MJzzlWoFBiJHySQlczNmiy5EpPcFBKkZTUvx5sSEzjjJikZb6DZNZblI0naNQrVu7UQtkHZrAOUG6xIKgia3rRASaaVn82b1ZLVIDEhtIKL9yhYSfyoAQ9yftshlPTZCvysCWduBf2+largd17azMeNOwiVDC2dvVQ6XB1WUF0q6EGevSzr1l1h1MVwe5fjXlj/HYN2atgJaP5UMEOcNl+ACwlpOjk5r7YSuOKS/wbbWVoYVIfb+WS65C+s2/lQO+g49XsbhZhRXS344oNg3Ubkaknk2xVtbriCE7v4bqR5XBgNXuv3FoaDEwiRzj/QpOBvFzf/bW3j/Pgo/vLo+Oev9p5+u/jvPYg+iz6PulEcfR09jX6ITqOziEZ/RK+jf6M3nd87f3X+7vwzp969s1jzabT2dF7/B9Db/e8=</latexit>
![Page 30: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/30.jpg)
BAYESIANS OFFER NO SOLUTIONS
P (w|D, l)
solution space
optimizationsolution
Monte Carlo methods randomly sample this solution space
BayesLagrange
![Page 31: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/31.jpg)
WHY MONTE CARLO?You need to know more than just a max/minimizer
optimizationsolution
“error bars”
Var(w) = E[(w � µ)(w � µ)T ] =
ZwwTP (w) dw � µµT
E(w) =
ZwP (w) dw
![Page 32: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/32.jpg)
EXAMPLE: POKER BOTSYou can’t solve your problem by other (faster) methods
maximize logP (w) we don’t know derivative
poker bots
model parameters describe player behavior
2.5M different community cards
10 players = 59K raise/fold/call perms per round
![Page 33: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/33.jpg)
WHY MONTE CARLO?You’re incredibly lazymaximize logP (w)
I could differentiate this,I just don’t wanna
argmax logP (w) ⇡ maxk
{P (wk)}
![Page 34: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/34.jpg)
MARKOV CHAINS
Irreducible: can visit any state starting at any state
Aperiodic: does not get trapped in deterministic cycles
MC has steady state if
What is an MC?
![Page 35: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/35.jpg)
METROPOLIS HASTINGSingredients
q(y|x)
proposal distribution
p(x)
posterior distribution
MH AlgorithmStart with x0
For k = 1, 2, 3, . . .Choose candidate y from q(y|xk)Compute acceptance probability
↵ = min
⇢1,
p(y)q(xk|y)p(xk)q(y|xk)
�
Set xk+1 = y with probability ↵otherwise xk+1 = xk
![Page 36: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/36.jpg)
CONVERGENCE
Theorem
Suppose the support of q contains the support of p. Thenthe MH sampler has a stationary distribution, and the
distribution is equal to p.
Irreducible: support of q must contain support of p
Aperiodic: there must be positive rejection probability states
![Page 37: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/37.jpg)
METROPOLIS ALGORITHMingredients
q(y|x)
proposal distribution
p(x)
posterior distribution
Assume proposal is symmetricq(xk|y) = q(y|xk)
↵ = min
⇢1,
p(y)q(xk|y)p(xk)q(y|xk)
�↵ = min
⇢1,
p(y)
p(xk)
�
Metropolis Hastings Metropolis
![Page 38: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/38.jpg)
EXAMPLE: GMMhistogram of Metropolis iterates
Andrieu, Freitas, Doucet, Jordan ‘03
![Page 39: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/39.jpg)
PROPERTIES OF MH
We don’t need a normalization constant for posteriorWe can run many chains in parallel (cluster/GPU)We don’t need any derivatives
Pro’s
P (parameters|data) = P (data|parameters)P (parameters)
P (data)
P (parameters|data) / P (data|parameters)P (parameters)
![Page 40: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/40.jpg)
PROPERTIES OF MH
“Mixing time” depends on proposal distribution• too wide = constant rejections = slow mixing• too narrow = short movements=slow mixing
Con’s
Samples only meaningful at stationary distribution • “Burn in” samples must be discarded• Many samples needed because of correlations
Posterior
proposal
This is bad…
xk<latexit sha1_base64="8tY7aDdJ2IAPpvSJDyWI10/Am7g=">AAAH03icfVVLb9w2EJbTNuu4rzg55iLUMeC2jrFyAqRHOw6K9mDUfdgxYNoLikutWPEhk5S9G1qXotdc2//Sf9J/06G8W+9SWusgDfl98w1nhqTSkjNj+/1/Vx589PEnD3urj9Y+/ezzL758vP7kxKhKE3pMFFf6NMWGcibpsWWW09NSUyxSTt+lxYHH311RbZiSv9lJSc8FHkmWMYItTP06vigGjzf6O/3midtGMjU2oulzNFh/+A8aKlIJKi3h2JizpF/ac4e1ZYTTeg1VhpaYFHhEHRbGTERax5sC29yEmJ/sxMZNavXa2ubc7Flls+/OHZNlZakk4AhYVvHYqtgnFw+ZpsTyCRiYaAbriUmONSYWSrAg5VKlCotTAyGQpNdECYHl0CFiTH2WnDuU0hGTjkBtTb2RIArg7aBedFA2p/qaGVo7ZOnYuvhuJhTnTEBTQcMhzUa5xVqr6zrkZDOlH7M4jMZBW/4fyQ/iVpQccy+RQd4uqd1uqKGxj4/lqGnWgrpHeBdiLCB5qsYuRtto21Tp71BpKLwfNUtY5DM5BAffV4K5Ow2XwORV7S7ciyQELPPJ4aoLU8rOpeVZLe87RgO3cy9KH7f4tiVeiAboiFqC5gUUAOuwjhRzM80yTd0voScY5Rx+0MJzxqTvFBgDdzCAqmR2ErLEXE6iKyUoy3xddrsK4zlFSBoUHTSzwDJsJHC7R7579zbCBJRuHaDcIIlTjuPxTQsUeFy7CSJDZeMZKyBRzpWEg/fcW4PkeQAf5Oy+TSjI/hy8H8KZnUO/b5XKx71tbeaS4AxCJ/2FTC+l8keHlkRVEu4gRy+r5uKtUbyF4PTOxl97/7cUrlNND0Hzp5JqbJX+BkIIKNPh4Um9lMAkE+w9LGdmIVAdLufj8Yw/te7nQ+/gxmneyyhYj5puwRdte+s+IpMzIluuaHLNJOzY6beT5lCpFURt3ksYFnYgZHr7gUsK/nZJ+G9rGye7O8nLnd2fX23svZn+91ajZ9FX0VaURK+jveiH6Cg6jkg0ij5Ef0V/9457rvdH789b6oOVqc/TaOHpffgP+CLnSg==</latexit>
![Page 41: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/41.jpg)
SIMULATED ANNEALINGmaximize p(x)
Easy choice: run MCMC, thenmax
kp(xk)
Better choice: sample the distribution
p1
Tk (xk) “temperature”
Why is this better?Where does the name come from?
![Page 42: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/42.jpg)
COOLING SCHEDULE
Andrieu, Freitas, Doucet, Jordan ‘03
hot warm
cool cold
![Page 43: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/43.jpg)
CONVERGENCE
Granville, "Simulated annealing: A proof of convergence”
TheoremSuppose the MCMC mixes fast enough that epsilon-dense sampling occurs in finite time starting at every temperature. For an annealing schedule with temperature
simulated annealing converges to a global optima with probability 1.
Tk =1
C log(k + T0)
SA solves non-convex problem, even NP-complete problems,as time goes to infinity.
![Page 44: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/44.jpg)
WHEN TO USE SIMULATED ANNEALING
flow chart
Should I use simulated annealing?
no.
No practical way to choose temperature scheduleToo fast = stuck in local minimum (risky)
Too slow = no different from MCMCAct of desperation!
![Page 45: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/45.jpg)
GIBBS SAMPLERWant to sample
P (x1, x2, x3)
on stage k, pick some coordinate j
q(y|xk) =
(p(yj |xk
jc), if yjc = xkjc
0, otherwise
↵ =p(y)q(xk|y)p(xk)q(y|xk)
=p(y)p(xk
j |xkjc)
p(xk)p(yj |yjc)
=p(yjc)
p(xkjc)
= 1
P (B) =P (A and B)
P (A|B)
p(y) = p(yj and yjc)<latexit sha1_base64="KTec+pwpBDIxNpH0RsoPrN7Tsvc=">AAACI3icbVDLSgMxFM3Ud31VXboJFkE3ZaYKuhFENy4r2Fpoa8mkd9poJjMkd8RhmA/wQ1y71W9wJ25c+AH+hWntwqoHQg7nnMtNjh9LYdB1353C1PTM7Nz8QnFxaXlltbS23jBRojnUeSQj3fSZASkU1FGghGasgYW+hEv/5nToX96CNiJSF5jG0AlZX4lAcIZW6pbK8U66S4+ovbrXtI1whxllqkdzmnaz6yue71KbcivuCPQv8cakTMaodUuf7V7EkxAUcsmMaXlujJ2MaRRcQl5sJwZixm9YH1qWKhaC6WSjz+R02yo9GkTaHoV0pP6cyFhoTBr6NhkyHJjf3lD8z2slGBx2MqHiBEHx70VBIilGdNgM7QkNHGVqCeNa2LdSPmCacbT9FYsTe5TgEFgrt914v5v4SxrVirdXqZ7vl49Pxi3Nk02yRXaIRw7IMTkjNVInnNyTR/JEnp0H58V5dd6+owVnPLNBJuB8fAH9lqL0</latexit>
![Page 46: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/46.jpg)
GIBBS SAMPLERWant to sample
P (x1, x2, x3)
iterates
x2 ⇠ P (x1|x12, x
13, x
14, . . . , x
1n)
x3 ⇠ P (x2|x21, x
23, x
24, . . . , x
2n)
x4 ⇠ P (x3|x31, x
32, x
34, . . . , x
3n)
· · ·
![Page 47: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/47.jpg)
APPLICATION: SAMPLING GRAPHICAL MODELS
Restricted Boltzmann Machine (RBM)
E(v, h) = �aT v � bTh� vTWh
P (v, h) =1
Ze�E(v,h) “partition
function”
weights
binaryrandomvariables
0/1
![Page 48: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/48.jpg)
APPLICATION: SAMPLING GRAPHICAL MODELS
Restricted Boltzmann Machine (RBM)
E(v, h) = �aT v � bTh� vTWh P (v, h) =1
Ze�E(v,h)
remove normalization
aggregate constant
cancel constants
sigmoid function
P (vi = 1|h) = P (vi = 1|h)P (vi = 0|h) + P (vi = 1|h)
=exp(�ai +
Pj wijhj + C)
exp(C) + exp(�ai +P
j wijhj + C)
=exp(�ai +
Pj wijhj)
1 + exp(�ai +P
j wijhj)
= �(�ai +X
i
wijhj)
![Page 49: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/49.jpg)
BLOCK GIBBS FOR RBM
P (vi = 1|h) = �(�ai +X
j
wijhj)
hiddenvisible
freeze hiddenrandomly sample visible
stage 1
freeze visiblerandomly sample hidden
stage 2
P (hj = 1|v) = �(�bj +X
i
wijvi)
![Page 50: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/50.jpg)
DEEP BELIEF NETSvisible
hidden
DBN = layered RBM
Each layer only dependson layer beneath it
(feed forward)
Probability for each hidden node is sigmoid function
![Page 51: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/51.jpg)
EXAMPLE: MNIST
Train 3-layer DBN with 200 hidden units
pre-training: layer-by-layer training
training: train all weights simultaneously
Gan, Henao, Carlson, Carin ‘15
trained on 60K MNIST digits
training done using Gibbs sampler
Gibbs sampler used to explore final solution
![Page 52: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/52.jpg)
EXAMPLE: MNISTtraining data Learned features
observations sampled from deep belief network using Gibbs sampler
![Page 53: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/53.jpg)
WHAT’S WRONG WITH GIBBS?
susceptible to strong correlationsalso a problem for MH: bad proposal distribution
![Page 54: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/54.jpg)
SLICE SAMPLERProblems with MH:
hard to choose proposal distributiondoes not exploit analytical information about problem
long mixing timeSlice sampler
q(x, u) =
(1, if 0 u p(x)
0, otherwise
1 1
x
p(x)
x
u
![Page 55: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/55.jpg)
“LIFTED INTEGRAL”
1 1
x
p(x)
x
u
p(x) =
Zq(x, u) du
Zf(x)p(x) =
Z Zf(x)q(x, u) du dx
sample this
Problems with MH:hard to choose proposal distribution
does not exploit analytical information about problemlong mixing timeSlice sampler
![Page 56: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/56.jpg)
SLICE SAMPLERZ
f(x)p(x) =
Z Zf(x)q(x, u) du dx
do Gibbs on thisFreeze x :
Choose u uniformly from [0, p(x)]Freeze u :
Choose x uniformly from {x|p(x) � u}need analytical formula
for this set
p(x)u
![Page 57: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/57.jpg)
MODEL FITTING
q(x, u1, u2, . . . , un) =
(1, if pi(x) ui 8i0, otherwise
Zq(x, u1, u2, . . . , un) du =
Z p1(x)
u1=0
Z p2(x)
u2=0· · ·
Z pn(x)
un=0du =
nY
k=1
pi(x)
log p(x) =nX
i=1
log pi(x)p(x) =nY
i=1
pi(x)
Zf(x)p(x) dx =
Z
x
Z
uf(x)q(x, u) dx du
support is a box
![Page 58: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/58.jpg)
MODEL FITTINGZ
f(x)p(x) dx =
Z
x
Z
uf(x)q(x, u) dx du
Freeze x :For all i, choose ui uniformly from [0, pi(x)]
Freeze u :Choose x uniformly from
Ti{x|pi(x) � ui}
q(x, u1, u2, . . . , un) =
(1, if pi(x) ui 8i0, otherwise
![Page 59: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/59.jpg)
SLICE SAMPLER PROPERTIESpros
• does not require proposal distribution• mixes super fast• simple implementation
cons• analytical formula for super-level sets• hard in multiple dimensions
fixes / generalizations• step-out methods• random hyper-rectangles
![Page 60: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/60.jpg)
STEP-OUT METHODSp(x)
u
p(x)u
p(x)u
•pick small interval
•double width size until slice is contained
•sample uniformly from slice using rejection sampling - throw away selections outside of slice
![Page 61: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/61.jpg)
HIGH DIMS: COORDINATE METHOD
Zf(x)p(x) =
Z Zf(x)q(x, u) du dx
do coordinate Gibbs on this
Freeze x :Choose u uniformly from [0, p(x)]
Freeze u :For i = 1 · · ·n
Choose xi uniformly from{xi|p(x1, · · · , xi · · · , xn) � u}
can use stepping -out methodRadford Neal: “Slice Sampling” ‘03
![Page 62: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/62.jpg)
HIGH DIMS: HYPER-RECTANGLE
Radford Neal: “Slice Sampling” ‘03
•choose “large” rectangle
•randomly/uniformly place rectangle around current location
•draw random proposal point from rectangle
•if proposal lies outside slice, shrink box so proposal is on boundary
performs much better for poorly-conditioned objectives
![Page 63: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/63.jpg)
COMPARISON
Gradient/Splitting
SGD
MCMC
rate
e�ck
1/k
1/pk
when to use
you value reliability and precision(moderate speed, high accuracy)
you value speed over accuracy(high speed, moderate accuracy)
you value simplicity (no gradient)or need statistical inference
(slow and inaccurate)
![Page 64: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.](https://reader035.fdocuments.net/reader035/viewer/2022081616/601c9c78008be57067574dff/html5/thumbnails/64.jpg)
DO MATLAB EXERCISEMCMC