Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.
-
Upload
clementine-ward -
Category
Documents
-
view
225 -
download
0
Transcript of Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.
![Page 1: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/1.jpg)
Herding Dynamical Weights
Max WellingBren School of Information
and Computer ScienceUC Irvine
![Page 2: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/2.jpg)
Motivation
• Xi=1 means that pin i will fall during a Bowling round. Xi=0 means that pin i will still stand.
• You are given pairwise probabilities P(Xi,Xj).
• Task: predict the distribution Q(n), n=0,.., 10 of the total number of pins that will fall.
Stock market: Xi=1 means that company i defaults.You are interested in the probability of n companies defaulting in your portfolio.
![Page 3: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/3.jpg)
Sneak Preview
Newsgroups-small (collected by S. Roweis)100 binary features, 16,242 instances (300 shown)
(Note: herding is a deterministic algorithm, no noise was added)
Herding is a deterministic dynamical system that turns “moments” (average feature statistics)into “samples” which share the same moments.
Quiz: which is which [top/bottom]?
-data in random order.
-herding sequence in order received.
![Page 4: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/4.jpg)
Traditional Approach:Hopfield Nets & Boltzman Machines
is
ijw weight
state value (say 0/1)
jiij
ij sswwsE ),(
ji
ijijw ssw
wZsP exp
)(
1)(
Energy:
jijiji SWIS 0
Probability of a joint state:
Coordinate descent on energy:
![Page 5: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/5.jpg)
Traditional Learning Approach
ij i
iijiij XXXW
eXP
)(
Pidataiii
Pjidatajiijij
XX
XXXXWW
Sii nSInQ
PS
10
0
)(
~
Use CDinstead
!
![Page 6: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/6.jpg)
What’s Wrong With This?
• E[Xi] and E[XiXj] are intractable to compute (and you need them at every iteration of gradient descent).
• Slow convergence & local minima (only w/ hidden vars)
• Sampling can get stuck in local modes (slow mixing).
![Page 7: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/7.jpg)
Solution in a Nutshell
datajiXX
Sii nSInQ
S
10
0
)(
dataiX
Nonlinear Dynamical SystemNonlinear Dynamical System
dataiSi
datajiSji
XS
XXSS
(sidestep learning + sampling)
![Page 8: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/8.jpg)
Herding Dynamics
idataiii
jidatajiijij
jijiji
SX
SSXXWW
SWIS
0
no stepsize
• no stepsize
• no random numbers
• no exponentiation
• no point estimates
iSjS
ijWi
![Page 9: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/9.jpg)
Piston Analogyweights=pistons
Pistons move up at a constant rate (proportional to observed correlations)
When they gets too high, the “fuel” will combustand the piston will be pushed down (depression)
“Engine driven by observed correlations”
![Page 10: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/10.jpg)
Herding Dynamics with General Features
)()(
)(maxarg
SfXfWW
SfWS
kdatakkk
kkk
Si
i
• no stepsize
• no random numbers
• no exponentiation
• no point estimates
![Page 11: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/11.jpg)
Features as New Coordinates)( 1Sf
)( 4Sf
)( 3Sf
)( 2Sf
1w
2w
tw
1tw
If then period is infinite
dataXf )(
)( 5Sf
data
B
bbbB ffnNnn
11 ),..,(
thanks to Romain Thibaux
![Page 12: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/12.jpg)
Example]:1:[
)sin()(10
1)(
2
21
X
XXf
XXf
weights initialized in a grid
red ball tracks 1 weight
converence on afractal attractor setwith Hausdorf dim.1.5
![Page 13: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/13.jpg)
The Tipi Function
gradient descend on G(w)with stepsize 1.
)(max)( SfWfWwG kk
kSk
datakk
This function is:
• Concave• Piecewise linear• Non-positive• Scale free
)(SffWW kdatakkk
kkk
SSfWS )(maxarg
coordinate ascend replaced with full maximization.
Scale free property implies that stepsize will not affect state sequence S.
![Page 14: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/14.jpg)
RecurrenceThm: If we can find the optimal state S, then the weights will stay within a compact region.
Empirical evidence: coordinate ascent is sufficient to guarantee recurrence.
![Page 15: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/15.jpg)
Ergodicity
s=1
s=2
s=3s=4
s=5
s=6
datak
T
ttk
T
fsfT
1
)(1
lim
s=[1,1,2,5,2...
Thm: If the 2-norm of the weights grows slower than linear, then feature averages over trajectories converge to data averages.
![Page 16: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/16.jpg)
Relation to Maximum Entropy
dataP
P
fftoSubject
PHMaximize
:
][
x
xfW
kdatakk
W
kkk
k
efWWLMaximize)(
}{log)(
Dual:
Tipi function:
T
WLTWG
T 0lim)(
Herding dynamics satisfies constraints but not maximal entropy
![Page 17: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/17.jpg)
Advantages / Disadvantages
• Learning & Inference have merged into one dynamical system.• Fully tractable – although one should monitor whether local maximization is enough to keep weights finite.• Very fast: no exponentation, no random number generation.• No fudge factors (learning rates, momentum, weight decay..).• Very efficient mixing over all “modes” (attractor set).
• Moments preserved, but what is our “inductive bias”? (i.e. what happens to remaining degrees of freedom?).
![Page 18: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/18.jpg)
Back to BowlingData collected by P. Cotton.10 pins, 298 bowling runs.X=1 means a pin has fallen in two subsequent bowls.H.XX uses all pairwise probabilitiesH.XXX uses all triplet probabilities
P(total nr. pins falling)
![Page 19: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/19.jpg)
More ResultsDatasets: Bowling (n=298, d=10, k=2, Ntrain=150, Ntest = 148)Abelone (n=4177, d=8, k=2, Ntrain=2000, Ntest = 2177)Newsgroup-small (n=16,242, d=100, k=2, Ntrain=10,000, Ntest = 6242)8x8 Digits (n=2200 [3’s and 5’s], d=64, k=2, Ntrain=1600, Ntest =600)
Task: given only pairwise probabilities,compute the probability of the total nr.of 1’s in a data-vector Q(n).
Solution: apply herding and compute Q(n)through sample averages.
Error : KL[Pdata||Pest]
Task: given only pairwise probabilities,compute the classifier P(Y|X).
Solution: train logistic regression (LR) classifieron herding sequence.
Error : fraction of misclassified test cases.
LR is too simple, PL on herding sequence also gives 0.04.In higher dimensions herding looses advantage in accuracy
![Page 20: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.](https://reader033.fdocuments.net/reader033/viewer/2022042718/56649ec55503460f94bcf7d5/html5/thumbnails/20.jpg)
Conclusions
• Herding replaces point estimates with trajectories over attractor sets (which is not the Bayesian posterior) in a tractable manner.
• Model for “neural computation”– similar to dynamical synapses– Quasi-random sampling of state space (chaotic?)– Local updates– Efficient (no random numbers, exponentiation)