Post on 01-Jun-2020
1
SGD cont’d AdaGrad
Machine Learning for Big Data CSE547/STAT548, University of Washington
Emily Fox April 2nd, 2015
©Emily Fox 2015 1
Case Study 1: Estimating Click Probabilities
Learning Problem for Click PredicOon • PredicOon task: • Features:
• Data:
– Batch: – Online:
• Many approaches (e.g., logisOc regression, SVMs, naïve Bayes, decision trees,
boosOng,…) – Focus on logisOc regression; captures main concepts, ideas generalize to other approaches
©Emily Fox 2015 2
2
Standard v. Regularized Updates • Maximum condiOonal likelihood esOmate
• Regularized maximum condiOonal likelihood esOmate
©Emily Fox 2015 3
(t)
(t)
w
⇤= argmax
wln
2
4Y
j
P (yj |xj ,w))
3
5� �X
i>0
w2i
2
Challenge 1: Complexity of compuOng gradients
• What’s the cost of a gradient update step for LR???
©Emily Fox 2015 4
(t)
3
Challenge 2: Data is streaming • AssumpOon thus far: Batch data
• But, click predicOon is a streaming data task: – User enters query, and ad must be selected:
• Observe xj, and must predict yj
– User either clicks or doesn’t click on ad: • Label yj is revealed a_erwards
– Google gets a reward if user clicks on ad – Weights must be updated for next Ome:
©Emily Fox 2015 5
SGD: StochasOc Gradient Ascent (or Descent) • “True” gradient:
• Sample based approximaOon:
• What if we esOmate gradient with just one sample??? – Unbiased esOmate of gradient – Very noisy! – Called stochasOc gradient ascent (or descent)
• Among many other names – VERY useful in pracOce!!!
©Emily Fox 2015 6
r`(w) = Ex
[r`(w,x)]
4
StochasOc Gradient Ascent: General Case • Given a stochasOc funcOon of parameters:
– Want to find maximum
• Start from w(0) • Repeat unOl convergence:
– Get a sample data point xt
– Update parameters:
• Works in the online learning sefng! • Complexity of each gradient step is constant in number of examples! • In general, step size changes with iteraOons
©Emily Fox 2015 7
StochasOc Gradient Ascent for LogisOc Regression
• LogisOc loss as a stochasOc funcOon:
• Batch gradient ascent updates:
• StochasOc gradient ascent updates: – Online sefng:
©Emily Fox 2015 8
w
(t+1)i w
(t)i + ⌘
8<
:��w(t)i +
1
N
NX
j=1
x
(j)i [y(j) � P (Y = 1|x(j)
,w
(t))]
9=
;
w
(t+1)i w
(t)i + ⌘t
n
��w(t)i + x
(t)i [y(t) � P (Y = 1|x(t)
,w
(t))]o
Ex
[`(w,x)] = Ex
⇥lnP (y|x,w)� �||w||22
⇤
2
5
Convergence Rate of SGD • Theorem:
– (see Nemirovski et al ‘09 from readings) – Let l be a strongly convex stochasOc funcOon – Assume gradient of l is Lipschitz conOnuous and bounded
– Then, for step sizes:
– The expected loss decreases as O(1/t):
©Emily Fox 2015 9
Convergence Rates for Gradient Descent/Ascent vs. SGD
• Number of IteraOons to get to accuracy
• Gradient descent: – If func is strongly convex: O(ln(1/ϵ)) iteraOons
• StochasOc gradient descent: – If func is strongly convex: O(1/ϵ) iteraOons
• Seems exponenOally worse, but much more subtle: – Total running Ome, e.g., for logisOc regression:
• Gradient descent:
• SGD:
• SGD can win when we have a lot of data – See readings for more details
©Emily Fox 2015 10
`(w⇤)� `(w) ✏
6
Constrained SGD: Projected Gradient • Consider an arbitrary restricted feature space
• OpOmizaOon objecOve:
• If , can use projected gradient for (sub)gradient descent
©Emily Fox 2015 11
w(t+1) =
w 2 W
w 2 W
MoOvaOng AdaGrad (Duchi, Hazan, Singer 2011) • Assuming , standard stochasOc (sub)gradient descent
updates are of the form:
• Should all features share the same learning rate?
• O_en have high-‐dimensional feature spaces – Many features are irrelevant – Rare features are o_en very informaOve
• Adagrad provides a feature-‐specific adapOve learning rate by incorporaOng knowledge of the geometry of past observaOons
©Emily Fox 2015 12
w 2 Rd
w(t+1)i w(t)
i � ⌘tgt,i
7
Why adapt to geometry?
Hard
Nice
y
t
�
t,1 �
t,2 �
t,3
1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1
1 Frequent, irrelevant
2 Infrequent, predictive
3 Infrequent, predictive
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32
x
x
x
Why Adapt to Geometry?
©Emily Fox 2015 13
Examples from Duchi et al. ISMP 2012
slides
Why adapt to geometry?
Hard
Nice
y
t
�
t,1 �
t,2 �
t,3
1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1
1 Frequent, irrelevant
2 Infrequent, predictive
3 Infrequent, predictive
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32
Not All Features are Created Equal
• Examples:
©Emily Fox 2015 14
Motivation
Text data:
The most unsung birthday
in American business and
technological history
this year may be the 50th
anniversary of the Xerox
914 photocopier.
a
aThe Atlantic, July/August 2010.
High-dimensional image features
Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 3 / 32
Images from Duchi et al. ISMP 2012 slides
8
©Emily Fox 2015 15
Credit: http://imgur.com/a/Hqolp
Visualizing Effect
Regret MinimizaOon • How do we assess the performance of an online algorithm?
• Algorithm iteraOvely predicts • Incur loss • Regret:
What is the total incurred loss of algorithm relaOve to the best choice of that could have been made retrospec1vely
©Emily Fox 2015 16
w(t)
w
`t(w(t))
R(T ) =TX
t=1
`t(w(t))� inf
w2W
TX
t=1
`t(w)
9
Regret Bounds for Standard SGD • Standard projected gradient stochasOc updates:
• Standard regret bound:
©Emily Fox 2015 17
w(t+1) = arg minw2W
||w � (w(t) � ⌘gt)||22
TX
t=1
`t(w(t))� `t(w
⇤) 1
2⌘||w(1) �w⇤||22 +
⌘
2
TX
t=1
||gt||22
Projected Gradient using Mahalanobis
w(t+1) = arg minw2W
||w � (w(t) � ⌘A�1gt)||2A
©Emily Fox 2015 18
• Standard projected gradient stochasOc updates:
• What if instead of an L2 metric for projecOon, we considered the Mahalanobis norm
w(t+1) = arg minw2W
||w � (w(t) � ⌘gt)||22
10
Mahalanobis Regret Bounds
• What A to choose? • Regret bound now:
• What if we minimize upper bound on regret w.r.t. A in hindsight?
©Emily Fox 2015 19
w(t+1) = arg minw2W
||w � (w(t) � ⌘A�1gt)||2A
TX
t=1
`t(w(t))� `t(w
⇤) 1
2⌘||w(1) �w⇤||22 +
⌘
2
TX
t=1
||gt||2A�1
minA
TX
t=1
gTt A�1gt
Mahalanobis Regret MinimizaOon • ObjecOve:
• SoluOon:
For proof, see Appendix E, Lemma 15 of Duchi et al. 2011. Uses “trace trick” and Lagrangian. • A defines the norm of the metric space we should be operaOng in
©Emily Fox 2015 20
A = c
TX
t=1
gtgTt
! 12
subject to A ⌫ 0, tr(A) CminA
TX
t=1
gTt A�1gt
11
AdaGrad Algorithm • At Ome t, esOmate opOmal (sub)gradient modificaOon A by
• For d large, At is computaOonally intensive to compute. Instead,
• Then, algorithm is a simple modificaOon of normal updates:
©Emily Fox 2015 21
At =
tX
⌧=1
g⌧gT⌧
! 12
w(t+1) = arg minw2W
||w � (w(t) � ⌘diag(At)�1gt)||2diag(At)
AdaGrad in Euclidean Space • For , • For each feature dimension,
where
• That is,
• Each feature dimension has it’s own learning rate! – Adapts with t – Takes geometry of the past observaOons into account – Primary role of η is determining rate the first Ome a feature is encountered
©Emily Fox 2015 22
W = Rd
w(t+1)i w(t)
i � ⌘t,igt,i
⌘t,i =
w(t+1)i w(t)
i �⌘qPt⌧=1 g
2⌧,i
gt,i
12
TX
t=1
`t(w(t))� `t(w
⇤) 2R1
dX
i=1
||g1:T,i||2
AdaGrad TheoreOcal Guarantees • AdaGrad regret bound:
– In stochasOc sefng:
• This really is used in pracOce! • Many cool examples. Let’s just examine one…
©Emily Fox 2015 23
R1 := max
t||w(t) �w⇤||1
E"`
1
T
TX
t=1
w(t)
!#� `(w⇤) 2R1
T
dX
i=1
E[||g1:T,j ||2]
AdaGrad TheoreOcal Example • Expect to out-‐perform when gradient vectors are sparse • SVM hinge loss example:
• If xjt ≠ 0 with probability
• Previously best known method:
©Emily Fox 2015 24
x
t 2 {�1, 0, 1}d
/ j�↵, ↵ > 1
`t(w) = [1� yt⌦x
t,w↵]+
E"`
1
T
TX
t=1
w(t)
!#� `(w⇤
) = O✓ ||w⇤||1p
T·max{log d, d1�↵/2}
◆
E"`
1
T
TX
t=1
w(t)
!#� `(w⇤) = O
✓ ||w⇤||1pT
·pd
◆
13
• Very non-‐convex problem, but use SGD methods anyway
Neural Network Learning
0 20 40 60 80 100 1200
5
10
15
20
25
Time (hours)
Ave
rage F
ram
e A
ccura
cy (
%)
Accuracy on Test Set
SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L−BFGS
(Dean et al. 2012)
Distributed, d = 1.7 · 109 parameters. SGD and AdaGrad use 80machines (1000 cores), L-BFGS uses 800 (10000 cores)
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 26 / 32
Neural Network Learning
Wildly non-convex problem:
f(x; ⇠) = log (1 + exp (h[p(hx1, ⇠1i) · · · p(hxk
, ⇠
k
i)], ⇠0i))
where
p(↵) =
1
1 + exp(↵)
�1 �2 �3 �4�5
x1 x2 x3 x4 x5
p(hx1, �1i)
Idea: Use stochastic gradient methods to solve it anyway
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 25 / 32
Neural Network Learning
©Emily Fox 2015 25
Images from Duchi et al. ISMP 2012 slides
`(w, x) = log(1 + exp(h[p(hw1, x1i) · · · p(hwk, xki)], x0i))
p(↵) =1
1 + exp(↵)
p(hw1, x1i)w ww w w
x1 x2 x3 x4 x5
What you should know about LogisOc Regression (LR) and Click PredicOon
• Click predicOon problem: – EsOmate probability of clicking – Can be modeled as logisOc regression
• LogisOc regression model: Linear model • Gradient ascent to opOmize condiOonal likelihood • Overfifng + regularizaOon • Regularized opOmizaOon
– Convergence rates and stopping criterion • StochasOc gradient ascent for large/streaming data
– Convergence rates of SGD • AdaGrad moOvaOon, derivaOon, and algorithm
©Emily Fox 2015 26