Offline and Online Optimization in Machine...
Transcript of Offline and Online Optimization in Machine...
![Page 1: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/1.jpg)
Offline and Online Optimizationin Machine Learning
Peter Sunehag
ANU
September 2010
![Page 2: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/2.jpg)
Section
Introduction: Optimization with gradient methods
Introduction:Machine Learning and Optimization
Convex Analysis
(Sub)Gradient Methods
Online (sub)Gradient Methods
![Page 3: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/3.jpg)
Steepest decent
I We want to find the minimum of a real valued functionf : Rn → R.
I We will primarily discuss optimization using gradients
I Gradient Decent is also called the steepest decent method
![Page 4: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/4.jpg)
Problem: Curvature
I If we the level sets are very curved (far from round), gradientdecent might take a long route to the optimum
I Newton’s method takes curvature into account and ”corrects”for it. It relies on calculating inverse Hessian (secondderivatives) of the objective function
I Quasi-Newton methods estimates the ”correction” by lookingat the path so far. Red Newton, Green Gradient Decent
![Page 5: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/5.jpg)
Problem: Zig-Zag
I Gradient Descent has a tendency to Zig-Zag
I This is particularly problematic in valleys
I One solution: Introduce momentum
I Another solution: Conjugate gradients
![Page 6: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/6.jpg)
Section
Introduction: Optimization with gradient methods
Introduction:Machine Learning and Optimization
Convex Analysis
(Sub)Gradient Methods
Online (sub)Gradient Methods
![Page 7: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/7.jpg)
Optimization in Machine Learning
I Data is becoming ”unlimited” in many areas.
I Computational resources are not
I Taking advantage of the data requires efficient optimization
I Machine Learning researchers have realized that we are in adifferent situation compared to the contexts in which theoptimization (say convex optimization) field developed.
I Some of those methods and theories are still useful.
I Different ultimate goal.
I We optimize surrogates for what we really want.
I We also often have much higher dimensionality than otherapplications.
![Page 8: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/8.jpg)
Optimization in Machine Learning
I Known training data A, unknown test data BI We want optimal performance on the test data
I Alternatively we have streaming data (or pretend that we do).
I Given a loss function L(w , z) (parameters w ∈W , datasample(s) z), we want as low loss
∑z∈B L(z ,w) as possible
on the test set.
I Since we do not have access to the test set, we minimizeregularized empirical loss on the training set∑
z∈AL(z ,w) + λR(w).
I Getting extremely close to the minimum of this objectivemight not increase performance on the test set.
I In Machine learning, the optimization methods of interest arethose that fast get a good estimate wk to the optimum w∗
![Page 9: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/9.jpg)
Optimization in Machine Learning
I How large k to get εk = f (wk)− f (w∗) ≤ ε for a given ε > 0?
I Is this number like 1/ε, 1/ε2, 1/√ε, log 1
ε ?
I Is εk like 1/k , 1/k2, etc
I How long does an iteration take?
I Quasi-Newton methods (BFGS, L-BFGS), are rapid when weare already close to the optimum but otherise not better thanmethods (gradient decent) with much cheaper iterations.
I We care the most about the early phase when we are notclose to the optimum
![Page 10: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/10.jpg)
Examples
I linear classifier (classes ±1) f (x) = sgn(wT x + b).
I Training data {(xi , yi )}I SVMs (hinge loss),
minw∑
i min(0, 1− yi (wT xi + b)) + λ‖w‖22
I Logistic multi-class: minw∑
i logexpwT
yixi∑
l expwTl xi
+ λ‖w‖22I Decision l∗ = argmaxl{wT
l x}I Different regularizers, e.g. ‖x‖1 encourages sparsity.
I hinge loss and `1 regularization causes non-differentiability onnull sets.
![Page 11: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/11.jpg)
Section
Introduction: Optimization with gradient methods
Introduction:Machine Learning and Optimization
Convex Analysis
(Sub)Gradient Methods
Online (sub)Gradient Methods
![Page 12: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/12.jpg)
Gradients
I We will make use of gradients with respect to a givencoordinate system and its associated Euclidean geometry.
I If f (x) is a real valued function that is differentiable atx = (x1, ..., xd), then
∇x f = (∂f
∂x1, ...,
∂f
∂xd)
I The gradient points in the direction of steepest ascent
![Page 13: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/13.jpg)
Subgradients
I If f (x) is a convex real valued function, and if g ∈ Rd is suchthat
f (x)− f (y) ≥ gT (x − y)
for all y in some open ball around x , then g is in thesubgradient set of f at x .
I If f is differentiable, then only ∇f (x) is in the subgradient set.
I A line with slope g can be drawn that only meet f at x
![Page 14: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/14.jpg)
Convex set
I A set X ⊂ Rn is convex if for all x , y ∈ X and 0 ≤ λ ≤ 1,
(1− λ)x + λy ∈ X
![Page 15: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/15.jpg)
Convex function
I A function f : Rn → R is convex if ∀λ ∈ [0, 1] and x , y ∈ Rn,
f ((1− λ)x + λy) ≤ (1− λ)f (x) + λf (y).
I The set above the function’s curve is called its epigraph
I A function is convex iff its epigraph is a convex set
I A convex function is called closed if the epigraph is closed
![Page 16: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/16.jpg)
Strong ConvexityI A function f : Rn → R is ρ-strongly convex (ρ > 0) if∀λ ∈ [0, 1] and x , y ∈ Rn,
f ((1−λ)x +λy) ≤ (1−λ)f (x) +λf (y)− ρλ(1− λ)
2‖x − y‖22
I or equivalently if f (x)− (ρ/2)‖x‖22 is convexI or equivalently
f (y) ≥ f (x)+ < ∇f (x), y − x > +(ρ/2)‖x − y‖22I or equivalently (if twice diff) ∇2f (x) ≥ ρI (id matrix)I Green (y = x4) is too flat. Blue (y = x2) is not too flat.
![Page 17: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/17.jpg)
Lipschitz Continuity
I f is L-Lipschitz if
|f (x)− f (y)| ≤ L‖x − y‖2
I Lipschitz implies continuity but not differentiability
I If differentiable, then bounded (norm) gradients impliesLipschitz continuity.
I The secant slopes are bounded by L
![Page 18: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/18.jpg)
Lipschitz Continuous Gradients
I If f is differentiable, then having L-Lipschitz gradients(L-LipG) means that
‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2
which is equivalent to
f (y) ≤ f (x)+ < ∇f (x), y − x > +L
2‖x − y‖22
which if f is twice differentiable is equivalent to∇2f (x) ≤ L · I .
![Page 19: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/19.jpg)
Legendre transform
I Transformation between points and lines (set of tangents)
I One dimensional functions (generalizes to Fenchel transform)
I f ∗(p) = maxx(px − f (x)).
I If f differentiable, the maximizing x has the property thatf ′(x) = p.
![Page 20: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/20.jpg)
Fenchel duality
I The fenchel dual of a function f is
f ∗(s) = supx< s, x > −f (x)
I If f is a closed convex function, then f ∗∗ = f .
I f ∗(0) = maxx(−f (x)) = −minx f (x)
I
∇f ∗(s) = argmaxx
< s, x > −f (x).
∇f (x) = argmaxs
< s, x > −f ∗(s).
I ∇f ∗(∇f (x)) = x and ∇f (∇f ∗(s)) = s
![Page 21: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/21.jpg)
Duality between strong convexity and Lipschitz gradients
I f is ρ strongly convex iff f ∗ has 1ρ -Lip. gradients
I f ∗ has L-Lip. gradients iff f ∗ is 1L strongly convex
I If a non-differentiable funtion f can be identified as g∗ we can”smooth it out” by considering (g + δ‖ · ‖22)∗ which isdifferentiable and has 1
δ -Lip gradients.
I δ = µ2 below
![Page 22: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/22.jpg)
Section
Introduction: Optimization with gradient methods
Introduction:Machine Learning and Optimization
Convex Analysis
(Sub)Gradient Methods
Online (sub)Gradient Methods
![Page 23: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/23.jpg)
(Sub)Gradient MethodsI Gradient descent: wt+1 = wt − atgt , gt (sub)gradient of
objective f at wt .I at can be decided by line search or by a predetermined schemeI If we are constrained to a convex set X , then use
wt+1 = ΠX (wt − atgt) = argminw∈X
‖w − (wt − atgt)‖2
![Page 24: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/24.jpg)
First and Second order information
I First order information: ∇f (wt), t = 0, 1, 2, ..., n
I Second order information is about ∇2f and is expensive tocompute and store for high dimensional problems.
I Newton’s method is a second order method wherewt+1 = wt − (Ht)
−1gt .
I Quasi-Newton methods use first order information to try toapproximate Newton’s method.
I Conjugate gradient also use first order information
![Page 25: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/25.jpg)
Lower Bounds
I There are lower bounds for how well a first order method, i.e.a method for which wk+1 − wk ∈Span(∇f (x0), ...,∇f (xk))can do. They depend on the function class!
I Given w0 ( and L, ρ), the bound is saying that for any ε > 0there exists f in the class such that it takes at least nε stepsfor any first order method and a statement about nε is made.
I Differentiable functions with Lipschitz continuous gradients:
Lower bound is O(√
1ε )
I Lipschitz continuous gradients and strong convexity: Lowerbound is O(log 1
ε ).
![Page 26: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/26.jpg)
Gradient Descent with Lipschitz gradients
I If we have L-Lipschitz continuous gradients, then definegradient decent by
wk+1 = wk −1
L∇f (wk).
I Monotone: f (wk+1) ≤ f (wk).
I If f is also ρ-strongly convex (ρ > 0), then
(f (wk)− f (w∗)) ≤ (1− ρ
L)k(f (w0)− f (w∗)).
I This result translates into the optimal O(log 1ε ) rate for first
order methods
![Page 27: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/27.jpg)
Gradient Descent without strong convexity
I Suppose that we do not have strong convexity (ρ = 0) but wehave L-LipG.
I Then the gradient descent procedure wk+1 = wk − 1L∇f (wk)
has rate O(1ε ). (f (wk)− f (w∗)) ≤ Lk ‖w
∗ − w0‖2.
I The lower bound for this class is O(√
1ε ).
I There is a gap!
I The question arose if the bound was loose or if there is abetter first order method
![Page 28: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/28.jpg)
Nesterov’s methodsI Nesterov answered the question in 1983 by constructing an
”Optimal Gradient Method” with the rate O(√
1ε )
(O(√
Lε ) if we have not fixed L).
I Later variations (1988,2005) widened the applicability.I Introduces a momentum termI wk+1 = yk −∇f (yk), yk+1 = wk + αk(wk − wk−1)
I αk = βk−1βk+1
, βk+1 = (1 +√
4β2k + 1)/2.
I Iteration cost similar to gradient descent
![Page 29: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/29.jpg)
Non-differentiable objectives: Subgradient descent
I Consider continuous but not necessarily differentiableobjectives
I Given a subgradient gk at wk we can perform an update
wk+1 = wk − ηkgk .
I In the typical machine learning situation we only havenon-differentiability on a set of Lebesgue measure 0 so gk willactually often be a gradient
I However, the gradients will ”jump” at the non-differentiabilityso we cannot have Lipschitz continuity of a (sub)gradient fieldin this situation.
I Subgradient descent has the bad rate O( 1ε2
).
![Page 30: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/30.jpg)
Nesterov’s smoothing trick
I If we have a continuous non-differentiable function f that canbe identified with g∗, then we can use the following trick.
I Given a desired accuracy ε > 0, apply Nestov’s method toh = (g + δ‖x‖22)∗ where δ = ε
2 , but go to precision ε2 .
I h has L-Lipschitz gradients where L = 1δ
I f (w)− h(w) ≤ δ = ε2 .
I Combined error at most ε.
I Since h has Lipschitz gradients it takes O(√
Lε ) = O(1ε ).
![Page 31: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/31.jpg)
Summary of offline gradient methods
I Normal gradient decent has optimal rate for strongly convexobjectives with Lipschitz gradients. O(log 1
ε )
I Normal gradient decent is not optimal (its O(1ε )) for convexbut not strongly convex objectives. Nesterov’s gradient
method is optimal ( O(√
1ε )).
I For non-differentiable objectives, subgradient decent is ingeneral O( 1
ε2).
I By utilizing special properties this can be improved. If theobjective can be identified with the dual of another function,we can use a smoothing trick and get O(1ε ).
![Page 32: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/32.jpg)
Section
Introduction: Optimization with gradient methods
Introduction:Machine Learning and Optimization
Convex Analysis
(Sub)Gradient Methods
Online (sub)Gradient Methods
![Page 33: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/33.jpg)
Online Optimization
I I will consider optimization problems of the formC (w) = EzL(z ,w).
I If the underlying distribution for z is unknown, then we do notknow C either.
I We observe a sequence z1, z2, ... and we use that informationto update parameters wt .
![Page 34: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/34.jpg)
Performance Measures
I There are different ways of measuring performance.
I C (wt), ‖w − w∗‖ (if there exists a unique minimum w∗)
I∑t
i=1 L(zi ,wi ),∑t
i=1 L(zi ,wi )− infw∑t
i=1 L(zi ,w)
I Interesting function classes (C and L)
I E.g. L(z ,w) = ‖w − z‖22.
![Page 35: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/35.jpg)
Stochastic Gradient Decent
I When an empirical average Cn(w) = 1n
∑ni=1 L(zi ,w) is
minimized, ”Batch optimizers” have to calculate the entiresum for each evalutation of Cn(w) and ∇wCn(w). (Gradientwith respect to a chosen metric, e.g. standard Euclidean w.r.t.a given coordinate system)
I Stochastic gradient methods work with gradient estimatesobtained from subsets of the training data.
I On large redundant data sets simple stochastic gradientdescent (SGD) typically outperforms second order ”batch”methods by orders of magnitude.
I wt+1 = wt − atYt where E (Yt) = ∇wC (w) and at > 0defines SGD.
I Many different attempts at using second order information
I wt+1 = wt − atBtYt where Bt is a positive scaling matrix.
![Page 36: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/36.jpg)
Some Historical landmarksI Robbins and Monro 1951 introduced one-dimensional SGD
and proved a convergence theoremI Blum 1954 generalizes this to the multivariate case in but
with restrictive assumptionsI Robbins and Siegmund 1971 achieved a more general result
for the multivariate case using super-martingale theoryI Lai and Robbins 1979 investigates one-dimensional ”adaptive
procedures” (step-sizes not predetermined) by introducingestimated curvature
I Wei 1987 does the multivariate case (proves asymptoticefficiency)
I More on second order methods Amari 1998, Murata 1998,Bottou and Lecun 2005
I Second order information only change the convergence by aconstant factor, the rate remains the same.
I E. Hazan et al. 2006 shows logaritmic (instead of square root)regret for general online convex optimization with Newton step
![Page 37: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/37.jpg)
Conditions for SGD convergence
I Cost function C (w) = EzL(z ,w) is three times continuouslydifferentiable and bounded from below (Can be replaced byother assumptions, e.g. bound on subgradient sets)
I∑
t a2t <∞,
∑t at =∞
I Ez(L(z ,w)2) ≤ A + B‖w‖22I inf‖w‖>D w · ∇wC (w) > 0
I∑‖w‖<E ‖L(z ,w)‖ ≤ K
I Conclusion: ‖wt − w∗‖22 = O(1/t) a.s.
I Addition: Updates with scaling matrices in the updatesconverges a.s. if the eigenvalues are uniformly bounded frombelow by a strictly positive constant and from above by afinite constant (Sunehag et. al. 2009).
![Page 38: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/38.jpg)
Step sizes
I The Robbins-Monro condition∑
t a2t <∞,
∑t at =∞ is e.g.
satisified by at = 1/t
I bt+b (or 1
t+b ) also works. Halfed after b steps. Initial searchphase with rather flat schedule
I Step sizes that are not satisfying the Robbins-Monroconditions are also of interest
I Constant step sizes get us to a ball around the optimumwhose radius depends on this constant. This is also good fortracking a changing environment
I Intermediary alternatives like 1√t.
I Various adaptive (data dependent) schemes have been tried,e.g. Stochastic Meta Decent by Schraudolph. Unclear verdict.Shows improvement in special cases. Causes extracomputation.
![Page 39: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/39.jpg)
Regret bounds
I Finite time analysis
I The regret is∑t
i=1 L(zi ,wi )− infw∑t
i=1 L(zi ,w)
I Under a strong convexity assumption one can prove that SGDwith the right step sizes 1
ρt has logarithmic regret which is thebest possible. One has to know ρ. Depends on knowing ρ.
I Without strong convexity its O(√t)).
I Using Newton-steps, one can still have logarithmic regret.
I See later online learning lecture
![Page 40: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/40.jpg)
Second order methods
I wt − wt−1 ≈ H−1wt(∇C (wt)−∇C (wt−1)) (used by
Quasi-Newton methods)
I Estimating full Hessians (various methods) gives cost order d2
per step instead of d . Prohibitive for high dimensionalproblems.
I Low rank approximations (online LBFGS Schraudolph et. al.2007) can be cost order kd where k is the memory length.
I Convergence speed at most improve my a constant factor.The cost increase by a constant factor.
![Page 41: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/41.jpg)
Second order on the cheap
I Online LBFGS with memory 10 multiplies the cost per stepwith at least 11. Only worth it for really ill-conditionedproblems
I Becker and Lecun estimated Diagonal Matrix for Neural Nets1989.
I SGD-QN by Bordes, Bordes, Bottou, Gallinari wins wild trackof the PASCAL Large Scale Learning Challenge 2008 byestimating Diagonal Scaling matrix (for SVMs) withLBFGS-inspired method
![Page 42: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/42.jpg)
Support Vector Machines with SGD
I SVMs (hinge loss),minw
1m
∑mi=1 min(0, 1− yi (w
T xi )) + λ‖w‖22I Instantaneous objective min(0, 1− yi (w
Tt xi )) + λ‖wt‖22 if we
have drawn sample i for the current time t
I (Sub)Gradient: If correctly classified then gt = 2λwt . Ifwrong, then gt = 2λwt − yixi .
I λ-strong convexity suggests step size 1λt or 1
λ(t+t0)
![Page 43: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/43.jpg)
Support Vector Machines with Pegasos
I Note: Plugging in w = 0 makes the expression equal to 1. Ifwe have w with ‖w‖ ≥ 1√
λ, then the expression is at least 1.
I Conclusion, we only have to consider ‖w‖2 ≤ 1√λ
.
I This is utilized by the Pegasos algorithm which alternates thegradient step above with a projection (scaling) onto this ball.
I Proven O(1ε ).
![Page 44: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/44.jpg)
Summary
I Simple methods with cheap updates can be the best idea forparameter estimation in machine learning
I This is particularly true in large scale learning
I With SGD, the time per iteration does not depend on data setsize. Having more data just means that we do not rerun onold data.
I Faster progress is made when using ”fresh data”.
I More data, therefore, means LESS runtime (for SGD) until wereach the desired true accuracy.
![Page 45: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"](https://reader033.fdocuments.net/reader033/viewer/2022050508/5f997920b7e5a111f4543b2e/html5/thumbnails/45.jpg)
Literature
I Numerical Optimization, Jorge Nocedal and Stephen J.Wright , Springer, August 2000
I Introductory Lectures on Convex Optimization: A BasicCourse, Yurii Nesterov, Springer 2003
I Convex Analysis, R. Tyrrell Rockafellar, Princeton UniversityPress 1996
I Convex Optimization, Stephen Boyd and LievenVandenberghe, Cambridge University Press 2004
I L. Bottou’s SGD page:http://leon.bottou.org/research/stochastic