First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725...
Transcript of First-order methods Convexityggordon/10725-F12/slides/04-convexity.pdf · Convexity 10-725...
First-order methodsConvexity
10-725 OptimizationGeoff Gordon
Ryan Tibshirani
Geoff Gordon—10-725 Optimization—Fall 2012
Administrivia
• HW1 out, due 9/20‣ in class, at beginning of class—no skipping lecture
to keep working on it ;-)
‣ if you use late days, hand in to course assistant before 1:30PM on due date + k days
• Reminder: think about project teams and project proposals (due 9/25)
2
Geoff Gordon—10-725 Optimization—Fall 2012
Review
• Convex sets‣ primal (convex hull) vs. dual (intersect hyperplanes)
‣ supporting, separating hyperplanes
‣ operations that preserve convexity
‣ affine fn; perspective
‣ open/closed/compact
3
Geoff Gordon—10-725 Optimization—Fall 2012
Review
• Convex functions‣ epigraph
‣ domain
‣ sublevel sets; quasiconvexity
‣ first order, 2nd order conditions
‣ operations that preserve convexity
‣ perspective; minimization over one argument
4
Geoff Gordon—10-725 Optimization—Fall 2012
Ex: structured classifier
25
xi
yi
φj(xi)ψijk(xi, yi)χikl(yi, yi+1)
L(x,y;v,w) =
Classifier:
Geoff Gordon—10-725 Optimization—Fall 2012
Learning structured classifier
• Get it right if:
• So, want:
• Where !(y,y’) = {
• RHS:
• RHS – LHS:
• Train: lots of pairs (xt,yt)
26
Geoff Gordon—10-725 Optimization—Fall 2012
Strict convexity; strong convexity
• Strictly convex: ‣
• k-strongly convex:‣
7
Geoff Gordon—10-725 Optimization—Fall 2012
Extended reals• Suppose dom f ⊂ Rn
• Define g(x) = {
• f convex g convex‣ f(tx+(1-t)y) ≤ tf(x) + (1-t)f(y)
‣ g(tx+(1-t)y) ≤ tg(x) + (1-t)g(y)
‣ cases:
8
Geoff Gordon—10-725 Optimization—Fall 2012
Lipschitz
• Function f(x) is Lipschitz (in norm ||x||) if:‣
9
2 0 22024
x2/3
2 0 22024
|x|
2 0 220
24
sgn(x)
Geoff Gordon—10-725 Optimization—Fall 2012
Back to gradient descent
• Suppose f(x) is convex, ∇f(x) exists
• Iterations to get to accuracy ϵ(f(x0)–f(x*)): if
‣ f Lipschitz: O(1/ϵ2)
‣ ∇f Lipschitz: O(1/ϵ)‣ f strongly convex: O(ln(1/ϵ))
• Constant in O(…): conditioning of f
10
Geoff Gordon—10-725 Optimization—Fall 2012
Conditioning
11
9.4 Steepest descent method 483
PSfrag replacements
x̄(0)
x̄(1)
Figure 9.14 The iterates of steepest descent with norm ! · !P1 , after thechange of coordinates. This change of coordinates reduces the conditionnumber of the sublevel sets, and so speeds up convergence.
PSfrag replacements
x̄(0)
x̄(1)
Figure 9.15 The iterates of steepest descent with norm ! · !P1 , after thechange of coordinates. This change of coordinates increases the conditionnumber of the sublevel sets, and so slows down convergence.
9.4
Steep
estdescen
tm
ethod
483
PSfrag
replacements
x̄(0
)
x̄(1
)
Fig
ure
9.1
4T
he
iteratesof
steepest
descen
tw
ithnorm
!·!
P1 ,
afterth
ech
ange
ofco
ordin
ates.T
his
chan
geof
coord
inates
reduces
the
condition
num
ber
ofth
esu
blevel
sets,an
dso
speed
sup
convergen
ce.
PSfrag
replacements
x̄(0
)
x̄(1
)
Fig
ure
9.1
5T
he
iteratesof
steepest
descen
tw
ithnorm
!·!
P1 ,
afterth
ech
ange
ofco
ordin
ates.T
his
chan
geof
coord
inates
increases
the
condition
num
ber
ofth
esu
blevel
sets,an
dso
slows
dow
ncon
vergence.
482 9 Unconstrained minimization
PSfrag replacements
x(0)
x(1)
x(2)
Figure 9.12 Steepest descent method, with quadratic norm ! · !P2 .
PSfrag replacements
k
P1
P2
f(x
(k) )!
p!
0 10 20 30 4010!15
10!10
10!5
100
105
Figure 9.13 Error f(x(k)) " p! versus iteration k, for the steepest descentmethod with the quadratic norm ! · !P1 and the quadratic norm ! · !P2 .Convergence is rapid for the norm ! · !P1 and very slow for ! · !P2 .
Geoff Gordon—10-725 Optimization—Fall 2012
Extensions• Subgradient descent
• Prox operator (e.g., )
• FISTA, mirror descent, conjugate gradient
• Nesterov’s smoothing
• Line search (BV sec 9.2)
• Stochastic GD (when f(x) = E(fi(x) | i ~ P))
‣ sample one i on each iter, use ∇fi(x)
‣ or, minibatches: sample a few i’s, use mean ∇fi(x)12
gt = argming
�g�2p +∇f · g
Geoff Gordon—10-725 Optimization—Fall 2012
Comparison: stochastic GD• Iteration bounds for stochastic GD
‣ f Lipschitz: O(1/ϵ2) (worse const, same O() as GD)
‣ f strongly convex: O(ln(1/ϵ)/ϵ) (much worse)
• f Lipschitz: stochastic GD often wins big
• Even if f strongly convex:‣ plain GD: each iter O(N), #iters O(ln(1/ϵ))‣ stochastic: each iter O(1), #iters O(ln(1/ϵ)/ϵ)‣ stochastic can win if lots of data, loose tolerance
‣ could make sense to throw data away, use full GD13