Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the...

59
Forward-Backward limited memory quasi-Newton methods for large scale nonsmooth convex optimization Panos Patrinos IMT Lucca (KU Leuven starting Sept. 1st, 2015) joint work with Lorenzo Stella PhD student, IMT Lucca el´ ecom ParisTech, June 25, 2015

Transcript of Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the...

Page 1: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward limited memory quasi-Newtonmethods for large scale nonsmooth convex

optimization

Panos PatrinosIMT Lucca

(KU Leuven starting Sept. 1st, 2015)joint work with Lorenzo Stella

PhD student, IMT Lucca

Telecom ParisTech, June 25, 2015

Page 2: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Nonsmooth convex optimization

composite form separable form

minimize f(x) + g(x) minimize f(x) + g(z)

subject to Ax+Bz = b

I f , g convex functions (possibly nonsmooth)

I norm regularized problems

f is loss function g is regularizer

I constrained problems

g(x) = δC(x) =

{0, if x ∈ C+∞, otherwise

I statistics, machine learning, signal processing, communications,control, system identification, image analysis. . .

1 / 42

Page 3: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Operator splitting methods

minimize f(x) + g(x) minimize f(x) + g(z)

subject to Ax+Bz = b

I solve problem by handling each term separately (or its conjugate)X perform either a gradient step (explicit or forward)

z = x− γ∇h(x)

X or a proximal step (implicit or backward)

z = proxγh(x) = arg minz∈IRn

{h(z) + 1

2γ ‖z − x‖2}

I forward-backward splitting (FBS), Douglas-Rachford splitting (DRS)

I AMM, ADMM (dual versions of FBS and DRS)

I simple, exploit structure; distributed implementations

I but: very prone to ill-conditioning, as any first-order method2 / 42

Page 4: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Gradient and variable-metric methods

minimizex∈IRn

f(x) f smooth

xk

x?

f(x)

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

x1

x2

3 / 42

Page 5: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Gradient and variable-metric methods

xk+1 = arg minz{f(xk) + 〈∇f(xk), x− xk〉+ 1

2γk‖x− xk‖2︸ ︷︷ ︸

Qfγ(x;xk)

}

= xk − γk∇f(xk)

xk+1

Qfγ(x;xk)

f(x)

xk

x?

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

x1

x2

3 / 42

Page 6: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Gradient and variable-metric methods

xk+1 = arg minx{f(xk) + 〈∇f(xk), x− xk〉+ 1

2γk〈x− xk, (Hk)−1(x− xk)〉︸ ︷︷ ︸

QfγHk

(x;xk)

}

= xk − γkHk∇f(xk)

xk+1

QfγHk(x;xk)

f(x)

xk

x?

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

x1

x2

3 / 42

Page 7: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Large-scale smooth convex optimization

xk+1 = xk − γkHk∇f(xk)

I Newton’s methodHk = ∇2f(xk)−1

X for n = 106: 8 TB storage, 1/3 billion GFLOP to factor HessianX 4 months per Newton step on my MacBook Pro (30GFLOPS)

I quasi-Newton methods (e.g. BFGS)

Hk+1 =(I − ρkyksTk

)THk

(I − ρkyksTk

)+ ρksks

Tk

sk = xk+1 − xk, yk = ∇f(xk+1)−∇f(xk), ρk = 1/〈yk, sk〉X maintain and update an approximation of the inverse HessianX still require storage of a n× n matrix

I limited-memory BFGSX Hk is stored implicitly by saving m past pairs of {si, yi}k−1i=k−mX Hk∇f(xk) is computed on the fly in O(n) operations

Can we extend L-BFGS to nonsmooth problems while maintainingfavorable structural properties of splitting methods?

4 / 42

Page 8: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Goals

I interpret forward-backward splitting as a gradient method. . .

I on a smooth function we call forward-backward envelope (FBE)

I generalizes the notion of Moreau envelope

I can apply any method for unconstrained smooth optimization. . .

I to solve nonsmooth problems. . .

I while retaining favorable features of splitting methods

X forward-backward L-BFGS

X alternating minimization L-BFGS

I require exactly the same oracle with FBS

I optimal global complexity estimates but much faster in practice

5 / 42

Page 9: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Outline

I FBS and forward-backward envelope (FBE)

I Forward-Backward L-BFGS and acceleration

I AMM and alternating minimization L-BFGS

I ForBES (Forward Backward Envelope Solver)

I conclusions

Page 10: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Outline

I FBS and forward-backward envelope (FBE)

Forward-Backward L-BFGS and acceleration

AMM and alternating minimization L-BFGS

ForBES (Forward Backward Envelope Solver)

conclusions

Page 11: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Problem Definition

minimize ϕ(x) = f(x) + g(x)

I f , g closed proper convex

I f : IRn → IR is Lf -strongly smooth

f(y) ≤ f(x) + 〈∇f(x), y − x〉+Lf2 ‖y − x‖2 ∀x, y

Quadratic upper bound

if ∇f is Lipschitz-continuous with parameter L, then

f(y) ≤ f(x) + ∇f(x)T (y − x) +L

2‖y − x‖2

2 ∀x, y ∈ dom f

f(y) (x, f(x))

if f is convex with dom f = Rn and f has a minimizer x!, this implies

1

2L‖∇f(x)‖2

2 ≤ f(x) − f(x!) ≤ L

2‖x − x!‖2

2 ∀x

Gradient method 1-14

I g : IRn → IR ∪ {+∞} has easily computable proximal mapping

proxγg(x) = arg minz∈IRn

{g(z) + 1

2γ ‖z − x‖2}

γ > 0

6 / 42

Page 12: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward Splitting (FBS)

xk+1 = proxγg(xk − γ∇f(xk)) γ ∈ (0, 2/Lf )

I fixed point iteration for optimality condition

x = proxγg(x− γ∇f(x)), γ > 0

I generalizes classical methods

xk+1 =

xk − γ∇f(xk) g = 0 gradient method

ΠC(xk − γ∇f(xk)) g = δC gradient projection

proxγg(xk) f = 0 proximal point algorithm

I accelerated versions-FISTA (Nesterov, Beck, Teboulle)

I γ ∈ (0, 1/Lf ]: can be seen as majorization-minimization method

xk+1 = minz∈IRn

{f(xk) + 〈∇f(xk), z − xk〉+ 12γ ‖z − xk‖2︸ ︷︷ ︸

Qfγ(z;xk)

+g(z)}

7 / 42

Page 13: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward Splitting (FBS)

minimize ϕ(x) = f(x) + g(x)

I forward-backward splitting (aka proximal gradient) (γ ≤ 1/Lf )

xk+1 = minz∈IRn

{f(xk) + 〈∇f(xk), z − xk〉+ 12γ ‖z − xk‖2︸ ︷︷ ︸

Qfγ(z;xk)

+g(z)}

x0

ϕ(x0)

ϕ = f + g

8 / 42

Page 14: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward Splitting (FBS)

minimize ϕ(x) = f(x) + g(x)

I forward-backward splitting (aka proximal gradient) (γ ≤ 1/Lf )

xk+1 = minz∈IRn

{f(xk) + 〈∇f(xk), z − xk〉+ 12γ ‖z − xk‖2︸ ︷︷ ︸

Qfγ(z;xk)

+g(z)}

x0

ϕ(x0)

ϕ = f + g

8 / 42

Page 15: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward Splitting (FBS)

minimize ϕ(x) = f(x) + g(x)

I forward-backward splitting (aka proximal gradient) (γ ≤ 1/Lf )

xk+1 = minz∈IRn

{f(xk) + 〈∇f(xk), z − xk〉+ 12γ ‖z − xk‖2︸ ︷︷ ︸

Qfγ(z;xk)

+g(z)}

x0

ϕ(x0)

ϕ = f + g

Qfγ(z;x0) + g(z)

8 / 42

Page 16: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward Splitting (FBS)

minimize ϕ(x) = f(x) + g(x)

I forward-backward splitting (aka proximal gradient) (γ ≤ 1/Lf )

xk+1 = minz∈IRn

{f(xk) + 〈∇f(xk), z − xk〉+ 12γ ‖z − xk‖2︸ ︷︷ ︸

Qfγ(z;xk)

+g(z)}

x0 x1

ϕ(x0)

ϕ(x1)

ϕ = f + g

Qfγ(z;x0) + g(z)

8 / 42

Page 17: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward Splitting (FBS)

minimize ϕ(x) = f(x) + g(x)

I forward-backward splitting (aka proximal gradient) (γ ≤ 1/Lf )

xk+1 = minz∈IRn

{f(xk) + 〈∇f(xk), z − xk〉+ 12γ ‖z − xk‖2︸ ︷︷ ︸

Qfγ(z;xk)

+g(z)}

x0 x1 x2

ϕ(x0)

ϕ(x1)

ϕ(x2)

ϕ = f + g

Qfγ(z;x1) + g(z)

8 / 42

Page 18: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward Splitting (FBS)

minimize ϕ(x) = f(x) + g(x)

I forward-backward splitting (aka proximal gradient) (γ ≤ 1/Lf )

xk+1 = minz∈IRn

{f(xk) + 〈∇f(xk), z − xk〉+ 12γ ‖z − xk‖2︸ ︷︷ ︸

Qfγ(z;xk)

+g(z)}

x0 x1 x2 x3

ϕ(x0)

ϕ(x1)

ϕ(x2)ϕ(x3)

ϕ = f + g

Qfγ(z;x2) + g(z)

8 / 42

Page 19: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Accelerating FBS

I FBS can be slow (as the gradient method)I remedy: better quadratic model for f (as in quasi-Newton methods)

xk+1 = arg minx{f(xk) + 〈∇f(xk), x− xk〉+ 1

2‖x− xk‖2Bk + g(x)}

where Bk is (an approximation of) ∇2f(xk)[Becker,Fadili (2012), Lee et al. (2012), Tran-Dinh et al. (2013)]

I bad idea: no closed-form solution of the inner subproblem

I another iterative procedure needed for computing xk+1

Forward-Backward Envelope (FBE)

I limited-memory quasi-Newton methods for nonsmooth problems

I no need for an inner iterative procedure

9 / 42

Page 20: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward Envelope (FBE)

value function of FBS subproblem (γ ≤ 1/Lf )

ϕγ(x) = minz

{f(x) + 〈∇f(x), z − x〉+ g(z) + 1

2γ ‖z − x‖2}

x

ϕ(x)

ϕγ(x)

ϕ

10 / 42

Page 21: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward Envelope (FBE)

value function of FBS subproblem (γ ≤ 1/Lf )

ϕγ(x) = minz

{f(x) + 〈∇f(x), z − x〉+ g(z) + 1

2γ ‖z − x‖2}

x

ϕ(x)

ϕγ(x)

ϕ

10 / 42

Page 22: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Forward-Backward Envelope (FBE)

value function of FBS subproblem (γ ≤ 1/Lf )

ϕγ(x) = minz

{f(x) + 〈∇f(x), z − x〉+ g(z) + 1

2γ ‖z − x‖2}

x

ϕ(x)

ϕγ(x)

ϕϕγ

10 / 42

Page 23: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Key property of FBE

upper bound

ϕγ(x) ≤ ϕ(x)− 12γ ‖Rγ(x)‖2

lower bound

ϕγ(x) ≥ ϕ(Tγ(x)) +1−γLf

2γ ‖Rγ(x)‖2

where Rγ(x) = x− Tγ(x)

Tγ(x) = proxγg(x− γ∇f(x))x Tγ(x)

ϕ(x)

ϕ(Tγ(x))ϕγ(x)

ϕϕγ

I minimizing FBE is equivalent to minimizing ϕ if γ ∈ (0, 1/Lf ]

inf ϕ = inf ϕγ arg minϕ = arg minϕγ

11 / 42

Page 24: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Key property of FBE

upper bound

ϕγ(x) ≤ ϕ(x)− 12γ ‖Rγ(x)‖2

lower bound

ϕγ(x) ≥ ϕ(Tγ(x)) +1−γLf

2γ ‖Rγ(x)‖2

where Rγ(x) = x− Tγ(x)

Tγ(x) = proxγg(x− γ∇f(x))x Tγ(x)

ϕ(Tγ(x))ϕγ(x)ϕ(x)

ϕϕγ

I minimizing FBE is equivalent to minimizing ϕ if γ ∈ (0, 1/Lf ]

inf ϕ = inf ϕγ arg minϕ = arg minϕγ

11 / 42

Page 25: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Key property of FBE

upper bound

ϕγ(x) ≤ ϕ(x)− 12γ ‖Rγ(x)‖2

lower bound

ϕγ(x) ≥ ϕ(Tγ(x)) +1−γLf

2γ ‖Rγ(x)‖2

where Rγ(x) = x− Tγ(x)

Tγ(x) = proxγg(x− γ∇f(x))x? = Tγ(x

?)

ϕ(x?)=ϕγ(x?)

ϕϕγ

I minimizing FBE is equivalent to minimizing ϕ if γ ∈ (0, 1/Lf ]

inf ϕ = inf ϕγ arg minϕ = arg minϕγ

11 / 42

Page 26: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Differentiability of FBE

FBE can be expressed as

ϕγ(x) = f(x)− γ2‖∇f(x)‖2 + gγ(x− γ∇f(x))

gγ is Moreau envelope of g

gγ(x) = infz∈IRn

{g(z) + 12γ ‖z − x‖2}

smooth approximation of g

∇gγ(x) = γ−1(x− proxγg(x))

g(x) = |x|g0.1

g10

I if f is C2 then FBE is C1 with

∇ϕγ(x) = γ−1(I − γ∇2f(x))(x− proxγg(x− γ∇f(x))

12 / 42

Page 27: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Problem equivalence

if f is C2 then FBE is C1 with

∇ϕγ(x) = γ−1(I − γ∇2f(x))(x− proxγg(x− γ∇f(x))

I for γ ∈ (0, 1/Lf ) stationary points of ϕγ =minimizers of ϕ

arg minϕ = arg minϕγ = zer∇ϕγ

I FBS is a variable metric gradient method for FBE

xk+1 = xk − γ(I − γ∇2f(x))−1∇ϕγ(xk)

I extends classical result of Rockafellar (put f = 0)

xk+1 = proxγg(xk) = xk − γ∇gγ(xk)

I paves the way for applying L-BFGS to nonsmooth composite problems

13 / 42

Page 28: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Outline

FBS and forward-backward envelope (FBE)

I Forward-Backward L-BFGS and acceleration

AMM and alternating minimization L-BFGS

ForBES (Forward Backward Envelope Solver)

conclusions

Page 29: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Algorithm 1 Forward-Backward L-BFGS

1: choose γ ∈ (0, 1/Lf ), starting point x0 ∈ IRn, integer m > 02: compute dk = −Hk∇ϕγ(xk)3: update xk+1 = xk + τkd

k where τk satisfies Wolfe condition4: If k > m then discard {sk−m, yk−m}5: store sk = xk+1 − xk, yk = ∇ϕγ(xk+1)−∇ϕγ(xk), ρk = 1/〈yk, sk〉6: k ← k + 1 and go to 2

I Hk can be seen as approximation of inverse Hessian (when it exists)

I no need to form Hk

I Hk∇ϕγ(xk)← r where r comes from L-BFGS two-loop recursion1

q ←∇ϕγ(xk) r ←Hk0 q

for i = k − 1 : −1 : k −m for i = k −m : 1 : k − 1

αi ← ρi〈si, q〉, q ← q − αiyi r ← r + si(αi − ρi〈yi, r〉)I requires 4mn multiplications (m usually between 3− 20)1Liu and Nocedal. ”On the limited memory BFGS method for large scale

optimization.” Mathematical programming 45.1-3 (1989): 503-528. 14 / 42

Page 30: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Black-box OracleI evaluating

ϕγ(x) = minz∈IRn

{f(x) + 〈∇f(x), z − x〉+ g(z) + 1

2γ ‖z − x‖2}

requires one call to f , ∇f , g, proxγg (same as FBS)

I evaluating

∇ϕγ(x) = γ−1(I − γ∇2f(x))(x− proxγg(x− γ∇f(x))

requires an extra hessian-vector product

I Hessian-free evaluation: only one extra call to ∇fforward differences

∇2f(x)d ≈ ∇f(x+ εd)−∇f(x)

εX prone to cancellation errors

complex step differentiation

∇2f(x)d ≈ imag(∇f(x+ iεd))

εX requires ∇f to be analytic

X ε = 10−8– machine precision accuracy

similar complexity per iteration as FBS

15 / 42

Page 31: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

(Accelerated) Forward-Backward L-BFGS

I L-BFGS: convergence only under restrictive conditions (but widelyused for non-convex functions)

I “What we would like to have, is a method with rate of convergence

which is guaranteed never to be worse, and ‘typically’ is much better

than the ‘optimal’ rate.”2

I can we have an algorithm with benefits of L-BFGS and strongconvergence properties?

I yes. combine L-BFGS with forward-backward steps

2Ben-Tal, Nemirovski, Non-euclidean restricted memory level method for large-scaleconvex optimization [Math. Program., Ser. A 102: 407-456 (2005)]

16 / 42

Page 32: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Algorithm 2 Global Forward-Backward L-BFGS

1: Choose γ ∈ (0, 1/Lf ), x0 ∈ dom g.2: Choose wk such that ϕγ(wk) ≤ ϕγ(xk)3: xk+1 = proxγg(w

k − γ∇f(wk))4: k ← k + 1 and go to 2

I in step 2 can use any descent step (L-BFGS, nonlinear CG,. . . )

I minor extra computations: ϕγ(wk) entails computation of xk+1

I can adjust dynamically γ when Lf is unknown

I ϕ bounded level sets: {ϕ(xk)} converges to ϕ? with rate O(1/k)

I linear convergence for strongly convex ϕ

17 / 42

Page 33: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Algorithm 3 Accelerated Forward-Backward L-BFGS

1: Choose γ ∈ (0, 1/Lf ), θ0 = 1, θk+1 =

√θ4k+4θ2k−θ

2k

2 , x0 = v0 ∈ dom g2: uk = θkv

k + (1− θk)xk

3: yk =

{uk, if ϕγ(uk) ≤ ϕγ(wk−1)

xk, otherwise

4: choose wk such that ϕγ(wk) ≤ ϕγ(yk)5: zk = proxγg(u

k − γ∇f(uk))

6: xk+1 = proxγg(wk − γ∇f(wk))

7: vk+1 = xk + θ−1k (zk − xk)8: k ← k + 1 and go to 2

I step 3 enforces monotonicity on FBE

I in step 4 can use any descent step (L-BFGS, nonlinear CG)

I uk = yk = wk: algorithm becomes FISTA3

I {ϕ(xk)} converges to ϕ? with the optimal rate O(1/k2)3Beck, and Teboulle. ”A fast iterative shrinkage-thresholding algorithm for linear inverse problems.” SIAM Journal on

Imaging Sciences 2.1 (2009): 183-202.

18 / 42

Page 34: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Outline

FBS and forward-backward envelope (FBE)

Forward-Backward L-BFGS and acceleration

I AMM and alternating minimization L-BFGS

ForBES (Forward Backward Envelope Solver)

conclusions

Page 35: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Separable convex problems

minimize f(x) + g(z)

subject to Ax+Bz = b

I f is strongly convex with convexity parameter µf

I f , g are nice, e.g. separable. coupling introduced through constraints

Alternating Minimization Method (AMM): FBS applied to the dual

xk+1 = arg minx

L0(x, zk, yk)

zk+1 = arg minz

Lγ(xk+1, z, yk) γ ∈ (0, 2µf/‖A‖2)

yk+1 = yk + γ(Axk+1 +Bzk+1 − b)

Augmented Lagrangian

Lγ(x, z, λ) = f(x) + g(z) + 〈y,Ax+Bz − b〉+ γ2‖Ax+Bz − b‖2

19 / 42

Page 36: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Distributed optimization

Row splitting (splitting across examples)

minimize f(x) +∑M

i=1gi(Aix)

write problem as

minimize f(x) +∑M

i=1gi(zi)

subject to Aix− zi = 0, i = 1, . . . ,M

AMM updates

xk+1 = arg min{f(x) +

⟨∑Mi=1A

Ti y

ki , x⟩}

zk+1i = proxγ−1gi(Aix

k+1 + γ−1yki ), i = 1, . . . ,M

yk+1i = yki + γ(Aix

k+1 − zk+1i ), i = 1, . . . ,M

I z and y updates are executed locally, in parallelI central node collects ATi yi, adds them, updates x, then scattersI f block separable =⇒ also x update in parallel

20 / 42

Page 37: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Alternating Minimization Envelope (AME)

I dual problem

maximize ψ(y) = −(f?(−A>y) + b>y + g?(−B>y)

)where f?(v) = supx∈IRn {〈v, x〉 − f(x)} is the conjugate of f

I AME: FBE for dual problem is augmented Lagrangian

ψγ(y) = Lγ(x(y), z(y), y)

where x(y), z(y) are the AMM updates

x(y) = arg minx

f(x) + 〈y,Ax〉

z(y) = arg minz

g(z) + 〈y,Bz〉+ γ2‖Ax(y) +Bz − b‖2

21 / 42

Page 38: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Connection between AMM and AME

dual problem is equivalent to

maximizey

ψγ(y) = Lγ(x(y), z(y), y) γ ∈(0, µf/‖A‖2

)I f is C2 on the interior of its domain =⇒ ψγ is C1

∇ψγ(y) =

D(y)︷ ︸︸ ︷(I − γA∇2f(x(y))−1A>

)(Ax(y) +Bz(y)− b)

I AMM = variable metric gradient method on AME

yk+1 = yk + γD(yk)−1∇ψγ(yk)

I Alternating minimization L-BFGS: Forward-Backward L-BFGSapplied to the dual

22 / 42

Page 39: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Outline

FBS and forward-backward envelope (FBE)

Forward-Backward L-BFGS and acceleration

AMM and alternating minimization L-BFGS

I ForBES (Forward Backward Envelope Solver)

conclusions

Page 40: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

ForBES

I Forward-Backward Envelope Solver

I MATLAB solver for convex composite and separable problems

I generic, efficient and robust

I implements several algorithms for minimizing the FBE

I contains library of commonly used functions together with theirgradient, conjugate, prox mapping

X loss functions: quadLoss, logLoss, huberLoss, hingeLoss,. . .X regularizers: l1Norm, l2Norm, elasticNet, l2NormSum,. . .

I on Github

http://lostella.github.io/ForBES/

23 / 42

Page 41: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

ForBES

composite problems

minimize f(Cx+ d) + g(z)

I f convex, strongly smooth and C2I g proper, closed, convex function with computable prox

separable problems

minimize f(x) + g(z)

subject to Ax+Bz = b

I f strongly convex and C2 in the interior of its domain

I g any proper, closed, convex function with computable prox

I B is a tight frameB>B = αI α > 0

24 / 42

Page 42: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

ForBEScomposite problems

minimize f(Cx+ d) + g(z)

I out = forbes(f, g, init, aff);

I can have the sum of multiple f ’s and g’s

I aff = {C, d};

separable problems

minimize f(x) + g(z)

subject to Ax+Bz = b

I out = forbes(f, g, init, [], constr);

I can have the sum of multiple f ’s and g’s

I constr = {A, B, b};25 / 42

Page 43: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Lasso

minimize 12‖Ax− b‖22︸ ︷︷ ︸f(Cx+d)

+λ‖x‖1︸ ︷︷ ︸g(x)

I regularization term enforces sparsity in the solution x?I used for regression and feature selectionI denser solution, harder problem as λ→ 0

proximal mapping of `1-norm: soft thresholding

proxγ‖·‖1(x) = sign(x) ·max{0, |x| − γ}

out = forbes(quadLoss(), l1Norm(lam), x0 , {A, -b});

out =

message: ’reached optimum (up to tolerance)’

flag: 0

x: [5000 x1 double]

iterations: 352

[..]

26 / 42

Page 44: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Lasso

performance profile on HDR problems taken from the SPEAR dataset.I HDR: high dynamic range =⇒ large ratio maxx?/minx?I 274 problems, size ranging from 512× 1024 to 8192× 49152

I point (x, y) of graph for solver s gives percentage of problems y onwhich solver’s running time was within 2x times the best running time

dataset at wwwopt.mathematik.tu-darmstadt.de/spear27 / 42

Page 45: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Lasso

Compute regularization path warm-starting each problem

x0 = zeros(n, 1);

lambdas = lambda_max*logspace(-2, 0, 10);

f = quadLoss ();

for i = length(lambdas ):-1:1 % start from the largerout = forbes(f, l1Norm(lam(i)), x0 , {A, -b});

x0 = out.x;

end

I random example, 3000 observations, 500K features, 7.5M nnzs

iter. matvecs backsolves time

Fast FBS 22406 44812 - 1054sADMM 2214 4428 2214 261s

ForBES (LBFGS) 1050 6260 - 196s

28 / 42

Page 46: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Sparse robust regression

minimize

m∑i=1

hδ(〈ai, x〉 − bi) + λ‖x‖1, hδ(r) =

{12δ r

2 if |r| ≤ δ|r| − δ

2 otherwise

I Huber loss is Moreau envelope (with parameter δ) of | · | (not C2)

I quadratic penalty for small residuals, linear penalty for large ones

I useful when data contains outliers

A =[a1, . . . , am

]>, b =

[b1, . . . , bm

]>f = huberLoss(del);

g = l1Norm(lam);

out = forbes(f, g, zeros(n,1), {A, -b});

29 / 42

Page 47: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Sparse robust regression

I random problem: 5000 samples, 200K features, 1M nonzerosI artificial outliers in b = (bin, bout) (added large noise to bout)

nnz(x?) = 92, ‖(Ax? − b)in‖/‖bin‖ = 0.16

with Lasso: nnz(x?) = 93, ‖(Ax? − b)in‖/‖bin‖ = 0.5830 / 42

Page 48: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Sparse logistic regression

minimize (1/m)∑m

i=1 log(1 + exp(−yi〈ai, x〉))︸ ︷︷ ︸f(Cx+d)

+λ‖x‖1︸ ︷︷ ︸g(x)

I labels yi ∈ {−1, 1}, i = 1, . . . ,m

I used for classification and feature selection

I again, as λ→ 0 denser solution, harder problem

f = logLoss (1/m);

C = diag(sparse(y))*A;

g = l1Norm(lam);

out = forbes(f, g, zeros(n, 1), C);

31 / 42

Page 49: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Sparse logistic regression

Dataset Features nnz Fast FBS ForBES Speedup

w4a 299 86K 3.54s 1.01s x3.5w5a 299 115K 3.73s 1.08s x3.5w6a 300 200K 5.73s 1.99s x2.9w7a 300 288K 7.97s 2.15s x3.7w8a 300 580K 15.85s 4.17s x3.8rcv1 45K 910K 15.94s 1.63s x9.8real-sim 21K 1.5M 24.39s 3.16s x7.7news20 1.4M 3.7M 427.77s 53.55s x8.0

Datasets taken from LIBSVM:

www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

32 / 42

Page 50: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Matrix completion

minimize 12‖A(X)− b‖22 + λ‖X‖?

I ‖X‖? =∑σ(X) forces X? to have low rank

I A : IRm×n → IRk is linear. example: A selects k entries from X

I proximal mapping of ‖ · ‖? is singular value soft-thresholding

proxγ‖·‖?(X) = U diag(max{0, σi − γ}, i = 1, . . . , r)V T

where σ1, . . . , σr are the singular values of X33 / 42

Page 51: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Matrix completion

% P binary matrix, B contains observationsf = quadLoss(P(:), B(:));

g = nuclearNorm(m, n, lam);

out = forbes(f, g, zeros(m*n, 1));

MovieLens 100k dataset: 943×1682, 1.5M unknowns

λ = 50 λ = 30 λ = 20rank(X?) = 3 rank(X?) = 8 rank(X?) = 38

34 / 42

Page 52: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Frequency Domain Subspace identification4

minimize 12‖x− b‖22 + λ‖Z‖?, subject to H(x) = Z

I H is a linear map returning the (block) Hankel matrix

H(x1, x2, . . . , xn) =

x1 x2 x3 · · · xqx2 x3 · · · xq+1

......

...xp xp+1 · · · xp+q−1

∈ IRpny×qnu

I bi is IDFT of a noisy frequency sample for a MIMO, LTI system

bi =1

2M

2M−1∑k=0

Gkej2πik/2M , i = 0, . . . , 2M − 1

Gk = C(ejωkI −A)−1B +D, ωk = kπ/M, k = 0, . . . ,M

f = quadLoss(1, b);

g = nuclearNorm(lam);

A = blockHankel(p, q, ny, nu);

b = zeros(p*q, 1);

out = forbes(f, g, zeros((p*ny)*(q*nu), 1), [], {A, -1, 0});

4Graf Plessen, Semeraro, Wood and Smith, Optimization Algorithms for Nuclear Norm Based Subspace Identification with

Uniformly Spaced Frequency Domain Data, ACC, 2015 35 / 42

Page 53: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Frequency Domain Subspace identification

More efficient use of operations

iterations Hankel SVD

Fast FBS 850 1700 850ForBES (LBFGS) 39 154 43ForBES (LBFGS) fast 30 234 95ForBES (LBFGS) global 30 176 66

36 / 42

Page 54: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Support vector machines

minimize λ2‖x‖22︸ ︷︷ ︸f(x)

+∑m

i=1 max{0, 1− bizi}︸ ︷︷ ︸g(z)

, subject to Ax = z

I used for classification, nonsmooth term is the hinge loss

I A = (a1, . . . , am)T matrix of features, b ∈ {−1, 1}m vector of labels

I g(z) =∑m

i=1 gi(zi) and

proxγgi(zi) =

{bi min{1, bizi + γ} if bizi < 1

zi otherwise

f = quadLoss(lam);

g = hingeLoss (1, b);

constr = {A, -1, zeros(m ,1)};

out = forbes(f, g, zeros(m, 1), [], constr );

37 / 42

Page 55: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Example: Support vector machines

Dataset Features nnz Fast AMM ForBES Speedup

w4a 299 86K 16.51s 1.44s x11.5w5a 299 115K 21.88s 2.69s x8.1w6a 300 200K 55.52s 4.61s x12.0w7a 300 288K 96.68s 7.29s x13.3w8a 300 580K 273.89s 20.61s x13.3rcv1 45K 910K 617.71s 96.47s x6.4real-sim 21K 1.5M 39.73s 14.38s x2.8news20 1.4M 3.7M 148.01s 18.73s x7.9

Datasets taken from LIBSVM:

www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

38 / 42

Page 56: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Summary

minimizex∈IRn

g(x)

I Moreau envelope Moreau (1965)

gγ(x) = infz{g(z) + 1

2γ ‖z − x‖2}

proximal point method = gradient method on Moreau envelope

minimizex∈IRn

f(x) + g(x)

I forward-backward envelope (FBE) Patrinos, Bemporad (2013)

ϕFBγ (x) = f(x)− γ

2‖∇f(x)‖2 + gγ(x− γ∇f(x))

forward-backward splitting = gradient method on FBE

39 / 42

Page 57: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Summary

minimizex∈IRn

f(x) + g(x)

Douglas-Rachford splitting

yk = proxγf (xk)

zk = proxγg(2yk − xk)

xk+1 = xk + λk(zk − yk)

I Douglas-Rachford envelope (DRE) Patrinos, Stella, Bemporad (2014)

ϕDRγ (x) = fγ(x)− γ∇‖fγ(x)‖2 + gγ(x− 2γ∇fγ(x))

= ϕFBγ (proxγf (x))

Douglas-Rachford splitting = gradient method on DRE

40 / 42

Page 58: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Summary

I Forward-Backward L-BFGS for nonsmooth problems

I optimal complexity estimates, much faster convergence in practice

I ForBES: Forward-Backward Envelope solver freely available on

http://lostella.github.io/ForBES/

41 / 42

Page 59: Forward-Backward limited memory quasi-Newton methods …Accelerating FBS I FBS can be slow (as the gradient method) I remedy: better quadratic model for f(as in quasi-Newton methods)

Sources

1. Patrinos and Bemporad (2013) Proximal Newton methods for convexcomposite optimization. 52nd IEEE CDC, Florence, Italy

2. Patrinos, Stella and Bemporad (2014), Forward-backward truncated Newtonmethods for convex composite optimization, arXiv:1402.6655

3. Patrinos, Stella and Bemporad (2014), Douglas-Rachford splitting:complexity estimates and accelerated variants. 53rd IEEE CDC, LosAngeles, CA, arXiv:1407.6723

4. Patrinos and Stella (2015), Forward-backward L-BFGS for large-scalenonsmooth convex optimization, submitted

42 / 42