Scaling’Up’LASSO’Solvers’...4/30/15 1 ©EmilyFox2015’ 1...

4/30/15

1

©Emily Fox 2015 1

Machine Learning for Big Data CSE547/STAT548, University of Washington

Emily Fox April 30th, 2015

“Scalable” LASSO Solvers: Parallel SCD (Shotgun)

Parallel SGD Averaging SoluUons

ADMM

Case Study 3: fMRI Predic5on

Scaling Up LASSO Solvers •  A simple SCD for LASSO (ShooUng)

–  Your HW, a more efficient implementaUon! J –  Analysis of SCD

•  Parallel SCD (Shotgun) •  Other parallel learning approaches for linear models

–  Parallel stochasUc gradient descent (SGD) –  Parallel independent soluUons then averaging

•  ADMM

©Emily Fox 2015 2

4/30/15

2

StochasUc Coordinate Descent for LASSO (aka ShooUng Algorithm)

•  Repeat unUl convergence –  Pick a coordinate j at random

•  Set:

•  Where:

©Emily Fox 2015 3

�̂j =

8<

:

(cj + �)/aj cj < ��0 cj 2 [��,�]

(cj � �)/aj cj > �

cj = 2NX

i=1

x

ij(y

i � �

0�jx

i�j)aj = 2

NX

i=1

(xij)

2

Shotgun: Parallel SCD [Bradley et al ‘11]

©Emily Fox 2015 4

Shotgun (Parallel SCD)

While not converged, " On each of P processors, " Choose random coordinate j, " Update βj (same as for Shooting)

minβF(β) F(β) =|| Xβ − y ||2

2 +λ || β ||1where Lasso:

4/30/15

3

Is SCD inherently sequential?

©Emily Fox 2015 5

Coordinate update: β j ← β j +δβ j

(closed-form minimization)

Δβ =

"

#

$$$$

%

&

''''

Collective update: δβi

δβ j

00

0

minβF(β) F(β) =|| Xβ − y ||2


Convergence Analysis

©Emily Fox 2015 6

≤d 1

2 || β* ||22 +F(β (0) )( )TP

E F(β (T ) )!" #$−F(β*)

Theorem: Shotgun Convergence

Assume

€

P < d /ρ +1where

€

ρ = spectral radius of XTX

Nice case: Uncorrelated features

ρ = __⇒ Pmax = __

Bad case: Correlated features ρ = __⇒ Pmax = __(at worst)

minβF(β) F(β) =|| Xβ − y ||2


p

p

4/30/15

4

Stepping Back…

©Emily Fox 2015 7

•  StochasUc coordinate ascent –  OpUmizaUon:

–  Parallel SCD:

–  Issue:

–  SoluUon:

•  Natural counterpart: –  OpUmizaUon:

–  Parallel

–  Issue:

–  SoluUon:

Parallel SGD with No Locks [e.g., Hogwild!, Niu et al. ‘11]

•  Each processor in parallel: –  Pick data point i at random –  For j = 1…p:

•  Assume atomicity of:

©Emily Fox 2015 8

4/30/15

5

Addressing Interference in Parallel SGD •  Key issues:

–  Old gradients

–  Processors overwrite each other’s work

•  Nonetheless: –  Can achieve convergence and some parallel speedups –  Proof uses weak interacUons, but through sparsity of data points

©Emily Fox 2015 9

Problem with Parallel SCD and SGD

•  Both Parallel SCD & SGD assume access to current esUmate of weight vector

•  Works well on shared memory machines

•  Very difficult to implement efficiently in distributed memory

•  Open problem: Good parallel SGD and SCD for distributed senng… –  Let’s look at a trivial approach

©Emily Fox 2015 10

4/30/15

6

Simplest Distributed OpUmizaUon Algorithm Ever Made

•  Given N data points & P machines •  StochasUc opUmizaUon problem: •  Distribute data:

•  Solve problems independently

•  Merge soluUons

•  Why should this work at all????

©Emily Fox 2015 11

For Convex FuncUons… •  Convexity:

•  Thus:

©Emily Fox 2015 12

4/30/15

7

The average mixture algorithm

Two pictures:

!θ1!θ2

!θ3!θ4

Duchi (UC Berkeley) Communication Efficient Optimization CDC 2012 7 / 21

Hopefully… •  Convexity only guarantees:

•  But, esUmates from independent data!

©Emily Fox 2015 13

Figure from John Duchi

Analysis of Distribute-‐then-‐Average [Zhang et al. ‘12]

•  Under some condiUons, including strong convexity, lots of smoothness, etc. •  If all data were in one machine, converge at rate:

•  With P machines, converge at a rate:

©Emily Fox 2015 14

4/30/15

8

Tradeoffs, tradeoffs, tradeoffs,… •  Distribute-‐then-‐Average:

–  “Minimum possible” communicaUon –  Bias term can be a killer with finite data

•  Issue definitely observed in pracUce –  Significant issues for L1 problems:

•  Parallel SCD or SGD –  Can have much beuer convergence in pracUce for mulUcore senng –  Preserves sparsity (especially SCD) –  But, hard to implement in distributed senng

©Emily Fox 2015 15

AlternaUng DirecUons Method of MulUpliers

•  A tool for solving convex problems with separable objecUves:

•  LASSO example:

•  Know how to minimize f(β) or g(β) separately

©Emily Fox 2015 16

4/30/15

9

ADMM Insight •  Try this instead:

•  Solve using method of mulUpliers •  Define the augmented Lagrangian:

–  Issue: L2 penalty destroys separability of Lagrangian –  SoluUon: Replace minimizaUon over (x, z) by alternaUng minimizaUon

©Emily Fox 2015 17

ADMM Algorithm •  Augmented Lagrangian:

•  Alternate between:

1.  x ß

2.  z ß 3.  y ß

©Emily Fox 2015 18

L⇢(x, z, y) = f(x) + g(z) + y

T (x� z) +⇢

2||x� z||22

4/30/15

10

ADMM for LASSO •  ObjecUve:

•  Augmented Lagrangian:

•  Alternate between:

1.  β ß

2.  z ß 3.  a ß

©Emily Fox 2015 19

L⇢(x, z, y) = f(x) + g(z) + y

T (x� z) +⇢

2||x� z||22

L⇢(�, z, a) =

ADMM Wrap-‐Up •  When does ADMM converge?

–  Under very mild condiUons –  Basically, f and g must be convex

•  ADMM is useful in cases where –  f(x) + g(x) is challenging to solve due to coupling –  We can minimize

•  f(x) + (x-‐a)2 •  g(x) + (x-‐a)2

•  Reference

–  Boyd, Parikh, Chu, Peleato, Eckstein (2011) “Distributed opUmizaUon and staUsUcal learning via the alternaUng direcUon method of mulUpliers.” Founda8ons and Trends in Machine Learning, 3(1):1-‐122.

©Emily Fox 2015 20

4/30/15

11

What you need to know

•  A simple SCD for LASSO (ShooUng) –  Your HW, a more efficient implementaUon! J –  Analysis of SCD

•  Parallel SCD (Shotgun) •  Other parallel learning approaches for linear models –  Parallel stochasUc gradient descent (SGD) –  Parallel independent soluUons then averaging

•  ADMM –  General idea –  ApplicaUon to LASSO

©Emily Fox 2015 21

Scaling’Up’LASSO’Solvers’...4/30/15 1 ©EmilyFox2015’ 1...

Documents

Transcript of Scaling’Up’LASSO’Solvers’...4/30/15 1 ©EmilyFox2015’ 1...