Scaling’Up’LASSO’Solvers’...4/30/15 1 ©EmilyFox2015’ 1...

11
4/30/15 1 ©Emily Fox 2015 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 30 th , 2015 “Scalable” LASSO Solvers: Parallel SCD (Shotgun) Parallel SGD Averaging SoluUons ADMM Case Study 3: fMRI Predic5on Scaling Up LASSO Solvers A simple SCD for LASSO (ShooUng) Your HW, a more efficient implementaUon! Analysis of SCD Parallel SCD (Shotgun) Other parallel learning approaches for linear models Parallel stochasUc gradient descent (SGD) Parallel independent soluUons then averaging ADMM ©Emily Fox 2015 2

Transcript of Scaling’Up’LASSO’Solvers’...4/30/15 1 ©EmilyFox2015’ 1...

4/30/15  

1  

©Emily  Fox  2015   1  

Machine  Learning  for  Big  Data      CSE547/STAT548,  University  of  Washington  

Emily  Fox  April  30th,  2015  

“Scalable”  LASSO  Solvers:  Parallel  SCD  (Shotgun)  

Parallel  SGD  Averaging  SoluUons  

ADMM  

Case  Study  3:  fMRI  Predic5on  

Scaling  Up  LASSO  Solvers  •  A  simple  SCD  for  LASSO  (ShooUng)  

–  Your  HW,  a  more  efficient  implementaUon!  J  –  Analysis  of  SCD  

•  Parallel  SCD  (Shotgun)  •  Other  parallel  learning  approaches  for  linear  models  

–  Parallel  stochasUc  gradient  descent  (SGD)  –  Parallel  independent  soluUons  then  averaging  

•  ADMM  

©Emily  Fox  2015   2  

4/30/15  

2  

StochasUc  Coordinate  Descent  for  LASSO    (aka  ShooUng  Algorithm)  

•  Repeat  unUl  convergence  –  Pick  a  coordinate  j  at  random  

•  Set:  

•  Where:    

©Emily  Fox  2015   3  

�̂j =

8<

:

(cj + �)/aj cj < ��0 cj 2 [��,�]

(cj � �)/aj cj > �

cj = 2NX

i=1

x

ij(y

i � �

0�jx

i�j)aj = 2

NX

i=1

(xij)

2

Shotgun: Parallel SCD [Bradley et al ‘11]

©Emily  Fox  2015   4  

Shotgun (Parallel SCD)

While not converged, " On each of P processors, " Choose random coordinate j, " Update βj (same as for Shooting)

minβF(β) F(β) =|| Xβ − y ||2

2 +λ || β ||1where Lasso:

4/30/15  

3  

Is SCD inherently sequential?

©Emily  Fox  2015   5  

Coordinate update: β j ← β j +δβ j

(closed-form minimization)

Δβ =

"

#

$$$$

%

&

''''

Collective update: δβi

δβ j

00

0

minβF(β) F(β) =|| Xβ − y ||2

2 +λ || β ||1where Lasso:

Convergence Analysis

©Emily  Fox  2015   6  

≤d 1

2 || β* ||22 +F(β (0) )( )TP

E F(β (T ) )!" #$−F(β*)

Theorem: Shotgun Convergence

Assume

P < d /ρ +1where

ρ = spectral radius of XTX

Nice case: Uncorrelated features

ρ = __⇒ Pmax = __

Bad case: Correlated features ρ = __⇒ Pmax = __(at worst)

minβF(β) F(β) =|| Xβ − y ||2

2 +λ || β ||1where Lasso:

p

p

4/30/15  

4  

Stepping  Back…  

©Emily  Fox  2015   7  

•  StochasUc  coordinate  ascent  –  OpUmizaUon:  

–  Parallel  SCD:  

–  Issue:  

–  SoluUon:  

•  Natural  counterpart:  –  OpUmizaUon:  

–  Parallel  

–  Issue:  

–  SoluUon:  

Parallel  SGD  with  No  Locks  [e.g.,  Hogwild!,  Niu  et  al.  ‘11]    

•  Each  processor  in  parallel:  –  Pick  data  point  i  at  random  –  For  j  =  1…p:    

 

•  Assume  atomicity  of:  

©Emily  Fox  2015   8  

4/30/15  

5  

Addressing  Interference  in  Parallel  SGD  •  Key  issues:  

–  Old  gradients  

–  Processors  overwrite  each  other’s  work  

•  Nonetheless:    –  Can  achieve  convergence  and  some  parallel  speedups    –  Proof  uses  weak  interacUons,  but  through  sparsity  of  data  points  

©Emily  Fox  2015   9  

Problem  with  Parallel  SCD  and  SGD  

•  Both  Parallel  SCD  &  SGD  assume  access  to  current  esUmate  of  weight  vector  

•  Works  well  on  shared  memory  machines  

•  Very  difficult  to  implement  efficiently  in  distributed  memory  

•  Open  problem:  Good  parallel  SGD  and  SCD  for  distributed  senng…  –  Let’s  look  at  a  trivial  approach  

©Emily  Fox  2015   10  

4/30/15  

6  

Simplest  Distributed  OpUmizaUon    Algorithm  Ever  Made  

•  Given  N  data  points  &  P  machines  •  StochasUc  opUmizaUon  problem:  •  Distribute  data:  

•  Solve  problems  independently  

•  Merge  soluUons  

•  Why  should  this  work  at  all????  

©Emily  Fox  2015   11  

For  Convex  FuncUons…  •  Convexity:  

•  Thus:  

©Emily  Fox  2015   12  

4/30/15  

7  

The average mixture algorithm

Two pictures:

!θ1!θ2

!θ3!θ4

Duchi (UC Berkeley) Communication Efficient Optimization CDC 2012 7 / 21

Hopefully…  •  Convexity  only  guarantees:  

•  But,  esUmates  from  independent  data!  

©Emily  Fox  2015   13  

Figure  from  John  Duchi  

Analysis  of  Distribute-­‐then-­‐Average  [Zhang  et  al.  ‘12]    

•  Under  some  condiUons,  including  strong  convexity,  lots  of  smoothness,  etc.  •  If  all  data  were  in  one  machine,  converge  at  rate:  

•  With  P  machines,  converge  at  a  rate:  

©Emily  Fox  2015   14  

4/30/15  

8  

Tradeoffs,  tradeoffs,  tradeoffs,…  •  Distribute-­‐then-­‐Average:  

–  “Minimum  possible”  communicaUon  –  Bias  term  can  be  a  killer  with  finite  data    

•  Issue  definitely  observed  in  pracUce  –  Significant  issues  for  L1  problems:  

•  Parallel  SCD  or  SGD  –  Can  have  much  beuer  convergence  in  pracUce  for  mulUcore  senng  –  Preserves  sparsity  (especially  SCD)  –  But,  hard  to  implement  in  distributed  senng  

©Emily  Fox  2015   15  

AlternaUng  DirecUons  Method  of  MulUpliers  

•  A  tool  for  solving  convex  problems  with  separable  objecUves:  

•  LASSO  example:  

•  Know  how  to  minimize  f(β)  or  g(β)  separately    

©Emily  Fox  2015   16  

4/30/15  

9  

ADMM  Insight  •  Try  this  instead:  

•  Solve  using  method  of  mulUpliers  •  Define  the  augmented  Lagrangian:  

–  Issue:    L2  penalty  destroys  separability  of  Lagrangian  –  SoluUon:    Replace  minimizaUon  over  (x,  z)  by  alternaUng  minimizaUon    

©Emily  Fox  2015   17  

ADMM  Algorithm  •  Augmented  Lagrangian:  

•  Alternate  between:  

1.  x  ß  

2.  z  ß      3.  y  ß  

©Emily  Fox  2015   18  

L⇢(x, z, y) = f(x) + g(z) + y

T (x� z) +⇢

2||x� z||22

4/30/15  

10  

ADMM  for  LASSO  •  ObjecUve:  

•  Augmented  Lagrangian:    

•  Alternate  between:  

1.  β  ß  

2.  z  ß      3.  a  ß  

©Emily  Fox  2015   19  

L⇢(x, z, y) = f(x) + g(z) + y

T (x� z) +⇢

2||x� z||22

L⇢(�, z, a) =

ADMM  Wrap-­‐Up  •  When  does  ADMM  converge?  

–  Under  very  mild  condiUons  –  Basically,  f  and  g  must  be  convex  

•  ADMM  is  useful  in  cases  where  –  f(x)  +  g(x)  is  challenging  to  solve  due  to  coupling  –  We  can  minimize  

•  f(x)  +  (x-­‐a)2  •  g(x)  +  (x-­‐a)2  

 •  Reference  

–  Boyd,  Parikh,  Chu,  Peleato,  Eckstein  (2011)  “Distributed  opUmizaUon  and  staUsUcal  learning  via  the  alternaUng  direcUon  method  of  mulUpliers.”  Founda8ons  and  Trends  in  Machine  Learning,  3(1):1-­‐122.  

©Emily  Fox  2015   20  

4/30/15  

11  

What  you  need  to  know  

•  A  simple  SCD  for  LASSO  (ShooUng)  –  Your  HW,  a  more  efficient  implementaUon!  J  –  Analysis  of  SCD  

•  Parallel  SCD  (Shotgun)  •  Other  parallel  learning  approaches  for  linear  models  –  Parallel  stochasUc  gradient  descent  (SGD)  –  Parallel  independent  soluUons  then  averaging  

•  ADMM  –  General  idea  –  ApplicaUon  to  LASSO  

©Emily  Fox  2015   21