Scaling’Up’LASSO’Solvers’...4/30/15 1 ©EmilyFox2015’ 1...
Transcript of Scaling’Up’LASSO’Solvers’...4/30/15 1 ©EmilyFox2015’ 1...
4/30/15
1
©Emily Fox 2015 1
Machine Learning for Big Data CSE547/STAT548, University of Washington
Emily Fox April 30th, 2015
“Scalable” LASSO Solvers: Parallel SCD (Shotgun)
Parallel SGD Averaging SoluUons
ADMM
Case Study 3: fMRI Predic5on
Scaling Up LASSO Solvers • A simple SCD for LASSO (ShooUng)
– Your HW, a more efficient implementaUon! J – Analysis of SCD
• Parallel SCD (Shotgun) • Other parallel learning approaches for linear models
– Parallel stochasUc gradient descent (SGD) – Parallel independent soluUons then averaging
• ADMM
©Emily Fox 2015 2
4/30/15
2
StochasUc Coordinate Descent for LASSO (aka ShooUng Algorithm)
• Repeat unUl convergence – Pick a coordinate j at random
• Set:
• Where:
©Emily Fox 2015 3
�̂j =
8<
:
(cj + �)/aj cj < ��0 cj 2 [��,�]
(cj � �)/aj cj > �
cj = 2NX
i=1
x
ij(y
i � �
0�jx
i�j)aj = 2
NX
i=1
(xij)
2
Shotgun: Parallel SCD [Bradley et al ‘11]
©Emily Fox 2015 4
Shotgun (Parallel SCD)
While not converged, " On each of P processors, " Choose random coordinate j, " Update βj (same as for Shooting)
minβF(β) F(β) =|| Xβ − y ||2
2 +λ || β ||1where Lasso:
4/30/15
3
Is SCD inherently sequential?
©Emily Fox 2015 5
Coordinate update: β j ← β j +δβ j
(closed-form minimization)
Δβ =
"
#
$$$$
%
&
''''
Collective update: δβi
δβ j
00
0
minβF(β) F(β) =|| Xβ − y ||2
2 +λ || β ||1where Lasso:
Convergence Analysis
©Emily Fox 2015 6
≤d 1
2 || β* ||22 +F(β (0) )( )TP
E F(β (T ) )!" #$−F(β*)
Theorem: Shotgun Convergence
Assume
€
P < d /ρ +1where
€
ρ = spectral radius of XTX
Nice case: Uncorrelated features
ρ = __⇒ Pmax = __
Bad case: Correlated features ρ = __⇒ Pmax = __(at worst)
minβF(β) F(β) =|| Xβ − y ||2
2 +λ || β ||1where Lasso:
p
p
4/30/15
4
Stepping Back…
©Emily Fox 2015 7
• StochasUc coordinate ascent – OpUmizaUon:
– Parallel SCD:
– Issue:
– SoluUon:
• Natural counterpart: – OpUmizaUon:
– Parallel
– Issue:
– SoluUon:
Parallel SGD with No Locks [e.g., Hogwild!, Niu et al. ‘11]
• Each processor in parallel: – Pick data point i at random – For j = 1…p:
• Assume atomicity of:
©Emily Fox 2015 8
4/30/15
5
Addressing Interference in Parallel SGD • Key issues:
– Old gradients
– Processors overwrite each other’s work
• Nonetheless: – Can achieve convergence and some parallel speedups – Proof uses weak interacUons, but through sparsity of data points
©Emily Fox 2015 9
Problem with Parallel SCD and SGD
• Both Parallel SCD & SGD assume access to current esUmate of weight vector
• Works well on shared memory machines
• Very difficult to implement efficiently in distributed memory
• Open problem: Good parallel SGD and SCD for distributed senng… – Let’s look at a trivial approach
©Emily Fox 2015 10
4/30/15
6
Simplest Distributed OpUmizaUon Algorithm Ever Made
• Given N data points & P machines • StochasUc opUmizaUon problem: • Distribute data:
• Solve problems independently
• Merge soluUons
• Why should this work at all????
©Emily Fox 2015 11
For Convex FuncUons… • Convexity:
• Thus:
©Emily Fox 2015 12
4/30/15
7
The average mixture algorithm
Two pictures:
!θ1!θ2
!θ3!θ4
Duchi (UC Berkeley) Communication Efficient Optimization CDC 2012 7 / 21
Hopefully… • Convexity only guarantees:
• But, esUmates from independent data!
©Emily Fox 2015 13
Figure from John Duchi
Analysis of Distribute-‐then-‐Average [Zhang et al. ‘12]
• Under some condiUons, including strong convexity, lots of smoothness, etc. • If all data were in one machine, converge at rate:
• With P machines, converge at a rate:
©Emily Fox 2015 14
4/30/15
8
Tradeoffs, tradeoffs, tradeoffs,… • Distribute-‐then-‐Average:
– “Minimum possible” communicaUon – Bias term can be a killer with finite data
• Issue definitely observed in pracUce – Significant issues for L1 problems:
• Parallel SCD or SGD – Can have much beuer convergence in pracUce for mulUcore senng – Preserves sparsity (especially SCD) – But, hard to implement in distributed senng
©Emily Fox 2015 15
AlternaUng DirecUons Method of MulUpliers
• A tool for solving convex problems with separable objecUves:
• LASSO example:
• Know how to minimize f(β) or g(β) separately
©Emily Fox 2015 16
4/30/15
9
ADMM Insight • Try this instead:
• Solve using method of mulUpliers • Define the augmented Lagrangian:
– Issue: L2 penalty destroys separability of Lagrangian – SoluUon: Replace minimizaUon over (x, z) by alternaUng minimizaUon
©Emily Fox 2015 17
ADMM Algorithm • Augmented Lagrangian:
• Alternate between:
1. x ß
2. z ß 3. y ß
©Emily Fox 2015 18
L⇢(x, z, y) = f(x) + g(z) + y
T (x� z) +⇢
2||x� z||22
4/30/15
10
ADMM for LASSO • ObjecUve:
• Augmented Lagrangian:
• Alternate between:
1. β ß
2. z ß 3. a ß
©Emily Fox 2015 19
L⇢(x, z, y) = f(x) + g(z) + y
T (x� z) +⇢
2||x� z||22
L⇢(�, z, a) =
ADMM Wrap-‐Up • When does ADMM converge?
– Under very mild condiUons – Basically, f and g must be convex
• ADMM is useful in cases where – f(x) + g(x) is challenging to solve due to coupling – We can minimize
• f(x) + (x-‐a)2 • g(x) + (x-‐a)2
• Reference
– Boyd, Parikh, Chu, Peleato, Eckstein (2011) “Distributed opUmizaUon and staUsUcal learning via the alternaUng direcUon method of mulUpliers.” Founda8ons and Trends in Machine Learning, 3(1):1-‐122.
©Emily Fox 2015 20
4/30/15
11
What you need to know
• A simple SCD for LASSO (ShooUng) – Your HW, a more efficient implementaUon! J – Analysis of SCD
• Parallel SCD (Shotgun) • Other parallel learning approaches for linear models – Parallel stochasUc gradient descent (SGD) – Parallel independent soluUons then averaging
• ADMM – General idea – ApplicaUon to LASSO
©Emily Fox 2015 21