Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All...
-
Upload
lynette-berry -
Category
Documents
-
view
218 -
download
0
Transcript of Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All...
Peter Richtárik (joint work with Martin Takáč)
Distributed Coordinate Descent Method
AmpLab All Hands Meeting - Berkeley - October 29, 2013
Randomized Coordinate Descent
in 2D
Find the minimizer
2D OptimizationContours of a function
Goal:
a2 =b2
Randomized Coordinate Descent in 2D
a2 =b2
N
S
EW
Randomized Coordinate Descent in 2D
a2 =b2
1
N
S
EW
Randomized Coordinate Descent in 2D
a2 =b2
1
N
S
EW
2
Randomized Coordinate Descent in 2D
a2 =b2
1
23 N
S
EW
Randomized Coordinate Descent in 2D
a2 =b2
1
23
4N
S
EW
Randomized Coordinate Descent in 2D
a2 =b2
1
23
4N
S
EW
5
Randomized Coordinate Descent in 2D
a2 =b2
1
23
45
6
N
S
EW
Randomized Coordinate Descent in 2D
a2 =b2
1
23
45
N
S
EW
67SOLVED!
Convergence of Randomized Coordinate Descent
Strongly convex f
Smooth or ‘simple’ nonsmooth f‘difficult’ nonsmooth f
Focus on d
(big data = big d)
Parallelization Dream
Serial Parallel
In reality we get something in between
How (not) to ParallelizeCoordinate Descent
“Naive” parallelization
Do the same thing as before, but with more or all coordinates
and add up the updates
Failure of naive parallelization
1a
1b
0
Failure of naive parallelization
1
1a
1b
0
Failure of naive parallelization
1
2a
2b
Failure of naive parallelization
1
2a
2b
2
Failure of naive parallelization
2
OOPS!
1
1a
1b
0
Idea: averaging updates may help
SOLVED!
Averaging can be too conservative
1a
1b
0
12a
2b
2
and so on...
Averaging may be too conservative
WANT
BAD!!!
Minimizing Regularized Loss
Minimizing Regularized Loss
Convex (smooth)
Convex (smooth or nonsmooth)- separable- allow
Loss Regularizer
Regularizer: examples
No regularizer Weighted L1 norm
Weighted L2 normBox constraints
e.g., SVM dual
e.g., LASSO
Structure of f
Considered in [BKBG, ICML 2011]
Loss: examples
Quadratic loss
L-infinity
L1 regression
Exponential loss
Logistic loss
Square hinge loss
BKBG’11RT’11bTBRS’13RT ’13a
FR’13
Distributed CoordinateDescent Method
I. Distribution of Datad = # features / variables / coordinates Data matrix
II. Choice of Coordinates
II. Choice of Coordinates
Random set of coordinates (‘sampling’)
III. Computing Updates to Selected Coordinates
Random set of coordinates (‘sampling’)
Current iterate New iterate
Update to i-th coordinate
All nodes need to be able to compute this (communication)
Iteration Complexity
implies
Strong convexity constant of the regularizer
Strong convexity constant of the loss f
Theorem [RT’13]# coordinates
# nodes # coordinates updated / node
Bad partitioning at most doubles # of iterations
spectral norm of the “partitioning”
Theorem [RT’13]
Experiment 1
1 node (c = 1)
LASSO problemn = 2 billions d = 1 billion
Coordinate Updates
Iterations
Wall Time
Experiment 2
128 nodes (c = 512, 4096 cores)
LASSO problemn = 1 billion d = 0.5 billion
data size = 3 TB
LASSO: 3TB data + 128 nodes