Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All...

41
Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013

Transcript of Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All...

Page 1: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Peter Richtárik (joint work with Martin Takáč)

Distributed Coordinate Descent Method

AmpLab All Hands Meeting - Berkeley - October 29, 2013

Page 2: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Randomized Coordinate Descent

in 2D

Page 3: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Find the minimizer

2D OptimizationContours of a function

Goal:

a2 =b2

Page 4: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Randomized Coordinate Descent in 2D

a2 =b2

N

S

EW

Page 5: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

Page 6: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

2

Page 7: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Randomized Coordinate Descent in 2D

a2 =b2

1

23 N

S

EW

Page 8: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

Page 9: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

5

Page 10: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

6

N

S

EW

Page 11: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

N

S

EW

67SOLVED!

Page 12: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Convergence of Randomized Coordinate Descent

Strongly convex f

Smooth or ‘simple’ nonsmooth f‘difficult’ nonsmooth f

Focus on d

(big data = big d)

Page 13: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Parallelization Dream

Serial Parallel

In reality we get something in between

Page 14: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

How (not) to ParallelizeCoordinate Descent

Page 15: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

“Naive” parallelization

Do the same thing as before, but with more or all coordinates

and add up the updates

Page 16: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Failure of naive parallelization

1a

1b

0

Page 17: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Failure of naive parallelization

1

1a

1b

0

Page 18: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Failure of naive parallelization

1

2a

2b

Page 19: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Failure of naive parallelization

1

2a

2b

2

Page 20: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Failure of naive parallelization

2

OOPS!

Page 21: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

1

1a

1b

0

Idea: averaging updates may help

SOLVED!

Page 22: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Averaging can be too conservative

1a

1b

0

12a

2b

2

and so on...

Page 23: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Averaging may be too conservative

WANT

BAD!!!

Page 24: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Minimizing Regularized Loss

Page 25: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Minimizing Regularized Loss

Convex (smooth)

Convex (smooth or nonsmooth)- separable- allow

Loss Regularizer

Page 26: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Regularizer: examples

No regularizer Weighted L1 norm

Weighted L2 normBox constraints

e.g., SVM dual

e.g., LASSO

Page 27: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Structure of f

Considered in [BKBG, ICML 2011]

Page 28: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Loss: examples

Quadratic loss

L-infinity

L1 regression

Exponential loss

Logistic loss

Square hinge loss

BKBG’11RT’11bTBRS’13RT ’13a

FR’13

Page 29: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Distributed CoordinateDescent Method

Page 30: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

I. Distribution of Datad = # features / variables / coordinates Data matrix

Page 31: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

II. Choice of Coordinates

Page 32: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

II. Choice of Coordinates

Random set of coordinates (‘sampling’)

Page 33: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

III. Computing Updates to Selected Coordinates

Random set of coordinates (‘sampling’)

Current iterate New iterate

Update to i-th coordinate

All nodes need to be able to compute this (communication)

Page 34: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Iteration Complexity

implies

Strong convexity constant of the regularizer

Strong convexity constant of the loss f

Theorem [RT’13]# coordinates

# nodes # coordinates updated / node

Page 35: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Bad partitioning at most doubles # of iterations

spectral norm of the “partitioning”

Theorem [RT’13]

Page 36: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Experiment 1

1 node (c = 1)

LASSO problemn = 2 billions d = 1 billion

Page 37: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Coordinate Updates

Page 38: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Iterations

Page 39: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Wall Time

Page 40: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Experiment 2

128 nodes (c = 512, 4096 cores)

LASSO problemn = 1 billion d = 0.5 billion

data size = 3 TB

Page 41: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

LASSO: 3TB data + 128 nodes