descent methods

Peter Richtárik

Parallel coordinateSimons Institute for the Theory of Computing, BerkeleyParallel and Distributed Algorithms for Inference and Optimization, October 23, 2013

descent methods

Randomized Coordinate Descent

Find the minimizer of

2D OptimizationContours of a function

a2 =b2

Randomized Coordinate Descent in 2D

a2 =b2

67SOLVED!

Convergence of Randomized Coordinate Descent

Strongly convex F

Smooth or ‘simple’ nonsmooth F‘difficult’ nonsmooth F

Focus on n

(big data = big n)

Parallelization Dream

Depends on to what extent we can add up individual updates, which depends on the properties of F and the

way coordinates are chosen at each iteration

Serial Parallel

What do we actually get?WANT

How (not) to ParallelizeCoordinate Descent

“Naive” parallelization

Do the same thing as before, but

for MORE or ALL coordinates &

ADD UP the updates

Failure of naive parallelization

Idea: averaging updates may help

SOLVED!

Averaging can be too conservative

and so on...

Averaging may be too conservative 2

BAD!!!But we wanted:

What to do?

Averaging:

Summation:

Update to coordinate i

i-th unit coordinate vector

Figure out when one can safely use:

Optimization Problems

Problem

Convex (smooth or nonsmooth)

Convex (smooth or nonsmooth)- separable- allow

Loss Regularizer

Regularizer: examples

No regularizer Weighted L1 norm

Weighted L2 normBox constraints

e.g., SVM dual

e.g., LASSO

Loss: examples

Quadratic loss

L-infinity

L1 regression

Exponential loss

Logistic loss

Square hinge loss

BKBG’11RT’11bTBRS’13RT ’13a

FR’13

3 models for f with small

Smooth partially separable f [RT’11b ]

Nonsmooth max-type f [FR’13]

f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

General Theory

Randomized Parallel Coordinate Descent Method

Random set of coordinates (sampling)

Current iterate New iterate i-th unit coordinate vector

Update to i-th coordinate

ESO: Expected Separable Overapproximation

Definition [RT’11b]

1. Separable in h2. Can minimize in parallel3. Can compute updates for only

Shorthand:

Minimize in h

Convergence rate: convex f

average # updated coordinates per iteration

# coordinatesstepsize parameter

error tolerance# iterations

implies

Theorem [RT’11b]

Convergence rate: strongly convex f

implies

Strong convexity constant of the regularizer

Strong convexity constant of the loss f

Theorem [RT’11b]

Partial Separability and

Doubly Uniform Samplings

Serial uniform samplingProbability law:

-nice samplingProbability law:

Good for shared memory systems

Doubly uniform sampling

Probability law:

Can model unreliable processors / machines

ESO for partially separable functionsand doubly uniform samplings

Theorem [RT’11b]

1 Smooth partially separable f [RT’11b ]

Theoretical speedup

Much of Big Data is here!

degree of partial separability

# coordinates

# coordinate updates / iter

WEAK OR NO SPEEDUP: Non-separable (dense) problems

LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems

n = 1000(# coordinates)

Theory

Practice

n = 1000(# coordinates)

Experiment with a 1 billion-by-2 billion

LASSO problem

Optimization with Big Data

* in a billion dimensional space on a foggy day

Extreme* Mountain Climbing=

Coordinate Updates

Iterations

Wall Time

Distributed-Memory Coordinate Descent

Distributed -nice sampling

Probability law:

Machine 2Machine 1 Machine 3

Good for a distributed version of coordinate descent

ESO: Distributed setting

Theorem [RT’13b]

3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

spectral norm of the data

Bad partitioning at most doubles # of iterations

spectral norm of the partitioning

Theorem [RT’13b]

# nodes

# iterations = implies

# updates/node

LASSO with a 3TB data matrix

128 Cray XE6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16-cores with 32GB RAM

= # coordinates

Conclusions• Coordinate descent methods scale very well

to big data problems of special structure– Partial separability (sparsity)– Small spectral norm of the data– Nesterov separability, ...

• Care is needed when combining updates (add them up? average?)

• Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss minimization. JMLR 2011.

• Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012.

• [RT’11b] P.R. and Martin Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Prog., 2012.

• Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, arXiv: 1304.5530, 2013.

• Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012.

• Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Technical report, Microsoft Research, 2013.

References: serial coordinate descent

• [BKBG’11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011

• [RT’12] P.R. and Martin Takáč, Parallel coordinate descen methods for big data optimization. arXiv:1212.0873, 2012

• Martin Takáč, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013

• [FR’12] Olivier Fercoq and P.R., Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885, 2013

• [RT’13a] P.R. and Martin Takáč, Distributed coordinate descent method for big data learning. arXiv:1310.2059, 2013

• [RT’13b] P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods. arXiv:1310.3438, 2013

References: parallel coordinate descent

Good entry point to the topic (4p paper)

• P.R. and Martin Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings 2012.

• Rachael Tappenden, P.R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. arXiv:1308.6774, 2013.

• Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with MapReduce. IEEE Transactions on Knowledge and Data Engineering, 24(10):1904-1916, 2012.

• Shai Shalev-Shwartz and Tong Zhang, Accelerated mini-batch stochastic dual coordinate ascent. NIPS 2013.

References: parallel coordinate descent

TALK TOMORROW

descent methods

Documents

Transcript of descent methods

12. Coordinate descent methods · 12. Coordinate descent methods • theoretical justiﬁcations • randomized coordinate descent method • minimizing composite objectives • accelerated

On the complexity of steepest descent, Newton’s and ... · On the complexity of steepest descent, ... regularized Newton’s methods for nonconvex unconstrained optimization ...

Gradient descent methods in Deep Learning

E–ciency of coordinate descent methods on huge-scale ...sqma/SEEM5121_Spring2015/... · E–ciency of coordinate descent methods on huge-scale optimization problems ... Computational

Gradient Methods May 2005. Preview Background Steepest Descent Conjugate Gradient.

Lecture 10: descent methods - Homepage - University of Surrey

Block coordinate descent methods for semideﬁnite programmingbicmr.pku.edu.cn/~wenzw/paper/RBRBookCh.pdf · 2019. 1. 14. · Block coordinate descent methods for semideﬁnite programming

STOCHASTIC BLOCK MIRROR DESCENT METHODS … · STOCHASTIC BLOCK MIRROR DESCENT METHODS FOR NONSMOOTH AND STOCHASTIC OPTIMIZATION CONG D. DANG yAND GUANGHUI LAN z Abstract. In this

Semi-Stochastic Gradient Descent Methods

Optimal Epoch Stochastic Gradient Descent Ascent Methods ...

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.

Convergence Analysis of Perturbed Feasible Descent Methods

Numerical stability of descent methods for solving linear ... · Numer. Math. 43, 361-377 (1984) Numerische Mathematik 9 Springer-Verlag 1984 Numerical Stability of Descent Methods

Semi-Stochastic Gradient Descent Methodsrichtarik/papers/s2gd.pdf · Semi-Stochastic Gradient Descent Methods ... y review two basic approaches to solving problem (1). 1. ... the

Gradient Methods April 2004. Preview Background Steepest Descent Conjugate Gradient.

Convergence of descent methods for semi-algebraic …plc/attouch.pdf · Convergence of descent methods for semi-algebraic and ... (GREMAQ, Toulouse I): Math. Programming, Ser. B,

E–ciency of coordinate descent ... - Optimization Online · E–ciency of coordinate descent methods on huge-scale optimization problems Yu. Nesterov ⁄ January 2010 Abstract In

Dual Coordinate Descent Methods for Logistic Regression ...cjlin/papers/maxent_dual.pdf · Dual Coordinate Descent Methods for Logistic Regression ... Newton, and truncated Newton

Stochastic Gradient Descent Methods

Numerical stability of descent methods for solving linear ... fileNumer. Math. 43, 361-377 (1984) Numerische Mathematik 9 Springer-Verlag 1984 Numerical Stability of Descent Methods