Optimization Methods for Large-Scale Machine Learning

29
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion Optimization Methods for Large-Scale Machine Learning Frank E. Curtis, Lehigh University presented at East Coast Optimization Meeting George Mason University Fairfax, Virginia April 2, 2021 Optimization Methods for Large-Scale Machine Learning 1 of 59

Transcript of Optimization Methods for Large-Scale Machine Learning

Page 1: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Optimization Methods for Large-Scale Machine Learning

Frank E. Curtis, Lehigh University

presented at

East Coast Optimization MeetingGeorge Mason University

Fairfax, Virginia

April 2, 2021

Optimization Methods for Large-Scale Machine Learning 1 of 59

Page 2: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

References

? Leon Bottou, Frank E. Curtis, and Jorge Nocedal.

Optimization Methods for Large-Scale Machine Learning.

SIAM Review, 60(2):223–311, 2018.

? Frank E. Curtis and Katya Scheinberg.

Optimization Methods for Supervised Machine Learning: From Linear Models toDeep Learning.

In INFORMS Tutorials in Operations Research, chapter 5, pages 89–114. Institutefor Operations Research and the Management Sciences (INFORMS), 2017.

Optimization Methods for Large-Scale Machine Learning 2 of 59

Page 3: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Motivating questions

I How do optimization problems arise in machine learning applications, andwhat makes them challenging?

I What have been the most successful optimization methods for large-scalemachine learning, and why?

I What recent advances have been made in the design of algorithms, and whatare open questions in this research area?

Optimization Methods for Large-Scale Machine Learning 3 of 59

Page 4: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 4 of 59

Page 5: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 5 of 59

Page 6: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Learning problems and (surrogate) optimization problems

Learn a prediction function h : X → Y to solve

maxh∈H

∫X×Y

1[h(x) ≈ y]dP (x, y)

Various meanings for h(x) ≈ y depending on the goal:

I Binary classification, with y ∈ −1,+1: y · h(x) > 0.

I Regression, with y ∈ Rny : ‖h(x)− y‖ ≤ δ.Parameterizing h by w ∈ Rd, we aim to solve

maxw∈Rd

∫X×Y

1[h(w;x) ≈ y]dP (x, y)

Now, common practice is to replace the indicator with a smooth loss. . .

Optimization Methods for Large-Scale Machine Learning 6 of 59

Page 7: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Stochastic optimization

Over a parameter vector w ∈ Rd and given

`(·; y) h(w;x) (loss w.r.t. “true label” prediction w.r.t. “features”),

consider the unconstrained optimization problem

minw∈Rd

f(w), where f(w) = E(x,y)[`(h(w;x), y)].

Given training set (xi, yi)ni=1, approximate problem given by

minw∈Rd

fn(w), where fn(w) =1

n

n∑i=1

`(h(w;xi), yi).

Optimization Methods for Large-Scale Machine Learning 7 of 59

Page 8: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Text classification

SIAM REVIEW c© 2018 Society for Industrial and Applied MathematicsVol. 60, No. 2, pp. 223–311

Optimization Methods forLarge-Scale Machine Learning∗

Leon Bottou†

Frank E. Curtis‡

Jorge Nocedal§

Abstract. This paper provides a review and commentary on the past, present, and future of numeri-cal optimization algorithms in the context of machine learning applications. Through casestudies on text classification and the training of deep neural networks, we discuss how op-timization problems arise in machine learning and what makes them challenging. A majortheme of our study is that large-scale machine learning represents a distinctive setting inwhich the stochastic gradient (SG) method has traditionally played a central role whileconventional gradient-based nonlinear optimization techniques typically falter. Based onthis viewpoint, we present a comprehensive theory of a straightforward, yet versatile SGalgorithm, discuss its practical behavior, and highlight opportunities for designing algo-rithms with improved performance. This leads to a discussion about the next generationof optimization methods for large-scale machine learning, including an investigation of twomain streams of research on techniques that diminish noise in the stochastic directions andmethods that make use of second-order derivative approximations.

Key words. numerical optimization, machine learning, stochastic gradient methods, algorithm com-plexity analysis, noise reduction methods, second-order methods

AMS subject classifications. 65K05, 68Q25, 68T05, 90C06, 90C30, 90C90

DOI. 10.1137/16M1080173

Contents

1 Introduction 224

2 Machine Learning Case Studies 2262.1 Text Classification via Convex Optimization . . . . . . . . . . . . . . 2262.2 Perceptual Tasks via Deep Neural Networks . . . . . . . . . . . . . . 2282.3 Formal Machine Learning Procedure . . . . . . . . . . . . . . . . . . . 231

3 Overview of Optimization Methods 2353.1 Formal Optimization Problem Statements . . . . . . . . . . . . . . . . 235

∗Received by the editors June 16, 2016; accepted for publication (in revised form) April 19, 2017;published electronically May 8, 2018.

http://www.siam.org/journals/sirev/60-2/M108017.htmlFunding: The work of the second author was supported by U.S. Department of Energy grant

DE-SC0010615 and U.S. National Science Foundation grant DMS-1016291. The work of the thirdauthor was supported by Office of Naval Research grant N00014-14-1-0313 P00003 and Departmentof Energy grant DE-FG02-87ER25047s.

†Facebook AI Research, New York, NY 10003 ([email protected]).‡Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015

([email protected]).§Department of Industrial Engineering and Management Sciences, Northwestern University,

Evanston, IL 60201 ([email protected]).

223 minw∈Rd

1

n

n∑i=1

log(1 + exp(−(wT xi)yi)) +λ

2‖w‖22

math

poetry

Optimization Methods for Large-Scale Machine Learning 8 of 59

Page 9: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Image / speech recognition

What pixel combinations represent the number 4?

What sounds are these? (“Here comes the sun” – The Beatles)

Optimization Methods for Large-Scale Machine Learning 9 of 59

Page 10: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Deep neural networks

h(w;x) = al(Wl . . . (a2(W2(a1(W1x+ ω1)) + ω2)) . . . )

Inp

ut

Layer

Ou

tpu

tL

ayer

Hidden Layers

x5

x4

x3

x2

x1

h14

h13

h12

h11

h24

h23

h22

h21

h3

h2

h1

[W1]54

[W1]11

[W2]44

[W2]11

[W3]43

[W3]11

Figure: Illustration of a DNN

Optimization Methods for Large-Scale Machine Learning 10 of 59

Page 11: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Tradeoffs of large-scale learning

Bottou, Bousquet (2008) and Bottou (2010)

Notice that we went from our true problem

maxh∈H

∫X×Y

1[h(x) ≈ y]dP (x, y)

to say that we’ll find our solution h ≡ h(w; ·) by (approximately) solving

minw∈Rd

1

n

n∑i=1

`(h(w;xi), yi).

Three sources of error:

I approximation

I estimation

I optimization

Optimization Methods for Large-Scale Machine Learning 11 of 59

Page 12: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Approximation error

Choice of prediction function family H has important implications; e.g.,

HC := h ∈ H : Ω(h) ≤ C.

C

misclassification rate

testing

training

training time

misclassification rate

testing

training

Figure: Illustration of C and training time vs. misclassification rate

Optimization Methods for Large-Scale Machine Learning 12 of 59

Page 13: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Problems of interest

Let’s focus on the expected loss/risk problem

minw∈Rd

f(w), where f(w) = E(x,y)[`(h(w;x), y)]

and the empirical loss/risk problem

minw∈Rd

fn(w), where fn(w) =1

n

n∑i=1

`(h(w;xi), yi).

For this talk, let’s assume

I f is continuously differentiable, bounded below, and potentially nonconvex;

I ∇f is L-Lipschitz continuous, i.e., ‖∇f(w)−∇f(w)‖2 ≤ L‖w − w‖2.

Optimization Methods for Large-Scale Machine Learning 13 of 59

Page 14: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descentAim: Find a stationary point, i.e., w with ∇f(w) = 0.

Algorithm GD : Gradient Descent

1: choose an initial point w0 ∈ Rn and stepsize α > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − α∇f(wk)4: end for

wk

f(wk)

f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖

22

f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖

22

f(w)? f(w)?

Optimization Methods for Large-Scale Machine Learning 14 of 59

Page 15: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descentAim: Find a stationary point, i.e., w with ∇f(w) = 0.

Algorithm GD : Gradient Descent

1: choose an initial point w0 ∈ Rn and stepsize α > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − α∇f(wk)4: end for

wk

f(wk)

f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖

22

f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖

22

f(w)? f(w)?

Optimization Methods for Large-Scale Machine Learning 14 of 59

Page 16: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descentAim: Find a stationary point, i.e., w with ∇f(w) = 0.

Algorithm GD : Gradient Descent

1: choose an initial point w0 ∈ Rn and stepsize α > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − α∇f(wk)4: end for

wk

f(wk)

f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖

22

f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖

22

f(w)? f(w)?

Optimization Methods for Large-Scale Machine Learning 14 of 59

Page 17: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descentAim: Find a stationary point, i.e., w with ∇f(w) = 0.

Algorithm GD : Gradient Descent

1: choose an initial point w0 ∈ Rn and stepsize α > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − α∇f(wk)4: end for

wk

f(wk)

f(wk) +∇f(wk)T (w − wk) + 12L‖w − wk‖

22

f(wk) +∇f(wk)T (w − wk) + 12c‖w − wk‖

22

f(w)? f(w)?

Optimization Methods for Large-Scale Machine Learning 14 of 59

Page 18: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

GD theory

Theorem GD

If α ∈ (0, 1/L], then∞∑k=0

‖∇f(wk)‖22 <∞, which implies ∇f(wk) → 0.

If, in addition, f is c-strongly convex, then for all k ≥ 1:

f(wk)− f∗ ≤ (1− αc)k(f(x0)− f∗).

Proof.

f(wk+1) ≤ f(wk) +∇f(wk)T (wk+1 − wk) + 12L‖wk+1 − wk‖22

· · · (due to stepsize choice)

≤ f(wk)− 12α‖∇f(wk)‖22

≤ f(wk)− αc(f(wk)− f∗).

=⇒ f(wk+1)− f∗ ≤ (1− αc)(f(wk)− f∗).

Optimization Methods for Large-Scale Machine Learning 15 of 59

Page 19: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

GD illustration

Figure: GD with fixed stepsize

Optimization Methods for Large-Scale Machine Learning 16 of 59

Page 20: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Stochastic gradient method (SG)

Invented by Herbert Robbins and Sutton Monro in 1951.

Sutton Monro, former Lehigh faculty member

Optimization Methods for Large-Scale Machine Learning 17 of 59

Page 21: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Stochastic gradient descent

Approximate gradient only; e.g., random ik so E[∇w`(h(w;xik ), yik )|w] = ∇f(w).

Algorithm SG : Stochastic Gradient

1: choose an initial point w0 ∈ Rn and stepsizes αk > 02: for k ∈ 0, 1, 2, . . . do3: set wk+1 ← wk − αkgk, where gk ≈ ∇f(wk)4: end for

Not a descent method!. . . but can guarantee eventual descent in expectation (with Ek[gk] = ∇f(wk)):

f(wk+1) ≤ f(wk) +∇f(wk)T (wk+1 − wk) + 12L‖wk+1 − wk‖22

= f(wk)− αk∇f(wk)T gk + 12α2kL‖gk‖

22

=⇒ Ek[f(wk+1)] ≤ f(wk)− αk‖∇f(wk)‖22 + 12α2kLEk[‖gk‖22].

Markov process: wk+1 depends only on wk and random choice at iteration k.

Optimization Methods for Large-Scale Machine Learning 18 of 59

Page 22: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

SG theory

Theorem SG

If Ek[‖gk‖22] ≤M + ‖∇f(wk)‖22, then:

αk =1

L=⇒ E

1

k

k∑j=1

‖∇f(wj)‖22

≤Mαk = O

(1

k

)=⇒ E

k∑j=1

αj‖∇f(wj)‖22

<∞.If, in addition, f is c-strongly convex, then:

αk =1

L=⇒ E[f(wk)− f∗] ≤ O

((αL)(M/c)

2

)αk = O

(1

k

)=⇒ E[f(wk)− f∗] = O

((L/c)(M/c)

k

).

(*Assumed unbiased gradient estimates; see paper for more generality.)

Optimization Methods for Large-Scale Machine Learning 19 of 59

Page 23: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Why O(1/k)?

Mathematically:∞∑k=1

αk =∞ while∞∑k=1

α2k <∞

Graphically (sequential version of constant stepsize result):

Optimization Methods for Large-Scale Machine Learning 20 of 59

Page 24: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

SG illustration

Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right)

Optimization Methods for Large-Scale Machine Learning 21 of 59

Page 25: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 22 of 59

Page 26: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Why SG over GD for large-scale machine learning?

GD: E[fn(wk)− fn,∗] = O(ρk) linear convergence

SG: E[fn(wk)− fn,∗] = O(1/k) sublinear convergence

So why SG?

Motivation Explanation

Intuitive data “redundancy”

Empirical SG vs. L-BFGS with batch gradient (below)

Theoretical E[fn(wk)− fn,∗] = O(1/k) and E[f(wk)− f∗] = O(1/k)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 105

0

0.1

0.2

0.3

0.4

0.5

0.6

Accessed Data Points

Em

piri

ca

l Ris

k

SGD

LBFGS

4

Optimization Methods for Large-Scale Machine Learning 23 of 59

Page 27: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Work complexityTime, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010).

Time Time for

Convergence rate per iteration ε-optimality

GD: E[fn(wk)− fn,∗] = O(ρk) + O(n) =⇒ n log(1/ε)

SG: E[fn(wk)− fn,∗] = O(1/k) + O(1) =⇒ 1/ε

Considering total (estimation + optimization) error as

E = E[f(wn)− f(w∗)] + E[f(wn)− f(wn)] ∼ 1n

+ ε

and a time budget T , one finds:

I SG: Process as many samples as possible (n ∼ T ), leading to

E ∼1

T.

I GD: With n ∼ T / log(1/ε), minimizing E yields ε ∼ 1/T and

E ∼log(T )

T+

1

T.

Optimization Methods for Large-Scale Machine Learning 24 of 59

Page 28: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 25 of 59

Page 29: Optimization Methods for Large-Scale Machine Learning

GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

End of the story?

SG is great! Let’s keep proving how great it is!

I SG is “stable with respect to inputs”

I SG avoids “steep minima”

I SG avoids “saddle points”

I . . . (many more)

No, we should want more. . .

I SG requires a lot of “hyperparameter” tuning

I Sublinear convergence is not satisfactory

I . . . “linearly” convergent method eventually wins

I . . . with higher budget, faster computation, parallel?, distributed?

Also, any “gradient”-based method is not scale invariant.

Optimization Methods for Large-Scale Machine Learning 26 of 59