The Power of Stagewise Learning - University of...

54
The Power of Stagewise Learning From Support Vector Machine to Generative Adversarial Nets Tianbao Yang Department of Computer Science The University of Iowa Yang (CS@Uiowa) The Power of Stagewise Learning 1 / 44

Transcript of The Power of Stagewise Learning - University of...

Page 1: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

The Power of Stagewise LearningFrom Support Vector Machine to Generative Adversarial Nets

Tianbao Yang

Department of Computer ScienceThe University of Iowa

Yang (CS@Uiowa) The Power of Stagewise Learning 1 / 44

Page 2: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Outline

1 Introduction

2 Stagewise Learning for Convex Problems

3 Stagewise Learning for Non-Convex Problems

Yang (CS@Uiowa) The Power of Stagewise Learning 2 / 44

Page 3: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Gap between Practice vs Theory

0 20 40 60 80 100���������������������

0.0

0.2

0.4

0.6

0.8

1.0��������

������

0.1

0.010.001

1√ t

������������������ ��������������������

Yang (CS@Uiowa) The Power of Stagewise Learning 3 / 44

Page 4: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Gap between Practice vs Theory

0 20 40 60 80 100���������������������

0.0

0.2

0.4

0.6

0.8

1.0

��������

������

0.1

0.010.001

1√ t

������������������ ��������������������

0 20 40 60 80 100���������������������

0.2

0.4

0.6

0.8

1.0

��������

���

0.1

0.01 0.001

1√ t

������������������ ��������������������

Q1: Why does Stagewise Learning (SL) Converge Faster?

Q2: How to design better SL algorithms for DNN and Other problems?

Yang (CS@Uiowa) The Power of Stagewise Learning 4 / 44

Page 5: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Gap between Practice vs Theory

0 20 40 60 80 100���������������������

0.0

0.2

0.4

0.6

0.8

1.0

��������

������

0.1

0.010.001

1√ t

������������������ ��������������������

0 20 40 60 80 100���������������������

0.2

0.4

0.6

0.8

1.0

��������

���

0.1

0.01 0.001

1√ t

������������������ ��������������������

Q1: Why does Stagewise Learning (SL) Converge Faster?

Q2: How to design better SL algorithms for DNN and Other problems?

Yang (CS@Uiowa) The Power of Stagewise Learning 4 / 44

Page 6: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Gap between Practice vs Theory

0 20 40 60 80 100���������������������

0.0

0.2

0.4

0.6

0.8

1.0

��������

������

0.1

0.010.001

1√ t

������������������ ��������������������

0 20 40 60 80 100���������������������

0.2

0.4

0.6

0.8

1.0

��������

���

0.1

0.01 0.001

1√ t

������������������ ��������������������

Q1: Why does Stagewise Learning (SL) Converge Faster?

Q2: How to design better SL algorithms for DNN and Other problems?

Yang (CS@Uiowa) The Power of Stagewise Learning 4 / 44

Page 7: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

The Evolution of Learning Methods

Time

Complexity of Learning

SVM, logistic regression, ...

Cortes & Vapnik (1995)

Convex Problems

Deep Neural Networks (DNN)

Krizhevsky et al. (2012)

Generative Adversarial Nets (GAN)

Goodfellow et al. (2014)

Non-Convex Problems

Yang (CS@Uiowa) The Power of Stagewise Learning 5 / 44

Page 8: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Big Data: Challenges and Opportunities

1.2 million of imagesAlexNet for Image Classification (Krizhevsky et al., 2012)

Google’s JFT 300 millions of imagesBigGAN for Image Generation (Brock et al., 2019)

Training on Huge datasets becomes a bottleneck!

Yang (CS@Uiowa) The Power of Stagewise Learning 6 / 44

Page 9: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Learning a Predictive Model

x ∈ Rd y

(x, y) is generated i.i.d.

predictive model: f (x)→ y

Yang (CS@Uiowa) The Power of Stagewise Learning 7 / 44

Page 10: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Risk Minimization

f ∗ = arg minf ∈F

R(f ) := Ex,y [`(f (x), y)]

F is a hypothesis class

loss function `(z , y): measures the prediction error

Risk of model f

Yang (CS@Uiowa) The Power of Stagewise Learning 8 / 44

Page 11: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Risk Minimization

f ∗ = arg minf ∈F

R(f ) := Ex,y [`(f (x), y)]

F is a hypothesis class

loss function `(z , y): measures the prediction error

Risk of model f

Yang (CS@Uiowa) The Power of Stagewise Learning 8 / 44

Page 12: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Empirical Risk Minimization

Empirical Risk Minimization (Offline Learning)

Collect {(x1, y1), (x2, y2), . . . , (xn, yn)}Find an approximate solution f to solve

f ∗n = arg minf ∈F

Rn(f ) :=1

n

n∑i=1

`(f (xi ), yi )

Empirical Risk

Yang (CS@Uiowa) The Power of Stagewise Learning 9 / 44

Page 13: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Research Questions in Machine Learning

Iterative Algorithms:

ft+1 ← ft +A(available information at iteration t)

T : total time for training (e.g., # of iterations)

1. How fast is learning?Faster Training: Training Error

2. How accurate is the learned model?

Better Generalization: Testing Error

Yang (CS@Uiowa) The Power of Stagewise Learning 10 / 44

Page 14: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Shallow Model: Convex Methods

x→ fw(x) = w>φ(x)

minw∈RD

1

n

n∑i=1

`(w; xi , yi )

Yang (CS@Uiowa) The Power of Stagewise Learning 11 / 44

Page 15: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Introduction

Deep Neural Networks: Non-Convex Methods

x→ fw(x) = wL ◦ σ(· · ·σ(w3 ◦ σ(w2 ◦ σ(w1 ◦ x))))

minw∈RD

1

n

n∑i=1

`(w; xi , yi )

Yang (CS@Uiowa) The Power of Stagewise Learning 12 / 44

Page 16: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Outline

1 Introduction

2 Stagewise Learning for Convex Problems

3 Stagewise Learning for Non-Convex Problems

Yang (CS@Uiowa) The Power of Stagewise Learning 13 / 44

Page 17: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Convex Methods

Consider

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

F (w) is a convex function

SVM, Logistic regression, Least-squares, LASSO, etc.

Goal: For a sufficiently small ε > 0, find a solution w such that

F (w)− minw∈Rd

F (w) ≤ ε

Yang (CS@Uiowa) The Power of Stagewise Learning 14 / 44

Page 18: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Stochastic Gradient Descent (Robbins & Monro, 1951)

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

Stochastic Gradient Descent (SGD) Method

Sample (xt , yt) ∼ Pwt+1 = wt − ηt∂`(wt , xt , yt)

step size

Yang (CS@Uiowa) The Power of Stagewise Learning 15 / 44

Page 19: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Stochastic Gradient Descent (Robbins & Monro, 1951)

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

Stochastic Gradient Descent (SGD) Method

Sample (xt , yt) ∼ Pwt+1 = wt − ηt∂`(wt , xt , yt)

step size

Yang (CS@Uiowa) The Power of Stagewise Learning 15 / 44

Page 20: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Slow Convergence of SGD

1 variance of stochastic gradient

2 decreasing step size: standard theory ηt ∝ 1/√t

3 O(1ε2

)iteration complexity (Nemirovski et al., 2009)

Yang (CS@Uiowa) The Power of Stagewise Learning 16 / 44

Page 21: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Slow Convergence of SGD

1 variance of stochastic gradient

2 decreasing step size: standard theory ηt ∝ 1/√t

3 O(1ε2

)iteration complexity (Nemirovski et al., 2009)

Yang (CS@Uiowa) The Power of Stagewise Learning 16 / 44

Page 22: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

How to improve the Convergence speed?

Previous approaches

Mini-batch SGD: sampling multiple samples each iteration• Pros: can have parallel speed-up• Cons: cannot not reduce total time complexity

Making stronger assumptions• e.g. strong convexity, smoothness, using full gradients• Pros: speed-up for some family of problems• Cons: may not hold

Can we do better without imposing these strong assumptions? ICML 2017

Yang (CS@Uiowa) The Power of Stagewise Learning 17 / 44

Page 23: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Stagewise Stochastic Gradient (ICML 2017)

One-Stage SGD(w1, η,D,T )

for τ = 1, . . . ,T

wτ+1 = Proj‖w−w1‖2≤D [wτ − η∂`(wτ , xiτ , yiτ )]

Output: w =∑T

τ=1wτ/T

projection onto a ball

optimum

Yang (CS@Uiowa) The Power of Stagewise Learning 18 / 44

Page 24: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Stagewise Stochastic Gradient (ICML 2017)

One-Stage SGD(w1, η,D,T )

for τ = 1, . . . ,T

wτ+1 = Proj‖w−w1‖2≤D [wτ − η∂`(wτ , xiτ , yiτ )]

Output: w =∑T

τ=1wτ/T

projection onto a ball

optimum

Yang (CS@Uiowa) The Power of Stagewise Learning 18 / 44

Page 25: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Stagewise Stochastic Gradient (ICML 2017)

Set η1, D1 and T SSGDfor k = 0, . . . ,K − 1 do

wk+1 = One-Stage SGD(wk , ηk ,Dk ,T )Set ηk+1 = ηk/2, Dk+1 = Dk/2

end forOutput: wK

-3.5 -2.5 -1.5 -0.5

Yang (CS@Uiowa) The Power of Stagewise Learning 19 / 44

Page 26: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Theoretical Result: Faster Convergence

Growth/Sharpness Condition: ∃c <∞ and θ ∈ (0, 1] such that:

‖w −w∗‖2 ≤ c(F (w)− F (w∗))θ,

−0.1 −0.05 0 0.05 0.10

0.02

0.04

0.06

0.08

0.1

x

F(x

)

|x|, θ=1

|x|1.5

, θ=2/3

|x|2, θ=0.5

O(

1ε2(1−θ)

log(

))vs. O( 1

ε2 )

SSGD SGD

Yang (CS@Uiowa) The Power of Stagewise Learning 20 / 44

Page 27: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Theoretical Result: Faster Convergence

Growth/Sharpness Condition: ∃c <∞ and θ ∈ (0, 1] such that:

‖w −w∗‖2 ≤ c(F (w)− F (w∗))θ,

−0.1 −0.05 0 0.05 0.10

0.02

0.04

0.06

0.08

0.1

x

F(x

)

|x|, θ=1

|x|1.5

, θ=2/3

|x|2, θ=0.5

O(

1ε2(1−θ)

log(

))vs. O( 1

ε2 )

SSGD SGD

Yang (CS@Uiowa) The Power of Stagewise Learning 20 / 44

Page 28: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Machine Learning Problems satisfy GC

SVM for high-dimensional data: θ = 1

minw

1

n

n∑i=1

max(0, 1− yiw>xi ) + λ‖w‖1

O(log(1/ε)) vs O(1/ε2)

LASSO: θ = 1/2

minw

1

n

n∑i=1

(yi −w>xi )2 + λ‖w‖1

O(1/ε) vs O(1/ε2)

many many more

Yang (CS@Uiowa) The Power of Stagewise Learning 21 / 44

Page 29: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Empirical Results: SSGD vs SGD

million songs: n = 463, 715 E2006-tfidf: n = 16, 087

number of iterations ×107

0 2 4 6 8 10

log10(objectivegap)

-6

-5

-4

-3

-2

-1

0

robust + ℓ1 norm, million songs

SGDSSGD

number of iterations ×105

0 2 4 6 8 10

log10(objectivegap)

-7

-6

-5

-4

-3

-2

-1

0

robust + ℓ1 norm, E2006-tfidf

SGDSSGD

number of iterations ×107

0 2 4 6 8 10

log10(objectivegap)

-8

-7.5

-7

-6.5

-6

-5.5

-5

-4.5

-4

-3.5

hinge loss + ℓ1 norm, covtype

SGDSSGD

number of iterations ×107

0 2 4 6 8 10

log10(objectivegap)

-5

-4.5

-4

-3.5

-3

-2.5

-2

-1.5

-1

hinge loss + ℓ1 norm, real-sim

SGDSSGD

covtype: n = 581, 012 real-sim: n = 72, 309

Yang (CS@Uiowa) The Power of Stagewise Learning 22 / 44

Page 30: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

From Balanced Data to Imbalanced Data

Minimizing Error Rate is not a Good idea!

Yang (CS@Uiowa) The Power of Stagewise Learning 23 / 44

Page 31: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Optimization of Suitable Measures for Imbalanced Data

maximize AUC (area under ROC curve):

AUC = Prob.(score of + > score of −)

maximize F-measure

F =2 · Precision · Recall

Precision + Recall

etc

Non-Decomposable over individual examples

Yang (CS@Uiowa) The Power of Stagewise Learning 24 / 44

Page 32: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Our Contributions

Our Approaches: low memory, low computation, low iteration complexity.

1 Fast Stochastic/Online AUC Maximization (ICML 2018)• based on a zero-sum convex-concave game formulation• stagewise learning for solving a convex-concave game• improves complexity from O(1/ε2) to O(1/ε)

2 Fast Stochastic/Online F-measure Maximization (NeurIPS 2018)• decomposes into two tasks• learning a posterior probability and learning a threshold• stagewise learning for optimizing the threshold faster• improves complexity from O(1/ε2) to O(1/ε)

Yang (CS@Uiowa) The Power of Stagewise Learning 25 / 44

Page 33: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Experiments: Stochastic AUC Maximization

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

iteration 104

0.87

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

AU

C

w8a dataset

FSAUC

SOALM

SOLAM_l1

OPAUC

1 2 3 4 5 6 7 8 9 10

iteration 104

0.89

0.895

0.9

0.905

0.91

0.915

0.92

0.925

0.93

AU

C

ijcnn1 dataset

FSAUC

SOALM

SOLAM_l1

OPAUC

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

iteration 104

0.97

0.975

0.98

0.985

0.99

0.995

AU

C

real-sim dataset

FSAUC

SOALM

SOLAM_l1

OPAUC

p = 2.97%, p = 9.49%, p = 30.68%

Yang (CS@Uiowa) The Power of Stagewise Learning 26 / 44

Page 34: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Convex Problems

Experiments: Stochastic F-measure Optimization

Yang (CS@Uiowa) The Power of Stagewise Learning 27 / 44

Page 35: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Outline

1 Introduction

2 Stagewise Learning for Convex Problems

3 Stagewise Learning for Non-Convex Problems

Yang (CS@Uiowa) The Power of Stagewise Learning 28 / 44

Page 36: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Generative Adversarial Nets (GAN)

Generative Modeling (Density Estimation)

Sample Generation

slides courtesy of Ian Goodfellow NIPS 2016 tutorial

Yang (CS@Uiowa) The Power of Stagewise Learning 29 / 44

Page 37: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Generative Adversarial Nets

Yang (CS@Uiowa) The Power of Stagewise Learning 30 / 44

Page 38: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Formulation of Generative Adversarial Nets

Zero-Sum Game Formulation (Goodfellow et al. (2014))

minG

maxD

Ex∼pdata [logD(x)] + Ez∼prandom [log(1− D(G (z)))]

z random noise, x real data

G generator: G (z) fake image

D discriminator: D(x) (e.g., probability of being real image)

Ideally: at Nash Equilibrium: p(G (z)) = p(x)

prob. of being real

Yang (CS@Uiowa) The Power of Stagewise Learning 31 / 44

Page 39: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Formulation of Generative Adversarial Nets

Zero-Sum Game Formulation (Goodfellow et al. (2014))

minG

maxD

Ex∼pdata [logD(x)] + Ez∼prandom [log(1− D(G (z)))]

z random noise, x real data

G generator: G (z) fake image

D discriminator: D(x) (e.g., probability of being real image)

Ideally: at Nash Equilibrium: p(G (z)) = p(x)

prob. of being real

Yang (CS@Uiowa) The Power of Stagewise Learning 31 / 44

Page 40: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Non-Convex Non-Concave Games

minw

maxu

F (w,u) := Ex∼pdata,z∼prandom [`(w,u; x, z)]

Non-convex w.r.t w, non-concave w.r.t u

Finding Nash-Equilibrium is NP-hard

Existing studies mostly heuristics learned from convex-concave games

No Convergence Guarantee (could be divergent)

First Convergence Theory for Finding Nearly Stationary Points:Stationary Point: ∇F (w∗,u∗) = 0

(Lin-Liu-Rafique-Y., arXiv 2018, presented at NeurIPS 2018 SGO&ML)

Yang (CS@Uiowa) The Power of Stagewise Learning 32 / 44

Page 41: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

A Stagewise Learning Algorithm for GAN

for k = 0, . . . ,K − 1 do

Fk(w,u) = F (w,u) + λ2‖w −wk‖2 − λ

2‖u− uk‖2(wk+1,uk+1) = A(Fk ,wk ,uk , ηk ,Tk)

end for

(w∗,u∗) = arg minw

maxu

F (w,u) +λ

2‖w −w∗‖2 − λ

2‖u− u∗‖2

1 λ > 0 is an algorithmic regularization parameter

2 Different A (e.g., Primal-Dual SGD, Adam)

3 Use variational inequality for analysis

Yang (CS@Uiowa) The Power of Stagewise Learning 33 / 44

Page 42: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Experiments: Image Generation

0 500 1000 1500

Number of Epochs

1

2

3

4

5

6

Incep

tio

n S

co

reWGAN training on CIFAR10 data

Adam(5d)

Adam

IPP-Adam

0 5 10 15 20

Number of Epochs

1

2

3

4

5

Incep

tio

n S

co

re

WGAN training on Lsun Bedroom data

Adam(5d)

Adam

IPP-Adam

GeneralizationPerformance

Yang (CS@Uiowa) The Power of Stagewise Learning 34 / 44

Page 43: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Experiments: Image Generation

0 500 1000 1500

Number of Epochs

1

2

3

4

5

6

Incep

tio

n S

co

reWGAN training on CIFAR10 data

Adam(5d)

Adam

IPP-Adam

0 5 10 15 20

Number of Epochs

1

2

3

4

5

Incep

tio

n S

co

re

WGAN training on Lsun Bedroom data

Adam(5d)

Adam

IPP-Adam

GeneralizationPerformance

Yang (CS@Uiowa) The Power of Stagewise Learning 34 / 44

Page 44: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Why Does Stagewise Learning Improves Testing Error?

Optimizing Deep Neural Networks

minw∈Rd

F (w) ,1

n

n∑i=1

`(w; xi , yi )

0 20 40 60 80 100������ ��� ����������

0.2

0.4

0.6

0.8

1.0

���

����

����

0.1

0.01 0.001

1√ t

����� ������������� ���������!���� ����

Non-Convex

Yang (CS@Uiowa) The Power of Stagewise Learning 35 / 44

Page 45: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

A Stagewise Learning Algorithm for DNN

for k = 0, . . . ,K − 1 do

Fk(w) = F (w) + λ2‖w −wk‖2

wk+1 = SGD(Fk ,wk , ηk ,Tk)ηk+1 = ηk/2

end for

1 λ ≥ 0: algorithmic regularization

2 step size decreases geometrically in stages

3 is a more general framework

4 convergence to a stationary point established in ICLR 2019

Yang (CS@Uiowa) The Power of Stagewise Learning 36 / 44

Page 46: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Why Does Stagewise Learning Improves Testing Error?

1 Explore Growth Condition!

µ‖w −w∗‖22 ≤ F (w)− F (w∗),

2 Explore Almost-Convexity Condition

−∇F (w)>(w∗ −w)

F (w)− F (w∗)≥ γ > 0,

3 Use stability for generalization analysis

Testing Error = Training Error + Generalization Error

Flat minimum

0 50 100 150 200���������������������

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

‎‏‏

μ= 0.00027

������������������ ������������

Yang (CS@Uiowa) The Power of Stagewise Learning 37 / 44

Page 47: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Why Does Stagewise Learning Improves Testing Error?

Faster Convergence of Training Error: O( 1µε) vs O( 1

µ2ε)

With the same number of iterations T =√

Testing Error Comparison

O(

1√nµ1/2

)vs. O

(1√

nµ3/2

)Stagewise Step Size Decreasing Step Size

First Theory for Explaining Stagewise Learning for DNN

(Y.-Yan-Yuan-Jin, arXiv 2018)

Yang (CS@Uiowa) The Power of Stagewise Learning 38 / 44

Page 48: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Why Does Stagewise Learning Improves Testing Error?

Faster Convergence of Training Error: O( 1µε) vs O( 1

µ2ε)

With the same number of iterations T =√

Testing Error Comparison

O(

1√nµ1/2

)vs. O

(1√

nµ3/2

)Stagewise Step Size Decreasing Step Size

First Theory for Explaining Stagewise Learning for DNN

(Y.-Yan-Yuan-Jin, arXiv 2018)

Yang (CS@Uiowa) The Power of Stagewise Learning 38 / 44

Page 49: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Better Stagewise Learning Algorithms

0 20 40 60 80 100������"� �"���!������

0.2

0.4

0.6

0.8

1.0

��!"

����

� �

��!��"��������������������"���#�!���� ������"���#�!���� ������"���#�!���� ����

0 20 40 60 80 100��!���$�"�$�! #�������

0.2

0.4

0.6

0.8

1.0

��#$� ���""!"

��#��$������������� �����$���%�#�����������$���%�#�����������$���%�#���������

V1: standard, no alg. regularization, restart at last solution

V2: algorithmic regularization, restart at last solution

V3: algorithmic regularization, restart at averaged solution

Yang (CS@Uiowa) The Power of Stagewise Learning 39 / 44

Page 50: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Conclusions

Stagewise Learning is Powerful

Theory: faster convergence for both convex and non-convex methods

Practice: SVM, AUC, F-measure, DNN, GAN

Open problems: e.g., Generalization of SL for GAN

Yang (CS@Uiowa) The Power of Stagewise Learning 40 / 44

Page 51: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Acknowledgements

Students: Yi Xu, Mingrui Liu, Xiaoxuan Zhang, Zhuoning Yuan, YanYan, Hassan Rafique

Collaborators: Qihang Lin (UIowa), Rong Jin (Alibaba Group)

Funding Agency: NSF (CRII, Big Data)

Yang (CS@Uiowa) The Power of Stagewise Learning 41 / 44

Page 52: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

Thank You!

Questions?

Yang (CS@Uiowa) The Power of Stagewise Learning 42 / 44

Page 53: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

References I

Brock, Andrew, Donahue, Jeff, and Simonyan, Karen. Large scale GANtraining for high fidelity natural image synthesis. In InternationalConference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=B1xsqj09Fm.

Cortes, Corinna and Vapnik, Vladimir. Support-vector networks. Mach.Learn., 20(3):273–297, September 1995. ISSN 0885-6125.

Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing,Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio,Yoshua. Generative adversarial nets. In Proceedings of the 27thInternational Conference on Neural Information Processing Systems -Volume 2, NIPS’14, pp. 2672–2680, 2014.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenetclassification with deep convolutional neural networks. In Advances inNeural Information Processing Systems (NIPS), pp. 1106–1114, 2012.

Yang (CS@Uiowa) The Power of Stagewise Learning 43 / 44

Page 54: The Power of Stagewise Learning - University of Iowahomepage.cs.uiowa.edu/~tyng/stagewise-learning-talk.pdfThe Power of Stagewise Learning From Support Vector Machine to Generative

Stagewise Learning for Non-Convex Problems

References II

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochasticapproximation approach to stochastic programming. SIAM Journal onOptimization, 19(4):1574–1609, 2009.

Robbins, Herbert and Monro, Sutton. A stochastic approximation method.The Annals of Mathematical Statistics, 22:400–407, 1951.

Yang (CS@Uiowa) The Power of Stagewise Learning 44 / 44