The Power of Stagewise Learning - University of...

The Power of Stagewise LearningFrom Support Vector Machine to Generative Adversarial Nets

Tianbao Yang

Department of Computer ScienceThe University of Iowa

Yang (CS@Uiowa) The Power of Stagewise Learning 1 / 44

Introduction

Outline

1 Introduction

2 Stagewise Learning for Convex Problems

3 Stagewise Learning for Non-Convex Problems


Introduction

Gap between Practice vs Theory

0 20 40 60 80 100��

0.0

0.2

0.4

0.6

0.8

1.0��

��

0.1

0.010.001

1√ t

��


Introduction

Gap between Practice vs Theory

0 20 40 60 80 100��

0.0

0.2

0.4

0.6

0.8

1.0

��

��

0.1

0.010.001

1√ t

��

0 20 40 60 80 100��

0.2

0.4

0.6

0.8

1.0

��

��

0.1

0.01 0.001

1√ t

��

Q1: Why does Stagewise Learning (SL) Converge Faster?

Q2: How to design better SL algorithms for DNN and Other problems?


Introduction

The Evolution of Learning Methods

Time

Complexity of Learning

SVM, logistic regression, ...

Cortes & Vapnik (1995)

Convex Problems

Deep Neural Networks (DNN)

Krizhevsky et al. (2012)

Generative Adversarial Nets (GAN)

Goodfellow et al. (2014)

Non-Convex Problems


Introduction

Big Data: Challenges and Opportunities

1.2 million of imagesAlexNet for Image Classification (Krizhevsky et al., 2012)

Google’s JFT 300 millions of imagesBigGAN for Image Generation (Brock et al., 2019)

Training on Huge datasets becomes a bottleneck!


Introduction

Learning a Predictive Model

x ∈ Rd y

(x, y) is generated i.i.d.

predictive model: f (x)→ y


Introduction

Risk Minimization

f ∗ = arg minf ∈F

R(f ) := Ex,y [`(f (x), y)]

F is a hypothesis class

loss function `(z , y): measures the prediction error

Risk of model f


Introduction

Empirical Risk Minimization

Empirical Risk Minimization (Offline Learning)

Collect {(x1, y1), (x2, y2), . . . , (xn, yn)}Find an approximate solution f to solve

f ∗n = arg minf ∈F

Rn(f ) :=1

n

n∑i=1

`(f (xi ), yi )

Empirical Risk


Introduction

Research Questions in Machine Learning

Iterative Algorithms:

ft+1 ← ft +A(available information at iteration t)

T : total time for training (e.g., # of iterations)

1. How fast is learning?Faster Training: Training Error

2. How accurate is the learned model?

Better Generalization: Testing Error


Introduction

Shallow Model: Convex Methods

x→ fw(x) = w>φ(x)

minw∈RD

1

n

n∑i=1

`(w; xi , yi )


Introduction

Deep Neural Networks: Non-Convex Methods

x→ fw(x) = wL ◦ σ(· · ·σ(w3 ◦ σ(w2 ◦ σ(w1 ◦ x))))

minw∈RD

1

n

n∑i=1

`(w; xi , yi )


Stagewise Learning for Convex Problems

Outline

1 Introduction





Convex Methods

Consider

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

F (w) is a convex function

SVM, Logistic regression, Least-squares, LASSO, etc.

Goal: For a sufficiently small ε > 0, find a solution w such that

F (w)− minw∈Rd

F (w) ≤ ε



Stochastic Gradient Descent (Robbins & Monro, 1951)

minw∈Rd

F (w) , Ex,y∼P [`(w; x, y)]

Stochastic Gradient Descent (SGD) Method

Sample (xt , yt) ∼ Pwt+1 = wt − ηt∂`(wt , xt , yt)

step size



Slow Convergence of SGD

1 variance of stochastic gradient

2 decreasing step size: standard theory ηt ∝ 1/√t

3 O(1ε2

)iteration complexity (Nemirovski et al., 2009)



How to improve the Convergence speed?

Previous approaches

Mini-batch SGD: sampling multiple samples each iteration• Pros: can have parallel speed-up• Cons: cannot not reduce total time complexity

Making stronger assumptions• e.g. strong convexity, smoothness, using full gradients• Pros: speed-up for some family of problems• Cons: may not hold

Can we do better without imposing these strong assumptions? ICML 2017



Stagewise Stochastic Gradient (ICML 2017)

One-Stage SGD(w1, η,D,T )

for τ = 1, . . . ,T

wτ+1 = Proj‖w−w1‖2≤D [wτ − η∂`(wτ , xiτ , yiτ )]

Output: w =∑T

τ=1wτ/T

projection onto a ball

optimum



Stagewise Stochastic Gradient (ICML 2017)

Set η1, D1 and T SSGDfor k = 0, . . . ,K − 1 do

wk+1 = One-Stage SGD(wk , ηk ,Dk ,T )Set ηk+1 = ηk/2, Dk+1 = Dk/2

end forOutput: wK

-3.5 -2.5 -1.5 -0.5



Theoretical Result: Faster Convergence

Growth/Sharpness Condition: ∃c <∞ and θ ∈ (0, 1] such that:

‖w −w∗‖2 ≤ c(F (w)− F (w∗))θ,

−0.1 −0.05 0 0.05 0.10

0.02

0.04

0.06

0.08

0.1

x

F(x

)

|x|, θ=1

|x|1.5

, θ=2/3

|x|2, θ=0.5

O(

1ε2(1−θ)

log(

1ε

))vs. O( 1

ε2 )

SSGD SGD



Machine Learning Problems satisfy GC

SVM for high-dimensional data: θ = 1

minw

1

n

n∑i=1

max(0, 1− yiw>xi ) + λ‖w‖1

O(log(1/ε)) vs O(1/ε2)

LASSO: θ = 1/2

minw

1

n

n∑i=1

(yi −w>xi )2 + λ‖w‖1

O(1/ε) vs O(1/ε2)

many many more



Empirical Results: SSGD vs SGD

million songs: n = 463, 715 E2006-tfidf: n = 16, 087

number of iterations ×107

0 2 4 6 8 10

log10(objectivegap)

-6

-5

-4

-3

-2

-1

0

robust + ℓ1 norm, million songs

SGDSSGD


0 2 4 6 8 10

log10(objectivegap)

-7

-6

-5

-4

-3

-2

-1

0

robust + ℓ1 norm, E2006-tfidf

SGDSSGD


0 2 4 6 8 10

log10(objectivegap)

-8

-7.5

-7

-6.5

-6

-5.5

-5

-4.5

-4

-3.5

hinge loss + ℓ1 norm, covtype

SGDSSGD


0 2 4 6 8 10

log10(objectivegap)

-5

-4.5

-4

-3.5

-3

-2.5

-2

-1.5

-1

hinge loss + ℓ1 norm, real-sim

SGDSSGD

covtype: n = 581, 012 real-sim: n = 72, 309



From Balanced Data to Imbalanced Data

Minimizing Error Rate is not a Good idea!



Optimization of Suitable Measures for Imbalanced Data

maximize AUC (area under ROC curve):

AUC = Prob.(score of + > score of −)

maximize F-measure

F =2 · Precision · Recall

Precision + Recall

etc

Non-Decomposable over individual examples



Our Contributions

Our Approaches: low memory, low computation, low iteration complexity.

1 Fast Stochastic/Online AUC Maximization (ICML 2018)• based on a zero-sum convex-concave game formulation• stagewise learning for solving a convex-concave game• improves complexity from O(1/ε2) to O(1/ε)

2 Fast Stochastic/Online F-measure Maximization (NeurIPS 2018)• decomposes into two tasks• learning a posterior probability and learning a threshold• stagewise learning for optimizing the threshold faster• improves complexity from O(1/ε2) to O(1/ε)



Experiments: Stochastic AUC Maximization

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

iteration 104

0.87

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

AU

C

w8a dataset

FSAUC

SOALM

SOLAM_l1

OPAUC

1 2 3 4 5 6 7 8 9 10

iteration 104

0.89

0.895

0.9

0.905

0.91

0.915

0.92

0.925

0.93

AU

C

ijcnn1 dataset

FSAUC

SOALM

SOLAM_l1

OPAUC

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

iteration 104

0.97

0.975

0.98

0.985

0.99

0.995

AU

C

real-sim dataset

FSAUC

SOALM

SOLAM_l1

OPAUC

p = 2.97%, p = 9.49%, p = 30.68%



Experiments: Stochastic F-measure Optimization


Stagewise Learning for Non-Convex Problems

Outline

1 Introduction





Generative Adversarial Nets (GAN)

Generative Modeling (Density Estimation)

Sample Generation

slides courtesy of Ian Goodfellow NIPS 2016 tutorial



Generative Adversarial Nets



Formulation of Generative Adversarial Nets

Zero-Sum Game Formulation (Goodfellow et al. (2014))

minG

maxD

Ex∼pdata [logD(x)] + Ez∼prandom [log(1− D(G (z)))]

z random noise, x real data

G generator: G (z) fake image

D discriminator: D(x) (e.g., probability of being real image)

Ideally: at Nash Equilibrium: p(G (z)) = p(x)

prob. of being real



Non-Convex Non-Concave Games

minw

maxu

F (w,u) := Ex∼pdata,z∼prandom [`(w,u; x, z)]

Non-convex w.r.t w, non-concave w.r.t u

Finding Nash-Equilibrium is NP-hard

Existing studies mostly heuristics learned from convex-concave games

No Convergence Guarantee (could be divergent)

First Convergence Theory for Finding Nearly Stationary Points:Stationary Point: ∇F (w∗,u∗) = 0

(Lin-Liu-Rafique-Y., arXiv 2018, presented at NeurIPS 2018 SGO&ML)



A Stagewise Learning Algorithm for GAN

for k = 0, . . . ,K − 1 do

Fk(w,u) = F (w,u) + λ2‖w −wk‖2 − λ

2‖u− uk‖2(wk+1,uk+1) = A(Fk ,wk ,uk , ηk ,Tk)

end for

(w∗,u∗) = arg minw

maxu

F (w,u) +λ

2‖w −w∗‖2 − λ

2‖u− u∗‖2

1 λ > 0 is an algorithmic regularization parameter

2 Different A (e.g., Primal-Dual SGD, Adam)

3 Use variational inequality for analysis



Experiments: Image Generation

0 500 1000 1500

Number of Epochs

1

2

3

4

5

6

Incep

tio

n S

co

reWGAN training on CIFAR10 data

Adam(5d)

Adam

IPP-Adam

0 5 10 15 20

Number of Epochs

1

2

3

4

5

Incep

tio

n S

co

re

WGAN training on Lsun Bedroom data

Adam(5d)

Adam

IPP-Adam

GeneralizationPerformance



Why Does Stagewise Learning Improves Testing Error?

Optimizing Deep Neural Networks

minw∈Rd

F (w) ,1

n

n∑i=1

`(w; xi , yi )

0 20 40 60 80 100��

0.2

0.4

0.6

0.8

1.0

��

��

��

�

0.1

0.01 0.001

1√ t

�� !��

Non-Convex



A Stagewise Learning Algorithm for DNN

for k = 0, . . . ,K − 1 do

Fk(w) = F (w) + λ2‖w −wk‖2

wk+1 = SGD(Fk ,wk , ηk ,Tk)ηk+1 = ηk/2

end for

1 λ ≥ 0: algorithmic regularization

2 step size decreases geometrically in stages

3 is a more general framework

4 convergence to a stationary point established in ICLR 2019




1 Explore Growth Condition!

µ‖w −w∗‖22 ≤ F (w)− F (w∗),

2 Explore Almost-Convexity Condition

−∇F (w)>(w∗ −w)

F (w)− F (w∗)≥ γ > 0,

3 Use stability for generalization analysis

Testing Error = Training Error + Generalization Error

Flat minimum

0 50 100 150 200��

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

‎‏‏

μ= 0.00027

��




Faster Convergence of Training Error: O( 1µε) vs O( 1

µ2ε)

With the same number of iterations T =√

nµ

Testing Error Comparison

O(

1√nµ1/2

)vs. O

(1√

nµ3/2

)Stagewise Step Size Decreasing Step Size

First Theory for Explaining Stagewise Learning for DNN

(Y.-Yan-Yuan-Jin, arXiv 2018)



Better Stagewise Learning Algorithms

0 20 40 60 80 100��"� �"��!��

0.2

0.4

0.6

0.8

1.0

��!"

��

� �

��!��"��"��#�!�� "��#�!�� "��#�!��

0 20 40 60 80 100��!��$�"�$�! #��

0.2

0.4

0.6

0.8

1.0

��#$� ��""!"

��#��$�� $��%�#��$��%�#��$��%�#��

V1: standard, no alg. regularization, restart at last solution

V2: algorithmic regularization, restart at last solution

V3: algorithmic regularization, restart at averaged solution



Conclusions

Stagewise Learning is Powerful

Theory: faster convergence for both convex and non-convex methods

Practice: SVM, AUC, F-measure, DNN, GAN

Open problems: e.g., Generalization of SL for GAN



Acknowledgements

Students: Yi Xu, Mingrui Liu, Xiaoxuan Zhang, Zhuoning Yuan, YanYan, Hassan Rafique

Collaborators: Qihang Lin (UIowa), Rong Jin (Alibaba Group)

Funding Agency: NSF (CRII, Big Data)



Thank You!

Questions?



References I

Brock, Andrew, Donahue, Jeff, and Simonyan, Karen. Large scale GANtraining for high fidelity natural image synthesis. In InternationalConference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=B1xsqj09Fm.

Cortes, Corinna and Vapnik, Vladimir. Support-vector networks. Mach.Learn., 20(3):273–297, September 1995. ISSN 0885-6125.

Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing,Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio,Yoshua. Generative adversarial nets. In Proceedings of the 27thInternational Conference on Neural Information Processing Systems -Volume 2, NIPS’14, pp. 2672–2680, 2014.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenetclassification with deep convolutional neural networks. In Advances inNeural Information Processing Systems (NIPS), pp. 1106–1114, 2012.


https://openreview.net/forum?id=B1xsqj09Fm


References II

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochasticapproximation approach to stochastic programming. SIAM Journal onOptimization, 19(4):1574–1609, 2009.

Robbins, Herbert and Monro, Sutton. A stochastic approximation method.The Annals of Mathematical Statistics, 22:400–407, 1951.


The Power of Stagewise Learning - University of...

Documents

Transcript of The Power of Stagewise Learning - University of...