Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization...

35
History & Trends Proximal Gradient Descent Three Applications Optimization for Machine Learning with a focus on proximal gradient descent algorithm Quanming Yao Department of Computer Science and Engineering Quanming Yao Optimization for Machine Learning

Transcript of Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization...

Page 1: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Optimization for Machine Learningwith a focus on proximal gradient descent algorithm

Quanming Yao

Department of Computer Science and Engineering

Quanming Yao Optimization for Machine Learning

Page 2: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Outline

1 History & Trends

2 Proximal Gradient Descent

3 Three Applications

Quanming Yao Optimization for Machine Learning

Page 3: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

A Brief History

A. Convex optimization: before 1980

Develop more efficient algorithms for convex problems

B. Convex optimization for machine learning: 1980-2000

Apply mature convex optimization algorithms on machinelearning models

C. Large scale machine learning: 2000-2010

Convexity may not longer be requiredAlgorithms become more specific to machine learning models

D. Huge scale machine learning: 2010-now

Convexity is longer a problem for optimizationParallel & distributed optimization

Quanming Yao Optimization for Machine Learning

Page 4: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Convex Optimization: Before 1980

Linear least squares regression

minx

1

2N

N∑n=1

(yi − a>i x

)2+λ

2‖x‖2

2

Gradient descent

Conjugate gradient descent

Quasi-Newton method: L-BFGS

Quanming Yao Optimization for Machine Learning

Page 5: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Convex Optimization for Machine Learning: 1980-2000

Support vector machine

min1

N

N∑n=1

εi +λ

2‖x‖2

2,

s.t. yi

(a>i x + b

)≥ 1− εi

Sequential minimal optimization

Quanming Yao Optimization for Machine Learning

Page 6: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Large Scale Machine Learning

Support vector machine

Core-set selection: CVM

Coordinate descent: LibLinear

Stochastic gradient descent (SGD): Pegasos

Sparse coding and matrix completion

Proximal gradient descent (PG)

Frank-Wolfe algorithm (FW)

Alternating Direction Method of Multipliers (ADMM)

Neural networks

Forward and Backward-propagation

Quanming Yao Optimization for Machine Learning

Page 7: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Huge Scale Machine Learning: 2010-now

Deep Neural networks

Tailored-made SGD (Resprop, Adam)

Making algorithms (PG, FW, ADMM) more efficient & applicable

Parallel and distributed version

Stochastic version

Convergence analysis without convexity

Quanming Yao Optimization for Machine Learning

Page 8: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Three Interesting Trends

Trends

General propose to tailer-made for model, even for data

Convex to nonconvex

Easy implementation to systematic approach

Requirements

Deep understanding of optimization algorithms

Deep understanding of machine learning models

Experienced with coding & computer systems

Domain knowledge from specific areas

Quanming Yao Optimization for Machine Learning

Page 9: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Why?

Big model (nonconvex) and big data (distributed)

Smaller model error, but increase the difficulty on optimizationwhich leads to large optimization error

Good algorithms

Keep small optimization error

Quanming Yao Optimization for Machine Learning

Page 10: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Gradient Descent

Minimization problem: minF (x)

F is a smooth function

we can take gradient ∇F

Gradient descent

xt+1 = xt −1

L∇F (xt)

Example: logistic regression

minx

1

N

N∑n=1

log(

1 + exp(−yi (a>i x)

))+λ

2‖x‖2

2

Quanming Yao Optimization for Machine Learning

Page 11: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Proximal Gradient Descent

F is always not smooth in machine learnings.

minF (x) = f (x)︸︷︷︸loss function

+ g(x)︸︷︷︸regularizer

f : loss function- usually smooth: e.g. square loss, logistic loss

g : regularizer- usually nonsmooth: e.g. sparse/low-rank regularizers

Examples: logistic regression + feature selection

minx

1

N

N∑n=1

log(

1 + exp(−yi (a>i x)

))+ λ‖x‖1

Quanming Yao Optimization for Machine Learning

Page 12: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Matrix Completion: Recommender system

Quanming Yao Optimization for Machine Learning

Page 13: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Matrix Completion: Recommender system

Birds of a feather flock together ⇐⇒ Rating matrix is low-rank

U: users × latent factors

V : items × latent factors

Quanming Yao Optimization for Machine Learning

Page 14: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Matrix Completion: Recommender system

Optimization problem:

minX

1

2‖PΩ(X − O)‖2

F︸ ︷︷ ︸f :loss

+ λ‖X‖∗︸ ︷︷ ︸g :regularizer

where observed positions are indicated by Ω, and ratings are in O.

f : square loss is used- smooth

g : the nuclear norm regularizer- ‖X‖∗ =

∑i σi (X ), sum of singular values, nonsmooth

Quanming Yao Optimization for Machine Learning

Page 15: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Topic Model: Tree-structured lasso

Quanming Yao Optimization for Machine Learning

Page 16: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Topic Model: Tree-structured lasso

Topics naturally have a hierarchical structure

Quanming Yao Optimization for Machine Learning

Page 17: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Topic Model: Tree-structured lasso

Bag of words (BoW) representation of a document

Quanming Yao Optimization for Machine Learning

Page 18: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Topic Model: Tree-structured lasso

Optimization problems

minx

1

2‖y − Dx‖2

2 + λ∑g=1

‖xIg ‖2

where y is the BoW representation of a document, and differentgroups are indicated by Ig

f : square loss is used - smooth

g : tree-structured lasso regularizer - nonsmooth

Quanming Yao Optimization for Machine Learning

Page 19: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Overview

Gradient Descent

xt+1 = xt −1

L∇f (xt)

= arg minx

1

2‖x − (xt −

1

L∇f (xt))‖2

2

= arg minx

f (xt) + (x − xt)>∇f (xt) +

L

2‖x − xt‖2

2

Proximal Gradient Descent

xt+1 = proxλLg

(xt −

1

L∇f (xt)

): proximal step

= arg minx

1

2‖x − (xt −

1

L∇f (xt))‖2

2 + λg(x)

= arg minx

f (xt) + (x − xt)>∇f (xt) +

L

2‖x − xt‖2

2 + λg(x)︸ ︷︷ ︸nonsmooth term

Quanming Yao Optimization for Machine Learning

Page 20: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Overview

Proximal step proxλLg (·)

Cheap closed-form solution

Fast O(1/T 2) convergence in convex problems

yt = xt +t − 1

t + 2(xt − xt−1),

xt+1 = proxλLg

(yt −

1

L∇f (yt)

).

Sound convergence even when f and g are not convex

Not for ADMM, SGD and FW algorithms

Quanming Yao Optimization for Machine Learning

Page 21: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Matrix completion

Optimization problem: minX12‖PΩ(X − O)‖2

F + λ‖X‖∗Proximal step

Xt+1 = proxλL‖·‖∗ (Zt) , Zt = Xt − PΩ(Xt − O)

SVD is needed, expensive for large matrices

O(m2n) for X ∈ Rm×n (m ≤ n)

How to decrease iteration time complexity while guaranteeconvergence?

Quanming Yao Optimization for Machine Learning

Page 22: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Matrix completion

Key observation: Zt is sparse + low-rank

Zt = PΩ(Xt − O)︸ ︷︷ ︸sparse

+ Xt︸︷︷︸low-rank

, Xt+1 = proxλ‖·‖∗ (Zt)

Let Xt = UtΣtVt>. For any u ∈ Rn,

Ztu = PΩ(O − Xt)u︸ ︷︷ ︸sparse:O(‖Ω‖1)

+ UtΣt(Vt>u)︸ ︷︷ ︸

low rank:O((m+n)k)

Rank-k SVD takes O(‖Ω‖1k + (m + n)k2) time, instead ofO(mnk) (similarly, for Z>t v)

k is much smaller than m and n‖Ω‖1 much smaller than mn

Use approximate SVD (power method) instead of exact SVD

Quanming Yao Optimization for Machine Learning

Page 23: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Matrix completion

Proposed AIS-Impute [IJCAI-2015] is in black

(a) MovieLens-100K. (b) MovieLens-10M.

Small datasets

Quanming Yao Optimization for Machine Learning

Page 24: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Parallelization

Intel Xeon, 12 cores. FaNCL [ICDM-2015, TKDE-2017] isadvanced version of AIS-Impute.

(a) Yahoo: 12 threads. (b) Speedup v.s. threads.

Large datasets

Quanming Yao Optimization for Machine Learning

Page 25: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Tree-structured lasso

Optimization problems: minx12‖y − Dx‖2

2 + λ∑

g=1 ‖xIg ‖2

Proximal step

xt+1 = proxλL

∑g=1 ‖(zt)Ig ‖2

(zt) , zt = xt −1

LD>(Dxt − y)

Cheap cheap-closed form exists

complex, but convexone pass of all groups, O(log(d)d) time for x ∈ Rd

Performance of convex regularizers is not good enough.

Quanming Yao Optimization for Machine Learning

Page 26: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Tree-structured lasso

Convex v.s. nonconvex regularizers

regularizer convex nonconvex

optimization√

×

performance ×√

Examples regularizers GP: Geman penalty

Laplace penalty

LSP: log-sum penalty

MCP: minimax concave penalty

SCAD: smoothly clipped absolutedeviation penalty

Quanming Yao Optimization for Machine Learning

Page 27: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Tree-structured lasso

Optimization problems

minx

1

2‖y − Dx‖2

2 + λ∑g=1

κ(‖xIg ‖2

)where κ is the nonconvex penalty function

Proximal step

xt+1 = proxλL

∑g=1 κ

(‖(zt)Ig ‖2

) (zt) , zt = xt −1

LD>(Dxt − y)

No closed-form solution

Needs to be iteratively solved with other algorithmsO(Tpd) is required, Tp is number of iterations: expensive

Better performance, much less efficient optimization

Quanming Yao Optimization for Machine Learning

Page 28: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Tree-structured lasso

Problem transformation [ICML-2016]:1

2‖y − Dx‖2

2 + λ∑g=1

[κ(‖xIg ‖2

)− κ0‖xIg ‖2

]︸ ︷︷ ︸f :augmented loss

+ λκ0

∑g=1

‖xIg ‖2︸ ︷︷ ︸g :convex regularizer

Move the nonconexity from regularizer to the loss

augmented loss: f is still smooth

convex regularizer: g is standard tree-structured regularizer

Quanming Yao Optimization for Machine Learning

Page 29: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Tree-structured lasso

Optimizing transformed problem: minx f (x) + λκ0∑

g=1 ‖xIg ‖2

Proximal step

xt+1 = proxλκ0L

∑g=1 ‖(zt)Ig ‖2

(zt) , zt = xt −1

L∇f (x)

Cheap closed-form solution

Same convergence guarantee on original/transform problems

Better performance, efficient optimization

Quanming Yao Optimization for Machine Learning

Page 30: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Tree-structured lasso

Table: Results on tree-structured group lasso.

non-accelerated accelerated convexSCP GIST GD-PAN nmAPG N2C FISTA

accuracy (%) 99.6±0.9 99.6±0.9 99.6±0.9 99.6±0.9 99.6±0.9 97.2±1.8sparsity (%) 5.5±0.4 5.7±0.4 6.9±0.4 5.4±0.3 5.1±0.2 9.2±0.2

CPU time(sec) 7.1±1.6 50.0±8.1 14.2±2.6 3.8±0.4 1.9±0.3 1.0±0.4

(a) Objective v.s. CPU time. (b) Objective v.s. iterations.

N2C is PG algorithm on the transformed problem

Quanming Yao Optimization for Machine Learning

Page 31: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Binary nets

Example: Small Alex-net

convolution layers: 2M parameters

fully connect layers: 60M parameters

Much lager (10x & 100x) networks are always prefer for big data

Not available for mobile devices

Quanming Yao Optimization for Machine Learning

Page 32: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Binary nets

Binary nets: minx f (x) : s.t. x = ±1n.

Neural network with binary weights

Float (32 bits) to binary (1 bit) - 32x compression

Multiplication to addition - 4 times faster

Binary constraint act as regularization - prevent overfitting

Quanming Yao Optimization for Machine Learning

Page 33: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Binary nets

During training

weights are first binarized xt = sign(xt);

then xt propagates to next layer

Take x ∈ ±1 as the regularizer g , using PG

xt+1 = proxx∈±1n (xt −∇f (xt)) = sign(xt − ∇f (xt)︸ ︷︷ ︸loss-aware

)

loss is considered in binarization [ICLR-2016]

Quanming Yao Optimization for Machine Learning

Page 34: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

PGD Algorithm: Binary nets

BPN is proposed method, “loss-aware” is used with Adam

Quanming Yao Optimization for Machine Learning

Page 35: Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization for machine learning: 1980-2000 Apply mature convex optimization algorithms on

History & Trends Proximal Gradient Descent Three Applications

Thanks & QA

Quanming Yao Optimization for Machine Learning