Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization...
Transcript of Optimization for Machine Learninghome.cse.ust.hk/~qyaoaa/papers/talk4thpara.pdfB.Convex optimization...
History & Trends Proximal Gradient Descent Three Applications
Optimization for Machine Learningwith a focus on proximal gradient descent algorithm
Quanming Yao
Department of Computer Science and Engineering
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Outline
1 History & Trends
2 Proximal Gradient Descent
3 Three Applications
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
A Brief History
A. Convex optimization: before 1980
Develop more efficient algorithms for convex problems
B. Convex optimization for machine learning: 1980-2000
Apply mature convex optimization algorithms on machinelearning models
C. Large scale machine learning: 2000-2010
Convexity may not longer be requiredAlgorithms become more specific to machine learning models
D. Huge scale machine learning: 2010-now
Convexity is longer a problem for optimizationParallel & distributed optimization
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Convex Optimization: Before 1980
Linear least squares regression
minx
1
2N
N∑n=1
(yi − a>i x
)2+λ
2‖x‖2
2
Gradient descent
Conjugate gradient descent
Quasi-Newton method: L-BFGS
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Convex Optimization for Machine Learning: 1980-2000
Support vector machine
min1
N
N∑n=1
εi +λ
2‖x‖2
2,
s.t. yi
(a>i x + b
)≥ 1− εi
Sequential minimal optimization
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Large Scale Machine Learning
Support vector machine
Core-set selection: CVM
Coordinate descent: LibLinear
Stochastic gradient descent (SGD): Pegasos
Sparse coding and matrix completion
Proximal gradient descent (PG)
Frank-Wolfe algorithm (FW)
Alternating Direction Method of Multipliers (ADMM)
Neural networks
Forward and Backward-propagation
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Huge Scale Machine Learning: 2010-now
Deep Neural networks
Tailored-made SGD (Resprop, Adam)
Making algorithms (PG, FW, ADMM) more efficient & applicable
Parallel and distributed version
Stochastic version
Convergence analysis without convexity
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Three Interesting Trends
Trends
General propose to tailer-made for model, even for data
Convex to nonconvex
Easy implementation to systematic approach
Requirements
Deep understanding of optimization algorithms
Deep understanding of machine learning models
Experienced with coding & computer systems
Domain knowledge from specific areas
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Why?
Big model (nonconvex) and big data (distributed)
Smaller model error, but increase the difficulty on optimizationwhich leads to large optimization error
Good algorithms
Keep small optimization error
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Gradient Descent
Minimization problem: minF (x)
F is a smooth function
we can take gradient ∇F
Gradient descent
xt+1 = xt −1
L∇F (xt)
Example: logistic regression
minx
1
N
N∑n=1
log(
1 + exp(−yi (a>i x)
))+λ
2‖x‖2
2
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Proximal Gradient Descent
F is always not smooth in machine learnings.
minF (x) = f (x)︸︷︷︸loss function
+ g(x)︸︷︷︸regularizer
f : loss function- usually smooth: e.g. square loss, logistic loss
g : regularizer- usually nonsmooth: e.g. sparse/low-rank regularizers
Examples: logistic regression + feature selection
minx
1
N
N∑n=1
log(
1 + exp(−yi (a>i x)
))+ λ‖x‖1
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Matrix Completion: Recommender system
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Matrix Completion: Recommender system
Birds of a feather flock together ⇐⇒ Rating matrix is low-rank
U: users × latent factors
V : items × latent factors
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Matrix Completion: Recommender system
Optimization problem:
minX
1
2‖PΩ(X − O)‖2
F︸ ︷︷ ︸f :loss
+ λ‖X‖∗︸ ︷︷ ︸g :regularizer
where observed positions are indicated by Ω, and ratings are in O.
f : square loss is used- smooth
g : the nuclear norm regularizer- ‖X‖∗ =
∑i σi (X ), sum of singular values, nonsmooth
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Topic Model: Tree-structured lasso
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Topic Model: Tree-structured lasso
Topics naturally have a hierarchical structure
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Topic Model: Tree-structured lasso
Bag of words (BoW) representation of a document
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Topic Model: Tree-structured lasso
Optimization problems
minx
1
2‖y − Dx‖2
2 + λ∑g=1
‖xIg ‖2
where y is the BoW representation of a document, and differentgroups are indicated by Ig
f : square loss is used - smooth
g : tree-structured lasso regularizer - nonsmooth
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Overview
Gradient Descent
xt+1 = xt −1
L∇f (xt)
= arg minx
1
2‖x − (xt −
1
L∇f (xt))‖2
2
= arg minx
f (xt) + (x − xt)>∇f (xt) +
L
2‖x − xt‖2
2
Proximal Gradient Descent
xt+1 = proxλLg
(xt −
1
L∇f (xt)
): proximal step
= arg minx
1
2‖x − (xt −
1
L∇f (xt))‖2
2 + λg(x)
= arg minx
f (xt) + (x − xt)>∇f (xt) +
L
2‖x − xt‖2
2 + λg(x)︸ ︷︷ ︸nonsmooth term
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Overview
Proximal step proxλLg (·)
Cheap closed-form solution
Fast O(1/T 2) convergence in convex problems
yt = xt +t − 1
t + 2(xt − xt−1),
xt+1 = proxλLg
(yt −
1
L∇f (yt)
).
Sound convergence even when f and g are not convex
Not for ADMM, SGD and FW algorithms
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Matrix completion
Optimization problem: minX12‖PΩ(X − O)‖2
F + λ‖X‖∗Proximal step
Xt+1 = proxλL‖·‖∗ (Zt) , Zt = Xt − PΩ(Xt − O)
SVD is needed, expensive for large matrices
O(m2n) for X ∈ Rm×n (m ≤ n)
How to decrease iteration time complexity while guaranteeconvergence?
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Matrix completion
Key observation: Zt is sparse + low-rank
Zt = PΩ(Xt − O)︸ ︷︷ ︸sparse
+ Xt︸︷︷︸low-rank
, Xt+1 = proxλ‖·‖∗ (Zt)
Let Xt = UtΣtVt>. For any u ∈ Rn,
Ztu = PΩ(O − Xt)u︸ ︷︷ ︸sparse:O(‖Ω‖1)
+ UtΣt(Vt>u)︸ ︷︷ ︸
low rank:O((m+n)k)
Rank-k SVD takes O(‖Ω‖1k + (m + n)k2) time, instead ofO(mnk) (similarly, for Z>t v)
k is much smaller than m and n‖Ω‖1 much smaller than mn
Use approximate SVD (power method) instead of exact SVD
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Matrix completion
Proposed AIS-Impute [IJCAI-2015] is in black
(a) MovieLens-100K. (b) MovieLens-10M.
Small datasets
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Parallelization
Intel Xeon, 12 cores. FaNCL [ICDM-2015, TKDE-2017] isadvanced version of AIS-Impute.
(a) Yahoo: 12 threads. (b) Speedup v.s. threads.
Large datasets
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Tree-structured lasso
Optimization problems: minx12‖y − Dx‖2
2 + λ∑
g=1 ‖xIg ‖2
Proximal step
xt+1 = proxλL
∑g=1 ‖(zt)Ig ‖2
(zt) , zt = xt −1
LD>(Dxt − y)
Cheap cheap-closed form exists
complex, but convexone pass of all groups, O(log(d)d) time for x ∈ Rd
Performance of convex regularizers is not good enough.
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Tree-structured lasso
Convex v.s. nonconvex regularizers
regularizer convex nonconvex
optimization√
×
performance ×√
Examples regularizers GP: Geman penalty
Laplace penalty
LSP: log-sum penalty
MCP: minimax concave penalty
SCAD: smoothly clipped absolutedeviation penalty
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Tree-structured lasso
Optimization problems
minx
1
2‖y − Dx‖2
2 + λ∑g=1
κ(‖xIg ‖2
)where κ is the nonconvex penalty function
Proximal step
xt+1 = proxλL
∑g=1 κ
(‖(zt)Ig ‖2
) (zt) , zt = xt −1
LD>(Dxt − y)
No closed-form solution
Needs to be iteratively solved with other algorithmsO(Tpd) is required, Tp is number of iterations: expensive
Better performance, much less efficient optimization
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Tree-structured lasso
Problem transformation [ICML-2016]:1
2‖y − Dx‖2
2 + λ∑g=1
[κ(‖xIg ‖2
)− κ0‖xIg ‖2
]︸ ︷︷ ︸f :augmented loss
+ λκ0
∑g=1
‖xIg ‖2︸ ︷︷ ︸g :convex regularizer
Move the nonconexity from regularizer to the loss
augmented loss: f is still smooth
convex regularizer: g is standard tree-structured regularizer
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Tree-structured lasso
Optimizing transformed problem: minx f (x) + λκ0∑
g=1 ‖xIg ‖2
Proximal step
xt+1 = proxλκ0L
∑g=1 ‖(zt)Ig ‖2
(zt) , zt = xt −1
L∇f (x)
Cheap closed-form solution
Same convergence guarantee on original/transform problems
Better performance, efficient optimization
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Tree-structured lasso
Table: Results on tree-structured group lasso.
non-accelerated accelerated convexSCP GIST GD-PAN nmAPG N2C FISTA
accuracy (%) 99.6±0.9 99.6±0.9 99.6±0.9 99.6±0.9 99.6±0.9 97.2±1.8sparsity (%) 5.5±0.4 5.7±0.4 6.9±0.4 5.4±0.3 5.1±0.2 9.2±0.2
CPU time(sec) 7.1±1.6 50.0±8.1 14.2±2.6 3.8±0.4 1.9±0.3 1.0±0.4
(a) Objective v.s. CPU time. (b) Objective v.s. iterations.
N2C is PG algorithm on the transformed problem
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Binary nets
Example: Small Alex-net
convolution layers: 2M parameters
fully connect layers: 60M parameters
Much lager (10x & 100x) networks are always prefer for big data
Not available for mobile devices
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Binary nets
Binary nets: minx f (x) : s.t. x = ±1n.
Neural network with binary weights
Float (32 bits) to binary (1 bit) - 32x compression
Multiplication to addition - 4 times faster
Binary constraint act as regularization - prevent overfitting
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Binary nets
During training
weights are first binarized xt = sign(xt);
then xt propagates to next layer
Take x ∈ ±1 as the regularizer g , using PG
xt+1 = proxx∈±1n (xt −∇f (xt)) = sign(xt − ∇f (xt)︸ ︷︷ ︸loss-aware
)
loss is considered in binarization [ICLR-2016]
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
PGD Algorithm: Binary nets
BPN is proposed method, “loss-aware” is used with Adam
Quanming Yao Optimization for Machine Learning
History & Trends Proximal Gradient Descent Three Applications
Thanks & QA
Quanming Yao Optimization for Machine Learning