Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota...
Transcript of Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota...
![Page 1: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/1.jpg)
. . . . . .
Convex Optimization: Old Tricks for New Problems
Ryota Tomioka1
1The University of Tokyo
2011-08-26 @ DTU PhD Summer Course
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 1 / 72
![Page 2: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/2.jpg)
. . . . . .
Why care about convex optimization (and sparsity)?
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 2 / 72
![Page 3: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/3.jpg)
. . . . . .
A typical machine learning problem (1/2)
minimizew∈Rn
12∥Xw − y∥2
2︸ ︷︷ ︸data-fit
+ φλ(w)︸ ︷︷ ︸Regularization
X Ridge penalty
φλ =λ
2
n∑j=1
w2j .
L1 penalty
φλ = λn∑
j=1|wj |.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72
![Page 4: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/4.jpg)
. . . . . .
A typical machine learning problem (2/2)Logistic regression for binary (yi ∈ −1,+1) classification:
minimizew∈Rn
m∑i=1
log(1 + exp(−yi ⟨x i , w⟩))︸ ︷︷ ︸data-fit
+ φλ(w)︸ ︷︷ ︸Regularization
The logistic loss function
log(1 + e−yz) = − log P(Y = y |z)
negative log-likelihoodwhere
P(Y = +1|z) =1
1 + ez
logistic function
f(x)=log(1+exp(−x))
y<x,w>
−5 0 50
0.5
1
z
σ (z
)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 4 / 72
![Page 5: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/5.jpg)
. . . . . .
Bayesian inference as a convex optimization
minimizeq
Eq[f (w)]︸ ︷︷ ︸average energy
+ Eq[log q(w)]︸ ︷︷ ︸entropy
s.t. q(w) ≥ 0,
∫q(w)dw = 1
wheref (w) = − log P(D|w)︸ ︷︷ ︸
neg. log likelihood
− log P(w)︸ ︷︷ ︸neg. log prior
⇒ q(w) =1Z
e−f (w) (Bayesian posterior)
Inner approximations
Variational BayesEmpirical Bayes
Outer approximations
Belief propagationSee Wainwright &Jordan 08.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 5 / 72
![Page 6: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/6.jpg)
. . . . . .
Bayesian inference as a convex optimization
minimizeq
Eq[f (w)]︸ ︷︷ ︸average energy
+ Eq[log q(w)]︸ ︷︷ ︸entropy
s.t. q(w) ≥ 0,
∫q(w)dw = 1
wheref (w) = − log P(D|w)︸ ︷︷ ︸
neg. log likelihood
− log P(w)︸ ︷︷ ︸neg. log prior
⇒ q(w) =1Z
e−f (w) (Bayesian posterior)
Inner approximations
Variational BayesEmpirical Bayes
Outer approximations
Belief propagationSee Wainwright &Jordan 08.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 5 / 72
![Page 7: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/7.jpg)
. . . . . .
Bayesian inference as a convex optimization
minimizeq
Eq[f (w)]︸ ︷︷ ︸average energy
+ Eq[log q(w)]︸ ︷︷ ︸entropy
s.t. q(w) ≥ 0,
∫q(w)dw = 1
wheref (w) = − log P(D|w)︸ ︷︷ ︸
neg. log likelihood
− log P(w)︸ ︷︷ ︸neg. log prior
⇒ q(w) =1Z
e−f (w) (Bayesian posterior)
Inner approximations
Variational BayesEmpirical Bayes
Outer approximations
Belief propagationSee Wainwright &Jordan 08.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 5 / 72
![Page 8: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/8.jpg)
. . . . . .
Bayesian inference as a convex optimization
minimizeq
Eq[f (w)]︸ ︷︷ ︸average energy
+ Eq[log q(w)]︸ ︷︷ ︸entropy
s.t. q(w) ≥ 0,
∫q(w)dw = 1
wheref (w) = − log P(D|w)︸ ︷︷ ︸
neg. log likelihood
− log P(w)︸ ︷︷ ︸neg. log prior
⇒ q(w) =1Z
e−f (w) (Bayesian posterior)
Inner approximations
Variational BayesEmpirical Bayes
Outer approximations
Belief propagationSee Wainwright &Jordan 08.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 5 / 72
![Page 9: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/9.jpg)
. . . . . .
Convex optimization = standard forms (boring?)
Example: Linear Programming (LP)
.
Primal problem
.
.
.
. ..
.
.
(P) min c⊤x ,
s.t. Ax = b, x ≥ 0.
.
Dual problem
.
.
.
. ..
.
.
(D) max b⊤y ,
s.t. A⊤y ≤ c.
Quadratic Programming (QP), Second Order Cone Programming(SOCP), Semidefinite Programming (SDP), etc...
Pro: “Efficient” (but complicated) solvers are already available.Con: Have to rewrite your problem into one of them.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 6 / 72
![Page 10: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/10.jpg)
. . . . . .
Convex optimization = standard forms (boring?)
Example: Linear Programming (LP)
.
Primal problem
.
.
.
. ..
.
.
(P) min c⊤x ,
s.t. Ax = b, x ≥ 0.
.
Dual problem
.
.
.
. ..
.
.
(D) max b⊤y ,
s.t. A⊤y ≤ c.
Quadratic Programming (QP), Second Order Cone Programming(SOCP), Semidefinite Programming (SDP), etc...
Pro: “Efficient” (but complicated) solvers are already available.Con: Have to rewrite your problem into one of them.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 6 / 72
![Page 11: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/11.jpg)
. . . . . .
Easy problems (that we don’t discuss)
Objective f is differentiable & no constraintI L-BFGS quasi-Newton method
F requires only gradient.F scales well.
I Newton’s methodF requires also Hessian.F very accurate.F for medium sized problems.
Differentiable f & simple box constraintI L-BFGS-B quasi-Newton method
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 7 / 72
![Page 12: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/12.jpg)
. . . . . .
Non-differentiability is everywhere
Support Vector Machine
minimizew
Cm∑
i=1
ℓH(yi ⟨x i , w⟩) +12∥w∥2
Lasso (least absolute shrinkage andselection operator)
minimizew
L(w) + λ
n∑j=1
|wj |
max(0,1−yz)
⇒ Leads to sparse (most of wj will be zero) solutions
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 8 / 72
![Page 13: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/13.jpg)
. . . . . .
Why we need sparsity
Genome-wide association studiesI Hundreds of thousands of genetic
variations (SNPs), small number ofparticipants (samples).
I Number of genes responsible for thedisease is small.
I Solve classification problem(disease/healthy) with sparsityconstraint.
EEG/MEG source localizationI Number of possible sources ≫
number of sensorsI Needs sparsity at a group level
φλ(w) = λ∑g∈G
∥wg∥2
(wg ∈ R3)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 9 / 72
![Page 14: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/14.jpg)
. . . . . .
L1-regularization and sparsity
Best convex approximation of ∥w∥0.
Threshold occurs for finite λ.Non-convex cases (p < 1) can be solved byre-weighted L1 minimization
−3 −2 −1 0 1 2 30
1
2
3
4
|x|0.01
|x|0.5
|x|x2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 10 / 72
![Page 15: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/15.jpg)
. . . . . .
L1-regularization and sparsity
Best convex approximation of ∥w∥0.
−3 −2 −1 0 1 2 30
1
2
3
4
|x|0.01
|x|0.5
|x|x2
Threshold occurs for finite λ.Non-convex cases (p < 1) can be solved byre-weighted L1 minimization
−3 −2 −1 0 1 2 30
1
2
3
4
|x|0.01
|x|0.5
|x|x2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 10 / 72
![Page 16: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/16.jpg)
. . . . . .
L1-regularization and sparsity
Best convex approximation of ∥w∥0.Threshold occurs for finite λ.
Non-convex cases (p < 1) can be solved byre-weighted L1 minimization
−3 −2 −1 0 1 2 30
1
2
3
4
|x|0.01
|x|0.5
|x|x2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 10 / 72
![Page 17: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/17.jpg)
. . . . . .
L1-regularization and sparsity
Best convex approximation of ∥w∥0.Threshold occurs for finite λ.Non-convex cases (p < 1) can be solved byre-weighted L1 minimization
−2 −1.5 −1 −0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
−3 −2 −1 0 1 2 30
1
2
3
4
|x|0.01
|x|0.5
|x|x2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 10 / 72
![Page 18: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/18.jpg)
. . . . . .
Multiple kernels & multiple tasksMultiple kernel learning [Lanckriet et al., 04; Bach et al., 04;...]
I Given: kernel functions k1(x , x ′), . . . , KM(x , x ′)I How do we optimally select and combine “good“ kernels?
minimizef1∈H1,f2∈H2,
...,fM∈HM
CN∑
i=1
ℓ(
yi∑M
m=1 fm(xi))
+ λM∑
m=1
∥fm∥Hm
Multiple task learning [Evgeniou et al 05]I Given: two learning tasks.I Can we do better than solving them individually?
minimizew1,w2,w12
L1(w1 + w12)︸ ︷︷ ︸Task 1 loss
+ L2(w2 + w12)︸ ︷︷ ︸Task 2 loss
+λ(∥w1∥ + ∥w2∥ + ∥w12∥)
w12: shared component, w1: Task 1 only component, w2: Task 2only component.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 11 / 72
![Page 19: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/19.jpg)
. . . . . .
Multiple kernels & multiple tasksMultiple kernel learning [Lanckriet et al., 04; Bach et al., 04;...]
I Given: kernel functions k1(x , x ′), . . . , KM(x , x ′)I How do we optimally select and combine “good“ kernels?
minimizef1∈H1,f2∈H2,
...,fM∈HM
CN∑
i=1
ℓ(
yi∑M
m=1 fm(xi))
+ λM∑
m=1
∥fm∥Hm
Multiple task learning [Evgeniou et al 05]I Given: two learning tasks.I Can we do better than solving them individually?
minimizew1,w2,w12
L1(w1 + w12)︸ ︷︷ ︸Task 1 loss
+ L2(w2 + w12)︸ ︷︷ ︸Task 2 loss
+λ(∥w1∥ + ∥w2∥ + ∥w12∥)
w12: shared component, w1: Task 1 only component, w2: Task 2only component.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 11 / 72
![Page 20: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/20.jpg)
. . . . . .
Estimation of low-rank matrices (1/2)Completion of partially observed low-rank matrix
minimizeX
12∥Ω(X − Y )∥2 + λ∥X∥S1
where ∥X∥S1 :=r∑
j=1
σj(X ) (Schatten 1-norm)
Linear sum of singular-values ⇒ sparsity in the singular-values.
I Collaborative filtering (netflix)
I Sensor network localization
1
4
2
32
211
11
1
32
4
23
4
1
2
1 1Movies
Use
rs
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 12 / 72
![Page 21: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/21.jpg)
. . . . . .
Estimation of low-rank matrices (2/2)Classification of matrix shaped data X .
f (X ) = ⟨W , X ⟩ + b Multivariate Time Series
s
Time
enso
rs
Second order statistics
Se
Second order statisticsSensors
5
10
15
20
25nsor
s
5 10 15 20 25 30 35 40 45
30
35
40
45
Sen
Classification of binary relationship between two objects (e.g.,protein and drug)
f (x , y) = x⊤Wy + b
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 13 / 72
![Page 22: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/22.jpg)
. . . . . .
Agenda
Convex optimization basicsI Convex setsI Convex functionI Conditions that guarantee convexityI Convex optimization problem
Looking into more detailsI Proximity operators and IST methodsI Conjugate duality and dual ascentI Augmented Lagrangian and ADMM
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 14 / 72
![Page 23: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/23.jpg)
. . . . . .
Convexity
Learning objectivesConvex setsConvex functionConditions that guarantee convexityConvex optimization problem
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 15 / 72
![Page 24: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/24.jpg)
. . . . . .
Convex setA subset V ⊆ Rn is a convex set
⇔ line segment between two arbitrary points x , y ∈ V is included in V ;that is,
∀x , y ∈ V , ∀λ ∈ [0, 1], λx + (1 − λ)y ∈ V .
x
y
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 16 / 72
![Page 25: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/25.jpg)
. . . . . .
Convex functionA function f : Rn → R ∪ +∞ is a convex function
⇔ the function f is below any line segment between two points on f ;that is,
∀x , y ∈ Rn, ∀λ ∈ [0, 1], f ((1 − λ)x + λy) ≤ (1 − λ)f (x) + λf (y)
(Jensen’s inequality)
x
f(x)
y
f(y)
f(x)
f(y)
Non−convexConvex
Johan Jensen
1859 – 1925
NB: when the strict inequality < holds, f is called strictly convex.Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 17 / 72
![Page 26: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/26.jpg)
. . . . . .
Convex function
A function f : Rn → R ∪ +∞ is a convex function
⇔ the epigraph of f is a convex set; that is
Vf := (t , x) : (t , x) ∈ Rn+1, t ≥ f (x) is convex.
(x,f(x))
(y,f(y))
Epigraph
NB: when the strict inequality < holds, f is called strictly convex.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 18 / 72
![Page 27: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/27.jpg)
. . . . . .
Exercise
Show that the indicator function δC(x) of a convex set C is aconvex function. Here
δC(x) =
0 if x ∈ C,
+∞ otherwise.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 19 / 72
![Page 28: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/28.jpg)
. . . . . .
Conditions that guarantee convexity (1/3)Hessian ∇2f (x) is positive semidefinite (if f is differentiable)Examples
I (Negative) entropy is a convex function.
f (p) =n∑
i=1
pi log pi ,
∇2f (p) = diag(1/p1, . . . , 1/pn) ≽ 0.0 1
I log determinant is a concave (−f is convex) function
f (X ) = log |X | (X ≽ 0),
∇2f (X ) = −X−⊤ ⊗ X−1 ≼ 0X
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 20 / 72
![Page 29: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/29.jpg)
. . . . . .
Conditions that guarantee convexity (1/3)Hessian ∇2f (x) is positive semidefinite (if f is differentiable)Examples
I (Negative) entropy is a convex function.
f (p) =n∑
i=1
pi log pi ,
∇2f (p) = diag(1/p1, . . . , 1/pn) ≽ 0.0 1
I log determinant is a concave (−f is convex) function
f (X ) = log |X | (X ≽ 0),
∇2f (X ) = −X−⊤ ⊗ X−1 ≼ 0X
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 20 / 72
![Page 30: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/30.jpg)
. . . . . .
Conditions that guarantee convexity (2/3)
Maximum over convex functions fj(x)∞j=1
f (x) := maxj
fj(x) (fj(x) is convex for all j)
The same as saying “intersection of convex sets is a convex set”
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 21 / 72
![Page 31: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/31.jpg)
. . . . . .
Conditions that guarantee convexity (2/3)
Maximum over convex functions f (x ; α) : α ∈ Rn
f (x) := maxα∈Rn
f (x ; α)
Example
Quadratic over linear is a convex function
f (y ,Σ) = maxα
[−1
2α⊤Σα + α⊤y
](Σ ≻ 0)
=12
y⊤Σ−1y
0
0.05
0.1 −1−0.5
00.5
1
0102030
y
Σ
yΣ−
1 y
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 22 / 72
![Page 32: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/32.jpg)
. . . . . .
Conditions that guarantee convexity (2/3)
Maximum over convex functions f (x ; α) : α ∈ Rn
f (x) := maxα∈Rn
f (x ; α)
Example
Quadratic over linear is a convex function
f (y ,Σ) = maxα
[−1
2α⊤Σα + α⊤y
](Σ ≻ 0)
=12
y⊤Σ−1y
0
0.05
0.1 −1−0.5
00.5
1
0102030
y
Σ
yΣ−
1 y
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 22 / 72
![Page 33: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/33.jpg)
. . . . . .
Conditions that guarantee convexity (3/3)Minimum of jointly convex function f (x , y)
f (x) := miny∈Rn
f (x , y) is convex.
ExamplesI Hierarchical prior minimization
f (x) = mind1,...,dn≥0
12
n∑j=1
(x2
j
dj+
dpj
p
)(p ≥ 1)
=1q
n∑j=1
|xj |q (q =2p
1 + p)
I Schatten 1- norm (sum of singularvalues)
f (X ) = minΣ≽0
12
(Tr
(XΣ−1X⊤
)+ Tr (Σ)
)= Tr
((X⊤X )1/2
)=
r∑j=1
σj(X ).
q=1q=1.5q=2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 23 / 72
![Page 34: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/34.jpg)
. . . . . .
Conditions that guarantee convexity (3/3)Minimum of jointly convex function f (x , y)
f (x) := miny∈Rn
f (x , y) is convex.
ExamplesI Hierarchical prior minimization
f (x) = mind1,...,dn≥0
12
n∑j=1
(x2
j
dj+
dpj
p
)(p ≥ 1)
=1q
n∑j=1
|xj |q (q =2p
1 + p)
I Schatten 1- norm (sum of singularvalues)
f (X ) = minΣ≽0
12
(Tr
(XΣ−1X⊤
)+ Tr (Σ)
)= Tr
((X⊤X )1/2
)=
r∑j=1
σj(X ).
q=1q=1.5q=2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 23 / 72
![Page 35: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/35.jpg)
. . . . . .
Conditions that guarantee convexity (3/3)Minimum of jointly convex function f (x , y)
f (x) := miny∈Rn
f (x , y) is convex.
ExamplesI Hierarchical prior minimization
f (x) = mind1,...,dn≥0
12
n∑j=1
(x2
j
dj+
dpj
p
)(p ≥ 1)
=1q
n∑j=1
|xj |q (q =2p
1 + p)
I Schatten 1- norm (sum of singularvalues)
f (X ) = minΣ≽0
12
(Tr
(XΣ−1X⊤
)+ Tr (Σ)
)
= Tr((X⊤X )1/2
)=
r∑j=1
σj(X ).
q=1q=1.5q=2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 23 / 72
![Page 36: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/36.jpg)
. . . . . .
Conditions that guarantee convexity (3/3)Minimum of jointly convex function f (x , y)
f (x) := miny∈Rn
f (x , y) is convex.
ExamplesI Hierarchical prior minimization
f (x) = mind1,...,dn≥0
12
n∑j=1
(x2
j
dj+
dpj
p
)(p ≥ 1)
=1q
n∑j=1
|xj |q (q =2p
1 + p)
I Schatten 1- norm (sum of singularvalues)
f (X ) = minΣ≽0
12
(Tr
(XΣ−1X⊤
)+ Tr (Σ)
)= Tr
((X⊤X )1/2
)=
r∑j=1
σj(X ).
q=1q=1.5q=2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 23 / 72
![Page 37: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/37.jpg)
. . . . . .
Convex optimization problem
f : convex function, g: concave function (−g is convex), C: convex set.
minimizex
f (x), maximizey
g(y),
s.t. x ∈ C. s.t. y ∈ C.
Why?local optimum ⇒ global optimumduality (later) can be used to check convergence
⇒ We can be sure that we are doing the right thing!
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 24 / 72
![Page 38: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/38.jpg)
. . . . . .
Proximity operators and IST methods
Learning objectives(Projected) gradient methodIterative shrinkage/thresholding (IST) methodAcceleration
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 25 / 72
![Page 39: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/39.jpg)
. . . . . .
Proximity view on gradient descent“Linearize and Prox”
x t+1 = argminx
(∇f (x t)(x − x t) +
12ηt
∥x − x t∥2)
= x t − ηt∇f (x t)
Step-size should satisfyηt ≤ 1/L(f ).L(f ): the Lipschitz constant
∥∇f (y) −∇f (x)∥ ≤ L(f )∥y − x∥.
L(f )=upper bound on themaximum eigenvalue of theHessian
xtxt+1x*
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 26 / 72
![Page 40: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/40.jpg)
. . . . . .
Constraint minimization problem
What do we do, if we have a constraint?
minimizex∈Rn
f (x),
s.t. x ∈ C.
can be equivalently written as
minimizex∈Rn
f (x) + δC(x),
where δC(x) is the indicator function of the set C.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 27 / 72
![Page 41: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/41.jpg)
. . . . . .
Constraint minimization problem
What do we do, if we have a constraint?
minimizex∈Rn
f (x),
s.t. x ∈ C.
can be equivalently written as
minimizex∈Rn
f (x) + δC(x),
where δC(x) is the indicator function of the set C.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 27 / 72
![Page 42: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/42.jpg)
. . . . . .
Projected gradient method (Bertsekas 99; Nesterov 03)
Linearize the objective f , δC is the indicator of the constraint C
x t+1 = argminx
(∇f (x t)(x − x t) + δC(x) +
12ηt
∥x − x t∥22
)= argmin
x
(δC(x) +
12ηt
∥x − (x t − ηt∇f (x t))∥22
)= projC(x t − ηt∇f (x t)).
Requires ηt ≤ 1/L(f ).Convergence rate
f (xk ) − f (x∗) ≤L(f )∥x0 − x∗∥2
22k
Need the projection projC tobe easy to compute
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 28 / 72
![Page 43: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/43.jpg)
. . . . . .
Ideas for regularized minimizationConstrained minimization problem
minimizex∈Rn
f (x) + δC(x).
⇒ need to compute the projection
x t+1 = argminx
(δC(x) +
12ηt
∥x − y∥22
)Regularized minimization problem
minimizex∈Rn
f (x) + φλ(x)
⇒ need to compute the proximity operator
x t+1 = argminx
(φλ(x) +
12ηt
∥x − y∥22
)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 29 / 72
![Page 44: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/44.jpg)
. . . . . .
Proximal Operator: generalization of projection
proxφλ(z) = argmin
x
(φλ(x) +
12∥x − z∥2
2
)φλ = δC : Projection onto a convex setproxδC
(z) = projC(z).φλ(x) = λ∥x∥1: Soft-Threshold
proxλ(z) = argminx
(λ∥x∥1 +
12∥x − z∥2
2
)
=
zj + λ (zj < −λ),
0 (−λ ≤ zj ≤ λ),
zj − λ (zj > λ).
λ−λ z
ST(z)
Prox can be computed easily for a separable φλ.Non-differentiability is OK.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 30 / 72
![Page 45: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/45.jpg)
. . . . . .
Iterative Shrinkage Thresholding (IST)
x t+1 = argminx
(∇f (x t)(x − x t) + φλ(x) +
12ηt
∥x − x t∥22
)= argmin
x
(φλ(x) +
12ηt
∥x − (x t − ηt∇f (x t))∥22
)= proxληt
(x t − ηt∇f (x t)).
The same condition for ηt , thesame O(1/k) convergence (Beck& Teboulle 09)
f (xk ) − f (x∗) ≤ L(f )∥x0 − x∗∥2
2k
If the Prox operator proxλ is easy,it is simple to implement.AKA Forward-Backward Splitting(Lions & Mercier 76)Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 31 / 72
![Page 46: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/46.jpg)
. . . . . .
IST summarySolve minimization problem
minimizew∈Rn
f (w) + φλ(w)
by iteratively computing
w t+1 = proxληt(w t − ηt∇f (w t)).
Exercise: Derive prox operator forRidge regularization
φλ(w) = λ
n∑j=1
w2j
Elastic-net regularization
φλ(w) = λ
n∑j=1
((1 − θ)|wj | + θw2
j
).
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 32 / 72
![Page 47: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/47.jpg)
. . . . . .
Exercise 1: implement an L1 regularized logisticregression via IST
minimizew∈Rn
m∑i=1
log(1 + exp(−yi ⟨x i , w⟩))︸ ︷︷ ︸data-fit
+ λn∑
j=1|wj |︸ ︷︷ ︸
Regularization
Hint: define
fℓ(z) =m∑
i=1
log(1 + exp(−zi)).
Then the problem is
minimize fℓ(Aw) + λn∑
j=1|wj | where A =
y1x1⊤
y2x2⊤
...ymxm
⊤
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 33 / 72
![Page 48: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/48.jpg)
. . . . . .
Some hints
.
..
1 Compute the gradient of the loss term
∇w fℓ(Aw) = −A⊤(
exp(−zi)
1 + exp(−zi)
)m
i=1
.
..
2 The gradient step becomes
w t+ 12 = w t + ηtA⊤
(exp(−zi)
1 + exp(−zi)
)m
i=1
.
.
.
3 Then compute the proximity operator
w t+1 = proxληt(w t+ 1
2 )
=
w
t+ 12
j + ληt (wt+ 1
2j < −ληt),
0 (−ληt ≤ wt+ 1
2j ≤ ληt),
wt+ 1
2j − ληt (w
t+ 12
j > ληt).
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 34 / 72
![Page 49: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/49.jpg)
. . . . . .
Matrix completion via IST (Mazumder et al. 10)Loss function:
L(X ) =12∥Ω(X − Y )∥2.
gradient:
∇L(X ) = Ω⊤(Ω(X − Y ))
Regularization:
φλ(X ) = λr∑
j=1
σj(X ) (S1-norm).
Prox operator (Singular ValueThresholding):
proxλ(Z ) = U max(S − λI , 0)V⊤.
Iteration:
X t+1 = proxληt
((I − ηtΩ
⊤Ω)(X t)︸ ︷︷ ︸fill in missing
+ ηtΩ⊤Ω(Y t)︸ ︷︷ ︸
observed
)
When ηt = 1, fill missings with predicted values X t , overwrite theobserved with observed values, then soft-threshold.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 35 / 72
![Page 50: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/50.jpg)
. . . . . .
Matrix completion via IST (Mazumder et al. 10)Loss function:
L(X ) =12∥Ω(X − Y )∥2.
gradient:
∇L(X ) = Ω⊤(Ω(X − Y ))
Regularization:
φλ(X ) = λr∑
j=1
σj(X ) (S1-norm).
Prox operator (Singular ValueThresholding):
proxλ(Z ) = U max(S − λI , 0)V⊤.
Iteration:
X t+1 = proxληt
((I − ηtΩ
⊤Ω)(X t)︸ ︷︷ ︸fill in missing
+ ηtΩ⊤Ω(Y t)︸ ︷︷ ︸
observed
)
When ηt = 1, fill missings with predicted values X t , overwrite theobserved with observed values, then soft-threshold.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 35 / 72
![Page 51: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/51.jpg)
. . . . . .
Matrix completion via IST (Mazumder et al. 10)Loss function:
L(X ) =12∥Ω(X − Y )∥2.
gradient:
∇L(X ) = Ω⊤(Ω(X − Y ))
Regularization:
φλ(X ) = λr∑
j=1
σj(X ) (S1-norm).
Prox operator (Singular ValueThresholding):
proxλ(Z ) = U max(S − λI , 0)V⊤.
Iteration:
X t+1 = proxληt
((I − ηtΩ
⊤Ω)(X t)︸ ︷︷ ︸fill in missing
+ ηtΩ⊤Ω(Y t)︸ ︷︷ ︸
observed
)
When ηt = 1, fill missings with predicted values X t , overwrite theobserved with observed values, then soft-threshold.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 35 / 72
![Page 52: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/52.jpg)
. . . . . .
Matrix completion via IST (Mazumder et al. 10)Loss function:
L(X ) =12∥Ω(X − Y )∥2.
gradient:
∇L(X ) = Ω⊤(Ω(X − Y ))
Regularization:
φλ(X ) = λr∑
j=1
σj(X ) (S1-norm).
Prox operator (Singular ValueThresholding):
proxλ(Z ) = U max(S − λI , 0)V⊤.
Iteration:
X t+1 = proxληt
((I − ηtΩ
⊤Ω)(X t)︸ ︷︷ ︸fill in missing
+ ηtΩ⊤Ω(Y t)︸ ︷︷ ︸
observed
)
When ηt = 1, fill missings with predicted values X t , overwrite theobserved with observed values, then soft-threshold.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 35 / 72
![Page 53: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/53.jpg)
. . . . . .
FISTA: accelerated version of IST (Beck & Teboulle 09;
Nesterov 07)
.
..
1 Initialize x0 appropriately, y1 = x0,s1 = 1.
.
. . 2 Update x t :
x t = proxληt(y t − ηt∇L(y t)).
.
.
.
3 Update y t :
y t+1 = x t +
(st − 1st+1
)(x t − x t−1),
where st+1 =(1 +
√1 + 4s2
t)/
2.
The same per iteration complexity. Converges as O(1/k2).Roughly speaking, y t predicts where the IST step should becomputed.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 36 / 72
![Page 54: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/54.jpg)
. . . . . .
Effect of acceleration
0 2000 4000 6000 8000 1000010-8
10-6
10-4
10-2
100
102
ISTAMTWISTFISTA
Without acceleration
With acceleration
Number of iterations
From Beck & Teboulle 2009 SIAM J. IMAGING SCIENCES
Vol. 2, No. 1, pp. 183-202
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 37 / 72
![Page 55: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/55.jpg)
. . . . . .
Conjugate duality and dual ascent
Convex conjugate functionLagrangian relaxation and dual problemDual ascent
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 38 / 72
![Page 56: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/56.jpg)
. . . . . .
Conjugate dualityThe convex conjugate f ∗ of a function f :
f ∗(y) = supx∈Rn
(⟨x , y⟩ − f (x))
f*(y)f(x)
Since the maximum over linear functions is always convex, f need notbe convex.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 39 / 72
![Page 57: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/57.jpg)
. . . . . .
Conjugate duality (dual view)
.
Convex conjugate function
.
.
.
. ..
.
.
−f ∗(y) is the minimum y -intercept of the hyperplanes that has slope yand have intersection with the graph of f (x).
f ∗(y) = supx
(⟨x , y⟩ − f (x))
⇔ −f ∗(y) = infx
(f (x) − ⟨x , y⟩)
= infx ,b
b,
s.t. f (x) = ⟨x , y⟩ + b.
-f*(y)y
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 40 / 72
![Page 58: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/58.jpg)
. . . . . .
Conjugate duality (dual view)
.
Convex conjugate function
.
.
.
. ..
.
.
−f ∗(y) is the minimum y -intercept of the hyperplanes that has slope yand have intersection with the graph of f (x).
f ∗(y) = supx
(⟨x , y⟩ − f (x))
⇔ −f ∗(y) = infx
(f (x) − ⟨x , y⟩)
= infx ,b
b,
s.t. f (x) = ⟨x , y⟩ + b.
-f*(y)y
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 40 / 72
![Page 59: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/59.jpg)
. . . . . .
Conjugate duality (dual view)
.
Convex conjugate function
.
.
.
. ..
.
.
−f ∗(y) is the minimum y -intercept of the hyperplanes that has slope yand have intersection with the graph of f (x).
f ∗(y) = supx
(⟨x , y⟩ − f (x))
⇔ −f ∗(y) = infx
(f (x) − ⟨x , y⟩)
= infx ,b
b,
s.t. f (x) = ⟨x , y⟩ + b.
-f*(y) y
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 40 / 72
![Page 60: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/60.jpg)
. . . . . .
Conjugate duality (dual view)
.
Convex conjugate function
.
.
.
. ..
.
.
−f ∗(y) is the minimum y -intercept of the hyperplanes that has slope yand have intersection with the graph of f (x).
f ∗(y) = supx
(⟨x , y⟩ − f (x))
⇔ −f ∗(y) = infx
(f (x) − ⟨x , y⟩)
= infx ,b
b,
s.t. f (x) = ⟨x , y⟩ + b.
-f*(y)y
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 40 / 72
![Page 61: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/61.jpg)
. . . . . .
Demo
http://www.ibis.t.u-tokyo.ac.jp/ryotat/applets/pld/
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 41 / 72
![Page 62: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/62.jpg)
. . . . . .
Example of conjugate duality f ∗(y) = supx∈Rn (⟨x , y⟩ − f (x))
Quadratic function
f (x) =x2
2σ2 f ∗(y) =σ2y2
2
f*(y)f(x)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 42 / 72
![Page 63: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/63.jpg)
. . . . . .
Example of conjugate duality f ∗(y) = supx∈Rn (⟨x , y⟩ − f (x))
Logistic loss function
f (x) = log(1 + exp(−x))
f ∗(−y) = y log(y)+ (1− y) log(1− y)
0−1
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 43 / 72
![Page 64: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/64.jpg)
. . . . . .
Example of conjugate duality f ∗(y) = supx∈Rn (⟨x , y⟩ − f (x))
Logistic loss function
f (x) = log(1 + exp(−x)) f ∗(−y) = y log(y)+ (1− y) log(1− y)
0−1
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 43 / 72
![Page 65: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/65.jpg)
. . . . . .
Example of conjugate duality f ∗(y) = supx∈Rn (⟨x , y⟩ − f (x))
L1 regularizer
f (x) = |x |
f ∗(y) =
0 (−1 ≤ y ≤ 1)
+∞ (otherwise)
f(x)
f*(y)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 44 / 72
![Page 66: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/66.jpg)
. . . . . .
Example of conjugate duality f ∗(y) = supx∈Rn (⟨x , y⟩ − f (x))
L1 regularizer
f (x) = |x | f ∗(y) =
0 (−1 ≤ y ≤ 1)
+∞ (otherwise)
f(x)
f*(y)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 44 / 72
![Page 67: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/67.jpg)
. . . . . .
Bi-conjugate f ∗∗ may be different from f
For nonconvex f ,
f(x)
f*(y)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 45 / 72
![Page 68: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/68.jpg)
. . . . . .
Lagrangian relaxationOur optimization problem:
minimizew∈Rn
f (Aw) + g(w)
For examplef (z) = 1
2∥z − y∥22
(squared loss)
Equivalently written as
minimizez∈Rm,w∈Rn
f (z) + g(w),
s.t. z = Aw (equality constraint)
Lagrangian relaxation
minimizez,w
L(z , w ,α) = f (z) + g(w) + α⊤(z − Aw)
As long as z = Aw , the relaxation is exact.Minimum of L is no greater than the minimum of the original.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 46 / 72
![Page 69: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/69.jpg)
. . . . . .
Lagrangian relaxationOur optimization problem:
minimizew∈Rn
f (Aw) + g(w)
For examplef (z) = 1
2∥z − y∥22
(squared loss)
Equivalently written as
minimizez∈Rm,w∈Rn
f (z) + g(w),
s.t. z = Aw (equality constraint)
Lagrangian relaxation
minimizez,w
L(z , w ,α) = f (z) + g(w) + α⊤(z − Aw)
As long as z = Aw , the relaxation is exact.Minimum of L is no greater than the minimum of the original.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 46 / 72
![Page 70: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/70.jpg)
. . . . . .
Lagrangian relaxationOur optimization problem:
minimizew∈Rn
f (Aw) + g(w)
For examplef (z) = 1
2∥z − y∥22
(squared loss)
Equivalently written as
minimizez∈Rm,w∈Rn
f (z) + g(w),
s.t. z = Aw (equality constraint)
Lagrangian relaxation
minimizez,w
L(z , w ,α) = f (z) + g(w) + α⊤(z − Aw)
As long as z = Aw , the relaxation is exact.Minimum of L is no greater than the minimum of the original.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 46 / 72
![Page 71: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/71.jpg)
. . . . . .
Lagrangian relaxationOur optimization problem:
minimizew∈Rn
f (Aw) + g(w)
For examplef (z) = 1
2∥z − y∥22
(squared loss)
Equivalently written as
minimizez∈Rm,w∈Rn
f (z) + g(w),
s.t. z = Aw (equality constraint)
Lagrangian relaxation
minimizez,w
L(z , w ,α) = f (z) + g(w) + α⊤(z − Aw)
As long as z = Aw , the relaxation is exact.Minimum of L is no greater than the minimum of the original.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 46 / 72
![Page 72: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/72.jpg)
. . . . . .
Weak duality
infz,w
L(z, w , α) ≤ p∗ (primal optimal)
proof
infz,w
L(z, w , α) = inf(
infz=Aw
L(z , w , α), infz =Aw
L(z, w , α)
)
= inf(
p∗, infz =Aw
L(z , w , α)
)≤ p∗
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 47 / 72
![Page 73: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/73.jpg)
. . . . . .
Weak duality
infz,w
L(z, w , α) ≤ p∗ (primal optimal)
proof
infz,w
L(z, w , α) = inf(
infz=Aw
L(z , w , α), infz =Aw
L(z, w , α)
)= inf
(p∗, inf
z =AwL(z , w , α)
)
≤ p∗
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 47 / 72
![Page 74: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/74.jpg)
. . . . . .
Weak duality
infz,w
L(z, w , α) ≤ p∗ (primal optimal)
proof
infz,w
L(z, w , α) = inf(
infz=Aw
L(z , w , α), infz =Aw
L(z, w , α)
)= inf
(p∗, inf
z =AwL(z , w , α)
)≤ p∗
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 47 / 72
![Page 75: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/75.jpg)
. . . . . .
Dual problem
From the above argument
d(α) := infz,w
L(z , w , α)
is a lower bound for p∗ for any α. Why don’t we maximize over w?
Dual problem
maximizeα∈Rm
d(α)
Note
supα
infz,w
L(z, w , α) = d∗ ≤ p∗ = infz,w
supα
L(z , w , α)
If d∗ = p∗, strong duality holds. This is the case if f and g both closedand convex.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 48 / 72
![Page 76: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/76.jpg)
. . . . . .
Dual problem
From the above argument
d(α) := infz,w
L(z , w , α)
is a lower bound for p∗ for any α. Why don’t we maximize over w?
Dual problem
maximizeα∈Rm
d(α)
Note
supα
infz,w
L(z, w , α) = d∗ ≤ p∗ = infz,w
supα
L(z , w , α)
If d∗ = p∗, strong duality holds. This is the case if f and g both closedand convex.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 48 / 72
![Page 77: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/77.jpg)
. . . . . .
Dual problem
d(α) = infz,w
L(z, w , α) (≤ p∗)
= infz,w
(f (z) + g(w) + α⊤(z − Aw)
)= inf
z(f (z) + ⟨α, z⟩) + inf
w
(g(w) −
⟨A⊤α, w
⟩)= − sup
z(⟨−α, z⟩ − f (z)) − sup
w
(⟨A⊤α, w
⟩− g(w)
)= −f ∗(−α) − g∗(A⊤α)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 49 / 72
![Page 78: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/78.jpg)
. . . . . .
Dual problem
d(α) = infz,w
L(z, w , α) (≤ p∗)
= infz,w
(f (z) + g(w) + α⊤(z − Aw)
)
= infz
(f (z) + ⟨α, z⟩) + infw
(g(w) −
⟨A⊤α, w
⟩)= − sup
z(⟨−α, z⟩ − f (z)) − sup
w
(⟨A⊤α, w
⟩− g(w)
)= −f ∗(−α) − g∗(A⊤α)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 49 / 72
![Page 79: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/79.jpg)
. . . . . .
Dual problem
d(α) = infz,w
L(z, w , α) (≤ p∗)
= infz,w
(f (z) + g(w) + α⊤(z − Aw)
)= inf
z(f (z) + ⟨α, z⟩) + inf
w
(g(w) −
⟨A⊤α, w
⟩)
= − supz
(⟨−α, z⟩ − f (z)) − supw
(⟨A⊤α, w
⟩− g(w)
)= −f ∗(−α) − g∗(A⊤α)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 49 / 72
![Page 80: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/80.jpg)
. . . . . .
Dual problem
d(α) = infz,w
L(z, w , α) (≤ p∗)
= infz,w
(f (z) + g(w) + α⊤(z − Aw)
)= inf
z(f (z) + ⟨α, z⟩) + inf
w
(g(w) −
⟨A⊤α, w
⟩)= − sup
z(⟨−α, z⟩ − f (z)) − sup
w
(⟨A⊤α, w
⟩− g(w)
)
= −f ∗(−α) − g∗(A⊤α)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 49 / 72
![Page 81: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/81.jpg)
. . . . . .
Dual problem
d(α) = infz,w
L(z, w , α) (≤ p∗)
= infz,w
(f (z) + g(w) + α⊤(z − Aw)
)= inf
z(f (z) + ⟨α, z⟩) + inf
w
(g(w) −
⟨A⊤α, w
⟩)= − sup
z(⟨−α, z⟩ − f (z)) − sup
w
(⟨A⊤α, w
⟩− g(w)
)= −f ∗(−α) − g∗(A⊤α)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 49 / 72
![Page 82: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/82.jpg)
. . . . . .
Fenchel’s duality
infw∈Rn
(f (Aw) + g(w)) = supα∈Rm
(−f ∗(−α) − g∗(A⊤α)
)M. W. Fenchel
ExamplesLogistic regression with L1 regularization
f (z) =m∑
i=1
log(1 + exp(−zi)), g(w) = λ∥w∥1.
Support vector machine (SVM)
f (z) = Cm∑
i=1
max(0, 1 − zi), g(w) =12∥w∥2
2.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 50 / 72
![Page 83: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/83.jpg)
. . . . . .
Example 1: Logistic regression with L1 regularization
.
Primal
.
.
.
. ..
.
.
minw
f (y Xw) + φλ(w) f (z) =m∑
i=1
log(1 + exp(−zi)),
φλ(w) = λ∥w∥1.
.
Dual
.
.
.
. ..
.
.
maxα
−f ∗(−α) − φ∗λ(X⊤(α y))
f ∗(−α)=m∑
i=1
αi log(αi)
+(1 − αi) log(1 − αi),
φ∗λ(v) =
0 (∥w∥∞ ≤ λ),
+∞ (otherwise).
(a) primal losses
O 1
Hinge LossLogistic Loss
(b) dual losses
O 1
Hinge LossLogistic Loss
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 51 / 72
![Page 84: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/84.jpg)
. . . . . .
Example 2: Support vector machine
.
Primal
.
.
.
. ..
.
.
minw
f (y Xw) + φλ(w)f (z) = C
m∑i=1
max(0, 1 − zi),
φλ(w) =12∥w∥2.
.
Dual
.
.
.
. ..
.
.
maxα
−f ∗(−α) − φ∗λ(X⊤(α y))
f ∗(−α)=
∑mi=1 −αi (0 ≤ α ≤ C),
+∞ (oterwise),
φ∗λ(v) =
12∥v∥2.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 52 / 72
![Page 85: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/85.jpg)
. . . . . .
Dual ascent
Assume for a moment that the dual d(α) is differentiable.
For a given αt
d(αt) = infz,w
(f (z) + g(w) +
⟨αt , z − Aw
⟩)and one can show that (Chapter 6, Bertsekas 99)
∇αd(αt) = z t+1 − Aw t+1
where
z t+1 = argminz
(f (z) +
⟨αt , z
⟩)w t+1 = argmin
w
(g(w) −
⟨A⊤αt , w
⟩)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 53 / 72
![Page 86: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/86.jpg)
. . . . . .
Dual ascent (Uzawa’s method)
Minimize the Lagrangian wrt x and z :z t+1 = argminz
(f (z) +
⟨αt , z
⟩),
w t+1 = argminw(g(w) −
⟨A⊤αt , w
⟩).
Update the Lagrangian multiplier αt :αt+1 = αt + ηt(z t+1 − Aw t+1).
Pro: Very simple.Con: When f ∗ or g∗ isnon-differentiable, it is a dualsubgradient method (convergencemore tricky)
NB: f ∗ is differentiable ⇔ f is strictlyconvex.
H. Uzawa
primal
dual
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 54 / 72
![Page 87: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/87.jpg)
. . . . . .
Exercise 2: Matrix completion via dual ascent (Cai et al. 08)
minimizeX
12λ
∥z − y∥2︸ ︷︷ ︸Strictly convex
+(τ∥X∥tr +
12∥X∥2︸ ︷︷ ︸
Strictly convex
),
s.t. Ω(X ) = z .
⇓
Lagrangian:
L(X , z, α) =1
2λ∥z − y∥2︸ ︷︷ ︸=f (z)
+(τ∥X∥S1 +
12∥X∥2︸ ︷︷ ︸
=g(x)
)+ α⊤(z − Ω(X )).
Dual ascentX t+1 = proxτ
(Ω⊤(αt)
)(Singular-Value Thresholding)
z t+1 = y − λαt
αt+1 = αt + ηt(z t+1 − Ω(X t+1))
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 55 / 72
![Page 88: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/88.jpg)
. . . . . .
Exercise 2: Matrix completion via dual ascent (Cai et al. 08)
minimizeX
12λ
∥z − y∥2︸ ︷︷ ︸Strictly convex
+(τ∥X∥tr +
12∥X∥2︸ ︷︷ ︸
Strictly convex
),
s.t. Ω(X ) = z .
⇓
Lagrangian:
L(X , z, α) =1
2λ∥z − y∥2︸ ︷︷ ︸=f (z)
+(τ∥X∥S1 +
12∥X∥2︸ ︷︷ ︸
=g(x)
)+ α⊤(z − Ω(X )).
Dual ascentX t+1 = proxτ
(Ω⊤(αt)
)(Singular-Value Thresholding)
z t+1 = y − λαt
αt+1 = αt + ηt(z t+1 − Ω(X t+1))
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 55 / 72
![Page 89: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/89.jpg)
. . . . . .
Augmented Lagrangian and ADMM
Learning objectivesStructured sparse estimationAugmented LagrangianAlternating direction method of multipliers
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 56 / 72
![Page 90: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/90.jpg)
. . . . . .
Total Variation based image denoising [Rudin, Osher, Fatemi 92]
minimizeX
12∥X − Y∥2
2 + λ∑i,j
∥∥∥(∂x Xij∂y Xij
)∥∥∥2
Original X0 Observed Y
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 57 / 72
![Page 91: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/91.jpg)
. . . . . .
In one dimensionFused lasso [Tibshirani et al. 05]
minimizex
12∥x − y∥2
2 + λn−1∑j=1
∣∣xj+1 − xj∣∣
TrueNoisy
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 58 / 72
![Page 92: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/92.jpg)
. . . . . .
Structured sparsity estimation
TV denoising
minimizeX
12∥X − Y∥2
2 + λ∑i,j
∥∥∥(∂x Xij∂y Xij
)∥∥∥2
Fused lasso
minimizex
12∥x − y∥2
2 + λn−1∑j=1
∣∣xj+1 − xj∣∣
.
Structured sparse estimation problem
.
.
.
. ..
.
.
minimizex∈Rn
f (x)︸︷︷︸data-fit
+ φλ(Ax)︸ ︷︷ ︸regularization
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 59 / 72
![Page 93: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/93.jpg)
. . . . . .
Structured sparsity estimation
TV denoising
minimizeX
12∥X − Y∥2
2 + λ∑i,j
∥∥∥(∂x Xij∂y Xij
)∥∥∥2
Fused lasso
minimizex
12∥x − y∥2
2 + λn−1∑j=1
∣∣xj+1 − xj∣∣
.
Structured sparse estimation problem
.
.
.
. ..
.
.
minimizex∈Rn
f (x)︸︷︷︸data-fit
+ φλ(Ax)︸ ︷︷ ︸regularization
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 59 / 72
![Page 94: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/94.jpg)
. . . . . .
Structured sparse estimation problem
minimizex∈Rn
f (x)︸︷︷︸data-fit
+ φλ(Ax)︸ ︷︷ ︸regularization
Not easy to compute prox operator (because it is non-separable)⇒ difficult to apply IST-type methods.Dual is not necessarily differentiable⇒ difficult to apply dual ascent.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 60 / 72
![Page 95: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/95.jpg)
. . . . . .
Forming the augmented LagrangianStructured sparsity problem
minimizex∈Rn
f (x)︸︷︷︸data-fit
+ φλ(Ax)︸ ︷︷ ︸regularization
Equivalently written as
minimizew∈Rn
f (x) + φλ(z)︸ ︷︷ ︸separable!
,
s.t. z = Ax (equality constraint)
.
Augmented Lagrangian function
.
.
.
. ..
.
.
Lη(x , z, α) = f (x) + φλ(z) + α⊤(z − Ax) +η
2∥z − Ax∥2
2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 61 / 72
![Page 96: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/96.jpg)
. . . . . .
Forming the augmented LagrangianStructured sparsity problem
minimizex∈Rn
f (x)︸︷︷︸data-fit
+ φλ(Ax)︸ ︷︷ ︸regularization
Equivalently written as
minimizew∈Rn
f (x) + φλ(z)︸ ︷︷ ︸separable!
,
s.t. z = Ax (equality constraint)
.
Augmented Lagrangian function
.
.
.
. ..
.
.
Lη(x , z, α) = f (x) + φλ(z) + α⊤(z − Ax) +η
2∥z − Ax∥2
2
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 61 / 72
![Page 97: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/97.jpg)
. . . . . .
Augmented Lagrangian Method
.
Augmented Lagrangian function
.
.
.
. ..
.
.
Lη(x , z ,α) = f (x) + φλ(z) + α⊤(z − Ax) +η
2∥z − Ax∥2.
.
Augmented Lagrangian method (Hestenes 69, Powell 69)
.
.
.
. ..
.
.
Minimize the AL function wrt x and z :(x t+1, z t+1) = argmin
x∈Rn,z∈RmLη(x , z , αt).
Update the Lagrangian multiplier:αt+1 = αt + η(z t+1 − Ax t+1).
Pro: The dual is always differentiable due to the penalty term.Con: Cannot minimize over x and z independently
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 62 / 72
![Page 98: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/98.jpg)
. . . . . .
Alternating Direction Method of Multipliers (ADMM;Gabay & Mercier 76)
Minimize the AL function Lη(x , z t , αt) wrt x :
x t+1 = argminx∈Rn
(f (x) − αt⊤Ax +
η
2∥z t − Ax∥2
2
).
Minimize the AL function Lη(x t+1, z ,αt) wrt z:
z t+1 = argminz∈Rm
(φλ(z) + αt⊤z +
η
2∥z − Ax t+1∥2
2
).
Update the Lagrangian multiplier:αt+1 = αt + η(z t+1 − Ax t+1).
Looks ad-hoc but convergence can be shown rigorously.Stability does not rely on the choice of step-size η.The newly updated x t+1 enters the computation of z t+1.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 63 / 72
![Page 99: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/99.jpg)
. . . . . .
Alternating Direction Method of Multipliers (ADMM;Gabay & Mercier 76)
Minimize the AL function Lη(x , z t , αt) wrt x :x t+1 = argmin
x∈Rn
(f (x) − αt⊤Ax +
η
2∥z t − Ax∥2
2
).
Minimize the AL function Lη(x t+1, z ,αt) wrt z:
z t+1 = argminz∈Rm
(φλ(z) + αt⊤z +
η
2∥z − Ax t+1∥2
2
).
Update the Lagrangian multiplier:αt+1 = αt + η(z t+1 − Ax t+1).
Looks ad-hoc but convergence can be shown rigorously.Stability does not rely on the choice of step-size η.The newly updated x t+1 enters the computation of z t+1.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 63 / 72
![Page 100: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/100.jpg)
. . . . . .
Alternating Direction Method of Multipliers (ADMM;Gabay & Mercier 76)
Minimize the AL function Lη(x , z t , αt) wrt x :x t+1 = argmin
x∈Rn
(f (x) − αt⊤Ax +
η
2∥z t − Ax∥2
2
).
Minimize the AL function Lη(x t+1, z ,αt) wrt z:z t+1 = argmin
z∈Rm
(φλ(z) + αt⊤z +
η
2∥z − Ax t+1∥2
2
).
Update the Lagrangian multiplier:αt+1 = αt + η(z t+1 − Ax t+1).
Looks ad-hoc but convergence can be shown rigorously.Stability does not rely on the choice of step-size η.The newly updated x t+1 enters the computation of z t+1.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 63 / 72
![Page 101: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/101.jpg)
. . . . . .
Exercise: implement an ADMM for fused lasso
Fused lasso
minimizex
12∥x − y∥2
2 + λ∥Ax∥1
What is the loss function f?What is the regularizer g?What is the matrix A for fused lasso?What is the prox operator for the regularizer g?
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 64 / 72
![Page 102: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/102.jpg)
. . . . . .
Conclusion
Three approaches for various sparse estimation problemsI Iterative shrinkage/thresholding – proximity operatorI Uzawa’s method – convex conjugate functionI ADMM – combination of the above two
Above methods go beyond black-box models (e.g., gradientdescent or Newton’s method) – takes better care of the problemstructures.These methods are simple enough to be implemented rapidly, butshould not be considered as a silver bullet.⇒ Trade-off between:
I Quick implementation – test new ideas rapidlyI Efficient optimization – more inspection/try-and-error/cross
validation
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 65 / 72
![Page 103: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/103.jpg)
. . . . . .
Topics we did not coverStopping criterion
I Care must be taken when making a comparison.Beyond polynomial convergence O(1/k2)
I Dual Augmented Lagrangian (DAL) converges super-linearlyo(exp(−k)). Softwarehttp://mloss.org/software/view/183/(This is limited to non-structured sparse estimation.)
Beyond convexityI Dual problem is always convex. It provides a lower-bound of the
original problem. If p∗ = d∗, you are done!I Dual ascent (or dual decomposition) for sequence labeling in
natural language processing; see [Wainwright, Jaakkola, Willsky05; Koo et al. 10]
I Difference of convex (DC) programming.I Eigenvalue problem.
Stochastic optimizationI Good tutorial by Nathan Srebro (ICML2010)
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 66 / 72
![Page 104: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/104.jpg)
. . . . . .
A new book “Optimization for Machine Learning” is coming out fromthe MIT press.
Contributed authors including: A. Nemirovksi, D. Bertsekas, L.Vandenberghe, and more.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 67 / 72
![Page 105: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/105.jpg)
. . . . . .
Possible projects
.
..
1 Compare the three approaches, namely IST, dual ascent, andADMM, and discuss empirically (and theoretically) their pros andcons.
.
.
.
2 Apply one of the methods discussed in the lecture to model somereal problem with (structured) sparsity or low-rank matrix.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 68 / 72
![Page 106: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/106.jpg)
. . . . . .
References
Recent surveys
Tomioka, Suzuki, & Sugiyama (2011) Augmented Lagrangian Methods for Learning,Selecting, and Combining Features. In Sra, Nowozin, Wright., editors, Optimization forMachine Learning, MIT Press.
Combettes & Pesquet (2010) Proximal splitting methods in signal processing. InFixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer-Verlag.
Boyd, Parikh, Peleato, & Eckstein (2010) Distributed optimization and statistical learningvia the alternating direction method of multipliers.
Textbooks
Rockafellar (1970) Convex Analysis. Princeton University Press.
Bertsekas (1999) Nonlinear Programming. Athena Scientific.
Nesterov (2003) Introductory Lectures on Convex Optimization: A Basic Course. Springer.
Boyd & Vandenberghe. (2004) Convex optimization, Cambridge University Press.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 69 / 72
![Page 107: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/107.jpg)
. . . . . .
ReferencesIST/FISTA
Moreau (1965) Proximité et dualité dans un espace Hilbertien. Bul letin de la S. M. F.
Nesterov (2007) Gradient Methods for Minimizing Composite Objective Function.
Beck & Teboulle (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for LinearInverse Problems. SIAM J Imag Sci 2, 183–202.
Dual ascent
Arrow, Hurwicz, & Uzawa (1958) Studies in Linear and Non-Linear Programming. StanfordUniversity Press.
Chapter 6 in Bertsekas (1999).
Wainwright, Jaakkola, & Willsky (2005) Map estimation via agreement on trees:message-passing and linear programming. IEEE Trans IT, 51(11).
Augmented Lagrangian
Rockafellar (1976) Augmented Lagrangians and applications of the proximal pointalgorithm in convex programming. Math. of Oper. Res. 1.
Bertsekas (1982) Constrained Optimization and Lagrange Multiplier Methods. AcademicPress.
Tomioka, Suzuki, & Sugiyama (2011) Super-Linear Convergence of Dual AugmentedLagrangian Algorithm for Sparse Learning. JMLR 12.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 70 / 72
![Page 108: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/108.jpg)
. . . . . .
References
ADMM
Gabay & Mercier (1976) A dual algorithm for the solution of nonlinear variational problemsvia finite element approximation. Comput Math Appl 2, 17–40.
Lions & Mercier (1979) Splitting Algorithms for the Sum of Two Nonlinear Operators. SIAMJ Numer Anal 16, 964–979.
Eckstein & Bertsekas (1992) On the Douglas-Rachford splitting method and the proximalpoint algorithm for maximal monotone operators.
Matrices
Srebro, Rennie, & Jaakkola (2005) Maximum-Margin Matrix Factorization. Advances inNIPS 17, 1329–1336.
Cai, Candès, & Shen (2008) A singular value thresholding algorithm for matrix completion.
Tomioka, Suzuki, Sugiyama, & Kashima (2010) A Fast Augmented Lagrangian Algorithmfor Learning Low-Rank Matrices. In ICML 2010.
Mazumder, Hastie, & Tibshirani (2010) Spectral Regularization Algorithms for LearningLarge Incomplete Matrices. JMLR 11, 2287–2322.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 71 / 72
![Page 109: Convex Optimization: Old Tricks for New Problemsryotat/teaching/dtuphd11/dtuphd11.pdf · Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 3 / 72. . . . . . A typical machine learning](https://reader033.fdocuments.net/reader033/viewer/2022060603/60579baf2bca9f46c936a7da/html5/thumbnails/109.jpg)
. . . . . .
ReferencesMulti-task/Mutliple kernel learning
Evgeniou, Micchelli, & Pontil (2005) Learning Multiple Tasks with Kernel Methods. JMLR 6,615–637.
Lanckriet, Christiani, Bartlett, Ghaoui, & Jordan (2004) Learning the Kernel Matrix withSemidefinite Programming.
Bach, Thibaux, & Jordan (2005) Computing regularization paths for learning multiplekernels. Advances in NIPS, 73–80.
Structured sparsity
Tibshirani, Saunders, Rosset, Zhu and Knight. (2005) Sparsity and smoothness via thefused lasso. J. Roy. Stat. Soc. B, 67.
Rudin, Osher, Fetemi. (1992) Nonlinear total variation based noise removal algorithms.Physica D: Nonlinear Phenomena, 60.
Goldstein & Osher (2009) Split Bregman method for L1 regularization problems. SIAM J.Imag. Sci. 2.
Mairal, Jenatton, Obozinski, & Bach. (2011) Convex and network flow optimization forstructured sparsity.
Bayes & Probabilistic Inference
Wainwright & Jordan (2008) Graphical Models, Exponential Families, and VariationalInference.
Ryota Tomioka (Univ Tokyo) Optimization 2011-08-26 72 / 72