Vishy Lec2
description
Transcript of Vishy Lec2
![Page 1: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/1.jpg)
Optimization for Machine LearningLecture 2: Support Vector Machine Training
S.V.N. (vishy) Vishwanathan
Purdue [email protected]
July 11, 2012
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 41
![Page 2: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/2.jpg)
Linear Support Vector Machines
Outline
1 Linear Support Vector Machines
2 Stochastic Optimization
3 Implicit Updates
4 Dual Problem
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 41
![Page 3: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/3.jpg)
Linear Support Vector Machines
Binary Classification
yi = −1
yi = +1
x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1
x | 〈w , x〉+ b = 1
x2x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2
⟩= 2‖w‖
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41
![Page 4: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/4.jpg)
Linear Support Vector Machines
Binary Classification
yi = −1
yi = +1
x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1
x | 〈w , x〉+ b = 1
x2x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2
⟩= 2‖w‖
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41
![Page 5: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/5.jpg)
Linear Support Vector Machines
Binary Classification
yi = −1
yi = +1
x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1
x | 〈w , x〉+ b = 1
x2x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2
⟩= 2‖w‖
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41
![Page 6: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/6.jpg)
Linear Support Vector Machines
Binary Classification
yi = −1
yi = +1
x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1
x | 〈w , x〉+ b = 1
x2x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2
⟩= 2‖w‖
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41
![Page 7: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/7.jpg)
Linear Support Vector Machines
Linear Support Vector Machines
Optimization Problem
minw ,b,ξ
λ
2‖w‖2 +
1
m
m∑i=1
ξi
s.t. yi (〈w , xi 〉+ b) ≥ 1− ξi for all i
ξi ≥ 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41
![Page 8: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/8.jpg)
Linear Support Vector Machines
Linear Support Vector Machines
Optimization Problem
minw ,b
λ
2‖w‖2 +
1
m
m∑i=1
max(0, 1− yi (〈w , xi 〉+ b))
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41
![Page 9: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/9.jpg)
Linear Support Vector Machines
Linear Support Vector Machines
Optimization Problem
minw ,b
λ
2‖w‖2︸ ︷︷ ︸λΩ(w)
+1
m
m∑i=1
max(0, 1− yi (〈w , xi 〉+ b))︸ ︷︷ ︸Remp(w)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41
![Page 10: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/10.jpg)
Stochastic Optimization
Outline
1 Linear Support Vector Machines
2 Stochastic Optimization
3 Implicit Updates
4 Dual Problem
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 41
![Page 11: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/11.jpg)
Stochastic Optimization
Stochastic Optimization Algorithms
Optimization Problem (with no bias)
minw
λ
2‖w‖2︸ ︷︷ ︸Ω(w)
+1
m
m∑i=1
max(0, 1− yi 〈w , xi 〉)︸ ︷︷ ︸Remp(w)
Unconstrained
Nonsmooth
Convex
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 41
![Page 12: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/12.jpg)
Stochastic Optimization
Pegasos: Stochastic Gradient Descent
Require: T1: w0 ← 02: for t = 1, . . . ,T do3: ηt ← 1
λt4: if yt 〈wt , xt〉 < 1 then5: w ′t ← (1− ηtλ)wt + ηtytxt6: else7: w ′t ← (1− ηtλ)wt
8: end if9: end for
10: wt+1 ← min
1, 1/√λ
‖w ′t‖
w ′t
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 41
![Page 13: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/13.jpg)
Stochastic Optimization
Understanding Pegasos
Objective Function Revisited
J(w) =λ
2‖w‖2 +
1
m
m∑i=1
max(0, 1− yi 〈w , xi 〉)
Subgradient
If yt 〈w , xt〉 < 1 then
∂wJt(w) = λw − ytxt
else
∂wJt(w) = λw
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41
![Page 14: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/14.jpg)
Stochastic Optimization
Understanding Pegasos
Objective Function Revisited
J(w) ≈ Jt(w) =λ
2‖w‖2 + max(0, 1− yt 〈w , xt〉)
Subgradient
If yt 〈w , xt〉 < 1 then
∂wJt(w) = λw − ytxt
else
∂wJt(w) = λw
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41
![Page 15: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/15.jpg)
Stochastic Optimization
Understanding Pegasos
Objective Function Revisited
J(w) ≈ Jt(w) =λ
2‖w‖2 + max(0, 1− yt 〈w , xt〉)
Subgradient
If yt 〈w , xt〉 < 1 then
∂wJt(w) = λw − ytxt
else
∂wJt(w) = λw
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41
![Page 16: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/16.jpg)
Stochastic Optimization
Understanding Pegasos
Explicit Update
If yt 〈w , xt〉 < 1 then
w ′t = wt − ηt∂wJt(wt) = (1− ληt)wt + ytxt
else
w ′t = wt − ηt∂wJt(wt) = (1− ληt)wt
Projection
Project w ′t onto the set
B =
w s.t. ‖w‖ ≤ 1/√λ
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 41
![Page 17: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/17.jpg)
Stochastic Optimization
Motivating Stochastic Gradient Descent
How are the Updates Derived?
Minimize the following objective function
wt+1 = argminw
1
2‖w − wt‖2 + ηt Jt(w)
This gives us
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41
![Page 18: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/18.jpg)
Stochastic Optimization
Motivating Stochastic Gradient Descent
How are the Updates Derived?
Minimize the following objective function
wt+1 = argminw
1
2‖w − wt‖2 + ηt Jt(w)
This gives us
wt+1 = wt − ηt∂wJt(wt+1)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41
![Page 19: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/19.jpg)
Stochastic Optimization
Motivating Stochastic Gradient Descent
How are the Updates Derived?
Minimize the following objective function
wt+1 = argminw
1
2‖w − wt‖2 + ηt Jt(w)
This gives us
wt+1 = wt − ηt ∂wJt(wt+1)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41
![Page 20: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/20.jpg)
Stochastic Optimization
Motivating Stochastic Gradient Descent
How are the Updates Derived?
Minimize the following objective function
wt+1 = argminw
1
2‖w − wt‖2 + ηt Jt(w)
This gives us
wt+1 ≈ wt − ηt ∂wJt(wt)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41
![Page 21: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/21.jpg)
Implicit Updates
Outline
1 Linear Support Vector Machines
2 Stochastic Optimization
3 Implicit Updates
4 Dual Problem
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 41
![Page 22: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/22.jpg)
Implicit Updates
Implicit Updates
What if we did not approximate ∂wJt(wt+1)?
wt+1 = wt − ηt∂wJt(wt+1)
Subgradient
∂wJt(w) = λw − γytxt
If yt 〈w , xt〉 < 1 then γ = 1
If yt 〈w , xt〉 = 1 then γ ∈ [0, 1]
If yt 〈w , xt〉 > 1 then γ = 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41
![Page 23: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/23.jpg)
Implicit Updates
Implicit Updates
What if we did not approximate ∂wJt(wt+1)?
wt+1 = wt − ηt∂wJt(wt+1)
Subgradient
∂wJt(w) = λw − γytxt
If yt 〈w , xt〉 < 1 then γ = 1
If yt 〈w , xt〉 = 1 then γ ∈ [0, 1]
If yt 〈w , xt〉 > 1 then γ = 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41
![Page 24: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/24.jpg)
Implicit Updates
Implicit Updates
What if we did not approximate ∂wJt(wt+1)?
wt+1 = wt − ηtλwt+1 + γηtytxt
Subgradient
∂wJt(w) = λw − γytxt
If yt 〈w , xt〉 < 1 then γ = 1
If yt 〈w , xt〉 = 1 then γ ∈ [0, 1]
If yt 〈w , xt〉 > 1 then γ = 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41
![Page 25: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/25.jpg)
Implicit Updates
Implicit Updates
What if we did not approximate ∂wJt(wt+1)?
(1 + ηtλ)wt+1 = wt + γηtytxt
Subgradient
∂wJt(w) = λw − γytxt
If yt 〈w , xt〉 < 1 then γ = 1
If yt 〈w , xt〉 = 1 then γ ∈ [0, 1]
If yt 〈w , xt〉 > 1 then γ = 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41
![Page 26: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/26.jpg)
Implicit Updates
Implicit Updates
What if we did not approximate ∂wJt(wt+1)?
wt+1 =1
1 + ηtλ[wt + γηtytxt ]
Subgradient
∂wJt(w) = λw − γytxt
If yt 〈w , xt〉 < 1 then γ = 1
If yt 〈w , xt〉 = 1 then γ ∈ [0, 1]
If yt 〈w , xt〉 > 1 then γ = 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41
![Page 27: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/27.jpg)
Implicit Updates
Implicit Updates: Case 1
The Implicit Update Condition
wt+1 =1
1 + ηtλ[wt + γηtytxt ]
Case 1
Suppose 1 + ηtλ < yt 〈wt , xt〉. Set
wt+1 =1
1 + ηtλwt
Verify yt 〈wt+1, xt〉 > 1 which implies that γ = 0 and the implicitupdate condition is satisfied
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 41
![Page 28: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/28.jpg)
Implicit Updates
Implicit Updates: Case 2
The Implicit Update Condition
wt+1 =1
1 + ηtλ[wt + γηtytxt ]
Case 2
Suppose yt 〈wt , xt〉 < 1 + ηtλ− ηt 〈xt , xt〉. Set
wt+1 =1
1 + ηtλ[wt + ηtytxt ]
Verify yt 〈wt+1, xt〉 < 1 which implies that γ = 1 and the implicitupdate condition is satisfied
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 41
![Page 29: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/29.jpg)
Implicit Updates
Implicit Updates: Case 3
The Implicit Update Condition
wt+1 =1
1 + ηtλ[wt + γηtytxt ]
Case 3
Suppose 1 + ηtλ− ηt 〈xt , xt〉 ≤ yt 〈wt , xt〉 ≤ 1 + ηtλ. Set
γ =1 + ηtλ− yt 〈wt , xt〉
ηt 〈xt , xt〉
wt+1 =1
1 + ηtλ[wt + γηtytxt ]
Verify γ ∈ [0, 1] and yt 〈wt+1, xt〉 = 1
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 41
![Page 30: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/30.jpg)
Implicit Updates
Implicit Updates: Summary
Summary
wt+1 =1
1 + ηtλ[wt + γηtytxt ]
If 1 + ηtλ < yt 〈wt , xt〉 then γ = 0
If 1 + ηtλ− ηt 〈xt , xt〉 ≤ yt 〈wt , xt〉 ≤ 1 + ηtλ then
γ =1 + ηtλ− yt 〈wt , xt〉
ηt 〈xt , xt〉
If yt 〈wt , xt〉 < 1 + ηtλ− ηt 〈xt , xt〉 then γ = 1
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 41
![Page 31: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/31.jpg)
Implicit Updates
Implicit Updates: Summary
Summary
wt+1 =1
1 + ηtλ[wt + γηtytxt ]
γ = min
(1,max
(0,
1 + ηtλ− yt 〈wt , xt〉ηt 〈xt , xt〉
))
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 41
![Page 32: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/32.jpg)
Dual Problem
Outline
1 Linear Support Vector Machines
2 Stochastic Optimization
3 Implicit Updates
4 Dual Problem
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 41
![Page 33: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/33.jpg)
Dual Problem
Deriving the Dual
Lagrangian
Recall the primal problem without bias
minw ,ξ
λ
2‖w‖2 +
1
m
m∑i=1
ξi
s.t. yi 〈w , xi 〉 ≥ 1− ξi for all i
ξi ≥ 0
Introduce non-negative dual variables α and β
L(w , ξ, α, β) =λ
2‖w‖2 +
1
m
m∑i=1
ξi −∑i
αi (yi 〈w , xi 〉 − 1 + ξi )
−∑i
βiξi
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 41
![Page 34: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/34.jpg)
Dual Problem
Deriving the Dual
Lagrangian
Recall the primal problem without bias
minw ,ξ
λ
2‖w‖2 +
1
m
m∑i=1
ξi
s.t. yi 〈w , xi 〉 ≥ 1− ξi for all i
ξi ≥ 0
Introduce non-negative dual variables α and β
L(w , ξ, α, β) =λ
2‖w‖2 +
1
m
m∑i=1
ξi −∑i
αi (yi 〈w , xi 〉 − 1 + ξi )
−∑i
βiξi
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 41
![Page 35: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/35.jpg)
Dual Problem
Deriving the Dual
Take Gradients and Set to Zero
Write the gradients
∇wL(w , ξ, α, β) = λw −∑i
αiyixi = 0
∇ξi L(w , ξ, α, β) =1
m− βi − αi = 0
Conclude that
w =1
λ
∑i
αiyixi
0 ≤ αi ≤1
m
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 41
![Page 36: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/36.jpg)
Dual Problem
Deriving the Dual
Plug back into Lagrangian
Plug w = 1λ
∑i αiyixi and βi + αi = 1
m into the Lagrangian
maxα−D(α) := − 1
2λ
∑i ,j
αiαjyiyj 〈xi , xj〉+∑i
αi
s.t. 0 ≤ αi ≤1
m
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 41
![Page 37: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/37.jpg)
Dual Problem
Deriving the Dual
Plug back into Lagrangian
Plug w = 1λ
∑i αiyixi and βi + αi = 1
m into the Lagrangian
minα
D(α) :=1
2λ
∑i ,j
αiαjyiyj 〈xi , xj〉 −∑i
αi
s.t. 0 ≤ αi ≤1
m
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 41
![Page 38: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/38.jpg)
Dual Problem
Deriving the Dual
Plug back into Lagrangian
Plug w = 1λ
∑i αiyixi and βi + αi = 1
m into the Lagrangian
minα
D(α) :=1
2λ
∑i ,j
αiαjyiyj 〈xi , xj〉 −∑i
αi
s.t. 0 ≤ αi ≤1
m
Quadratic objective
Linear (box) constraints
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 41
![Page 39: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/39.jpg)
Dual Problem
Coordinate Descent in the Dual
One dimensional function
D(αt) =α2t
2λ〈xt , xt〉+
1
λ
∑i
αtαiyiyt 〈xi , xt〉 − αt + const.
s.t. 0 ≤ αt ≤1
m
Take Gradients and set to Zero
∇D(αt) =αt
λ〈xt , xt〉+
1
λ
∑i
αiyiyt 〈xi , xt〉 − 1 = 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 41
![Page 40: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/40.jpg)
Dual Problem
Coordinate Descent in the Dual
One dimensional function
D(αt) =α2t
2λ〈xt , xt〉+
1
λ
∑i
αtαiyiyt 〈xi , xt〉 − αt + const.
s.t. 0 ≤ αt ≤1
m
Take Gradients and set to Zero
∇D(αt) =αt
λ〈xt , xt〉+
1
λyt
⟨wt︸︷︷︸
:=∑
i yiαixi
, xt
⟩− 1 = 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 41
![Page 41: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/41.jpg)
Dual Problem
Coordinate Descent in the Dual
One dimensional function
D(αt) =α2t
2λ〈xt , xt〉+
1
λ
∑i
αtαiyiyt 〈xi , xt〉 − αt + const.
s.t. 0 ≤ αt ≤1
m
Take Gradients and set to Zero
αt = min
(max
(0,λ− yt 〈wt , xt〉〈xt , xt〉
),
1
m
)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 41
![Page 42: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/42.jpg)
Dual Problem
Contrast with Implicit Updates
Coordinate Descent in the Dual
wt =1
λ
∑i
αiyixi
αt = min
(1
m,max
(0,λ− yt 〈wt , xt〉〈xt , xt〉
))
Implicit Updates
wt+1 =1
1 + ηtλ[wt + γηtytxt ]
γ = min
(1,max
(0,
1 + ηtλ− yt 〈wt , xt〉ηt 〈xt , xt〉
))
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 23 / 41
![Page 43: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/43.jpg)
Scaling Things Up
Outline
1 Linear Support Vector Machines
2 Stochastic Optimization
3 Implicit Updates
4 Dual Problem
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 24 / 41
![Page 44: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/44.jpg)
Scaling Things Up
What if Data Does not Fit in Memory?
Idea 1: Block Minimization [Yu et al., KDD 2010]
Split data into blocks B1, B2 . . . such that Bj fits in memory
Compress and store each block separately
Load one block of data at a time and optimize only those αi ’s
Idea 2: Selective Block Minimization [Chang and Roth, KDD 2011]
Split data into blocks B1, B2 . . . such that Bj fits in memory
Compress and store each block separately
Load one block of data at a time and optimize only those αi ’s
Retain informative samples from each block in main memory
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 25 / 41
![Page 45: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/45.jpg)
Scaling Things Up
What are Informative Samples?
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 26 / 41
![Page 46: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/46.jpg)
Scaling Things Up
Some Observations
SBM and BM are wasteful
Both split data into blocks and compress the blocks
This requires reading the entire data at least once (expensive)
Both pause optimization while a block is loaded into memory
Hardware 101
Disk I/O is slower than CPU (sometimes by a factor of 100)
Random access on HDD is terrible
sequential access is reasonably fast (factor of 10)
Multi-core processors are becoming commonplace
How can we exploit this?
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 27 / 41
![Page 47: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/47.jpg)
Scaling Things Up
Dual Cached Loops [Matsushima, Vishwanathan, Smola]
Reader Trainer
RAM RAMHDD
Data Working Set Weight Vec
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 41
![Page 48: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/48.jpg)
Scaling Things Up
Underlying Philosophy
Iterate over the data in main memory while streaming data fromdisk. Evict primarily examples from main memory that are“uninformative”.
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 29 / 41
![Page 49: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/49.jpg)
Scaling Things Up
Reader
for k = 1, . . . ,max iter dofor i = 1, . . . , n do
if |A| = Ω thenrandomly select i ′ ∈ AA = A \ i ′delete yi ,Qii , xi from RAM
end ifread yi , xi from Disk
calculate Qii = 〈xi , xi 〉store yi ,Qii , xi in RAM
A = A ∪ iend forif stopping criterion is met then
exitend if
end forS.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 30 / 41
![Page 50: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/50.jpg)
Scaling Things Up
Trainer
α1 = 0,w 1 = 0, ε = 9, εnew = 0, β = 0.9while stopping criterion is not met do
for t = 1, . . . , n doIf |A| > 0.9× Ω then ε = βεrandomly select i ∈ A and read yi ,Qii , xi from RAM
compute ∇iD := yi 〈w t , xi 〉 − 1if (αt
i = 0 and ∇iD > ε) or (αti = C and ∇iD < −ε) then
A = A \ i and delete yi ,Qii , xi from RAM
continueend ifαt+1i = median
(0,C , αt
i −∇iDQii
), w t+1 = w t + (αt+1
i − αti )yixi
εnew = max(εnew, |∇iD|)end forUpdate stopping criterionε = εnew
end whileS.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 31 / 41
![Page 51: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/51.jpg)
Scaling Things Up
Experiments
dataset n d s(%) n+ :n− Datasize
ocr 3.5 M 1156 100 0.96 45.28 GB
dna 50 M 800 25 3e−3 63.04 GB
webspam-t 0.35 M 16.61 M 0.022 1.54 20.03 GB
kddb 20.01 M 29.89 M 1e-4 6.18 4.75 GB
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 32 / 41
![Page 52: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/52.jpg)
Scaling Things Up
Does Active Eviction Work?
0 0.5 1 1.5 2 2.5
·104
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Wall Clock Time (sec)
Rel
ativ
eF
un
ctio
nV
alu
eD
iffer
ence
Linear SVM
webspam-t C = 1.0RandomActive
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 33 / 41
![Page 53: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/53.jpg)
Scaling Things Up
Comparison with Block Minimization
0 1 2 3 4
·104
10−9
10−7
10−5
10−3
10−1
Wall Clock Time (sec)
Rel
ativ
eF
un
ctio
nV
alu
eD
iffer
ence ocr C = 1.0
StreamSVMSBMBM
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41
![Page 54: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/54.jpg)
Scaling Things Up
Comparison with Block Minimization
0 0.5 1 1.5 2
·104
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Wall Clock Time (sec)
Rel
ativ
eF
un
ctio
nV
alu
eD
iffer
ence webspam-t C = 1.0
StreamSVMSBMBM
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41
![Page 55: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/55.jpg)
Scaling Things Up
Comparison with Block Minimization
0 1 2 3 4
·104
10−6
10−5
10−4
10−3
10−2
10−1
100
Wall Clock Time (sec)
Rel
ativ
eF
un
ctio
nV
alu
eD
iffer
ence kddb C = 1.0
StreamSVMSBMBM
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41
![Page 56: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/56.jpg)
Scaling Things Up
Comparison with Block Minimization
0 1 2 3 4
·104
10−11
10−9
10−7
10−5
10−3
10−1
Wall Clock Time (sec)
Rel
ativ
eO
bje
ctiv
eF
un
ctio
nV
alu
e
dna C = 1.0StreamSVM
SBMBM
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41
![Page 57: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/57.jpg)
Scaling Things Up
Expanding Features
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
·105
10−10
10−8
10−6
10−4
10−2
100
Wall Clock Time (sec)
Rel
ativ
eF
un
ctio
nV
alu
eD
iffer
ence dna expanded C = 1.0
16 GB32GB
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 35 / 41
![Page 58: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/58.jpg)
Bringing in the Bias
Outline
1 Linear Support Vector Machines
2 Stochastic Optimization
3 Implicit Updates
4 Dual Problem
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 36 / 41
![Page 59: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/59.jpg)
Bringing in the Bias
Let us Bring Back the Bias
Lagrangian
Recall the primal problem
minw ,b,ξ
λ
2‖w‖2 +
1
m
m∑i=1
ξi
s.t. yi (〈w , xi 〉+ b) ≥ 1− ξi for all i
ξi ≥ 0
Introduce non-negative dual variables α and β
L(w , b, ξ, α, β) =λ
2‖w‖2 −
∑i
βiξi
+1
m
m∑i=1
ξi −∑i
αi (yi (〈w , xi 〉+ b)− 1 + ξi )
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 41
![Page 60: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/60.jpg)
Bringing in the Bias
Let us Bring Back the Bias
Lagrangian
Recall the primal problem
minw ,b,ξ
λ
2‖w‖2 +
1
m
m∑i=1
ξi
s.t. yi (〈w , xi 〉+ b) ≥ 1− ξi for all i
ξi ≥ 0
Introduce non-negative dual variables α and β
L(w , b, ξ, α, β) =λ
2‖w‖2 −
∑i
βiξi
+1
m
m∑i=1
ξi −∑i
αi (yi (〈w , xi 〉+ b)− 1 + ξi )
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 41
![Page 61: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/61.jpg)
Bringing in the Bias
Let us Bring Back the Bias
Take Gradients and Set to Zero
Write the gradients
∇wL(w , b, ξ, α, β) = λw −∑i
αiyixi = 0
∇bL(w , b, ξ, α, β) =∑i
αiyi = 0
∇ξi L(w , b, ξ, α) =1
m− βi − αi = 0
Conclude that
w =1
λ
∑i
αiyixi
∑i
αiyi = 0 and 0 ≤ αi ≤1
m
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 38 / 41
![Page 62: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/62.jpg)
Bringing in the Bias
Let us Bring Back the Bias
Plug back into Lagrangian
Plug w = 1λ
∑i αiyixi and βi + αi = 1
m into the Lagrangian
maxα−D(α) := − 1
2λ
∑i ,j
αiαjyiyj 〈xi , xj〉+∑i
αi
s.t.∑i
αiyi = 0
0 ≤ αi ≤1
m
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 39 / 41
![Page 63: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/63.jpg)
Bringing in the Bias
Let us Bring Back the Bias
Plug back into Lagrangian
Plug w = 1λ
∑i αiyixi and βi + αi = 1
m into the Lagrangian
minα
D(α) :=1
2λ
∑i ,j
αiαjyiyj 〈xi , xj〉 −∑i
αi
s.t.∑i
αiyi = 0
0 ≤ αi ≤1
m
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 39 / 41
![Page 64: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/64.jpg)
Bringing in the Bias
Coordinate Descent in the Dual
One Dimensional Function
Cannot pick one coordinate so pick two!
Call the two coordinates t1 and t2
D(ηt1 , ηt2) =η2t1
2λ〈xt1 , xt1〉+
η2t2
2λ〈xt2 , xt2〉
+ηt1
λ
∑i
αi 〈xi , xt1〉+ηt2
λ
∑i
αi 〈xi , xt2〉
+ηt1ηt2
λ〈xt1 , xt2〉 − ηt1 − ηt2 + const.
s.t. yt1ηt1 + yt2ηt2 = 0
0 ≤ αt1 + ηt1 ≤1
m
0 ≤ αt2 + ηt2 ≤1
mS.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 41
![Page 65: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/65.jpg)
Bringing in the Bias
Coordinate Descent in the Dual
One Dimensional Function
Cannot pick one coordinate so pick two!
Call the two coordinates t1 and t2
ηt1 = −yt2
yt1
ηt2 = η
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 41
![Page 66: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/66.jpg)
Bringing in the Bias
Coordinate Descent in the Dual
One Dimensional Function
Cannot pick one coordinate so pick two!
Call the two coordinates t1 and t2
D(ηt1 , ηt2) =η2
2λ〈xt1 , xt1〉+
η2
2λ〈xt2 , xt2〉
+η
λ
∑i
αi 〈xi , xt1〉 −ηyt1
λyt2
∑i
αi 〈xi , xt2〉
− η2yt1
λyt2
〈xt1 , xt2〉 − η +yt1
yt2
η + const.
s.t. 0 ≤ αt1 + η ≤ 1
m
0 ≤ αt2 −yt1
yt2
η ≤ 1
m
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 41
![Page 67: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/67.jpg)
Bringing in the Bias
Software
LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LibLinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 41 / 41
![Page 68: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/68.jpg)
Bringing in the Bias
References (Incomplete)
Implicit UpdatesKivinen and Warmuth, Exponentiated Gradient Versus GradientDescent for Linear Predictors, Information and Computation, 1997.Kivinen, Warmuth, and Hassibi, The p-norm generaliziation of theLMS algorithm for adaptive filtering, IEEE Transactions on SignalProcessing, 2006.Cheng, Vishwanathan, Schuurmans, Wang, and Caelli, Implicit OnlineLearning With Kernels, NIPS 2006.Hsieh, Chang, Lin, Keerthi, and Sundararajan, A Dual CoordinateDescent Method for Large-scale Linear SVM, ICML 2008.
SMOPlatt, Fast Training of Support Vector Machines using SequentialMinimal Optimization, Advances in Kernel Methods — Support VectorLearnin, 1999.
Dual Cached LoopsMatsushima, Vishwanathan, and Smola, Linear Support VectorMachines via Dual Cached Loops, KDD 2012.
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 42 / 41
![Page 69: Vishy Lec2](https://reader034.fdocuments.net/reader034/viewer/2022050805/55cf8f7c550346703b9cda7a/html5/thumbnails/69.jpg)
Bringing in the Bias
References (Incomplete)
Slides are loosely based on lecture notes from
http://learning.stat.purdue.edu/wiki/courses/sp2011/
598a/lectures
http://www.ee.ucla.edu/~vandenbe/shortcourses.html
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 43 / 41