Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber,...
Transcript of Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber,...
Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 10
Slides adapted from Jordan Boyd-Graber, Chris Ketelsen
Machine Learning: Chenhao Tan | Boulder | 1 of 52
Roadmap
• Last time: linear SVM formulation when data is linearly separable• This time:◦ Introduce duality◦ Make linear SVM work when data is not linearly separable◦ Introduce an efficient algorithm for finding weights
• Next time: Kernel trick
Machine Learning: Chenhao Tan | Boulder | 2 of 52
Overview
Duality
Slack variables
Sequential Mimimal Optimization
Recap
Machine Learning: Chenhao Tan | Boulder | 3 of 52
Duality
Outline
Duality
Slack variables
Sequential Mimimal Optimization
Recap
Machine Learning: Chenhao Tan | Boulder | 4 of 52
Duality
Binary classification
Given: Strain = {(xi, yi)}mi=1 training examples, xi ∈ Rd, yi ∈ {−1, 1}
Goal: Find hypothesis function h : X → YLinear SVM: learn a linear decision rule of the form w · x + b
Machine Learning: Chenhao Tan | Boulder | 5 of 52
Duality
Optimizing the objective function
minw,b
12||w||2 (1)
subject to yi(w · xi + b) ≥ 1, i ∈ [1,m]
Machine Learning: Chenhao Tan | Boulder | 6 of 52
Duality
Optimizing Constrained Functions
The Method of Lagrange Multipliers
Constrained problem (Primalproblem)
minx
f (x)
s.t. gi(x) ≥ 0, i ∈ [1, n]
Lagrange Multiplier
L (x,α) = f (x)−n∑
i=1
αigi(x),
αi ≥ 0, i ∈ [1, n]
Machine Learning: Chenhao Tan | Boulder | 7 of 52
Duality
Lagrange Multiplier
p∗: the optimal value in the primal problemWe claim that
p∗ = minx
maxα
L (x,α) = minx
maxα
f (x)−n∑
i=1
αigi(x)
This is because
max−αy =
{0 y ≥ 0+∞ otherwise
Machine Learning: Chenhao Tan | Boulder | 8 of 52
Duality
Lagrange Multiplier
p∗: the optimal value in the primal problemWe claim that
p∗ = minx
maxα
L (x,α) = minx
maxα
f (x)−n∑
i=1
αigi(x)
This is because
max−αy =
{0 y ≥ 0+∞ otherwise
Machine Learning: Chenhao Tan | Boulder | 8 of 52
Duality
Lagrange Multiplier
What happens if we reverse min and max:
maxα
minx
L (x,α) ≥ or ≤ minx
maxα
L (x,α)
The left leads to the dual problem.
Machine Learning: Chenhao Tan | Boulder | 9 of 52
Duality
Lagrange Multiplier
What happens if we reverse min and max:
maxα
minx
L (x,α) ≤ minx
maxα
L (x,α)
The left leads to the dual problem.
Machine Learning: Chenhao Tan | Boulder | 9 of 52
Duality
Primal vs. Dual
Prime problem
minw,b
12||w||2
s.t. yi(w · xi + b) ≥ 1, i ∈ [1,m]
Derive the function for dual problem.
Machine Learning: Chenhao Tan | Boulder | 10 of 52
Duality
Primal vs. Dual
Prime problem
minw,b
12||w||2
s.t. yi(w · xi + b) ≥ 1, i ∈ [1,m]
Replace w, b with stationarity conditions.
Machine Learning: Chenhao Tan | Boulder | 10 of 52
Duality
Primal vs. Dual
Primal problem
minw,b
12||w||2
s.t. yi(w · xi + b) ≥ 1, i ∈ [1,m]
Dual problem
maxα
m∑i=1
αi −12
m∑i=1
m∑j=1
αiαjyiyj(xj · xi)
s.t. αi ≥ 0, i ∈ [1,m]∑i
αiyi = 0
Machine Learning: Chenhao Tan | Boulder | 11 of 52
Duality
Karush-Kuhn-Tucker (KKT) conditions
Primal and dual feasibility
yi(w · xi + b) ≥ 1, αi ≥ 0 (2)
Stationarity
w =m∑
i=1
αiyixi,m∑
i=1
αiyi = 0 (3)
Complementary slackness
αi = 0 ∨ yi(w · xi + b) = 1 (4)
Machine Learning: Chenhao Tan | Boulder | 12 of 52
Duality
Karush-Kuhn-Tucker (KKT) conditions
Primal and dual feasibility
yi(w · xi + b) ≥ 1, αi ≥ 0 (2)
Stationarity
w =m∑
i=1
αiyixi,m∑
i=1
αiyi = 0 (3)
Complementary slackness
αi = 0 ∨ yi(w · xi + b) = 1 (4)
Machine Learning: Chenhao Tan | Boulder | 12 of 52
Duality
Karush-Kuhn-Tucker (KKT) conditions
Primal and dual feasibility
yi(w · xi + b) ≥ 1, αi ≥ 0 (2)
Stationarity
w =m∑
i=1
αiyixi,m∑
i=1
αiyi = 0 (3)
Complementary slackness
αi = 0 ∨ yi(w · xi + b) = 1 (4)
Machine Learning: Chenhao Tan | Boulder | 12 of 52
Slack variables
Outline
Duality
Slack variables
Sequential Mimimal Optimization
Recap
Machine Learning: Chenhao Tan | Boulder | 13 of 52
Slack variables
Old objective function
minw,b
12||w||2 (5)
subject to yi(w · xi + b) ≥ 1, i ∈ [1,m]
Machine Learning: Chenhao Tan | Boulder | 14 of 52
Slack variables
Can SVMs Work Here?
yi(w · xi + b) ≥ 1 (6)
Machine Learning: Chenhao Tan | Boulder | 15 of 52
Slack variables
Can SVMs Work Here?
yi(w · xi + b) ≥ 1 (6)
Machine Learning: Chenhao Tan | Boulder | 15 of 52
Slack variables
Trick: Allow for a few bad apples
Machine Learning: Chenhao Tan | Boulder | 16 of 52
Slack variables
Old objective function
minw,b
12||w||2 (7)
subject to yi(w · xi + b) ≥ 1, i ∈ [1,m]
Machine Learning: Chenhao Tan | Boulder | 17 of 52
Slack variables
Relaxing the constraint
yi(w · xi + b) ≥ 1− ξi
• ξi = 0 means at least one margin on correct side of decision boundary• ξi = 1/2 means at least one-half margin on correct side of decision boundary• ξi = 2 means at least one margin on wrong side of decision boundary
Machine Learning: Chenhao Tan | Boulder | 18 of 52
Slack variables
New objective function
minw,b,ξ
12||w||2 + C
∑i=1
ξip (8)
subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]
• Standard margin• How wrong a point is (slack variables)• Tradeoff between margin and slack variables• How bad wrongness scales
Machine Learning: Chenhao Tan | Boulder | 19 of 52
Slack variables
New objective function
minw,b,ξ
12||w||2 + C
∑i=1
ξip (8)
subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]
• Standard margin
• How wrong a point is (slack variables)• Tradeoff between margin and slack variables• How bad wrongness scales
Machine Learning: Chenhao Tan | Boulder | 19 of 52
Slack variables
New objective function
minw,b,ξ
12||w||2 + C
∑i=1
ξip (8)
subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]
• Standard margin• How wrong a point is (slack variables)
• Tradeoff between margin and slack variables• How bad wrongness scales
Machine Learning: Chenhao Tan | Boulder | 19 of 52
Slack variables
New objective function
minw,b,ξ
12||w||2 + C
∑i=1
ξip (8)
subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]
• Standard margin• How wrong a point is (slack variables)• Tradeoff between margin and slack variables
• How bad wrongness scales
Machine Learning: Chenhao Tan | Boulder | 19 of 52
Slack variables
New objective function
minw,b,ξ
12||w||2 + C
∑i=1
ξip (8)
subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]
• Standard margin• How wrong a point is (slack variables)• Tradeoff between margin and slack variables• How bad wrongness scales
Machine Learning: Chenhao Tan | Boulder | 19 of 52
Slack variables
Aside: Loss Functions
• Losses measure how bad a mistake is• Important for slack as well
Machine Learning: Chenhao Tan | Boulder | 20 of 52
Slack variables
Aside: Loss Functions
• Losses measure how bad a mistake is• Important for slack as well
x0/1 Loss
Machine Learning: Chenhao Tan | Boulder | 20 of 52
Slack variables
Aside: Loss Functions
• Losses measure how bad a mistake is• Important for slack as well
x
LinearHinge
0/1 Loss
Machine Learning: Chenhao Tan | Boulder | 20 of 52
Slack variables
Aside: Loss Functions
• Losses measure how bad a mistake is• Important for slack as well
x
Quadratic Hinge
LinearHinge
0/1 Loss
Machine Learning: Chenhao Tan | Boulder | 20 of 52
Slack variables
Aside: Loss Functions
• Losses measure how bad a mistake is• Important for slack as well
x
Quadratic Hinge
LinearHinge
0/1 Loss
We’ll focus on linear hinge loss, set p = 1Machine Learning: Chenhao Tan | Boulder | 20 of 52
Slack variables
What is the role of C?
minw,b,ξ
12||w||2 + C
∑i=1
ξi (9)
subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]
A. C ↑⇒ low bias, low varianceB. C ↑⇒ low bias, high varianceC. C ↑⇒ high bias, low varianceD. C ↑⇒ high bias, high variance
Machine Learning: Chenhao Tan | Boulder | 21 of 52
Slack variables
New Lagrangian
L (w, b, ξ,α,β) =12||w||2 + C
m∑i=1
ξi (10)
−m∑
i=1
αi [yi(w · xi + b)− 1 + ξi] (11)
−m∑
i=1
βiξi (12)
Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us
w =
m∑i=1
αiyixi (13)
m∑i=1
αiyi = 0 (14) αi + βi = C (15)
Machine Learning: Chenhao Tan | Boulder | 22 of 52
Slack variables
New Lagrangian
L (w, b, ξ,α,β) =12||w||2 + C
m∑i=1
ξi (10)
−m∑
i=1
αi [yi(w · xi + b)− 1 + ξi] (11)
−m∑
i=1
βiξi (12)
Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us
w =
m∑i=1
αiyixi (13)
m∑i=1
αiyi = 0 (14) αi + βi = C (15)
Machine Learning: Chenhao Tan | Boulder | 22 of 52
Slack variables
New Lagrangian
L (w, b, ξ,α,β) =12||w||2 + C
m∑i=1
ξi (10)
−m∑
i=1
αi [yi(w · xi + b)− 1 + ξi] (11)
−m∑
i=1
βiξi (12)
Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us
w =
m∑i=1
αiyixi (13)
m∑i=1
αiyi = 0 (14) αi + βi = C (15)
Machine Learning: Chenhao Tan | Boulder | 22 of 52
Slack variables
New Lagrangian
L (w, b, ξ,α,β) =12||w||2 + C
m∑i=1
ξi (10)
−m∑
i=1
αi [yi(w · xi + b)− 1 + ξi] (11)
−m∑
i=1
βiξi (12)
Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us
w =
m∑i=1
αiyixi (13)
m∑i=1
αiyi = 0 (14) αi + βi = C (15)
Machine Learning: Chenhao Tan | Boulder | 22 of 52
Slack variables
New Lagrangian
L (w, b, ξ,α,β) =12||w||2 + C
m∑i=1
ξi (10)
−m∑
i=1
αi [yi(w · xi + b)− 1 + ξi] (11)
−m∑
i=1
βiξi (12)
Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us
w =
m∑i=1
αiyixi (13)
m∑i=1
αiyi = 0 (14) αi + βi = C (15)
Machine Learning: Chenhao Tan | Boulder | 22 of 52
Slack variables
Simplifying dual objective
w =
m∑i=1
αiyixim∑
i=1
αiyi = 0 αi + βi = C
L (w, b, ξ,α,β) =12||w||2 + C
m∑i=1
ξi
−m∑
i=1
αi [yi(w · xi + b)− 1 + ξi]
−m∑
i=1
βiξi
Machine Learning: Chenhao Tan | Boulder | 23 of 52
Slack variables
Simplifying dual objective
w =
m∑i=1
αiyixim∑
i=1
αiyi = 0 αi + βi = C
L (w, b, ξ,α,β) =12||w||2 + C
m∑i=1
ξi
−m∑
i=1
αi [yi(w · xi + b)− 1 + ξi]
−m∑
i=1
βiξi
Machine Learning: Chenhao Tan | Boulder | 23 of 52
Slack variables
Dual Problem
maxα
m∑i=1
αi −12
m∑i=1
m∑j=1
αiαjyiyj(xj · xi)
s.t. C ≥ αi ≥ 0, i ∈ [1,m]∑i
αiyi = 0
Machine Learning: Chenhao Tan | Boulder | 24 of 52
Slack variables
Dual Problem
maxα
m∑i=1
αi −12
m∑i=1
m∑j=1
αiαjyiyj(xj · xi)
s.t. C ≥ αi ≥ 0, i ∈ [1,m]∑i
αiyi = 0
Machine Learning: Chenhao Tan | Boulder | 24 of 52
Slack variables
Karush-Kuhn-Tucker (KKT) conditions
Primal and dual feasibility
yi(w · xi + b) ≥ 1− ξi, ξi ≥ 0,C ≥ αi ≥ 0, βi ≥ 0 (16)
Stationarity
w =m∑
i=1
αiyixi,m∑
i=1
αiyi = 0, αi + βi = C (17)
Complementary slackness
αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (18)
Machine Learning: Chenhao Tan | Boulder | 25 of 52
Slack variables
Karush-Kuhn-Tucker (KKT) conditions
Primal and dual feasibility
yi(w · xi + b) ≥ 1− ξi, ξi ≥ 0,C ≥ αi ≥ 0, βi ≥ 0 (16)
Stationarity
w =m∑
i=1
αiyixi,m∑
i=1
αiyi = 0, αi + βi = C (17)
Complementary slackness
αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (18)
Machine Learning: Chenhao Tan | Boulder | 25 of 52
Slack variables
Karush-Kuhn-Tucker (KKT) conditions
Primal and dual feasibility
yi(w · xi + b) ≥ 1− ξi, ξi ≥ 0,C ≥ αi ≥ 0, βi ≥ 0 (16)
Stationarity
w =m∑
i=1
αiyixi,m∑
i=1
αiyi = 0, αi + βi = C (17)
Complementary slackness
αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (18)
Machine Learning: Chenhao Tan | Boulder | 25 of 52
Slack variables
More on Complementary Slackness
αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (19)
• xi satisfies the margin, yi(w · xi + b) > 1⇒ αi = 0
• xi does not satisfy the margin, yi(w · xi + b) < 1⇒ αi = C• xi is on the margin, yi(w · xi + b) = 1⇒ 0 ≤ αi ≤ C
Machine Learning: Chenhao Tan | Boulder | 26 of 52
Slack variables
More on Complementary Slackness
αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (19)
• xi satisfies the margin, yi(w · xi + b) > 1⇒ αi = 0• xi does not satisfy the margin, yi(w · xi + b) < 1⇒ αi = C
• xi is on the margin, yi(w · xi + b) = 1⇒ 0 ≤ αi ≤ C
Machine Learning: Chenhao Tan | Boulder | 26 of 52
Slack variables
More on Complementary Slackness
αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (19)
• xi satisfies the margin, yi(w · xi + b) > 1⇒ αi = 0• xi does not satisfy the margin, yi(w · xi + b) < 1⇒ αi = C• xi is on the margin, yi(w · xi + b) = 1⇒ 0 ≤ αi ≤ C
Machine Learning: Chenhao Tan | Boulder | 26 of 52
Sequential Mimimal Optimization
Outline
Duality
Slack variables
Sequential Mimimal Optimization
Recap
Machine Learning: Chenhao Tan | Boulder | 27 of 52
Sequential Mimimal Optimization
Sequential Mimimal Optimization
Trivia• Invented by John Platt in 1998 at Microsoft Research• Called Minimal due to solving small sub-problems
Machine Learning: Chenhao Tan | Boulder | 28 of 52
Sequential Mimimal Optimization
Dual problem
maxα
m∑i=1
αi −12
m∑i=1
m∑j=1
αiαjyiyj(xj · xi)
s.t. C ≥αi ≥ 0, i ∈ [1,m]∑i
αiyi = 0
Machine Learning: Chenhao Tan | Boulder | 29 of 52
Sequential Mimimal Optimization
Brief Interlude: Coordinate Ascent
maxα
m∑i=1
αi −12
m∑i=1
m∑j=1
αiαjyiyj(xj · xi)
s.t. C ≥αi ≥ 0, i ∈ [1,m]∑i
αiyi = 0
Loop over each training example, change αi to maximize the above function
Although coordinate ascent works OK for lots of problems, we have the constraint∑i αiyi = 0
Machine Learning: Chenhao Tan | Boulder | 30 of 52
Sequential Mimimal Optimization
Brief Interlude: Coordinate Ascent
maxα
m∑i=1
αi −12
m∑i=1
m∑j=1
αiαjyiyj(xj · xi)
s.t. C ≥αi ≥ 0, i ∈ [1,m]∑i
αiyi = 0
Loop over each training example, change αi to maximize the above functionAlthough coordinate ascent works OK for lots of problems, we have the constraint∑
i αiyi = 0
Machine Learning: Chenhao Tan | Boulder | 30 of 52
Sequential Mimimal Optimization
Outline for SVM Optimization (SMO)
1. Select two examples i, j
2. Update αj, αi to maximize the above function
Machine Learning: Chenhao Tan | Boulder | 31 of 52
Sequential Mimimal Optimization
Karush-Kuhn-Tucker (KKT) conditions
Primal and dual feasibility
yi(w · xi + b) ≥ 1− ξi, ξi ≥ 0,C ≥ αi ≥ 0, βi ≥ 0 (20)
Stationarity
w =m∑
i=1
αiyixi,m∑
i=1
αiyi = 0, αi + βi = C (21)
Complementary slackness
αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (22)
Machine Learning: Chenhao Tan | Boulder | 32 of 52
Sequential Mimimal Optimization
Outline for SVM Optimization (SMO)
yiαi + yjαj = yiαoldi + yjα
oldj = γ
Machine Learning: Chenhao Tan | Boulder | 33 of 52
Sequential Mimimal Optimization
Step 2: Optimize αj
1. Compute upper (H) and lower (L) bounds that ensure 0 ≤ αj ≤ C.
yi 6= yj
L = max(0, αj − αi) (23)H = min(C,C + αj − αi) (24)
yi = yj
L = max(0, αi + αj − C) (25)H = min(C, αj + αi) (26)
This is because the update for αi is based on yiyj (sign matters)
Machine Learning: Chenhao Tan | Boulder | 34 of 52
Sequential Mimimal Optimization
Step 2: Optimize αj
1. Compute upper (H) and lower (L) bounds that ensure 0 ≤ αj ≤ C.
yi 6= yj
L = max(0, αj − αi) (23)H = min(C,C + αj − αi) (24)
yi = yj
L = max(0, αi + αj − C) (25)H = min(C, αj + αi) (26)
This is because the update for αi is based on yiyj (sign matters)
Machine Learning: Chenhao Tan | Boulder | 34 of 52
Sequential Mimimal Optimization
Step 2: Optimize αj
Compute errors for i and jEk ≡ f (xk)− yk (27)
η = 2xi · xj − xi · xi − xj · xj (28)
for new value for αj
α∗j = α
(old)j −
yj(Ei − Ej)
η(29)
Machine Learning: Chenhao Tan | Boulder | 35 of 52
Sequential Mimimal Optimization
Step 3: Optimize αi
Set αi:α∗
i = α(old)i + yiyj
(α(old)j − αj
)(30)
This balances out the move that we made for αj.
Machine Learning: Chenhao Tan | Boulder | 36 of 52
Sequential Mimimal Optimization
Step 3: Optimize αi
Set αi:α∗
i = α(old)i + yiyj
(α(old)j − αj
)(30)
This balances out the move that we made for αj.
Machine Learning: Chenhao Tan | Boulder | 36 of 52
Sequential Mimimal Optimization
Overall algorithm
Iterate over i = {1, . . .m}Repeat until KKT conditions are met
Choose j randomly from m− 1 other optionsUpdate αi, αj
Find w, b based on stationarity conditions
Machine Learning: Chenhao Tan | Boulder | 37 of 52
Sequential Mimimal Optimization
Iterations / Details
• What if i doesn’t violate the KKT conditions?• What if η ≥ 0?• When do we stop?
Machine Learning: Chenhao Tan | Boulder | 38 of 52
Sequential Mimimal Optimization
Iterations / Details
• What if i doesn’t violate the KKT conditions? Skip it!• What if η ≥ 0?• When do we stop?
Machine Learning: Chenhao Tan | Boulder | 38 of 52
Sequential Mimimal Optimization
Iterations / Details
• What if i doesn’t violate the KKT conditions? Skip it!• What if η ≥ 0? Skip it! (should not happen except for numerical instability)• When do we stop?
Machine Learning: Chenhao Tan | Boulder | 38 of 52
Sequential Mimimal Optimization
Iterations / Details
• What if i doesn’t violate the KKT conditions? Skip it!• What if η ≥ 0? Skip it! (should not happen except for numerical instability)• When do we stop? Until we go through α’s without changing anything
Machine Learning: Chenhao Tan | Boulder | 38 of 52
Sequential Mimimal Optimization
SMO Algorithm
Positive(-2, 2)(0, 4)(2, 1)
0
4
1
2
3 5
positive
negative
Negative(-2, -3)(0, -1)(2, -3)
• Initially, all alphas are zero
α =< 0, 0, 0, 0, 0, 0 >
• Intercept b is also zero• Capacity C = π
Machine Learning: Chenhao Tan | Boulder | 39 of 52
Sequential Mimimal Optimization
SMO Algorithm
Positive(-2, 2)(0, 4)(2, 1)
0
4
1
2
3 5
positive
negative
Negative(-2, -3)(0, -1)(2, -3)
• Initially, all alphas are zero
α =< 0, 0, 0, 0, 0, 0 >
• Intercept b is also zero• Capacity C = π
Machine Learning: Chenhao Tan | Boulder | 39 of 52
Sequential Mimimal Optimization
SMO Algorithm
Positive(-2, 2)(0, 4)(2, 1)
0
4
1
2
3 5
positive
negative
Negative(-2, -3)(0, -1)(2, -3)
• Initially, all alphas are zero
α =< 0, 0, 0, 0, 0, 0 >
• Intercept b is also zero• Capacity C = π
Machine Learning: Chenhao Tan | Boulder | 39 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: Predictions and Step
0
4
1
2
3 5
positive
negative
• Prediction: f (x0)
• Prediction: f (x4)
• Error: E0
• Error: E4
Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: Predictions and Step
0
4
1
2
3 5
positive
negative
• Prediction: f (x0) = 0• Prediction: f (x4)
• Error: E0
• Error: E4
Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: Predictions and Step
0
4
1
2
3 5
positive
negative
• Prediction: f (x0) = 0• Prediction: f (x4) = 0• Error: E0
• Error: E4
Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: Predictions and Step
0
4
1
2
3 5
positive
negative
• Prediction: f (x0) = 0• Prediction: f (x4) = 0• Error: E0 = −1• Error: E4 = +1
Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: Predictions and Step
0
4
1
2
3 5
positive
negative
• Prediction: f (x0) = 0• Prediction: f (x4) = 0• Error: E0 = −1• Error: E4 = +1
η = 2〈x0, x4〉 − 〈x0, x0〉 − 〈x4, x4〉
Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: Predictions and Step
0
4
1
2
3 5
positive
negative
• Prediction: f (x0) = 0• Prediction: f (x4) = 0• Error: E0 = −1• Error: E4 = +1
η = 2〈x0, x4〉 − 〈x0, x0〉 − 〈x4, x4〉= 2 · −2− 8− 1 = −13
Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: Bounds
• Lower and upper bounds for αj
L = max(0, αj − αi) (31)H = min(C,C + αj − αi) (32)
Machine Learning: Chenhao Tan | Boulder | 41 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: Bounds
• Lower and upper bounds for αj
L = max(0, αj − αi) = 0 (31)H = min(C,C + αj − αi) (32)
Machine Learning: Chenhao Tan | Boulder | 41 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: Bounds
• Lower and upper bounds for αj
L = max(0, αj − αi) = 0 (31)H = min(C,C + αj − αi) = π (32)
Machine Learning: Chenhao Tan | Boulder | 41 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: α update
New value for αj
α∗j = αj −
yj(Ei − Ej)
η(33)
(34)
Machine Learning: Chenhao Tan | Boulder | 42 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: α update
New value for αj
α∗j = αj −
yj(Ei − Ej)
η=−2η
=2
13(33)
(34)
Machine Learning: Chenhao Tan | Boulder | 42 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: α update
New value for αj
α∗j = αj −
yj(Ei − Ej)
η=−2η
=2
13(33)
New value for αi
(34)
Machine Learning: Chenhao Tan | Boulder | 42 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: α update
New value for αj
α∗j = αj −
yj(Ei − Ej)
η=−2η
=2
13(33)
New value for αi
α∗i = αi + yiyj
(α(old)j − αj
)(34)
Machine Learning: Chenhao Tan | Boulder | 42 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 0, j = 4: α update
New value for αj
α∗j = αj −
yj(Ei − Ej)
η=−2η
=2
13(33)
New value for αi
α∗i = αi + yiyj
(α(old)j − αj
)= αj =
213
(34)
Machine Learning: Chenhao Tan | Boulder | 42 of 52
Sequential Mimimal Optimization
Margin
Machine Learning: Chenhao Tan | Boulder | 43 of 52
Sequential Mimimal Optimization
Find weight vector and bias
• Weight vector
~w =
m∑i
αiyi~xi (35)
• Bias
b =b(old) − Ei − yi(α∗i − α
(old)i )xi · xi − yj(α
∗j − α
(old)j )xi · xj (36)
(37)
Machine Learning: Chenhao Tan | Boulder | 44 of 52
Sequential Mimimal Optimization
Find weight vector and bias
• Weight vector
~w =
m∑i
αiyi~xi =213
[−22
]− 2
13
[0−1
](35)
• Bias
b =b(old) − Ei − yi(α∗i − α
(old)i )xi · xi − yj(α
∗j − α
(old)j )xi · xj (36)
(37)
Machine Learning: Chenhao Tan | Boulder | 44 of 52
Sequential Mimimal Optimization
Find weight vector and bias
• Weight vector
~w =
m∑i
αiyi~xi =2
13
[−22
]− 2
13
[0−1
]=
[−4136
13
](35)
• Bias
b =b(old) − Ei − yi(α∗i − α
(old)i )xi · xi − yj(α
∗j − α
(old)j )xi · xj (36)
(37)
Machine Learning: Chenhao Tan | Boulder | 44 of 52
Sequential Mimimal Optimization
Find weight vector and bias
• Weight vector
~w =
m∑i
αiyi~xi =2
13
[−22
]− 2
13
[0−1
]=
[−4136
13
](35)
• Bias
b =b(old) − Ei − yi(α∗i − α
(old)i )xi · xi − yj(α
∗j − α
(old)j )xi · xj (36)
=1− 213· 8 +
213· −2 = −0.54 (37)
Machine Learning: Chenhao Tan | Boulder | 44 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 2, j = 4
0
4
1
2
3 5
positive
negative
Let’s skip the boring stuff• E2 = −1.69• E4 = 0.00• η = −8
• α4 = α(old)j − yj(Ei−Ej)
η
• α2 = α(old)i + yiyj
(α(old)j − αj
)
Machine Learning: Chenhao Tan | Boulder | 45 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 2, j = 4
0
4
1
2
3 5
positive
negative
Let’s skip the boring stuff• E2 = −1.69• E4 = 0.00• η = −8
• α4 = α(old)j − yj(Ei−Ej)
η
• α2 = α(old)i + yiyj
(α(old)j − αj
)
Machine Learning: Chenhao Tan | Boulder | 45 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 2, j = 4
0
4
1
2
3 5
positive
negative
Let’s skip the boring stuff• E2 = −1.69• E4 = 0.00• η = −8
• α4 = α(old)j − yj(Ei−Ej)
η = 0.15 + −1.69−8 =
0.37• α2 = α
(old)i + yiyj
(α(old)j − αj
)
Machine Learning: Chenhao Tan | Boulder | 45 of 52
Sequential Mimimal Optimization
SMO Optimization for i = 2, j = 4
0
4
1
2
3 5
positive
negative
Let’s skip the boring stuff• E2 = −1.69• E4 = 0.00• η = −8
• α4 = α(old)j − yj(Ei−Ej)
η = 0.15 + −1.69−8 =
0.37• α2 = α
(old)i + yiyj
(α(old)j − αj
)=
0− (0.15− 0.37) = 0.21
Machine Learning: Chenhao Tan | Boulder | 45 of 52
Sequential Mimimal Optimization
Margin
Machine Learning: Chenhao Tan | Boulder | 46 of 52
Sequential Mimimal Optimization
Weight vector and bias
• Bias b = −0.12• Weight vector
~w =m∑i
αiyi~xi (38)
Machine Learning: Chenhao Tan | Boulder | 47 of 52
Sequential Mimimal Optimization
Weight vector and bias
• Bias b = −0.12• Weight vector
~w =m∑i
αiyi~xi =
[0.120.88
](38)
Machine Learning: Chenhao Tan | Boulder | 47 of 52
Sequential Mimimal Optimization
Another Iteration (i = 0, j = 2)
Machine Learning: Chenhao Tan | Boulder | 48 of 52
Sequential Mimimal Optimization
SMO Algorithm
• Convenient approach for solving: vanilla, slack, kernel approaches• Convex problem• Scalable to large datasets (implemented in scikit learn)• What we didn’t do:◦ Check KKT conditions◦ Randomly choose indices
Machine Learning: Chenhao Tan | Boulder | 49 of 52
Recap
Outline
Duality
Slack variables
Sequential Mimimal Optimization
Recap
Machine Learning: Chenhao Tan | Boulder | 50 of 52
Recap
Recap
• Duality• Slack variables
Machine Learning: Chenhao Tan | Boulder | 51 of 52
Recap
Recap
• SMO: Optimize objective function for two data points• Convex problem: Will converge• Relatively fast• Gives good performance
Machine Learning: Chenhao Tan | Boulder | 51 of 52
Recap
Wrapup
• Adding slack variables don’t break the SVM problem• Very popular algorithm◦ SVMLight (many options)◦ Libsvm / Liblinear (very fast)◦ Weka (friendly)◦ pyml (Python focused, from Colorado)
• Next up: simple algorithm for finding SVMs
Machine Learning: Chenhao Tan | Boulder | 52 of 52