Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber,...

103
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 10 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan | Boulder | 1 of 52

Transcript of Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber,...

Page 1: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 10

Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

Machine Learning: Chenhao Tan | Boulder | 1 of 52

Page 2: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Roadmap

• Last time: linear SVM formulation when data is linearly separable• This time:◦ Introduce duality◦ Make linear SVM work when data is not linearly separable◦ Introduce an efficient algorithm for finding weights

• Next time: Kernel trick

Machine Learning: Chenhao Tan | Boulder | 2 of 52

julius
Pencil
Page 3: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Overview

Duality

Slack variables

Sequential Mimimal Optimization

Recap

Machine Learning: Chenhao Tan | Boulder | 3 of 52

julius
Pencil
Page 4: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Outline

Duality

Slack variables

Sequential Mimimal Optimization

Recap

Machine Learning: Chenhao Tan | Boulder | 4 of 52

Page 5: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Binary classification

Given: Strain = {(xi, yi)}mi=1 training examples, xi ∈ Rd, yi ∈ {−1, 1}

Goal: Find hypothesis function h : X → YLinear SVM: learn a linear decision rule of the form w · x + b

Machine Learning: Chenhao Tan | Boulder | 5 of 52

julius
Pencil
Page 6: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Optimizing the objective function

minw,b

12||w||2 (1)

subject to yi(w · xi + b) ≥ 1, i ∈ [1,m]

Machine Learning: Chenhao Tan | Boulder | 6 of 52

julius
Pencil
Page 7: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Optimizing Constrained Functions

The Method of Lagrange Multipliers

Constrained problem (Primalproblem)

minx

f (x)

s.t. gi(x) ≥ 0, i ∈ [1, n]

Lagrange Multiplier

L (x,α) = f (x)−n∑

i=1

αigi(x),

αi ≥ 0, i ∈ [1, n]

Machine Learning: Chenhao Tan | Boulder | 7 of 52

julius
Pencil
julius
Pencil
Page 8: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Lagrange Multiplier

p∗: the optimal value in the primal problemWe claim that

p∗ = minx

maxα

L (x,α) = minx

maxα

f (x)−n∑

i=1

αigi(x)

This is because

max−αy =

{0 y ≥ 0+∞ otherwise

Machine Learning: Chenhao Tan | Boulder | 8 of 52

julius
Pencil
Page 9: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Lagrange Multiplier

p∗: the optimal value in the primal problemWe claim that

p∗ = minx

maxα

L (x,α) = minx

maxα

f (x)−n∑

i=1

αigi(x)

This is because

max−αy =

{0 y ≥ 0+∞ otherwise

Machine Learning: Chenhao Tan | Boulder | 8 of 52

julius
Pencil
Page 10: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Lagrange Multiplier

What happens if we reverse min and max:

maxα

minx

L (x,α) ≥ or ≤ minx

maxα

L (x,α)

The left leads to the dual problem.

Machine Learning: Chenhao Tan | Boulder | 9 of 52

julius
Pencil
Page 11: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Lagrange Multiplier

What happens if we reverse min and max:

maxα

minx

L (x,α) ≤ minx

maxα

L (x,α)

The left leads to the dual problem.

Machine Learning: Chenhao Tan | Boulder | 9 of 52

julius
Pencil
julius
Pencil
julius
Pencil
Page 12: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Primal vs. Dual

Prime problem

minw,b

12||w||2

s.t. yi(w · xi + b) ≥ 1, i ∈ [1,m]

Derive the function for dual problem.

Machine Learning: Chenhao Tan | Boulder | 10 of 52

julius
Pencil
julius
Pencil
julius
Pencil
julius
Pencil
Page 13: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Primal vs. Dual

Prime problem

minw,b

12||w||2

s.t. yi(w · xi + b) ≥ 1, i ∈ [1,m]

Replace w, b with stationarity conditions.

Machine Learning: Chenhao Tan | Boulder | 10 of 52

julius
Pencil
julius
Pencil
julius
Pencil
julius
Pencil
Page 14: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Primal vs. Dual

Primal problem

minw,b

12||w||2

s.t. yi(w · xi + b) ≥ 1, i ∈ [1,m]

Dual problem

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1

αiαjyiyj(xj · xi)

s.t. αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 11 of 52

julius
Pencil
Page 15: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1, αi ≥ 0 (2)

Stationarity

w =m∑

i=1

αiyixi,m∑

i=1

αiyi = 0 (3)

Complementary slackness

αi = 0 ∨ yi(w · xi + b) = 1 (4)

Machine Learning: Chenhao Tan | Boulder | 12 of 52

julius
Pencil
Page 16: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1, αi ≥ 0 (2)

Stationarity

w =m∑

i=1

αiyixi,m∑

i=1

αiyi = 0 (3)

Complementary slackness

αi = 0 ∨ yi(w · xi + b) = 1 (4)

Machine Learning: Chenhao Tan | Boulder | 12 of 52

julius
Pencil
Page 17: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Duality

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1, αi ≥ 0 (2)

Stationarity

w =m∑

i=1

αiyixi,m∑

i=1

αiyi = 0 (3)

Complementary slackness

αi = 0 ∨ yi(w · xi + b) = 1 (4)

Machine Learning: Chenhao Tan | Boulder | 12 of 52

julius
Pencil
Page 18: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Outline

Duality

Slack variables

Sequential Mimimal Optimization

Recap

Machine Learning: Chenhao Tan | Boulder | 13 of 52

Page 19: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Old objective function

minw,b

12||w||2 (5)

subject to yi(w · xi + b) ≥ 1, i ∈ [1,m]

Machine Learning: Chenhao Tan | Boulder | 14 of 52

julius
Pencil
Page 20: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Can SVMs Work Here?

yi(w · xi + b) ≥ 1 (6)

Machine Learning: Chenhao Tan | Boulder | 15 of 52

Page 21: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Can SVMs Work Here?

yi(w · xi + b) ≥ 1 (6)

Machine Learning: Chenhao Tan | Boulder | 15 of 52

julius
Pencil
Page 22: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Trick: Allow for a few bad apples

Machine Learning: Chenhao Tan | Boulder | 16 of 52

julius
Pencil
Page 23: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Old objective function

minw,b

12||w||2 (7)

subject to yi(w · xi + b) ≥ 1, i ∈ [1,m]

Machine Learning: Chenhao Tan | Boulder | 17 of 52

Page 24: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Relaxing the constraint

yi(w · xi + b) ≥ 1− ξi

• ξi = 0 means at least one margin on correct side of decision boundary• ξi = 1/2 means at least one-half margin on correct side of decision boundary• ξi = 2 means at least one margin on wrong side of decision boundary

Machine Learning: Chenhao Tan | Boulder | 18 of 52

Page 25: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

New objective function

minw,b,ξ

12||w||2 + C

∑i=1

ξip (8)

subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]

• Standard margin• How wrong a point is (slack variables)• Tradeoff between margin and slack variables• How bad wrongness scales

Machine Learning: Chenhao Tan | Boulder | 19 of 52

Page 26: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

New objective function

minw,b,ξ

12||w||2 + C

∑i=1

ξip (8)

subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]

• Standard margin

• How wrong a point is (slack variables)• Tradeoff between margin and slack variables• How bad wrongness scales

Machine Learning: Chenhao Tan | Boulder | 19 of 52

Page 27: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

New objective function

minw,b,ξ

12||w||2 + C

∑i=1

ξip (8)

subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]

• Standard margin• How wrong a point is (slack variables)

• Tradeoff between margin and slack variables• How bad wrongness scales

Machine Learning: Chenhao Tan | Boulder | 19 of 52

Page 28: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

New objective function

minw,b,ξ

12||w||2 + C

∑i=1

ξip (8)

subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]

• Standard margin• How wrong a point is (slack variables)• Tradeoff between margin and slack variables

• How bad wrongness scales

Machine Learning: Chenhao Tan | Boulder | 19 of 52

Page 29: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

New objective function

minw,b,ξ

12||w||2 + C

∑i=1

ξip (8)

subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]

• Standard margin• How wrong a point is (slack variables)• Tradeoff between margin and slack variables• How bad wrongness scales

Machine Learning: Chenhao Tan | Boulder | 19 of 52

Page 30: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Aside: Loss Functions

• Losses measure how bad a mistake is• Important for slack as well

Machine Learning: Chenhao Tan | Boulder | 20 of 52

Page 31: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Aside: Loss Functions

• Losses measure how bad a mistake is• Important for slack as well

x0/1 Loss

Machine Learning: Chenhao Tan | Boulder | 20 of 52

Page 32: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Aside: Loss Functions

• Losses measure how bad a mistake is• Important for slack as well

x

LinearHinge

0/1 Loss

Machine Learning: Chenhao Tan | Boulder | 20 of 52

Page 33: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Aside: Loss Functions

• Losses measure how bad a mistake is• Important for slack as well

x

Quadratic Hinge

LinearHinge

0/1 Loss

Machine Learning: Chenhao Tan | Boulder | 20 of 52

Page 34: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Aside: Loss Functions

• Losses measure how bad a mistake is• Important for slack as well

x

Quadratic Hinge

LinearHinge

0/1 Loss

We’ll focus on linear hinge loss, set p = 1Machine Learning: Chenhao Tan | Boulder | 20 of 52

Page 35: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

What is the role of C?

minw,b,ξ

12||w||2 + C

∑i=1

ξi (9)

subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]

A. C ↑⇒ low bias, low varianceB. C ↑⇒ low bias, high varianceC. C ↑⇒ high bias, low varianceD. C ↑⇒ high bias, high variance

Machine Learning: Chenhao Tan | Boulder | 21 of 52

Page 36: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

New Lagrangian

L (w, b, ξ,α,β) =12||w||2 + C

m∑i=1

ξi (10)

−m∑

i=1

αi [yi(w · xi + b)− 1 + ξi] (11)

−m∑

i=1

βiξi (12)

Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us

w =

m∑i=1

αiyixi (13)

m∑i=1

αiyi = 0 (14) αi + βi = C (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 52

Page 37: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

New Lagrangian

L (w, b, ξ,α,β) =12||w||2 + C

m∑i=1

ξi (10)

−m∑

i=1

αi [yi(w · xi + b)− 1 + ξi] (11)

−m∑

i=1

βiξi (12)

Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us

w =

m∑i=1

αiyixi (13)

m∑i=1

αiyi = 0 (14) αi + βi = C (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 52

Page 38: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

New Lagrangian

L (w, b, ξ,α,β) =12||w||2 + C

m∑i=1

ξi (10)

−m∑

i=1

αi [yi(w · xi + b)− 1 + ξi] (11)

−m∑

i=1

βiξi (12)

Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us

w =

m∑i=1

αiyixi (13)

m∑i=1

αiyi = 0 (14) αi + βi = C (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 52

Page 39: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

New Lagrangian

L (w, b, ξ,α,β) =12||w||2 + C

m∑i=1

ξi (10)

−m∑

i=1

αi [yi(w · xi + b)− 1 + ξi] (11)

−m∑

i=1

βiξi (12)

Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us

w =

m∑i=1

αiyixi (13)

m∑i=1

αiyi = 0 (14) αi + βi = C (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 52

Page 40: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

New Lagrangian

L (w, b, ξ,α,β) =12||w||2 + C

m∑i=1

ξi (10)

−m∑

i=1

αi [yi(w · xi + b)− 1 + ξi] (11)

−m∑

i=1

βiξi (12)

Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us

w =

m∑i=1

αiyixi (13)

m∑i=1

αiyi = 0 (14) αi + βi = C (15)

Machine Learning: Chenhao Tan | Boulder | 22 of 52

Page 41: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Simplifying dual objective

w =

m∑i=1

αiyixim∑

i=1

αiyi = 0 αi + βi = C

L (w, b, ξ,α,β) =12||w||2 + C

m∑i=1

ξi

−m∑

i=1

αi [yi(w · xi + b)− 1 + ξi]

−m∑

i=1

βiξi

Machine Learning: Chenhao Tan | Boulder | 23 of 52

Page 42: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Simplifying dual objective

w =

m∑i=1

αiyixim∑

i=1

αiyi = 0 αi + βi = C

L (w, b, ξ,α,β) =12||w||2 + C

m∑i=1

ξi

−m∑

i=1

αi [yi(w · xi + b)− 1 + ξi]

−m∑

i=1

βiξi

Machine Learning: Chenhao Tan | Boulder | 23 of 52

Page 43: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Dual Problem

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1

αiαjyiyj(xj · xi)

s.t. C ≥ αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 24 of 52

Page 44: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Dual Problem

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1

αiαjyiyj(xj · xi)

s.t. C ≥ αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 24 of 52

Page 45: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1− ξi, ξi ≥ 0,C ≥ αi ≥ 0, βi ≥ 0 (16)

Stationarity

w =m∑

i=1

αiyixi,m∑

i=1

αiyi = 0, αi + βi = C (17)

Complementary slackness

αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (18)

Machine Learning: Chenhao Tan | Boulder | 25 of 52

Page 46: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1− ξi, ξi ≥ 0,C ≥ αi ≥ 0, βi ≥ 0 (16)

Stationarity

w =m∑

i=1

αiyixi,m∑

i=1

αiyi = 0, αi + βi = C (17)

Complementary slackness

αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (18)

Machine Learning: Chenhao Tan | Boulder | 25 of 52

Page 47: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1− ξi, ξi ≥ 0,C ≥ αi ≥ 0, βi ≥ 0 (16)

Stationarity

w =m∑

i=1

αiyixi,m∑

i=1

αiyi = 0, αi + βi = C (17)

Complementary slackness

αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (18)

Machine Learning: Chenhao Tan | Boulder | 25 of 52

Page 48: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

More on Complementary Slackness

αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (19)

• xi satisfies the margin, yi(w · xi + b) > 1⇒ αi = 0

• xi does not satisfy the margin, yi(w · xi + b) < 1⇒ αi = C• xi is on the margin, yi(w · xi + b) = 1⇒ 0 ≤ αi ≤ C

Machine Learning: Chenhao Tan | Boulder | 26 of 52

Page 49: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

More on Complementary Slackness

αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (19)

• xi satisfies the margin, yi(w · xi + b) > 1⇒ αi = 0• xi does not satisfy the margin, yi(w · xi + b) < 1⇒ αi = C

• xi is on the margin, yi(w · xi + b) = 1⇒ 0 ≤ αi ≤ C

Machine Learning: Chenhao Tan | Boulder | 26 of 52

Page 50: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Slack variables

More on Complementary Slackness

αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (19)

• xi satisfies the margin, yi(w · xi + b) > 1⇒ αi = 0• xi does not satisfy the margin, yi(w · xi + b) < 1⇒ αi = C• xi is on the margin, yi(w · xi + b) = 1⇒ 0 ≤ αi ≤ C

Machine Learning: Chenhao Tan | Boulder | 26 of 52

Page 51: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Outline

Duality

Slack variables

Sequential Mimimal Optimization

Recap

Machine Learning: Chenhao Tan | Boulder | 27 of 52

Page 52: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Sequential Mimimal Optimization

Trivia• Invented by John Platt in 1998 at Microsoft Research• Called Minimal due to solving small sub-problems

Machine Learning: Chenhao Tan | Boulder | 28 of 52

Page 53: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Dual problem

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1

αiαjyiyj(xj · xi)

s.t. C ≥αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 29 of 52

Page 54: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Brief Interlude: Coordinate Ascent

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1

αiαjyiyj(xj · xi)

s.t. C ≥αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0

Loop over each training example, change αi to maximize the above function

Although coordinate ascent works OK for lots of problems, we have the constraint∑i αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 30 of 52

Page 55: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Brief Interlude: Coordinate Ascent

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1

αiαjyiyj(xj · xi)

s.t. C ≥αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0

Loop over each training example, change αi to maximize the above functionAlthough coordinate ascent works OK for lots of problems, we have the constraint∑

i αiyi = 0

Machine Learning: Chenhao Tan | Boulder | 30 of 52

Page 56: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Outline for SVM Optimization (SMO)

1. Select two examples i, j

2. Update αj, αi to maximize the above function

Machine Learning: Chenhao Tan | Boulder | 31 of 52

Page 57: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1− ξi, ξi ≥ 0,C ≥ αi ≥ 0, βi ≥ 0 (20)

Stationarity

w =m∑

i=1

αiyixi,m∑

i=1

αiyi = 0, αi + βi = C (21)

Complementary slackness

αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (22)

Machine Learning: Chenhao Tan | Boulder | 32 of 52

Page 58: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Outline for SVM Optimization (SMO)

yiαi + yjαj = yiαoldi + yjα

oldj = γ

Machine Learning: Chenhao Tan | Boulder | 33 of 52

Page 59: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Step 2: Optimize αj

1. Compute upper (H) and lower (L) bounds that ensure 0 ≤ αj ≤ C.

yi 6= yj

L = max(0, αj − αi) (23)H = min(C,C + αj − αi) (24)

yi = yj

L = max(0, αi + αj − C) (25)H = min(C, αj + αi) (26)

This is because the update for αi is based on yiyj (sign matters)

Machine Learning: Chenhao Tan | Boulder | 34 of 52

Page 60: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Step 2: Optimize αj

1. Compute upper (H) and lower (L) bounds that ensure 0 ≤ αj ≤ C.

yi 6= yj

L = max(0, αj − αi) (23)H = min(C,C + αj − αi) (24)

yi = yj

L = max(0, αi + αj − C) (25)H = min(C, αj + αi) (26)

This is because the update for αi is based on yiyj (sign matters)

Machine Learning: Chenhao Tan | Boulder | 34 of 52

Page 61: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Step 2: Optimize αj

Compute errors for i and jEk ≡ f (xk)− yk (27)

η = 2xi · xj − xi · xi − xj · xj (28)

for new value for αj

α∗j = α

(old)j −

yj(Ei − Ej)

η(29)

Machine Learning: Chenhao Tan | Boulder | 35 of 52

Page 62: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Step 3: Optimize αi

Set αi:α∗

i = α(old)i + yiyj

(α(old)j − αj

)(30)

This balances out the move that we made for αj.

Machine Learning: Chenhao Tan | Boulder | 36 of 52

Page 63: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Step 3: Optimize αi

Set αi:α∗

i = α(old)i + yiyj

(α(old)j − αj

)(30)

This balances out the move that we made for αj.

Machine Learning: Chenhao Tan | Boulder | 36 of 52

Page 64: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Overall algorithm

Iterate over i = {1, . . .m}Repeat until KKT conditions are met

Choose j randomly from m− 1 other optionsUpdate αi, αj

Find w, b based on stationarity conditions

Machine Learning: Chenhao Tan | Boulder | 37 of 52

Page 65: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Iterations / Details

• What if i doesn’t violate the KKT conditions?• What if η ≥ 0?• When do we stop?

Machine Learning: Chenhao Tan | Boulder | 38 of 52

Page 66: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Iterations / Details

• What if i doesn’t violate the KKT conditions? Skip it!• What if η ≥ 0?• When do we stop?

Machine Learning: Chenhao Tan | Boulder | 38 of 52

Page 67: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Iterations / Details

• What if i doesn’t violate the KKT conditions? Skip it!• What if η ≥ 0? Skip it! (should not happen except for numerical instability)• When do we stop?

Machine Learning: Chenhao Tan | Boulder | 38 of 52

Page 68: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Iterations / Details

• What if i doesn’t violate the KKT conditions? Skip it!• What if η ≥ 0? Skip it! (should not happen except for numerical instability)• When do we stop? Until we go through α’s without changing anything

Machine Learning: Chenhao Tan | Boulder | 38 of 52

Page 69: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Algorithm

Positive(-2, 2)(0, 4)(2, 1)

0

4

1

2

3 5

positive

negative

Negative(-2, -3)(0, -1)(2, -3)

• Initially, all alphas are zero

α =< 0, 0, 0, 0, 0, 0 >

• Intercept b is also zero• Capacity C = π

Machine Learning: Chenhao Tan | Boulder | 39 of 52

Page 70: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Algorithm

Positive(-2, 2)(0, 4)(2, 1)

0

4

1

2

3 5

positive

negative

Negative(-2, -3)(0, -1)(2, -3)

• Initially, all alphas are zero

α =< 0, 0, 0, 0, 0, 0 >

• Intercept b is also zero• Capacity C = π

Machine Learning: Chenhao Tan | Boulder | 39 of 52

Page 71: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Algorithm

Positive(-2, 2)(0, 4)(2, 1)

0

4

1

2

3 5

positive

negative

Negative(-2, -3)(0, -1)(2, -3)

• Initially, all alphas are zero

α =< 0, 0, 0, 0, 0, 0 >

• Intercept b is also zero• Capacity C = π

Machine Learning: Chenhao Tan | Boulder | 39 of 52

Page 72: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

0

4

1

2

3 5

positive

negative

• Prediction: f (x0)

• Prediction: f (x4)

• Error: E0

• Error: E4

Machine Learning: Chenhao Tan | Boulder | 40 of 52

Page 73: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

0

4

1

2

3 5

positive

negative

• Prediction: f (x0) = 0• Prediction: f (x4)

• Error: E0

• Error: E4

Machine Learning: Chenhao Tan | Boulder | 40 of 52

Page 74: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

0

4

1

2

3 5

positive

negative

• Prediction: f (x0) = 0• Prediction: f (x4) = 0• Error: E0

• Error: E4

Machine Learning: Chenhao Tan | Boulder | 40 of 52

Page 75: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

0

4

1

2

3 5

positive

negative

• Prediction: f (x0) = 0• Prediction: f (x4) = 0• Error: E0 = −1• Error: E4 = +1

Machine Learning: Chenhao Tan | Boulder | 40 of 52

Page 76: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

0

4

1

2

3 5

positive

negative

• Prediction: f (x0) = 0• Prediction: f (x4) = 0• Error: E0 = −1• Error: E4 = +1

η = 2〈x0, x4〉 − 〈x0, x0〉 − 〈x4, x4〉

Machine Learning: Chenhao Tan | Boulder | 40 of 52

Page 77: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Predictions and Step

0

4

1

2

3 5

positive

negative

• Prediction: f (x0) = 0• Prediction: f (x4) = 0• Error: E0 = −1• Error: E4 = +1

η = 2〈x0, x4〉 − 〈x0, x0〉 − 〈x4, x4〉= 2 · −2− 8− 1 = −13

Machine Learning: Chenhao Tan | Boulder | 40 of 52

Page 78: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Bounds

• Lower and upper bounds for αj

L = max(0, αj − αi) (31)H = min(C,C + αj − αi) (32)

Machine Learning: Chenhao Tan | Boulder | 41 of 52

Page 79: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Bounds

• Lower and upper bounds for αj

L = max(0, αj − αi) = 0 (31)H = min(C,C + αj − αi) (32)

Machine Learning: Chenhao Tan | Boulder | 41 of 52

Page 80: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: Bounds

• Lower and upper bounds for αj

L = max(0, αj − αi) = 0 (31)H = min(C,C + αj − αi) = π (32)

Machine Learning: Chenhao Tan | Boulder | 41 of 52

Page 81: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: α update

New value for αj

α∗j = αj −

yj(Ei − Ej)

η(33)

(34)

Machine Learning: Chenhao Tan | Boulder | 42 of 52

Page 82: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: α update

New value for αj

α∗j = αj −

yj(Ei − Ej)

η=−2η

=2

13(33)

(34)

Machine Learning: Chenhao Tan | Boulder | 42 of 52

Page 83: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: α update

New value for αj

α∗j = αj −

yj(Ei − Ej)

η=−2η

=2

13(33)

New value for αi

(34)

Machine Learning: Chenhao Tan | Boulder | 42 of 52

Page 84: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: α update

New value for αj

α∗j = αj −

yj(Ei − Ej)

η=−2η

=2

13(33)

New value for αi

α∗i = αi + yiyj

(α(old)j − αj

)(34)

Machine Learning: Chenhao Tan | Boulder | 42 of 52

Page 85: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 0, j = 4: α update

New value for αj

α∗j = αj −

yj(Ei − Ej)

η=−2η

=2

13(33)

New value for αi

α∗i = αi + yiyj

(α(old)j − αj

)= αj =

213

(34)

Machine Learning: Chenhao Tan | Boulder | 42 of 52

Page 86: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Margin

Machine Learning: Chenhao Tan | Boulder | 43 of 52

Page 87: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Find weight vector and bias

• Weight vector

~w =

m∑i

αiyi~xi (35)

• Bias

b =b(old) − Ei − yi(α∗i − α

(old)i )xi · xi − yj(α

∗j − α

(old)j )xi · xj (36)

(37)

Machine Learning: Chenhao Tan | Boulder | 44 of 52

Page 88: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Find weight vector and bias

• Weight vector

~w =

m∑i

αiyi~xi =213

[−22

]− 2

13

[0−1

](35)

• Bias

b =b(old) − Ei − yi(α∗i − α

(old)i )xi · xi − yj(α

∗j − α

(old)j )xi · xj (36)

(37)

Machine Learning: Chenhao Tan | Boulder | 44 of 52

Page 89: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Find weight vector and bias

• Weight vector

~w =

m∑i

αiyi~xi =2

13

[−22

]− 2

13

[0−1

]=

[−4136

13

](35)

• Bias

b =b(old) − Ei − yi(α∗i − α

(old)i )xi · xi − yj(α

∗j − α

(old)j )xi · xj (36)

(37)

Machine Learning: Chenhao Tan | Boulder | 44 of 52

Page 90: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Find weight vector and bias

• Weight vector

~w =

m∑i

αiyi~xi =2

13

[−22

]− 2

13

[0−1

]=

[−4136

13

](35)

• Bias

b =b(old) − Ei − yi(α∗i − α

(old)i )xi · xi − yj(α

∗j − α

(old)j )xi · xj (36)

=1− 213· 8 +

213· −2 = −0.54 (37)

Machine Learning: Chenhao Tan | Boulder | 44 of 52

Page 91: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 2, j = 4

0

4

1

2

3 5

positive

negative

Let’s skip the boring stuff• E2 = −1.69• E4 = 0.00• η = −8

• α4 = α(old)j − yj(Ei−Ej)

η

• α2 = α(old)i + yiyj

(α(old)j − αj

)

Machine Learning: Chenhao Tan | Boulder | 45 of 52

Page 92: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 2, j = 4

0

4

1

2

3 5

positive

negative

Let’s skip the boring stuff• E2 = −1.69• E4 = 0.00• η = −8

• α4 = α(old)j − yj(Ei−Ej)

η

• α2 = α(old)i + yiyj

(α(old)j − αj

)

Machine Learning: Chenhao Tan | Boulder | 45 of 52

Page 93: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 2, j = 4

0

4

1

2

3 5

positive

negative

Let’s skip the boring stuff• E2 = −1.69• E4 = 0.00• η = −8

• α4 = α(old)j − yj(Ei−Ej)

η = 0.15 + −1.69−8 =

0.37• α2 = α

(old)i + yiyj

(α(old)j − αj

)

Machine Learning: Chenhao Tan | Boulder | 45 of 52

Page 94: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Optimization for i = 2, j = 4

0

4

1

2

3 5

positive

negative

Let’s skip the boring stuff• E2 = −1.69• E4 = 0.00• η = −8

• α4 = α(old)j − yj(Ei−Ej)

η = 0.15 + −1.69−8 =

0.37• α2 = α

(old)i + yiyj

(α(old)j − αj

)=

0− (0.15− 0.37) = 0.21

Machine Learning: Chenhao Tan | Boulder | 45 of 52

Page 95: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Margin

Machine Learning: Chenhao Tan | Boulder | 46 of 52

Page 96: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Weight vector and bias

• Bias b = −0.12• Weight vector

~w =m∑i

αiyi~xi (38)

Machine Learning: Chenhao Tan | Boulder | 47 of 52

Page 97: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Weight vector and bias

• Bias b = −0.12• Weight vector

~w =m∑i

αiyi~xi =

[0.120.88

](38)

Machine Learning: Chenhao Tan | Boulder | 47 of 52

Page 98: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

Another Iteration (i = 0, j = 2)

Machine Learning: Chenhao Tan | Boulder | 48 of 52

Page 99: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Sequential Mimimal Optimization

SMO Algorithm

• Convenient approach for solving: vanilla, slack, kernel approaches• Convex problem• Scalable to large datasets (implemented in scikit learn)• What we didn’t do:◦ Check KKT conditions◦ Randomly choose indices

Machine Learning: Chenhao Tan | Boulder | 49 of 52

Page 100: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Recap

Outline

Duality

Slack variables

Sequential Mimimal Optimization

Recap

Machine Learning: Chenhao Tan | Boulder | 50 of 52

Page 101: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Recap

Recap

• Duality• Slack variables

Machine Learning: Chenhao Tan | Boulder | 51 of 52

Page 102: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Recap

Recap

• SMO: Optimize objective function for two data points• Convex problem: Will converge• Relatively fast• Gives good performance

Machine Learning: Chenhao Tan | Boulder | 51 of 52

Page 103: Machine Learning: Chenhao Tan University of Colorado ... · Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan j Boulder j 1 of 52. Roadmap Last

Recap

Wrapup

• Adding slack variables don’t break the SVM problem• Very popular algorithm◦ SVMLight (many options)◦ Libsvm / Liblinear (very fast)◦ Weka (friendly)◦ pyml (Python focused, from Colorado)

• Next up: simple algorithm for finding SVMs

Machine Learning: Chenhao Tan | Boulder | 52 of 52