Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan...

39
Optimization for Machine Learning Lecture 4: SMO-MKL S.V . N. (vishy) Vishwanathan Purdue University [email protected] July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 22

Transcript of Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan...

Page 1: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Optimization for Machine LearningLecture 4: SMO-MKL

S.V.N. (vishy) Vishwanathan

Purdue [email protected]

July 11, 2012

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 22

Page 2: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Binary Classification

yi = −1

yi = +1

{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}

{x | 〈w , x〉+ b = 1}

x2x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w

‖w‖ , x1 − x2⟩

= 2‖w‖

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

Page 3: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Binary Classification

yi = −1

yi = +1

{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}

{x | 〈w , x〉+ b = 1}

x2x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w

‖w‖ , x1 − x2⟩

= 2‖w‖

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

Page 4: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Binary Classification

yi = −1

yi = +1

{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}

{x | 〈w , x〉+ b = 1}

x2x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w

‖w‖ , x1 − x2⟩

= 2‖w‖

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

Page 5: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Binary Classification

yi = −1

yi = +1

{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}

{x | 〈w , x〉+ b = 1}

x2x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w

‖w‖ , x1 − x2⟩

= 2‖w‖

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

Page 6: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Linear Support Vector Machines

Optimization Problem

minw ,b,ξ

1

2‖w‖2 + C

m∑i=1

ξi

s.t. yi (〈w , xi 〉+ b) ≥ 1− ξi for all i

ξi ≥ 0

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 22

Page 7: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

The Kernel Trick

x

y

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22

Page 8: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

The Kernel Trick

x

y

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22

Page 9: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

The Kernel Trick

x

x2+y2

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22

Page 10: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Kernel Trick

Optimization Problem

minw ,b,ξ

1

2‖w‖2 + C

m∑i=1

ξi

s.t. yi (〈w , φ(xi )〉+ b) ≥ 1− ξi for all i

ξi ≥ 0

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22

Page 11: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Kernel Trick

Optimization Problem

maxα

− 1

2α>Hα + 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Hij = yiyj 〈φ(xi ), φ(xj)〉

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22

Page 12: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Kernel Trick

Optimization Problem

maxα

− 1

2α>Hα + 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Hij = yiyjk(xi , xj)

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22

Page 13: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Key Question

Which kernel should I use?

The Multiple Kernel Learning Answer

Cook up as many (base) kernels as you can

Compute a data dependent kernel function as a linear combination ofbase kernels

k(x , x ′) =∑k

dkkk(x , x ′) s.t. dk ≥ 0

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 22

Page 14: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Key Question

Which kernel should I use?

The Multiple Kernel Learning Answer

Cook up as many (base) kernels as you can

Compute a data dependent kernel function as a linear combination ofbase kernels

k(x , x ′) =∑k

dkkk(x , x ′) s.t. dk ≥ 0

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 22

Page 15: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Object Detection

Localize a specified object of interest if it exists in a given image

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 22

Page 16: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Some Examples of MKL Detection

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 22

Page 17: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Summary of Our Results

Sonar Dataset with 800 kernels

pTraining Time (s) # Kernels Selected

SMO-MKL Shogun SMO-MKL Shogun

1.1 4.71 47.43 91.20 258.001.33 3.21 19.94 248.20 374.202.0 3.39 34.67 661.20 664.80

Web dataset: ≈ 50,000 points and 50 kernels ≈ 30 minutes

Sonar with a hundred thousand kernels

Precomputed: ≈ 8 minutesKernels computed on-the-fly: ≈ 30 minutes

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 22

Page 18: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Setting up the Optimization Problem -I

The Setup

We are given K kernel functions k1, . . . , kn with corresponding featuremaps φ1(·), . . . , φn(·)We are interested in deriving the feature map

φ(x) =

√d1φ1(x)

...√dnφn(x)

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 22

Page 19: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Setting up the Optimization Problem -I

The Setup

We are given K kernel functions k1, . . . , kn with corresponding featuremaps φ1(·), . . . , φn(·)We are interested in deriving the feature map

φ(x) =

√d1φ1(x)

...√dnφn(x)

=⇒ w =

w1...wn

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 22

Page 20: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Setting up the Optimization Problem

Optimization Problem

minw ,b,ξ

1

2‖w‖2 + C

m∑i=1

ξi

s.t. yi (〈w , φ(xi )〉+ b) ≥ 1− ξi for all i

ξi ≥ 0

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

Page 21: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Setting up the Optimization Problem

Optimization Problem

minw ,b,ξ,d

1

2

∑k

‖wk‖2 + Cm∑i=1

ξi

s.t. yi

(∑k

√dk 〈wk , φk(xi )〉+ b

)≥ 1− ξi for all i

ξi ≥ 0

dk ≥ 0 for all k

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

Page 22: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Setting up the Optimization Problem

Optimization Problem

minw ,b,ξ,d

1

2

∑k

‖wk‖2 + Cm∑i=1

ξi +ρ

2

(∑k

dpk

) 2p

s.t. yi

(∑k

√dk 〈wk , φk(xi )〉+ b

)≥ 1− ξi for all i

ξi ≥ 0

dk ≥ 0 for all k

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

Page 23: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Setting up the Optimization Problem

Optimization Problem

minw ,b,ξ,d

1

2

∑k

‖wk‖2

dk+ C

m∑i=1

ξi +ρ

2

(∑k

dpk

) 2p

s.t. yi

(∑k

〈wk , φk(xi )〉+ b

)≥ 1− ξi for all i

ξi ≥ 0

dk ≥ 0 for all k

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

Page 24: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Setting up the Optimization Problem

Optimization Problem

mind

maxα

− 1

2

∑k

dkα>Hkα + 1> α +

ρ

2

(∑k

dpk

) 2p

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

dk ≥ 0

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

Page 25: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Saddle Point Problem

−4 −2 0 2 4 −5

0

5−20

0

20

d

α

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 22

Page 26: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Motivation

Solving the Saddle Point

Saddle Point Problem

mind

maxα

− 1

2

∑k

dkα>Hkα + 1> α +

ρ

2

(∑k

dpk

) 2p

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

dk ≥ 0

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 22

Page 27: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Our Approach

The Key Insight

Eliminate d

D(α) := maxα

− 1

(∑k

(α>Hkα

)q) 2q

+ 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

1

p+

1

q= 1

Not a QP but very close to one!

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 22

Page 28: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Our Approach

SMO-MKL: High Level Overview

D(α) := maxα

− 1

(∑k

(α>Hkα

)q) 2q

+ 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Algorithm

Choose two variables αi and αj to optimize

Solve the one dimensional reduced optimization problem

Repeat until convergence

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 22

Page 29: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Our Approach

SMO-MKL: High Level Overview

Selecting the Working Set

Compute directional derivative and directional Hessian

Greedily select the variables

Solving the Reduced Problem

Analytic solution for p = q = 2 (one dimensional quartic)

For other values of p use Newton Raphson

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 22

Page 30: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Our Approach

SMO-MKL: High Level Overview

Selecting the Working Set

Compute directional derivative and directional Hessian

Greedily select the variables

Solving the Reduced Problem

Analytic solution for p = q = 2 (one dimensional quartic)

For other values of p use Newton Raphson

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 22

Page 31: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.080

82

84

86

88

90T

est

Acc

ura

cy(%

)

Australian

SMO-MKL Shogun

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 22

Page 32: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.088

90

92

94

Tes

tA

ccu

racy

(%)

ionosphere

SMO-MKL Shogun

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 22

Page 33: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Experiments

Scaling with Training Set Size

Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5101

102

103

104

Number of Training Examples

CP

UT

ime

inse

con

ds

SMO-MKLShogun

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 22

Page 34: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Experiments

Scaling with Training Set Size

Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5101

102

103

104

Number of Training Examples

CP

UT

ime

inse

con

ds

Observed

O(n1.9)

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 22

Page 35: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Experiments

On Another Dataset

Web: 300 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5

101

102

103

Number of Training Examples

CP

UT

ime

inse

con

ds

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 22

Page 36: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Experiments

Scaling with Number of Kernels

Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1

104 105

102.5

103

Number of Kernels

CP

UT

ime

inse

con

ds

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 22

Page 37: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Experiments

Scaling with Number of Kernels

Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1

104 105

102.5

103

Number of Kernels

CP

UT

ime

inse

con

ds

SMO-MKL

O(n)

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 22

Page 38: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

Experiments

On Another Dataset

Real-sim: 72,309 examples, 20,958 dimensions, p = 1.33, C = 1

100.6 100.8 101 101.2 101.4 101.6104

105

Number of Training Examples

CP

UT

ime

inse

con

ds

SMO-MKL

O(n)

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 22

Page 39: Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22. Motivation Linear

References

References

S. V. N. Vishwanathan, Zhaonan Sun, Theera-Ampornpunt, andManik Varma. Multiple Kernel Learning and the SMO Algorithm.NIPS 2010. Pages 2361–2369.

Code available for download from http://research.microsoft.

com/en-us/um/people/manik/code/SMO-MKL/download.html

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 22