Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan...

Optimization for Machine LearningLecture 4: SMO-MKL

S.V.N. (vishy) Vishwanathan

Purdue [email protected]

July 11, 2012

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 22

Motivation

Binary Classification

yi = −1

yi = +1

{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}

{x | 〈w , x〉+ b = 1}

x2x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w

‖w‖ , x1 − x2⟩

= 2‖w‖


Motivation

Linear Support Vector Machines

Optimization Problem

minw ,b,ξ

1

2‖w‖2 + C

m∑i=1

ξi

s.t. yi (〈w , xi 〉+ b) ≥ 1− ξi for all i

ξi ≥ 0


Motivation

The Kernel Trick

x

y


Motivation

The Kernel Trick

x

x2+y2


Motivation

Kernel Trick


minw ,b,ξ

1

2‖w‖2 + C

m∑i=1

ξi

s.t. yi (〈w , φ(xi )〉+ b) ≥ 1− ξi for all i

ξi ≥ 0


Motivation

Kernel Trick


maxα

− 1

2α>Hα + 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Hij = yiyj 〈φ(xi ), φ(xj)〉


Motivation

Kernel Trick


maxα

− 1

2α>Hα + 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Hij = yiyjk(xi , xj)


Motivation

Key Question

Which kernel should I use?

The Multiple Kernel Learning Answer

Cook up as many (base) kernels as you can

Compute a data dependent kernel function as a linear combination ofbase kernels

k(x , x ′) =∑k

dkkk(x , x ′) s.t. dk ≥ 0


Motivation

Object Detection

Localize a specified object of interest if it exists in a given image


Motivation

Some Examples of MKL Detection


Motivation

Summary of Our Results

Sonar Dataset with 800 kernels

pTraining Time (s) # Kernels Selected

SMO-MKL Shogun SMO-MKL Shogun

1.1 4.71 47.43 91.20 258.001.33 3.21 19.94 248.20 374.202.0 3.39 34.67 661.20 664.80

Web dataset: ≈ 50,000 points and 50 kernels ≈ 30 minutes

Sonar with a hundred thousand kernels

Precomputed: ≈ 8 minutesKernels computed on-the-fly: ≈ 30 minutes


Motivation

Setting up the Optimization Problem -I

The Setup

We are given K kernel functions k1, . . . , kn with corresponding featuremaps φ1(·), . . . , φn(·)We are interested in deriving the feature map

φ(x) =

√d1φ1(x)

...√dnφn(x)


Motivation

Setting up the Optimization Problem -I

The Setup

We are given K kernel functions k1, . . . , kn with corresponding featuremaps φ1(·), . . . , φn(·)We are interested in deriving the feature map

φ(x) =

√d1φ1(x)

...√dnφn(x)

=⇒ w =

w1...wn


Motivation

Setting up the Optimization Problem


minw ,b,ξ

1

2‖w‖2 + C

m∑i=1

ξi

s.t. yi (〈w , φ(xi )〉+ b) ≥ 1− ξi for all i

ξi ≥ 0


Motivation



minw ,b,ξ,d

1

2

∑k

‖wk‖2 + Cm∑i=1

ξi

s.t. yi

(∑k

√dk 〈wk , φk(xi )〉+ b

)≥ 1− ξi for all i

ξi ≥ 0

dk ≥ 0 for all k


Motivation



minw ,b,ξ,d

1

2

∑k

‖wk‖2 + Cm∑i=1

ξi +ρ

2

(∑k

dpk

) 2p

s.t. yi

(∑k

√dk 〈wk , φk(xi )〉+ b


ξi ≥ 0

dk ≥ 0 for all k


Motivation



minw ,b,ξ,d

1

2

∑k

‖wk‖2

dk+ C

m∑i=1

ξi +ρ

2

(∑k

dpk

) 2p

s.t. yi

(∑k

〈wk , φk(xi )〉+ b


ξi ≥ 0

dk ≥ 0 for all k


Motivation



mind

maxα

− 1

2

∑k

dkα>Hkα + 1> α +

ρ

2

(∑k

dpk

) 2p

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

dk ≥ 0


Motivation

Saddle Point Problem

−4 −2 0 2 4 −5

0

5−20

0

20

d

α


Motivation

Solving the Saddle Point

Saddle Point Problem

mind

maxα

− 1

2

∑k

dkα>Hkα + 1> α +

ρ

2

(∑k

dpk

) 2p

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

dk ≥ 0


Our Approach

The Key Insight

Eliminate d

D(α) := maxα

− 1

8ρ

(∑k

(α>Hkα

)q) 2q

+ 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

1

p+

1

q= 1

Not a QP but very close to one!


Our Approach

SMO-MKL: High Level Overview

D(α) := maxα

− 1

8ρ

(∑k

(α>Hkα

)q) 2q

+ 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Algorithm

Choose two variables αi and αj to optimize

Solve the one dimensional reduced optimization problem

Repeat until convergence


Our Approach

SMO-MKL: High Level Overview

Selecting the Working Set

Compute directional derivative and directional Hessian

Greedily select the variables

Solving the Reduced Problem

Analytic solution for p = q = 2 (one dimensional quartic)

For other values of p use Newton Raphson


Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.080

82

84

86

88

90T

est

Acc

ura

cy(%

)

Australian

SMO-MKL Shogun


Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.088

90

92

94

Tes

tA

ccu

racy

(%)

ionosphere

SMO-MKL Shogun


Experiments

Scaling with Training Set Size

Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5101

102

103

104

Number of Training Examples

CP

UT

ime

inse

con

ds

SMO-MKLShogun


Experiments

Scaling with Training Set Size

Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5101

102

103

104


CP

UT

ime

inse

con

ds

Observed

O(n1.9)


Experiments

On Another Dataset

Web: 300 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5

101

102

103


CP

UT

ime

inse

con

ds


Experiments

Scaling with Number of Kernels

Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1

104 105

102.5

103

Number of Kernels

CP

UT

ime

inse

con

ds


Experiments

Scaling with Number of Kernels

Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1

104 105

102.5

103

Number of Kernels

CP

UT

ime

inse

con

ds

SMO-MKL

O(n)


Experiments

On Another Dataset

Real-sim: 72,309 examples, 20,958 dimensions, p = 1.33, C = 1

100.6 100.8 101 101.2 101.4 101.6104

105


CP

UT

ime

inse

con

ds

SMO-MKL

O(n)


References

References

S. V. N. Vishwanathan, Zhaonan Sun, Theera-Ampornpunt, andManik Varma. Multiple Kernel Learning and the SMO Algorithm.NIPS 2010. Pages 2361–2369.

Code available for download from http://research.microsoft.

com/en-us/um/people/manik/code/SMO-MKL/download.html


http://research.microsoft.com/en-us/um/people/manik/code/SMO-MKL/download.html

http://research.microsoft.com/en-us/um/people/manik/code/SMO-MKL/download.html

Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan...

Documents

Transcript of Optimization for Machine Learningniejiazhong/slides/vishy-lec5.pdf · E = 2 kwk S.V:N. Vishwanathan...