Ai Lecture15 Svm 2up

7/26/2019 Ai Lecture15 Svm 2up

1/24

Artificial Intelligence

Support Vector Machines

Stephan Dreiseitl

FH Hagenberg

Software Engineering

Lecture 15: Support Vector Machines Artificial Intelligence SS2009 1 / 48

Overview

Motivation Statistical learning theory Optimal separating hyperplanes Support vector classification and regression Kernel functions



2/24

Motivation

Given data D={xi, ti} distributed according to P(x, t),which model is better representation of data?

3 2 1 0 1 2 330

20

10

0

10

20

30

3 2 1 0 1 2 330

20

10

0

10

20

30

high bias low biaslow variance high variance


Motivation (cont.)

Neural networks model p(t|x) by topology restriction

early stopping

weight decay Bayesian approach

In support vector machines, this is replaced by capacitycontrol

SVM concepts based on statistical learning theory



3/24

Statistical learning theory

Given: data set{xi, ti}, class labels ti {1, +1},classifier output y(, xi) {1, +1}Find: parameter such that y(, xi) =ti

Important questions: is learning consistent(doesperformance increase with number of size of trainingset)?

How to handle limited data (small training sets)?

Can performance on test set (generalization error) beinferred from performance on training set?


Statistical learning theory (cont.)

Empirical error on a data set{xi, ti}with distributionP(x, t) for classifier with parameter :

Remp() = 12n

ni=1

|y(, xi) ti|

Expected error of same classifier on unseen data with thesame distribution:

R() =

1

2|y(, xi) ti| dP(x, t)



4/24

Statistical learning theory (cont.)

Fundamental question of statistical learning theory: Howcan we relate Remp and R?

Key result: Generalization error Rdepends on both Rempand capacity h of the classifier

The following holds with probability 1 :

R()

Remp() +

h(log(2n/h) + 1) log(/4)n

,

with h the Vapnik-Chervonenkis (VC) dimension of theclassifier, and n the size of training set


Shattering

A classifier shattersdata points if, for any labeling, thepoints can be correctly classified

Capacity of classifier depends on number of points thatcan be shattered by a classifier

VC dimension is largest number of data points for whichthere exists an arrangement that can be shattered

Not the same as the number of parameters in the

classifier!



5/24

Shattering examples

Straight lines can shatter 3 points in 2-space

Classifier: sign(

x)


Shattering examples (cont.)

Other classifier:sign(x

x

)

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1classifier for lastcase:sign(x x )



6/24


Extreme example: one parameter, but infinite VC

dimensionConsider classifier y(, x) = sign(sin(x))

Surprising fact: for any n there exists arrangement ofdata points{xi} Rthat can be shattered by y(, x)Choose data points as xi= 10

i, i= 1, . . . , n

There is a clever way of encoding labeling information insingle number



For any labeling ti {1, +1}, construct as

=1 +1

2

n

i=1

(1

ti)10

i

For n= 5 and ti= (+1, 1, 1, +1, 1), = 101101

105

104

103

102

101

1

0

1



7/24

VC dimension

VC dimension is capacity measure for classifiers

VC dimension is largest number of data points for whichthere exists an arrangement that can be shattered

For straight lines in 2-space, VC dimension is 3

For hyperplanes in n-space, VC dimension is n+ 1

May be difficult to calculate VC dimension for classifiers


Structural risk minimization

Recall that, with probability 1 ,

R()

Remp() +

h(log(2n/h) + 1) log(/4)

n

,

Induction principle for finding best classifier:

fix data set and order classifiers according to theirVC dimension

for each classier, train and calculate right-hand sideof inequality

best classifier is the one that minimizes right-handside



8/24

Structural risk minimization (cont.)

R()Remp() + h(log(2n/h) + 1)

log(/4)

n

Model Remp VC conf. upper bound

y1(, x)

y2(, x)

y3(, x)

y4(, x)

y5(, x)


Support vector machines

Algorithmic representation of concepts from statisticallearning theory

Implement hyperplanes, so VC dimension is known

SVM calculate optimal hyperplanes: hyperplanes thatmaximize margin between classes

Decision function: sign(w x+w0)



9/24

Geometry of hyperplanes

z

|w z+w0|w

{x| w x+w0= 0}

|w0|w


Geometry of hyperplanes (cont.)

Hyperplanes invariant to scaling of parameters:

{x| w x+w0= 0}={x| cw x+cw0= 0}

-2 -1 1 2 3

-10

-8

-6

-4

-2

2

4

-2 -1 1 2 3

-10

-8

-6

-4

-2

2

4

{x| 3xy 4 = 0} {x| 6x 2y 8 = 0}Lecture 15: Support Vector Machines Artificial Intelligence SS2009 18 / 48


10/24

Optimal separating hyperplanes

We want

w x+w0+1 for allxi with ti= +1w x+w0 1 for all xi with ti=1

w x+w0=1 0 +1Lecture 15: Support Vector Machines Artificial Intelligence SS2009 19 / 48

Optimal separating hyperplanes (cont.)Points x and o on dashed lines satisfy w x+w0= +1and w o+w0=1, resp.Distance between dashed lines is

|w

x

+w

0|w +|w

o

+w

0|w = 2w

Find largest (optimal) margin by maximizing 2w

This is equivalent to

minimize

1

2w2

subject to ti(w xi+w0)1Lecture 15: Support Vector Machines Artificial Intelligence SS2009 20 / 48


11/24

Algorithmic aspects

Constrained optimization problem transformed toLagrangian

1

2w2

ni=1

i

ti(w xi+w0) 1

Find saddle point (minimize w.r.t. w, w0, maximizew.r.t.i)

Leads to criterian

i=1

iti= 0 and w=n

i=1

itixi


Algorithmic aspects (cont.)

Substituting constraints into Lagrangian results in dualproblem

maximizen

i=1

i

1

2

n

i,j=1

ijtitjxi

xj

subject to i0 andn

i=1

iti= 0

With expansion w=n

i=1itixi, decision functionsign(w x+w0) becomes

f(x) = sign

ni=1

tiixi x+w0



12/24

Summary algorithmic aspects

Optimal separating hyperplane has largest margin (SVM

are large margin classifiers)Unique solution to convex constrained optimizationproblem is w=

itixiover all points xi with i= 0

Points xi with i= 0 lie on the margin (supportvectors), all other points irrelevant for solution!

Observe that data points enter calculation only via dotproduct


Large margin classifiers

Arguments for the importance of large margins:

1.5 1 0.5 0 0.5 1 1.5 2 2.5 3

0.5

0

0.5

1

1.5

2

2.5

3 2 1 0 1 2 33

2

1

0

1

2

3



13/24

Soft margin classifiers

What happens when data set is not linearly separable?

Introduce slack variables

i0

1 0.5 0 0.5 1 1.5 2 2.52

1.5

1

0.5

0

0.5

1

1.5

2

1


Soft margin classifiers (cont.)

Constraints are then

w

x+w0

+1

i for all xi with ti= +1

w x+w0 1 +i for all xi with ti=1

Want slack variables as small as possible, include this inobjective function

Soft margin classifierminimizes 12w2 +C

ni=1i

Large value ofCgives large penalty to data on wrongside



14/24

Soft margin classifiers (cont.)

Little difference in dual formulation:

maximizen

i=1

i12

ni,j=1

ijtitjxi xj

subject to 0i C andn

i=1

iti= 0

Again, data points appear only via dot products


Support vector regression

Difference to classification: targets tiare real-valued

Prediction function for linear regression is

f(x) =w x+w0

Recall 0-1 loss in classification: 12|f(xi) ti|

Need different loss for regression (-insensitive loss):

|f(xi) ti|:= max0, |f(xi) ti|



15/24

Support vector regression (cont.)

-insensitive loss results in tube around regressionfunction

2 4 6 8 10

1

2

3

4

5

6

7

2 4 6 8 10

1

2

3

4

5

6

7

Minimize regularization term and error contribution

1

2w2 +C

ni=1

|f(xi) ti|



Need slack variables for points outside tube

2 4 6 8 10

1

2

3

4

5

6

7

2 4 6 8 10

1

2

3

4

5

6

7

maximize 1

2w2 +C

ni=1

i+

i

subject to f(xi) ti +i, i0ti f(xi)+i, i0



16/24


Convert to dual problem statement (omitting details)

Regression estimate is

f(x) =n

i=1

(i i) xi x+w0

Again, data points enter calculation only via dot products


Probabilities for SVM outputs

In many applications, want output to be P(ti= 1 | xi)

SVMs provide only1 classificationsProbabilities can be obtained by fitting a sigmoid toraw SVM output w x+w0Functional form of sigmoid can be motivatedtheoretically



17/24

Probabilities for SVM outputs (cont.)

Details: With 0/1 target encoding ti, SVM output fi andsigmoid pi= 1/(1 + exp(A fi+B)), minimize

n

i=1

tilog(pi) + (1 ti) log(1 pi)

-4 -2 2

0.2

0.4

0.6

0.8

1

-4 -2 2

0.2

0.4

0.6

0.8

1


Nonlinear SVM

Idea: Do nonlinear projection (x) : Rm Hof originaldata points xinto some higher-dimensional space H

Then, apply optimal margin hyperplane algorithm in H

1 0.5 0 0.5 1 1.5 2 2.52

1.5

1

0.5

0

0.5

1

1.5

1.5 1 0.5 0 0.5 1 1.5 2 2.5 30.5

0

0.5

1

1.5

2

2.5

( )

( )

( )

( )

( )

( )

( )

( )( )

( )



18/24

Nonlinear SVM example

3 2 1 0 1 2 33

2

1

0

1

2

3

Idea: Project R2 R3 by

x1x2

=

x

21

2 x1x2x22


Nonlinear SVM example (cont.)

Do the math (dot product in H):

x1x2

y1y2 =

x212 x1x2x22

y212y1y2y22

=x21y21 + 2x1x2y1y2+x

22y

22

= (x1y1+x2y2)2

=

x1x2

y1y2

2

This means that dot product in Hcan be represented byfunction in original space R2



19/24

The kernel trick

Recall that data enters maximum margin calculation only

via dot productsx

ixjor (

xi) (

xj)

Instead of calculating (xi) (xj), use kernel function inoriginal space:

K(xi, xj) = (xi) (xj)

Advantage: no need to calculate

Advantage: no need to know H

Raises question: what are admissible kernel functions?


Kernel functions

Admissible kernel functions: Gram matrix

K(xi, xj)i,j

is

positive definite

Most widely used kernel functions and their parameters:

polynomials (degree) Gaussians (variance)

Practical importance of kernels: similarity measures ondata sets without dot products!

Great for text analysis, bioinformatics,. . .

Kernel-ization of other algorithms (Kernel PCA, LDA,. . . )Lecture 15: Support Vector Machines Artificial Intelligence SS2009 38 / 48


20/24

Kernel function example

Dot product (x1, x2) (y1,y2) after nonlinearprojection (x1, x2) = (x

21 ,

2x1x2, x

22 ) can be achieved

by kernel function K(x,y) = (xy)2

3 2 1 0 1 2 33

2

1

0

1

2

3

0

1

2

3

4

5

6 0

1

2

3

4

5

4

2

0

2


SVM examples

Linearly separable C= 100



21/24

SVM examples (cont.)

C= 100 C= 1



Linear function Quad. polynomial,C= 10



22/24


Quad. polynomial, C= 10 Quad. polynomial,C= 100



Kubic polynomial Gaussian, = 1



23/24


Quad. polynomial Kubic polynomial, C = 10



Kubic polynomial, C = 10 deg. 4 polynomial, C = 10



24/24


Gaussian,= 1 Gaussian, = 3


Summary

SVM based on statistical learning theory

Allows calculation of bounds on generalization

performanceOptimal separating hyperplanes

Kernel trick (projection)

Kernel functions are similarity measures

SVM perform comparable to neural networks

Ai Lecture15 Svm 2up

Documents

Transcript of Ai Lecture15 Svm 2up