Ai Lecture15 Svm 2up
-
Upload
krishna-veni-gnanasekaran -
Category
Documents
-
view
232 -
download
0
Transcript of Ai Lecture15 Svm 2up
-
7/26/2019 Ai Lecture15 Svm 2up
1/24
Artificial Intelligence
Support Vector Machines
Stephan Dreiseitl
FH Hagenberg
Software Engineering
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 1 / 48
Overview
Motivation Statistical learning theory Optimal separating hyperplanes Support vector classification and regression Kernel functions
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 2 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
2/24
Motivation
Given data D={xi, ti} distributed according to P(x, t),which model is better representation of data?
3 2 1 0 1 2 330
20
10
0
10
20
30
3 2 1 0 1 2 330
20
10
0
10
20
30
high bias low biaslow variance high variance
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 3 / 48
Motivation (cont.)
Neural networks model p(t|x) by topology restriction
early stopping
weight decay Bayesian approach
In support vector machines, this is replaced by capacitycontrol
SVM concepts based on statistical learning theory
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 4 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
3/24
Statistical learning theory
Given: data set{xi, ti}, class labels ti {1, +1},classifier output y(, xi) {1, +1}Find: parameter such that y(, xi) =ti
Important questions: is learning consistent(doesperformance increase with number of size of trainingset)?
How to handle limited data (small training sets)?
Can performance on test set (generalization error) beinferred from performance on training set?
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 5 / 48
Statistical learning theory (cont.)
Empirical error on a data set{xi, ti}with distributionP(x, t) for classifier with parameter :
Remp() = 12n
ni=1
|y(, xi) ti|
Expected error of same classifier on unseen data with thesame distribution:
R() =
1
2|y(, xi) ti| dP(x, t)
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 6 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
4/24
Statistical learning theory (cont.)
Fundamental question of statistical learning theory: Howcan we relate Remp and R?
Key result: Generalization error Rdepends on both Rempand capacity h of the classifier
The following holds with probability 1 :
R()
Remp() +
h(log(2n/h) + 1) log(/4)n
,
with h the Vapnik-Chervonenkis (VC) dimension of theclassifier, and n the size of training set
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 7 / 48
Shattering
A classifier shattersdata points if, for any labeling, thepoints can be correctly classified
Capacity of classifier depends on number of points thatcan be shattered by a classifier
VC dimension is largest number of data points for whichthere exists an arrangement that can be shattered
Not the same as the number of parameters in the
classifier!
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 8 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
5/24
Shattering examples
Straight lines can shatter 3 points in 2-space
Classifier: sign(
x)
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 9 / 48
Shattering examples (cont.)
Other classifier:sign(x
x
)
0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1classifier for lastcase:sign(x x )
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 10 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
6/24
Shattering examples (cont.)
Extreme example: one parameter, but infinite VC
dimensionConsider classifier y(, x) = sign(sin(x))
Surprising fact: for any n there exists arrangement ofdata points{xi} Rthat can be shattered by y(, x)Choose data points as xi= 10
i, i= 1, . . . , n
There is a clever way of encoding labeling information insingle number
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 11 / 48
Shattering examples (cont.)
For any labeling ti {1, +1}, construct as
=1 +1
2
n
i=1
(1
ti)10
i
For n= 5 and ti= (+1, 1, 1, +1, 1), = 101101
105
104
103
102
101
1
0
1
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 12 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
7/24
VC dimension
VC dimension is capacity measure for classifiers
VC dimension is largest number of data points for whichthere exists an arrangement that can be shattered
For straight lines in 2-space, VC dimension is 3
For hyperplanes in n-space, VC dimension is n+ 1
May be difficult to calculate VC dimension for classifiers
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 13 / 48
Structural risk minimization
Recall that, with probability 1 ,
R()
Remp() +
h(log(2n/h) + 1) log(/4)
n
,
Induction principle for finding best classifier:
fix data set and order classifiers according to theirVC dimension
for each classier, train and calculate right-hand sideof inequality
best classifier is the one that minimizes right-handside
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 14 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
8/24
Structural risk minimization (cont.)
R()Remp() + h(log(2n/h) + 1)
log(/4)
n
Model Remp VC conf. upper bound
y1(, x)
y2(, x)
y3(, x)
y4(, x)
y5(, x)
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 15 / 48
Support vector machines
Algorithmic representation of concepts from statisticallearning theory
Implement hyperplanes, so VC dimension is known
SVM calculate optimal hyperplanes: hyperplanes thatmaximize margin between classes
Decision function: sign(w x+w0)
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 16 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
9/24
Geometry of hyperplanes
z
|w z+w0|w
{x| w x+w0= 0}
|w0|w
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 17 / 48
Geometry of hyperplanes (cont.)
Hyperplanes invariant to scaling of parameters:
{x| w x+w0= 0}={x| cw x+cw0= 0}
-2 -1 1 2 3
-10
-8
-6
-4
-2
2
4
-2 -1 1 2 3
-10
-8
-6
-4
-2
2
4
{x| 3xy 4 = 0} {x| 6x 2y 8 = 0}Lecture 15: Support Vector Machines Artificial Intelligence SS2009 18 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
10/24
Optimal separating hyperplanes
We want
w x+w0+1 for allxi with ti= +1w x+w0 1 for all xi with ti=1
w x+w0=1 0 +1Lecture 15: Support Vector Machines Artificial Intelligence SS2009 19 / 48
Optimal separating hyperplanes (cont.)Points x and o on dashed lines satisfy w x+w0= +1and w o+w0=1, resp.Distance between dashed lines is
|w
x
+w
0|w +|w
o
+w
0|w = 2w
Find largest (optimal) margin by maximizing 2w
This is equivalent to
minimize
1
2w2
subject to ti(w xi+w0)1Lecture 15: Support Vector Machines Artificial Intelligence SS2009 20 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
11/24
Algorithmic aspects
Constrained optimization problem transformed toLagrangian
1
2w2
ni=1
i
ti(w xi+w0) 1
Find saddle point (minimize w.r.t. w, w0, maximizew.r.t.i)
Leads to criterian
i=1
iti= 0 and w=n
i=1
itixi
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 21 / 48
Algorithmic aspects (cont.)
Substituting constraints into Lagrangian results in dualproblem
maximizen
i=1
i
1
2
n
i,j=1
ijtitjxi
xj
subject to i0 andn
i=1
iti= 0
With expansion w=n
i=1itixi, decision functionsign(w x+w0) becomes
f(x) = sign
ni=1
tiixi x+w0
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 22 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
12/24
Summary algorithmic aspects
Optimal separating hyperplane has largest margin (SVM
are large margin classifiers)Unique solution to convex constrained optimizationproblem is w=
itixiover all points xi with i= 0
Points xi with i= 0 lie on the margin (supportvectors), all other points irrelevant for solution!
Observe that data points enter calculation only via dotproduct
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 23 / 48
Large margin classifiers
Arguments for the importance of large margins:
1.5 1 0.5 0 0.5 1 1.5 2 2.5 3
0.5
0
0.5
1
1.5
2
2.5
3 2 1 0 1 2 33
2
1
0
1
2
3
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 24 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
13/24
Soft margin classifiers
What happens when data set is not linearly separable?
Introduce slack variables
i0
1 0.5 0 0.5 1 1.5 2 2.52
1.5
1
0.5
0
0.5
1
1.5
2
1
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 25 / 48
Soft margin classifiers (cont.)
Constraints are then
w
x+w0
+1
i for all xi with ti= +1
w x+w0 1 +i for all xi with ti=1
Want slack variables as small as possible, include this inobjective function
Soft margin classifierminimizes 12w2 +C
ni=1i
Large value ofCgives large penalty to data on wrongside
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 26 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
14/24
Soft margin classifiers (cont.)
Little difference in dual formulation:
maximizen
i=1
i12
ni,j=1
ijtitjxi xj
subject to 0i C andn
i=1
iti= 0
Again, data points appear only via dot products
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 27 / 48
Support vector regression
Difference to classification: targets tiare real-valued
Prediction function for linear regression is
f(x) =w x+w0
Recall 0-1 loss in classification: 12|f(xi) ti|
Need different loss for regression (-insensitive loss):
|f(xi) ti|:= max0, |f(xi) ti|
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 28 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
15/24
Support vector regression (cont.)
-insensitive loss results in tube around regressionfunction
2 4 6 8 10
1
2
3
4
5
6
7
2 4 6 8 10
1
2
3
4
5
6
7
Minimize regularization term and error contribution
1
2w2 +C
ni=1
|f(xi) ti|
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 29 / 48
Support vector regression (cont.)
Need slack variables for points outside tube
2 4 6 8 10
1
2
3
4
5
6
7
2 4 6 8 10
1
2
3
4
5
6
7
maximize 1
2w2 +C
ni=1
i+
i
subject to f(xi) ti +i, i0ti f(xi)+i, i0
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 30 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
16/24
Support vector regression (cont.)
Convert to dual problem statement (omitting details)
Regression estimate is
f(x) =n
i=1
(i i) xi x+w0
Again, data points enter calculation only via dot products
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 31 / 48
Probabilities for SVM outputs
In many applications, want output to be P(ti= 1 | xi)
SVMs provide only1 classificationsProbabilities can be obtained by fitting a sigmoid toraw SVM output w x+w0Functional form of sigmoid can be motivatedtheoretically
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 32 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
17/24
Probabilities for SVM outputs (cont.)
Details: With 0/1 target encoding ti, SVM output fi andsigmoid pi= 1/(1 + exp(A fi+B)), minimize
n
i=1
tilog(pi) + (1 ti) log(1 pi)
-4 -2 2
0.2
0.4
0.6
0.8
1
-4 -2 2
0.2
0.4
0.6
0.8
1
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 33 / 48
Nonlinear SVM
Idea: Do nonlinear projection (x) : Rm Hof originaldata points xinto some higher-dimensional space H
Then, apply optimal margin hyperplane algorithm in H
1 0.5 0 0.5 1 1.5 2 2.52
1.5
1
0.5
0
0.5
1
1.5
1.5 1 0.5 0 0.5 1 1.5 2 2.5 30.5
0
0.5
1
1.5
2
2.5
( )
( )
( )
( )
( )
( )
( )
( )( )
( )
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 34 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
18/24
Nonlinear SVM example
3 2 1 0 1 2 33
2
1
0
1
2
3
Idea: Project R2 R3 by
x1x2
=
x
21
2 x1x2x22
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 35 / 48
Nonlinear SVM example (cont.)
Do the math (dot product in H):
x1x2
y1y2 =
x212 x1x2x22
y212y1y2y22
=x21y21 + 2x1x2y1y2+x
22y
22
= (x1y1+x2y2)2
=
x1x2
y1y2
2
This means that dot product in Hcan be represented byfunction in original space R2
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 36 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
19/24
The kernel trick
Recall that data enters maximum margin calculation only
via dot productsx
ixjor (
xi) (
xj)
Instead of calculating (xi) (xj), use kernel function inoriginal space:
K(xi, xj) = (xi) (xj)
Advantage: no need to calculate
Advantage: no need to know H
Raises question: what are admissible kernel functions?
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 37 / 48
Kernel functions
Admissible kernel functions: Gram matrix
K(xi, xj)i,j
is
positive definite
Most widely used kernel functions and their parameters:
polynomials (degree) Gaussians (variance)
Practical importance of kernels: similarity measures ondata sets without dot products!
Great for text analysis, bioinformatics,. . .
Kernel-ization of other algorithms (Kernel PCA, LDA,. . . )Lecture 15: Support Vector Machines Artificial Intelligence SS2009 38 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
20/24
Kernel function example
Dot product (x1, x2) (y1,y2) after nonlinearprojection (x1, x2) = (x
21 ,
2x1x2, x
22 ) can be achieved
by kernel function K(x,y) = (xy)2
3 2 1 0 1 2 33
2
1
0
1
2
3
0
1
2
3
4
5
6 0
1
2
3
4
5
4
2
0
2
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 39 / 48
SVM examples
Linearly separable C= 100
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 40 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
21/24
SVM examples (cont.)
C= 100 C= 1
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 41 / 48
SVM examples (cont.)
Linear function Quad. polynomial,C= 10
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 42 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
22/24
SVM examples (cont.)
Quad. polynomial, C= 10 Quad. polynomial,C= 100
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 43 / 48
SVM examples (cont.)
Kubic polynomial Gaussian, = 1
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 44 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
23/24
SVM examples (cont.)
Quad. polynomial Kubic polynomial, C = 10
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 45 / 48
SVM examples (cont.)
Kubic polynomial, C = 10 deg. 4 polynomial, C = 10
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 46 / 48
-
7/26/2019 Ai Lecture15 Svm 2up
24/24
SVM examples (cont.)
Gaussian,= 1 Gaussian, = 3
Lecture 15: Support Vector Machines Artificial Intelligence SS2009 47 / 48
Summary
SVM based on statistical learning theory
Allows calculation of bounds on generalization
performanceOptimal separating hyperplanes
Kernel trick (projection)
Kernel functions are similarity measures
SVM perform comparable to neural networks