Ai Lecture15 Svm 2up

download Ai Lecture15 Svm 2up

of 24

Transcript of Ai Lecture15 Svm 2up

  • 7/26/2019 Ai Lecture15 Svm 2up

    1/24

    Artificial Intelligence

    Support Vector Machines

    Stephan Dreiseitl

    FH Hagenberg

    Software Engineering

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 1 / 48

    Overview

    Motivation Statistical learning theory Optimal separating hyperplanes Support vector classification and regression Kernel functions

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 2 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    2/24

    Motivation

    Given data D={xi, ti} distributed according to P(x, t),which model is better representation of data?

    3 2 1 0 1 2 330

    20

    10

    0

    10

    20

    30

    3 2 1 0 1 2 330

    20

    10

    0

    10

    20

    30

    high bias low biaslow variance high variance

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 3 / 48

    Motivation (cont.)

    Neural networks model p(t|x) by topology restriction

    early stopping

    weight decay Bayesian approach

    In support vector machines, this is replaced by capacitycontrol

    SVM concepts based on statistical learning theory

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 4 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    3/24

    Statistical learning theory

    Given: data set{xi, ti}, class labels ti {1, +1},classifier output y(, xi) {1, +1}Find: parameter such that y(, xi) =ti

    Important questions: is learning consistent(doesperformance increase with number of size of trainingset)?

    How to handle limited data (small training sets)?

    Can performance on test set (generalization error) beinferred from performance on training set?

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 5 / 48

    Statistical learning theory (cont.)

    Empirical error on a data set{xi, ti}with distributionP(x, t) for classifier with parameter :

    Remp() = 12n

    ni=1

    |y(, xi) ti|

    Expected error of same classifier on unseen data with thesame distribution:

    R() =

    1

    2|y(, xi) ti| dP(x, t)

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 6 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    4/24

    Statistical learning theory (cont.)

    Fundamental question of statistical learning theory: Howcan we relate Remp and R?

    Key result: Generalization error Rdepends on both Rempand capacity h of the classifier

    The following holds with probability 1 :

    R()

    Remp() +

    h(log(2n/h) + 1) log(/4)n

    ,

    with h the Vapnik-Chervonenkis (VC) dimension of theclassifier, and n the size of training set

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 7 / 48

    Shattering

    A classifier shattersdata points if, for any labeling, thepoints can be correctly classified

    Capacity of classifier depends on number of points thatcan be shattered by a classifier

    VC dimension is largest number of data points for whichthere exists an arrangement that can be shattered

    Not the same as the number of parameters in the

    classifier!

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 8 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    5/24

    Shattering examples

    Straight lines can shatter 3 points in 2-space

    Classifier: sign(

    x)

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 9 / 48

    Shattering examples (cont.)

    Other classifier:sign(x

    x

    )

    0.2 0.4 0.6 0.8 1

    0.2

    0.4

    0.6

    0.8

    1

    0.2 0.4 0.6 0.8 1

    0.2

    0.4

    0.6

    0.8

    1

    0.2 0.4 0.6 0.8 1

    0.2

    0.4

    0.6

    0.8

    1

    0.2 0.4 0.6 0.8 1

    0.2

    0.4

    0.6

    0.8

    1classifier for lastcase:sign(x x )

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 10 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    6/24

    Shattering examples (cont.)

    Extreme example: one parameter, but infinite VC

    dimensionConsider classifier y(, x) = sign(sin(x))

    Surprising fact: for any n there exists arrangement ofdata points{xi} Rthat can be shattered by y(, x)Choose data points as xi= 10

    i, i= 1, . . . , n

    There is a clever way of encoding labeling information insingle number

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 11 / 48

    Shattering examples (cont.)

    For any labeling ti {1, +1}, construct as

    =1 +1

    2

    n

    i=1

    (1

    ti)10

    i

    For n= 5 and ti= (+1, 1, 1, +1, 1), = 101101

    105

    104

    103

    102

    101

    1

    0

    1

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 12 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    7/24

    VC dimension

    VC dimension is capacity measure for classifiers

    VC dimension is largest number of data points for whichthere exists an arrangement that can be shattered

    For straight lines in 2-space, VC dimension is 3

    For hyperplanes in n-space, VC dimension is n+ 1

    May be difficult to calculate VC dimension for classifiers

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 13 / 48

    Structural risk minimization

    Recall that, with probability 1 ,

    R()

    Remp() +

    h(log(2n/h) + 1) log(/4)

    n

    ,

    Induction principle for finding best classifier:

    fix data set and order classifiers according to theirVC dimension

    for each classier, train and calculate right-hand sideof inequality

    best classifier is the one that minimizes right-handside

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 14 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    8/24

    Structural risk minimization (cont.)

    R()Remp() + h(log(2n/h) + 1)

    log(/4)

    n

    Model Remp VC conf. upper bound

    y1(, x)

    y2(, x)

    y3(, x)

    y4(, x)

    y5(, x)

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 15 / 48

    Support vector machines

    Algorithmic representation of concepts from statisticallearning theory

    Implement hyperplanes, so VC dimension is known

    SVM calculate optimal hyperplanes: hyperplanes thatmaximize margin between classes

    Decision function: sign(w x+w0)

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 16 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    9/24

    Geometry of hyperplanes

    z

    |w z+w0|w

    {x| w x+w0= 0}

    |w0|w

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 17 / 48

    Geometry of hyperplanes (cont.)

    Hyperplanes invariant to scaling of parameters:

    {x| w x+w0= 0}={x| cw x+cw0= 0}

    -2 -1 1 2 3

    -10

    -8

    -6

    -4

    -2

    2

    4

    -2 -1 1 2 3

    -10

    -8

    -6

    -4

    -2

    2

    4

    {x| 3xy 4 = 0} {x| 6x 2y 8 = 0}Lecture 15: Support Vector Machines Artificial Intelligence SS2009 18 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    10/24

    Optimal separating hyperplanes

    We want

    w x+w0+1 for allxi with ti= +1w x+w0 1 for all xi with ti=1

    w x+w0=1 0 +1Lecture 15: Support Vector Machines Artificial Intelligence SS2009 19 / 48

    Optimal separating hyperplanes (cont.)Points x and o on dashed lines satisfy w x+w0= +1and w o+w0=1, resp.Distance between dashed lines is

    |w

    x

    +w

    0|w +|w

    o

    +w

    0|w = 2w

    Find largest (optimal) margin by maximizing 2w

    This is equivalent to

    minimize

    1

    2w2

    subject to ti(w xi+w0)1Lecture 15: Support Vector Machines Artificial Intelligence SS2009 20 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    11/24

    Algorithmic aspects

    Constrained optimization problem transformed toLagrangian

    1

    2w2

    ni=1

    i

    ti(w xi+w0) 1

    Find saddle point (minimize w.r.t. w, w0, maximizew.r.t.i)

    Leads to criterian

    i=1

    iti= 0 and w=n

    i=1

    itixi

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 21 / 48

    Algorithmic aspects (cont.)

    Substituting constraints into Lagrangian results in dualproblem

    maximizen

    i=1

    i

    1

    2

    n

    i,j=1

    ijtitjxi

    xj

    subject to i0 andn

    i=1

    iti= 0

    With expansion w=n

    i=1itixi, decision functionsign(w x+w0) becomes

    f(x) = sign

    ni=1

    tiixi x+w0

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 22 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    12/24

    Summary algorithmic aspects

    Optimal separating hyperplane has largest margin (SVM

    are large margin classifiers)Unique solution to convex constrained optimizationproblem is w=

    itixiover all points xi with i= 0

    Points xi with i= 0 lie on the margin (supportvectors), all other points irrelevant for solution!

    Observe that data points enter calculation only via dotproduct

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 23 / 48

    Large margin classifiers

    Arguments for the importance of large margins:

    1.5 1 0.5 0 0.5 1 1.5 2 2.5 3

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3 2 1 0 1 2 33

    2

    1

    0

    1

    2

    3

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 24 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    13/24

    Soft margin classifiers

    What happens when data set is not linearly separable?

    Introduce slack variables

    i0

    1 0.5 0 0.5 1 1.5 2 2.52

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    1

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 25 / 48

    Soft margin classifiers (cont.)

    Constraints are then

    w

    x+w0

    +1

    i for all xi with ti= +1

    w x+w0 1 +i for all xi with ti=1

    Want slack variables as small as possible, include this inobjective function

    Soft margin classifierminimizes 12w2 +C

    ni=1i

    Large value ofCgives large penalty to data on wrongside

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 26 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    14/24

    Soft margin classifiers (cont.)

    Little difference in dual formulation:

    maximizen

    i=1

    i12

    ni,j=1

    ijtitjxi xj

    subject to 0i C andn

    i=1

    iti= 0

    Again, data points appear only via dot products

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 27 / 48

    Support vector regression

    Difference to classification: targets tiare real-valued

    Prediction function for linear regression is

    f(x) =w x+w0

    Recall 0-1 loss in classification: 12|f(xi) ti|

    Need different loss for regression (-insensitive loss):

    |f(xi) ti|:= max0, |f(xi) ti|

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 28 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    15/24

    Support vector regression (cont.)

    -insensitive loss results in tube around regressionfunction

    2 4 6 8 10

    1

    2

    3

    4

    5

    6

    7

    2 4 6 8 10

    1

    2

    3

    4

    5

    6

    7

    Minimize regularization term and error contribution

    1

    2w2 +C

    ni=1

    |f(xi) ti|

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 29 / 48

    Support vector regression (cont.)

    Need slack variables for points outside tube

    2 4 6 8 10

    1

    2

    3

    4

    5

    6

    7

    2 4 6 8 10

    1

    2

    3

    4

    5

    6

    7

    maximize 1

    2w2 +C

    ni=1

    i+

    i

    subject to f(xi) ti +i, i0ti f(xi)+i, i0

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 30 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    16/24

    Support vector regression (cont.)

    Convert to dual problem statement (omitting details)

    Regression estimate is

    f(x) =n

    i=1

    (i i) xi x+w0

    Again, data points enter calculation only via dot products

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 31 / 48

    Probabilities for SVM outputs

    In many applications, want output to be P(ti= 1 | xi)

    SVMs provide only1 classificationsProbabilities can be obtained by fitting a sigmoid toraw SVM output w x+w0Functional form of sigmoid can be motivatedtheoretically

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 32 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    17/24

    Probabilities for SVM outputs (cont.)

    Details: With 0/1 target encoding ti, SVM output fi andsigmoid pi= 1/(1 + exp(A fi+B)), minimize

    n

    i=1

    tilog(pi) + (1 ti) log(1 pi)

    -4 -2 2

    0.2

    0.4

    0.6

    0.8

    1

    -4 -2 2

    0.2

    0.4

    0.6

    0.8

    1

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 33 / 48

    Nonlinear SVM

    Idea: Do nonlinear projection (x) : Rm Hof originaldata points xinto some higher-dimensional space H

    Then, apply optimal margin hyperplane algorithm in H

    1 0.5 0 0.5 1 1.5 2 2.52

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    1.5 1 0.5 0 0.5 1 1.5 2 2.5 30.5

    0

    0.5

    1

    1.5

    2

    2.5

    ( )

    ( )

    ( )

    ( )

    ( )

    ( )

    ( )

    ( )( )

    ( )

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 34 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    18/24

    Nonlinear SVM example

    3 2 1 0 1 2 33

    2

    1

    0

    1

    2

    3

    Idea: Project R2 R3 by

    x1x2

    =

    x

    21

    2 x1x2x22

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 35 / 48

    Nonlinear SVM example (cont.)

    Do the math (dot product in H):

    x1x2

    y1y2 =

    x212 x1x2x22

    y212y1y2y22

    =x21y21 + 2x1x2y1y2+x

    22y

    22

    = (x1y1+x2y2)2

    =

    x1x2

    y1y2

    2

    This means that dot product in Hcan be represented byfunction in original space R2

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 36 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    19/24

    The kernel trick

    Recall that data enters maximum margin calculation only

    via dot productsx

    ixjor (

    xi) (

    xj)

    Instead of calculating (xi) (xj), use kernel function inoriginal space:

    K(xi, xj) = (xi) (xj)

    Advantage: no need to calculate

    Advantage: no need to know H

    Raises question: what are admissible kernel functions?

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 37 / 48

    Kernel functions

    Admissible kernel functions: Gram matrix

    K(xi, xj)i,j

    is

    positive definite

    Most widely used kernel functions and their parameters:

    polynomials (degree) Gaussians (variance)

    Practical importance of kernels: similarity measures ondata sets without dot products!

    Great for text analysis, bioinformatics,. . .

    Kernel-ization of other algorithms (Kernel PCA, LDA,. . . )Lecture 15: Support Vector Machines Artificial Intelligence SS2009 38 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    20/24

    Kernel function example

    Dot product (x1, x2) (y1,y2) after nonlinearprojection (x1, x2) = (x

    21 ,

    2x1x2, x

    22 ) can be achieved

    by kernel function K(x,y) = (xy)2

    3 2 1 0 1 2 33

    2

    1

    0

    1

    2

    3

    0

    1

    2

    3

    4

    5

    6 0

    1

    2

    3

    4

    5

    4

    2

    0

    2

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 39 / 48

    SVM examples

    Linearly separable C= 100

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 40 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    21/24

    SVM examples (cont.)

    C= 100 C= 1

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 41 / 48

    SVM examples (cont.)

    Linear function Quad. polynomial,C= 10

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 42 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    22/24

    SVM examples (cont.)

    Quad. polynomial, C= 10 Quad. polynomial,C= 100

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 43 / 48

    SVM examples (cont.)

    Kubic polynomial Gaussian, = 1

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 44 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    23/24

    SVM examples (cont.)

    Quad. polynomial Kubic polynomial, C = 10

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 45 / 48

    SVM examples (cont.)

    Kubic polynomial, C = 10 deg. 4 polynomial, C = 10

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 46 / 48

  • 7/26/2019 Ai Lecture15 Svm 2up

    24/24

    SVM examples (cont.)

    Gaussian,= 1 Gaussian, = 3

    Lecture 15: Support Vector Machines Artificial Intelligence SS2009 47 / 48

    Summary

    SVM based on statistical learning theory

    Allows calculation of bounds on generalization

    performanceOptimal separating hyperplanes

    Kernel trick (projection)

    Kernel functions are similarity measures

    SVM perform comparable to neural networks