Kernel Methods & SVMhiSupport Vector Machinespeople.cs.pitt.edu/~pakdaman/tutorials/kernel.pdf ·...

Kernel Methods &S V M hiSupport Vector Machines

Mahdi pakdaman Naeini

PhD Candidate, University of TehranSenior Researcher, TOSAN Intelligent Data MinersSenior Researcher, TOSAN Intelligent Data Miners

Outline

Motivation

I d i i i d M hi Introduction to pattern recognition and Machine

LearningLearning

Introduction to Kernels

Spars kernel methods (SVM)

Anomaly detection using kernel methods

2Automated Fraud Detection: Kernel Methods & SVM

Motivation

Fraud detection’s perspectives Fraud detection s perspectives Fast recall time of the learner Binary class classificationy One-class classification

Generalization performance of the kernel methods Generalization performance of the kernel methods Different kinds of information can be used Good performance in high dimensional feature spacep g p Using Linear learning typically has nice properties

Unique optimal solutions Fast learning algorithms Better statistical analysis


Introduction

Data can exhibit regularities that may or may not be immediately apparent exact patterns – e.g. motions of planets

complex patterns – e.g. genes in DNAp p g g

probabilistic patterns – e.g. market research

Detecting patterns makes it possible to understand Detecting patterns makes it possible to understand and/or exploit the regularities to make predictions

Machine Learning and Pattern Recognition is the study of automatic detection of patterns in data


Pattern Definingg

Generative Models Generative Models Parametric Models

Using Maximum entropy Model GMM Using Maximum entropy Model, GMM ,… Non Parametric Models

Histogram Based Methods, KNN, Parzen estimate,..Histogram Based Methods, KNN, Parzen estimate,..

Discriminative models Linear and non-Linear discriminant Models

Linear RegressionN l N k Neural Networks

SVM ….


Historical perspectivep p

Minsky and Pappert highlighted the weakness in their book Perceptrons

Neural networks overcame the problem by gluing together many linear units with non-lineartogether many linear units with non linear activation functions Solved problem of capacity and led to very impressive Solved problem of capacity and led to very impressive

extension of applicability of learning B t r n int tr inin pr bl m f p d nd m ltipl But ran into training problems of speed and multiple

local minima


Kernel methods approachpp

• The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:

• The expectation is that the feature space has a much higher dimension than the input spacemuch higher dimension than the input space.


Form of the functions

• So kernel methods use linear functions in a feature space:

For regression this could be the function

For classification require thresholding


Examplep

C id h i Consider the mapping

Φ : R2 → R3

(X1 , X2 ) → (Z1 , Z2 , Z3 ) = (X12 , 2½ X1 X2 , X2

2 )

If we consider a linear equation in this feature space:

We actually have an ellipse – i.e. a non-linear shape in the input space.


Examples of Kernels (III)p ( )

Polynomial kernel (n=2)

RBF kernel


Capacity of feature spacesp y p

The capacity is proportional to the dimension for example:

2-dim: 2 dim:

Automated Fraud Detection: Kernel Methods & SVM 11

Problems of high dimensionsg

• Computational costs involved in dealing with large vectors Kernel Function & Kernel trick

C i il b l d l d• Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well Large Margin trickg g


Different Perspective: Learning & Similarityp g y

I / X Y• Input / output sets X , Y

• Training set (x1 y1) , …., (xm,ym) є XxYT ai i g set ( 1,y1) , …., ( m,ym) є X Y

• Generalization : given a previously unseen x є X , find it bl Ya suitable y є Y

• (x,y) should be “similar” to (x1 y1) , …., (xm,ym)( ,y) ( 1,y1) , , ( m,ym)

• How to measure similarity? For outputs: loss function : (e.g. for yє{-1, +1}, zer-one loss) For inputs : kernel function


Similarity of Inputsy p Symmetric function y

k : X x X → R(x, x’) → k(x,x’)

For example : if X = Rn : canonical dot product

• If X is not a dot product space: assume that k has a representation as a dot product in a linear space H e g there is a mapping Φ : X → Ha dot product in a linear space H. e.g there is a mapping Φ : X → H such that:

• In that case, we can think of the patterns as and , also carry out geometric algorithms in in dot product space (feature space) HH


An Example of Kernel Methodp

Id l if i t i f t di t hi hIdea: classify points in feature space according to which of the two class means is closer:

Compute the sign of dot product between w := C+ - C and X-CCompute the sign of dot product between w : C+ C- and X C


An Example: Cnt’dp

• Provides geometric interpretation of Parzen windowsTh d i i f i i h l• The decision function is a hyperplane


Example: All Degree 2 Monomialp gΦ : R2 → R3

(X1 , X2 ) → (Z1 , Z2 , Z3 ) = (X12 , 2½ X1 X2 , X2

2 )


The kernel Trick

• The dot product in H can be computed in R2


Kernel Trick : Cnt’d

• More Generally:

• Where maps into the space spanned by all ordered products of d input directionsp p


Kernel : more formal definition

A kernel k(x y) A kernel k(x,y) is a similarity measure defined by an implicit mapping f defined by an implicit mapping f, from the original space to a vector space (feature space) such that: k(x,y)=f(x)•f(y)( ,y) ( ) (y)

This similarity measure and the mapping include: Invariance or other a priori knowledge Invariance or other a priori knowledge Simpler structure (linear representation of the data) Possibly infinite dimension (hypothesis space for learning)y ( yp p g) but still computational efficiency when computing k(x,y) Different kind of data eg. string, set, graph, tree, text, ….


Valid Kernels

The function k(x,y) is a valid kernel, if there exists a mapping f ( ,y) , pp ginto a vector space (with a dot-product) such that k can be expressed as k(x,y)=f(x)•f(y)

Theorem: k(x,y) is a valid kernel if k is positive definite and symmetric (Mercer Kernel)

A function is P.D. if In other words the Gram matrix K (whose elements are k(x x )) must be

20)()(),( LfddffK yxyxyx

In other words, the Gram matrix K (whose elements are k(xi,xj)) must be positive definite for all ai, aj of the input space


How to build new kernels

Kernel combinations, preserving validity:10)()1()()( 21 yxyxyx ,K,K,K

)()()(0)(.)(

)()()()(

1

21

yxyxyxyxyx

yyy

KKKa,Ka,K

)().()()().()( 21

yxyxyxyx

functionvaluedrealisfyfxf,K,K,K,K

)())()(()( 3

yxyxyφxφyx

positivedefinitesymmetricPP,K,K,K

)()()()( 1

yyxxyxyxKK

,K,K )()( 11 yyxx ,K,K


I t d ti t i l d D l fIntroduction to primal and Dual form of learning functionof learning function


Linear Regressiong

Gi i i dGiven training data:

1 1 2 2, , , , , , , , ,

points and labels i

ni

S y y y y

R y R

i

i

x x x x

x

Construct linear function that best interpolates a given training set

1( ) , '

n

i ii

g w x

x w x w x


Linear Regression exampleg p

,w x

y


Ridge Regressiong g

Inverse typically does not exist. Use least norm solution for fixed Regularized problem

2 2

Optimality Condition:

2 2min ( , )L S w w w y Xw 0.

Optimality Condition:( , )

2 2 ' 2 ' 0L S

w

w X y X Xw

yw

' 'n X X I w X y Requires O(n3) operations


Ridge Regression (cont)g g ( )

Inverse always exists for any

1' ' w X X I X y 0.

Alternative representation: y

1' ' ' '

1

1

' ' ' '

' '

X X I w X y w X y X Xw

w X y Xw X α

1

'

α y Xw

α y Xw y XX α

1

'

where '

α y Xw y XX αXX α α y

α G I y G XX Solving ll equation is 0(l3) where α G I y G XXg q ( )


Dual Ridge Regressiong g

To predict new point:

1( ) 'g x w x x x y G I z

1

( ) , ,

where ,

i ii

i

g

x w x x x y G I z

z x x

Note need only compute G, the Gram Matrix

w e e ,i

p

' ,ij i jG G XX x x

Ridge Regression requires only inner products between data pointsp p


Efficiencyy

To compute w in primal ridge regression is O(n3)p g g ( )

in primal ridge regression is O(l3)T di i To predict new point xprimal O(n) ( ) ,

n

i ig w x w x xp

dual O( l)

1 1 1

( ) , ,n

i i i i j ji i j

g

x wx x x x x

1i

O(nl)1 1 1i i j

Dual is better if n>>l


Key ingredient of dual solutiony g

S 1• Step 1: α = (G + λ I )-1 y G = X X’ G = X.X Gij = <xi ,xj >

• Step 2:

• Important observation :pBoth steps only involve inner products between input data points


Kernel Trick: Summeryy

• Any Learning Algorithm that only depends on dot products can benefit from kernel trick

• This way, we can apply linear methods to vectorial as well as non-vectorial data

• Think of kernels as non linear similarity measures• Example of common kernels:Example of common kernels:


Kernel Method Application ISparse Kernels (SVM)


Sparse Kernels Method: pSupport Vector Classifier


Separating Hypreplanep g yp p


Optimal Separating Hypeplanep p g yp p


Eliminating the scaling freedomg g


Canonical Optimal Hyperplanep yp p


Formulation as an Optimization Problemp


Lagrange Functiong g


Derivation of Dual Problem


The Support Vector Expansionpp p


Dual problemp


Example: Gaussian kernelp


Non-Separable datap

11

2

7


Soft Margin SVMsg


The v-propertyp p y


Duals Using Kernelsg


SVM Trainingg


Kernel Method Application II

Anomaly Detection usingAnomaly Detection using Kernel Methods


One-Class SVM

• Finding the smallest hyper sphere containing the most training data (Y. Chen et al. 2001)

• Mapping data into feature space finding the• Mapping data into feature space finding the maximum margin line separating the origin from th d d t (Sh lk f l 1999)the mapped data (Sholkopf et al. 1999)


Chen ‘s Method


Scholkopf Method: Separating unlabeled p p gdata from the origin


V-Soft Margin Separationg p


Dual problemp


Remark

These two methods would be the same if the kernel K(x,y) just depends on x-y

Scholkopf et al. 2001


ConclusionKernel methods provide a general purpose toolkit for pattern analysis Advantages Advantages

kernels define flexible interface to the data enabling the user to encode prior knowledge into a measure of similarity between two items – with the proviso that it must satisfy the psd property.

algorithms well-founded in statistical learning theory enable efficient and effective exploitation of the high-dimensional representations to enable good off-training performance.

Subspace methods can often be implemented in kernel defined feature spaces using Subspace methods can often be implemented in kernel defined feature spaces using dual representations

Overall gives a generic plug and play framework for analysing data, combining different data types, models, tasks, and preprocessing

We can accommodate different kind of data in learning problems using kernel methods

Convex optimization problem: local minima is globally optimum

Disadvantages The choose of kernel is somewhat heuristic and depend on the application you face Risk of encountering overfitting


عشق به شده دلشزنده آنكه هرگزنميرد آنكه دلش زنده شده به عشقهرگزنميردثبت است بر جريده عالم دوام ما

Thank you!y

Automated Fraud Detection: Kernel Methods & SVM

Kernel Methods & SVMhiSupport Vector Machinespeople.cs.pitt.edu/~pakdaman/tutorials/kernel.pdf ·...

Documents

Transcript of Kernel Methods & SVMhiSupport Vector Machinespeople.cs.pitt.edu/~pakdaman/tutorials/kernel.pdf ·...