Kernel Methods & SVMhiSupport Vector Machinespeople.cs.pitt.edu/~pakdaman/tutorials/kernel.pdf ·...

58
Kernel Methods & S V M hi Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Senior Researcher, TOSAN Intelligent Data Miners

Transcript of Kernel Methods & SVMhiSupport Vector Machinespeople.cs.pitt.edu/~pakdaman/tutorials/kernel.pdf ·...

Kernel Methods &S V M hiSupport Vector Machines

Mahdi pakdaman Naeini

PhD Candidate, University of TehranSenior Researcher, TOSAN Intelligent Data MinersSenior Researcher, TOSAN Intelligent Data Miners

Outline

Motivation

I d i i i d M hi Introduction to pattern recognition and Machine

LearningLearning

Introduction to Kernels

Spars kernel methods (SVM)

Anomaly detection using kernel methods

2Automated Fraud Detection: Kernel Methods & SVM

Motivation

Fraud detection’s perspectives Fraud detection s perspectives Fast recall time of the learner Binary class classificationy One-class classification

Generalization performance of the kernel methods Generalization performance of the kernel methods Different kinds of information can be used Good performance in high dimensional feature spacep g p Using Linear learning typically has nice properties

Unique optimal solutions Fast learning algorithms Better statistical analysis

3Automated Fraud Detection: Kernel Methods & SVM

Introduction

Data can exhibit regularities that may or may not be immediately apparent exact patterns – e.g. motions of planets

complex patterns – e.g. genes in DNAp p g g

probabilistic patterns – e.g. market research

Detecting patterns makes it possible to understand Detecting patterns makes it possible to understand and/or exploit the regularities to make predictions

Machine Learning and Pattern Recognition is the study of automatic detection of patterns in data

4Automated Fraud Detection: Kernel Methods & SVM

Pattern Definingg

Generative Models Generative Models Parametric Models

Using Maximum entropy Model GMM Using Maximum entropy Model, GMM ,… Non Parametric Models

Histogram Based Methods, KNN, Parzen estimate,..Histogram Based Methods, KNN, Parzen estimate,..

Discriminative models Linear and non-Linear discriminant Models

Linear RegressionN l N k Neural Networks

SVM ….

5Automated Fraud Detection: Kernel Methods & SVM

Historical perspectivep p

Minsky and Pappert highlighted the weakness in their book Perceptrons

Neural networks overcame the problem by gluing together many linear units with non-lineartogether many linear units with non linear activation functions Solved problem of capacity and led to very impressive Solved problem of capacity and led to very impressive

extension of applicability of learning B t r n int tr inin pr bl m f p d nd m ltipl But ran into training problems of speed and multiple

local minima

6Automated Fraud Detection: Kernel Methods & SVM

Kernel methods approachpp

• The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:

• The expectation is that the feature space has a much higher dimension than the input spacemuch higher dimension than the input space.

7Automated Fraud Detection: Kernel Methods & SVM

Form of the functions

• So kernel methods use linear functions in a feature space:

For regression this could be the function

For classification require thresholding

8Automated Fraud Detection: Kernel Methods & SVM

Examplep

C id h i Consider the mapping

Φ : R2 → R3

(X1 , X2 ) → (Z1 , Z2 , Z3 ) = (X12 , 2½ X1 X2 , X2

2 )

If we consider a linear equation in this feature space:

We actually have an ellipse – i.e. a non-linear shape in the input space.

9Automated Fraud Detection: Kernel Methods & SVM

Examples of Kernels (III)p ( )

Polynomial kernel (n=2)

RBF kernel

10Automated Fraud Detection: Kernel Methods & SVM

Capacity of feature spacesp y p

The capacity is proportional to the dimension for example:

2-dim: 2 dim:

Automated Fraud Detection: Kernel Methods & SVM 11

Problems of high dimensionsg

• Computational costs involved in dealing with large vectors Kernel Function & Kernel trick

C i il b l d l d• Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well Large Margin trickg g

12Automated Fraud Detection: Kernel Methods & SVM

Different Perspective: Learning & Similarityp g y

I / X Y• Input / output sets X , Y

• Training set (x1 y1) , …., (xm,ym) є XxYT ai i g set ( 1,y1) , …., ( m,ym) є X Y

• Generalization : given a previously unseen x є X , find it bl Ya suitable y є Y

• (x,y) should be “similar” to (x1 y1) , …., (xm,ym)( ,y) ( 1,y1) , , ( m,ym)

• How to measure similarity? For outputs: loss function : (e.g. for yє{-1, +1}, zer-one loss) For inputs : kernel function

13Automated Fraud Detection: Kernel Methods & SVM

Similarity of Inputsy p Symmetric function y

k : X x X → R(x, x’) → k(x,x’)

For example : if X = Rn : canonical dot product

• If X is not a dot product space: assume that k has a representation as a dot product in a linear space H e g there is a mapping Φ : X → Ha dot product in a linear space H. e.g there is a mapping Φ : X → H such that:

• In that case, we can think of the patterns as and , also carry out geometric algorithms in in dot product space (feature space) HH

14Automated Fraud Detection: Kernel Methods & SVM

An Example of Kernel Methodp

Id l if i t i f t di t hi hIdea: classify points in feature space according to which of the two class means is closer:

Compute the sign of dot product between w := C+ - C and X-CCompute the sign of dot product between w : C+ C- and X C

15Automated Fraud Detection: Kernel Methods & SVM

An Example: Cnt’dp

• Provides geometric interpretation of Parzen windowsTh d i i f i i h l• The decision function is a hyperplane

16Automated Fraud Detection: Kernel Methods & SVM

Example: All Degree 2 Monomialp gΦ : R2 → R3

(X1 , X2 ) → (Z1 , Z2 , Z3 ) = (X12 , 2½ X1 X2 , X2

2 )

17Automated Fraud Detection: Kernel Methods & SVM

The kernel Trick

• The dot product in H can be computed in R2

18Automated Fraud Detection: Kernel Methods & SVM

Kernel Trick : Cnt’d

• More Generally:

• Where maps into the space spanned by all ordered products of d input directionsp p

19Automated Fraud Detection: Kernel Methods & SVM

Kernel : more formal definition

A kernel k(x y) A kernel k(x,y) is a similarity measure defined by an implicit mapping f defined by an implicit mapping f, from the original space to a vector space (feature space) such that: k(x,y)=f(x)•f(y)( ,y) ( ) (y)

This similarity measure and the mapping include: Invariance or other a priori knowledge Invariance or other a priori knowledge Simpler structure (linear representation of the data) Possibly infinite dimension (hypothesis space for learning)y ( yp p g) but still computational efficiency when computing k(x,y) Different kind of data eg. string, set, graph, tree, text, ….

20Automated Fraud Detection: Kernel Methods & SVM

Valid Kernels

The function k(x,y) is a valid kernel, if there exists a mapping f ( ,y) , pp ginto a vector space (with a dot-product) such that k can be expressed as k(x,y)=f(x)•f(y)

Theorem: k(x,y) is a valid kernel if k is positive definite and symmetric (Mercer Kernel)

A function is P.D. if In other words the Gram matrix K (whose elements are k(x x )) must be

20)()(),( LfddffK yxyxyx

In other words, the Gram matrix K (whose elements are k(xi,xj)) must be positive definite for all ai, aj of the input space

21Automated Fraud Detection: Kernel Methods & SVM

How to build new kernels

Kernel combinations, preserving validity:10)()1()()( 21 yxyxyx ,K,K,K

)()()(0)(.)(

)()()()(

1

21

yxyxyxyxyx

yyy

KKKa,Ka,K

)().()()().()( 21

yxyxyxyx

functionvaluedrealisfyfxf,K,K,K,K

)())()(()( 3

yxyxyφxφyx

positivedefinitesymmetricPP,K,K,K

)()()()( 1

yyxxyxyxKK

,K,K )()( 11 yyxx ,K,K

22Automated Fraud Detection: Kernel Methods & SVM

I t d ti t i l d D l fIntroduction to primal and Dual form of learning functionof learning function

23Automated Fraud Detection: Kernel Methods & SVM

Linear Regressiong

Gi i i dGiven training data:

1 1 2 2, , , , , , , , ,

points and labels i

ni

S y y y y

R y R

i

i

x x x x

x

Construct linear function that best interpolates a given training set

1( ) , '

n

i ii

g w x

x w x w x

24Automated Fraud Detection: Kernel Methods & SVM

Linear Regression exampleg p

,w x

y

25Automated Fraud Detection: Kernel Methods & SVM

Ridge Regressiong g

Inverse typically does not exist. Use least norm solution for fixed Regularized problem

2 2

Optimality Condition:

2 2min ( , )L S w w w y Xw 0.

Optimality Condition:( , )

2 2 ' 2 ' 0L S

w

w X y X Xw

yw

' 'n X X I w X y Requires O(n3) operations

26Automated Fraud Detection: Kernel Methods & SVM

Ridge Regression (cont)g g ( )

Inverse always exists for any

1' ' w X X I X y 0.

Alternative representation: y

1' ' ' '

1

1

' ' ' '

' '

X X I w X y w X y X Xw

w X y Xw X α

1

'

α y Xw

α y Xw y XX α

1

'

where '

α y Xw y XX αXX α α y

α G I y G XX Solving ll equation is 0(l3) where α G I y G XXg q ( )

27Automated Fraud Detection: Kernel Methods & SVM

Dual Ridge Regressiong g

To predict new point:

1( ) 'g x w x x x y G I z

1

( ) , ,

where ,

i ii

i

g

x w x x x y G I z

z x x

Note need only compute G, the Gram Matrix

w e e ,i

p

' ,ij i jG G XX x x

Ridge Regression requires only inner products between data pointsp p

28Automated Fraud Detection: Kernel Methods & SVM

Efficiencyy

To compute w in primal ridge regression is O(n3)p g g ( )

in primal ridge regression is O(l3)T di i To predict new point xprimal O(n) ( ) ,

n

i ig w x w x xp

dual O( l)

1 1 1

( ) , ,n

i i i i j ji i j

g

x wx x x x x

1i

O(nl)1 1 1i i j

Dual is better if n>>l

29Automated Fraud Detection: Kernel Methods & SVM

Key ingredient of dual solutiony g

S 1• Step 1: α = (G + λ I )-1 y G = X X’ G = X.X Gij = <xi ,xj >

• Step 2:

• Important observation :pBoth steps only involve inner products between input data points

30Automated Fraud Detection: Kernel Methods & SVM

Kernel Trick: Summeryy

• Any Learning Algorithm that only depends on dot products can benefit from kernel trick

• This way, we can apply linear methods to vectorial as well as non-vectorial data

• Think of kernels as non linear similarity measures• Example of common kernels:Example of common kernels:

31Automated Fraud Detection: Kernel Methods & SVM

Kernel Method Application ISparse Kernels (SVM)

32Automated Fraud Detection: Kernel Methods & SVM

Sparse Kernels Method: pSupport Vector Classifier

33Automated Fraud Detection: Kernel Methods & SVM

Separating Hypreplanep g yp p

34Automated Fraud Detection: Kernel Methods & SVM

Optimal Separating Hypeplanep p g yp p

35Automated Fraud Detection: Kernel Methods & SVM

Eliminating the scaling freedomg g

36Automated Fraud Detection: Kernel Methods & SVM

Canonical Optimal Hyperplanep yp p

37Automated Fraud Detection: Kernel Methods & SVM

Formulation as an Optimization Problemp

38Automated Fraud Detection: Kernel Methods & SVM

Lagrange Functiong g

39Automated Fraud Detection: Kernel Methods & SVM

Derivation of Dual Problem

40Automated Fraud Detection: Kernel Methods & SVM

The Support Vector Expansionpp p

41Automated Fraud Detection: Kernel Methods & SVM

Dual problemp

42Automated Fraud Detection: Kernel Methods & SVM

Example: Gaussian kernelp

43Automated Fraud Detection: Kernel Methods & SVM

Non-Separable datap

11

2

7

44Automated Fraud Detection: Kernel Methods & SVM

Soft Margin SVMsg

45Automated Fraud Detection: Kernel Methods & SVM

The v-propertyp p y

46Automated Fraud Detection: Kernel Methods & SVM

Duals Using Kernelsg

47Automated Fraud Detection: Kernel Methods & SVM

SVM Trainingg

48Automated Fraud Detection: Kernel Methods & SVM

Kernel Method Application II

Anomaly Detection usingAnomaly Detection using Kernel Methods

49Automated Fraud Detection: Kernel Methods & SVM

One-Class SVM

• Finding the smallest hyper sphere containing the most training data (Y. Chen et al. 2001)

• Mapping data into feature space finding the• Mapping data into feature space finding the maximum margin line separating the origin from th d d t (Sh lk f l 1999)the mapped data (Sholkopf et al. 1999)

50Automated Fraud Detection: Kernel Methods & SVM

Chen ‘s Method

51Automated Fraud Detection: Kernel Methods & SVM

Scholkopf Method: Separating unlabeled p p gdata from the origin

52Automated Fraud Detection: Kernel Methods & SVM

V-Soft Margin Separationg p

53Automated Fraud Detection: Kernel Methods & SVM

Dual problemp

54Automated Fraud Detection: Kernel Methods & SVM

Remark

These two methods would be the same if the kernel K(x,y) just depends on x-y

Scholkopf et al. 2001

55Automated Fraud Detection: Kernel Methods & SVM

ConclusionKernel methods provide a general purpose toolkit for pattern analysis Advantages Advantages

kernels define flexible interface to the data enabling the user to encode prior knowledge into a measure of similarity between two items – with the proviso that it must satisfy the psd property.

algorithms well-founded in statistical learning theory enable efficient and effective exploitation of the high-dimensional representations to enable good off-training performance.

Subspace methods can often be implemented in kernel defined feature spaces using Subspace methods can often be implemented in kernel defined feature spaces using dual representations

Overall gives a generic plug and play framework for analysing data, combining different data types, models, tasks, and preprocessing

We can accommodate different kind of data in learning problems using kernel methods

Convex optimization problem: local minima is globally optimum

Disadvantages The choose of kernel is somewhat heuristic and depend on the application you face Risk of encountering overfitting

56Automated Fraud Detection: Kernel Methods & SVM

عشق به شده دلشزنده آنكه هرگزنميرد آنكه دلش زنده شده به عشقهرگزنميردثبت است بر جريده عالم دوام ما

Thank you!y

Automated Fraud Detection: Kernel Methods & SVM