A Unifying Framework in Vector-valued Reproducing Kernel Hilbert ...
Kernel Methods & SVMhiSupport Vector Machinespeople.cs.pitt.edu/~pakdaman/tutorials/kernel.pdf ·...
Transcript of Kernel Methods & SVMhiSupport Vector Machinespeople.cs.pitt.edu/~pakdaman/tutorials/kernel.pdf ·...
Kernel Methods &S V M hiSupport Vector Machines
Mahdi pakdaman Naeini
PhD Candidate, University of TehranSenior Researcher, TOSAN Intelligent Data MinersSenior Researcher, TOSAN Intelligent Data Miners
Outline
Motivation
I d i i i d M hi Introduction to pattern recognition and Machine
LearningLearning
Introduction to Kernels
Spars kernel methods (SVM)
Anomaly detection using kernel methods
2Automated Fraud Detection: Kernel Methods & SVM
Motivation
Fraud detection’s perspectives Fraud detection s perspectives Fast recall time of the learner Binary class classificationy One-class classification
Generalization performance of the kernel methods Generalization performance of the kernel methods Different kinds of information can be used Good performance in high dimensional feature spacep g p Using Linear learning typically has nice properties
Unique optimal solutions Fast learning algorithms Better statistical analysis
3Automated Fraud Detection: Kernel Methods & SVM
Introduction
Data can exhibit regularities that may or may not be immediately apparent exact patterns – e.g. motions of planets
complex patterns – e.g. genes in DNAp p g g
probabilistic patterns – e.g. market research
Detecting patterns makes it possible to understand Detecting patterns makes it possible to understand and/or exploit the regularities to make predictions
Machine Learning and Pattern Recognition is the study of automatic detection of patterns in data
4Automated Fraud Detection: Kernel Methods & SVM
Pattern Definingg
Generative Models Generative Models Parametric Models
Using Maximum entropy Model GMM Using Maximum entropy Model, GMM ,… Non Parametric Models
Histogram Based Methods, KNN, Parzen estimate,..Histogram Based Methods, KNN, Parzen estimate,..
Discriminative models Linear and non-Linear discriminant Models
Linear RegressionN l N k Neural Networks
SVM ….
5Automated Fraud Detection: Kernel Methods & SVM
Historical perspectivep p
Minsky and Pappert highlighted the weakness in their book Perceptrons
Neural networks overcame the problem by gluing together many linear units with non-lineartogether many linear units with non linear activation functions Solved problem of capacity and led to very impressive Solved problem of capacity and led to very impressive
extension of applicability of learning B t r n int tr inin pr bl m f p d nd m ltipl But ran into training problems of speed and multiple
local minima
6Automated Fraud Detection: Kernel Methods & SVM
Kernel methods approachpp
• The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:
• The expectation is that the feature space has a much higher dimension than the input spacemuch higher dimension than the input space.
7Automated Fraud Detection: Kernel Methods & SVM
Form of the functions
• So kernel methods use linear functions in a feature space:
For regression this could be the function
For classification require thresholding
8Automated Fraud Detection: Kernel Methods & SVM
Examplep
C id h i Consider the mapping
Φ : R2 → R3
(X1 , X2 ) → (Z1 , Z2 , Z3 ) = (X12 , 2½ X1 X2 , X2
2 )
If we consider a linear equation in this feature space:
We actually have an ellipse – i.e. a non-linear shape in the input space.
9Automated Fraud Detection: Kernel Methods & SVM
Examples of Kernels (III)p ( )
Polynomial kernel (n=2)
RBF kernel
10Automated Fraud Detection: Kernel Methods & SVM
Capacity of feature spacesp y p
The capacity is proportional to the dimension for example:
2-dim: 2 dim:
Automated Fraud Detection: Kernel Methods & SVM 11
Problems of high dimensionsg
• Computational costs involved in dealing with large vectors Kernel Function & Kernel trick
C i il b l d l d• Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well Large Margin trickg g
12Automated Fraud Detection: Kernel Methods & SVM
Different Perspective: Learning & Similarityp g y
I / X Y• Input / output sets X , Y
• Training set (x1 y1) , …., (xm,ym) є XxYT ai i g set ( 1,y1) , …., ( m,ym) є X Y
• Generalization : given a previously unseen x є X , find it bl Ya suitable y є Y
• (x,y) should be “similar” to (x1 y1) , …., (xm,ym)( ,y) ( 1,y1) , , ( m,ym)
• How to measure similarity? For outputs: loss function : (e.g. for yє{-1, +1}, zer-one loss) For inputs : kernel function
13Automated Fraud Detection: Kernel Methods & SVM
Similarity of Inputsy p Symmetric function y
k : X x X → R(x, x’) → k(x,x’)
For example : if X = Rn : canonical dot product
• If X is not a dot product space: assume that k has a representation as a dot product in a linear space H e g there is a mapping Φ : X → Ha dot product in a linear space H. e.g there is a mapping Φ : X → H such that:
• In that case, we can think of the patterns as and , also carry out geometric algorithms in in dot product space (feature space) HH
14Automated Fraud Detection: Kernel Methods & SVM
An Example of Kernel Methodp
Id l if i t i f t di t hi hIdea: classify points in feature space according to which of the two class means is closer:
Compute the sign of dot product between w := C+ - C and X-CCompute the sign of dot product between w : C+ C- and X C
15Automated Fraud Detection: Kernel Methods & SVM
An Example: Cnt’dp
• Provides geometric interpretation of Parzen windowsTh d i i f i i h l• The decision function is a hyperplane
16Automated Fraud Detection: Kernel Methods & SVM
Example: All Degree 2 Monomialp gΦ : R2 → R3
(X1 , X2 ) → (Z1 , Z2 , Z3 ) = (X12 , 2½ X1 X2 , X2
2 )
17Automated Fraud Detection: Kernel Methods & SVM
The kernel Trick
• The dot product in H can be computed in R2
18Automated Fraud Detection: Kernel Methods & SVM
Kernel Trick : Cnt’d
• More Generally:
• Where maps into the space spanned by all ordered products of d input directionsp p
19Automated Fraud Detection: Kernel Methods & SVM
Kernel : more formal definition
A kernel k(x y) A kernel k(x,y) is a similarity measure defined by an implicit mapping f defined by an implicit mapping f, from the original space to a vector space (feature space) such that: k(x,y)=f(x)•f(y)( ,y) ( ) (y)
This similarity measure and the mapping include: Invariance or other a priori knowledge Invariance or other a priori knowledge Simpler structure (linear representation of the data) Possibly infinite dimension (hypothesis space for learning)y ( yp p g) but still computational efficiency when computing k(x,y) Different kind of data eg. string, set, graph, tree, text, ….
20Automated Fraud Detection: Kernel Methods & SVM
Valid Kernels
The function k(x,y) is a valid kernel, if there exists a mapping f ( ,y) , pp ginto a vector space (with a dot-product) such that k can be expressed as k(x,y)=f(x)•f(y)
Theorem: k(x,y) is a valid kernel if k is positive definite and symmetric (Mercer Kernel)
A function is P.D. if In other words the Gram matrix K (whose elements are k(x x )) must be
20)()(),( LfddffK yxyxyx
In other words, the Gram matrix K (whose elements are k(xi,xj)) must be positive definite for all ai, aj of the input space
21Automated Fraud Detection: Kernel Methods & SVM
How to build new kernels
Kernel combinations, preserving validity:10)()1()()( 21 yxyxyx ,K,K,K
)()()(0)(.)(
)()()()(
1
21
yxyxyxyxyx
yyy
KKKa,Ka,K
)().()()().()( 21
yxyxyxyx
functionvaluedrealisfyfxf,K,K,K,K
)())()(()( 3
yxyxyφxφyx
positivedefinitesymmetricPP,K,K,K
)()()()( 1
yyxxyxyxKK
,K,K )()( 11 yyxx ,K,K
22Automated Fraud Detection: Kernel Methods & SVM
I t d ti t i l d D l fIntroduction to primal and Dual form of learning functionof learning function
23Automated Fraud Detection: Kernel Methods & SVM
Linear Regressiong
Gi i i dGiven training data:
1 1 2 2, , , , , , , , ,
points and labels i
ni
S y y y y
R y R
i
i
x x x x
x
Construct linear function that best interpolates a given training set
1( ) , '
n
i ii
g w x
x w x w x
24Automated Fraud Detection: Kernel Methods & SVM
Ridge Regressiong g
Inverse typically does not exist. Use least norm solution for fixed Regularized problem
2 2
Optimality Condition:
2 2min ( , )L S w w w y Xw 0.
Optimality Condition:( , )
2 2 ' 2 ' 0L S
w
w X y X Xw
yw
' 'n X X I w X y Requires O(n3) operations
26Automated Fraud Detection: Kernel Methods & SVM
Ridge Regression (cont)g g ( )
Inverse always exists for any
1' ' w X X I X y 0.
Alternative representation: y
1' ' ' '
1
1
' ' ' '
' '
X X I w X y w X y X Xw
w X y Xw X α
1
'
α y Xw
α y Xw y XX α
1
'
where '
α y Xw y XX αXX α α y
α G I y G XX Solving ll equation is 0(l3) where α G I y G XXg q ( )
27Automated Fraud Detection: Kernel Methods & SVM
Dual Ridge Regressiong g
To predict new point:
1( ) 'g x w x x x y G I z
1
( ) , ,
where ,
i ii
i
g
x w x x x y G I z
z x x
Note need only compute G, the Gram Matrix
w e e ,i
p
' ,ij i jG G XX x x
Ridge Regression requires only inner products between data pointsp p
28Automated Fraud Detection: Kernel Methods & SVM
Efficiencyy
To compute w in primal ridge regression is O(n3)p g g ( )
in primal ridge regression is O(l3)T di i To predict new point xprimal O(n) ( ) ,
n
i ig w x w x xp
dual O( l)
1 1 1
( ) , ,n
i i i i j ji i j
g
x wx x x x x
1i
O(nl)1 1 1i i j
Dual is better if n>>l
29Automated Fraud Detection: Kernel Methods & SVM
Key ingredient of dual solutiony g
S 1• Step 1: α = (G + λ I )-1 y G = X X’ G = X.X Gij = <xi ,xj >
• Step 2:
• Important observation :pBoth steps only involve inner products between input data points
30Automated Fraud Detection: Kernel Methods & SVM
Kernel Trick: Summeryy
• Any Learning Algorithm that only depends on dot products can benefit from kernel trick
• This way, we can apply linear methods to vectorial as well as non-vectorial data
• Think of kernels as non linear similarity measures• Example of common kernels:Example of common kernels:
31Automated Fraud Detection: Kernel Methods & SVM
Kernel Method Application II
Anomaly Detection usingAnomaly Detection using Kernel Methods
49Automated Fraud Detection: Kernel Methods & SVM
One-Class SVM
• Finding the smallest hyper sphere containing the most training data (Y. Chen et al. 2001)
• Mapping data into feature space finding the• Mapping data into feature space finding the maximum margin line separating the origin from th d d t (Sh lk f l 1999)the mapped data (Sholkopf et al. 1999)
50Automated Fraud Detection: Kernel Methods & SVM
Scholkopf Method: Separating unlabeled p p gdata from the origin
52Automated Fraud Detection: Kernel Methods & SVM
Remark
These two methods would be the same if the kernel K(x,y) just depends on x-y
Scholkopf et al. 2001
55Automated Fraud Detection: Kernel Methods & SVM
ConclusionKernel methods provide a general purpose toolkit for pattern analysis Advantages Advantages
kernels define flexible interface to the data enabling the user to encode prior knowledge into a measure of similarity between two items – with the proviso that it must satisfy the psd property.
algorithms well-founded in statistical learning theory enable efficient and effective exploitation of the high-dimensional representations to enable good off-training performance.
Subspace methods can often be implemented in kernel defined feature spaces using Subspace methods can often be implemented in kernel defined feature spaces using dual representations
Overall gives a generic plug and play framework for analysing data, combining different data types, models, tasks, and preprocessing
We can accommodate different kind of data in learning problems using kernel methods
Convex optimization problem: local minima is globally optimum
Disadvantages The choose of kernel is somewhat heuristic and depend on the application you face Risk of encountering overfitting
56Automated Fraud Detection: Kernel Methods & SVM