Kernel Methods and Support Vector...
Transcript of Kernel Methods and Support Vector...
![Page 1: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/1.jpg)
Kernel Methods and SupportVector Machines
John Shawe-TaylorDepartment of Computer Science
University College [email protected]
June, 2009
Chicago/TTI Summer School, June 2009
![Page 2: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/2.jpg)
Aim:
The tutorial is intended to give a broad introductionto the kernel approach to pattern analysis. This willcover:
• Why linear pattern functions?
• Why kernel approach?
• How to plug and play with the differentcomponents of a kernel-based pattern analysissystem?
Chicago/TTI Summer School, June 2009 1
![Page 3: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/3.jpg)
What won’t be included:
• Other approaches to Pattern Analysis
• Complete History
• Bayesian view of kernel methods
• More recent developments
Chicago/TTI Summer School, June 2009 2
![Page 4: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/4.jpg)
OVERALL STRUCTURE
Part 1: Introduction to the Kernel methods approach.
Part 2: Projections and subspaces in the featurespace.
Part 3: Other learning algorithms with the exampleof Support Vector Machines.
Part 4: Kernel design strategies.
Chicago/TTI Summer School, June 2009 3
![Page 5: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/5.jpg)
PART 1 STRUCTURE
• Introduction to pattern analysis and brief history
• Kernel methods approach
• Worked example of kernel Ridge Regression
• Properties of kernels.
Chicago/TTI Summer School, June 2009 4
![Page 6: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/6.jpg)
Pattern Analysis
• Data can exhibit regularities that may or may notbe immediately apparent
– exact patterns – eg motions of planets– complex patterns – eg genes in DNA– probabilistic patterns – eg market research
• Detecting patterns makes it possible to understandand/or exploit the regularities to make predictions
• Pattern analysis is the study of automaticdetection of patterns in data
Chicago/TTI Summer School, June 2009 5
![Page 7: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/7.jpg)
Defining patterns
• Exact patterns: non-trivial function f such that
f(x) = 0
• Approximate patterns: f such that
f(x) ≈ 0
• Statistical patterns: f such that
Ex[f(x)] ≈ 0
Chicago/TTI Summer School, June 2009 6
![Page 8: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/8.jpg)
Pattern analysis algorithmsWe would like algorithms to be:
• Computationally efficient – running time polynomialin the size of the data – often needs to be of a lowdegree
• Robust – able to handle noisy data, eg examplesmisclassified, noisy sensors or outputs only to acertain accuracy
• Statistical stability – able to distinguish betweenchance patterns and those characteristic of theunderlying source of the data
Chicago/TTI Summer School, June 2009 7
![Page 9: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/9.jpg)
Brief Historical Perspective
• Machine learning using neural like structures firstconsidered seriously in 1960s with such systemsas the Perceptron
– Linear patterns– Simple learning algorithm– shown to be limited in complexity
• Resurrection of ideas in more powerful multi-layer perceptrons in 1980s
– networks of perceptrons with continuousactivation functions
– very slow learning– limited statistical analysis
Chicago/TTI Summer School, June 2009 8
![Page 10: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/10.jpg)
Kernel methodsKernel methods (re)introduced in 1990s withSupport Vector Machines
• Linear functions but in high dimensional spacesequivalent to non-linear functions in the inputspace
• Statistical analysis showing large margin canovercome curse of dimensionality
• Extensions rapidly introduced for many othertasks other than classification
Chicago/TTI Summer School, June 2009 9
![Page 11: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/11.jpg)
Kernel methods approach
• Data embedded into a Euclidean feature space
• Linear relations are sought among the images ofthe data
• Algorithms implemented so that only requireinner products between vectors
• Embedding designed so that inner products ofimages of two points can be computed directlyby an efficient ‘short-cut’ known as the kernel.
Chicago/TTI Summer School, June 2009 10
![Page 12: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/12.jpg)
Kernel methods embedding
• The function φ embeds the data into a featurespace where the non-linear pattern now appearslinear. The kernel computes inner products in thefeature space directly from the inputs.
κ(x, z) = 〈φ(x), φ(z)〉
Chicago/TTI Summer School, June 2009 11
![Page 13: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/13.jpg)
Worked example: Ridge Regression
Consider the problem of finding a homogeneousreal-valued linear function
g(x) = 〈w,x〉 = x′w =n∑
i=1
wixi,
that best interpolates a given training set
S = {(x1, y1), . . . , (xm, ym)}
of points xi from X ⊆ Rn with corresponding labelsyi in Y ⊆ R.
Chicago/TTI Summer School, June 2009 12
![Page 14: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/14.jpg)
Possible pattern function
• Measures discrepancy between function outputand correct output – squared to ensure alwayspositive:
fg((x, y)) = (g(x)− y)2
Note that the pattern function fg is not itself alinear function, but a simple functional of thelinear functions g.
• We introduce notation: matrix X has rows the mexamples of S. Hence we can write
ξ = y −Xw
for the vector of differences between g(xi) and yi.
Chicago/TTI Summer School, June 2009 13
![Page 15: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/15.jpg)
Optimising the choice of g
Need to ensure flexibility of g is controlled –controlling the norm of w proves effective:
minw
Lλ(w, S) = minw
λ‖w‖2 + ‖ξ‖2,
where we can compute
‖ξ‖2 = 〈y −Xw,y −Xw〉= y′y − 2w′X′y + w′X′Xw
Setting derivative of Lλ(w, S) equal to 0 gives
X′Xw + λw = (X′X + λIn)w = X′y
Chicago/TTI Summer School, June 2009 14
![Page 16: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/16.jpg)
Primal solution
The term primal is used for the explicit representationin the feature space:
• We get the primal solution weight vector:
w = (X′X + λIn)−1 X′y
• and regression function
g(x) = x′w = x′ (X′X + λIn)−1 X′y
Chicago/TTI Summer School, June 2009 15
![Page 17: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/17.jpg)
Dual solution
A dual solution expresses the weight vector as alinear combination of the training examples:
X′Xw + λw = X′y implies
w =1λ
(X′y −X′Xw) = X′1λ
(y −Xw) = X′α,
where
α =1λ
(y −Xw) (1)
or equivalently
w =m∑
i=1
αixi
The vector α is the dual solution.
Chicago/TTI Summer School, June 2009 16
![Page 18: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/18.jpg)
Dual solution
Substituting w = X′α into equation (1) we obtain:
λα = y −XX′α
implying(XX′ + λIm) α = y
This means the dual solution can be computed as:
α = (XX′ + λIm)−1 y
with the regression function
g(x) = x′w = x′X′α =
⟨x,
m∑
i=1
αixi
⟩=
m∑
i=1
αi〈x,xi〉
Chicago/TTI Summer School, June 2009 17
![Page 19: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/19.jpg)
Key ingredients of dual solution
Step 1: Compute
α = (K + λIm)−1 y
where K = XX′ that is Kij = 〈xi,xj〉
Step 2: Evaluate on new point x by
g(x) =m∑
i=1
αi〈x,xi〉
Important observation: Both steps only involveinner products between input data points
Chicago/TTI Summer School, June 2009 18
![Page 20: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/20.jpg)
Applying the ‘kernel trick’
Since the computation only involves inner products,we can substitute for all occurrences of 〈·, ·〉 a kernelfunction κ that computes:
κ(x, z) = 〈φ(x), φ(z)〉
and we obtain an algorithm for ridge regression inthe feature space F defined by the mapping
φ : x 7−→ φ(x) ∈ F
Note if φ is the identity this has no effect.
Chicago/TTI Summer School, June 2009 19
![Page 21: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/21.jpg)
A simple kernel exampleThe simplest non-trivial kernel function is thequadratic kernel:
κ(x, z) = 〈x, z〉2
involving just one extra operation. But surprisinglythis kernel function now corresponds to a complexfeature mapping:
κ(x, z) = (x′z)2 = z′(xx′)z
= 〈vec(zz′), vec(xx′)〉
where vec(A) stacks the columns of the matrix Aon top of each other. Hence, κ corresponds to thefeature mapping
φ : x 7−→ vec(xx′)
Chicago/TTI Summer School, June 2009 20
![Page 22: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/22.jpg)
Implications of the kernel trick
• Consider for example computing a regressionfunction over 1000 images represented by pixelvectors – say 32× 32 = 1024 pixels.
• By using the quadratic kernel we implement theregression function in a 1, 000, 000 dimensionalspace
• but actually using less computation for thelearning phase than we did in the original space– inverting a 1000 × 1000 matrix instead of a1024× 1024 matrix.
Chicago/TTI Summer School, June 2009 21
![Page 23: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/23.jpg)
Implications of kernel algorithms
• Can perform linear regression in very high-dimensional (even infinite dimensional) spacesefficiently.
• This is equivalent to performing non-linearregression in the original input space: forexample quadratic kernel leads to solution of theform
g(x) =m∑
i=1
αi〈xi,x〉2
that is a quadratic polynomial function of thecomponents of the input vector x.
• Using these high-dimensional spaces mustsurely come with a health warning, what aboutthe curse of dimensionality?
Chicago/TTI Summer School, June 2009 22
![Page 24: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/24.jpg)
Defining kernels
• Natural to consider defining kernels for your data.Clearly, kernel must be symmetric and satisfy
κ(x,x) > 0
• BUT not every function satisfying these conditionsis a kernel.
• Commonly used kernel is Gaussian kernel:
κ(x, z) = exp(−‖x− z‖
2σ2
)
corresponds to infinite dimensional featurespace.
Chicago/TTI Summer School, June 2009 23
![Page 25: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/25.jpg)
Means and distancesSuppose we are given a kernel function:
κ(x, z) = 〈φ(x), φ(z)〉
and a training set S, what can we estimate?
• Consider some vector
w =m∑
i=1
αiφ(xi)
we have
‖w‖2 =
⟨m∑
i=1
αiφ(xi),m∑
j=1
αjφ(xj)
⟩=
m∑
i,j=1
αiαjκ(xi,xj)
Chicago/TTI Summer School, June 2009 24
![Page 26: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/26.jpg)
Means and distances
• Hence, we can normalise data in the featurespace:
φ(x) 7→ φ(x) =φ(x)‖φ(x)‖
since we can compute the corresponding kernelκ by
κ(x, z) =⟨
φ(x)‖φ(x)‖,
φ(z)‖φ(z)‖
⟩=
κ(x, z)√κ(x,x)κ(z, z)
Chicago/TTI Summer School, June 2009 25
![Page 27: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/27.jpg)
Means and distances
• Given two vectors:
wa =m∑
i=1
αiφ(xi) and wb =m∑
i=1
βiφ(xi)
we have
wa −wb =m∑
i=1
(αi − βi)φ(xi)
so we can compute the distance between wa andwb as
d(wa,wb) = ‖wa −wb‖
Chicago/TTI Summer School, June 2009 26
![Page 28: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/28.jpg)
Means and distances
• For example the norm of the mean of a sample isgiven by
‖φS‖ =
∥∥∥∥∥1m
m∑
i=1
φ(xi)
∥∥∥∥∥ =1m
√j′Kj
where j is the all ones vector.
• Hence, expected squared distance to the meanof a sample is:
E[‖φ(x)− φS‖2] =1m
m∑
i=1
κ(xi,xi)− 〈φS, φS〉
=1m
tr(K)− 1m2
j′Kj
Chicago/TTI Summer School, June 2009 27
![Page 29: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/29.jpg)
Means and distances
• Consider centering the sample, i.e. moving theorigin to the sample mean: this will result in
‖φS‖2 =1
m2j′Kj = 0
in the new coordinate system, while the lhs ofprevious equation is unchanged by centering.Hence, centering minimises the trace.
• Centering is achieved by transformation:
φ(x) 7→ φ(x) = φ(x)− φS
Chicago/TTI Summer School, June 2009 28
![Page 30: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/30.jpg)
Means and distances
• What is effect on kernel and kernel matrix?
κ(x, z) = 〈φ(x), φ(z)〉
= κ(x, z)− 1m
m∑
i=1
(κ(x,xi) + κ(z,xi)) +1
m2j′Kj
• Hence we can implement the centering of akernel matrix by
K = K− 1m
(jj′K + Kjj′) +1
m2(j′Kj)jj′
Chicago/TTI Summer School, June 2009 29
![Page 31: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/31.jpg)
Simple novelty detection
• Consider putting a ball round the centre of massφS of radius sufficient to contain all the data:
‖φ(x)− φS‖ > max1≤i≤m
‖φ(xi)− φS‖
• Give a kernel expression for this quantity.
Chicago/TTI Summer School, June 2009 30
![Page 32: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/32.jpg)
OVERALL STRUCTURE
Part 1: Introduction to the Kernel methods approach.
Part 2: Projections and subspaces in the featurespace.
Part 3: Other learning algorithms with the exampleof Support Vector Machines.
Part 4: Kernel design strategies.
Chicago/TTI Summer School, June 2009 31
![Page 33: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/33.jpg)
Part 2 structure
• Simple classification algorithm
• Fisher discriminant analysis.
• Principal components analysis.
Chicago/TTI Summer School, June 2009 32
![Page 34: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/34.jpg)
Simple classification algorithm
• Consider finding the centres of mass of positiveand negative examples and classifying a testpoint by measuring which is closest
h(x) = sgn(‖φ(x)− φS−‖2 − ‖φ(x)− φS+‖2
)
• we can express as a function of kernelevaluations
h(x) = sgn
1
m+
m+∑
i=1
κ(x,xi)− 1m−
m∑
i=m++1
κ(x,xi)− b
,
where
b =1
2m2+
m+∑
i,j=1
κ(xi,xj)− 12m2−
m∑
i,j=m++1
κ(xi,xj)
Chicago/TTI Summer School, June 2009 33
![Page 35: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/35.jpg)
Simple classification algorithm
• equivalent to dividing the space with a hyperplaneperpendicular to the line half way between thetwo centres with vector given by
w =1
m+
m+∑
i=1
φ(xi)− 1m−
m∑
i=m++1
φ(xi)
• Function is the difference in likelihood of theParzen window density estimators for positiveand negative examples
• We will see some examples of the performanceof this algorithm in Part 3.
Chicago/TTI Summer School, June 2009 34
![Page 36: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/36.jpg)
Variance of projections
• Consider projections of the datapoints φ(xi) ontoa unit vector direction v in the feature space:average is given by
µv = E [‖Pv(φ(x))‖] = E [v′φ(x)] = v′φS
of course this is 0 if the data has been centred.
• average squared is given by
E[‖Pv(φ(x))‖2] = E [v′φ(x)φ(x)′v] =
1m
v′X′Xv
Chicago/TTI Summer School, June 2009 35
![Page 37: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/37.jpg)
Variance of projections
• Now suppose v has the dual representation v =X′α. Average is given by
µv =1m
α′XX′j =1m
α′Kj
• average squared is given by
1m
v′X′Xv =1m
α′XX′XX′α =1m
α′K2α
• Hence, variance in direction v is given by
σ2v =
1m
α2K2α− 1m2
(α′Kj)2
Chicago/TTI Summer School, June 2009 36
![Page 38: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/38.jpg)
Fisher discriminant
• The Fisher discriminant is a thresholded linearclassifier:
f(x) = sgn(〈w, φ(x)〉+ b
where w is chosen to maximise the quotient:
J(w) =(µ+
w − µ−w)2
(σ+w)2 + (σ−w)2
• As with Ridge regression is makes sense toregularise if we are working in high-dimensionalkernel spaces, so maximise
J(w) =(µ+
w − µ−w)2
(σ+w)2 + (σ−w)2 + λ‖w‖2
Chicago/TTI Summer School, June 2009 37
![Page 39: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/39.jpg)
Fisher discriminant
• Using the results we now have we can substituedual expressions for all of these quantities andsolve using lagrange multipliers.
• The resulting classifier has dual variables
α = (BK + λI)−1y
where B = D−C with
Cij =
2m−/(mm+) if yi = 1 = yj
2m+/(mm−) if yi = −1 = yj
0 otherwise
Chicago/TTI Summer School, June 2009 38
![Page 40: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/40.jpg)
and
D =
2m−/m if i = j and yi = 12m+/m if i = j and yi = −10 otherwise
and b = 0.5αKt with
ti =
1/m+ if yi = 11/m− if yi = −10 otherwise
giving a decision function
f(x) = sgn
(m∑
i=1
αiκ(xi,x)− b
)
Chicago/TTI Summer School, June 2009 39
![Page 41: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/41.jpg)
Overview of remainder of tutorial
• Plug and play aspects of kernel methods:
Data → kernel → preprocessing → pattern analysis
• Part 2: preprocessing: for example normalisation,projection into subspaces, kernel PCA, kernelCCA, etc.
• Part 3: pattern analysis: support vectormachines, novely detection, support vectorregression.
• Part 4: kernel design: properties of kernels,kernels for text, string kernels.
Chicago/TTI Summer School, June 2009 40
![Page 42: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/42.jpg)
Preprocessing
• Corresponds to feature selection, or learning thefeature space
• Note that in kernel methods the featurespace is only determined up to orthogonaltransformations (change of basis):
φ(x) = Uφ(x)
for some orthogonal transformation U (U′U =I = UU′), then
κ(x, z) = 〈Uφ(x),Uφ(z)〉 = φ(x)′U′Uφ(z) = φ(x)′φ(z) = κ(x, z)
• so feature selection is eqivalent to subspaceprojection in kernel defined feature spaces
Chicago/TTI Summer School, June 2009 41
![Page 43: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/43.jpg)
Subspace methods
• Principal components analysis: choose directionsto maximise variance in the training data
• Canonical correlation analysis: choose directionsto maximise correlations between two differentviews of the same objects
• Gram-Schmidt: greedily choose directionsaccording to largest residual norms
• Partial least squares: greedily choose directionswith maximal covariance with the target (will notcover this)
In all cases we need kernel versions in order toapply these methods in high-dimensional kerneldefined feature spaces
Chicago/TTI Summer School, June 2009 42
![Page 44: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/44.jpg)
Principal Components Analysis
• PCA is a subspace method – that is it involvesprojecting the data into a lower dimensionalspace.
• Subspace is chosen to ensure maximal varianceof the projections:
w = argmaxw:‖w‖=1w′X′Xw
• This is equivalent to maximising the Raleighquotient:
w′X′Xww′w
Chicago/TTI Summer School, June 2009 43
![Page 45: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/45.jpg)
Principal Components Analysis
• We can optimise using Lagrange multipliers inorder to remove the contraints:
L(w, λ) = w′X′Xw − λw′w
taking derivatives wrt w and setting equal to 0gives:
X′Xw = λwimplying w is an eigenvalue of X′X.
• Note that
λ = w′X′Xw =m∑
i=1
〈w,xi〉2
Chicago/TTI Summer School, June 2009 44
![Page 46: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/46.jpg)
Principal Components Analysis
• So principal components analysis performs aneigenvalue decomposition of X′X and projectsinto the space spanned by the first k eigenvectors
• Captures a total of
k∑
i=1
λi
of the overall variance:
m∑
i=1
‖xi‖2 =n∑
i=1
λi = tr(K)
Chicago/TTI Summer School, June 2009 45
![Page 47: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/47.jpg)
Kernel PCA
• We would like to find a dual representationof the principal eigenvectors and hence of theprojection function.
• Suppose that w, λ 6= 0 is an eigenvector/eigenvaluepair for X′X, then Xw, λ is for XX′:
(XX′)Xw = X(X′X)w = λXw
• and vice versa α, λ → X′α, λ
(X′X)X′α = X′(XX′)α = λX′α
• Note that we get back to where we started if wedo it twice.
Chicago/TTI Summer School, June 2009 46
![Page 48: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/48.jpg)
Kernel PCA
• Hence, 1-1 correspondence between eigenvectorscorresponding to non-zero eigenvalues, but notethat if ‖α‖ = 1
‖X′α‖2 = α′XX′α = α′Kα = λ
so if αi, λi, i = 1, . . . , k are first k eigenvectors/valuesof K
1√λi
αi
are dual representations of first k eigenvectorsw1, . . . ,wk of X′X with same eigenvalues.
• Computing projections:
〈wi, φ(x)〉 =1√λi
〈X′αi, φ(x)〉 =1√λi
m∑
j=1
αijκ(xi,x)
Chicago/TTI Summer School, June 2009 47
![Page 49: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/49.jpg)
OVERALL STRUCTURE
Part 1: Introduction to the Kernel methods approach.
Part 2: Projections and subspaces in the featurespace.
Part 3: Other learning algorithms with the exampleof Support Vector Machines.
Part 4: Kernel design strategies.
Chicago/TTI Summer School, June 2009 48
![Page 50: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/50.jpg)
Part 3 structure
• Perceptron algorithm
• Generalisation of SVMs
• Support Vector Machine Optimisation
• Novelty detection
Chicago/TTI Summer School, June 2009 49
![Page 51: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/51.jpg)
Kernel algorithms
• Have already seen three kernel based algorithms:
– Ridge regression– Fisher discriminant– Simple novelty detector
• Key properties that enable an algorithm to bekernelised:
– Must reduce to estimating a linear function inthe feature space
– Weight vector must be in the span of thetraining examples
– Algorithm to find dual coefficients only involvesinner products of training data
• Very simple example: perceptron algorithm
Chicago/TTI Summer School, June 2009 50
![Page 52: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/52.jpg)
Perceptron algorithm
• initialise w ← 0
• repeat if for some example: yi〈w, φ(xi)〉 ≤ 0
w ←− w + yiφ(xi)
• Clearly dual version:
– initialise αi ← 0 for all i– update: αi ← αi + yi
– Note can evaluate as for ridge regression:
〈w, φ(x)〉 =∑
i
αiκ(xi,x)
• Dual version by Aizerman et al. (1964) but tendsto overfit
Chicago/TTI Summer School, June 2009 51
![Page 53: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/53.jpg)
Margin Perceptron algorithm
• Margin version if replace test by
yi〈w, φ(xi)〉 ≤ τ
• Set τ = 1 – in 1964 one parameter away from akernel classification algorithm able to generalisein high dimensions
Chicago/TTI Summer School, June 2009 52
![Page 54: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/54.jpg)
Support Vector Machines (SVM)
• SVM seeks linear function in a feature spacedefined implicitly via a kernel κ:
κ(x, z) = 〈φ(x), φ(z)〉
that optimises a bound on the generalisation.
• Several bounds on the performance of SVMsexist all use the margin to give an empiricallydefined complexity: data-dependent structuralrisk minimisation
• Tightest is the PAC-Bayes bound
Chicago/TTI Summer School, June 2009 53
![Page 55: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/55.jpg)
Margins in SVMs
• Critical to the bound will be the margin of theclassifier
γ(x, y) = yg(x) = y(〈w, φ(x)〉+ b) :
positive if correctly classified, and measuresdistance from the separating hyperplane whenthe weight vector is normalised.
• The margin of a linear function g is
γ(g) = mini
γ(xi, yi)
though this is frequently increased to allow some‘margin errors’.
Chicago/TTI Summer School, June 2009 54
![Page 56: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/56.jpg)
Margins in SVMs
Chicago/TTI Summer School, June 2009 55
![Page 57: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/57.jpg)
Form of the SVM bound
• If we define the inverse of the KL by
KL−1(q, A) = max{p : KL(q‖p) ≤ A}
then have with probability at least 1− δ for all µ
Pr (〈w, φ(x)〉 6= y) ≤
2minµ
KL−1
(Em[F (µγ(x, y))],
µ2/2 + ln m+1δ
m
)
where F (t) = 1− 1√2π
∫ t
−∞e−x2/2dx
Chicago/TTI Summer School, June 2009 56
![Page 58: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/58.jpg)
Slack variable conversion
−2 −1.5 −1 −0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
Bound and slack variable used in optimisation
Chicago/TTI Summer School, June 2009 57
![Page 59: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/59.jpg)
Gives SVM Optimisation
• Primal form:
minw,ξi
[12‖w‖2 + C
∑mi=1 ξi
]
s.t. yiwTφ(xi) ≥ 1− ξi i = 1, . . . , m
ξi ≥ 0 i = 1, . . . , m
• Dual form:
maxα
[∑mi=1 αi − 1
2
∑mi,j=1 αiαjyiyjκ(xi, xj)
]
s.t. 0 ≤ αi ≤ C i = 1, . . . , m
where κ(xi, xj) = 〈φ(xi), φ(xj)〉 and 〈w, φ(x)〉 =∑mi=1 αiyiκ(xi, x).
Chicago/TTI Summer School, June 2009 58
![Page 60: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/60.jpg)
Dual form of the SVM problem
Decision boundary and γ margin for 1-norm svmwith a gaussian kernel:
Chicago/TTI Summer School, June 2009 59
![Page 61: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/61.jpg)
Novelty detectionWe can also motivate novelty detection by a similaranalysis as that for SVM: consider a hyperspherecentred at c of radius r and the function g:
g (x) =
0, if ‖c− φ(x)‖ ≤ r;(‖c− φ(x)‖2 − r2)/γ, if r2 ≤ ‖c− φ(x)‖2 ≤ r2 + γ;1, otherwise.
with probability at least 1− δ
E[g(x)] ≤ E[g(x)] +6R2
γ√
m+ 3
√ln(2/δ)
2m
Note that tension is between creating a tight boundand defining a small sphere.
Chicago/TTI Summer School, June 2009 60
![Page 62: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/62.jpg)
Novelty detection
Letξi = (‖c− φ(x)‖2 − r2)+
so thatE[g(x)] ≤ 1
γm‖ξ‖1
Treating γ as fixed we minimise the bound byminimising ‖ξ‖1 and r:
minc,r,ξ r2 + C ‖ξ‖1subject to ‖φ(xi)− c‖2 ≤ r2 + ξi
ξi ≥ 0, i = 1, . . . , m
Chicago/TTI Summer School, June 2009 61
![Page 63: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/63.jpg)
Novelty detectionDual optimisation maximise
W (α) =m∑
i=1
αiκ (xi,xi)−m∑
i,j=1
αiαjκ (xi,xj)
subject to∑m
i=1 αi = 1 and 0 ≤ αi ≤ C,i = 1, . . . , m.with final novelty test being:
f(·) = H
[κ (·, ·)− 2
m∑
i=1
α∗i κ (xi, ·) + D
]
where
D =m∑
i,j=1
α∗i α∗jκ (xi,xj)− (r∗)2 − γ
Chicago/TTI Summer School, June 2009 62
![Page 64: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/64.jpg)
Novelty detection
Chicago/TTI Summer School, June 2009 63
![Page 65: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/65.jpg)
OVERALL STRUCTURE
Day 1: Introduction to the Kernel methods approach.
Day 2: Projections and subspaces in the featurespace.
Day 3: Other learning algorithms with the exampleof Support Vector Machines.
Day 4: Kernel design strategies.
Chicago/TTI Summer School, June 2009 64
![Page 66: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/66.jpg)
Part 4 structure
• Kernel design strategies.
• Kernels for text and string kernels.
• Kernels for other structures.
• Kernels from generative models.
Chicago/TTI Summer School, June 2009 65
![Page 67: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/67.jpg)
Kernel functions• Already seen some properties of kernels:
– symmetric:
κ(x, z) = 〈φ(x), φ(z)〉 = 〈φ(z), φ(x)〉 = κ(z,x)
– kernel matrices psd:
u′Ku =m∑
i,j=1
uiuj〈φ(xi), φ(xj)〉
=
⟨m∑
i=1
uiφ(xi),m∑
j=1
ujφ(xj)
⟩
=
∥∥∥∥∥m∑
i=1
uiφ(xi)
∥∥∥∥∥
2
≥ 0
Chicago/TTI Summer School, June 2009 66
![Page 68: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/68.jpg)
Kernel functions
• These two properties are all that is required for akernel function to be valid: symmetric and everykernel matrix is psd.
• Note that this is equivalent to all eigenvalues non-negative – recall that eigenvalues of the kernelmatrix measured the sum of the squares of theprojections onto the eigenvector.
• If we have uncountable domains should alsohave continuity, though there are exceptions tothis as well.
Chicago/TTI Summer School, June 2009 67
![Page 69: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/69.jpg)
Kernel functions
Proof outline:
• Define feature space as class of functions:
F =
{m∑
i=1
αiκ(xi, ·): m ∈ N,xi ∈ X, αi ∈ R, i = 1, . . . , m
}
• Linear space
• embedding given by
x 7−→ κ(x, ·)
Chicago/TTI Summer School, June 2009 68
![Page 70: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/70.jpg)
Kernel functions
• inner product between
f(x) =m∑
i=1
αiκ(xi,x) and g(x) =n∑
i=1
βiκ(zi,x)
defined as
〈f, g〉 =m∑
i=1
n∑
j=1
αiβjκ(xi, zj) =m∑
i=1
αig(xi) =n∑
j=1
βjf(zj),
• well-defined
• 〈f, f〉 ≥ 0 by psd property.
Chicago/TTI Summer School, June 2009 69
![Page 71: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/71.jpg)
Kernel functions
• so-called reproducing property:
〈f, φ(x)〉 = 〈f, κ(x, ·)〉 = f(x)
• implies that inner product corresponds tofunction evaluation – learning a function correspondsto learning a point being the weight vectorcorresponding to that function:
〈wf , φ(x)〉 = f(x)
Chicago/TTI Summer School, June 2009 70
![Page 72: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/72.jpg)
Kernel constructions
For κ1, κ2 valid kernels, φ any feature map, B psdmatrix, a ≥ 0 and f any real valued function, thefollowing are valid kernels:
• κ(x, z) = κ1(x, z) + κ2(x, z),
• κ(x, z) = aκ1(x, z),
• κ(x, z) = κ1(x, z)κ2(x, z),
• κ(x, z) = f(x)f(z),
• κ(x, z) = κ1(φ(x),φ(z)),
• κ(x, z) = x′Bz.
Chicago/TTI Summer School, June 2009 71
![Page 73: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/73.jpg)
Kernel constructionsFollowing are also valid kernels:
• κ(x, z) =p(κ1(x, z)), for p any polynomial withpositive coefficients.
• κ(x, z) = exp(κ1(x, z)),
• κ(x, z) = exp(−‖x− z‖2 /(2σ2)).
Proof of third: normalise the second kernel:
exp(〈x, z〉 /σ2)√exp(‖x‖2 /σ2) exp(‖z‖2 /σ2)
= exp(〈x, z〉
σ2− 〈x,x〉
2σ2− 〈z, z〉
2σ2
)
= exp
(−‖x− z‖2
2σ2
).
Chicago/TTI Summer School, June 2009 72
![Page 74: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/74.jpg)
Subcomponents kernelFor the kernel 〈x, z〉s the features can be indexed bysequences
i = (i1, . . . , in),n∑
j=1
ij = s
whereφi(x) = xi1
1 xi22 . . . xin
n
A similar kernel can be defined in which all subsetsof features occur:
φ : x 7→ (φA(x))A⊆{1,...,n}
whereφA(x) =
∏
i∈A
xi
Chicago/TTI Summer School, June 2009 73
![Page 75: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/75.jpg)
Subcomponents kernel
So we have
κ⊆(x,y) = 〈φ(x), φ(y)〉=
∑
A⊆{1,...,n}φA(x)φA(y)
=∑
A⊆{1,...,n}
∏
i∈A
xiyi =n∏
i=1
(1 + xiyi)
Can represent computation with a graph:1
x y1 1
x y2 2
x yn n
1 1
Each path in the graph corresponds to a feature.
Chicago/TTI Summer School, June 2009 74
![Page 76: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/76.jpg)
Graph kernels
Can also represent polynomial kernel
κ(x,y) = (〈x,y〉+ R)d = (x1y1 + x2y2 + · · ·+ xnyn + R)d
with a graph:R
x y 1 1
x y 2 2
x y n n
R x y
1 1
x y 2 2
x y n n
R x y
1 1
x y 2 2
x y n n
d 1 - d 1 2 3
Chicago/TTI Summer School, June 2009 75
![Page 77: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/77.jpg)
Graph kernels
The ANOVA kernel is represented by the graph:1
x z1 1
0 0( , ) 1
x z1 1
1
1
0 1( , )
1 1( , ) 1
1
( , )0 2
,( )1 2
( , )2 2
x z2 2 x z2 2
1
x z1 1
( , )1 n( , )1 1n -
( , )2 1n - ( , )2 n
( , )d n- -1 1 ( , )d n-1
( , )d n-1 ( , )d n
( , )0 1n- ( , )0 n
1
x zn n
Chicago/TTI Summer School, June 2009 76
![Page 78: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/78.jpg)
ANOVA kernels
Features are all the combinations of exactly ddistinct features, while computation is given byrecursion:
κm0 (x, z) = 1, if m ≥ 0,
κms (x, z) = 0, if m < s,
κms (x, z) = (xmzm)κm−1
s−1 (x, z) + κm−1s (x, z)
While the resulting kernel is given by
κnd(x, z)
in the bottom right corner of the graph.
Chicago/TTI Summer School, June 2009 77
![Page 79: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/79.jpg)
General Graph kernels
• Defined over directed acyclic graphs (DAGs)
• Number vertices 1, . . . , s compatible with edgedirections, i.e. ui → uj =⇒ i < j.
• Compute using dynamic programming table DP
• Initialise DP(1) = 1;
• for i = 2, . . . , s compute
DP(i) =∑
j→i
κ(uj→ui) (x, z)DP (j)
• result given at output node s: κ(x, z) = DP(s).
Chicago/TTI Summer School, June 2009 78
![Page 80: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/80.jpg)
Kernels for text
• The simplest representation for text is the kernelgiven by the feature map known as the vectorspace model
φ : d 7→ φ(d) = (tf(t1, d), tf(t2, d), . . . , tf(tN , d))′
where t1, t2, . . . , tN are the terms occurring in thecorpus and tf(t, d) measures the frequency ofterm t in document d.
• Usually use the notation D for the document termmatrix (cf. X from previous notation).
Chicago/TTI Summer School, June 2009 79
![Page 81: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/81.jpg)
Kernels for text
• Kernel matrix is given by
K = DD′
wrt kernel
κ(d1, d2) =N∑
j=1
tf(tj, d1)tf(tj, d2)
• despite high-dimensionality kernel function canbe computed efficiently by using a linked listrepresentation.
Chicago/TTI Summer School, June 2009 80
![Page 82: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/82.jpg)
Semantics for text
• The standard representation does not take intoaccount the importance or relationship betweenwords.
• Main methods do this by introducing a ‘semantic’mapping S:
κ(d1, d2) = φ(d1)′SS′φ(d2)
Chicago/TTI Summer School, June 2009 81
![Page 83: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/83.jpg)
Semantics for text
• Simplest is diagonal matrix giving term weightings(known as inverse document frequency – tfidf):
w(t) = lnm
df(t)
• Hence kernel becomes:
κ(d1, d2) =N∑
j=1
w(tj)2tf(tj, d1)tf(tj, d2)
Chicago/TTI Summer School, June 2009 82
![Page 84: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/84.jpg)
Semantics for text
• In general would also like to include semanticlinks between terms with off-diagonal elements,eg stemming, query expansion, wordnet.
• More generally can use co-occurrence of wordsin documents:
S = D′
so(SS′)ij =
∑
d
tf(i, d)tf(j, d)
Chicago/TTI Summer School, June 2009 83
![Page 85: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/85.jpg)
Semantics for text
• Information retrieval technique known as latentsemantic indexing uses SVD decomposition:
D′ = UΣV′
so thatd 7→ U′
kφ(d)
which is equivalent to peforming kernel PCA togive latent semantic kernels:
κ(d1, d2) = φ(d1)′UkU′kφ(d2)
Chicago/TTI Summer School, June 2009 84
![Page 86: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/86.jpg)
String kernels
• Consider the feature map given by
φpu(s) = |{(v1, v2) : s = v1uv2}|
for u ∈ Σp with associated kernel
κp(s, t) =∑
u∈Σp
φpu(s)φp
u(t)
Chicago/TTI Summer School, June 2009 85
![Page 87: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/87.jpg)
String kernels
• Consider the following two sequences:
s ="statistics"t ="computation"
The two strings contain the following substringsof length 3:
"sta", "tat", "ati", "tis","ist", "sti", "tic", "ics""com", "omp", "mpu", "put","uta", "tat", "ati", "tio", "ion"
and they have in common the substrings "tat"and "ati", so their inner product would beκ3 (s, t) = 2.
Chicago/TTI Summer School, June 2009 86
![Page 88: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/88.jpg)
Trie based p-spectrum kernels
• Computation organised into a trie with nodesindexed by substrings – root node by emptystring ε.
• Create lists of substrings at root node:
Ls(ε) = {(s(i : i + p− 1), 0) : i = 1, |s| − p + 1}
Similarly for t.
• Recursively through the tree: if Ls(v) and Lt(v)both not empty:for each (u, i) ∈ L∗(v) add (u, i + 1) to listL∗(vui+1)
• At depth p increment global variable kerninitialised to 0 by |Ls(v)||Lt(v)|.
Chicago/TTI Summer School, June 2009 87
![Page 89: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/89.jpg)
Gap weighted string kernels
• Can create kernels whose features are allsubstrings of length p with the feature weightedaccording to all occurrences of the substring asa subsequence:
φ ca ct at ba bt cr ar br
cat λ2 λ3 λ2 0 0 0 0 0car λ2 0 0 0 0 λ3 λ2 0bat 0 0 λ2 λ2 λ3 0 0 0bar 0 0 0 λ2 0 0 λ2 λ3
• This can be evaluated using a dynamicprogramming computation over arrays indexedby the two strings.
Chicago/TTI Summer School, June 2009 88
![Page 90: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/90.jpg)
Tree kernels
• We can consider a feature mapping for treesdefined by
φ : T 7−→ (φS(T ))S∈I
where I is a set of all subtrees and φS(T ) countsthe number of co-rooted subtrees isomorphic tothe tree S.
• The computation can again be performedefficiently by working up from the leaves of thetree integrating the results from the children ateach internal node.
• Similarly we can compute the inner product in thefeature space given by all subtrees of the giventree not necessarily co-rooted.
Chicago/TTI Summer School, June 2009 89
![Page 91: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/91.jpg)
Probabilistic model kernels
• There are two types of kernels that can bedefined based on probabilistic models of thedata.
• The most natural is to consider a class of modelsindex by a model class M : we can then definethe similarity as
κ(x, z) =∑
m∈M
P (x|m)P (z|m)PM(m)
also known as the marginalisation kernel.
• For the case of Hidden Markov Models thiscan be again be computed by a dynamicprogramming technique.
Chicago/TTI Summer School, June 2009 90
![Page 92: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/92.jpg)
Probabilistic model kernels
• Pair HMMs generate pairs of symbols and undermild assumptions can also be shown to give riseto kernels that can be efficiently evaluated.
• Similarly hidden tree generating models of data,again using a recursion that works upwards fromthe leaves.
Chicago/TTI Summer School, June 2009 91
![Page 93: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/93.jpg)
Fisher kernelsFisher kernels are an alternative way of definingkernels based on probabilistic models.
• We assume the model is parametrised accordingto some parameters: consider the simpleexample of a 1-dim Gaussian distributionparametrised by µ and σ:
M =
{P (x|θ) =
1√2πσ
exp
(−(x− µ)2
2σ2
): θ = (µ, σ) ∈ R2
}.
• The Fisher score vector is the derivative of thelog likelihood of an input x wrt the parameters:
log L(µ,σ) (x) = −(x− µ)2
2σ2− 1
2log (2πσ) .
Chicago/TTI Summer School, June 2009 92
![Page 94: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/94.jpg)
Fisher kernels
• Hence the score vector is given by:
g(θ0, x
)=
((x− µ0)
σ20
,(x− µ0)
2
σ30
− 12σ0
).
• Taking µ0 = 0 and σ0 = 1 the feature embeddingis given by:
Chicago/TTI Summer School, June 2009 93
![Page 95: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/95.jpg)
Fisher kernels
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
2
2.5
3
3.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
2
2.5
3
3.5
Chicago/TTI Summer School, June 2009 94
![Page 96: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/96.jpg)
Fisher kernels
Can compute Fisher kernels for various modelsincluding
• ones closely related to string kernels
• mixtures of Gaussians
• Hidden Markov Models
Chicago/TTI Summer School, June 2009 95
![Page 97: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/97.jpg)
ConclusionsKernel methods provide a general purpose toolkitfor pattern analysis
• kernels define flexible interface to the dataenabling the user to encode prior knowledge intoa measure of similarity between two items – withthe proviso that it must satisfy the psd property.
• composition and subspace methods providetools to enhance the representation: normalisation,centering, kernel PCA, kernel Gram-Schmidt,kernel CCA, etc.
• algorithms well-founded in statistical learningtheory enable efficient and effective exploitationof the high-dimensional representations toenable good off-training performance.
Chicago/TTI Summer School, June 2009 96
![Page 98: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/98.jpg)
Where to find out more
Web Sites: www.support-vector.net (SV Machines)
www.kernel-methods.net (kernel methods)
www.kernel-machines.net (kernel Machines)
www.pascal-network.org
References
[1] M. Aizerman, E. Braverman, and L. Rozonoer.Theoretical foundations of the potential function methodin pattern recognition learning. Automation and RemoteControl, 25:821 – 837, 1964.
[2] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler.Scale-sensitive Dimensions, Uniform Convergence, andLearnability. Journal of the ACM, 44(4):615–631, 1997.
Chicago/TTI Summer School, June 2009 97
![Page 99: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/99.jpg)
[3] M. Anthony and P. Bartlett. Neural Network Learning:Theoretical Foundations. Cambridge University Press,1999.
[4] M. Anthony and N. Biggs. Computational LearningTheory, volume 30 of Cambridge Tracts in TheoreticalComputer Science. Cambridge University Press, 1992.
[5] M. Anthony and J. Shawe-Taylor. A result of Vapnik withapplications. Discrete Applied Mathematics, 47:207–217, 1993.
[6] K. Azuma. Weighted sums of certain dependent randomvariables. Tohoku Math J., 19:357–367, 1967.
[7] P. Bartlett and J. Shawe-Taylor. Generalizationperformance of support vector machines and otherpattern classifiers. In B. Scholkopf, C. J. C. Burges,and A. J. Smola, editors, Advances in Kernel Methods— Support Vector Learning, pages 43–54, Cambridge,MA, 1999. MIT Press.
[8] P. L. Bartlett. The sample complexity of patternclassification with neural networks: the size of theweights is more important than the size of the network.
Chicago/TTI Summer School, June 2009 98
![Page 100: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/100.jpg)
IEEE Transactions on Information Theory, 44(2):525–536, 1998.
[9] P. L. Bartlett and S. Mendelson. Rademacher andGaussian complexities: risk bounds and structuralresults. Journal of Machine Learning Research, 3:463–482, 2002.
[10] S. Boucheron, G. Lugosi, , and P. Massart. A sharpconcentration inequality with applications. RandomStructures and Algorithms, pages vol.16, pp.277–292,2000.
[11] O. Bousquet and A. Elisseeff. Stability andgeneralization. Journal of Machine Learning Research,2:499–526, 2002.
[12] N. Cristianini and J. Shawe-Taylor. An introduction toSupport Vector Machines. Cambridge University Press,Cambridge, UK, 2000.
[13] Y. Freund and R. E. Schapire. A decision-theoreticgeneralization of on-line learning and an application toboosting. In Computational Learning Theory: Eurocolt’95, pages 23–37. Springer-Verlag, 1995.
Chicago/TTI Summer School, June 2009 99
![Page 101: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/101.jpg)
[14] W. Hoeffding. Probability inequalities for sums ofbounded random variables. J. Amer. Stat. Assoc.,58:13–30, 1963.
[15] M. Kearns and U. Vazirani. An Introduction toComputational Learning Theory. MIT Press, 1994.
[16] V. Koltchinskii and D. Panchenko. Rademacherprocesses and bounding the risk of function learning.High Dimensional Probability II, pages 443 – 459, 2000.
[17] J. Langford and J. Shawe-Taylor. PAC bayes andmargins. In Advances in Neural Information ProcessingSystems 15, Cambridge, MA, 2003. MIT Press.
[18] M. Ledoux and M. Talagrand. Probability in BanachSpaces: isoperimetry and processes. Springer, 1991.
[19] C. McDiarmid. On the method of bounded differences. In141 London Mathematical Society Lecture Notes Series,editor, Surveys in Combinatorics 1989, pages 148–188.Cambridge University Press, Cambridge, 1989.
[20] R. Schapire, Y. Freund, P. Bartlett, and W. SunLee. Boosting the margin: A new explanation for theeffectiveness of voting methods. Annals of Statistics,
Chicago/TTI Summer School, June 2009 100
![Page 102: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/102.jpg)
1998. (To appear. An earlier version appeared in:D.H. Fisher, Jr. (ed.), Proceedings ICML97, MorganKaufmann.).
[21] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson,and M. Anthony. Structural risk minimization overdata-dependent hierarchies. IEEE Transactions onInformation Theory, 44(5):1926–1940, 1998.
[22] J. Shawe-Taylor and N. Cristianini. On the generalisationof soft margin algorithms. IEEE Transactions onInformation Theory, 48(10):2721–2735, 2002.
[23] J. Shawe-Taylor and N. Cristianini. Kernel Methodsfor Pattern Analysis. Cambridge University Press,Cambridge, UK, 2004.
[24] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. S.Kandola. On the eigenspectrum of the gram matrixand its relationship to the operator eigenspectrum. InProceedings of the 13th International Conference onAlgorithmic Learning Theory (ALT2002), volume 2533,pages 23–40, 2002.
[25] M. Talagrand. New concentration inequalities in product
Chicago/TTI Summer School, June 2009 101
![Page 103: Kernel Methods and Support Vector Machinesweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON... · 2009-06-07 · Kernel Methods and Support Vector Machines John Shawe-Taylor Department](https://reader034.fdocuments.net/reader034/viewer/2022042219/5ec4dd15e891f3349e4e6b43/html5/thumbnails/103.jpg)
spaces. Invent. Math., 126:505–563, 1996.[26] V. Vapnik. Statistical Learning Theory. Wiley, New York,
1998.[27] V. Vapnik and A. Chervonenkis. Uniform convergence of
frequencies of occurence of events to their probabilities.Dokl. Akad. Nauk SSSR, 181:915 – 918, 1968.
[28] V. Vapnik and A. Chervonenkis. On the uniformconvergence of relative frequencies of events to theirprobabilities. Theory of Probability and its Applications,16(2):264–280, 1971.
[29] Tong Zhang. Covering number bounds of certainregularized linear function classes. Journal of MachineLearning Research, 2:527–550, 2002.
Chicago/TTI Summer School, June 2009 102