Introduction to Machine Learninggdicaro/10315/lectures/315-F... · 2019. 10. 3. · Teacher: Gianni...
Transcript of Introduction to Machine Learninggdicaro/10315/lectures/315-F... · 2019. 10. 3. · Teacher: Gianni...
-
Teacher:Gianni A. Di Caro
Lecture 17:Kernel methods 2
Introduction to Machine Learning10-315 Fall ‘19
Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.
-
2
Recap: Transform the data to use linear models
-
3
Recap: Similarity measures and kernels
o Problem: Representing data in a high- dimensional space is computationally difficult
o Alternative solution to the original problem: Calculate a similarity measure in the feature space instead of the coordinates of the vectors there, then apply algorithms that only need the value of this measure
o Use inner (dot) product as similarity measure
o Kernel: a function that takes as its inputs vectors in the original space 𝑋 and returns the dot product of the vectors in the (possibly highly dimensional) feature space 𝐹.
o Formally: A kernel is a function 𝑘 that for all 𝒙, 𝒛 ∈ 𝑋 satisfies
𝑘 𝑥, 𝑧 = 𝜙 𝑥 , 𝜙(𝑧) ,
Where 𝜙 is a mapping from 𝑋 to an (inner product) feature space 𝐹
v 𝑘(.,.) is a kernel if it can be viewed as a legal definition of inner product in a Hilbert space
-
4
Kernel trick
vUsing kernels, we do not need to embed the data into the space 𝐹 because a number of algorithms only require the inner products between input vectors!
vWe never need the coordinates of the data in the feature space → We never need 𝜙 in an explicit form!
v If the kernel function meets Mercer’s conditions, it will represent 𝜙 implicitly and provide the value of the inner product 𝜙 𝒙 𝜙 𝒛 , the similarity measure, that can be plugged directly into the algorithms
o Kernel trick: To avoid working in the non-linear high-dimensional feature space , choose a feature space in which the dot product can be evaluated directly using a nonlinear function 𝑘 in the input space (i.e., a kernel function!)
-
5
Hilbert spaces (just to know)
-
6
Hilbert spaces (just to know)
-
7
Hilbert spaces (just to know)
-
8
Hilbert spaces (just to know)
-
9
Hilbert spaces (just to know)
-
10
Using kernels
-
11
Dual formulation only depends on dot-products, not on 𝒙 alone!
𝜙(𝒙) – High-dimensional feature space, but never need it explicitly as long as we can compute the dot product (fast) using some Kernel 𝐾(𝒙, 𝒛)
This is implicitly defined if we have a kernel function 𝐾 and it’s associated kernel matrix
-
12
Using kernels: modularity
o Kernelization offers great modularity
o No need to change the underlying learning algorithm to accommodate for a particular choice of the kernel function
o Also, we can substitute a different algorithm while maintaining the same kernel
ü Plug-and-Play!
-
13
Kernel examples
-
14
Kernel examples: feature space might not be unique
-
15
General Polynomial Kernels
𝑑 = 1
𝑑 = 2
𝑑
(only includes terms of degree 𝑑)
-
16
Feature spaces can grow very large and very quickly!
𝑛 – input features 𝑑 – degree of polynomial
For a polynomial kernel of degree 𝑑
E.g., 𝑑 = 6, 𝑛 = 100 → 1.6 billion terms → dimensions in the feature space!!
# terms of degree 𝑑 = 789:;7 =78
-
17
Kernel examples
o Sigmoid kernel:
or 𝑐 + 𝒙D𝒛 7 where 𝑐 ≥ 0 is a parameter trading off the influence of high-order vs. low order terms
-
18
Kernel examples: RBF
o Radial Basis Function (RBF) / Gaussian kernel
𝑘 𝒙, 𝒛 = exp[ −𝛾 𝒙 − 𝒛 B ]
o Measure of similarity based on distance scaled by the hyperparameter 𝛾, termed kernel bandwidth
o The RBF kernels corresponds to an infinite dimensional feature space: we can’t actually write down or store the map 𝜙(𝒙) explicitly
o It’s a stationary kernel: only depends on the distance between 𝒙 and 𝒛, translating both of the same value won’t change the value of 𝑘(𝒙, 𝒛) [this isn’t true in general)
o Cross-validation to choose the kernel hyperparameters
-
19
RBF Kernel: infinite dimensional mapping
o RBF kernel corresponds to mapping input data to an infinite dimensional space
o Let’s prove it by considering one dimensional inputs and 𝛾 = 1
-
20
Mercer Kernels
What functions are valid kernels that correspond to feature vectors 𝜙(x)?
Answer: Mercer kernels 𝑘(𝒙, 𝒛)
o 𝑘 is continuous
o 𝑘 is symmetric
o The kernel matrix (Gram matrix, re-check next slide)) is positive semi-definite: 𝒛D𝐾𝒛 ≥ 0 for all 𝒛 (𝐾 has non-negative eigenvalues)
ü Al previous kernels are Mercer kernels
-
21
Kernel matrix
-
22
Properties for constructing kernels
o These derive from the fact that a kernel implicitly defines a feature map in a Hilbert space (complete vector space endowed by inner product) and from Mercer’s properties
-
23
Kernels: closure properties
-
24
Kernels: closure properties
-
25
Finally: SVMs and the Kernel Trick!
Ø Never represent features explicitly– Compute dot products in closed form
using the kernel!
Ø Constant-time high-dimensional dot-products for many classes of features
Training time
-
26
What about classification time?
o For a new input x, if we need to represent F(x), we are in trouble!
o Recall classifier: sign(w.F(x)+b)
o Using kernels we are cool!
-
27
SVMs with Kernels: training and classification
ü Choose a set of features and kernel functionü Solve dual problem to obtain support vectors aiü At classification time, compute:
Classify as
This we know how to compute!
-
28
Overfitting?
v Huge feature space with kernels, what about overfitting???
ü Maximizing margin leads to sparse set of support vectors
ü Some interesting theory says that SVMs search for simple hypothesis with large margin
ü Often robust to overfitting
-
29
SVMs with Kernels
• Iris dataset, 2 vs 13, Linear Kernel
-
30
SVMs with Kernels
• Iris dataset, 1 vs 23, Polynomial Kernel degree 2
-
31
SVMs with Kernels
• Iris dataset, 1 vs 23, Gaussian RBF kernel
-
32
SVMs with Kernels
• Iris dataset, 1 vs 23, Gaussian RBF kernel
-
33
SVMs with Kernels
• Chessboard dataset, Gaussian RBF kernel
-
34
SVMs with Kernels
• Chessboard dataset, Polynomial kernel
-
35
Corel Dataset
Olivier Chapelle 1998
-
36
USPS Handwritten digits