Introduction to Machine Learninggdicaro/10315/lectures/315-F... · 2019. 10. 3. · Teacher: Gianni...

Teacher:Gianni A. Di Caro

Lecture 17:Kernel methods 2

Introduction to Machine Learning10-315 Fall ‘19

Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.

2

Recap: Transform the data to use linear models

3

Recap: Similarity measures and kernels

o Problem: Representing data in a high- dimensional space is computationally difficult

o Alternative solution to the original problem: Calculate a similarity measure in the feature space instead of the coordinates of the vectors there, then apply algorithms that only need the value of this measure

o Use inner (dot) product as similarity measure

o Kernel: a function that takes as its inputs vectors in the original space 𝑋 and returns the dot product of the vectors in the (possibly highly dimensional) feature space 𝐹.

o Formally: A kernel is a function 𝑘 that for all 𝒙, 𝒛 ∈ 𝑋 satisfies

𝑘 𝑥, 𝑧 = 𝜙 𝑥 , 𝜙(𝑧) ,

Where 𝜙 is a mapping from 𝑋 to an (inner product) feature space 𝐹

v 𝑘(.,.) is a kernel if it can be viewed as a legal definition of inner product in a Hilbert space

4

Kernel trick

vUsing kernels, we do not need to embed the data into the space 𝐹 because a number of algorithms only require the inner products between input vectors!

vWe never need the coordinates of the data in the feature space → We never need 𝜙 in an explicit form!

v If the kernel function meets Mercer’s conditions, it will represent 𝜙 implicitly and provide the value of the inner product 𝜙 𝒙 𝜙 𝒛 , the similarity measure, that can be plugged directly into the algorithms

o Kernel trick: To avoid working in the non-linear high-dimensional feature space , choose a feature space in which the dot product can be evaluated directly using a nonlinear function 𝑘 in the input space (i.e., a kernel function!)

5

Hilbert spaces (just to know)

6


7


8


9


10

Using kernels

11

Dual formulation only depends on dot-products, not on 𝒙 alone!

𝜙(𝒙) – High-dimensional feature space, but never need it explicitly as long as we can compute the dot product (fast) using some Kernel 𝐾(𝒙, 𝒛)

This is implicitly defined if we have a kernel function 𝐾 and it’s associated kernel matrix

12

Using kernels: modularity

o Kernelization offers great modularity

o No need to change the underlying learning algorithm to accommodate for a particular choice of the kernel function

o Also, we can substitute a different algorithm while maintaining the same kernel

ü Plug-and-Play!

13

Kernel examples

14

Kernel examples: feature space might not be unique

15

General Polynomial Kernels

𝑑 = 1

𝑑 = 2

𝑑

(only includes terms of degree 𝑑)

16

Feature spaces can grow very large and very quickly!

𝑛 – input features 𝑑 – degree of polynomial

For a polynomial kernel of degree 𝑑

E.g., 𝑑 = 6, 𝑛 = 100 → 1.6 billion terms → dimensions in the feature space!!

# terms of degree 𝑑 = 789:;7 =78

17

Kernel examples

o Sigmoid kernel:

or 𝑐 + 𝒙D𝒛 7 where 𝑐 ≥ 0 is a parameter trading off the influence of high-order vs. low order terms

18

Kernel examples: RBF

o Radial Basis Function (RBF) / Gaussian kernel

𝑘 𝒙, 𝒛 = exp[ −𝛾 𝒙 − 𝒛 B ]

o Measure of similarity based on distance scaled by the hyperparameter 𝛾, termed kernel bandwidth

o The RBF kernels corresponds to an infinite dimensional feature space: we can’t actually write down or store the map 𝜙(𝒙) explicitly

o It’s a stationary kernel: only depends on the distance between 𝒙 and 𝒛, translating both of the same value won’t change the value of 𝑘(𝒙, 𝒛) [this isn’t true in general)

o Cross-validation to choose the kernel hyperparameters

19

RBF Kernel: infinite dimensional mapping

o RBF kernel corresponds to mapping input data to an infinite dimensional space

o Let’s prove it by considering one dimensional inputs and 𝛾 = 1

20

Mercer Kernels

What functions are valid kernels that correspond to feature vectors 𝜙(x)?

Answer: Mercer kernels 𝑘(𝒙, 𝒛)

o 𝑘 is continuous

o 𝑘 is symmetric

o The kernel matrix (Gram matrix, re-check next slide)) is positive semi-definite: 𝒛D𝐾𝒛 ≥ 0 for all 𝒛 (𝐾 has non-negative eigenvalues)

ü Al previous kernels are Mercer kernels

21

Kernel matrix

22

Properties for constructing kernels

o These derive from the fact that a kernel implicitly defines a feature map in a Hilbert space (complete vector space endowed by inner product) and from Mercer’s properties

23

Kernels: closure properties

24

Kernels: closure properties

25

Finally: SVMs and the Kernel Trick!

Ø Never represent features explicitly– Compute dot products in closed form

using the kernel!

Ø Constant-time high-dimensional dot-products for many classes of features

Training time

26

What about classification time?

o For a new input x, if we need to represent F(x), we are in trouble!

o Recall classifier: sign(w.F(x)+b)

o Using kernels we are cool!

27

SVMs with Kernels: training and classification

ü Choose a set of features and kernel functionü Solve dual problem to obtain support vectors aiü At classification time, compute:

Classify as

This we know how to compute!

28

Overfitting?

v Huge feature space with kernels, what about overfitting???

ü Maximizing margin leads to sparse set of support vectors

ü Some interesting theory says that SVMs search for simple hypothesis with large margin

ü Often robust to overfitting

29

SVMs with Kernels

• Iris dataset, 2 vs 13, Linear Kernel

30

SVMs with Kernels

• Iris dataset, 1 vs 23, Polynomial Kernel degree 2

31

SVMs with Kernels

• Iris dataset, 1 vs 23, Gaussian RBF kernel

32

SVMs with Kernels

• Iris dataset, 1 vs 23, Gaussian RBF kernel

33

SVMs with Kernels

• Chessboard dataset, Gaussian RBF kernel

34

SVMs with Kernels

• Chessboard dataset, Polynomial kernel

35

Corel Dataset

Olivier Chapelle 1998

36

USPS Handwritten digits

Introduction to Machine Learninggdicaro/10315/lectures/315-F... · 2019. 10. 3. · Teacher: Gianni...

Documents

Transcript of Introduction to Machine Learninggdicaro/10315/lectures/315-F... · 2019. 10. 3. · Teacher: Gianni...