Introduction to Machine Learninggdicaro/10315/lectures/315-F... · 2019. 10. 3. · Teacher: Gianni...

36
Teacher: Gianni A. Di Caro Lecture 17: Kernel methods 2 Introduction to Machine Learning 10 - 315 Fall ‘19 Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.

Transcript of Introduction to Machine Learninggdicaro/10315/lectures/315-F... · 2019. 10. 3. · Teacher: Gianni...

  • Teacher:Gianni A. Di Caro

    Lecture 17:Kernel methods 2

    Introduction to Machine Learning10-315 Fall ‘19

    Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.

  • 2

    Recap: Transform the data to use linear models

  • 3

    Recap: Similarity measures and kernels

    o Problem: Representing data in a high- dimensional space is computationally difficult

    o Alternative solution to the original problem: Calculate a similarity measure in the feature space instead of the coordinates of the vectors there, then apply algorithms that only need the value of this measure

    o Use inner (dot) product as similarity measure

    o Kernel: a function that takes as its inputs vectors in the original space 𝑋 and returns the dot product of the vectors in the (possibly highly dimensional) feature space 𝐹.

    o Formally: A kernel is a function 𝑘 that for all 𝒙, 𝒛 ∈ 𝑋 satisfies

    𝑘 𝑥, 𝑧 = 𝜙 𝑥 , 𝜙(𝑧) ,

    Where 𝜙 is a mapping from 𝑋 to an (inner product) feature space 𝐹

    v 𝑘(.,.) is a kernel if it can be viewed as a legal definition of inner product in a Hilbert space

  • 4

    Kernel trick

    vUsing kernels, we do not need to embed the data into the space 𝐹 because a number of algorithms only require the inner products between input vectors!

    vWe never need the coordinates of the data in the feature space → We never need 𝜙 in an explicit form!

    v If the kernel function meets Mercer’s conditions, it will represent 𝜙 implicitly and provide the value of the inner product 𝜙 𝒙 𝜙 𝒛 , the similarity measure, that can be plugged directly into the algorithms

    o Kernel trick: To avoid working in the non-linear high-dimensional feature space , choose a feature space in which the dot product can be evaluated directly using a nonlinear function 𝑘 in the input space (i.e., a kernel function!)

  • 5

    Hilbert spaces (just to know)

  • 6

    Hilbert spaces (just to know)

  • 7

    Hilbert spaces (just to know)

  • 8

    Hilbert spaces (just to know)

  • 9

    Hilbert spaces (just to know)

  • 10

    Using kernels

  • 11

    Dual formulation only depends on dot-products, not on 𝒙 alone!

    𝜙(𝒙) – High-dimensional feature space, but never need it explicitly as long as we can compute the dot product (fast) using some Kernel 𝐾(𝒙, 𝒛)

    This is implicitly defined if we have a kernel function 𝐾 and it’s associated kernel matrix

  • 12

    Using kernels: modularity

    o Kernelization offers great modularity

    o No need to change the underlying learning algorithm to accommodate for a particular choice of the kernel function

    o Also, we can substitute a different algorithm while maintaining the same kernel

    ü Plug-and-Play!

  • 13

    Kernel examples

  • 14

    Kernel examples: feature space might not be unique

  • 15

    General Polynomial Kernels

    𝑑 = 1

    𝑑 = 2

    𝑑

    (only includes terms of degree 𝑑)

  • 16

    Feature spaces can grow very large and very quickly!

    𝑛 – input features 𝑑 – degree of polynomial

    For a polynomial kernel of degree 𝑑

    E.g., 𝑑 = 6, 𝑛 = 100 → 1.6 billion terms → dimensions in the feature space!!

    # terms of degree 𝑑 = 789:;7 =78

  • 17

    Kernel examples

    o Sigmoid kernel:

    or 𝑐 + 𝒙D𝒛 7 where 𝑐 ≥ 0 is a parameter trading off the influence of high-order vs. low order terms

  • 18

    Kernel examples: RBF

    o Radial Basis Function (RBF) / Gaussian kernel

    𝑘 𝒙, 𝒛 = exp[ −𝛾 𝒙 − 𝒛 B ]

    o Measure of similarity based on distance scaled by the hyperparameter 𝛾, termed kernel bandwidth

    o The RBF kernels corresponds to an infinite dimensional feature space: we can’t actually write down or store the map 𝜙(𝒙) explicitly

    o It’s a stationary kernel: only depends on the distance between 𝒙 and 𝒛, translating both of the same value won’t change the value of 𝑘(𝒙, 𝒛) [this isn’t true in general)

    o Cross-validation to choose the kernel hyperparameters

  • 19

    RBF Kernel: infinite dimensional mapping

    o RBF kernel corresponds to mapping input data to an infinite dimensional space

    o Let’s prove it by considering one dimensional inputs and 𝛾 = 1

  • 20

    Mercer Kernels

    What functions are valid kernels that correspond to feature vectors 𝜙(x)?

    Answer: Mercer kernels 𝑘(𝒙, 𝒛)

    o 𝑘 is continuous

    o 𝑘 is symmetric

    o The kernel matrix (Gram matrix, re-check next slide)) is positive semi-definite: 𝒛D𝐾𝒛 ≥ 0 for all 𝒛 (𝐾 has non-negative eigenvalues)

    ü Al previous kernels are Mercer kernels

  • 21

    Kernel matrix

  • 22

    Properties for constructing kernels

    o These derive from the fact that a kernel implicitly defines a feature map in a Hilbert space (complete vector space endowed by inner product) and from Mercer’s properties

  • 23

    Kernels: closure properties

  • 24

    Kernels: closure properties

  • 25

    Finally: SVMs and the Kernel Trick!

    Ø Never represent features explicitly– Compute dot products in closed form

    using the kernel!

    Ø Constant-time high-dimensional dot-products for many classes of features

    Training time

  • 26

    What about classification time?

    o For a new input x, if we need to represent F(x), we are in trouble!

    o Recall classifier: sign(w.F(x)+b)

    o Using kernels we are cool!

  • 27

    SVMs with Kernels: training and classification

    ü Choose a set of features and kernel functionü Solve dual problem to obtain support vectors aiü At classification time, compute:

    Classify as

    This we know how to compute!

  • 28

    Overfitting?

    v Huge feature space with kernels, what about overfitting???

    ü Maximizing margin leads to sparse set of support vectors

    ü Some interesting theory says that SVMs search for simple hypothesis with large margin

    ü Often robust to overfitting

  • 29

    SVMs with Kernels

    • Iris dataset, 2 vs 13, Linear Kernel

  • 30

    SVMs with Kernels

    • Iris dataset, 1 vs 23, Polynomial Kernel degree 2

  • 31

    SVMs with Kernels

    • Iris dataset, 1 vs 23, Gaussian RBF kernel

  • 32

    SVMs with Kernels

    • Iris dataset, 1 vs 23, Gaussian RBF kernel

  • 33

    SVMs with Kernels

    • Chessboard dataset, Gaussian RBF kernel

  • 34

    SVMs with Kernels

    • Chessboard dataset, Polynomial kernel

  • 35

    Corel Dataset

    Olivier Chapelle 1998

  • 36

    USPS Handwritten digits