CS 337: Artificial Intelligence & Machine Learning ...
Transcript of CS 337: Artificial Intelligence & Machine Learning ...
![Page 1: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/1.jpg)
CS 337: Artificial Intelligence & Machine LearningInstructor: Prof. Ganesh Ramakrishnan
Lecture 12: Kernels: Perceptron, Logistic Regressionand Ridge Regression
August 2019
![Page 2: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/2.jpg)
Kernel Perceptronhttps://youtu.be/ql_GMONpmHM?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa
![Page 3: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/3.jpg)
Recap: Perceptron Update Rule
Perceptron works for two classes (y = ±1).
A point is misclassified if ywT (φ(x)) < 0.
Perceptron Algorithm:
INITIALIZE: w=ones()REPEAT: for each < x, y >
If ywTΦ(x) < 0then, w = w + ηφ(x).yendif
![Page 4: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/4.jpg)
![Page 5: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/5.jpg)
![Page 6: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/6.jpg)
![Page 7: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/7.jpg)
![Page 8: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/8.jpg)
![Page 9: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/9.jpg)
Kernel (Non-linear) perceptron?
Kernelized perceptron1:
1The first kernel classification learner, was invented in 1964
![Page 10: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/10.jpg)
Kernel (Non-linear) perceptron?
Kernelized perceptron1: f (x) = sign
��
i
αiyiK (x, xi) + b
�
INITIALIZE: α=zeroes()REPEAT: for < xi , yi >
If sign��
j αjyjK (xi , xj) + b��= yi
then, αi = αi + 1endif
Convergence is matter of Tutorial 4 & 5
1The first kernel classification learner, was invented in 1964
This could help you avoidseparately storing all the in solution to Problem 2 of lab 3
![Page 11: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/11.jpg)
An Example Kernel
We illustrate going from φ(x) to K (x, y) and back...
For example, for a 2-dimensional xi :
φ(xi) =
1
xi1√2
xi2√2
xi1xi2√2
x2i1x2i2
φ(xi) exists in a 6-dimensional space
But, to compute K (x1, x2), all we need is x�1 x2 without having to enumerateφ(xi)
![Page 12: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/12.jpg)
Example Kernels
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
RBF kernel: K (x, y) = exp
�−�x− y�2
2σ2
�
![Page 13: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/13.jpg)
Example: Classification using Kernels
Source:https://www.slideshare.net/ANITALOKITA/winnow-vs-perceptron
![Page 14: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/14.jpg)
More on the Kernel Trick
Kernels operate in a high-dimensional, implicit feature space withoutnecessarily computing the coordinates of the data in that space, but ratherby simply computing the Kernel function
This operation is often computationally cheaper than the explicitcomputation of the coordinates
![Page 15: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/15.jpg)
The Gram (Kernel) Matrix
For any dataset {x1, x2, . . . , xm} and for any m, the Gram matrix K isdefined as
K =
K (x1, x1) ... K (x1, xn)
... K (xi , xj) ...K (xm, x1) ... K (xm, xm)
Claim: If Kij = K (xi , xj) = �φ(xi),φ(xj)� are entries of an n × n GramMatrix K then K must be symmetric and.....
K must be
![Page 16: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/16.jpg)
The Gram (Kernel) Matrix
For any dataset {x1, x2, . . . , xm} and for any m, the Gram matrix K isdefined as
K =
K (x1, x1) ... K (x1, xn)
... K (xi , xj) ...K (xm, x1) ... K (xm, xm)
Claim: If Kij = K (xi , xj) = �φ(xi),φ(xj)� are entries of an n × n GramMatrix K then K must be symmetric and.....
K must be positive semi-definite
Proof: For any b ∈ �m, bTKb =�
i ,j
biKijbj =�
i ,j
bibj�φ(xi ),φ(xj)�
= ��
i
biφ(xi ),�
j
bjφ(xj)� = ||�
i
biφ(xi )||22 ≥ 0
![Page 17: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/17.jpg)
Basis expansion φ for symmetric K?
Positive-definite kernel: For any dataset {x1, x2, . . . , xm} and for any m, theGram matrix K must be positive semi-definite so thatK = UΣUT = (UΣ
12 )(UΣ
12 )T = RRT where rows of U are linearly
independent and Σ is a positive diagonal matrix
We will illustrate through an example..
![Page 18: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/18.jpg)
![Page 19: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/19.jpg)
![Page 20: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/20.jpg)
![Page 21: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/21.jpg)
From matrix decomposition to function decomposition:Mercer Kernel K
Positive-definite kernel: For any dataset {x1, x2, . . . , xm} and for any m, the
Gram matrix K must be positive semi-definite so that
K = UΣUT = (UΣ12 )(UΣ
12 )T = RRT where rows of U are linearly independent
and Σ is a positive diagonal matrix
Mercer kernel: Extending to eigenfunction decomposition2:
K (x1, x2) =∞�
j=1
αjφj(x1)φj(x2) where αj ≥ 0 and�∞
j=1 α2j < ∞
Mercer kernel and Positive-definite kernel turn out to be equivalent if theinput space {x} is compact3
2Eigen-decomposition wrt linear operators.
3Equivalent of closed and bounded.
![Page 22: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/22.jpg)
Mercer and Positive Definite Kernels
Mercers Theorem is NOT INCLUDED IN
MIDSEMSEMESTER EXAM SYLLABUS:
![Page 23: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/23.jpg)
Mercer’s theorem (Not for Midsem)
Mercer kernel: K (x1, x2) is a Mercer kernel if�x1
�x2K (x1, x2)g(x1)g(x2) dx1dx2 ≥ 0 for all square integrable functions
g(x)
![Page 24: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/24.jpg)
Mercer’s theorem (Not for Midsem)
Mercer kernel: K (x1, x2) is a Mercer kernel if�x1
�x2K (x1, x2)g(x1)g(x2) dx1dx2 ≥ 0 for all square integrable functions
g(x)(g(x) is square integrable iff
�(g(x))2 dx is finite)
Mercer’s theorem: For any Mercer kernel K (x1, x2), ∃φ(x) : IRn �→ H ,s.t. K (x1, x2) = φ�(x1)φ(x2)
where H is a Hilbert space, the infinite dimensional version of the Eucledianspace, which is......(�n,< ., . >) where < ., . > is the standard dot product in �n
Advanced: Formally, Hibert Space is an inner product space with associatednorms, where every Cauchy sequence is convergent
![Page 25: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/25.jpg)
Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)
We want to prove that for all square integrable functions g(x)�x1
�x2(x�1 x2)
dg(x1)g(x2) dx1dx2 ≥ 0
Here, x1 and x2 are vectors s.t x1, x2 ∈ �t
Thus,�x1
�x2(x�1 x2)
dg(x1)g(x2) dx1dx2
=
�
x11
..
�
x1t
�
x21
..
�
x2t
��
n1..nt
d !
n1!..nt!
t�
j=1
(x1jx2j)nj
�g(x1)g(x2) dx11..dx1tdx21..dx2t
s.t.t�
i=1
ni = d
![Page 26: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/26.jpg)
Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)
We want to prove that for all square integrable functions g(x)�x1
�x2(x�1 x2)
dg(x1)g(x2) dx1dx2 ≥ 0
Here, x1 and x2 are vectors s.t x1, x2 ∈ �t
Thus,�x1
�x2(x�1 x2)
dg(x1)g(x2) dx1dx2
=
�
x11
..
�
x1t
�
x21
..
�
x2t
��
n1..nt
d !
n1!..nt!
t�
j=1
(x1jx2j)nj
�g(x1)g(x2) dx11..dx1tdx21..dx2t
s.t.t�
i=1
ni = d
By some Fubini’s theorem, under finiteness conditions, integral of sum is sum ofintegrals
![Page 27: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/27.jpg)
Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)
=�
n1...nt
d !
n1! . . . nt!
�
x1
�
x2
t�
j=1
(x1jx2j)nj g(x1)g(x2) dx1dx2
=�
n1...nt
d !
n1! . . . nt!
�
x1
�
x2
(xn111xn212 . . . x
nt1t )g(x1) (x
n121x
n222 . . . x
nt2t )g(x2) dx1dx2
![Page 28: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/28.jpg)
Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)
=�
n1...nt
d !
n1! . . . nt!
�
x1
�
x2
t�
j=1
(x1jx2j)nj g(x1)g(x2) dx1dx2
=�
n1...nt
d !
n1! . . . nt!
�
x1
�
x2
(xn111xn212 . . . x
nt1t )g(x1) (x
n121x
n222 . . . x
nt2t )g(x2) dx1dx2
=�
n1...nt
d !
n1! . . . nt!
��
x1
(xn111 . . . xnt1t )g(x1) dx1
� ��
x2
(xn121 . . . xnt2t )g(x2) dx2
�
(integral of decomposable product as product of integrals)
s.t.t�
i
ni = d
![Page 29: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/29.jpg)
Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)
Realize that both the integrals are basically the same, with different variablenames
Thus, the quadratic (positive-definiteness) expression becomes:
�
n1...nt
d !
n1! . . . nt!(
�
x1
(xn111 . . . xnt1t )g(x1) dx1)
2 ≥ 0
(the square is non-negative for reals)
Thus, we have shown that (x�1 x2)d is a Mercer kernel.
![Page 30: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/30.jpg)
What about�r
d=1 αd(x�1 x2)d s.t. αd ≥ 0?
K (x1, x2) =r�
d=1
αd(x�1 x2)
d
Is
�
x1
�
x2
�r�
d=1
αd(x�1 x2)
d
�g(x1)g(x2) dx1dx2 ≥ 0?
![Page 31: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/31.jpg)
What about�r
d=1 αd(x�1 x2)d s.t. αd ≥ 0?
Is
�
x1
�
x2
�r�
d=1
αd(x�1 x2)
d
�g(x1)g(x2) dx1dx2 ≥ 0?
We have �
x1
�
x2
�r�
d=1
αd(x�1 x2)
d
�g(x1)g(x2) dx1dx2 =
![Page 32: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/32.jpg)
What about�r
d=1 αd(x�1 x2)d s.t. αd ≥ 0?
Is
�
x1
�
x2
�r�
d=1
αd(x�1 x2)
d
�g(x1)g(x2) dx1dx2 ≥ 0?
We have �
x1
�
x2
�r�
d=1
αd(x�1 x2)
d
�g(x1)g(x2) dx1dx2 =
r�
d=1
αd
�
x1
�
x2
(x�1 x2)dg(x1)g(x2) dx1dx2
![Page 33: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/33.jpg)
What about�r
d=1 αd(x�1 x2)d s.t. αd ≥ 0?
Since αd ≥ 0, ∀d and since we have already proved that�x1
�x2(x�1 x2)
dg(x1)g(x2) dx1dx2 ≥ 0
We must have,
r�
d=1
αd
�
x1
�
x2
(x1�x2)dg(x1)g(x2) dx1dx2 ≥ 0
By which, K (x1, x2) =r�
d=1
αd(x�1 x2)
d is a Mercer kernel.
Examples of Mercer Kernels: Linear Kernel, Polynomial Kernel, Radial BasisFunction Kernel
![Page 34: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/34.jpg)
Closure properties of Kernels (Part of Midsems)
Let K1(x1, x2) and K2(x1, x2) be positive definite (mercer) kernels. Then thefollowing are also kernels.
α1K1(x1, x2) + α2K2(x1, x2) for α1,α2 ≥ 0.Proof:
![Page 35: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/35.jpg)
Closure properties of Kernels (Part of Midsems)
Let K1(x1, x2) and K2(x1, x2) be positive definite (mercer) kernels. Then thefollowing are also kernels.
α1K1(x1, x2) + α2K2(x1, x2) for α1,α2 ≥ 0.Proof:
K1(x1, x2)K2(x1, x2)Proof:
For simplicity, we assume that the is finite dimensional
![Page 36: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/36.jpg)
Are the following Mercer Kernels? (Part of Midsems)
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
Exponential Kernel: K (x, y) = exp (�x, y�)
![Page 37: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/37.jpg)
Are the following Mercer Kernels? (Part of Midsems)
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
Exponential Kernel: K (x, y) = exp (�x, y�)
RBF kernel: K (x, y) = exp
�−�x− y�2
2σ2
�
![Page 38: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/38.jpg)
Are the following Mercer Kernels? (Part of Midsems)
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
Exponential Kernel: K (x, y) = exp (�x, y�)
RBF kernel: K (x, y) = exp
�−�x− y�2
2σ2
�= k(x− y)
![Page 39: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/39.jpg)
Are the following Mercer Kernels? (Part of Midsems)
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
Exponential Kernel: K (x, y) = exp (�x, y�)
RBF kernel: K (x, y) = exp
�−�x− y�2
2σ2
�= k(x− y) = ky(x)
The function ky(x) is also called a smoothing kernel (as well see soon).
![Page 40: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/40.jpg)
Some more Tutorial 4 + 5 Questions
Informally show that the Kernelized Logistic Regression form is equivalent tothe original Logistic Regression when regularized cross entropy is minimized
Show that Ridge Regression has an equivalent Kernelized form
![Page 41: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/41.jpg)
Kernelized Logistic Regression
(Part of Lab 3, Tutorial 4 +5 and Midsems)https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=316
![Page 42: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/42.jpg)
Logistic Regression Kernelized
1 We have already seen (a) Cross Entropy loss and (b) Bayesian interpretationfor regularization
2 The Regularized (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):
E (w) = −
1
m
m�
i=1
�y (i) log σw
�x(i)
�+
�1 − y (i)
�log
�1 − σw
�x(i)
��� +
λ
2m�w�22 (1)
3 Equivalent dual kernelized objective?
![Page 43: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/43.jpg)
Logistic Regression Kernelized
2 The Regularized (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):
E (w) = −�1
m
m�
i=1
�y (i) log σw
�x(i)
�+�1− y (i)
�log
�1− σw
�x(i)
����+
λ
2m�w�22(2)
3 By substituting σw (x) =�
1
1+e−wTφ(x)
�=
�ew
Tφ(x)
1+ewTφ(x)
�, we simplify (2) to...
E (w) = −�1
m
m�
i=1
�y (i)wTφ(x(i))− log
�1 + exp
�wTφ
�x(i)
�����+
λ
2m�w�22
(3)
![Page 44: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/44.jpg)
Logistic Regression Kernelized
2 The rewritten (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):
E (w) = −�1
m
m�
i=1
�y (i)wTφ(x(i))− log
�1 + exp
�wTφ
�x(i)
�����+
λ
2m�w�22
(4)
3 Equivalent dual kernelized objective4, (minimized wrt α ∈ �m):
4Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/
Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf
![Page 45: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/45.jpg)
Logistic Regression Kernelized
2 The rewritten (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):
E (w) = −�1
m
m�
i=1
�y (i)wTφ(x(i))− log
�1 + exp
�wTφ
�x(i)
�����+
λ
2m�w�22
(4)
3 Equivalent dual kernelized objective4, (minimized wrt α ∈ �m):
ED (α) = −
m�
i=1
m�
j=1
y (i)K�x(i), x(j)
�αj −
λ
2αi K
�x(i), x(j)
�αj
− log
1 + exp
−
m�
j=1
αj K�x(i), x(j)
�
(5)
Decision function σw(x) = 1
1+ exp
−
m�
j=1
αj K�x, x(j)
�
4Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/
Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf
![Page 46: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/46.jpg)
CS 337: Artificial Intelligence & Machine LearningThe Kernel Trick: Illustrations on Ridge Regression
![Page 47: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/47.jpg)
Recall: Ridge RegressionVideo link: https://youtu.be/UVopa_V7rgE?t=598
![Page 48: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/48.jpg)
Recall: Penalized Regularized Least Squares Regression
Φ =
φ1(x1) φ2(x1) ...... φn(x1)..
φ1(xm) φ2(xm) ...... φn(xm)
The Bayes and MAP estimates for Linear Regression y = wTφ(x) + �(� ∼ N (0, σ2)) using a gaussian prior w ∼ N (0, 1
λI ) coincide with
Regularized Ridge Regression
wRidge = argminw
||Φw − y||22 + λσ2||w||22
Penalty: To account for noise and stop coefficients of w from becomingtoo large in magnitude
![Page 49: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/49.jpg)
Recall: Penalized Regularized Least Squares Regression
Φ =
φ1(x1) φ2(x1) ...... φn(x1)..
φ1(xm) φ2(xm) ...... φn(xm)
The Bayes and MAP estimates for Linear Regression y = wTφ(x) + �(� ∼ N (0, σ2)) using a gaussian prior w ∼ N (0, 1
λI ) coincide with
Regularized Ridge Regression
wRidge = argminw
||Φw − y||22 + λσ2||w||22
Penalty: To account for noise and stop coefficients of w from becomingtoo large in magnitude
We replace λσ2 by a single λ for convenience.
![Page 50: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/50.jpg)
Recall: Closed-form solutions
Linear regression and Ridge regression both admit closed-form solutions
For linear regression,w∗ = (Φ�Φ)−1Φ�y
For ridge regression,w∗ = (Φ�Φ+ λI )−1Φ�y
(for linear regression, λ = 0)
![Page 51: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/51.jpg)
Recall: Polynomial regression
Consider a degree 3 polynomialregression model as shown in thefigure
Each bend in the curve correspondsto increase in �w�Eigen values of (Φ�Φ+ λI ) areindicative of curvature.Increasing λ reduces the curvature(Tutorial 4 + 5)
![Page 52: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/52.jpg)
Ridge Regression in another form
If y = wTφ(x), its ridge regression estimate is w = (ΦTΦ+ λI )−1ΦTy,where
Φ =
φ1(x1) ... φp(x1)... ... ...
φ1(xm) ... φp(xm)
Please note the difference between matrix Φ and vector φ(x)
φ(xj) =
φ1(xj)...
φp(xj)
![Page 53: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/53.jpg)
Ridge Regression in another form
The regression function will be
f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy
Recall that φT (xi)φ(xj) = K (xi , xj). Note the following differences betweenΦTΦ and ΦΦT
![Page 54: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/54.jpg)
Ridge Regression in another form
The regression function will be
f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy
Recall that φT (xi)φ(xj) = K (xi , xj). Note the following differences betweenΦTΦ and ΦΦT
�ΦTΦ
�ij=
�mk=1 φi (xk)φj(xk)�
ΦΦT�ij=
�pk=1 φk(xi )φk(xj) = φT (xi )φ(xj) = K (xi , xj)
![Page 55: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/55.jpg)
Ridge Regression in another form
The regression function will be
f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy
![Page 56: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/56.jpg)
Ridge Regression in another form
The regression function will be
f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy
Consider the following matrix identity (verify for scalars):
(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1
⇒ by setting R = I , P = 1λ I and B = Φ,
![Page 57: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/57.jpg)
Ridge Regression in another form
The regression function will be
f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy
Consider the following matrix identity (verify for scalars):
(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1
⇒ by setting R = I , P = 1λ I and B = Φ,
⇒ w = ΦT (ΦΦT + λI )−1y =�m
i=1 αiφ(xi ) where αi =�(ΦΦT + λI )−1y
�i
⇒ the final decision function f (x) = φT (x)w =�m
i=1 αiφT (x)φ(xi )
![Page 58: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/58.jpg)
Ridge Regression in another form
Thus, given w = (ΦTΦ+ λI )−1ΦTy and the matrix identity(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1
⇒ w = ΦT (ΦΦT + λI )−1y =�m
i=1 αiφ(xi ) where αi =�(ΦΦT + λI )−1y
�i
⇒ the final decision function f (x) = φT (x)w =�m
i=1 αiφT (x)φ(xi )
We notice that the only way the decision function f (x) involves φ isthrough φ�(xi)φ(xj), for some i , j
![Page 59: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/59.jpg)
Ridge Regression in another form
Given w = (ΦTΦ+ λI )−1ΦTy
⇒ w = ΦT (ΦΦT + λI )−1y =�m
i=1 αiφ(xi ) where αi =�(ΦΦT + λI )−1y
�i
⇒ the final decision function f (x) = φT (x)w =�m
i=1 αiφT (x)φ(xi )
φT (xi)φ(xj) = K (xi , xj)�ΦTΦ
�ij=
�mk=1 φi(xk)φj(xk)�
ΦΦT�ij=
�pk=1 φk(xi)φk(xj) = φT (xi)φ(xj) = K (xi , xj)
![Page 60: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/60.jpg)
Recap: Example Kernel
For a 2-dimensional xi :
φ(xi) =
1
xi1√2
xi2√2
xi1xi2√2
x2i1x2i2
φ(xi) exists in a 6-dimensional space
But, to compute K (x1, x2), all we need is x�1 x2 without having to enumerateφ(xi)
![Page 61: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/61.jpg)
Example Kernels
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
RBF kernel: K (x, y) = exp
�−�x− y�2
2σ2
�
![Page 62: CS 337: Artificial Intelligence & Machine Learning ...](https://reader033.fdocuments.net/reader033/viewer/2022042101/625659dcd296a26e766a8ffd/html5/thumbnails/62.jpg)
Example: Regression using Kernels