Machine Learning - web.engr.oregonstate.edu
Transcript of Machine Learning - web.engr.oregonstate.edu
![Page 1: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/1.jpg)
Machine LearningFall 2017
Professor Liang Huang
Kernels
(Kernels, Kernelized Perceptron and SVM)
(Chap. 12 of CIML)
![Page 2: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/2.jpg)
• Concatenated (combined) features• XOR: x = (x1, x2, x1x2)• income: add “degree + major”
• Perceptron• Map data into feature space• Solution in span of
Nonlinear Features
x ! �(x)
�(xi)
x1: +1
x2: -1
x4: -1
x3: +1
![Page 3: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/3.jpg)
Quadratic Features
• Separating surfaces areCircles, hyperbolae, parabolae
![Page 4: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/4.jpg)
Kernels as dot productsKernels
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40
ProblemExtracting features can sometimes be very costly.Example: second order features in 1000 dimensions.This leads to 5005 numbers. For higher order polyno-mial features much worse.
SolutionDon’t compute the features, try to compute dot productsimplicitly. For some features this works . . .
DefinitionA kernel function k : X ⇥ X ! R is a symmetric functionin its arguments for which the following property holds
k(x, x
0) = h�(x), �(x
0)i for some feature map �.
If k(x, x
0) is much cheaper to compute than �(x) . . .
5 · 105
![Page 5: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/5.jpg)
Quadratic Kernel
Polynomial Features
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 39
Quadratic Features in R2
�(x) :=
⇣x
21,p
2x1x2, x22
⌘
Dot Producth�(x), �(x
0)i =
D⇣x
21,p
2x1x2, x22
⌘,
⇣x
012,
p2x
01x
02, x
022⌘E
= hx, x
0i2.InsightTrick works for any polynomials of order d via hx, x
0id.
x1: +1
x2: -1
x4: -1
x3: +1for x in ℝn, quadratic ɸ: naive: ɸ(x): O(n2) ɸ(x)∙ɸ(x’): O(n2) kernel k(x,x’): O(n)
= k(x, x0)
![Page 6: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/6.jpg)
The Perceptron on features
• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is (implicitly) a linear combination of
inner products
Perceptron on Features
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 37
argument: X := {x1, . . . , xm
} ⇢ X (data)Y := {y1, . . . , ym
} ⇢ {±1} (labels)function (w, b) = Perceptron(X, Y, ⌘)
initialize w, b = 0
repeatPick (x
i
, y
i
) from dataif y
i
(w · �(x
i
) + b) 0 thenw
0= w + y
i
�(x
i
)
b
0= b + y
i
until y
i
(w · �(x
i
) + b) > 0 for all i
end
Important detailw =
X
j
y
j
�(x
j
) and hence f (x) =
Pj
y
j
(�(x
j
) · �(x)) + b
w =X
i2I
↵i�(xi)
f(x) =X
i2I
↵i h�(xi),�(x)i
![Page 7: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/7.jpg)
Kernelized Perceptron
• instead of updating w, now update αi
• Weight vector is linear combination• Classifier is linear combination of inner products
Kernel Perceptron
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 42
argument: X := {x1, . . . , xm
} ⇢ X (data)Y := {y1, . . . , ym
} ⇢ {±1} (labels)function f = Perceptron(X, Y, ⌘)
initialize f = 0
repeatPick (x
i
, y
i
) from dataif y
i
f (x
i
) 0 thenf (·) f (·) + y
i
k(x
i
, ·) + y
i
until y
i
f (x
i
) > 0 for all i
end
Important detailw =
X
j
y
j
�(x
j
) and hence f (x) =
Pj
y
j
k(x
j
, x) + b.
f(x) =X
i2I
↵i h�(xi),�(x)i =X
i2I
↵ik(xi, x)
w =X
i2I
↵i�(xi)
Functional Form
↵i ↵i + yiincrease its vote by 1
![Page 8: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/8.jpg)
Kernelized Perceptron
• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of inner products
Dual Formupdate linear coefficients
implicitly equivalent to:
Primal Formupdate weights
classifyw w + yi�(xi)
f(k) = w · �(x)
↵i ↵i + yi
w =X
i2I
↵i�(xi)
w =X
i2I
↵i�(xi)
f(x) =X
i2I
↵i h�(xi),�(x)i =X
i2I
↵ik(xi, x)
![Page 9: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/9.jpg)
Kernelized PerceptronDual Formupdate linear coefficients
implicitly equivalent to:
Primal Formupdate weights
classifyw w + yi�(xi)
classify
f(k) = w · �(x) w =X
i2I
↵i�(xi)
↵i ↵i + yi
f(x) = w · �(x) = [X
i2I
↵i�(xi)]�(x)
=X
i2I
↵ih�(xi),�(x)i
=X
i2I
↵ik(xi, x)fastO(d)
slowO(d2)
![Page 10: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/10.jpg)
Kernelized Perceptroninitialize for allrepeat Pick from data if then
until for all
↵i = 0
(xi, yi)yif(xi) 0
↵i ↵i + yiyif(xi) > 0
i
i
Dual Formupdate linear coefficients
implicitly
classify
↵i ↵i + yi
w =X
i2I
↵i�(xi)
f(x) = w · �(x) = [X
i2I
↵i�(xi)]�(x)
=X
i2I
↵ih�(xi),�(x)i
=X
i2I
↵ik(xi, x)
if #features >> #examples, dual is easier;
otherwise primal is easierfastO(d)
slowO(d2)
![Page 11: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/11.jpg)
Kernelized PerceptronDual Perceptronupdate linear coefficients
implicitly
Primal Perceptronupdate weights
classifyw w + yi�(xi)
f(k) = w · �(x)
↵i ↵i + yi
w =X
i2I
↵i�(xi)
if #features >> #examples, dual is easier;
otherwise primal is easier
Q: when is #features >> #examples?
A: higher-order polynomial kernels or exponential kernels (inf. dim.)
![Page 12: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/12.jpg)
Kernelized PerceptronDual Perceptronupdate linear coefficients
implicitly
classify
Pros/Cons of Kernel in Dual• pros:
• no need to compute ɸ(x) (time)• no need to store ɸ(x) and w
(memory)
• cons:• sum over all misclassified
training examples for test • need to store all misclassified
training examples (memory)• called “support vector set”• SVM will minimize this set!
↵i ↵i + yi
w =X
i2I
↵i�(xi)
f(x) = w · �(x) = [X
i2I
↵i�(xi)]�(x)
=X
i2I
↵ih�(xi),�(x)i
=X
i2I
↵ik(xi, x) fastO(d)
slowO(d2)
![Page 13: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/13.jpg)
Kernelized PerceptronDual PerceptronPrimal Perceptron
update on new param.x1: -1 w = (0, -1)x2: +1 w = (2, 0)x3: +1 w = (2, -1)
update on new param. w (implicit)
x1: -1 α = (-1, 0, 0) -x1x2: +1 α = (-1, 1, 0) -x1 + x2x3: +1 α = (-1, 1, 1) -x1 + x2 + x3
linear kernel (identity map)final implicit w = (2, -1)
x2(2, 1) : +1
x3(0,�1) : +1
x1(0, 1) : �1
geometric interpretation of dual classification:
sum of dot-products with x2 & x3bigger than dot-product with x1
(agreement w/ positive > w/ negative)
![Page 14: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/14.jpg)
XOR ExampleDual Perceptron
update on new param. w (implicit)
x1: +1 α = (+1, 0, 0, 0) φ(x1)
x2: -1 α = (+1, -1, 0, 0) φ(x1) - φ(x2)
x1: +1
x2: -1
x4: -1
x3: +1
classification rule in dual/geom:(x · x1)
2> (x · x2)
2
) cos
2✓1 > cos
2✓2
) | cos ✓1| > | cos ✓2|
x1: +1
x2: -1
in dual/algebra:
(x · x1)2> (x · x2)
2
) (x1 + x2)2> (x1 � x2)
2
) x1x2 > 0
also verify in primal
k(x, x0) = (x · x0)2 , �(x) = (x21, x
22,p2x1x2) w = (0, 0, 2
p2)
![Page 15: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/15.jpg)
Circle Example??Dual Perceptron
update on new param. w (implicit)
x1: +1 α = (+1, 0, 0, 0) φ(x1)
x2: -1 α = (+1, -1, 0, 0) φ(x1) - φ(x2)k(x, x0) = (x · x0)2 , �(x) = (x2
1, x22,p2x1x2)
![Page 16: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/16.jpg)
Polynomial KernelsPolynomial Kernels in Rn
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 41
IdeaWe want to extend k(x, x
0) = hx, x
0i2 to
k(x, x
0) = (hx, x
0i + c)
d where c > 0 and d 2 N.
Prove that such a kernel corresponds to a dot product.Proof strategySimple and straightforward: compute the explicit sumgiven by the kernel, i.e.
k(x, x
0) = (hx, x
0i + c)
d
=
mX
i=0
✓d
i
◆(hx, x
0i)i cd�i
Individual terms (hx, x
0i)i are dot products for some �
i
(x).
+c is just augmenting space.simpler proof: set x0 = sqrt(c)
![Page 17: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/17.jpg)
Circle ExampleDual Perceptron
x (augmented) y(2, 0, 1) +1(-1, 2, 1) +1
(0, -1.5, 1) -1
update on new param. w (implicit)
x1: +1 α = (+1, 0, 0, 0, 0) φ(x1)
x2: -1 α = (+1, -1, 0, 0, 0) φ(x1) - φ(x2)
x3: -1 α = (+1, -1, -1, 0, 0)
k(x, x0) = (x · x0)2 , �(x) = (x21, x
22,p2x1x2)
k(x, x0) = (x · x0 + 1)2 , �(x) =?
x1
x2
x3
x4
x5
![Page 18: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/18.jpg)
ExamplesSome Good Kernels
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 48
Examples of kernels k(x, x
0)
Linear hx, x
0iLaplacian RBF exp (��kx � x
0k)Gaussian RBF exp
���kx � x
0k2�
Polynomial (hx, x
0i + ci)d , c � 0, d 2 NB-Spline B2n+1(x � x
0)
Cond. Expectation E
c
[p(x|c)p(x
0|c)]Simple trick for checking Mercer’s conditionCompute the Fourier transform of the kernel and checkthat it is nonnegative.
you only need to know polynomial and gaussian.
distorts distance
distorts angle
![Page 19: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/19.jpg)
Kernel Summary• For a feature map ɸ, find a magic function k, s.t.:
• the dot-product ɸ(x)∙ɸ(x’) = k(x, x’)• this k(x, x’) should be much faster than ɸ(x)• k(x, x’) should be computable in O(n) if x in ℝn
• ɸ(x) is much slower: O(nd) for poly d, more for Gaussian• But for any k function, is there a ɸ s.t. ɸ(x)∙ɸ(x’) = k(x,x’)?Some Good Kernels
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 48
Examples of kernels k(x, x
0)
Linear hx, x
0iLaplacian RBF exp (��kx � x
0k)Gaussian RBF exp
���kx � x
0k2�
Polynomial (hx, x
0i + ci)d , c � 0, d 2 NB-Spline B2n+1(x � x
0)
Cond. Expectation E
c
[p(x|c)p(x
0|c)]Simple trick for checking Mercer’s conditionCompute the Fourier transform of the kernel and checkthat it is nonnegative.
![Page 20: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/20.jpg)
Mercer’s Theorem
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 44
The TheoremFor any symmetric function k : X ⇥ X ! R which issquare integrable in X⇥ X and which satisfies
Z
X⇥X
k(x, x
0)f (x)f (x
0)dxdx
0 � 0 for all f 2 L2(X)
there exist �i
: X ! R and numbers �
i
� 0 wherek(x, x
0) =
X
i
�
i
�
i
(x)�
i
(x
0) for all x, x
0 2 X.
InterpretationDouble integral is the continuous version of a vector-matrix-vector multiplication. For positive semidefinitematrices we haveX
i
X
j
k(x
i
, x
j
)↵
i
↵
j
� 0
Mercer’s Theorem
![Page 21: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/21.jpg)
PropertiesProperties of the Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 45
Distance in Feature SpaceDistance between points in feature space via
d(x, x
0)
2:=k�(x) � �(x
0)k2
=h�(x), �(x)i � 2h�(x), �(x
0)i + h�(x
0), �(x
0)i
=k(x, x) + k(x
0, x
0) � 2k(x, x)
Kernel MatrixTo compare observations we compute dot products, sowe study the matrix K given by
K
ij
= h�(x
i
), �(x
j
)i = k(x
i
, x
j
)
where x
i
are the training patterns.Similarity MeasureThe entries K
ij
tell us the overlap between �(x
i
) and�(x
j
), so k(x
i
, x
j
) is a similarity measure.
![Page 22: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/22.jpg)
Kernelized Pegasos for SVM
for HW2, you don’t need to randomly choose training examples.just go over all training examples in the original order, and call that an epoch (same as HW1).
![Page 23: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/23.jpg)
σ = 1.0 C =∞f(x) = 1
f(x) = 0
f(x) = −1
f(x) =NX
i
αiyi exp³−||x− xi||2/2σ2
´+ b
Gaussian RBF kernel (default in sklearn)
![Page 24: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/24.jpg)
σ = 1.0 C = 100
Decrease C, gives wider (soft) margin
![Page 25: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/25.jpg)
σ = 1.0 C = 10
f(x) =NX
i
αiyi exp³−||x− xi||2/2σ2
´+ b
![Page 26: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/26.jpg)
σ = 1.0 C =∞
f(x) =NX
i
αiyi exp³−||x− xi||2/2σ2
´+ b
![Page 27: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/27.jpg)
σ = 0.25 C =∞
Decrease sigma, moves towards nearest neighbour classifier
![Page 28: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/28.jpg)
σ = 0.1 C =∞
f(x) =NX
i
αiyi exp³−||x− xi||2/2σ2
´+ b
![Page 29: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/29.jpg)
Polynomial Kernels
this is in contrast with C: smaller C => wide margin (underfitting)larger C => narrow margin (overfitting)
![Page 30: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/30.jpg)
Overfitting vs. Overfitting
![Page 31: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/31.jpg)
From SVM to Nearest Neighbor• for each test example x, decide its label by the
training example closest to x• decision boundary highly non-linear (Voronoi)• k-nearest neighbor (k-NN): smoother boundaries
![Page 32: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/32.jpg)
K = 1
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.0 error = 0.15
Training data Testing data
K = 3
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.0760 error = 0.1340
Training data Testing data
K = 7
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.1320 error = 0.1110
Training data Testing data
K = 21
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.1120 error = 0.0920
Training data Testing data
![Page 33: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/33.jpg)
K = 1
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.0 error = 0.15
Training data Testing data
K = 3
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.0760 error = 0.1340
Training data Testing data
K = 7
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.1320 error = 0.1110
Training data Testing data
K = 21
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.1120 error = 0.0920
Training data Testing data
small k: overfitting
large k: underfitting
what about k=N?
![Page 34: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/34.jpg)
SVM vs. Nearest Neighbor
support vectors few all
![Page 35: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/35.jpg)
![Page 36: Machine Learning - web.engr.oregonstate.edu](https://reader033.fdocuments.net/reader033/viewer/2022050512/6271d363e27cc23147384a70/html5/thumbnails/36.jpg)
b
a c e
fd