Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf ·...
Transcript of Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf ·...
![Page 1: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/1.jpg)
Chapter 4: Linear Models for Classifica6on
Pa8ern Recogni6on and Machine Learning
![Page 2: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/2.jpg)
Represen'ng the target values for classifica'on
• If there are only two classes, we typically use a single real valued output that has target values of 1 for the “posi've” class and 0 (or some'mes -‐1) for the other class
• For probabilis'c class labels the output of the model can represent the probability the model gives to the posi've class.
• If there are N classes we oDen use a vector of N target values containing a single 1 for the correct class and zeros elsewhere.
• For probabilis'c labels we can then use a vector of class probabili'es as the target vector.
![Page 3: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/3.jpg)
Three approaches to classifica'on • Use discriminant func'ons directly without probabili'es:
• Convert the input vector into one or more real values so that a simple opera'on (like threshholding) can be applied to get the class.
• Infer condi'onal class probabili'es: • Then make a decision that minimizes some loss func'on
• Compare the probability of the input under separate, class-‐specific, genera've models. • E.g. fit a mul'variate Gaussian to the input vectors of
each class and see which Gaussian makes a test data vector most probable.
)|( xkCclassp =
![Page 4: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/4.jpg)
The planar decision surface in data-space for the simple linear discriminant function:
y = w
Tx+ w0 � 0
Discriminant func'ons
![Page 5: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/5.jpg)
Discriminant func'ons for N>2 classes
• One possibility is to use N two-‐way discriminant func'ons. • Each func'on discriminates one class from the rest.
• Another possibility is to use N(N-‐1)/2 two-‐way discriminant func'ons • Each func'on discriminates between two par'cular
classes.
• Both these methods have problems
![Page 6: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/6.jpg)
Problems with mul'-‐class discriminant func'ons
![Page 7: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/7.jpg)
A simple solu'on
• Use K discriminant functions, and pick the max. – This is guaranteed to
give consistent and convex decision regions if y is linear.
( ) ( )BAjBAk
BjBkAjAk
yythatpositiveforimplies
yyandyy
xxxx
xxxx
)1()1()(
)()()()(
ααααα
−+>−+
>>
...,, kji yyy
![Page 8: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/8.jpg)
Using “least squares” for classifica'on
This is not the right thing to do and it doesn’t work as well as other methods (we will see later), but it is easy. It reduces classifica'on to least squares regression. We already know how to do regression. We can just solve for the op'mal weights with some matrix algebra.
![Page 9: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/9.jpg)
Least squares for classifica6on
The sum-of-squares error function: nth row of matrix T is tT
n
{xn, tn}
![Page 10: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/10.jpg)
Problems with using least squares for classifica'on
If the right answer is 1 and the model says 1.5, it loses, so it changes the boundary to avoid being “too correct”
logistic regression
least squares regression
![Page 11: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/11.jpg)
Another example where least squares regression gives poor decision surfaces
![Page 12: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/12.jpg)
Fisher’s linear discriminant
• A simple linear discriminant func'on is a projec'on of the data down to 1-‐D.
• So choose the projec'on that gives the best separa'on of the classes. What do we mean by “best separa'on”?
• An obvious direc'on to choose is the direc'on of the line joining the class means.
• But if the main direc'on of variance in each class is not orthogonal to this line, this will not give good separa'on.
• Fisher’s method chooses the direc'on that maximizes the ra'o of between class variance to within class variance.
• This is the direc'on in which the projected points contain the most informa'on about class membership (under Gaussian assump'ons)
![Page 13: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/13.jpg)
Fisher’s linear discriminant
The two-‐class linear discriminant acts as a projec'on: followed by a threshold
In which direc'on w should we project? One which separates classes “well” Intui'on: We want the (projected) centers of the classes to be far apart, and each class (projec'on) to be clustered around its centre.
y � �w0
y = w
Tx
![Page 14: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/14.jpg)
Fisher’s linear discriminant
A natural idea would be to project in the direc'on of the line connec'ng class means However, problema'c if classes have variance in this direc'on (ie are not clustered around the mean) Fisher criterion: maximize ra'o of inter-‐class separa'on (between) to intra-‐class variance (inside)
![Page 15: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/15.jpg)
A picture showing the advantage of Fisher’s linear discriminant.
When projected onto the line joining the class means, the classes are not well separated.
Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.
![Page 16: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/16.jpg)
Math of Fisher’s linear discriminants
What linear transforma'on is best for discrimina'on? The projec'on onto the vector separa'ng the class means seems sensible. But we also want small variance within each class: Fisher’s objec've func'on is:
xwTy =
)(
)(
222
121
2
1
mys
mys
Cnn
Cnn
−=
−=
∑
∑
ε
ε
22
21
212 )()(ssmmJ
+−=w
between
within
![Page 17: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/17.jpg)
)(:
)()()()(
)()(
)()(
121
2211
1212
22
21
212
21
mmSw
mxmxmxmxS
mmmmS
wSwwSww
−∝
−−+−−=
−−=
=+−=
−
∈∈∑∑
W
Cn
Tnn
Cn
TnnW
TB
WT
BT
solutionOptimal
ssmmJ
More math of Fisher’s linear discriminants
![Page 18: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/18.jpg)
Probabilis6c Genera6ve Models
Model the class-conditional densities and class prior Use Bayes’ theorem to compute posterior probabilities
![Page 19: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/19.jpg)
Logistic Sigmoid Function
![Page 20: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/20.jpg)
Probabilis6c Genera6ve Models (k class)
• Beyes’ theorem
• if ak aj for all j ≠ k, then p(Ck|x) ≈1, and p(Cj |x)≈ 0.
![Page 21: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/21.jpg)
Continuous inputs
Model the class-conditional densities Here, all classes share the same covariance matrix. Two-classes Problem
![Page 22: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/22.jpg)
Two-Classes Problem
![Page 23: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/23.jpg)
The posterior when the covariance matrices are different for different classes.
The decision surface is planar when the covariance matrices are the same and quadratic when they are not.
![Page 24: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/24.jpg)
Maximum likelihood solu6on � The likelihood func'on:
![Page 25: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/25.jpg)
Maximum likelihood solu6on
Note that the approach of fi`ng Gaussian distribu'ons to the classes is not robust to outliers, because the maximum likelihood es*ma*on of a Gaussian is not robust.
![Page 26: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/26.jpg)
Discrete features
When The class-conditional distributions: Linear functions of the input values:
![Page 27: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/27.jpg)
Probabilistic Discriminative Models
What is it? It means to use the functional form of the generalized linear probabilistic model explicitly, and to determine its parameters directly
The difference from Discriminant Functions, and Probabilistic Generative Models
27
![Page 28: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/28.jpg)
28
Fixed Basis Functions
Space transformation by nonlinear function
![Page 29: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/29.jpg)
29
Logistic Regression
• To reduce the complexity of probabilistic models, the logistic sigmoid is introduced
– For Gaussian class conditional densities: M(M+5)/2+1 parameters
– Logistic Regression:
M parameters
![Page 30: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/30.jpg)
30
The Likelihood
p(t|w) = ⇧Nn=1y
tnn {1� yn}1�tn
yn = p(C1|�n)
![Page 31: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/31.jpg)
31
The Gradient of the Log Likelihood
• Error function:
• Gradient of the Error function
![Page 32: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/32.jpg)
32
Iterative Reweighted Least Squares
• Newton-Raphson
• H is the Hessian Matrix
• R is an NxN diagonal matrix with elements: H = �TR�
Rnn = yn(1� yn)
![Page 33: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/33.jpg)
33
Iterative Reweighted Least Squares
• The Newton-Raphson update takes this form:
z = �w(old) �R�1(y � t)
![Page 34: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/34.jpg)
34
Multiclass Logistic Regression
• Generative models (discussed in section 4.2)
where
• Using the maximum likelihood to determine the parameters directly
![Page 35: Chapter(4:(Linear(Models(for(Classifica6on(shirani/vision/prml-ch4.pdf · Chapter(4:(Linear(Models(for(Classifica6on(Pa8ern(Recogni6on((and(Machine(Learning](https://reader034.fdocuments.net/reader034/viewer/2022050814/5aafa36d7f8b9a190d8d8c42/html5/thumbnails/35.jpg)
35
Multiclass Logistic Regression
• Maximum likelihood method
• Using IRLS algorithm to do the w updating work