Lecture 4 Discriminant Analysis, k-Nearest Neighbors ·...

Lecture 4 – Discriminant Analysis, k-Nearest Neighbors

Johan WågbergDivision of Systems and ControlDepartment of Information TechnologyUppsala University.

Email: [email protected]

[email protected]

mailto:[email protected]


The mini project has now started!

Task: Classify songs according to whether Andreas likesthem or not (based on features from the Spotify Web-API).

Training data: 750 songs labeled as like or dislike.

Test data: 200 unlabeled songs. Upload your predictionsto the leaderboard!

Examination: A written report (max 6 pages).

Instructions and data is available at the course homepagehttp://www.it.uu.se/edu/course/homepage/sml/project/

You will receive your group password per e-mail tomorrow. Deadline for submitting your finalsolution: February 21

1 / 30 [email protected]

http://www.it.uu.se/edu/course/homepage/sml/project/


Summary of Lecture 3 (I/VI)

The classification problem amounts to modeling the relationship between the input xand a categorical output y, i.e., the output belongs to one out of M distinct classes.

A classifier is a prediction model y(x) that maps any input x into a predicted classy ∈ {1, . . . ,M}.

Common classifier predicts each input as belonging to the most likely class accordingto the conditional probabilities

p(y = m |x) for m ∈ {1, . . . ,M}.



Summary of Lecture 3 (II/VI)

For binary (two-class) classification, y ∈ {−1, 1}, the logistic regression model is

p(y = 1 |x) ≈ g(x) = eθTx

1 + eθTx.

The model parameters θ are found by maximum likelihood by (numerically)maximizing the log-likelihood function,

log `(θ) = −n∑

i=1log

(1 + e−yiθ

Txi

).



Summary of Lecture 3 (III/VI)

Using the common classifier, we get the prediction model

y(x) =

1 if g(x) > 0.5 ⇔ θTx > 0

−1 otherwise

This attempts to minimize the total misclassification error.

More generally, we can use

y(x) ={

1 if g(x) > r,

−1 otherwise

where 0 ≤ r ≤ 1 is a user chosen threshold.



Summary of Lecture 3 (IV/VI)

Confusion matrix:Predicted conditiony = 0 y = 1 Total

True y = 0 TN FP Ncondition y = 1 FN TP P

Total N* P*

For the classifier

y(x) ={

1 if g(x) > r,

−1 otherwisethe numbers in the confusion matrix will depend on the threshold r.• Decreasing r ⇒ TN,FN decrease and FP,TP increase.• Increasing r ⇒ TN,FN increase and FP,TP decrease.



Summary of Lecture 3 (V/VI)

Some commonly used performance measures are:

• True positive rate: TPR = TPP = TP

FN + TP ∈ [0, 1]

• False positive rate: FPR = FPN = FP

FP + TN ∈ [0, 1]

• Precision: Prec = TPP* = TP

FP + TP ∈ [0, 1]



Summary of Lecture 3 (VI/VI)

0 0.2 0.4 0.6 0.8 1

False positive rate

0

0.2

0.4

0.6

0.8

1

Tru

e po

sitiv

e ra

te

ROC curve for logistic regression

• ROC: plot of TPR vs. FPR as r ranges from 0 to 1.1• Area Under Curve (AUC): condensed performance measure for the classifier,taking all possible thresholds into account.

1Possible to draw ROC curves based on other tuning parameters as well.7 / 30 [email protected]


Contents – Lecture 4

1. Summary of lecture 32. Generative models3. Linear Discriminant Analysis (LDA)4. Quadratic Discriminant Analysis (QDA)5. A nonparametric classifier – k-Nearest Neighbors (kNN)



Generative models

A discriminative model describes how the output y is generated, i.e. p(y |x).

A generative model describes how both the output y and the input x is generated viap(y,x).

Predictions can be derived using laws of probability

p(y |x) = p(y,x)p(x) = p(y)p(x | y)∫

y p(y,x) .



Generative models for classification

We have built classifiers based on models gm(x?) of the conditional probabilitiesp(y = m |x).

A generative model for classification can be expressed as

p(y = m |x) = p(y = m)p(x | y = m)∑Mm=1 p(y = m)p(x | y = m)

• p(y = m) denotes the marginal (prior) probability of class m ∈ {1, . . . ,M}.• p(x | y = m) denotes the conditional probability density of x for an observationfrom class m.



Multivariate Gaussian density

The p-dimensional Gaussian (normal) probability density function with mean vector µand covariance matrix Σ is,

N (x |µ,Σ) = 1(2π)p/2|Σ|1/2 exp

(−1

2(x− µ)TΣ−1(x− µ)),

where µ : p× 1 vector and Σ : p× p positive definite matrix.

Let x = (x1, . . . , xp)T ∼ N (µ,Σ),• µj is the mean of xj ,• Σjj is the variance of xj ,• Σij (i 6= j) is the covariance between xi and xj .



Multivariate Gaussian density

05

0.05

5

0.1

Val

ue o

f PD

F 0.15

x2

0

x1

0.2

0

-5 -5

Σ =(

1 00 1

)

05

0.1

5

Val

ue o

f PD

F

0.2

x2

0

x1

0.3

0

-5 -5

Σ =(

1 0.40.4 0.5

)12 / 30 [email protected]


Linear discriminant analysis

We need,1. The prior class probabilities πm , p(y = m), m ∈ {1, . . . ,M}.2. The conditional probability densities of the input x, fm(x) , p(x | y = m), for

each class m.This gives us the model

gm(x) = πmfm(x)∑Mm=1 πmfm(x)




For the first task, a natural estimator is the proportion of training samples in the mthclass

πm = 1n

n∑i=1

I{yi = m} = nm

n

where n is the size of the training set and nm the number of training samples of classm.




For the second task, a simple model is to assume that fm(x) is a multivariate normaldensity with mean vector µm and covariance matrix Σm,

fm(x) = 1(2π)p/2|Σm|1/2 exp

(−1

2(x− µm)TΣ−1m (x− µm)

).

If we further assume that all classes share the same covariance matrix,

Σ def= Σ1 = · · · = ΣM ,

the remaining parameters of the model are

µ1,µ2, . . . ,µM ,Σ.




These parameters are naturally estimated as the (within class) sample means andsample covariance, respectively:

µm = 1nm

∑i:yi=m

xi, m = 1, . . . ,M,

Σ = 1n−M

M∑m=1

∑i:yi=m

(xi − µm)(xi − µm)T.

Modeling the class probabilities using the normal assumptions and these parameterestimates is referred to as Linear Discriminant Analysis (LDA).



Linear discriminant analysis, summary

The LDA classifier assigns a test input x to class m for which,

δm(x) = xTΣ−1µm −

12 µT

mΣ−1µm + log πm

is largest, where

πm = nm

nm = 1, . . . ,M,

µm = 1nm

∑i:yi=m

xi, m = 1, . . . ,M,

Σ = 1n−M

M∑m=1

∑i:yi=m

(xi − µm)(xi − µm)T.

LDA is a linear classifier (its decision boundaries are linear).



ex) LDA decision boundary

Illustration of LDA decision boundary – the level curves of two Gaussian PDFs withthe same covariance matrix intersect along a straight line, δ1(x) = δ2(x).

x1

x 2



ex) Simple spam filter

We will use LDA to construct a simple spam filter:• Output: y ∈ {spam, ham}• Input: x = 57-dimensional vector of “features” extracted from the email

• Frequencies of 48 predefined words (make, address, all, . . . )• Frequencies of 6 predefined characters (;, (, [, !, $, #)• Average length of uninterrupted sequences of capital letters• Length of longest uninterrupted sequence of capital letters• Total number of capital letters in the e-mail

• Dataset consists of 4, 601 emails classified as either spam or ham (split into 75 %training and 25 % testing)

UCI Machine Learning Repository — Spambase Dataset(https://archive.ics.uci.edu/ml/datasets/Spambase)


https://archive.ics.uci.edu/ml/datasets/Spambase



LDA confusion matrix (test data):

y = 0 y = 1y = 0 667 23y = 1 102 358

Table: Threshold r = 0.5

y = 0 y = 1y = 0 687 3y = 1 237 223

Table: Threshold r = 0.9




A linear classifier uses a linear decision boundary for separating the two classes asmuch as possible.

In Rp this is a (p− 1)-dimensional hyperplane. How can we visualize the classifier?

One way: project the (labeled) test inputs onto the normal of the decision boundary.

-0.5 0 0.5

Projection on normal

0

50

100

150

200Y=0Y=1



Gmail’s spam filter

Official Gmail blog July 9, 2015:

“. . . the spam filter now uses an artificial neural network to detect and block theespecially sneaky spam—the kind that could actually pass for wanted mail.” — SriHarsha Somanchi, Product Manager

Official Gmail blog February 6, 2019:

“With TensorFlow, we are now blocking around 100 million additional spam messagesevery day” — Neil Kumaran, Product Manager, Gmail Security



Quadratic discriminant analysis

Do we have to assume a common covariance matrix?

No! Estimating a separate covariance matrix for each class leads to an alternativemethod, Quadratic Discriminant Analysis (QDA).

Whether we should choose LDA or QDA has to do with the bias-variance trade-off,i.e. the risk of over- or underfitting.

Compared to LDA, QDA. . .• has more parameters• is more flexible (lower bias)• has higher risk of overfitting (larger variance)



ex) QDA decision boundary

Illustration of QDA decision boundary – the level curves of two Gaussian PDFs withdifferent covariance matrices intersect along a curved (quadratic) line.

x1

x 2

Thus, QDA is not a linear classifier.24 / 30 [email protected]


Parametric and nonparametric models

So far we have looked at a few parametric models,• linear regression,• logistic regression,• LDA and QDA,

all of which are parametrized by a fixed-dimensional parameter .

Non-parametric models instead allow the flexibility of the model to grow with theamount of available data.

N can be very flexible (= low bias)H can suffer from overfitting (high variance)H can be compute and memory intensive

As always, the bias-variance trade-off is key also when working with non-parametricmodels.



k-NN

The k-nearest neighbors (k-NN) classifier is a simple non-parametric method.

Given training data T = {(xi, yi)}ni=1, for a test input x?,1. Identify the set

R? = {i : xi is one of the k training data points closest to x?}

2. Classify x? according to a majority vote within the neighborhood R?, i.e. assignx? to class m for which ∑

i∈R?

I{yi = m}

is largest.



ex) k-NN on a toy model

We illustrate the k-NN classifier on a synthetic example where the optimal classifier isknown (colored regions).

-6 -4 -2 0 2 4X

1

-4

-2

0

2

4

X2

Y=1Y=2Y=3




The decision boundaries for the k-NN classifier are shown as black lines.

k=1

-6 -4 -2 0 2 4X

1

-4

-2

0

2

4

X2

Y=1Y=2Y=3

k=10

-6 -4 -2 0 2 4X

1

-4

-2

0

2

4

X2

Y=1Y=2Y=3

k=100

-6 -4 -2 0 2 4X

1

-4

-2

0

2

4

X2

Y=1Y=2Y=3




k=1

-6 -4 -2 0 2 4X

1

-4

-2

0

2

4

X2

Y=1Y=2Y=3

k=10

-6 -4 -2 0 2 4X

1

-4

-2

0

2

4

X2

Y=1Y=2Y=3

k=100

-6 -4 -2 0 2 4X

1

-4

-2

0

2

4

X2

Y=1Y=2Y=3

The choice of the tuning parameter k controls the model flexibility:• Small k ⇒ small bias, large variance• Large k ⇒ large bias, small variance



A few concepts to summarize lecture 4

Generative model: A model that describes both the output y and the input x.

Linear discriminant analysis (LDA): A classifier based on the assumption that the distribution of the input x ismultivariate Gaussian for each class, with different means but the same covariance. LDA is a linear classifier.

Quadratic discriminant analysis (QDA): Same as LDA, but where a different covariance matrix is used for eachclass. QDA is not a linear classifier.

Non-parametric model: A model where the flexibility is allowed to grow with the amount of available trainingdata.

k-NN classifier: A simple non-parametric classifier based on classifying a test input x according to the classlabels of the k training samples nearest to x.



Lecture 4 Discriminant Analysis, k-Nearest Neighbors ·...

Documents

Transcript of Lecture 4 Discriminant Analysis, k-Nearest Neighbors ·...