Post on 18-Oct-2021
Pattern Recognition 2020Introduction
Ad Feelders
Universiteit Utrecht
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 53
About the Course
Lecturers: Zerrin Yumak and Ad Feelders
Teaching Assistants: Ali Katsheh and Jiayuan Hu
Course info: http://www.cs.uu.nl/docs/vakken/mpr/
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 53
About the Course
Part I (Ad Feelders): Introduction to statistical machine learning.
General principles of data analysis: overfitting, bias-variance trade-off,model selection, regularization, the curse of dimensionality.
Linear statistical models for regression and classification.
Clustering and unsupervised learning.
Support vector machines.
Required literature:
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 3 / 53
About the Course
Part II (Zerrin Yumak): Neural networks and deep learning.
Feed-forward neural networks.
Convolutional neural networks.
Recurrent neural networks.
Recommended reading:
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 4 / 53
About the Course
Practical assignment: analysis of handwritten digit data in R orPython (teams of 2 students)
Deep learning project: subject of own choice (teams of 5 students).
Online lectures in MS Teams (Wednesday and Friday).
Online support for practical assignment and deep learning project inMS Teams (Friday after the lecture, starting next week).
Grading:
Practical assignment (20%)
Deep learning project (40%)
Written exam (40%)
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 5 / 53
What is statistical pattern recognition?
The field of pattern recognition/machine learning is concernedwith the automatic discovery of regularities in data through theuse of computer algorithms and with the use of these regularities totake actions such as classifying the data into different categories.
(Bishop, page 1)
28 × 28 pixel images
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 6 / 53
Machine Learning Approach
Use training dataD = {(x1, t1), . . . , (xN , tN)}
of N labeled examples, and fit a model to the training data.
This model can subsequently be used to predict the class (digit) for newinput vectors x.
The ability to categorize correctly new examples is called generalization.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 7 / 53
Types of Learning Problems
Supervised Learning
Numeric target: regression.Discrete unordered target: classification.Discrete ordered target: ordinal classification; ranking.. . .
Unsupervised Learning
Clustering.Density estimation.. . .
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 8 / 53
Example: Polynomial Curve Fitting
�
�
0 1
−1
0
1
t = sin(2πx) + ε, with ε ∼ N (µ = 0, σ = 0.3).
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 9 / 53
Polynomial Curve Fitting
Fit a model:
y(x ,w) = w0 + w1x + w2x2 + . . .+ wMxM
=M∑j=0
wjxj (1.1)
Linear function of the coefficients w = w0,w1, . . . ,wM .
The coefficients (or weights) w are estimated (or learned) from the data.
PS: equation numbers refer to the book of Bishop.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 10 / 53
Error Function
We choose those values for w that minimize the sum of squared errors
E (w) =1
2
N∑n=1
{y(xn,w)− tn}2 (1.2)
Why square the difference between predicted and true value?
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 11 / 53
Error Function
t
x
y(xn,w)
tn
xn
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 12 / 53
Curves Fitted with Least Squares (in red)
�
�
�����
0 1
−1
0
1
�
�
�����
0 1
−1
0
1
�
�
�����
0 1
−1
0
1
�
�
�����
0 1
−1
0
1
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 13 / 53
Magnitude of Coefficients
M = 0 M = 1 M = 3 M = 9
w?0 0.19 0.82 0.31 0.35
w?1 -1.27 7.99 232.37
w?2 -25.43 -5321.83
w?3 17.37 48568.31
w?4 -231639.30
w?5 640042.26
w?6 -1061800.52
w?7 1042400.18
w?8 -557682.99
w?9 125201.43
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 14 / 53
Training and Test Error
�
�����
0 3 6 90
0.5
1TrainingTest
ERMS =√
2E (w?)/N (1.3)
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 15 / 53
Overfitting and Sample Size
�
�
�������
0 1
−1
0
1
�
�
���������
0 1
−1
0
1
Red curve (M = 9) is much more smooth for N = 100 than for N = 15.Also, it is closer to the true (green) curve.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 16 / 53
Regularization
Adjusted error function
E (w) =1
2
N∑n=1
{y(xn,w)− tn}2 +λ
2‖w‖2 (1.4)
with ‖w‖2 = wTw = w20 + w2
1 + . . .+ w2M .
Shrink coefficients towards zero.
Ridge regression
Neural networks: weight decay
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 17 / 53
Magnitude of Coefficients (M = 9)
lnλ = −∞ lnλ = −18 lnλ = 0
w?0 0.35 0.35 0.13
w?1 232.37 4.74 -0.05
w?2 -5321.83 -0.77 -0.06
w?3 48568.31 -31.97 -0.05
w?4 -231639.30 -3.89 -0.03
w?5 640042.26 55.28 -0.02
w?6 -1061800.52 41.32 -0.01
w?7 1042400.18 -45.95 -0.00
w?8 -557682.99 -91.53 0.00
w?9 125201.43 72.68 0.01
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 18 / 53
Fitted Curves for M = 9, λ ≈ 10−8, λ = 1.
�
�
� ������� �
0 1
−1
0
1
�
�
� ������
0 1
−1
0
1
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 19 / 53
RMSE versus lnλ for M = 9
�����
� ���−35 −30 −25 −200
0.5
1TrainingTest
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 20 / 53
Probability distribution and likelihood function
Binomial distribution with parameters N and π:
p(t) =
(Nt
)πt (1− π)N−t
Binomial distribution with N = 10 and π = 0.7:
p(t) =
(10t
)0.7t 0.310−t
Probability of observing t = 8:(108
)0.780.32 ≈ 0.234
Likelihood function if we observe 7 heads in 10 trials:
L(π | t = 7) =
(107
)π7(1− π)3
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 21 / 53
Probability and Likelihood
t π
0.1 0.3 0.5 0.7 0.9
0 .349 .028 .0011 .387 .121 .012 .194 .234 .044 .0023 .057 .267 .117 .0094 .011 .2 .205 .0365 .002 .103 .246 .103 .0026 .036 .205 .2 .0117 .009 .117 .267 .0578 .002 .044 .234 .1949 .01 .121 .38710 .001 .028 .349
1 1 1 1 1
Probability distribution for π = 0.7 and N = 10.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 22 / 53
Probability and Likelihood
t π
0.1 0.3 0.5 0.7 0.9
0 .349 .028 .0011 .387 .121 .012 .194 .234 .044 .0023 .057 .267 .117 .0094 .011 .2 .205 .0365 .002 .103 .246 .103 .0026 .036 .205 .2 .0117 .009 .117 .267 .0578 .002 .044 .234 .1949 .01 .121 .38710 .001 .028 .349
1 1 1 1 1
Likelihood function for observing t = 7 in 10 trials.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 23 / 53
Likelihood function
Lett = (t1, . . . , tN)
be N independent observations, all from the same probability distribution
p(t | θ),
where θ is the parameter vector of p (e.g. θ = (µ, σ) for normaldistribution), then
L(θ | t) ∝N∏
n=1
p(tn| θ)
is the likelihood function for t.
Maximum likelihood estimation:Find that particular value θML which maximizes L, i.e. that θML such thatthe observed t are more likely to have come from p(t | θML) than fromp(t | θ) for any other value of θ.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 24 / 53
Maximum Likelihood Estimation
Take the derivatives of L with respect to the components of θ and equatethem to zero (normal equations)
∂L
∂θj= 0
Solve for the θj (and check second order condition).Maximizing the loglikelihood function is often easier
L(θ | t) = ln{L(θ | t)} = ln
{N∏
n=1
p(tn | θ)
}
=N∑
n=1
ln p(tn | θ)
since ln ab = ln a + ln b.
This is allowed because the ln function is strictly increasing on (0,∞).Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 25 / 53
Likelihood function
Likelihood function for 7 heads out of 10 coin flips:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.00
000.
0005
0.00
100.
0015
0.00
20
π
L(π)
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 26 / 53
Example: coin flipping
Random variable t with t = 1 if heads comes up, and t = 0 if tails comesup. π = P(t = 1). Probability distribution for one coin flip
p(t) = πt(1− π)1−t
Sequence of N coin flips
p(t) = p(t1, t2, ..., tN) =N∏
n=1
πtn(1− π)1−tn
which defines the likelihood when viewed as a function of π. Theloglikelihood function consequently becomes
L(π | t) =N∑
n=1
tn ln(π) + (1− tn) ln(1− π)
since ln ab = b ln a.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 27 / 53
Example: coin flipping (continued)
In a sequence of 10 coin flips with seven times heads coming up, we obtain
L(π) = ln(π7(1− π)3) = 7 lnπ + 3 ln(1− π)
To determine the maximum we take the derivative with respect to π,equate to zero, and solve for π:
dLdπ
=7
π− 3
1− π= 0
which yields maximum likelihood estimate πML = 0.7.
Note:d ln x
dx=
1
x
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 28 / 53
Model Selection
Cross-Validation
run 1
run 2
run 3
run 4
Score = Quality of Fit − Complexity Penalty
For example: AIC = ln p(D|wML)−M
where ln p(D|wML) is the maximized loglikelihood and M is the number ofparameters of the fitted model.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 29 / 53
The Curse of Dimensionality
���
���
0 0.25 0.5 0.75 10
0.5
1
1.5
2
Predict class of ×.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 30 / 53
The Curse of Dimensionality
���
���
0 0.25 0.5 0.75 10
0.5
1
1.5
2
Predict class of ×.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 31 / 53
The Curse of Dimensionality
x1
D = 1x1
x2
D = 2
x1
x2
x3
D = 3
Number of rectangles grows exponentially with D. If D is large, mostrectangles will be empty (contain no data).
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 32 / 53
Decision Theory
Suppose we have to make a decision in a situation involving uncertainty.Two steps
1 Inference: Learn p(x, t) from data. This problem is the main subjectof this course.
2 Decision: Given this estimate of p(x, t), determine the optimaldecision. Relatively straightforward.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 33 / 53
Decision Theory: Example
Predict whether patient has cancer from X-ray image.
Let t = 1 denote that cancer is present.Then knowledge of
p(t = 1|x) =p(x|t = 1)p(t = 1)
p(x)
would allow us to make optimal predictions of t from x (given anappropriate loss function).
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 34 / 53
Loss Functions for Classification
Suppose we know p(x, t).Task: given a value for x, predict the class label t.Lkj : loss of predicting class j when the true class is k .K : number of classes.
To minimize expected loss, predict the class j that minimizes:
K∑k=1
Lkjp(t = k | x), (1.81)
where j ∈ {1, . . . ,K}.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 35 / 53
Minimizing the Misclassification Rate
To minimize the probability of misclassification, we take (0/1 loss)
Lkj =
{0 if j = k1 otherwise
The minimum of
K∑k=1
Lkjp(t = k | x) =∑k 6=j
p(t = k | x) = 1− p(t = j | x)
is now achieved if we assign to the class j for which p(t = j | x) ismaximum.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 36 / 53
Minimizing Expected Loss
Example loss matrix for prediction of cancer:
kj
0 1
0 0 11 10 0
Suppose p(t = 0) = 0.8 and p(t = 1) = 0.2.
The expected loss of predicting “no cancer present” is:
L00 × p(t = 0) + L10 × p(t = 1) = 0× 0.8 + 10× 0.2 = 2
The expected loss of predicting “cancer present” is:
L01 × p(t = 0) + L11 × p(t = 1) = 1× 0.8 + 0× 0.2 = 0.8
Even though the probability of cancer is “only” 0.2, loss is minimized if wepredict (act as if) cancer is present.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 37 / 53
Properties of Expectation and Variance
Some useful properties:
1 E[c] = c for constant c .
2 E[cx ] = cE[x ].
3 E[x ± y ] = E[x ]± E[y ].
4 var[c] = 0 for constant c .
5 var[cx ] = c2var[x ].
6 var[x ± y ] = var[x ] + var[y ] if x and y independent.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 38 / 53
Loss function for regression
Let t0 be a random draw from p(t | x0) and we predict the value of t0 bysome number y0 = y(x0). The expected squared prediction error is:
E[(y0 − t0)2] = E[y20 − 2y0t0 + t20 ]
= y20 − 2y0E[t0] + E[t20 ],
where expectation is taken with respect to p(t | x0).
To minimize this expression we solve
d(y20 − 2y0E[t0] + E[t20 ])
dy0= 2y0 − 2E[t0] = 0
which gives y0 = E[t0]. Conclusion: predict the expected value (mean)!
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 39 / 53
Minimizing expected squared prediction error
Since this reasoning applies to any value of x we might pick, we have that
y(x) = Et [t | x ] (1.89)
minimizes the expected squared prediction error.
The function Et [t | x ] is called the (population) regression function.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 40 / 53
Population Regression Function
t
xx0
y(x0)
y(x)
p(t|x0)
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 41 / 53
Question
We have derived the result that
y(x) = Et [t|x ] (1.89)
minimizes the expected squared prediction error.
How could we use this result to construct a prediction rule y(x) from afinite data sample
D = {(x1, t1), . . . , (xN , tN)}?
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 42 / 53
Simple approach to regression?
xx0
t
y(x0)
Predict the mean of the target values of all training observations withx = x0.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 43 / 53
Simple approach to regression?
xx0
t
y(x0)
Predict the mean of the target values of training observations with x-valueclosest to x0.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 44 / 53
Nearest-neighbor functions
Consider a regression problem with input variable x and the outputvariable t:
for each input value x , we define a neighborhood Nk(x) containingthe indices n of the k points (xn, tn) from the training data that arethe closest to x ;
from the neighborhood function Nk(x), we construct the function
yk(x) =1
k
∑n∈Nk (x)
tn
The function yk(x) is called the k-nearest neighbor function.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 45 / 53
An example learning problem
In a clinical study of risk factors for cardiovascular disease,
the independent variable x is a patient’s waist circumference;
the dependent variable t is a patient’s deep abdominal adipose tissue.
The researchers want to predict the amount of deep abdominal adiposetissue from a simple measurement of waist circumference.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 46 / 53
Scatterplot of the data
For learning the relationship between x and t, measurements (xn, tn) on109 men between 18 and 42 years of age, are available:
0
50
100
150
200
250
60 70 80 90 100 110 120
deep
abd
omin
al A
T (
Y)
waist circumference (X)
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 47 / 53
An example
We consider eight (consecutive) points (xn, tn) from the clinical study ofrisk factors for cardiovascular disease:
1.(68.85, 55.78) 5.(73.10, 38.21)2.(71.85, 21.68) 6.(73.20, 32.22)3.(71.90, 28.32) 7.(73.80, 43.35)4.(72.60, 25.89) 8.(74.15, 33.41)
20
25
30
35
40
45
50
55
60
68 69 70 71 72 73 74
deep
abd
omin
al A
T (
Y)
waist circumference (X)
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 48 / 53
The example (continued)
With k = 2, the neighborhood of x = 73.00 equals
N2(x = 73.00) = {5, 6}
and we find
y2(x = 73.00) =38.21 + 32.22
2= 35.215
With k = 5, we find y5(x = 73.00) = 33.598.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 49 / 53
The example continued
With k = 2 and Euclidean distance, the following k-nearest neighborfunction is constructed from the training data:
70 80 90 100 110 120
5010
015
020
025
0
kNN with k=2
Waist Circumference
Adi
pose
Tis
sue
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 50 / 53
The example continued
With k = 20 and Euclidean distance, the following k-nearest neighborfunction is constructed from the training data:
70 80 90 100 110 120
5010
015
020
025
0
kNN with k=20
Waist Circumference
Adi
pose
Tis
sue
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 51 / 53
kNN: going to the extremes
70 80 90 100 110 120
5010
015
020
025
0
kNN with k=1
Waist Circumference
Adi
pose
Tis
sue
70 80 90 100 110 120
5010
015
020
025
0
kNN with k=109
Waist Circumference
Adi
pose
Tis
sue
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 52 / 53
The idea of k-nearest neighbor
We recall that, for a regression problem, the best prediction for the outputvariable t at the input value x is the mean E[t | x ]:
the nearest-neighbor function approximates the mean by averagingover the training data;
the nearest-neighbor function relaxes conditioning at a specific inputvalue to the neighborhood of that value.
The nearest-neighbor function thus implements the idea of selecting themeans for prediction directly.
Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 53 / 53