P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns...

90
Pattern Recognition P.S.Sastry [email protected] Dept. Electrical Engineering Indian Institute of Science, Bangalore PR video course – p.1/90

Transcript of P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns...

Page 1: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Pattern Recognition

P.S.Sastry

[email protected]

Dept. Electrical Engineering

Indian Institute of Science, Bangalore

PR video course – p.1/90

Page 2: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Reference Books

• R.O.Duda, P.E.Hart and D.G.Stork, ‘PatternClassification’, Johy Wiley, 2002

• C.M.Bishop, ‘Pattern Recognition and MachineLearning’, Springer, 2006.

• C. M. Bishop, ‘Neural Networks for PatternRecognition’, Oxford University Press, Indian Edition,2003.

• R.O.Duda and P.E. Hart, ‘Pattern Classification andScene Analysis’, Wiley, 1973

PR video course – p.2/90

Page 3: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Pattern Recognition

A basic attribute of people – categorisation of sensoryinput

Pattern → PR System → Class label

Examples of Pattern Recognition tasks• ‘Reading’ facial expressions• Recognising Speech• Reading a Document• Identifying a person by fingerprints• Diagnosis from medical images• Wine tasting

PR video course – p.3/90

Page 4: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Machine Recognition of Patterns

pattern→ feature extractorX

−→ classifier →classlabel

• Feature extractor makes some measurements on theinput pattern.

• X is called Feature Vector. Often, X ∈ ℜn.• Classifier maps each feature vector to a class label.• Features to be used are problem-specific.

PR video course – p.4/90

Page 5: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Some Examples of PR Tasks

• Character Recognition• Pattern – Image.• Class – identity of character• Features: Binary image, projections (e.g., row and

column sums), Moments etc.

PR video course – p.5/90

Page 6: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Some Examples of PR Tasks

• Speech Recognition• Pattern – 1-D signal (or its sampled version)• Class – identity of speech units• Features – LPC model of chunks of speech,

spectral info, cepstrum etc.• Pattern can become a sequence of feature vectors.

PR video course – p.6/90

Page 7: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Examples contd...

• Fingerprint based identity verification• Pattern – image plus a identity claim• Class – Yes / No• Features: Position of Minutiae, Orientation field of

ridge lines etc.

PR video course – p.7/90

Page 8: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Examples contd...

• Video-based Surveillance• Pattern – video sequence• Class – e.g., level of alertness• Features – Motion trajectories, Parameters of a

prefixed model etc.

PR video course – p.8/90

Page 9: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Examples contd...

• Credit Screening• Pattern – Details of an applicant (for, e.g., credit

card)• Class – Yes / No• Features: income, job history, level of credit, credit

history etc.

PR video course – p.9/90

Page 10: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Examples contd...

• Imposter detection (of, e.g., credit card)• Pattern – A sequence of transactions• Class – Yes / No• Features: Amount of money, locations of

transactions, times between transactions etc.

PR video course – p.10/90

Page 11: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Examples contd...

• Document Classification• Pattern – A document and a query• Class – Relevant or not (in general, rank)• Features – word occurrence counts, word context

etc.• Spam filtering, diagnostics of machinery etc.

PR video course – p.11/90

Page 12: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Design of Pattern Recognition Systems

• Features depend on the problem. Measure ‘relevant’quantities.

PR video course – p.12/90

Page 13: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Design of Pattern Recognition Systems

• Features depend on the problem. Measure ‘relevant’quantities.

• Some techniques available to extract ‘more relevant’quantities from the initial measurements. (e.g., PCA)

PR video course – p.13/90

Page 14: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Design of Pattern Recognition Systems

• Features depend on the problem. Measure ‘relevant’quantities.

• Some techniques available to extract ‘more relevant’quantities from the initial measurements. (e.g., PCA)

• After feature extraction each pattern is a vector• Classifier is a function to map such vectors into class

labels.

PR video course – p.14/90

Page 15: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Design of Pattern Recognition Systems

• Features depend on the problem. Measure ‘relevant’quantities.

• Some techniques available to extract ‘more relevant’quantities from the initial measurements. (e.g., PCA)

• After feature extraction each pattern is a vector• Classifier is a function to map such vectors into class

labels.• Many general techniques of classifier design are

available.• Need to test and validate the final system.

PR video course – p.15/90

Page 16: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Some notation

• Feature Space, X – Set of all possible feature vectors.

PR video course – p.16/90

Page 17: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Some notation

• Feature Space, X – Set of all possible feature vectors.• Classifier: a decision rule or a function,

h : X → {1, · · · , M}.

• Often, X = ℜn. Convenient to take M = 2.Then we take the labels as {0, 1} or {−1, 1}.

PR video course – p.17/90

Page 18: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Some notation

• Feature Space, X – Set of all possible feature vectors.• Classifier: a decision rule or a function,

h : X → {1, · · · , M}.

• Often, X = ℜn. Convenient to take M = 2.Then we take the labels as {0, 1} or {−1, 1}.

• Then, any binary valued function on X is a classifier.• What h to choose? We want correct or optimal

classifier.• How to design classifiers?• How to judge performance?• How to provide performance guarantees?

PR video course – p.18/90

Page 19: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• We first consider the 2-class problem.

PR video course – p.19/90

Page 20: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• We first consider the 2-class problem.• Can handle the M > 2 case if we know how to handle

2-class problem.

PR video course – p.20/90

Page 21: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• We first consider the 2-class problem.• Can handle the M > 2 case if we know how to handle

2-class problem.• Simplest alternative: design M 2-class classifiers.

‘One Vs Rest’

PR video course – p.21/90

Page 22: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• We first consider the 2-class problem.• Can handle the M > 2 case if we know how to handle

2-class problem.• Simplest alternative: design M 2-class classifiers.

‘One Vs Rest’• There are other possibilities: e.g., Tree structured

classifiers.

PR video course – p.22/90

Page 23: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• We first consider the 2-class problem.• Can handle the M > 2 case if we know how to handle

2-class problem.• Simplest alternative: design M 2-class classifiers.

‘One Vs Rest’• There are other possibilities: e.g., Tree structured

classifiers.• The 2-class problem is the basic problem.

PR video course – p.23/90

Page 24: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• We first consider the 2-class problem.• Can handle the M > 2 case if we know how to handle

2-class problem.• Simplest alternative: design M 2-class classifiers.

‘One Vs Rest’• There are other possibilities: e.g., Tree structured

classifiers.• The 2-class problem is the basic problem.• We will also look at M-class classifiesrs.

PR video course – p.24/90

Page 25: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

A simple PR problem

• : Problem: ‘Spot the Right Candidate’• : Features:

• x1: Marks based on academic record• x2: Marks in the interview

PR video course – p.25/90

Page 26: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

A simple PR problem

• : Problem: ‘Spot the Right Candidate’• : Features:

• x1: Marks based on academic record• x2: Marks in the interview

• A Classifier: ax1 + bx2 > c ⇒ ‘Good’

PR video course – p.26/90

Page 27: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

A simple PR problem

• : Problem: ‘Spot the Right Candidate’• : Features:

• x1: Marks based on academic record• x2: Marks in the interview

• A Classifier: ax1 + bx2 > c ⇒ ‘Good’Another Classifier: x1x2 > c ⇒ ’Good’(or (x1 + a)(x2 + b) > c).

PR video course – p.27/90

Page 28: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

A simple PR problem

• : Problem: ‘Spot the Right Candidate’• : Features:

• x1: Marks based on academic record• x2: Marks in the interview

• A Classifier: ax1 + bx2 > c ⇒ ‘Good’Another Classifier: x1x2 > c ⇒ ’Good’(or (x1 + a)(x2 + b) > c).

• Design of classifier:We have to choose a specific form for the classifier.What values to use for parameters such as a, b, c?

PR video course – p.28/90

Page 29: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Designing Classifiers

• Need to decide how feature vector values determinethe class.(How different marks reflect goodness of candidate)

PR video course – p.29/90

Page 30: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Designing Classifiers

• Need to decide how feature vector values determinethe class.(How different marks reflect goodness of candidate)

• In most applications, not possible to design classifierfrom ’physics of the problem’.

PR video course – p.30/90

Page 31: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Designing Classifiers

• Need to decide how feature vector values determinethe class.(How different marks reflect goodness of candidate)

• In most applications, not possible to design classifierfrom ’physics of the problem’.

• The difficulties are• Lot of variability in patterns of a single class• Variability in feature vector values• Feature vectors of patterns from different classes

can be arbitrarily close.• Noise in measurements

PR video course – p.31/90

Page 32: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Designing Classifiers contd...

• Often the only information available for the design is –A training set of example patterns.

• Training set: {(Xi, yi), i = 1, . . . , ℓ}.Here Xi is an example feature vector of class yi.

PR video course – p.32/90

Page 33: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Designing Classifiers contd...

• Often the only information available for the design is –A training set of example patterns.

• Training set: {(Xi, yi), i = 1, . . . , ℓ}.Here Xi is an example feature vector of class yi.

• Generation of training set – Take representativepatterns of known category (data collection) andobtain the feature vectors. (choice of featuremeasurements).

• Now learn an appropriate function h as classifier.(Model choice)

• Test and validate the classifier on more data.PR video course – p.33/90

Page 34: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

A simple PR problem

• : Problem: ‘Spot the Right Candidate’• : Features:

• x1: Marks based on academic record• x2: Marks in the interview

• A Classifier: ax1 + bx2 > c ⇒ ‘Good’We have chosen a specific form for the classifier.

• Design of classifier: What values to use for a, b, c?• Information available: ‘experience’ – history of past

candidates

PR video course – p.34/90

Page 35: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Training Set

PR video course – p.35/90

Page 36: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Another example problem

• Problem: recognize persons of ‘medium build’• Features: Height and Weight

The classifier is ‘nonlinear’ herePR video course – p.36/90

Page 37: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Learning from Examples

• Designing a classifier is a typical problem of learningfrom examples. (Also called learning with a teacher).

• Nature of feedback from teacher determines difficultyof the learning problem.

PR video course – p.37/90

Page 38: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

In the context of Pattern Recognition

feature

V ector→ ClassifierClass−→

‘feedback′

←− Teacher

• Supervised learning – Teacher gives the true classlabel for each feature vector

• Reinforcement Learning – noisy assessment ofperformance, e.g., correct/incorrect

• Unsupervised learning – no teacher input (Clusteringproblems)

PR video course – p.38/90

Page 39: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• When the class labels of training patterns as given by‘teacher’ are noisy, we consider it as supervisedlearning with label noise or classification noise.

• Many classifier design algorithms do supervisedlearning.

PR video course – p.39/90

Page 40: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Function Learning

• Closely related problem. Output is continuous-valuedrather than discrete as in classifiers.

• Here training set examples could be{(Xi, yi), i = 1, . . . , ℓ}, Xi ∈ X , yi ∈ ℜ.

PR video course – p.40/90

Page 41: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Function Learning

• Closely related problem. Output is continuous-valuedrather than discrete as in classifiers.

• Here training set examples could be{(Xi, yi), i = 1, . . . , ℓ}, Xi ∈ X , yi ∈ ℜ.

• The prediction variable, y, is continuous; rather thantaking finitely many values. ( There can be noise inexamples).

• Similar learning techniques needed to infer theunderlying functional relationship between X and y.(Regression function of y on X).

PR video course – p.41/90

Page 42: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Examples of Function Learning

• Time series prediction: Given a series x1, x2, · · ·, finda function to predict xn.

PR video course – p.42/90

Page 43: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Examples of Function Learning

• Time series prediction: Given a series x1, x2, · · ·, finda function to predict xn.

• Based on past values: Find a ‘best’ functionx̂n = h(xn−1, xn−2, · · · , xn−p)• Predict stock prices, exchange rates etc.• Linear prediction model used in speech analysis

PR video course – p.43/90

Page 44: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Examples of Function Learning

• Time series prediction: Given a series x1, x2, · · ·, finda function to predict xn.

• Based on past values: Find a ‘best’ functionx̂n = h(xn−1, xn−2, · · · , xn−p)• Predict stock prices, exchange rates etc.• Linear prediction model used in speech analysis

• More general predictors can use other variables also.• Predict rainfall based on measurements and

(possibly) previous years’ data.• In general, System Identification. (An application:

smart sensors)PR video course – p.44/90

Page 45: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Examples contd... : Equaliser

Tx → x(k) → channel → Z(k) → filter →

y(k) → Rx

We want y(k) = x(k). Design (or adapt) the filter toachieve thisWe can choose a filter as

y(k) =T∑

i=1

aiZ(k − i)

Find ‘best’ ai – a function learning problem.

PR video course – p.45/90

Page 46: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Examples contd... : Equaliser

Tx → x(k) → channel → Z(k) → filter →

y(k) → Rx

We want y(k) = x(k). Design (or adapt) the filter toachieve thisWe can choose a filter as

y(k) =T∑

i=1

aiZ(k − i)

Find ‘best’ ai – a function learning problem.Training set: {(x(k), Z(k)), k = 1, 2, · · · , N}How do we know x(k) at the receiver end?

Prior agreements (Protocols)PR video course – p.46/90

Page 47: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Learning from examples

• Learning from examples is basic to both classificationand regression.

• Training set: {(X1, y1), · · · , (Xℓ, yℓ)}.

• We essentially want to fit a ‘best’ function: y = f(X).

PR video course – p.47/90

Page 48: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Learning from examples

• Learning from examples is basic to both classificationand regression.

• Training set: {(X1, y1), · · · , (Xℓ, yℓ)}.

• We essentially want to fit a ‘best’ function: y = f(X).

• Suppose X ∈ ℜ. Then this is a familiar ‘curve-fitting’problem.

PR video course – p.48/90

Page 49: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Learning from examples

• Learning from examples is basic to both classificationand regression.

• Training set: {(X1, y1), · · · , (Xℓ, yℓ)}.

• We essentially want to fit a ‘best’ function: y = f(X).

• Suppose X ∈ ℜ. Then this is a familiar ‘curve-fitting’problem.

• Model choice (choice of form for f ) is very importanthere.

• If we choose f to be a polynomial, higher the degree‘better’ would be the performance on training data.

• But that is not the real performance – A polynomial ofdegree ℓ would give zero error!!

PR video course – p.49/90

Page 50: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Learning from examples – Generalization

• To obtain a classifier (or a regression function) we usethe training set.

• We ‘know’ the class label of patterns (or the values forprediction variable) in the training set.

PR video course – p.50/90

Page 51: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Learning from examples – Generalization

• To obtain a classifier (or a regression function) we usethe training set.

• We ‘know’ the class label of patterns (or the values forprediction variable) in the training set.

• Errors on the training set do not necessarily tell howgood is the classifier.

PR video course – p.51/90

Page 52: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Learning from examples – Generalization

• To obtain a classifier (or a regression function) we usethe training set.

• We ‘know’ the class label of patterns (or the values forprediction variable) in the training set.

• Errors on the training set do not necessarily tell howgood is the classifier.

• Any classifier that amounts to only storing the trainingset is useless.

• Interested in the ‘generalization abilities’ – how doesour classifier perform on unseen or new patterns.

PR video course – p.52/90

Page 53: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Design of Classifiers

• The classifier should perform well inspite of inherentvariability of patterns and noise in feature extractionand/or in class labels as given in training set.

PR video course – p.53/90

Page 54: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Design of Classifiers

• The classifier should perform well inspite of inherentvariability of patterns and noise in feature extractionand/or in class labels as given in training set.

• Statistical Pattern Recognition – An approach wherethe variabilities are captured through probabilisticmodels.

PR video course – p.54/90

Page 55: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Design of Classifiers

• The classifier should perform well inspite of inherentvariability of patterns and noise in feature extractionand/or in class labels as given in training set.

• Statistical Pattern Recognition – An approach wherethe variabilities are captured through probabilisticmodels.

• There are other approaches, e.g., syntactic patternrecognition, fuzzy-set based methods etc.

PR video course – p.55/90

Page 56: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Design of Classifiers

• The classifier should perform well inspite of inherentvariability of patterns and noise in feature extractionand/or in class labels as given in training set.

• Statistical Pattern Recognition – An approach wherethe variabilities are captured through probabilisticmodels.

• There are other approaches, e.g., syntactic patternrecognition, fuzzy-set based methods etc.

• In this course we consider classification andregression (function learning) problems in thestatistical framework.

PR video course – p.56/90

Page 57: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical Pattern Recognition

• X is the feature space. (We take X = ℜn). WeConsider a 2-class problem for simplicity of notation.

• A given feature vector can come from differentclasses with different probabilities.

PR video course – p.57/90

Page 58: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical Pattern Recognition

• X is the feature space. (We take X = ℜn). WeConsider a 2-class problem for simplicity of notation.

• A given feature vector can come from differentclasses with different probabilities.

• Let: fi be the probability density function of thefeature vectors from class-i, i = 0, 1.

• fi are called class conditional densities.

PR video course – p.58/90

Page 59: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical Pattern Recognition

• X is the feature space. (We take X = ℜn). WeConsider a 2-class problem for simplicity of notation.

• A given feature vector can come from differentclasses with different probabilities.

• Let: fi be the probability density function of thefeature vectors from class-i, i = 0, 1.

• fi are called class conditional densities.

• Let X = (X1, · · · , Xn) ∈ ℜn represent the featurevector.

• Then fi is the (conditional) joint density of the randomvariables X1, · · · , Xn given that X is from class-i.

PR video course – p.59/90

Page 60: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• Class conditional densities model the variability in thefeature values.

• For example, the two classes can be uniformlydistributed in the two regions as shown. (The twoclasses are separable here).

Class 1

Class 2

PR video course – p.60/90

Page 61: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• When class regions are separable, an importantspecial case is linear separability.

Class 1

Class 2Class 1

Class 2

• The classes in the left panel above are linearlyseparable (can be separated by a line) while those inthe right panel are not linearly separable (thoughseparable). PR video course – p.61/90

Page 62: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• In general, the two class conditional densities canoverlap. (The same value of feature vector can befrom different classes with different probabilities)

Class 2Class 1

PR video course – p.62/90

Page 63: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• The statistical viewpoint gives us one way of lookingfor ‘optimal’ classifier.

• We can say we want a classifier that has leastprobability of misclassifying a random pattern (drawnfrom the underlying distributions).

PR video course – p.63/90

Page 64: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• Letqi(X) = Prob[class = i|X], i = 0, 1.

qi is called posterior probability (function) for class-i.

PR video course – p.64/90

Page 65: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• Letqi(X) = Prob[class = i|X], i = 0, 1.

qi is called posterior probability (function) for class-i.• Consider the classifier

h(X) = 0 if q0(X) > q1(X)

= 1 otherwise

PR video course – p.65/90

Page 66: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• Letqi(X) = Prob[class = i|X], i = 0, 1.

qi is called posterior probability (function) for class-i.• Consider the classifier

h(X) = 0 if q0(X) > q1(X)

= 1 otherwise

• q0(X) > q1(X) would imply that the feature vector Xis more likely to come from class-0 rather thanclass-1.

PR video course – p.66/90

Page 67: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• Letqi(X) = Prob[class = i|X], i = 0, 1.

qi is called posterior probability (function) for class-i.• Consider the classifier

h(X) = 0 if q0(X) > q1(X)

= 1 otherwise

• q0(X) > q1(X) would imply that the feature vector Xis more likely to come from class-0 rather thanclass-1.

• Hence, intuitively, such a classifier should minimizeprobability of error in classification. PR video course – p.67/90

Page 68: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical Pattern Recognition

• X (= ℜn) is the feature space. Y = {0, 1} is the set ofclass labels

PR video course – p.68/90

Page 69: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical Pattern Recognition

• X (= ℜn) is the feature space. Y = {0, 1} is the set ofclass labels

• H = {h | h : X → Y} is the set of classifiers ofinterest.

PR video course – p.69/90

Page 70: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical Pattern Recognition

• X (= ℜn) is the feature space. Y = {0, 1} is the set ofclass labels

• H = {h | h : X → Y} is the set of classifiers ofinterest.

• For a feature vector X , let y(X) denote the classlabel of X . In general, y(X) would be random.

PR video course – p.70/90

Page 71: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical Pattern Recognition

• X (= ℜn) is the feature space. Y = {0, 1} is the set ofclass labels

• H = {h | h : X → Y} is the set of classifiers ofinterest.

• For a feature vector X , let y(X) denote the classlabel of X . In general, y(X) would be random.

• Now we want to assign a figure of merit to eachpossible classifier in H.

PR video course – p.71/90

Page 72: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

• F (h) is the probability that h misclassifies a randomX .

PR video course – p.72/90

Page 73: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

• F (h) is the probability that h misclassifies a randomX .

• Optimal classifier would be one with lowest value ofF .

PR video course – p.73/90

Page 74: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

• F (h) is the probability that h misclassifies a randomX .

• Optimal classifier would be one with lowest value ofF .

• Given a h we can calculate F (h) only if we know theprobability distributions of classes.

PR video course – p.74/90

Page 75: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

• F (h) is the probability that h misclassifies a randomX .

• Optimal classifier would be one with lowest value ofF .

• Given a h we can calculate F (h) only if we know theprobability distributions of classes.

• Minimizing F is not a straight-forward optimizationproblem.

PR video course – p.75/90

Page 76: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

• For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

• F (h) is the probability that h misclassifies a randomX .

• Optimal classifier would be one with lowest value ofF .

• Given a h we can calculate F (h) only if we know theprobability distributions of classes.

• Minimizing F is not a straight-forward optimizationproblem.

• Here we are treating all errors as same. We willgeneralize the framework later.

PR video course – p.76/90

Page 77: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical PR contd.

• Recall that fi(X) denotes the probability densityfunction of feature vectors of class-i. ( classconditional densities).

PR video course – p.77/90

Page 78: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical PR contd.

• Recall that fi(X) denotes the probability densityfunction of feature vectors of class-i. ( classconditional densities).

• Let pi = Prob[y(X) = i]. Called prior probabilities.

PR video course – p.78/90

Page 79: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical PR contd.

• Recall that fi(X) denotes the probability densityfunction of feature vectors of class-i. ( classconditional densities).

• Let pi = Prob[y(X) = i]. Called prior probabilities.• Recall, posterior probabilities,

qi(X) = Prob[y(X) = i | X].

PR video course – p.79/90

Page 80: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Statistical PR contd.

• Recall that fi(X) denotes the probability densityfunction of feature vectors of class-i. ( classconditional densities).

• Let pi = Prob[y(X) = i]. Called prior probabilities.• Recall, posterior probabilities,

qi(X) = Prob[y(X) = i | X]. Now, by Bayes theorem

qi(X) = fi(X)pi / Z

where Z = f0(X)p0 + f1(X)p1 is the normalisingconstant

PR video course – p.80/90

Page 81: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Bayes Classifier

• Consider the classifier given by

h(X) = 0 ifq0(X)

q1(X)> 1

= 1 otherwise

PR video course – p.81/90

Page 82: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Bayes Classifier

• Consider the classifier given by

h(X) = 0 ifq0(X)

q1(X)> 1

= 1 otherwise

• This is called the Bayes classifier. Given our statisticalframework, this is the optimal classifier.

PR video course – p.82/90

Page 83: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Bayes Classifier

• Consider the classifier given by

h(X) = 0 ifq0(X)

q1(X)> 1

= 1 otherwise

• This is called the Bayes classifier. Given our statisticalframework, this is the optimal classifier.

• q0(X) > q1(X) is same as p0f0(X) > p1f1(X).

PR video course – p.83/90

Page 84: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Bayes Classifier

• Consider the classifier given by

h(X) = 0 ifq0(X)

q1(X)> 1

= 1 otherwise

• This is called the Bayes classifier. Given our statisticalframework, this is the optimal classifier.

• q0(X) > q1(X) is same as p0f0(X) > p1f1(X).

• We will prove optimality of Bayes classifier later.

PR video course – p.84/90

Page 85: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

story so far

• We consider PR as a two step process – Featuremeasurement/extraction and Classification

• A classifier is to map feature vectors to Class labels.• The main difficulty in designing classifiers is due to

large variability in feature vector values.• The main information we have for the design is a

training set of examples.• Function learning is a closely related problem.• In both cases we need to learn from (training)

examples.

PR video course – p.85/90

Page 86: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

story so far

• The statistical view is: model the variation in featurevectors from a given class through probability models.

• One objective can be: Find a classifier that has leastprobability of error.

• For example, Bayes Classifier is the optimal one if weknow class conditional densities.

PR video course – p.86/90

Page 87: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Organization of the course

• Overview of classifier learning strategies• Nearest Neighbour classification rule• Bayes Classifier for minimizing risk.• Some variations on this theme (e.g.

Neymann-Pearson Classifier)• Techniques for estimating class conditional densities

to implement Bayes classifier.

PR video course – p.87/90

Page 88: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Organization of the course

• Linear discriminant functions• Learning linear discriminant functions (Perceptron,

LMS, logistic regression)• Linear least-squares regression• Simple overview of statistical learning theory• Empirical risk minimization and VC theory

PR video course – p.88/90

Page 89: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Organization of the course

• Learning nonlinear classifiers• Neural Networks (Backpropagation, RBF networks)• Support Vector Machines• Kernel-based methods• Feature extraction and dimensionality reduction (PCA)• Boosting and ensemble classifiers (AdaBoost)

PR video course – p.89/90

Page 90: P.S.Sastry - NPTELnptel.ac.in/courses/117108048/module1/Lecture1.pdfMachine Recognition of Patterns pattern→ feature extractor −→X classifier →class label • Feature extractor

Thank You!

PR video course – p.90/90