Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace...

20
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University

Transcript of Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace...

Page 1: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Pattern Recognition:Baysian Decision Theory

Charles TappertSeidenberg School of CSIS, Pace

University

Page 2: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Pattern Classification

Most of the material in these slides was taken from the figures in Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2001

Page 3: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Baysian Decision Theory

Fundamental pure statistical approach Assumes relevant probabilities are

known perfectly Makes theoretically optimal decisions

Page 4: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Baysian Decision Theory

Based on Bayes formula P(j| x) = p(x | j) P(j) / p(x)

which is easily derived from writing the joint probability density two ways

P(j , x) = P(j|x) p(x) P(j , x) = p(x|j) p(j)

Note: uppercase P(.) denotes a probability mass function and lowercase p(.) a density function

Page 5: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Bayes Formula

Bayes formula

P(j| x) = p(x | j) P(j) / p(x)

can be expressed informally in English as

posterior = likelihood x prior / evidence

and Bayes decision chooses the class j with the greatest posterior probability

Page 6: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Bayes Formula

Bayes formula: P(j| x) = p(x | j) P(j) / p(x) Bayes decision chooses class j with the

greatest P(j| x) Since p(x) is the same for all classes, greatest

P(j| x) means greatest p(x | j) P(j) Special case: if all classes are equally likely,

i.e. same P(j), we get a further simplification – greatest P(j| x) is greatest likelihood p(x | j)

Page 7: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Baysian Decision Theory

Now, let’s look at the fish example of two classes – sea bass and salmon – and one feature – lightness

Let p(x | 1) and p(x | 2) describe the difference in lightness between populations of sea bass and salmon (see next slide)

Page 8: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Page 9: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Baysian Decision Theory

In the previous slide, if the two classes are equally likely, we get the simplification – greatest posterior means greatest likelihood, and Bayes decision is to choose class 1 when p(x | 1) > p(x | 2), i.e. when lightness is > approximately 12.4

However, if the two classes are not equally likely, we get a case like the next slide

Page 10: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Page 11: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Baysian Parameter Estimation

Because the actual probabilities are rarely known, they are usually estimated after assuming the form of the distributions

The usually assumed form of the distributions is multivariate normal

Page 12: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Baysian Parameter Estimation

Assuming multivariate normal probability density functions, it is necessary to estimate for each pattern class Feature means Feature covariance matrices

Page 13: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Multivariate Normal Densities

Simplifying assumptions can be made for multivariate normal density functions Statistically independent features with

equal variances yields hyperplane decision surfaces

Equal covariance matrices for each class also yields hyperplane decision surfaces

Arbitrary normal distributions yields hyperquadric decision surfaces

Page 14: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Nonparametric Techniques

Probabilities are not known Two approaches

Estimate the density functions from sample patterns

Bypass probability estimation entirely

Use a non-parametric method Such as k-Nearest-Neighbor

Page 15: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

k-Nearest-Neighbor

Page 16: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

k-Nearest-Neighbor (k-NN) Method

Used where probabilities are not known Bypasses probability estimation entirely Easy to implement Asymptotic error never worst than twice

Baysian error Computationally intense, therefore slow

Page 17: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Simple PR System with k-NN

Good for feasibility studies – easy to implement

Typical procedural steps Extract feature measurements Normalize features to 0-1 range

Classify by k nearest neighbor Using Euclidean distance

minmax

min'xx

xxx

Page 18: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Simple PR System with k-NN (cont):

Two Modes of Operation

Leave-one-out procedure One input file of training/test patterns Repeatedly train on all samples except one

which is left for testing Good for feasibility study with little data Train and test on separate files One input file for training and one for testing Good for measuring performance change

when varying an independent variable (e.g., different keyboards for keystroke biometric)

Page 19: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Simple PR System with k-NN (cont)

Used in keystroke biometric studies Feasibility study – Dr. Mary Curtin Different keyboards/modes – Dr. Mary Villani

Used in other studies that used keystroke data Study of procedures for handling incomplete and

missing data – e.g., fallback procedures in the keystroke biometric system – Dr. Mark Ritzmann

New kNN-ROC procedures – Dr. Robert Zack Used in other biometric studies

Mouse movement – Larry Immohr Stylometry + keystroke study – John Stewart

Page 20: Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.

Conclusions

Bayes decision method best if probabilities known

Bayes method okay if you are good with statistics and the form of the probability distributions can be assumed, especially if there is justification for simplifying assumptions like independent features

Otherwise, stay with easier to implement methods that provide reasonable results, like k-NN