L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE...

15
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1 L5: Quadratic classifiers β€’ Bayes classifiers for Normally distributed classes – Case 1: Ξ£ = 2 – Case 2: Ξ£ =Ξ£ (Ξ£ diagonal) – Case 3: Ξ£ =Ξ£ (Ξ£ non-diagonal) – Case 4: Ξ£ = 2 – Case 5: Ξ£ β‰ Ξ£ (general case) β€’ Numerical example β€’ Linear and quadratic classifiers: conclusions

Transcript of L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE...

Page 1: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1

L5: Quadratic classifiers

β€’ Bayes classifiers for Normally distributed classes – Case 1: Σ𝑖 = 𝜎2𝐼

– Case 2: Σ𝑖 = Ξ£ (Ξ£ diagonal)

– Case 3: Σ𝑖 = Ξ£ (Ξ£ non-diagonal)

– Case 4: Σ𝑖 = πœŽπ‘–2𝐼

– Case 5: Σ𝑖 β‰  Σ𝑗 (general case)

β€’ Numerical example

β€’ Linear and quadratic classifiers: conclusions

Page 2: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2

Bayes classifiers for Gaussian classes β€’ Recap

– On L4 we showed that the decision rule that minimized 𝑃[π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ] could be formulated in terms of a family of discriminant functions

β€’ For normally Gaussian classes, these DFs reduce to simple expressions – The multivariate Normal pdf is

𝑓𝑋 π‘₯ = 2πœ‹ βˆ’π‘/2 Ξ£ βˆ’1/2π‘’βˆ’12π‘₯βˆ’πœ‡ π‘‡Ξ£βˆ’1 π‘₯βˆ’πœ‡

– Using Bayes rule, the DFs become

𝑔𝑖 π‘₯ = 𝑃 πœ”π‘–|π‘₯ = 𝑃 πœ”π‘– 𝑝 π‘₯ πœ”π‘– 𝑝 π‘₯

= 2πœ‹ βˆ’π‘/2 Ξ£π‘–βˆ’1/2π‘’βˆ’

12π‘₯βˆ’πœ‡π‘–

π‘‡Ξ£π‘–βˆ’1 π‘₯βˆ’πœ‡π‘– 𝑃 πœ”π‘– 𝑝 π‘₯

– Eliminating constant terms

𝑔𝑖 π‘₯ = Ξ£π‘–βˆ’1/2 π‘’βˆ’

12π‘₯βˆ’πœ‡π‘–

π‘‡Ξ£π‘–βˆ’1 π‘₯βˆ’πœ‡π‘– 𝑃 πœ”π‘–

– And taking natural logs

𝑔𝑖 π‘₯ = βˆ’1

2π‘₯ βˆ’ πœ‡π‘–

π‘‡Ξ£π‘–βˆ’1 π‘₯ βˆ’ πœ‡π‘– βˆ’

1

2log Σ𝑖 + π‘™π‘œπ‘”π‘ƒ(πœ”π‘–)

– This expression is called a quadratic discriminant function

x2x2 x3

x3 xdxd

g1(x)g1(x)

x1x1

g2(x)g2(x) gC(x)gC(x)

Select maxSelect max

CostsCosts

Class assignment

Discriminant

functions

Features

Page 3: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3

Case 1: πšΊπ’Š = πˆπŸπ‘° β€’ This situation occurs when features are statistically independent with

equal variance for all classes

– In this case, the quadratic DFs become

𝑔𝑖 π‘₯ = βˆ’1

2π‘₯ βˆ’ πœ‡π‘–

𝑇 𝜎2𝐼 βˆ’1 π‘₯ βˆ’ πœ‡π‘– βˆ’1

2log 𝜎2𝐼 + π‘™π‘œπ‘”π‘ƒπ‘– ≑ βˆ’

1

2𝜎2π‘₯ βˆ’ πœ‡π‘–

𝑇 π‘₯ βˆ’ πœ‡π‘– + π‘™π‘œπ‘”π‘ƒπ‘–

– Expanding this expression

𝑔𝑖 π‘₯ = βˆ’1

2𝜎2π‘₯𝑇π‘₯ βˆ’ 2πœ‡π‘–

𝑇π‘₯ + πœ‡π‘–π‘‡πœ‡π‘– + π‘™π‘œπ‘”π‘ƒπ‘–

– Eliminating the term π‘₯𝑇π‘₯, which is constant for all classes

𝑔𝑖 π‘₯ = βˆ’1

2𝜎2βˆ’2πœ‡π‘–

𝑇π‘₯ + πœ‡π‘–π‘‡πœ‡π‘– + π‘™π‘œπ‘”π‘ƒπ‘– = 𝑀𝑖

𝑇π‘₯ + 𝑀0

– So the DFs are linear, and the boundaries 𝑔𝑖(π‘₯) = 𝑔𝑗(π‘₯) are hyper-planes

– If we assume equal priors

𝑔𝑖 π‘₯ = βˆ’1

2𝜎2π‘₯ βˆ’ πœ‡π‘–

𝑇 π‘₯ βˆ’ πœ‡π‘–

β€’ This is called a minimum-distance or nearest mean classifier

β€’ The equiprobable contours are hyper-spheres

β€’ For unit variance (𝜎2 = 1), 𝑔𝑖 π‘₯ is the Euclidean distance

Distance 1

Min

imu

m S

elec

tor

Distance 2

Distance C

class

x

[Schalkoff, 1992]

Page 4: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4

β€’ Example – Three-class 2D problem

with equal priors

πœ‡1 = 3 2 𝑇 πœ‡2 = 7 4 𝑇 πœ‡3 = 2 5 𝑇

Ξ£1 =2

2Ξ£1 =

22

Ξ£1 =2

2

Page 5: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5

Case 2: πšΊπ’Š = 𝚺 (diagonal)

β€’ Classes still have the same covariance, but features are allowed to have different variances – In this case, the quadratic DFs becomes

𝑔𝑖 π‘₯ = βˆ’1

2π‘₯ βˆ’ πœ‡π‘–

π‘‡Ξ£π‘–βˆ’1 π‘₯ βˆ’ πœ‡π‘– βˆ’

1

2log Σ𝑖 + π‘™π‘œπ‘”π‘ƒπ‘– =

βˆ’1

2βˆ‘π‘˜=1𝑁

π‘₯π‘˜ βˆ’ πœ‡π‘–,π‘˜2

πœŽπ‘˜2 βˆ’

1

2π‘™π‘œπ‘”βˆπ‘˜=1

𝑁 πœŽπ‘˜2 + π‘™π‘œπ‘”π‘ƒπ‘–

– Eliminating the term π‘₯π‘˜2, which is constant for all classes

𝑔𝑖 π‘₯ = βˆ’1

2βˆ‘π‘˜=1𝑁

βˆ’2π‘₯π‘˜πœ‡π‘–,π‘˜ + πœ‡π‘–,π‘˜2

πœŽπ‘˜2 βˆ’

1

2π‘™π‘œπ‘”Ξ π‘˜=1

𝑁 πœŽπ‘˜2 + π‘™π‘œπ‘”π‘ƒπ‘–

– This discriminant is also linear, so the decision boundaries 𝑔𝑖(π‘₯) = 𝑔𝑗(π‘₯) will also be hyper-planes

– The equiprobable contours are hyper-ellipses aligned with the reference frame

– Note that the only difference with the previous classifier is that the distance of each axis is normalized by its variance

Page 6: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6

β€’ Example – Three-class 2D problem

with equal priors

πœ‡1 = 3 2 𝑇 πœ‡2 = 5 4 𝑇 πœ‡3 = 2 5 𝑇

Ξ£1 =1

2Ξ£2 =

12

Ξ£3 =1

2

Page 7: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7

Case 3: πšΊπ’Š = 𝚺 (non-diagonal) β€’ Classes have equal covariance matrix, but no longer diagonal

– The quadratic discriminant becomes

𝑔𝑖 π‘₯ = βˆ’1

2π‘₯ βˆ’ πœ‡π‘–

π‘‡Ξ£βˆ’1 π‘₯ βˆ’ πœ‡π‘– βˆ’1

2log Ξ£ + π‘™π‘œπ‘”π‘ƒπ‘–

– Eliminating the term log Ξ£ , which is constant for all classes, and assuming equal priors

𝑔𝑖 π‘₯ = βˆ’1

2π‘₯ βˆ’ πœ‡π‘–

π‘‡Ξ£βˆ’1 π‘₯ βˆ’ πœ‡π‘–

– The quadratic term is called the Mahalanobis distance, a very important concept in statistical pattern recognition

– The Mahalanobis distance is a vector distance that uses a Ξ£βˆ’1norm,

– Ξ£βˆ’1 acts as a stretching factor on the space

– Note that when Ξ£ = 𝐼, the Mahalanobis distance becomes the familiar Euclidean distance

x2

x1

Κ μ-x2

i =

K ΞΌ-x2

i 1 =-Γ₯

Page 8: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8

– Expanding the quadratic term

𝑔𝑖 π‘₯ = βˆ’1

2π‘₯π‘‡Ξ£βˆ’1π‘₯ βˆ’ 2πœ‡π‘–

π‘‡Ξ£βˆ’1π‘₯ + πœ‡π‘–π‘‡Ξ£βˆ’1πœ‡π‘–

– Removing the term π‘₯π‘‡Ξ£βˆ’1π‘₯, which is constant for all classes

𝑔𝑖 π‘₯ = βˆ’1

2βˆ’2πœ‡π‘–

π‘‡Ξ£βˆ’1π‘₯ + πœ‡π‘–π‘‡Ξ£βˆ’1πœ‡π‘– = 𝑀1

𝑇π‘₯ + 𝑀0

– So the DFs are still linear, and the decision boundaries will also be hyper-planes

– The equiprobable contours are hyper-ellipses aligned with the eigenvectors of Ξ£

– This is known as a minimum (Mahalanobis) distance classifier

Distance 1

Min

imu

m S

elec

tor

Distance 2

Distance C

Γ₯

class

x

Page 9: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9

β€’ Example – Three-class 2D problem

with equal priors

πœ‡1 = 3 2 𝑇 πœ‡2 = 5 4 𝑇 πœ‡3 = 2 5 𝑇

Ξ£1 =1 .7.7 2

Ξ£2 =1 .7.7 2

Ξ£3 =1 .7.7 2

Page 10: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10

Case 4: πšΊπ’Š = πˆπ’ŠπŸπ‘°

β€’ In this case, each class has a different covariance matrix, which is proportional to the identity matrix – The quadratic discriminant becomes

𝑔𝑖 π‘₯ = βˆ’1

2π‘₯ βˆ’ πœ‡π‘–

π‘‡πœŽπ‘–βˆ’2 π‘₯ βˆ’ πœ‡π‘– βˆ’

1

2𝑁log πœŽπ‘–

2 + π‘™π‘œπ‘”π‘ƒπ‘–

– This expression cannot be reduced further

– The decision boundaries are quadratic: hyper-ellipses

– The equiprobable contours are hyper-spheres aligned with the feature axis

Page 11: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11

β€’ Example – Three-class 2D problem

with equal priors

πœ‡1 = 3 2 𝑇 πœ‡2 = 5 4 𝑇 πœ‡3 = 2 5 𝑇

Ξ£1 =.5

.5Ξ£2 =

11

Ξ£3 =2

2

Zoom out

Page 12: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12

Case 5: πšΊπ’Š β‰  πšΊπ’‹ (general case)

β€’ We already derived the expression for the general case

𝑔𝑖 π‘₯ = βˆ’1

2π‘₯ βˆ’ πœ‡π‘–

π‘‡Ξ£π‘–βˆ’1 π‘₯ βˆ’ πœ‡π‘– βˆ’

1

2log Σ𝑖 + π‘™π‘œπ‘”π‘ƒπ‘–

– Reorganizing terms in a quadratic form yields

𝑔𝑖 π‘₯ = π‘₯π‘‡π‘Š2,𝑖π‘₯ + 𝑀1,𝑖𝑇 π‘₯ + 𝑀0,𝑖

where

π‘Š2,𝑖 = βˆ’1

2Ξ£π‘–βˆ’1

𝑀1,𝑖 = Ξ£π‘–βˆ’1πœ‡π‘–

π‘€π‘œ,𝑖 = βˆ’1

2πœ‡π‘–π‘‡Ξ£π‘–

βˆ’1πœ‡π‘– βˆ’1

2log Σ𝑖 + π‘™π‘œπ‘”π‘ƒπ‘–

– The equiprobable contours are hyper-ellipses, oriented with the eigenvectors of Σ𝑖 for that class

– The decision boundaries are again quadratic: hyper-ellipses or hyper-parabolloids

– Notice that the quadratic expression in the discriminant is proportional to the Mahalanobis distance for covariance Σ𝑖

Page 13: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13

β€’ Example – Three-class 2D problem

with equal priors

πœ‡1 = 3 2 𝑇 πœ‡2 = 5 4 𝑇 πœ‡3 = 3 4 𝑇

Ξ£1 =1 βˆ’1βˆ’1 2

Ξ£2 =1 βˆ’1βˆ’1 7

Ξ£3 =.5 .5.5 3

Zoom out

Page 14: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14

Numerical example β€’ Derive a linear DF for the following 2-class 3D problem

πœ‡1 = 0 0 0 𝑇; πœ‡2 = 1 1 1 𝑇; Ξ£1 = Ξ£2 =.25

.25.25

; P2 = 2P1

– Solution

𝑔1 π‘₯ = βˆ’1

2𝜎2π‘₯ βˆ’ πœ‡1

𝑇 π‘₯ βˆ’ πœ‡1 + π‘™π‘œπ‘”π‘ƒ1 = βˆ’1

2

π‘₯ βˆ’ 0𝑦 βˆ’ 0𝑧 βˆ’ 0

𝑇4

44

π‘₯ βˆ’ 0𝑦 βˆ’ 0𝑧 βˆ’ 0

+ π‘™π‘œπ‘”1

3

β€’ 𝑔2 π‘₯ = βˆ’1

2

π‘₯ βˆ’ 1𝑦 βˆ’ 1𝑧 βˆ’ 1

𝑇4

44

π‘₯ βˆ’ 1𝑦 βˆ’ 1𝑧 βˆ’ 1

+ π‘™π‘œπ‘”2

3

β€’ 𝑔1 π‘₯πœ”1><πœ”2

𝑔2 π‘₯ β‡’ βˆ’2 π‘₯2 + 𝑦2 + 𝑧2 + 𝑙𝑔1

3 πœ”1><πœ”2

βˆ’ 2 π‘₯ βˆ’ 1 2 + 𝑦 βˆ’ 1 2 + 𝑧 βˆ’ 1 2 + lg2

3

β€’ π‘₯ + 𝑦 + 𝑧 πœ”2><πœ”1

6βˆ’π‘™π‘œπ‘”2

4= 1.32

– Classify the test example π‘₯𝑒 = 0.1 0.7 0.8 𝑇

0.1 + 0.7 + 0.8 = 1.6

πœ”2

><πœ”1

1.32 β‡’ π‘₯𝑒 ∈ πœ”2

Page 15: L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 Bayes classifiers for

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15

Conclusions β€’ The examples in this lecture illustrate the following points

– The Bayes classifier for Gaussian classes (general case) is quadratic – The Bayes classifier for Gaussian classes with equal covariance is linear – The Mahalanobis distance classifier is Bayes-optimal for

β€’ normally distributed classes and β€’ equal covariance matrices and β€’ equal priors

– The Euclidean distance classifier is Bayes-optimal for β€’ normally distributed classes and β€’ equal covariance matrices proportional to the identity matrix and β€’ equal priors

– Both Euclidean and Mahalanobis distance classifiers are linear classifiers

β€’ Thus, some of the simplest and most popular classifiers can be derived from decision-theoretic principles – Using a specific (Euclidean or Mahalanobis) minimum distance classifier

implicitly corresponds to certain statistical assumptions – The question whether these assumptions hold or don’t can rarely be

answered in practice; in most cases we can only determine whether the classifier solves our problem