Bayesian Decision Theoryhic/CS7616/pdf/lecture2.pdf · 2016-01-19 · • The Bayes decision rule...

Post on 10-Mar-2020

10 views 0 download

Transcript of Bayesian Decision Theoryhic/CS7616/pdf/lecture2.pdf · 2016-01-19 · • The Bayes decision rule...

BayesianDecisionTheory

Chapter 2(Duda,Hart&Stork)

CS7616- PatternRecognition

HenrikIChristensenGeorgiaTech.

BayesianDecisionTheory

• Designclassifierstorecommenddecisions thatminimizesometotalexpected”risk”.– Thesimplestrisk istheclassificationerror(i.e.,costsareequal).

– Typically,therisk includesthecost associatedwithdifferentdecisions.

Terminology

• Stateofnatureω (randomvariable):– e.g.,ω1 forseabass,ω2 forsalmon

• ProbabilitiesP(ω1) andP(ω2) (priors):– e.g.,priorknowledgeofhowlikelyistogetaseabassorasalmon

• Probabilitydensityfunctionp(x)(evidence):– e.g.,howfrequentlywewillmeasureapatternwithfeaturevaluex (e.g.,x correspondstolightness)

Terminology(cont’d)

• Conditionalprobabilitydensityp(x/ωj) (likelihood):– e.g.,howfrequentlywewillmeasureapatternwithfeaturevaluex giventhatthepatternbelongstoclassωj

e.g., lightness distributionsbetween salmon/sea-basspopulations

Terminology(cont’d)

• ConditionalprobabilityP(ωj/x)(posterior):– e.g.,theprobabilitythatthefishbelongstoclassωj givenmeasurementx.

DecisionRuleUsingPriorProbabilities

Decideω1 if P(ω1) >P(ω2); otherwisedecide ω2

or P(error)=min[P(ω1),P(ω2)]

• Favoursthemostlikelyclass.• Thisrulewillbemakingthesamedecisionalltimes.

– i.e.,optimumifnootherinformationisavailable

1 2

2 1

( )( )

( )P if wedecide

P errorP if wedecideω ω

ω ω⎧

= ⎨⎩

DecisionRuleUsingConditionalProbabilities

• UsingBayes’rule,theposteriorprobabilityofcategoryωjgivenmeasurementxisgivenby:

where(i.e.,scalefactor– sumofprobs=1)

Decideω1ifP(ω1 /x)>P(ω2/x); otherwisedecideω2or

Decideω1ifp(x/ω1)P(ω1)>p(x/ω2)P(ω2) otherwisedecideω2

( / ) ( )( / )

( )j j

j

p x P likelihood priorP xp x evidenceω ω

ω×

= =

2

1( ) ( / ) ( )j j

jp x p x Pω ω

=

=∑

DecisionRuleUsingConditionalpdf (cont’d)

1 22 1( ) ( )3 3

P Pω ω= = P(ωj /x)p(x/ωj)

ProbabilityofError

• Theprobabilityoferrorisdefinedas:

or

• Whatistheaverageprobabilityerror?

• TheBayesruleisoptimum,thatis,itminimizestheaverageprobabilityerror!

1 2

2 1

( / )( / )

( / )P x if wedecide

P error xP x if wedecideω ω

ω ω⎧

= ⎨⎩

( ) ( , ) ( / ) ( )P error P error x dx P error x p x dx∞ ∞

−∞ −∞

= =∫ ∫

P(error/x) = min[P(ω1/x), P(ω2/x)]

WheredoProbabilitiesComeFrom?

• Therearetwocompetitiveanswerstothisquestion:

(1) Relativefrequency (objective)approach.– Probabilitiescanonlycomefromexperiments.

(2) Bayesian (subjective)approach.– Probabilitiesmayreflectdegreeofbeliefandcanbebasedonopinion.

Example(objectiveapproach)

• Classifycarswhethertheyaremoreorlessthan$50K:– Classes:C1 ifprice>$50K,C2 ifprice<=$50K– Features:x,theheightofacar

• UsetheBayes’ruletocomputetheposteriorprobabilities:

• Weneedtoestimatep(x/C1),p(x/C2),P(C1),P(C2)

( / ) ( )( / )( )i i

ip x C P CP C x

p x=

Example(cont’d)

• Collectdata– Askdrivershowmuchtheircarwasandmeasureheight.

• Determineprior probabilitiesP(C1),P(C2)– e.g.,1209samples:#C1=221#C2=988

1

2

221( ) 0.1831209988( ) 0.8171209

P C

P C

= =

= =

Example(cont’d)

• Determineclassconditionalprobabilities(likelihood)– Discretizecarheightintobinsandusenormalizedhistogram

( / )ip x C

Example(cont’d)

• Calculatetheposteriorprobability foreachbin:

1 11

1 1 2 2

( 1.0 / ) ( )( / 1.0)( 1.0 / ) ( ) ( 1.0 / ) ( )

0.2081*0.183 0.4380.2081*0.183 0.0597*0.817

p x C P CP C xp x C P C p x C P C

== = =

= + =

= =+

( / )iP C x

AMoreGeneralTheory

• Usemorethanonefeatures.• Allowmorethantwocategories.• Allowactions otherthanclassifyingtheinputtooneofthepossiblecategories(e.g.,rejection).

• Employamoregeneralerrorfunction(i.e.,“risk”function)byassociatinga“cost”(“loss”function)witheacherror(i.e.,wrongaction).

Terminology

• Featuresformavector• Afinitesetofc categoriesω1,ω2,…,ωc

• Bayesrule(i.e.,usingvectornotation):

• Afinitesetof lactionsα1,α2,…,αl

• Aloss functionλ(αi /ωj)– thecostassociatedwithtakingactionαiwhenthecorrect

classificationcategoryisωj

dR∈x

( / ) ( )( / )

( )j j

j

p PP

pω ω

ω =x

xx

1( ) ( / ) ( )

c

j jj

where p p Pω ω=

=∑x x

ConditionalRisk(orExpectedLoss)

• Supposeweobservexandtakeaction αi

• Supposethatthecostassociatedwithtakingactionαi withωj beingthecorrectcategoryisλ(αi /ωj)

• Theconditionalrisk (orexpectedloss)withtakingactionαi is:

1( / ) ( / ) ( / )

c

i i j jj

R a a Pλ ω ω=

=∑x x

OverallRisk

• Supposeα(x)isageneral decisionrulethatdetermineswhichactionα1,α2,…,αltotakeforeveryx;thentheoverallriskisdefinedas:

• Theoptimum decisionruleistheBayesrule

( ( ) / ) ( )R R a p d= ∫ x x x x

OverallRisk(cont’d)

• TheBayesdecisionruleminimizesR by:(i)ComputingR(αi /x) foreveryαi givenanx

(ii)ChoosingtheactionαiwiththeminimumR(αi /x)

• TheresultingminimumoverallriskiscalledBayesrisk andisthebest(i.e.,optimum)performancethatcanbeachieved:

* minR R=

Example:Two-categoryclassification

• Define– α1:decideω1

– α2:decideω2

– λij=λ(αi /ωj)

• Theconditionalrisksare:

1( / ) ( / ) ( / )

c

i i j jj

R a a Pλ ω ω=

=∑x x

(c=2)

Example:Two-categoryclassification(cont’d)

• Minimumriskdecisionrule:

or (i.e.,usinglikelihoodratio)

or

>

thresholdlikelihood ratio

SpecialCase:Zero-OneLossFunction

• Assignthesamelosstoallerrors:

• Theconditionalriskcorrespondingtothislossfunction:

SpecialCase:Zero-OneLossFunction(cont’d)

• Thedecisionrulebecomes:

• Inthiscase,theoverallriskistheaverageprobabilityerror!

or

or

Example

2 1( ) / ( )a P Pθ ω ω=

2 12 22

1 21 11

( )( )( )( )bPPω λ λ

θω λ λ

−=

−(decisionregions)

Decide ω1 if p(x/ω1)/p(x/ω2)>P(ω2 )/P(ω1) otherwise decide ω2

Assumingzero-one loss:

12 21λ λ>

>

assume:

Assuminggeneral loss:

DiscriminantFunctions

• Ausefulwaytorepresentclassifiersisthroughdiscriminant functions gi(x),i =1,...,c,whereafeaturevectorx isassignedtoclassωi if:

gi(x)>gj(x) forall j i≠

DiscriminantsforBayesClassifier

• Assumingagenerallossfunction:

gi(x)=-R(αi/x)

• Assumingthezero-onelossfunction:

gi(x)=P(ωi/x)

DiscriminantsforBayesClassifier(cont’d)

• Isthechoiceofgi unique?– Replacinggi(x)withf(gi(x)),wheref() ismonotonicallyincreasing,doesnotchangetheclassificationresults.

( / ) ( )( )( )

( ) ( / ) ( )( ) ln ( / ) ln ( )

i ii

i i i

i i i

p Pgp

g p Pg p P

ω ω

ω ω

ω ω

=

=

= +

xxx

x xx x

gi(x)=P(ωi/x)

we’llusethisformextensively!

Caseoftwocategories

• Morecommontouseasinglediscriminantfunction(dichotomizer)insteadoftwo:

• Examples:1 2

1 1

2 2

( ) ( / ) ( / )( / ) ( )( ) ln ln( / ) ( )

g P Pp Pgp P

ω ω

ω ωω ω

= −

= +

x x xxxx

DecisionRegions andBoundaries• Decisionrulesdividethefeaturespaceindecisionregions

R1,R2,…,Rc, separatedbydecisionboundaries.

decisionboundaryisdefinedby:

g1(x)=g2(x)

DiscriminantFunctionforMultivariateGaussianDensity

• Considerthefollowingdiscriminantfunction:

( ) ln ( / ) ln ( )i i ig p Pω ω= +x x

N(µ,Σ)

p(x/ωi)

MultivariateGaussianDensity:CaseI

• Σi=σ2(diagonal)– Featuresarestatisticallyindependent– Eachfeaturehasthesamevariance

favoursthea-priorimorelikelycategory

MultivariateGaussianDensity:CaseI(cont’d)

wi=

)

)

MultivariateGaussianDensity:CaseI(cont’d)

• Propertiesofdecisionboundary:– Itpassesthroughx0– Itisorthogonaltothelinelinkingthemeans.– WhathappenswhenP(ωi)=P(ωj) ?– IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.– Ifσ isverysmall,thepositionoftheboundaryisinsensitivetoP(ωi)

and P(ωj)

)

)

MultivariateGaussianDensity:CaseI(cont’d)

IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.

MultivariateGaussianDensity:CaseI(cont’d)

IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.

MultivariateGaussianDensity:CaseI(cont’d)

IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.

MultivariateGaussianDensity:CaseI(cont’d)

• Minimumdistanceclassifier– WhenP(ωi)areequal,then:

2( ) || ||i ig µ= − −x x

max

MultivariateGaussianDensity:CaseII

• Σi=Σ

MultivariateGaussianDensity:CaseII(cont’d)

MultivariateGaussianDensity:CaseII(cont’d)

• Propertiesofhyperplane(decisionboundary):– Itpassesthroughx0– Itisnotorthogonaltothelinelinkingthemeans.– WhathappenswhenP(ωi)=P(ωj) ?– IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.≠

MultivariateGaussianDensity:CaseII(cont’d)

IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.

MultivariateGaussianDensity:CaseII(cont’d)

IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.

MultivariateGaussianDensity:CaseII(cont’d)

• Mahalanobisdistanceclassifier– WhenP(ωi)areequal,then:

max

MultivariateGaussianDensity:CaseIII

• Σi=arbitrary

e.g., hyperplanes,pairsofhyperplanes,hyperspheres,hyperellipsoids,hyperparaboloids etc.

hyperquadrics;

Example- CaseIII

P(ω1)=P(ω2)

decisionboundary:

boundarydoesnot passthroughmidpointofμ1,μ2

MultivariateGaussianDensity:CaseIII(cont’d)

non-lineardecisionboundaries

MultivariateGaussianDensity:CaseIII(cont’d)

• Moreexamples

ErrorBounds• Exacterrorcalculationscouldbedifficult– easierto

estimateerrorbounds!

ormin[P(ω1/x),P(ω2/x)]

P(error)

ErrorBounds(cont’d)

• IftheclassconditionaldistributionsareGaussian,then

where:

| |

ErrorBounds(cont’d)

• TheChernoff boundcorrespondstoβ thatminimizes e-κ(β)– Thisisa1-Doptimizationproblem,regardlesstothedimensionality

oftheclassconditionaldensities.loose boundloose bound

tight bound

ErrorBounds(cont’d)• Bhattacharyyabound

– Approximatetheerrorboundusingβ=0.5– EasiertocomputethanChernofferrorbutlooser.

• TheChernoffandBhattacharyyaboundswillnotbegoodboundsifthedistributionsarenot Gaussian.

Example

k(0.5)=4.06

( ) 0.0087P error ≤

Bhattacharyyaerror:

ReceiverOperatingCharacteristic(ROC)Curve

• Everyclassifieremployssomekindofathreshold.

• Changingthethresholdaffectstheperformanceofthesystem.

• ROCcurvescanhelpusevaluatesystemperformancefordifferent thresholds.

2 1( ) / ( )a P Pθ ω ω=

2 12 22

1 21 11

( )( )( )( )bPPω λ λ

θω λ λ

−=

Example:PersonAuthentication• Authenticateapersonusingbiometrics(e.g.,fingerprints).

• Therearetwopossibledistributions(i.e.,classes):– Authentic (A)andImpostor (I)

IA

Example:PersonAuthentication(cont’d)

• Possibledecisions:– (1)correctacceptance(truepositive):

• Xbelongs toA,andwedecideA

– (2)incorrectacceptance (falsepositive):• Xbelongs toI,andwedecide A

– (3)correctrejection(truenegative):• Xbelongs toI,andwedecide I

– (4)incorrectrejection (falsenegative):• Xbelongs toA,andwedecide I

I A

false positive

correct acceptance

correct rejection

false negative

ErrorvsThreshold

ROC

FalseNegativesvsPositives

NextLecture

• LinearClassificationMethods– Hastieetal,Chapter4

• PaperlistwillavailablebyWeekend– BiddingtostartonMonday

BayesDecisionTheory:CaseofDiscreteFeatures

• Replacewith

• Seesection2.9

( / )jp dω∫ x x ( / )jP ω∑x

x

MissingFeatures

• ConsideraBayesclassifierusinguncorrupteddata.• Supposex=(x1,x2)isatestvectorwherex1 ismissingandthe

valueofx2 is- howcanweclassifyit?– Ifwesetx1 equaltotheaveragevalue,wewillclassifyx asω3

– Butislarger;maybeweshouldclassifyxasω2 ?2 2ˆ( / )p x ω

2x̂

MissingFeatures(cont’d)

• Supposex=[xg,xb](xg:goodfeatures,xb:badfeatures)• DerivetheBayesruleusingthegoodfeatures:

pp

Marginalizeposteriorprobabilityoverbadfeatures.

CompoundBayesianDecisionTheory

• Sequential decision(1)Decideaseachfishemerges.

• Compound decision(1)Waitforn fishtoemerge.(2)Makeall n decisionsjointly.

– Couldimproveperformancewhenconsecutivestatesofnaturearenot bestatisticallyindependent.

CompoundBayesianDecisionTheory(cont’d)

• SupposeΩ=(ω(1),ω(2),…,ω(n))denotesthenstatesofnaturewhereω(i)cantakeoneofcvaluesω1,ω2,…,ωc(i.e.,ccategories)

• SupposeP(Ω)isthepriorprobabilityofthenstatesofnature.

• SupposeX=(x1,x2,…,xn)arenobservedvectors.

CompoundBayesianDecisionTheory(cont’d)

i.e.,consecutivestatesofnaturemaynot bestatisticallyindependent!

acceptable!P P