Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional...

Post on 26-Jun-2020

17 views 0 download

Transcript of Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional...

1

NaïveBayesClassifier

PradeepRavikumar

Co-instructor:Ziv Bar-Joseph

MachineLearning10-701

2

Goal:

Classification

SportsScienceNews

Features,X Labels,Y

ProbabilityofError

0

0.5

1

OptimalClassificationOptimalpredictor:(Bayes classifier)

3

• EventheoptimalclassifiermakesmistakesR(f*)>0• Optimalclassifierdependsonunknown distribution

Bayes risk

X

OptimalClassifier

Bayes Rule:

Optimalclassifier:

4

Classconditionaldensity

Classprior

5

Wecannowconsiderappropriatemodelsforthetwoterms

ClassprobabilityP(Y=y),ClassconditionaldistributionoffeaturesP(X=x|Y=y)

Classconditionaldistribution

Classprobability

ModelbasedApproach

= θ = 1 − θ

ModelingClassprobabilityP(Y=y)=Bernoulli(θ)

Likeacoinflip

ModelingClassConditionalDistributionofFeatures

• Gaussianclassconditionaldensities(1-dimension/feature)

6DecisionBoundary

• Gaussianclassconditionaldensities (2-dimensions/features)

7

ModelingClassConditionalDistributionofFeatures

DecisionBoundary

µ1

µ1

µ2

µ2

Handwrittendigitrecognition

8Note:8digits shownoutof10(0,1,…,9);

Axesareobtainedbynonlineardimensionality reduction (laterincourse)

φ2(X)

φ1(X)

Multi-classclassification

Handwrittendigitrecognition

9

TrainingData:

GaussianBayesmodel:

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix

1

…ngreyscaleimages

…nlabels

Input,X

Label,Y

Eachimagerepresentedasavectorofintensityvaluesatthedpixels(features)

=

2

664

X1

X2

. . .Xd

3

775

2

X

GaussianBayesclassifier

10

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix

11

1p(2⇡)d|⌃y|

• Binaryclassificationwithcontinuousfeaturesdecisionboundaryissetofpointsx:P(Y=1|X=x)=P(Y=0|X=x)

IfclassconditionalfeaturedistributionP(X=x|Y=y)is2-dimGaussianN(μy,Σy)

DecisionBoundaryofGaussianBayes

P (Y = 1|X = x)

P (Y = 0|X = x)=

P (X = x|Y = 1)P (Y = 1)

P (X = x|Y = 0)P (Y = 0)

=

s|⌃0||⌃1|

exp

✓��(x� µ1)⌃

�11 (x� µ1)0

2+

(x� µ0)⌃�10 (x� µ0)0

2

◆✓

1� ✓

Note:Ingeneral,thisimpliesaquadraticequationinx.ButifΣ1=Σ0,thenquadraticpartcancelsoutandequationislinear.

GaussianBayesclassifier

12

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix

Howtolearnparameterspy,μy,Σy fromdata?

Howmanyparametersdoweneedtolearn?

13

Kd +Kd(d+1)/2=O(Kd2) ifdfeatures

Quadraticindimensiond!Ifd=256x256pixels,~21.5billionparameters!

Classprobability:

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

Classconditionaldistributionoffeatures:

P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix

K-1ifKlabels

Whataboutdiscretefeatures?

1414

TrainingData:

DiscreteBayesmodel:

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

P(X=x|Y =y)~Foreachlabely,maintainprobabilitytablewith2d-1entries

1

…nblack-whiteimages

…nlabels

Input,X

Label,Y

Eachimagerepresentedasavectorofdbinaryfeatures(black1orwhite0)

=

2

664

X1

X2

. . .Xd

3

775

2

X

Howmanyparametersdoweneedtolearn?

15

Classprobability:

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

Classconditionaldistributionoffeatures:

P(X=x|Y =y)~Foreachlabely,maintainprobabilitytablewith2d-1entries

K-1ifKlabels

K(2d – 1)ifdbinaryfeatures

Exponentialindimensiond!

What’swrongwithtoomanyparameters?

• Howmanytrainingdataneededtolearnoneparameter(biasofacoin)?

• Needlotsoftrainingdatatolearntheparameters!– Trainingdata>numberofparameters

16

NaïveBayesClassifier

17

• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:

– Moregenerally:

• Ifconditionalindependenceassumptionholds,NBisoptimalclassifier!Butworseotherwise.

X =

X1

X2

=

2

664

X1

X2

. . .Xd

3

775X =

X1

X2

ConditionalIndependence

18

• Xisconditionallyindependent ofYgivenZ:probabilitydistributiongoverningXisindependentofthevalueofY,giventhevalueofZ

• Equivalentto:

• e.g.,Note: doesNOTmeanThunderisindependentofRain

Conditionalvs.MarginalIndependence

19

Conditionalvs.MarginalIndependence

20

Wearingcoatsisindependentofaccidentsconditionedonthefactthatitrained

NaïveBayesClassifier

21

• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:

• Howmanyparametersnow?

Handwrittendigitrecognition(continuousfeatures)

22

TrainingData:

Howmanyparameters?

ClassprobabilityP(Y=y)=py forally

Classconditionaldistributionoffeatures(usingNaïveBayesassumption)

P(Xi =xi|Y =y)~N(μ(y)i,σ2i(y))foreachyandeachpixeli

K-1ifKlabels

2Kd

1 2

…ngreyscaleimageswithdpixels

…nlabels

X

Y

=

2

664

X1

X2

. . .Xd

3

775

May not hold

LinearinsteadofQuadraticind!

Handwrittendigitrecognition(discretefeatures)

23

TrainingData:

Howmanyparameters?

ClassprobabilityP(Y=y)=py forally

Classconditionaldistributionoffeatures(usingNaïveBayesassumption)

P(Xi =xi|Y =y)– oneprobabilityvalueforeachy,pixeli

K-1ifKlabels

Kd

1 2

…nblack-white(1/0)imageswithdpixels

…nlabels

X

Y

=

2

664

X1

X2

. . .Xd

3

775

May not hold

LinearinsteadofExponentialind!

NaïveBayesClassifier

24

• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:

• Hasfewerparameters,andhencerequiresfewertrainingdata,eventhoughassumptionmaybeviolatedinpractice

NaïveBayes Algo – Discretefeatures

• TrainingData

• MaximumLikelihoodEstimates– ForClassprobability

– Forclassconditionaldistribution

• NBPredictionfortestdata

25

IssueswithNaïveBayes

26

• Issue1: Usually,featuresarenotconditionallyindependent:

Nonetheless,NBisthesinglemostusedclassifierparticularlywhendataislimited,workswell

• Issue2: TypicallyuseMAPestimatesinsteadofMLEsinceinsufficientdatamaycauseMLEtobezero.

InsufficientdataforMLE

27

• WhatifyouneverseeatraininginstancewhereX1=awhenY=b?– e.g.,b={SpamEmail},a={‘Earn’}– P(X1=a|Y=b)=0

• Thus,nomatterwhatthevaluesX2,…,Xd take:

• Whatnow???

=0

NaïveBayes Algo – Discretefeatures

• TrainingData

• MaximumAPosteriori(MAP)Estimates– addm“virtual”datapts

Assumegivensomepriordistribution(typicallyuniform):

MAPEstimate

Now,evenifyouneverobserveaclass/featureposteriorprobabilityneverzero.

28

#virtualexampleswithY=b

CaseStudy:TextClassification

29

• Classifye-mails– Y={Spam,NotSpam}

• Classifynewsarticles– Y={whatisthetopicofthearticle?}

• Classifywebpages– Y={Student,professor,project,…}

• WhataboutthefeaturesX?– Thetext!

Bagofwordsapproach

30

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

Zaire 0

NBforTextClassification

31

• FeaturesX arethecountofhowmanytimeseachwordinthevocabularyappearsindocument

• ProbabilitytableforP(X|Y)ishuge!!!

• NBassumptionhelpsalot!!!

• Bagofwords+NaïveBayesassumptionimplyP(X|Y=y)isjusttheproduct ofprobabilityofeachword,raisedtoitscount, inadocumentontopicy

Bagofwordsmodel

32

• Typicaladditionalassumption– Positionindocumentdoesn’tmatter– “Bagofwords”model– orderofwordsonthepageignored– Soundsreallysilly,butoftenworksverywell!

inislecturelecture nextoverpersonrememberroomsittingthethe the toto upwakewhenyou

Bagofwordsmodel

33

• Typicaladditionalassumption– Positionindocumentdoesn’tmatter– “Bagofwords”model– orderofwordsonthepageignored– Soundsreallysilly,butoftenworksverywell!

Whenthelectureisover,remembertowakeupthepersonsittingnexttoyouinthelectureroom.

NBwithBagofWordsfortextclassification

34

• Learningphase:– ClassPriorP(Y):fraction oftimestopicYappearsinthecollectionofdocuments

– P(w|Y):fractionoftimesword wappearsindocumentswithtopicY

• Testphase:– Foreachdocument

• UseBagofwords+naïveBayesdecisionrule

Twentynewsgroupsresults

35

Whatiffeaturesarecontinuous?

36

Eg.,characterrecognition:Xi isintensityatith pixel

GaussianNaïveBayes (GNB):

Differentmeanandvarianceforeachclasskandeachpixeli.

Sometimesassumevariance• isindependentofY(i.e.,σi),• orindependentofXi (i.e.,σk)• orboth(i.e.,σ)

Estimatingparameters:Ydiscrete,Xi continuous

37

Maximumlikelihoodestimates:

jth trainingimageith pixelin

jth trainingimage

kth class

Example:GNBforclassifyingmentalstates

38

~1mmresolution

~2imagespersec.

15,000voxels/image

non-invasive,safe

measuresBloodOxygenLevelDependent(BOLD)response

[Mitchelletal.]

GaussianNaïveBayes:Learnedµvoxel,word

39

[Mitchelletal.]

15,000voxelsorfeatures

10trainingexamplesorsubjectsperclass(12wordcategories)

LearnedNaïveBayes Models–MeansforP(BrainActivity |WordCategory)

40

AnimalwordsPeoplewordsPairwise classificationaccuracy:85% [Mitchelletal.]

Whatyoushouldknow…

41

• OptimaldecisionusingBayes Classifier• NaïveBayes classifier

– What’stheassumption– Whyweuseit– Howdowelearnit– WhyisMAPestimationimportant

• Textclassification– Bagofwordsmodel

• GaussianNB– Featuresarestillconditionallyindependent– EachfeaturehasaGaussiandistributiongivenclass

GaussianNaïveBayes vs.LogisticRegression

42

• Representationequivalence(bothyieldlineardecisionboundaries)– Butonlyinaspecialcase!!!(GNBwithclass-independentvariances)

– LRmakesnoassumptionsabout P(X|Y)inlearning!!!– Optimizedifferentfunctions(MLE/MCLE)or(MAP/MCAP)! Obtaindifferentsolutions

SetofGaussianNaïveBayes parameters

(featurevarianceindependentofclasslabel)

SetofLogisticRegressionparameters

Discriminativevs GenerativeClassifiers

43

Generative(Modelbased)approach:e.g.NaïveBayes• AssumesomeprobabilitymodelforP(Y)andP(X|Y)• Estimateparametersofprobabilitymodelsfromtrainingdata

Discriminative(Modelfree)approach:e.g.LogisticregressionWhynotlearnP(Y|X)directly?Orbetteryet,whynotlearnthedecisionboundarydirectly?• AssumesomefunctionalformforP(Y|X)orforthedecisionboundary• Estimateparametersoffunctionalformdirectlyfromtrainingdata

OptimalClassifier: