Lecture10 - Naïve Bayes

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 10Lecture 10Bayesian decision theory – Naïve Bayes

Albert Orriols i Puigi l @ ll l [email protected]

Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q

Universitat Ramon Llull

Recap of Lecture 9Outputs the most probable hypothesis h∈H, given the data D + knowledge about prior probabilities of hypotheses in H

Terminology:

P(h|D): probability that h holds given data D. Posterior probability of h; confidence that h holds given D.

P(h): prior probability of h (background knowledge we have about that h is a correct hypothesis)

P(D): prior probability that training data D will be observedP(D): prior probability that training data D will be observed

P(D|h): probability of observing D given h holds

)()()|()|(

DPhPhDPDhP =

)()|(

DP

Slide 2Artificial Intelligence Machine Learning

Today’s Agenda

Bayesian Decision Theoryy yNominal VariablesContinuous Variables

A Medical ExampleNaïve BayesNaïve Bayes

Slide 4Artificial Intelligence Introduction to C++

Bayesian Decision TheoryStatistical approach to pattern classificationpp p

Forget about rule-based and tree-based models

We will express the problem in probabilistic termsWe will express the problem in probabilistic terms

GoalClassify a pattern x into one of the two classes w1 or w2 to minimize the probability of misclassification P(error)

P i b bilitPrior probability

P(x) = Fraction of times that x belongs to class wk

Without more information, we have to classify a new example x’. What should we do?

⎩⎨⎧ >

=otherwisew

)()( if of class

2

211 wPwPwx

The best option if we know nothing else about the domain!


⎩ otherwisew 2

Bayesian Decision TheoryNow, we measure a feature of each example x, p

Threshold θ

How we should classify these data?As the classes overlap, x1 cannot perfectly discriminate1 y

At the end, I want my algorithm to put a threshold that defines the class boundary

Slide 6

y

Artificial Intelligence Machine Learning

Bayesian Decision Theory

L t’ dd d f tLet’s add a second feature

How we should classify these data?An oblique line will be a good discriminantAn oblique line will be a good discriminant

So the problem turns out to be: How can we build or simulate this oblique line?

Slide 7

this oblique line?


Bayesian Decision TheoryAssume that xi are nominal variables with possible values {xi1, xi2, …, xin}

Let’s build a table of number of occurrencesP(w x ) 1/8

Xi1 Xi2 Xin TotalW1 1 3 0 4

P(w1,xi1) = 1/8

P(w1) = 4/8

P(x | w ) = 1/4W2 0 2 2 4 P(xi1| w1) = 1/4

Join probability P(wk,xij): Probability of a pattern having value xij for variable xi and belonging to class wk. That is, the value of each cell divided by the total number of examplesby the total number of examples.

Priors P(wk): Marginal sums of each row

C diti l P( | ) P b bilit th t tt h l i th t itConditional P(xij|wk): Probability that a pattern has a value xij given that it belongs to class wk. That is, each cell divided by the sum of each row.


Bayesian Decision TheoryFrom nominal to continuous attributes

From probability mass functions to probability density functions (PDFs)( s)

∫∫ ==∈b

p(x)dx dxxpbaxP 1 where)(]),[( ∫∫Xa

As well, we have class-conditional PDFs p(x, wk)

If we have d random variables x = (x1, …, xd)e a e d a do a ab es ( 1, , d)

)()( ∫ dRP rrr )()( ∫=∈R

xdxpRxP


Naïve BayesBut step down… I still need to learn the probabilities from data p pdescribed by

Nominal attributes

Continuous attributes

That isThat is,Given a new instance with attributes (a1,a2,…,an), the Bayesian approach classifies it to the most probable value vMAPapproach classifies it to the most probable value vMAP

),...,,|(maxarg 21 njVv

MAP aaavPvj∈

=

Using Bayes’ theorem:Vv j∈

)()|( 21 jjn vPvaaaP

How to compute P(v ) and P(a a a |v ) ?

)()|,...,,(maxarg),...,,(

)()|,...,,(maxarg 21

21

21jjn

Vvn

jjn

VvMAP vPvaaaP

aaaPvPvaaaP

vjj ∈∈

==

Slide 11

How to compute P(vj) and P(a1,a2,…,an|vj) ?


Naïve BayesHow to compute P(vj)?p ( j)

P(vj): counting the frequency with which each target value vj occurs in the training data.

How to compute P(a1,a2,…,an|vj) ?How to compute P(a1,a2,…,an|vj) ?

P(a1,a2,…,an|vj) : we should have a very large dataset. The number of these terms=number of possible instances times the number of possible target values (i f ibl )(infeasible).

Simplifying assumption: the attribute values are conditionally independent given the target value. I.e., the probability of observing (a1,a2,…,an) is thegiven the target value. I.e., the probability of observing (a1,a2,…,an) is the product of the probabilities for the individual attributes.


Naïve BayesPrediction of Naïve Bayes classifier:

The learning algorithm:

)|()(maxarg ji

ijVv

NB vaPvPvj

∏∈

=

g g

Training:Estimate the probabilities P(vj) and P(ai|vj) based on their frequencies over the j jtraining data

Output after training:Th l d h th i i t f th t f ti tThe learned hypothesis consists of the set of estimates

Test:Use formula above to classify new instancesUse formula above to classify new instances

Observations:

Number of distinct P(a |v ) terms number of distinct attribute values times theNumber of distinct P(ai|vj) terms =number of distinct attribute values times the number of distinct target values

The algorithm does not perform an explicit search through the space of

Slide 13

g p p g ppossible hypothesis (the space of possible hypotheses is the space of possible values that can be assigned to the various probabilities).


ExampleGiven the training examples:g p

Day Outlook Temperature Humidity Wind PlayTennisD1 Sunny Hot High Weak No D2 S H t Hi h St ND2 Sunny Hot High Strong NoD3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

Classify the new instance:


(Outlook=sunny, Temp=cool, Humidity=high, Wind=strong)

ExampleNaive Bayes training

Outlook|YesSunny|Yes 2/9

Outlook|NoSunny|No 3/5

Overcast|Yes 4/9 Overcast|No 0Outlook|Yes Outlook|NoOvercast|Yes 4/9 Overcast|No 0Rain|Yes 3/9 Rain|No 2/5

Temperature|yYesHot|Yes 2/9

Temperature|NoHot 2/5

Mild|Yes 4/9 Mild 2/5 Cool|Yes 3/9 Cool 1/5|

Humidity|Yes High 3/9 Humidity|No High 4/5 Normal 6/9 Normal 1/5

Wind|Yes Weak 6/9 Wind|Yes Weak 2/5 Strong 3/9 Strong 3/5

P(Yes)=9/14

P(No)=5/14

( )

Test:Classify (Outlook=sunny, Temp=cool, Humidity=high, Wind=strong)Classify (Outlook sunny, Temp cool, Humidity high, Wind strong)

max { 9/14·2/9·3/9·3/9·3/9, 5/14·3/5·1/5·4/5·3/5} = {.0053, .0206} = 0.0206


Do not play tennis!

Estimation of ProbabilitiesThe explained process to estimate probabilities could lead to poor estimate if the number of observations is small

E.g.: P( Outlook=overcast| No) = 0.008, but we only have 5 examples

Use the following estimatempnc +

weightthedetermineswhichsize,sampleequivalentconstant, :mdetermine to wishy weprobabilit the of estimate prior the ispwhere

mn +

Assuming uniform distribution, p=1/k, being k the number of values f th tt ib t

data observed the to assigned g

of the attribute.

E.g., P(Outlook=overcast | No):

252·3/10

++

=++

mnmpnc


Next Class

Neural Networks and Support Vector Machines

Slide 17Artificial Intelligence Introduction to C++

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 10Lecture 10Bayesian decision theory – Naïve Bayes

Albert Orriols i Puigi l @ ll l [email protected]

Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q

Universitat Ramon Llull

Lecture10 - Naïve Bayes

Education

Transcript of Lecture10 - Naïve Bayes