Lecture10 - Naïve Bayes

18
Introduction to Machine Introduction to Machine Learning Learning Lecture 10 Lecture 10 Bayesian decision theory – Naïve Bayes Albert Orriols i Puig il@ ll l d aorriols@salle.url.edu Artificial Intelligence – Machine Learning Enginyeria i Arquitectura La Salle Universitat Ramon Llull

description

 

Transcript of Lecture10 - Naïve Bayes

Page 1: Lecture10 - Naïve Bayes

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 10Lecture 10Bayesian decision theory – Naïve Bayes

Albert Orriols i Puigi l @ ll l [email protected]

Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q

Universitat Ramon Llull

Page 2: Lecture10 - Naïve Bayes

Recap of Lecture 9Outputs the most probable hypothesis h∈H, given the data D + knowledge about prior probabilities of hypotheses in H

Terminology:

P(h|D): probability that h holds given data D. Posterior probability of h; confidence that h holds given D.

P(h): prior probability of h (background knowledge we have about that h is a correct hypothesis)

P(D): prior probability that training data D will be observedP(D): prior probability that training data D will be observed

P(D|h): probability of observing D given h holds

)()()|()|(

DPhPhDPDhP =

)()|(

DP

Slide 2Artificial Intelligence Machine Learning

Page 3: Lecture10 - Naïve Bayes

Bayes’ Theorem

Given H the space of possible hypothesis

Th b bl h h i i h h i i P(h|D)The most probable hypothesis is the one that maximizes P(h|D):

)()|(maxarg)(

)()|(maxarg)|(maxarg hPhDPDP

hPhDPDhPhHh

MAP ==≡∈

Slide 3Artificial Intelligence Machine Learning

Page 4: Lecture10 - Naïve Bayes

Today’s Agenda

Bayesian Decision Theoryy yNominal VariablesContinuous Variables

A Medical ExampleNaïve BayesNaïve Bayes

Slide 4Artificial Intelligence Introduction to C++

Page 5: Lecture10 - Naïve Bayes

Bayesian Decision TheoryStatistical approach to pattern classificationpp p

Forget about rule-based and tree-based models

We will express the problem in probabilistic termsWe will express the problem in probabilistic terms

GoalClassify a pattern x into one of the two classes w1 or w2 to minimize the probability of misclassification P(error)

P i b bilitPrior probability

P(x) = Fraction of times that x belongs to class wk

Without more information, we have to classify a new example x’. What should we do?

⎩⎨⎧ >

=otherwisew

)()( if of class

2

211 wPwPwx

The best option if we know nothing else about the domain!

Slide 5Artificial Intelligence Machine Learning

⎩ otherwisew 2

Page 6: Lecture10 - Naïve Bayes

Bayesian Decision TheoryNow, we measure a feature of each example x, p

Threshold θ

How we should classify these data?As the classes overlap, x1 cannot perfectly discriminate1 y

At the end, I want my algorithm to put a threshold that defines the class boundary

Slide 6

y

Artificial Intelligence Machine Learning

Page 7: Lecture10 - Naïve Bayes

Bayesian Decision Theory

L t’ dd d f tLet’s add a second feature

How we should classify these data?An oblique line will be a good discriminantAn oblique line will be a good discriminant

So the problem turns out to be: How can we build or simulate this oblique line?

Slide 7

this oblique line?

Artificial Intelligence Machine Learning

Page 8: Lecture10 - Naïve Bayes

Bayesian Decision TheoryAssume that xi are nominal variables with possible values {xi1, xi2, …, xin}

Let’s build a table of number of occurrencesP(w x ) 1/8

Xi1 Xi2 Xin TotalW1 1 3 0 4

P(w1,xi1) = 1/8

P(w1) = 4/8

P(x | w ) = 1/4W2 0 2 2 4 P(xi1| w1) = 1/4

Join probability P(wk,xij): Probability of a pattern having value xij for variable xi and belonging to class wk. That is, the value of each cell divided by the total number of examplesby the total number of examples.

Priors P(wk): Marginal sums of each row

C diti l P( | ) P b bilit th t tt h l i th t itConditional P(xij|wk): Probability that a pattern has a value xij given that it belongs to class wk. That is, each cell divided by the sum of each row.

Slide 8Artificial Intelligence Machine Learning

Page 9: Lecture10 - Naïve Bayes

Bayesian Decision TheoryRecall that recognizing that P(A,B)=P(B|A)P(A) = P(A|B)P(B)g g ( , ) ( | ) ( ) ( | ) ( )

)()|(),( kkijijk wPwxPxwP = )()|()( kkijijk

)()|(),( ijijkijk xPxwPxwP =

Therefore

jjjWe have all these values

)()()|(

)|( kkijijk xP

wPwxPxwP =

)( ijxP

And the class:

)|(maxargofclass 21 kk xwPx =

Slide 9Artificial Intelligence Machine Learning

)|(maxargofclass 2,1 ijkk xwPx ==

Page 10: Lecture10 - Naïve Bayes

Bayesian Decision TheoryFrom nominal to continuous attributes

From probability mass functions to probability density functions (PDFs)( s)

∫∫ ==∈b

p(x)dx dxxpbaxP 1 where)(]),[( ∫∫Xa

As well, we have class-conditional PDFs p(x, wk)

If we have d random variables x = (x1, …, xd)e a e d a do a ab es ( 1, , d)

)()( ∫ dRP rrr )()( ∫=∈R

xdxpRxP

Slide 10Artificial Intelligence Machine Learning

Page 11: Lecture10 - Naïve Bayes

Naïve BayesBut step down… I still need to learn the probabilities from data p pdescribed by

Nominal attributes

Continuous attributes

That isThat is,Given a new instance with attributes (a1,a2,…,an), the Bayesian approach classifies it to the most probable value vMAPapproach classifies it to the most probable value vMAP

),...,,|(maxarg 21 njVv

MAP aaavPvj∈

=

Using Bayes’ theorem:Vv j∈

)()|( 21 jjn vPvaaaP

How to compute P(v ) and P(a a a |v ) ?

)()|,...,,(maxarg),...,,(

)()|,...,,(maxarg 21

21

21jjn

Vvn

jjn

VvMAP vPvaaaP

aaaPvPvaaaP

vjj ∈∈

==

Slide 11

How to compute P(vj) and P(a1,a2,…,an|vj) ?

Artificial Intelligence Machine Learning

Page 12: Lecture10 - Naïve Bayes

Naïve BayesHow to compute P(vj)?p ( j)

P(vj): counting the frequency with which each target value vj occurs in the training data.

How to compute P(a1,a2,…,an|vj) ?How to compute P(a1,a2,…,an|vj) ?

P(a1,a2,…,an|vj) : we should have a very large dataset. The number of these terms=number of possible instances times the number of possible target values (i f ibl )(infeasible).

Simplifying assumption: the attribute values are conditionally independent given the target value. I.e., the probability of observing (a1,a2,…,an) is thegiven the target value. I.e., the probability of observing (a1,a2,…,an) is the product of the probabilities for the individual attributes.

Slide 12Artificial Intelligence Machine Learning

Page 13: Lecture10 - Naïve Bayes

Naïve BayesPrediction of Naïve Bayes classifier:

The learning algorithm:

)|()(maxarg ji

ijVv

NB vaPvPvj

∏∈

=

g g

Training:Estimate the probabilities P(vj) and P(ai|vj) based on their frequencies over the j jtraining data

Output after training:Th l d h th i i t f th t f ti tThe learned hypothesis consists of the set of estimates

Test:Use formula above to classify new instancesUse formula above to classify new instances

Observations:

Number of distinct P(a |v ) terms number of distinct attribute values times theNumber of distinct P(ai|vj) terms =number of distinct attribute values times the number of distinct target values

The algorithm does not perform an explicit search through the space of

Slide 13

g p p g ppossible hypothesis (the space of possible hypotheses is the space of possible values that can be assigned to the various probabilities).

Artificial Intelligence Machine Learning

Page 14: Lecture10 - Naïve Bayes

ExampleGiven the training examples:g p

Day Outlook Temperature Humidity Wind PlayTennisD1 Sunny Hot High Weak No D2 S H t Hi h St ND2 Sunny Hot High Strong NoD3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

Classify the new instance:

Slide 14Artificial Intelligence Machine Learning

(Outlook=sunny, Temp=cool, Humidity=high, Wind=strong)

Page 15: Lecture10 - Naïve Bayes

ExampleNaive Bayes training

Outlook|YesSunny|Yes 2/9

Outlook|NoSunny|No 3/5

Overcast|Yes 4/9 Overcast|No 0Outlook|Yes Outlook|NoOvercast|Yes 4/9 Overcast|No 0Rain|Yes 3/9 Rain|No 2/5

Temperature|yYesHot|Yes 2/9

Temperature|NoHot 2/5

Mild|Yes 4/9 Mild 2/5 Cool|Yes 3/9 Cool 1/5|

Humidity|Yes High 3/9 Humidity|No High 4/5 Normal 6/9 Normal 1/5

Wind|Yes Weak 6/9 Wind|Yes Weak 2/5 Strong 3/9 Strong 3/5

P(Yes)=9/14

P(No)=5/14

( )

Test:Classify (Outlook=sunny, Temp=cool, Humidity=high, Wind=strong)Classify (Outlook sunny, Temp cool, Humidity high, Wind strong)

max { 9/14·2/9·3/9·3/9·3/9, 5/14·3/5·1/5·4/5·3/5} = {.0053, .0206} = 0.0206

Slide 15Artificial Intelligence Machine Learning

Do not play tennis!

Page 16: Lecture10 - Naïve Bayes

Estimation of ProbabilitiesThe explained process to estimate probabilities could lead to poor estimate if the number of observations is small

E.g.: P( Outlook=overcast| No) = 0.008, but we only have 5 examples

Use the following estimatempnc +

weightthedetermineswhichsize,sampleequivalentconstant, :mdetermine to wishy weprobabilit the of estimate prior the ispwhere

mn +

Assuming uniform distribution, p=1/k, being k the number of values f th tt ib t

data observed the to assigned g

of the attribute.

E.g., P(Outlook=overcast | No):

252·3/10

++

=++

mnmpnc

Slide 16Artificial Intelligence Machine Learning

Page 17: Lecture10 - Naïve Bayes

Next Class

Neural Networks and Support Vector Machines

Slide 17Artificial Intelligence Introduction to C++

Page 18: Lecture10 - Naïve Bayes

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 10Lecture 10Bayesian decision theory – Naïve Bayes

Albert Orriols i Puigi l @ ll l [email protected]

Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q

Universitat Ramon Llull