Lecture10 - Naïve Bayes
-
Upload
albert-orriols-puig -
Category
Education
-
view
13.917 -
download
0
description
Transcript of Lecture10 - Naïve Bayes
Introduction to MachineIntroduction to Machine LearningLearning
Lecture 10Lecture 10Bayesian decision theory – Naïve Bayes
Albert Orriols i Puigi l @ ll l [email protected]
Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q
Universitat Ramon Llull
Recap of Lecture 9Outputs the most probable hypothesis h∈H, given the data D + knowledge about prior probabilities of hypotheses in H
Terminology:
P(h|D): probability that h holds given data D. Posterior probability of h; confidence that h holds given D.
P(h): prior probability of h (background knowledge we have about that h is a correct hypothesis)
P(D): prior probability that training data D will be observedP(D): prior probability that training data D will be observed
P(D|h): probability of observing D given h holds
)()()|()|(
DPhPhDPDhP =
)()|(
DP
Slide 2Artificial Intelligence Machine Learning
Bayes’ Theorem
Given H the space of possible hypothesis
Th b bl h h i i h h i i P(h|D)The most probable hypothesis is the one that maximizes P(h|D):
)()|(maxarg)(
)()|(maxarg)|(maxarg hPhDPDP
hPhDPDhPhHh
MAP ==≡∈
Slide 3Artificial Intelligence Machine Learning
Today’s Agenda
Bayesian Decision Theoryy yNominal VariablesContinuous Variables
A Medical ExampleNaïve BayesNaïve Bayes
Slide 4Artificial Intelligence Introduction to C++
Bayesian Decision TheoryStatistical approach to pattern classificationpp p
Forget about rule-based and tree-based models
We will express the problem in probabilistic termsWe will express the problem in probabilistic terms
GoalClassify a pattern x into one of the two classes w1 or w2 to minimize the probability of misclassification P(error)
P i b bilitPrior probability
P(x) = Fraction of times that x belongs to class wk
Without more information, we have to classify a new example x’. What should we do?
⎩⎨⎧ >
=otherwisew
)()( if of class
2
211 wPwPwx
The best option if we know nothing else about the domain!
Slide 5Artificial Intelligence Machine Learning
⎩ otherwisew 2
Bayesian Decision TheoryNow, we measure a feature of each example x, p
Threshold θ
How we should classify these data?As the classes overlap, x1 cannot perfectly discriminate1 y
At the end, I want my algorithm to put a threshold that defines the class boundary
Slide 6
y
Artificial Intelligence Machine Learning
Bayesian Decision Theory
L t’ dd d f tLet’s add a second feature
How we should classify these data?An oblique line will be a good discriminantAn oblique line will be a good discriminant
So the problem turns out to be: How can we build or simulate this oblique line?
Slide 7
this oblique line?
Artificial Intelligence Machine Learning
Bayesian Decision TheoryAssume that xi are nominal variables with possible values {xi1, xi2, …, xin}
Let’s build a table of number of occurrencesP(w x ) 1/8
Xi1 Xi2 Xin TotalW1 1 3 0 4
P(w1,xi1) = 1/8
P(w1) = 4/8
P(x | w ) = 1/4W2 0 2 2 4 P(xi1| w1) = 1/4
Join probability P(wk,xij): Probability of a pattern having value xij for variable xi and belonging to class wk. That is, the value of each cell divided by the total number of examplesby the total number of examples.
Priors P(wk): Marginal sums of each row
C diti l P( | ) P b bilit th t tt h l i th t itConditional P(xij|wk): Probability that a pattern has a value xij given that it belongs to class wk. That is, each cell divided by the sum of each row.
Slide 8Artificial Intelligence Machine Learning
Bayesian Decision TheoryRecall that recognizing that P(A,B)=P(B|A)P(A) = P(A|B)P(B)g g ( , ) ( | ) ( ) ( | ) ( )
)()|(),( kkijijk wPwxPxwP = )()|()( kkijijk
)()|(),( ijijkijk xPxwPxwP =
Therefore
jjjWe have all these values
)()()|(
)|( kkijijk xP
wPwxPxwP =
)( ijxP
And the class:
)|(maxargofclass 21 kk xwPx =
Slide 9Artificial Intelligence Machine Learning
)|(maxargofclass 2,1 ijkk xwPx ==
Bayesian Decision TheoryFrom nominal to continuous attributes
From probability mass functions to probability density functions (PDFs)( s)
∫∫ ==∈b
p(x)dx dxxpbaxP 1 where)(]),[( ∫∫Xa
As well, we have class-conditional PDFs p(x, wk)
If we have d random variables x = (x1, …, xd)e a e d a do a ab es ( 1, , d)
)()( ∫ dRP rrr )()( ∫=∈R
xdxpRxP
Slide 10Artificial Intelligence Machine Learning
Naïve BayesBut step down… I still need to learn the probabilities from data p pdescribed by
Nominal attributes
Continuous attributes
That isThat is,Given a new instance with attributes (a1,a2,…,an), the Bayesian approach classifies it to the most probable value vMAPapproach classifies it to the most probable value vMAP
),...,,|(maxarg 21 njVv
MAP aaavPvj∈
=
Using Bayes’ theorem:Vv j∈
)()|( 21 jjn vPvaaaP
How to compute P(v ) and P(a a a |v ) ?
)()|,...,,(maxarg),...,,(
)()|,...,,(maxarg 21
21
21jjn
Vvn
jjn
VvMAP vPvaaaP
aaaPvPvaaaP
vjj ∈∈
==
Slide 11
How to compute P(vj) and P(a1,a2,…,an|vj) ?
Artificial Intelligence Machine Learning
Naïve BayesHow to compute P(vj)?p ( j)
P(vj): counting the frequency with which each target value vj occurs in the training data.
How to compute P(a1,a2,…,an|vj) ?How to compute P(a1,a2,…,an|vj) ?
P(a1,a2,…,an|vj) : we should have a very large dataset. The number of these terms=number of possible instances times the number of possible target values (i f ibl )(infeasible).
Simplifying assumption: the attribute values are conditionally independent given the target value. I.e., the probability of observing (a1,a2,…,an) is thegiven the target value. I.e., the probability of observing (a1,a2,…,an) is the product of the probabilities for the individual attributes.
Slide 12Artificial Intelligence Machine Learning
Naïve BayesPrediction of Naïve Bayes classifier:
The learning algorithm:
)|()(maxarg ji
ijVv
NB vaPvPvj
∏∈
=
g g
Training:Estimate the probabilities P(vj) and P(ai|vj) based on their frequencies over the j jtraining data
Output after training:Th l d h th i i t f th t f ti tThe learned hypothesis consists of the set of estimates
Test:Use formula above to classify new instancesUse formula above to classify new instances
Observations:
Number of distinct P(a |v ) terms number of distinct attribute values times theNumber of distinct P(ai|vj) terms =number of distinct attribute values times the number of distinct target values
The algorithm does not perform an explicit search through the space of
Slide 13
g p p g ppossible hypothesis (the space of possible hypotheses is the space of possible values that can be assigned to the various probabilities).
Artificial Intelligence Machine Learning
ExampleGiven the training examples:g p
Day Outlook Temperature Humidity Wind PlayTennisD1 Sunny Hot High Weak No D2 S H t Hi h St ND2 Sunny Hot High Strong NoD3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No
Classify the new instance:
Slide 14Artificial Intelligence Machine Learning
(Outlook=sunny, Temp=cool, Humidity=high, Wind=strong)
ExampleNaive Bayes training
Outlook|YesSunny|Yes 2/9
Outlook|NoSunny|No 3/5
Overcast|Yes 4/9 Overcast|No 0Outlook|Yes Outlook|NoOvercast|Yes 4/9 Overcast|No 0Rain|Yes 3/9 Rain|No 2/5
Temperature|yYesHot|Yes 2/9
Temperature|NoHot 2/5
Mild|Yes 4/9 Mild 2/5 Cool|Yes 3/9 Cool 1/5|
Humidity|Yes High 3/9 Humidity|No High 4/5 Normal 6/9 Normal 1/5
Wind|Yes Weak 6/9 Wind|Yes Weak 2/5 Strong 3/9 Strong 3/5
P(Yes)=9/14
P(No)=5/14
( )
Test:Classify (Outlook=sunny, Temp=cool, Humidity=high, Wind=strong)Classify (Outlook sunny, Temp cool, Humidity high, Wind strong)
max { 9/14·2/9·3/9·3/9·3/9, 5/14·3/5·1/5·4/5·3/5} = {.0053, .0206} = 0.0206
Slide 15Artificial Intelligence Machine Learning
Do not play tennis!
Estimation of ProbabilitiesThe explained process to estimate probabilities could lead to poor estimate if the number of observations is small
E.g.: P( Outlook=overcast| No) = 0.008, but we only have 5 examples
Use the following estimatempnc +
weightthedetermineswhichsize,sampleequivalentconstant, :mdetermine to wishy weprobabilit the of estimate prior the ispwhere
mn +
Assuming uniform distribution, p=1/k, being k the number of values f th tt ib t
data observed the to assigned g
of the attribute.
E.g., P(Outlook=overcast | No):
252·3/10
++
=++
mnmpnc
Slide 16Artificial Intelligence Machine Learning
Next Class
Neural Networks and Support Vector Machines
Slide 17Artificial Intelligence Introduction to C++
Introduction to MachineIntroduction to Machine LearningLearning
Lecture 10Lecture 10Bayesian decision theory – Naïve Bayes
Albert Orriols i Puigi l @ ll l [email protected]
Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q
Universitat Ramon Llull