Classification Algorithms Lecture 17ramanlab.wustl.edu/Lectures/Lecture15_Classifiers.pdf27!...
Transcript of Classification Algorithms Lecture 17ramanlab.wustl.edu/Lectures/Lecture15_Classifiers.pdf27!...
1!
Classification Algorithms!
!Lecture 17!
2!
Probability Theory!Apples and Oranges!
Pick red box!(40%)!
Pick blue box!(60%)!
(Red Box)!2 Apples!
6 Oranges!
(Blue Box)!3 Apples!1 Orange!
any piece of fruit in the boxes is equally likely!
From Bishop, PRML!
3!
Probability Theory!Apples and Oranges!
Pick red box!(40%)!
Pick blue box!(60%)!
(Red Box)!2 Apples!
6 Oranges!
(Blue Box)!3 Apples!1 Orange!
1. What is the overall probability that the selection will pick an apple?!
From Bishop, PRML!
4!
Probability Theory!Apples and Oranges!
Pick red box!(40%)!
Pick blue box!(60%)!
(Red Box)!2 Apples!
6 Oranges!
(Blue Box)!3 Apples!1 Orange!
2. Given that we have chosen an orange, what is the probability the !box we chose was the blue one?!
From Bishop, PRML!
5!
Probability Theory!
Total # Trials:!
!!# instances where X=xi!!!!!# instances where Y=yj!!
!
N
!
ci
!
rj
From Bishop, PRML!
6!
Probability Theory!
Marginal Probability!
From Bishop, PRML!
7!
Probability Theory!
Marginal Probability!
Conditional Probability!
From Bishop, PRML!
8!
Probability Theory!
Marginal Probability!
Conditional Probability!Joint Probability!
From Bishop, PRML!
9!
Probability Theory!
Sum Rule!!!!
From Bishop, PRML!
10!
Probability Theory!
Sum Rule!!!!
Product Rule!!
!
p X = xi Y = y j( )p Y = y j( )
From Bishop, PRML!
11!
The Rules of Probability!
g Sum Rule!
g Product Rule!
!
p X,Y( ) = p X Y( )p Y( )
From Bishop, PRML!
12!
Bayesʼ Theorem!
posterior ∝ likelihood × prior
!
p Y X( )p X( ) = p X Y( )p Y( ) From Bishop, PRML!
13!
Probability Theory!Apples and Oranges!
Pick red box!(40%)!
Pick blue box!(60%)!
(Red Box)!6 Apples!
2 Oranges!
(Blue Box)!3 Apples!
1 Oranges!
1. What is the overall probability that the selection will pick an apple?!
From Bishop, PRML!
14!
Sum Rule & Product Rule at work!g P(B=r)=4/10!g P(B=b)=6/10!g P(F=a | B=r)=1/4!g P(F=o | B=r)=3/4!g P(F=a | B=b)=3/4!g P(F=o | B=b)=1/4!
!
P F = a( ) = P F = aB = b( )P B = b( ) + P F = aB = r( )P B = r( )
P F = a( ) =14"410
+34"610
=1120
P F = o( ) =920
!
p X,Y( ) = p X Y( )p Y( )
15!
Probability Theory!Apples and Oranges!
Pick red box!(40%)!
Pick blue box!(60%)!
(Red Box)!2 Apples!
6 Oranges!
(Blue Box)!3 Apples!1 Orange!
2. Given that we have chosen an orange, what is the probability the !box we chose was the blue one?!
16!
Sum Rule & Product Rule at work!g P(B=r)=4/10!g P(B=b)=6/10!g P(F=a | B=r)=1/4!g P(F=o | B=r)=3/4!g P(F=a | B=b)=3/4!g P(F=o | B=b)=1/4!
!
P B = bF = o( ) =P F = oB = b( )P B = b( )
P(F = o)
P B = bF = o( ) =14"610
"209
=13
!
p X,Y( ) = p X Y( )p Y( )
17!
Classification!g Likelihood ratio test:!
n Assume we are to classify an object based on the evidence provided by a measurement (or feature vector) x"
n Would you agree that a reasonable decision rule would be the following?"
g "Choose the class that is most ʻprobableʼ given the observed feature vector x”"
g More formally: Evaluate the posterior probability of each class P(Ci|x) and choose the class with largest P(Ci|x)"
18!
Classification!g Likelihood ratio test:!
n Let us examine the implications of this decision rule for a 2-class problem"
n In this case the decision rule becomes"
!
if P C1 x( ) > P C2 x( ) " x #C1
else P C1 x( ) < P C2 x( ) " x #C2
19!
Classification!g Likelihood ratio test:!
n Let us examine the implications of this decision rule for a 2-class problem"
n In this case the decision rule becomes"
n More compactly:"!
if P C1 x( ) > P C2 x( ) " x #C1
else P C1 x( ) < P C2 x( ) " x #C2
!
P C1 x( )<C2
>C1
P C2 x( )
20!
Classification!g Likelihood ratio test:!
n Let us examine the implications of this decision rule for a 2-class problem"
n More compactly:"
n From Bayes rule:"
!
P C1 x( )<C2
>C1
P C2 x( )
!
P xC1( )P(C1)P(x)
<C2
>C1
P xC2( )P(C2)P(x)
21!
Classification!g Likelihood ratio test:!
n Let us examine the implications of this decision rule for a 2-class problem"
n More compactly:"
n From Bayes rule:"
!
P C1 x( )<C2
>C1
P C2 x( )
!
P xC1( )P(C1)P(x)
<C2
>C1
P xC2( )P(C2)P(x)
P xC1( )P(C1)<C2
>C1
P xC2( )P(C2)
"(x) =P xC1( )P xC2( )
<C2
>C1
P(C2)P(C1)
22!
Classification!g Likelihood ratio test:!
n Let us examine the implications of this decision rule for a 2-class problem"
n More compactly:"
n From Bayes rule:"
n The term ∧(x) is called the likelihood ratio"
!
P C1 x( )<C2
>C1
P C2 x( )
!
P xC1( )P(C1)P(x)
<C2
>C1
P xC2( )P(C2)P(x)
"(x) =P xC1( )P xC2( )
<C2
>C1
P(C2)P(C1)
23!
An example!g Likelihood ratio test:!
!
P xC1( ) =12"
e#12x#4( )2
P xC2( ) =12"
e#12x#10( )2
(lets assume equal priors)!P(C1)=P(C2)!
From Gutierrez-Osuna!
24!
An example!g Likelihood ratio test:!
!
"(x) =P xC1( )P xC2( )
<C2
>C1
1
!
"(x) =
12#
e$12x$4( )2
12#
e$12x$10( )2
=e$12x$4( )2
e$12x$10( )2
<C2
>C1
1
log "(x)( ) = $ x $ 4( )2 + x $10( )2<C2
>C1
0
!
P xC1( ) =12"
e#12x#4( )2
P xC2( ) =12"
e#12x#10( )2
25!
An example!g Likelihood ratio test:!
!
"(x) =P xC1( )P xC2( )
<C2
>C1
1
!
log "(x)( ) = # x # 4( )2 + x #10( )2<C2
>C1
0
7<C2
>C1
x
!
P xC1( ) =12"
e#12x#4( )2
P xC2( ) =12"
e#12x#10( )2
26!
Variants!g Maximum A Posteriori (MAP) Criterion!
g Maximum Likelihood (ML) Criterion!!
"(x) =P xC1( )P(C1)
P(x)<C2
>C1
P xC2( )P(C2)P(x)
=P C1 x( )P C2 x( )
<C2
>C1
1
!
"(x) =P xC1( )P(C1)
P(x)<C2
>C1
P xC2( )P(C2)P(x)
=P xC1( )P xC2( )
<C2
>C1
1 P C1( ) = P C2( )[ ]
27!
Discriminant Functions!g All the decision rules we have presented in this lecture have the same
structure!n At each point x in feature space choose class Ci which maximizes (or minimizes)
some measure gi(x)"g This structure can be formalized with a set of discriminant functions gi
(x), i=1..C, and the following decision rule!! !“assign x to the class C if gi(x) >gj(x) all j≠I”!
g Therefore, we can visualize the decision rule as a network or machine that computes C discriminant functions and selects the category corresponding to the largest discriminant. Such network is depicted in the following figure!
Criterion! Discriminant !Function!
MAP" gi(x)=P(Ci|x)"
ML" gi(x)=P(x|Ci)"
g1(x)!
g2(x)!
gC(x)!
x1!
x2!
xd!
gC(x)!
gC(x)!
gC(x)!
gC(x)!
gC(x)!
gC(x)!
Select max gi(x)!
features!
discriminant!functions!
Class!assignment!
From Gutierrez-Osuna!
28!
Quadratic Classifiers!g Bayes classifiers for Normally distributed
classes!n Case 1: Σi=σ2I"n Case 2: Σi=Σ (Σ diagonal)"n Case 3: Σi=Σ (Σ non-diagonal)"n Case 4: Σi=σi
2I"n Case 5: Σi≠Σj general case"
From Gutierrez-Osuna! From Duda, Hart and Stork!
29!
Quadratic Classifiers!g Bayes classifiers for Normally distributed classes!
g As we will show, for classes that are normally distributed, this family of discriminant functions can be reduced to very simple expressions!
!
choose Ci if gi(x) > g j (x) "j # iwhere gi(x) = P(Ci | x) (MAP)
g1(x)!
g2(x)!
gC(x)!
x1!
x2!
xd!
gC(x)!
gC(x)!
gC(x)!
gC(x)!
gC(x)!
gC(x)!
Select max gi(x)!
features!
discriminant!functions!
Class!assignment!
From Gutierrez-Osuna!
30!
Quadratic Classifiers!g Bayes classifiers for Normally distributed classes!
!g Gaussian distribution:!
g Bayes Rule!
!
choose Ci if gi(x) > g j (x) "j # iwhere gi(x) = P(Ci | x) (MAP)
!
P x( ) =1
2"( )D / 2 #1/ 2exp $
12x $ µ( )T#$1 x $ µ( )
%
& '
(
) *
µ +mean D $ dimensional, 2 +var iance DxD covariance matrix# +determinant covariance matrix
!
gi(x) = P(Ci | x) =P(x |Ci)P(Ci)
P(x)=
12"( )D / 2 #i
1/ 2 exp $12x $ µi( )T#i
$1 x $ µi( )%
& '
(
) * P(Ci)
1P(x)
31!
Quadratic Classifiers!g Bayes Rule for Gaussian distribution (after eliminating
constants):!
g Taking natural logs since the logarithm is also monotonically increasing function!
g This is called quadratic discriminant function!
!
gi(x) = P(Ci | x) = "i#1/ 2 exp #
12x # µi( )T"i
#1 x # µi( )$
% &
'
( ) P(Ci)
!
gi(x) = "12x " µi( )T#i
"1 x " µi( )$
% &
'
( ) "12log #i( ) + log P(Ci)( )
32!
Case 1: : Σi=σ2I !!g This situation occurs when the features are statistically independent
with the same variance for all classes!n In this case, the quadratic discriminant function becomes"
gi (x) = !12x !µi( )T ! 2I( )
!1x !µi( )
"
#$
%
&'!12log ! 2I( )+ log P(Ci )( )
gi (x) = !12! 2 x !µi( )T x !µi( )
"
#$
%
&'!12D log ! 2( )+ log P(Ci )( )
gi (x) =
droppingsecondterm
!12! 2 x !µi( )T x !µi( )
"
#$
%
&'+ log P(Ci )( )
33!
Case 1: : Σi=σ2I !!g This situation occurs when the features are statistically independent
with the same variance for all classes!n In this case, the quadratic discriminant function becomes"
n Expanding"
n Ignoring xTx as it is constant for all classes"
gi (x) = !12! 2 xT x ! xTµi !µi
T x +µiTµi( )+ log P(Ci )( )
gi (x) = !12! 2 xT x ! 2µi
T x +µiTµi( )+ log P(Ci )( )
gi (x) = !12x !µi( )T ! 2I( )
!1x !µi( )
"
#$
%
&'!12log ! 2I( )+ log P(Ci )( )
gi (x) = !12! 2 x !µi( )T x !µi( )
"
#$
%
&'!12D log ! 2( )+ log P(Ci )( )
gi (x) =
droppingsecondterm
!12! 2 x !µi( )T x !µi( )
"
#$
%
&'+ log P(Ci )( )
gi (x) = !12! 2 !2µi
T x +µiTµi( )+ log P(Ci )( )
34!
Case 1: : Σi=σ2I !!g This situation occurs when the features are statistically independent
with the same variance for all classes!n In this case, the quadratic discriminant function becomes"""
n Expanding""
n Ignoring xTx as it is constant for all classes"
n Discriminant function form"
gi (x) = !12! 2 xT x ! 2µi
T x +µiTµi( )+ log P(Ci )( )!
gi(x) = "12# 2 x " µi( )T x " µi( )
$
% &
'
( ) + log P(Ci)( )
gi (x) = !12! 2 !2µi
T x +µiTµi( )+ log P(Ci )( )
gi (x) = wiT x +wio
wherewi =
µi
! 2
wi0 = !12! 2 µi
Tµi + log P(Ci )( )
"
#$$
%$$
35!
Case 1: : Σi=σ2I !!g This situation occurs when the features are statistically independent
with the same variance for all classes!n Discriminant function form"
"n Since the discriminant is linear, the decision boundaries gi(x), gj(x) will be
hyperplanes"n If we assume equal priors"
n This is the nearest mean classifier"n If unit variance (σ2=1), the distance becomes Euclidean distance"
!
gi(x) = "12#2
"2µiT x " µi
Tµi( ) + log P(Ci)( )
!
gi(x) = wiT x + wio
wherewi =
µi
" 2
wi = #12" 2 µi
Tµi + log P(Ci)( )
$
% &
' &
!
gi(x) = "12#2
x " µi( )T x " µi( )$
% &
'
( ) + log P(Ci)( )
gi(x) = "12#2
x " µi( )T x " µi( )$
% &
'
( )
36!
Case 1: : Σi=σ2I !!
From Gutierrez-Osuna!
37!
Case 2: : Σi=Σ (Σ diagonal) !!g The classes still have the same covariance matrix, but the features are
allowed to have different variances!n In this case, the quadratic discriminant function becomes"
"n Eliminating the term x[k]2, which is constant for all classes"
g This discriminant is linear, so the decision boundaries gi(x)=gj(x), will also be hyper-planes"g The loci of constant probability are hyper-ellipses aligned with the feature axes"g Note that the only difference with the previous classifier is that the distance of each axis is
normalized by the variance of the axis"
38!
Case 2: : Σi=Σ (Σ diagonal) !!
From Gutierrez-Osuna!
39!
Case 3: : Σi=Σ (Σ non-diagonal) !!g In this case, all the classes have the same covariance matrix, but this
is no longer diagonal!g The quadratic discriminant becomes!
!g Eliminating constant log(|Σ|) term!
n The quadratic term is called the Mahanalobis distance"
g The Mahalanobis distance is a vector! distance that uses a Σ-1 norm!
n Σ-1 can be thought of as a stretching factor "on the space n Note that for an identity covariance matrix (Σ=I), the"n Mahalanobis distance becomes the familiar Euclidean distance"
!
gi(x) = "12x " µi( )T#i
"1 x " µi( )$
% &
'
( ) "12log #i( ) + log P(Ci)( )
gi(x) = "12x " µi( )T#"1 x " µi( )
$
% &
'
( ) "12log #( ) + log P(Ci)( )
!
gi(x) = "12x " µi( )T#"1 x " µi( )
$
% &
'
( ) + log P(Ci)( )
!
x " y#"1
2= x " y( )T#"1 x " y( )
40!
Case 3: : Σi=Σ (Σ non-diagonal) !!g Expansion of the quadratic term in the discriminant yields!
g Removing the term xTΣ-1x, which is constant for all classes!
g Reorganizing terms we obtain!
"n This discriminant is linear, so the decision boundaries will also be hyper-planes"n The constant probability loci are hyper-ellipses aligned with the eigenvectors of Σ"
g If we can assume equal priors!n The classifier becomes a minimum (Mahalanobis) distance classifier"
!
!
gi(x) = "12x " µi( )T#"1 x " µi( )
$
% &
'
( ) + log P(Ci)( )
gi(x) = "12xT#"1x " 2µi
T#"1x + µiT#"1µi( )$
% &
'
( ) + log P(Ci)( )
!
gi(x) = µiT"#1x #
12
µiT"#1µi
$
% &
'
( ) + log P(Ci)( )
!
gi(x) = wiT x + wio
wherewi = "#1µi
wi = #12
µiT"#1µi + log P(Ci)( )
$
% &
' &
!
gi(x) = "12x " µi( )T#"1 x " µi( )
$
% &
'
( )
41!
Case 3: : Σi=Σ (Σ non-diagonal) !!g Expansion of the quadratic term in the discriminant yields!
g Reorganizing terms we obtain!
"n This discriminant is linear, so the decision boundaries will also be hyper-planes"n The constant probability loci are hyper-ellipses aligned with the eigenvectors of Σ"
g If we can assume equal priors!n The classifier becomes a minimum (Mahalanobis) distance classifier"
!
!
gi(x) = µiT"#1x #
12
µiT"#1µi
$
% &
'
( ) + log P(Ci)( )
!
gi(x) = wiT x + wio
wherewi = "#1µi
wi = #12
µiT"#1µi + log P(Ci)( )
$
% &
' &
!
gi(x) = "12x " µi( )T#"1 x " µi( )
$
% &
'
( )
42!
Case 3: : Σi=Σ (Σ non-diagonal) !
From Gutierrez-Osuna!
43!
Case 4: Σi=σi I!g In this case, each class has a different covariance
matrix, which is proportional to the identity matrix!n The quadratic discriminant becomes"
g This expression cannot be reduced further so!n The decision boundaries are quadratic: hyper-ellipses"n The loci of constant probability are hyper-spheres aligned with
the feature axis"
!
gi(x) = "12x " µi( )T#i
"1 x " µi( )$
% &
'
( ) "12log #i( ) + log P(Ci)( )
gi(x) = "12x " µi( )T* i
"2 x " µi( )$
% &
'
( ) "12N log * i
2( ) + log P(Ci)( )
44!
Case 4: Σi=σi I!
From Gutierrez-Osuna!
45!
Case 5: Σi≠Σj!g We already derived the expression for the general case at the
beginning of this discussion!
!g Reorganizing terms in a quadratic form yields!
n The loci of constant probability for each class are hyper-ellipses, oriented with the eigenvectors of Σi for that class"
n The decision boundaries are again quadratic: hyper-ellipses or hyper-parabolloids"
n Notice that the quadratic expression in the discriminant is proportional to the Mahalanobis distance using the class-conditional covariance Σi"
!
gi(x) = xTWix + wiT x + wi0
where
Wi = "12#i"1
wi = #i"1µi
wi0 = "12
µiT#i
"1µi "12log #i( ) + log P(Ci)( )
$
%
& &
'
& &
!
gi(x) = "12x " µi( )T#"1 x " µi( )
$
% &
'
( ) "12log #i( ) + log P(Ci)( )
46!
Case 5: Σi≠Σj!
From Gutierrez-Osuna!
47!
Naïve Bayes Classifier: An Example!
Day Outlook Temperature Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No
From Machine Learning, Mitchell!
48!
Naïve Bayes Classifier!
Outlook = sunny, Temp = Cool, Humidity = high, Wind = strong!
Predict the target value Play = yes or Play = no!
49!
Naïve Bayes Classifier!
Outlook = sunny, Temp = Cool, Humidity = high, Wind = strong!
Predict the target value Play = yes or Play = no!
Using Bayes rule we can write:!
!
P(Play = yes) =P sunny,cool,humidity,strong play = yes( )P play = yes( )
P sunny,cool,humidity,strong play = playi( )playi =yes,no"
!
P(Play = no) =P sunny,cool,humidity,strong play = yes( )P play = no( )
P sunny,cool,humidity,strong play = playi( )playi =yes,no"
50!
Naïve Bayes Classifier!
Outlook = sunny, Temp = Cool, Humidity = high, Wind = strong!
Predict the target value Play = yes or Play = no!
More generally we can write the most probable target value as:!
!
"MAP = argmax" j #V
P a1,a2,...,an" j( )P " j( )P a1,a2,...,an( )
"MAP = argmax" j #V
P a1,a2,...,an" j( )P " j( )
51!
Naïve Bayes Classifier!
Outlook = sunny, Temp = Cool, Humidity = high, Wind = strong!
Predict the target value Play = yes or Play = no!
More generally we can write the most probable target value as:!
!
"MAP = argmax" j #V
P a1,a2,...,an" j( )P " j( )
Naïve Bayes classifier is based on the simplifying assumption!that attributes/features are conditionally independent!
!
P a1,a2,...,an" j( ) = P a1" j( )# P a2" j( )# ...P an" j( )P a1,a2,...,an" j( ) = P ai" j( )
i$
52!
Naïve Bayes Classifier!
Outlook = sunny, Temp = Cool, Humidity = high, Wind = strong!
Predict the target value Play = yes or Play = no!
More generally we can write the most probable target value as:!
!
"MAP = argmax" j #V
P a1,a2,...,an" j( )P " j( )
Naïve Bayes classifier is based on the simplifying assumption!that attributes/features are conditionally independent!
!
"MAP = argmaxv j #V
P " j( ) P ai" j( )i$
53!
Naïve Bayes Classifier: An Example!
Day Outlook Temperature Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No
From Machine Learning, Mitchell!
Outlook = sunny, Temp = Cool, Humidity = high, Wind = strong!
54!
Naïve Bayes Classifier!
Outlook = sunny, Temp = Cool, Humidity = high, Wind = strong!
P(Play = Yes) � 9/14!
P(Play = no) � 5/14!
P ( Wind = strong | Play = yes ) = 3/9!
P(Wind = strong | Play = no ) = 3/5!
P(yes)*P(sunny|yes)*P(cool|yes)*P(high|yes)*P(strong|yes)=? !
P(no)*P(sunny|no)*P(cool|no)*P(high|no)*P(strong|no)=?!
55!
Naïve Bayes Classifier!
Outlook = sunny, Temp = Cool, Humidity = high, Wind = strong!
P(Play = Yes) � 9/14!
P(Play = no) � 5/14!
P ( Wind = strong | Play = yes ) = 3/9!
P(Wind = strong | Play = no ) = 3/5!
P(yes)*P(sunny|yes)*P(cool|yes)*P(high|yes)*P(strong|yes)=0.0053 !
P(no)*P(sunny|no)*P(cool|no)*P(high|no)*P(strong|no)=0.0206!
Naïve Bayes Classifier Prediction: Play Tennis = no!
56!
Non-parametric density estimation!
From Gutierrez-Osuna!
57!
Nearest Neighbor Classifier!
From Gutierrez-Osuna!
58!
Nearest Neighbor Classifier!
From Gutierrez-Osuna!
59!
Nearest Neighbor Classifier!
From Gutierrez-Osuna!