Post on 21-Dec-2015
LOGO
Classification I
Lecturer: Dr. Bo Yuan
E-mail: yuanb@sz.tsinghua.edu.cn
Overview
K-Nearest Neighbor Algorithm
Naïve Bayes Classifier
3
Thomas Bayes
Classification
4
Definition
Classification is one of the fundamental skills for survival.
Food vs. Predator
A kind of supervised learning
Techniques for deducing a function from data
<Input, Output>
Input: a vector of features
Output: a Boolean value (binary classification) or integer (multiclass)
“Supervised” means:
A teacher or oracle is needed to label each data sample.
We will talk about unsupervised learning later.5
Classifiers
6
Height
Wei
ght
Mary
Lisa
JaneJack
Peter
Tom
Sam
Helen
Z=f(x,y)
{boy, girl}
Height Weight
Training a Classifier
7
Learning
Lazy Learners
8
Car
Truck
K-Nearest Neighbor Algorithm
The algorithm procedure:
Given a set of n training data in the form of <x, y>.
Given an unknown sample x′.
Calculate the distance d(x′, xi) for i=1 … n.
Select the K samples with the shortest distances.
Assign x′ the label that dominates the K samples.
It is the simplest classifier you will ever meet (I mean it!).
No Training (literally)
A memory of the training data is maintained.
All computation is deferred until classification.
Produces satisfactory results in many cases.
Should give it a go whenever possible. 10
Properties of KNN
11
Instance-Based Learning
No explicit description of the target function
Can handle complicated situations.
Properties of KNN
12
?
Dependent of the data distributions.
Can make mistakes at boundaries.
K=7 Neighborhood
K=1 Neighborhood
Challenges of KNN
The Value of K
Non-monotonous impact on accuracy
Too Big vs. Too Small
Rule of thumbs
Weights
Different features may have different impact …
Distance
There are many different ways to measure the distance.
Euclidean, Manhattan …
Complexity
Need to calculate the distance between x′ and all training data.
In proportion to the size of the training data.13
K
Acc
urac
y
Distance Metrics
14
kd
i
k
iik yxyxL/1
1
,
2/1
1
2
2 ,
d
iii yxyxL
d
iii yxyxL
11 ,
Distance Metrics
15
The shortest path between two points …
Mahalanobis Distance
16
Distance from a point to a point set
Mahalanobis Distance
17
xSxxD TM
1)(
xxxD TM )(
For identity matrix S:
n
i i
iiM
xxD
12
2
)(
For diagonal matrix S:
Voronoi Diagram
18
perpendicular bisector
Structured Data
20
0 1
1
0.5
0.5
?
KD-Tree
21
Point Set: {(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)}
KD-Tree
function kdtree (list of points pointList, int depth) { if pointList is empty return nil; else { // Select axis based on depth so that axis cycles through all valid values var int axis := depth mod k; // Sort point list and choose median as pivot element select median by axis from pointList; // Create node and construct subtrees
var tree_node node;
node.location := median;
node.leftChild := kdtree(points in pointList before median, depth+1);
node.rightChild := kdtree(points in pointList after median, depth+1);
return node; } }
22
KD-Tree
23
Evaluation
Accuracy
Recall what we have learned in the first lecture … Confusion Matrix ROC Curve
Training Set vs. Test Set
N-fold Cross Validation
24
Test Set
Test Set
Test Set
Test Set
Test Set
LOOCV
Leave One Out Cross Validation
An extreme case of N-fold cross validation
N=number of available samples
Usually very time consuming but okay for KNN
Now, let’s try KNN+LOOCV …
All students in this class are given one of two labels.
Gender: Male vs. Female
Major: CS vs. EE vs. Automation25
26
10 Minutes …
Bayes Theorem
27
A B BAPBPAPBAP
APABPBPBAPBAP ||
BP
APABPBAP
||
evidence
priorlikelihoodposterior
Bayes
Theorem
Fish Example
Salmon vs. Tuna
P(ω1)=P(ω2)
P(ω1)>P(ω2)
Additional information
28
xP
PxPxP ii
i
||
Shooting Example
Probability of Kill
P(A): 0.6
P(B): 0.5
The target is killed with:
One shoot from A
One shoot from B
What is the probability that it is shot down by A?
C: The target is killed.
29
43
5.06.05.04.05.06.06.01
)(
)()()(
CP
APACPCAP
Cancel Example
ω1: Cancer; ω2: Normal
P(ω1)=0.008; P(ω2)=0.992
Lab Test Outcomes: + vs. –
P(+|ω1)=0.98; P(-|ω1)=0.02
P(+|ω2)=0.03; P(-|ω2)=0.97
Now someone has a positive test result…
Is he/she doomed?30
Cancel Example
31
0078.0008.098.0|| 111 PPP
0298.0992.003.0|| 222 PPP
11 21.00298.00078.0
0078.0| PP
|| 21 PP
Headache & Flu Example
H=“Having a headache”
F=“Coming down with flu”
P(H)=1/10; P(F)=1/40; P(H|F)=1/2
What does this mean?
One day you wake up with a headache …
Since 50% flu cases are associated with headaches …
I must have a 50-50 chance of coming down with flu!
32
Headache & Flu Example
33
8
1
10/1
40/12/1
)(
)()|()|(
HP
FPFHPHFP
Flu
Headache
The truth is …
Naïve Bayes Classifier
34
niMAP aaaPi
,...,,|maxarg 21
MAP: Maximum A Posterior
n
iinMAP aaaP
PaaaP
i ,...,,
|,...,,maxarg
21
21
iinMAP PaaaPi
|,...,,maxarg 21
j
ijiMAP aPPi
|maxarg
Conditionally Independent
Independence
35
BPAPBAP
ABPAPBAP | BPABP |
)|()|()|,( GBPGAPGBAP
Conditionally Independent
)|(),|( GAPBGAP
)|(),|(
)(/),(),|()(/),,()|,(
GBPGBAP
GPGBPGBAPGPGBAPGBAP
Conditional Independence
36
)|()|()|( YBPYRPYBRP
Independent ≠ Uncorrelated
37
2
];1,1[
XY
X
Cov (X,Y)=0 X and Y are uncorrelated
However, Y is completely determined by X.
X Y
1 1
0.5 0.25
0.2 0.04
0 0
-0.2 0.04
-0.5 0.25
-1 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
X
Y
𝜌𝑋 ,𝑌=𝑐𝑜𝑣 ( 𝑋 ,𝑌 )𝜎 𝑋𝜎𝑌
=𝐸 ( (𝑋 −𝜇𝑋 ) (𝑌 −𝜇𝑌 ))
𝜎 𝑋𝜎 𝑌
Estimating P(αj|ωi)
38
α1 α2 α3 ω
+ ω1
ω2
- ω1
+ ω1
ω2
3/2|'' 12 aP
3/1|'' 12 aP
5/2;5/3 21 PP
ji
ijkj
ijka
aaaP
1|Laplace Smoothing
How about continuous variables?
Tennis Example
39
Day Outlook Temperature Humidity Wind Play
Tennis
Day1 Sunny Hot High Weak No
Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
Tennis Example
40
)(
:Predict
Wind,Humidity,eTemperatur,Outlook
:Given
nooryesPlayTennis
stronghighcoolsunny
795.00053.00206.0
0206.0 :yprobabilit with splay tenni not to is conclusion The
0206.0)|()|()|()|()(
0053.0)|()|()|()|()(
...
5/3|
9/3|
14/5
14/9
:
nostrongPnohighPnocoolPnosunnyPnoP
yesstrongPyeshighPyescoolPyessunnyPyesP
noPlayTennisstrongWindP
yesPlayTennisstrongWindP
noPlayTennisP
yesPlayTennisP
SolutionBayes
Text Classification Example
41
Interesting? Boring?
Politics? Entertainment? Sports?
Text Representation
42
α1 α2 α3 α4 … αn ω
Long long ago there … king 1
New sanctions will be … Iran 0
Hidden Markov models are … method 0
The Federal Court today … investigate 0
However, there are 2×n×|Vocabulary| terms in total. For n=100 and a
vocabulary of 50,000 distinct words, it adds up to 10 million terms!
We need to estimate probabilities such as .
Text Representation
By only considering the probability of encountering a specific word instead of the specific word position, we can reduce the number of probabilities to be estimated.
We only count the frequency of each word.
Now, 2×50,000=100,000 terms need to be estimated.
n: the total number of word positions in all training samples whose target value is ωi.
nk: the number of times word Vk is found among these n positions.
43
||
1|
Vocabularyn
nVP k
iK
Case Study: Newsgroups
Classification
Joachims, 1996
20 newsgroups
20,000 documents
Random Guess: 5%
NB: 89%
Recommendation
Lang, 1995
NewsWeeder
User rated articles
Interesting vs. Uninteresting
Top 10% selected articles
16% vs. 59%
44
Reading Materials
C. C. Aggarwal, A. Hinneburg and D. A. Keim, “On the Surprising Behavior of Distance
Metrics in High Dimensional Space,” Proc. the 8th International Conference on Database
Theory, LNCS 1973, pp. 420-434, London, UK, 2001.
J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An Algorithm for Finding Best Matches in
Logarithmic Expected Time,” ACM Transactions on Mathematical Software, 3(3):209–226,
1977.
S. M. Omohundro, “Bumptrees for Efficient Function, Constraint, and Classification
Learning,” Advances in Neural Information Processing Systems 3, pp. 693-699, Morgan
Kaufmann, 1991.
Tom Mitchell, Machine Learning (Chapter 6), McGraw-Hill.
Additional reading about Naïve Bayes Classifier http://www-2.cs.cmu.edu/~tom/NewChapters.html
Software for text classification using Naïve Bayes Classifier http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
45
Review
What is classification?
What is supervised learning?
What does KNN stand for?
What are the major challenges of KNN?
How to accelerate KNN?
What is N-fold cross validation?
What does LOOCV stand for?
What is Bayes Theorem?
What is the key assumption in Naïve Bayes Classifiers?
46
Next Week’s Class Talk
Volunteers are required for next week’s class talk.
Topic 1: Efficient KNN Implementations
Hints:
Ball Trees
Metric Trees
R Trees
Topic 2: Bayesian Belief Networks
Length: 20 minutes plus question time
47