Classification Evaluation
-
Upload
balathegame -
Category
Documents
-
view
222 -
download
0
Transcript of Classification Evaluation
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 1/22
Classification – Evaluation
Witten and Frank
Han and Kamber
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 2/22
Adapted from Han and Kamber
Classifier Accuracy
Measures
Accuracy of a classifier M, acc(M): percentage of test set tuples that arecorrectly classified by the model M Error rate (misclassification rate) of M = 1 – acc(M)
Given m classes, CM i,j , an entry in a confusion matrix, indicates # of tuplesin class i that are labeled by the classifier as class j
Alternative accuracy measures (e.g., for cancer diagnosis)sensitivity = t-pos/pos /* true positive recognition rate */
specificity = t-neg/neg /* true negative recognition rate */precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
This model can also be used for cost-benefit analysis
95.521000026347366total
86.2730002588412buy_computer = no
99.347000466954buy_computer = yes
recognition(%)totalbuy_computer = nobuy_computer = yesclasses
True negativeFalse positiveC2
False negativeTrue positiveC1
C2C1
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 3/22
Adapted from Han and Kamber
Predictor Error Measures
Measure predictor accuracy: measure how far off the predicted value
is from the actual known value
Loss function: measures the error bet. yi and the predicted value yi’
Absolute error: | yi – yi’|
Squared error: (yi – yi’)2
Test error (generalization error): the average loss over the test set
Mean absolute error: Mean squared error:
Relative absolute error: Relative squared error:
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root relative squared
error
d
y yd
i
ii!=
"1
|'|
d
y yd
i
ii!=
"1
2)'(
!
!
=
=
"
"
d
ii
d
i
ii
y y
y y
1
1
||
|'|
!
!
=
=
"
"
d
i
i
d
i
ii
y y
y y
1
2
1
2
)(
)'(
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 4/22
Adapted from Han and Kamber
Evaluating the Accuracy of a Classifier
or Predictor (I)
Holdout method Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies
obtained
Cross-validation (k -fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
At i -th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data
Stratified cross-validation: folds are stratified so that class dist. ineach fold is approx. the same as that in the initial data
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 5/22
Adapted from Han and Kamber
Evaluating the Accuracy of a Classifier
or Predictor (II)
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Several boostrap methods, and a common one is .632 boostrap
Suppose we are given a data set of d tuples. The data set is sampled d times, with
replacement, resulting in a training set of d samples. The data tuples that did not
make it into the training set end up forming the test set. About 63.2% of the original
data will end up in the bootstrap, and the remaining 36.8% will form the test set
(since (1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedue k times, overall accuracy of the model:
))(368.0)(632.0()( _
1
_ set traini
k
i
set test iM accM accM acc !+!="
=
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 6/22
Adapted from Witten and Frank
Comparing data mining
schemes
Frequent question: which of two learning
schemes performs better?
Note: this is domain dependent!
Obvious way: compare 10-fold CV estimates
Problem: variance in estimate
Variance can be reduced using repeated CV
However, we still don’t know whether theresults are reliable
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 7/22
Adapted from Han and Kamber
Model Selection:
ROC Curves
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
Originated from signal detection
theory
Shows the trade-off between the
true positive rate and the false
positive rate
The area under the ROC curve is a
measure of the accuracy of the
model
Rank the test tuples in decreasing
order: the one that is most likely tobelong to the positive class
appears at the top of the list
The closer to the diagonal line (i.e.,
the closer the area is to 0.5), the
less accurate is the model
Vertical axis represents the truepositive rate
Horizontal axis rep. the false
positive rate The plot also shows a diagonal line
A model with perfect accuracy willhave an area of 1.0
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 8/22
Adapted from Witten and Frank
Paired t-test
Student’s t-test tells whether themeans of two samples aresignificantly different
Take individual samples using
cross-validation Use a paired t-test because the
individual samples are paired The same CV is applied twice
William GossetBorn: 1876 in Canterbury; Died: 1937 in Beaconsfield, England
Obtained a post as a chemist in the Guinness brewery in Dublin in
1899. Invented the t-test to handle small samples for quality
control in brewing. Wrote under the name "Student".
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 9/22
Adapted from Witten and Frank
Significance tests
Significance tests tell us how confident wecan be that there really is a difference
Null hypothesis: there is no “real” difference
Alternative hypothesis: there is a difference
A significance test measures how muchevidence there is in favor of rejecting thenull hypothesis
Let’s say we are using 10-fold CV Question: do the two means of the 10 CV
estimates differ significantly?
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 10/22
Adapted from Witten and Frank
Distribution of the means x 1 x 2 … x k and y 1 y 2 … y k are the 2k samples for a k -
fold CV
m x and my are the means
With enough samples, the mean of a set of
independent samples is normally distributed
Estimated variances of the means areσx
2/k and σy2/k
If µ x and µ y are the true means then
are approximately normally distributed withmean 0, variance 1
k
m
x
x x
/2
!
µ "
k
m
y
y y
/2
!
µ "
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 11/22
Adapted from Witten and Frank
Student’s distribution
With small samples (k < 100) the mean
follows Student’s distribution with k–1
degrees of freedom
Confidence limits:
0.8820%
1.3810%
1.835%
2.82
3.25
4.30
z
1%
0.5%
0.1%
Pr[ X ≥ z ]
0.8420%
1.2810%
1.655%
2.33
2.58
3.09
z
1%
0.5%
0.1%
Pr[ X ≥ z ]
9 degrees of freedom normal distribution
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 12/22
Adapted from Witten and Frank
Distribution of the differences
Let md = m x – my
The difference of the means (md ) also has aStudent’s distribution with k–1 degrees of
freedom Let σd
2 be the variance of the difference
The standardized version of md is called thet -statistic:
We use t to perform the t-testk
mt
d
d
/
2
!
=
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 13/22
Adapted from Witten and Frank
Performing the test
• Fix a significance level α• If a difference is significant at the α% level,
there is a (100-α)% chance that there really isa difference
•
Divide the significance level by twobecause the test is two-tailed• I.e. the true difference can be +ve or – ve
• Look up the value for z that correspondsto α/2
• If t ≤ –z or t ≥ z then the difference issignificant
• I.e. the null hypothesis can be rejected
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 14/22
Adapted from Witten and Frank
Unpaired observations If the CV estimates are from different
randomizations, they are no longer paired
(or maybe we used k -fold CV for one
scheme, and j -fold CV for the other one)
Then we have to use an un paired t-test
with min(k , j ) – 1 degrees of freedom
The t -statistic becomes:
jk
mmt y x
y x
22! !
+
"
=
k
mt
d
d
/2
!
=
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 15/22
Adapted from Witten and Frank
Interpreting the result
All our cross-validation estimates are basedon the same dataset
Samples are not independent
Should really use a different datasetsample for each of the k estimates used inthe test to judge performance acrossdifferent training sets
Or, use heuristic test, e.g. corrected resampled t-test
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 16/22
Combining Classifiers
Han and Kamber
Russell and Norvig
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 17/22
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 18/22
Adapted from Han and Kamber
Bagging: Bootstrap
Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
Training Given a set D of d tuples, at each iteration i , a training set Di of d tuples
is sampled with replacement from D (i.e., boostrap)
A classifier model Mi is learned for each training set Di
Classification: classify an unknown sample X Each classifier Mi returns its class prediction
The bagged classifier M* counts the votes and assigns the class with themost votes to X
Prediction: can be applied to the prediction of continuous values bytaking the average value of each prediction for a given test tuple
Accuracy Often significantly better than a single classifier derived from D
For noisy data: not considerably worse, more robust
Proved improved accuracy in prediction
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 19/22
Adapted from Han and Kamber
Boosting
Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
How boosting works?
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are updated to allow the subsequent
classifier, Mi+1, to pay more attention to the training tuples that were misclassified by
Mi
The final M* combines the votes of each individual classifier, where the weight of
each classifier's vote is a function of its accuracy
The boosting algorithm can be extended for the prediction of continuous values
Comparing with bagging: boosting tends to achieve greater accuracy, but it
also risks overfitting the model to misclassified data
Can be shown to maximize margin of classifier
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 20/22
3 part boosting
Train classifier A on all data
Train classifier B on data that A makes anerror on
Train classifier C on data that A and B don’tagree on
Break ties using C
Problem: strong classifier => fewer training
points for B and C Unreliable
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 21/22
Decision Fusion
Train heterogeneous classifiers
Use a voting mechanism for deciding
final classifier Can learn relative weights of votes of
classifiers
Can fix weights acc. to classification
accuracy
8/8/2019 Classification Evaluation
http://slidepdf.com/reader/full/classification-evaluation 22/22
Summary
Several methods for evaluating classifier accuracy Hold out methods
Boot strap
Comparing classifiers Confidence intervals
ROC curves
Combining classifiers
Bagging Boosting
Fusion