On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg...
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg...
![Page 1: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/1.jpg)
On Comparing Classifiers: Pitfalls to
Avoid and Recommended
Approach
Published bySteven L. Salzberg
Presented byPrakash Tilwani
MACS 598April 25th 2001
![Page 2: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/2.jpg)
Agenda Introduction Classification basics Definitions Statistical validity
• Bonferroni Adjustment• Statistical Accidents• Repeated Tuning
A Recommended Approach Conclusion
![Page 3: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/3.jpg)
Introduction Comparative studies – proper
methodology? Public databases – relied too
heavily on them? Comparison results – are they
really correct or just statistical accidents?
![Page 4: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/4.jpg)
Definitions T-test F-test P-value Null hypothesis
![Page 5: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/5.jpg)
T-test The t-test assesses whether the
means of two groups are statistically different from each other.
Ratio of difference in means to variability of groups.
![Page 6: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/6.jpg)
F-test It determines whether the
variances of two samples are significantly different.
Ratio of variance of two datasets Basis for “Analysis of Variance”
(ANOVA)
![Page 7: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/7.jpg)
p-value It represents probability of concluding
(incorrectly) that there is a difference in samples when no true difference exists.
Dependent upon the statistical test being performed.
P = 0.05 means that there is 5% chance that you would be wrong if concluding the populations are different.
![Page 8: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/8.jpg)
NULL Hypothesis Assumption that there is no
difference in two or more populations.
Any observed difference in samples is due to chance or sampling error.
![Page 9: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/9.jpg)
Statistical Validity Tests Statistics offers many tests that
are designed to measure the significance of any difference.
Adaptation to classifier comparison should be done carefully.
![Page 10: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/10.jpg)
Bonferroni Adjustment – an
example Comparison of classifier algorithms 154 datasets NULL hypothesis is true if p-value
is < 0.05 (not very stringent) Differences were reported
significant if a t-test produced p-value < 0.05.
![Page 11: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/11.jpg)
Example (cont.) This is not correct usage of p-value
significance test. There were 154 experiments.
Therefore, 154 chances to be significant.
Actual p-value used is 154*0.05 (= 7.7).
![Page 12: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/12.jpg)
Example (cont.) Let the significance for each level be Chance for making right conclusion for
one experiment is 1- Assuming experiments are
independent of one another, chance for getting n experiments correct is (1-)n
Chances of not making correct conclusion is 1-(1-)n
![Page 13: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/13.jpg)
Example (cont.) Substituting =0.05 Chances for making incorrect
conclusion is 0.9996 To obtain results significant at 0.05
level with 154 tests1-(1-)154 < 0.05 or < 0.003
![Page 14: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/14.jpg)
Example - conclusion Rough calculations but provides
insight to problem The use of wrong p-value results in
incorrect conclusions T-test overall is wrong test as
training and test sets are not independent
![Page 15: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/15.jpg)
Simple Recommended Statistical Test
Comparison must consider 4 numbers when a common test set to compare two algorithms (A and B)
A > B A < B A = B ~A = ~B
![Page 16: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/16.jpg)
Simple Recommended Statistical Test (cont.)
If only two algorithms compared Throw out ties. Compare A>B Vs A<B
If more than two algorithms compared Use “Analysis of Variance” (ANOVA) Bonferroni adjustment for multiple
test should be applied
![Page 17: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/17.jpg)
Statistical Accidents Suppose 100 people are studying
the effect of algorithms A and B. At least 5 will get results
statistically significant at p <= 0.05 (assuming independent experiments).
These results are nothing but due to chance.
![Page 18: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/18.jpg)
Repeated Tuning Algorithms are “tuned” repeatedly on
same datasets. Every “tuning” attempt should be
considered as a separate experiment. For example if 10 tuning experiments
were attempted, then p-value should be 0.005 instead of 0.05.
![Page 19: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/19.jpg)
Repeated Tuning (cont.) Datasets are not independent,
therefore even Bonferroni adjustment is not very accurate.
A greater problem occurs while using an algorithm that has been used before: you may not know how it was tuned (one disadvantage of using public databases).
![Page 20: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/20.jpg)
Repeated Tuning – Recommended approach
Break dataset into k disjoint subsets of approximately equal size.
K experiments are performed. After every experiment one subset
is removed. Trained system is tested on held-
out subsystem.
![Page 21: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/21.jpg)
Repeated Tuning – Recommended approach
(cont.)
At the end of k-fold experiment, every sample has been used in test set exactly once.
Advantage: test sets are independent.
Disadvantage: training sets are clearly not independent.
![Page 22: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/22.jpg)
A Recommended Approach
Choose other algorithms to include in the comparison. Try including most similar to new algorithm.
Choose datasets. Divide the data set into k subsets
for cross validation. Typically k=10. For a small data set, choose larger k,
since this leaves more examples in the training set.
![Page 23: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/23.jpg)
A Recommended Approach (cont.)
Run a cross-validation For each of the k subsets of the data
set D, create a training set T = D – k Divide T into T1 (training) and T2
(tuning) subsets Once tuning is done, rerun training
on T Finally measure accuracy on k Overall accuracy is averaged across
all k partitions.
![Page 24: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/24.jpg)
A Recommended Approach (cont.)
Finally, compare algorithms
In case of multiple data sets, Bonferroni adjustment should be applied
![Page 25: On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.](https://reader036.fdocuments.net/reader036/viewer/2022062715/56649d7f5503460f94a630b4/html5/thumbnails/25.jpg)
Conclusion
We don’t mean to discourage empirical comparisons but to provide suggestions to avoid pitfalls.
Statistical tools should be used carefully.
Every details of the experiment should be reported.