Introductory Chemistry B CH4751 Dr. Erzeng Xue CH4751 Lecture Notes 1 (Erzeng Xue)
Jack Chongjie Xue †
-
Upload
audra-ford -
Category
Documents
-
view
64 -
download
2
description
Transcript of Jack Chongjie Xue †
1KDD-09, Paris France
Quantification and Semi-Quantification and Semi-Supervised Classification Supervised Classification
Methods for Handling Changes in Methods for Handling Changes in Class DistributionClass Distribution
Jack Chongjie Xue† Gary M. Weiss
KDD-09, Paris, France
Department of Computer and Information Science†Also with the Office of Institutional Research
Fordham UniversityFordham University, USA
2KDD-09, Paris France
Important Research ProblemImportant Research Problem
• Distributions may change after model is induced
• Our research problem/scenario:– Class distribution changes but “concept” does not– Let x represent an example and y its label. We
assume:• P(y|x) is constant (i.e., concept does not change)
• P(y) changes (which means that P(x) must change)
– Assume unlabeled data available from new class distribution (training and separate test)
3KDD-09, Paris France
• Two research questions:– How can we maximize classifier performance when
class distribution changes but is unknown?
– How can we utilize unlabeled data from the changed class distribution to accomplish this?
• Our Goals– Outperform naïve methods that ignore these changes– Approach performance of “oracle” method which trains
on labeled data from new distribution
Research Questions and Research Questions and GoalsGoals
4KDD-09, Paris France
When Class Distribution When Class Distribution ChangesChanges
5KDD-09, Paris France
Technical ApproachesTechnical Approaches
• Quantification [Forman KDD 06 & DMKD 08]
– Task of estimating a class distribution (CD)• Much easier than classification
– Adjust model to compensate for CD change [Elkan 01, Weiss & Provost 03]
– New examples not used directly in training– We call class distribution estimation (CDE) methods
• Semi-Supervised Learning (SSL)– Exploits unlabeled data, which are used for training
• Other approaches discussed later
6KDD-09, Paris France
CDE MethodsCDE Methods
• CDE-Oracle (upper bound)– Determines new CD by peeking at class labels then
adjusts model; CDE upper bound
• CDE-Iterate-n– Iterative algorithm because changes to class
distribution will be underestimated 1. Builds model M on orig. training data (using last NEWCD)
2. Labels new distribution to estimate NEWCD
3. Adjusts M using NEWCD estimate; Output M;
4. Increment n; Loop to step 1
7KDD-09, Paris France
CDE MethodsCDE Methods
• CDE-AC– Based on Adjusted Count quantification– See [Forman KDD 06 and DMKD 08] for details
– Adjusted Positive Rate pr* = (pr – fpr) / (tpr – fpr)• pr is calculated from the predicted class labels
• fpr and tpr obtained via cross-validation of labeled training set
• Essentially compensates for fact that pr will underestimate changes to class distribution
8KDD-09, Paris France
SSL MethodsSSL Methods
• SSL-Naïve1. Build model from labeled training data
2. Label unlabeled data from new distribution
3. Build new model from predicted labels of new distr.
– Note: Does not directly use original training data
• SSL-Self-Train– Similar to SSL-Naïve, but original training data
used and examples from new distribution with most confident predictions (above median)– Iterates until all examples merged or max
iterations (4)
9KDD-09, Paris France
Hybrid MethodHybrid Method
• Combination of SSL-Self-Train and CDE-Iterate– Can view as SSL-Self-Train but at each iteration
model adjusted to compensate for difference between CD of merged training data and model applied to new data
10KDD-09, Paris France
Experiment MethodologyExperiment Methodology
• Use 5 relatively large UCI data sets• Partition data to form “original” and “new”
distributions– Original distribution made to be 50% positive– New distribution varied from 1% to 99% positive– Results averaged over 10 random runs
• Use WEKA’s J48 for experiments (like C4.5)• Track accuracy and F-measure
– F-measure places more emphasis on minority-class
11KDD-09, Paris France
Results: Accuracy (Adult Data Results: Accuracy (Adult Data Set)Set)
12KDD-09, Paris France
Results: Accuracy (SSL-Naive)Results: Accuracy (SSL-Naive)
13KDD-09, Paris France
Results: Accuracy (SSL-Self-Results: Accuracy (SSL-Self-Train)Train)
14KDD-09, Paris France
Results: Accuracy (CDE-Iterate-1)Results: Accuracy (CDE-Iterate-1)
15KDD-09, Paris France
Results: Accuracy (CDE-Iterate-2)Results: Accuracy (CDE-Iterate-2)
16KDD-09, Paris France
Results: Accuracy (Hybrid)Results: Accuracy (Hybrid)
17KDD-09, Paris France
Results: Accuracy (CDE-AC)Results: Accuracy (CDE-AC)
18KDD-09, Paris France
Results: Average AccuracyResults: Average Accuracy (99 pos (99 pos rates)rates)
19KDD-09, Paris France
Results: F-Measure (Adult Data Results: F-Measure (Adult Data Set)Set)
20KDD-09, Paris France
Results: F-Measure (99 pos Results: F-Measure (99 pos rates)rates)
21KDD-09, Paris France
Why do Oracle Methods Perform Why do Oracle Methods Perform Poorly?Poorly?• Oracle method:
– Oracle trains only on new distribution– New distribution often very unbalanced– F-measure should do best with balanced data
• Weiss and Provost (2003) show balanced best for AUC
• CDE-Oracle method:– CDE-Iterate underestimates change in class distr.– May be helpful for F-measure since will better
balance importance of minority class
22KDD-09, Paris France
ConclusionConclusion
• Can substantially improve performance by not ignoring changes to class distribution
– Can exploit unlabeled data from new distribution, even if only to estimate NEWCD
– Quantification methods can be very helpful and much better than semi-supervised learning alone
23KDD-09, Paris France
Future WorkFuture Work
• Problem reduced with well-calibrated probability models (Zadrozny & Elkan ’01)
– Decision trees do not produce these
– Evaluate methods that produce good estimates
• In our problem setting p(x) changes– Try methods that measure this change and
compensate for it (e.g., via weighting the x’s)• Experiment with initial distribution not 1:1
– Especially highly skewed distributions (e.g. diseases)
• Other issues: data streams/real time update
24KDD-09, Paris France
ReferencesReferences[Forman 06] G. Forman, Quantifying trends accurately despite classifier error
and class imbalance, KDD-06, 157-166.
[Forman 08] G. Forman, Quantifying counts and costs via classification, Data Mining and Knowledge Discovery, 17(2), 164-206.
[Weiss & Provost 03] G. Weiss & F. Provost, Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:315-354.
[Zadrozny & Elkan 01] B. Zadrozny & C. Elkan, Obtaining calibrated probability estimates from decision trees and naïve bayesian classifiers, ICML-01, 609-616.