Jack Chongjie Xue †

1KDD-09, Paris France

Quantification and Semi-Quantification and Semi-Supervised Classification Supervised Classification

Methods for Handling Changes in Methods for Handling Changes in Class DistributionClass Distribution

Jack Chongjie Xue† Gary M. Weiss

KDD-09, Paris, France

Department of Computer and Information Science†Also with the Office of Institutional Research

Fordham UniversityFordham University, USA


Important Research ProblemImportant Research Problem

• Distributions may change after model is induced

• Our research problem/scenario:– Class distribution changes but “concept” does not– Let x represent an example and y its label. We

assume:• P(y|x) is constant (i.e., concept does not change)

• P(y) changes (which means that P(x) must change)

– Assume unlabeled data available from new class distribution (training and separate test)


• Two research questions:– How can we maximize classifier performance when

class distribution changes but is unknown?

– How can we utilize unlabeled data from the changed class distribution to accomplish this?

• Our Goals– Outperform naïve methods that ignore these changes– Approach performance of “oracle” method which trains

on labeled data from new distribution

Research Questions and Research Questions and GoalsGoals


When Class Distribution When Class Distribution ChangesChanges


Technical ApproachesTechnical Approaches

• Quantification [Forman KDD 06 & DMKD 08]

– Task of estimating a class distribution (CD)• Much easier than classification

– Adjust model to compensate for CD change [Elkan 01, Weiss & Provost 03]

– New examples not used directly in training– We call class distribution estimation (CDE) methods

• Semi-Supervised Learning (SSL)– Exploits unlabeled data, which are used for training

• Other approaches discussed later


CDE MethodsCDE Methods

• CDE-Oracle (upper bound)– Determines new CD by peeking at class labels then

adjusts model; CDE upper bound

• CDE-Iterate-n– Iterative algorithm because changes to class

distribution will be underestimated 1. Builds model M on orig. training data (using last NEWCD)

2. Labels new distribution to estimate NEWCD

3. Adjusts M using NEWCD estimate; Output M;

4. Increment n; Loop to step 1


CDE MethodsCDE Methods

• CDE-AC– Based on Adjusted Count quantification– See [Forman KDD 06 and DMKD 08] for details

– Adjusted Positive Rate pr* = (pr – fpr) / (tpr – fpr)• pr is calculated from the predicted class labels

• fpr and tpr obtained via cross-validation of labeled training set

• Essentially compensates for fact that pr will underestimate changes to class distribution


SSL MethodsSSL Methods

• SSL-Naïve1. Build model from labeled training data

2. Label unlabeled data from new distribution

3. Build new model from predicted labels of new distr.

– Note: Does not directly use original training data

• SSL-Self-Train– Similar to SSL-Naïve, but original training data

used and examples from new distribution with most confident predictions (above median)– Iterates until all examples merged or max

iterations (4)


Hybrid MethodHybrid Method

• Combination of SSL-Self-Train and CDE-Iterate– Can view as SSL-Self-Train but at each iteration

model adjusted to compensate for difference between CD of merged training data and model applied to new data


Experiment MethodologyExperiment Methodology

• Use 5 relatively large UCI data sets• Partition data to form “original” and “new”

distributions– Original distribution made to be 50% positive– New distribution varied from 1% to 99% positive– Results averaged over 10 random runs

• Use WEKA’s J48 for experiments (like C4.5)• Track accuracy and F-measure

– F-measure places more emphasis on minority-class


Results: Accuracy (Adult Data Results: Accuracy (Adult Data Set)Set)


Results: Accuracy (SSL-Naive)Results: Accuracy (SSL-Naive)


Results: Accuracy (SSL-Self-Results: Accuracy (SSL-Self-Train)Train)


Results: Accuracy (CDE-Iterate-1)Results: Accuracy (CDE-Iterate-1)


Results: Accuracy (CDE-Iterate-2)Results: Accuracy (CDE-Iterate-2)


Results: Accuracy (Hybrid)Results: Accuracy (Hybrid)


Results: Accuracy (CDE-AC)Results: Accuracy (CDE-AC)


Results: Average AccuracyResults: Average Accuracy (99 pos (99 pos rates)rates)


Results: F-Measure (Adult Data Results: F-Measure (Adult Data Set)Set)


Results: F-Measure (99 pos Results: F-Measure (99 pos rates)rates)


Why do Oracle Methods Perform Why do Oracle Methods Perform Poorly?Poorly?• Oracle method:

– Oracle trains only on new distribution– New distribution often very unbalanced– F-measure should do best with balanced data

• Weiss and Provost (2003) show balanced best for AUC

• CDE-Oracle method:– CDE-Iterate underestimates change in class distr.– May be helpful for F-measure since will better

balance importance of minority class


ConclusionConclusion

• Can substantially improve performance by not ignoring changes to class distribution

– Can exploit unlabeled data from new distribution, even if only to estimate NEWCD

– Quantification methods can be very helpful and much better than semi-supervised learning alone


Future WorkFuture Work

• Problem reduced with well-calibrated probability models (Zadrozny & Elkan ’01)

– Decision trees do not produce these

– Evaluate methods that produce good estimates

• In our problem setting p(x) changes– Try methods that measure this change and

compensate for it (e.g., via weighting the x’s)• Experiment with initial distribution not 1:1

– Especially highly skewed distributions (e.g. diseases)

• Other issues: data streams/real time update


ReferencesReferences[Forman 06] G. Forman, Quantifying trends accurately despite classifier error

and class imbalance, KDD-06, 157-166.

[Forman 08] G. Forman, Quantifying counts and costs via classification, Data Mining and Knowledge Discovery, 17(2), 164-206.

[Weiss & Provost 03] G. Weiss & F. Provost, Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:315-354.

[Zadrozny & Elkan 01] B. Zadrozny & C. Elkan, Obtaining calibrated probability estimates from decision trees and naïve bayesian classifiers, ICML-01, 609-616.

Jack Chongjie Xue †

Documents

Transcript of Jack Chongjie Xue †