Jack Chongjie Xue †

24
1 KDD-09, Paris France Quantification and Semi- Quantification and Semi- Supervised Classification Supervised Classification Methods for Handling Changes in Methods for Handling Changes in Class Distribution Class Distribution Jack Chongjie Xue Gary M. Weiss KDD-09, Paris, France Department of Computer and Information Science Also with the Office of Institutional Research Fordham University Fordham University, USA

description

Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution. KDD-09, Paris, France. Gary M. Weiss. Jack Chongjie Xue †. Department of Computer and Information Science † Also with the Office of Institutional Research Fordham University , USA. - PowerPoint PPT Presentation

Transcript of Jack Chongjie Xue †

Page 1: Jack Chongjie Xue †

1KDD-09, Paris France

Quantification and Semi-Quantification and Semi-Supervised Classification Supervised Classification

Methods for Handling Changes in Methods for Handling Changes in Class DistributionClass Distribution

Jack Chongjie Xue† Gary M. Weiss

KDD-09, Paris, France

Department of Computer and Information Science†Also with the Office of Institutional Research

Fordham UniversityFordham University, USA

Page 2: Jack Chongjie Xue †

2KDD-09, Paris France

Important Research ProblemImportant Research Problem

• Distributions may change after model is induced

• Our research problem/scenario:– Class distribution changes but “concept” does not– Let x represent an example and y its label. We

assume:• P(y|x) is constant (i.e., concept does not change)

• P(y) changes (which means that P(x) must change)

– Assume unlabeled data available from new class distribution (training and separate test)

Page 3: Jack Chongjie Xue †

3KDD-09, Paris France

• Two research questions:– How can we maximize classifier performance when

class distribution changes but is unknown?

– How can we utilize unlabeled data from the changed class distribution to accomplish this?

• Our Goals– Outperform naïve methods that ignore these changes– Approach performance of “oracle” method which trains

on labeled data from new distribution

Research Questions and Research Questions and GoalsGoals

Page 4: Jack Chongjie Xue †

4KDD-09, Paris France

When Class Distribution When Class Distribution ChangesChanges

Page 5: Jack Chongjie Xue †

5KDD-09, Paris France

Technical ApproachesTechnical Approaches

• Quantification [Forman KDD 06 & DMKD 08]

– Task of estimating a class distribution (CD)• Much easier than classification

– Adjust model to compensate for CD change [Elkan 01, Weiss & Provost 03]

– New examples not used directly in training– We call class distribution estimation (CDE) methods

• Semi-Supervised Learning (SSL)– Exploits unlabeled data, which are used for training

• Other approaches discussed later

Page 6: Jack Chongjie Xue †

6KDD-09, Paris France

CDE MethodsCDE Methods

• CDE-Oracle (upper bound)– Determines new CD by peeking at class labels then

adjusts model; CDE upper bound

• CDE-Iterate-n– Iterative algorithm because changes to class

distribution will be underestimated 1. Builds model M on orig. training data (using last NEWCD)

2. Labels new distribution to estimate NEWCD

3. Adjusts M using NEWCD estimate; Output M;

4. Increment n; Loop to step 1

Page 7: Jack Chongjie Xue †

7KDD-09, Paris France

CDE MethodsCDE Methods

• CDE-AC– Based on Adjusted Count quantification– See [Forman KDD 06 and DMKD 08] for details

– Adjusted Positive Rate pr* = (pr – fpr) / (tpr – fpr)• pr is calculated from the predicted class labels

• fpr and tpr obtained via cross-validation of labeled training set

• Essentially compensates for fact that pr will underestimate changes to class distribution

Page 8: Jack Chongjie Xue †

8KDD-09, Paris France

SSL MethodsSSL Methods

• SSL-Naïve1. Build model from labeled training data

2. Label unlabeled data from new distribution

3. Build new model from predicted labels of new distr.

– Note: Does not directly use original training data

• SSL-Self-Train– Similar to SSL-Naïve, but original training data

used and examples from new distribution with most confident predictions (above median)– Iterates until all examples merged or max

iterations (4)

Page 9: Jack Chongjie Xue †

9KDD-09, Paris France

Hybrid MethodHybrid Method

• Combination of SSL-Self-Train and CDE-Iterate– Can view as SSL-Self-Train but at each iteration

model adjusted to compensate for difference between CD of merged training data and model applied to new data

Page 10: Jack Chongjie Xue †

10KDD-09, Paris France

Experiment MethodologyExperiment Methodology

• Use 5 relatively large UCI data sets• Partition data to form “original” and “new”

distributions– Original distribution made to be 50% positive– New distribution varied from 1% to 99% positive– Results averaged over 10 random runs

• Use WEKA’s J48 for experiments (like C4.5)• Track accuracy and F-measure

– F-measure places more emphasis on minority-class

Page 11: Jack Chongjie Xue †

11KDD-09, Paris France

Results: Accuracy (Adult Data Results: Accuracy (Adult Data Set)Set)

Page 12: Jack Chongjie Xue †

12KDD-09, Paris France

Results: Accuracy (SSL-Naive)Results: Accuracy (SSL-Naive)

Page 13: Jack Chongjie Xue †

13KDD-09, Paris France

Results: Accuracy (SSL-Self-Results: Accuracy (SSL-Self-Train)Train)

Page 14: Jack Chongjie Xue †

14KDD-09, Paris France

Results: Accuracy (CDE-Iterate-1)Results: Accuracy (CDE-Iterate-1)

Page 15: Jack Chongjie Xue †

15KDD-09, Paris France

Results: Accuracy (CDE-Iterate-2)Results: Accuracy (CDE-Iterate-2)

Page 16: Jack Chongjie Xue †

16KDD-09, Paris France

Results: Accuracy (Hybrid)Results: Accuracy (Hybrid)

Page 17: Jack Chongjie Xue †

17KDD-09, Paris France

Results: Accuracy (CDE-AC)Results: Accuracy (CDE-AC)

Page 18: Jack Chongjie Xue †

18KDD-09, Paris France

Results: Average AccuracyResults: Average Accuracy (99 pos (99 pos rates)rates)

Page 19: Jack Chongjie Xue †

19KDD-09, Paris France

Results: F-Measure (Adult Data Results: F-Measure (Adult Data Set)Set)

Page 20: Jack Chongjie Xue †

20KDD-09, Paris France

Results: F-Measure (99 pos Results: F-Measure (99 pos rates)rates)

Page 21: Jack Chongjie Xue †

21KDD-09, Paris France

Why do Oracle Methods Perform Why do Oracle Methods Perform Poorly?Poorly?• Oracle method:

– Oracle trains only on new distribution– New distribution often very unbalanced– F-measure should do best with balanced data

• Weiss and Provost (2003) show balanced best for AUC

• CDE-Oracle method:– CDE-Iterate underestimates change in class distr.– May be helpful for F-measure since will better

balance importance of minority class

Page 22: Jack Chongjie Xue †

22KDD-09, Paris France

ConclusionConclusion

• Can substantially improve performance by not ignoring changes to class distribution

– Can exploit unlabeled data from new distribution, even if only to estimate NEWCD

– Quantification methods can be very helpful and much better than semi-supervised learning alone

Page 23: Jack Chongjie Xue †

23KDD-09, Paris France

Future WorkFuture Work

• Problem reduced with well-calibrated probability models (Zadrozny & Elkan ’01)

– Decision trees do not produce these

– Evaluate methods that produce good estimates

• In our problem setting p(x) changes– Try methods that measure this change and

compensate for it (e.g., via weighting the x’s)• Experiment with initial distribution not 1:1

– Especially highly skewed distributions (e.g. diseases)

• Other issues: data streams/real time update

Page 24: Jack Chongjie Xue †

24KDD-09, Paris France

ReferencesReferences[Forman 06] G. Forman, Quantifying trends accurately despite classifier error

and class imbalance, KDD-06, 157-166.

[Forman 08] G. Forman, Quantifying counts and costs via classification, Data Mining and Knowledge Discovery, 17(2), 164-206.

[Weiss & Provost 03] G. Weiss & F. Provost, Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:315-354.

[Zadrozny & Elkan 01] B. Zadrozny & C. Elkan, Obtaining calibrated probability estimates from decision trees and naïve bayesian classifiers, ICML-01, 609-616.