Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen 04-29-2009.

Learning from Imbalanced, Only Positive and Unlabeled Data

Yetian Chen

04-29-2009

Outline

Introduction and Problem statement

2008 UC San Diego Data Ming Competition

Task 1: Supervised Learning from Imbalanced Data Sets

Over-sampling and Under-sampling

Task 2: Semi-Supervised Learning from Only Positive and Unlabeled Data

Two-step Strategy

Statement of Problems

2008 UC San Diego Data Ming Competition Task 1:Standard Binary Classification

A binary classification task that involves 20 real-valued features from an experiment in the physical sciences. The training data consist of 40,000 examples, but there are roughly ten times as many negative examples as positive. The test set, however, is evenly distributed between positive and negative examples.

Task 2:Positive-Only Semi-Supervised Task also a binary classification task, but most of the training examples are unlabeled. In fact, only a few of the positive examples have labels. There are both positive and negative unlabeled examples, but there are several times as many negative training examples as positive. This class distribution is reflected in the test sets.

Task 1: Learning from Imbalanced Data

Class imbalance is prevalent in many applications: fraud/intrusion detection, risk management, text classification, medical diagnosis/monitoring, etc.

Standard classifiers tend to be overwhelmed by the large classes and ignore the small ones,

i.e., tend to produce high predictive accuracy over the majority class, but poor predictive accuracy over the minority class

Solutions to Class Imbalance Problem

At the data level (re-samplings) Over-sampling: increases the number of minority instances by

over-sampling them Under-sampling: extract a smaller set of majority instances

while preserving all the minority instances

At the algorithmic level Cost-sensitive based: adjust the costs of the various classes

so as to counter the class imbalance ……

Over-sampling

SMOTE: Synthetic Minority Over-sampling Technique

The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k

minority class nearest neighbors.

Over-sampling by duplicating the minority examples

Under-sampling

Randomly select a subset from the majority class. The size of the subset is roughly equal to the size of minority class.

After re-sampling, apply standard classifiers onto the rebalanced datasets, compare the accuracies.

Decision Tree, Naïve Bayes, Neural Network(one hidden layer)

Results for Task 1Effect of resampling techniques on imbalanced data

Regular US OSbD OS_SMOTE

Acc

ura

cy

.5

.6

.7

.8

.9

1.0

Decision TreeNaive BayesNeural Network(hidden neurons = 11)

regular US OSbD SMOTE

DT 0.791 0.828 0.788 0.875

NB 0.834 0.827 0.827 0.838

NN 0.835 0.909 0.904 0.91

For Neural Network Classifiers, I experimented with different hidden units (5,11,15, 20), 11 gives the best accuracies.

My Ranking (52th/199)

……

Conclusion for Task 1

For Naïve Bayes classifiers, re-sampling does not improve the accuracy significantly.

For Decision Tree Classifiers, random under-sampling and over-sampling with SMOTE significantly improve the accuracy.

For Neural Network, all three re-sampling techniques significantly improve the accuracy

Neural Network classifier with over-sampling with SMOTE gives the best accuracy compared to other classifiers and re-sampling techniques.

Task 2: Learning from Only Positive and Unlabeled Data

Positive examples: One has a set of examples of a class P, and

Unlabeled set: also has a set U of unlabeled (or mixed) examples with instances from P and also not from P (negative examples).

Build a classifier: Build a classifier to classify the examples in U and/or future (test) data.

Key feature of the problem: no labeled negative training data.

We call this problem, PU-learning.

Examples in Real Life

Specialized molecular biology database. Defines a set of positive examples ( genes/proteins related to certain disease or function ) No info about examples that should not be included and it is unnatural to build such set.

Learning user’s preference for web pages:– The user’s bookmarks can be considered as positive examples– All the rest web pages are unlabeled examples

Direct marketing: company’s current list of customers as positive examples

Text classification: labeling is labor intensive

Are Unlabeled Examples Helpful?

Function known to be either x1

< 0 or x2 > 0 Which one is it?

x1 < 0

x2 > 0

+

+

++ ++ + +

+

uuu

uu

u

u

uu

uu

“Not learnable” with only positiveexamples. However, addition ofunlabeled examples makes it learnable.

Two-step strategy Step 1: Identifying a set of reliable negative examples from

the unlabeled set. – S-EM [Liu et al, 2002] uses a Spy technique, – PEBL [Yu et al, 2002] uses a 1-DNF technique– Roc-SVM [Li & Liu, 2003] uses the Rocchio algorithm.

– …

Step 2: Building a sequence of classifiers by iteratively applying a classification algorithm and then selecting a good classifier.

– S-EM uses the Expectation Maximization (EM) algorithm, with an error based classifier selection mechanism

– PEBL uses SVM, and gives the classifier at convergence. I.e., no classifier selection.

– Roc-SVM uses SVM with a heuristic method for selecting the final classifier.

Step 1 Step 2

positive negative

ReliableNegative(RN)

Q =U - RN

U

P

positive

Using P, RN and Q to build the final classifier iteratively

or

Using only P and RN to build a classifier

Step 1: The Spy technique

Sample a certain % of positive examples and put them into unlabeled set to act as “spies”.

Run a classification algorithm assuming all unlabeled examples are negative,

– We will know the behavior of those actual positive examples in the unlabeled set through the “spies”.

– Use Expectation-Maximization (EM) algorithm to assign each unlabeled example a probabilistic class label

We can then extract reliable negative examples from the unlabeled set more accurately.

Step 1: The Spy technique

Step 2: Building the final classifier

Use Naïve Bayes classifiers to build the final classifier

Use P as the positive class, use N (reliable negative examples) as the negative class

Results and Conclusion for Task 2

Use P as positive class, use U as the negative class, use SMOTE to over-sample P so that the size of P is roughly the same as U, the F1 score = 0.545

Two-step algorithm gives F1 score = 0.651

The highest score is F1=0.721

Only positive and unlabeled data is learnable with the two-step strategy.

Future Work

For task 1, we can try Cost-sensitive based method

For task 2, two-step strategy – Step 1: 1-DNF, Rocchio algorithm– Step2: SVM

References

B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179–188, 2003.

B. Liu, W.S.Lee, P.S. Wu, X. Li. Partially Classification of Text Documents. Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia.

Wee Sun Lee, Bing Liu. Learning with Positive and Unlabeled Examples using Weighted Logistic Regression. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA.

Giang Hoang Nguyen, Abdesselam Bouzerdoum, Son Lam Phung: A supervised learning approach for imbalanced data sets. ICPR 2008: 1-4

Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Kotcz: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations 6(1): 1-6 (2004)

Nitesh V. Chawla et. al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research . Vol.16, pp.321-357.

Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen 04-29-2009.

Documents

Transcript of Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen 04-29-2009.