Learning to Classify Documents with Only a Small Positive Training Set

36
Learning to Classify Documents with Only a Small Positive Training Set Xiao-Li Li Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore Joint work with Bing Liu (University of Illinois at Chicago) See-Kiong Ng (Institute for Infocomm Research)

description

Learning to Classify Documents with Only a Small Positive Training Set. Xiao-Li Li Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore Joint work with Bing Liu (University of Illinois at Chicago) See-Kiong Ng (Institute for Infocomm Research). - PowerPoint PPT Presentation

Transcript of Learning to Classify Documents with Only a Small Positive Training Set

Page 1: Learning to Classify Documents with Only a Small Positive Training Set

Learning to Classify Documents with Only a Small Positive Training Set

Xiao-Li Li Institute for Infocomm Research, Singapore

Nanyang Technological University, Singapore

Joint work with Bing Liu (University of Illinois at Chicago) See-Kiong Ng (Institute for Infocomm Research)

Page 2: Learning to Classify Documents with Only a Small Positive Training Set

Outline

• 1. Introduction of problem

• 2. The Proposed Technique LPLP

• 3. Evaluation Experiments

• 4. Conclusions

Page 3: Learning to Classify Documents with Only a Small Positive Training Set

1. Introduction

• Traditional Supervised Learning– Given a set of labeled training documents of n

classes, the system uses this set to build a classifier.

– The classifier is then used to classify new documents into the n classes.

• Typically require a large number of labeled examples, which can be an expensive and tedious process.

Page 4: Learning to Classify Documents with Only a Small Positive Training Set

Positive-Unlabeled (PU) Learning

• One way to reduce the amount of labeled training data is to develop classification algorithms that can learn from a set P of labeled positive examples augmented with a set U of unlabeled examples.

• Then, build a classifier using P and U to classify the data in U as well as future test data. We call this the PU learning problem.

Page 5: Learning to Classify Documents with Only a Small Positive Training Set

PU Learning

• Positive documents: One has a set of documents of a class P, and

• Unlabeled (or mixed) set: also has a set U of unlabeled documents containing documents from P and also not from P (negative documents).

• Build a classifier: Build a classifier to classify the documents in U and future (test) data.

Page 6: Learning to Classify Documents with Only a Small Positive Training Set

An illustration of the typical PU Learning

P(ECML) U

(AAAI)

ClassifierClassifier aims to automatically find hidden positives in U

classifyMachine learning papers in AAAI

Page 7: Learning to Classify Documents with Only a Small Positive Training Set

Applications of the problem

• Given a ECML proceedings, find all machine learning papers from AAAI, IJCAI, KDD

• Given one's bookmarks, identify those documents that are of interest to him/her from Web sources.

• A company has a database with details of its customers, try to find potential customers from a database consisting of details of people.

Page 8: Learning to Classify Documents with Only a Small Positive Training Set

Related works• Theoretical study: Denis (1998), Muggleton (2001)

and Liu et al (2002) show that this problem is learnable.

• Scholkopf et al(1999) and others proposed one-class SVM

• S-EM: In [ICML,Liu, Lee, Yu, Li, 2002], Liu et al. proposed a method (called S-EM) to solve the problem based on a spy technique, naïve Bayesian classification (NB) and the EM algorithm.

Page 9: Learning to Classify Documents with Only a Small Positive Training Set

Related works

• PEBL: Yu et al. (KDD, 2002) proposed a SVM based technique to classify Web pages given positive and unlabeled pages.

• NBP: Denis’s group also built a NBP system.

• Roc-SVM: Li and Liu (IJCAI, 2003) gives a Rocchio and SVM based method.

Page 10: Learning to Classify Documents with Only a Small Positive Training Set

Can we use the current techniques in some real applications?

Can we use the current techniques in some real applications?

Page 11: Learning to Classify Documents with Only a Small Positive Training Set

A real-life business intelligence application

---searching for information on related products

Printer Pages From Amazon CNET

A company that sells computer printers may wantto do a product comparison among the various

printers currently available in the market

Page 12: Learning to Classify Documents with Only a Small Positive Training Set

Current techniques can

not work well!

Why???

Page 13: Learning to Classify Documents with Only a Small Positive Training Set

The Assumption (1) of current techniques

• There was a sufficiently large set of positive training examples

• However, in practice, obtaining a large number of positive examples can be rather difficult in many real applications.

Page 14: Learning to Classify Documents with Only a Small Positive Training Set

Current Assumption (1)

The small positive set may not even adequately

represent the whole positive class

Page 15: Learning to Classify Documents with Only a Small Positive Training Set

PU learning with a small positive training set

H2

H1

Page 16: Learning to Classify Documents with Only a Small Positive Training Set

The Assumption (2) of current techniques

• Positive set (P) and hidden positive examples in the Positive set (P) and hidden positive examples in the unlabeled set(U) are generated from the same unlabeled set(U) are generated from the same distribution.distribution.

• Reason: Different Web sites present similar products in different styles and have different focuses.

Different red color???

Printer Pages From Amazon CNET

Page 17: Learning to Classify Documents with Only a Small Positive Training Set

2. The proposed techniques: Ideas

Printer Pages From Amazon CNET

E.g. share the representative word features

such as “printer”, “inkjet”, “laser”, “ppm” etc,

Both should still be similar in some underlying feature dimensions (or subspaces) as they belong to the same class.

Page 18: Learning to Classify Documents with Only a Small Positive Training Set

The proposed techniques: Ideas (Cont.)

• If we can find such a set of representative word features (RW) from the positive set P and U, then we can use them to extract other hidden positive documents from U (Share).

• Method: LPLP Learning from Probabilistically Labeled Positive examples

Page 19: Learning to Classify Documents with Only a Small Positive Training Set

The proposed techniques: LPLP

• 1. Select the set of representative word features RW from the given positive set P.

• 2. We extract the likely positive documents from U and probabilistically label them based on the set RW.

• 3. We employ the EM algorithm to build an accurate classifier to identify the hidden positive examples from U.

Page 20: Learning to Classify Documents with Only a Small Positive Training Set

Step1: Selecting a set of representative word features from P

• The scoring function s() is based on TFIDF method

• It gives high scores to those words that occur frequently in the positive set P and not in the whole corpus since U contains many other unrelated documents.

Page 21: Learning to Classify Documents with Only a Small Positive Training Set

Select representative word features from P

Printer Pages From Amazon

CNET

printer

inkjet

laser

ppm

Page 22: Learning to Classify Documents with Only a Small Positive Training Set

Step2: identifying LP from U and probabilistically

labeling the documents in LP • rd: representative document consists of all reprehensive features

•Compare each document di in U with rd using the cosine similarity, which produces a set LP of probabilistically labeled documents with Pr(di|+) > 0.

• The hidden positive examples in LP will be assigned high probabilities while the negative examples in LP will be assigned very low probabilities.

Page 23: Learning to Classify Documents with Only a Small Positive Training Set

Identifying likely positives

LP

RUU

PrinterInkjetLaserPpm

Reprehensive document rd

high probability

low probability

Page 24: Learning to Classify Documents with Only a Small Positive Training Set

The Naïve Bayesian method

||

)|()(

||1

D

dcc

Di ij

j

||1

||1

||1

)|(),(||

)|(),(1)|(

Vs

Di ijis

Di ijit

jt dcdwNV

dcdwNcw

||1

||1 ,

||1 ,

)|()(

)|()()|(

Cr

dk rkdr

dk jkdj

iji

i

i

i

cwc

cwcdc

(1)

(2)

(3)Classifier

Classifier

parameters

Page 25: Learning to Classify Documents with Only a Small Positive Training Set

Step3: EM algorithm•Re-initialize the EM algorithm by treating the probabilistically labeled LP (with/without P) as positive documents.

•LP has the similar distributions with other hidden positive documents in U

• The remaining unlabeled set RU is also much purer than U as a negative set.

Page 26: Learning to Classify Documents with Only a Small Positive Training Set

Build a final classifier

RU

LP

Classifier

P

Option1Option1

Option2: combine P and LPOption2: combine P and LP

Negative set Negative set Positive set (two options) Positive set (two options)

Page 27: Learning to Classify Documents with Only a Small Positive Training Set

3. EMPIRICAL EVALUATION

Datasets

Number of Web pages and their classes

Web sites Amazon CNet J&R PCMag ZDnet

Notebook 434 480 51 144 143

Camera 402 219 80 137 151

Mobile 45 109 9 43 97

Printer 767 500 104 107 80

TV 719 449 199 0 0

Page 28: Learning to Classify Documents with Only a Small Positive Training Set

Experiment setting

• We experimented with different number of (randomly selected) positive documents in P, i.e. |P| = 5, 15, or 25 and allpos.

• We conducted a comprehensive set of experiments using all the possible P and U combinations. That is, we selected every entry in Table 1 as the positive set P and use each of the other 4 Web sites as the unlabeled set U.

Page 29: Learning to Classify Documents with Only a Small Positive Training Set

Performance of LPLP with different numbers of positive documents

different numbers of positive documents

Page 30: Learning to Classify Documents with Only a Small Positive Training Set

LP + P or LP only?

• If there were only a small number of positive documents (|P| = 5, 15 and 25) available, we found that combining LP and P to construct the positive set for the classifier is better than using LP only.

• If there is a large number of positive documents, then using LP only is better.

Page 31: Learning to Classify Documents with Only a Small Positive Training Set

The number of the representative features

• In general, 5-25 representative words would suffice.

• Including the less representative word features beyond the top 25 most representative ones would introduce unnecessary noise in identifying the likely positive documents in U

Page 32: Learning to Classify Documents with Only a Small Positive Training Set

Performance of LPLP, Roc-SVM and PEBL (using either P or LP) when using all positive

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

LPLP Roc-SVM PEBL

F v

alu

e

P LP

PEBL and Roc-SVM use the likely positive documents LP which requires each document (d) from U to contain at least 5 (out of 10) selected representative words.

Page 33: Learning to Classify Documents with Only a Small Positive Training Set

Comparative results when the number of the positive documents is small

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

LPLP Roc-SVM PEBL

The document number of P

F v

alu

e5 15 25

Page 34: Learning to Classify Documents with Only a Small Positive Training Set

Conclusions

• In many real-world classification applications, it is often the case that the number of positive examples available for learning can be fairly limited

• We proposed an effective technique LPLP that can learn effectively from positive and unlabeled examples with a small positive set for document classification.

Page 35: Learning to Classify Documents with Only a Small Positive Training Set

Conclusions (cont.)

• The likely positive documents LP can be used to help boost the performance of classification techniques for PU learning problems

• LPLP algorithm benefited the most because of its ability to handle probabilistic labels and is thus better equipped to take advantage of the probabilistic LP set than the SVM-based approaches.

Page 36: Learning to Classify Documents with Only a Small Positive Training Set

Thank you for your attention!