Learning to Classify Documents with Only a Small Positive Training Set

Post on 13-Jan-2016

32 views 3 download

Tags:

description

Learning to Classify Documents with Only a Small Positive Training Set. Xiao-Li Li Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore Joint work with Bing Liu (University of Illinois at Chicago) See-Kiong Ng (Institute for Infocomm Research). - PowerPoint PPT Presentation

Transcript of Learning to Classify Documents with Only a Small Positive Training Set

Learning to Classify Documents with Only a Small Positive Training Set

Xiao-Li Li Institute for Infocomm Research, Singapore

Nanyang Technological University, Singapore

Joint work with Bing Liu (University of Illinois at Chicago) See-Kiong Ng (Institute for Infocomm Research)

Outline

• 1. Introduction of problem

• 2. The Proposed Technique LPLP

• 3. Evaluation Experiments

• 4. Conclusions

1. Introduction

• Traditional Supervised Learning– Given a set of labeled training documents of n

classes, the system uses this set to build a classifier.

– The classifier is then used to classify new documents into the n classes.

• Typically require a large number of labeled examples, which can be an expensive and tedious process.

Positive-Unlabeled (PU) Learning

• One way to reduce the amount of labeled training data is to develop classification algorithms that can learn from a set P of labeled positive examples augmented with a set U of unlabeled examples.

• Then, build a classifier using P and U to classify the data in U as well as future test data. We call this the PU learning problem.

PU Learning

• Positive documents: One has a set of documents of a class P, and

• Unlabeled (or mixed) set: also has a set U of unlabeled documents containing documents from P and also not from P (negative documents).

• Build a classifier: Build a classifier to classify the documents in U and future (test) data.

An illustration of the typical PU Learning

P(ECML) U

(AAAI)

ClassifierClassifier aims to automatically find hidden positives in U

classifyMachine learning papers in AAAI

Applications of the problem

• Given a ECML proceedings, find all machine learning papers from AAAI, IJCAI, KDD

• Given one's bookmarks, identify those documents that are of interest to him/her from Web sources.

• A company has a database with details of its customers, try to find potential customers from a database consisting of details of people.

Related works• Theoretical study: Denis (1998), Muggleton (2001)

and Liu et al (2002) show that this problem is learnable.

• Scholkopf et al(1999) and others proposed one-class SVM

• S-EM: In [ICML,Liu, Lee, Yu, Li, 2002], Liu et al. proposed a method (called S-EM) to solve the problem based on a spy technique, naïve Bayesian classification (NB) and the EM algorithm.

Related works

• PEBL: Yu et al. (KDD, 2002) proposed a SVM based technique to classify Web pages given positive and unlabeled pages.

• NBP: Denis’s group also built a NBP system.

• Roc-SVM: Li and Liu (IJCAI, 2003) gives a Rocchio and SVM based method.

Can we use the current techniques in some real applications?

Can we use the current techniques in some real applications?

A real-life business intelligence application

---searching for information on related products

Printer Pages From Amazon CNET

A company that sells computer printers may wantto do a product comparison among the various

printers currently available in the market

Current techniques can

not work well!

Why???

The Assumption (1) of current techniques

• There was a sufficiently large set of positive training examples

• However, in practice, obtaining a large number of positive examples can be rather difficult in many real applications.

Current Assumption (1)

The small positive set may not even adequately

represent the whole positive class

PU learning with a small positive training set

H2

H1

The Assumption (2) of current techniques

• Positive set (P) and hidden positive examples in the Positive set (P) and hidden positive examples in the unlabeled set(U) are generated from the same unlabeled set(U) are generated from the same distribution.distribution.

• Reason: Different Web sites present similar products in different styles and have different focuses.

Different red color???

Printer Pages From Amazon CNET

2. The proposed techniques: Ideas

Printer Pages From Amazon CNET

E.g. share the representative word features

such as “printer”, “inkjet”, “laser”, “ppm” etc,

Both should still be similar in some underlying feature dimensions (or subspaces) as they belong to the same class.

The proposed techniques: Ideas (Cont.)

• If we can find such a set of representative word features (RW) from the positive set P and U, then we can use them to extract other hidden positive documents from U (Share).

• Method: LPLP Learning from Probabilistically Labeled Positive examples

The proposed techniques: LPLP

• 1. Select the set of representative word features RW from the given positive set P.

• 2. We extract the likely positive documents from U and probabilistically label them based on the set RW.

• 3. We employ the EM algorithm to build an accurate classifier to identify the hidden positive examples from U.

Step1: Selecting a set of representative word features from P

• The scoring function s() is based on TFIDF method

• It gives high scores to those words that occur frequently in the positive set P and not in the whole corpus since U contains many other unrelated documents.

Select representative word features from P

Printer Pages From Amazon

CNET

printer

inkjet

laser

ppm

Step2: identifying LP from U and probabilistically

labeling the documents in LP • rd: representative document consists of all reprehensive features

•Compare each document di in U with rd using the cosine similarity, which produces a set LP of probabilistically labeled documents with Pr(di|+) > 0.

• The hidden positive examples in LP will be assigned high probabilities while the negative examples in LP will be assigned very low probabilities.

Identifying likely positives

LP

RUU

PrinterInkjetLaserPpm

Reprehensive document rd

high probability

low probability

The Naïve Bayesian method

||

)|()(

||1

D

dcc

Di ij

j

||1

||1

||1

)|(),(||

)|(),(1)|(

Vs

Di ijis

Di ijit

jt dcdwNV

dcdwNcw

||1

||1 ,

||1 ,

)|()(

)|()()|(

Cr

dk rkdr

dk jkdj

iji

i

i

i

cwc

cwcdc

(1)

(2)

(3)Classifier

Classifier

parameters

Step3: EM algorithm•Re-initialize the EM algorithm by treating the probabilistically labeled LP (with/without P) as positive documents.

•LP has the similar distributions with other hidden positive documents in U

• The remaining unlabeled set RU is also much purer than U as a negative set.

Build a final classifier

RU

LP

Classifier

P

Option1Option1

Option2: combine P and LPOption2: combine P and LP

Negative set Negative set Positive set (two options) Positive set (two options)

3. EMPIRICAL EVALUATION

Datasets

Number of Web pages and their classes

Web sites Amazon CNet J&R PCMag ZDnet

Notebook 434 480 51 144 143

Camera 402 219 80 137 151

Mobile 45 109 9 43 97

Printer 767 500 104 107 80

TV 719 449 199 0 0

Experiment setting

• We experimented with different number of (randomly selected) positive documents in P, i.e. |P| = 5, 15, or 25 and allpos.

• We conducted a comprehensive set of experiments using all the possible P and U combinations. That is, we selected every entry in Table 1 as the positive set P and use each of the other 4 Web sites as the unlabeled set U.

Performance of LPLP with different numbers of positive documents

different numbers of positive documents

LP + P or LP only?

• If there were only a small number of positive documents (|P| = 5, 15 and 25) available, we found that combining LP and P to construct the positive set for the classifier is better than using LP only.

• If there is a large number of positive documents, then using LP only is better.

The number of the representative features

• In general, 5-25 representative words would suffice.

• Including the less representative word features beyond the top 25 most representative ones would introduce unnecessary noise in identifying the likely positive documents in U

Performance of LPLP, Roc-SVM and PEBL (using either P or LP) when using all positive

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

LPLP Roc-SVM PEBL

F v

alu

e

P LP

PEBL and Roc-SVM use the likely positive documents LP which requires each document (d) from U to contain at least 5 (out of 10) selected representative words.

Comparative results when the number of the positive documents is small

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

LPLP Roc-SVM PEBL

The document number of P

F v

alu

e5 15 25

Conclusions

• In many real-world classification applications, it is often the case that the number of positive examples available for learning can be fairly limited

• We proposed an effective technique LPLP that can learn effectively from positive and unlabeled examples with a small positive set for document classification.

Conclusions (cont.)

• The likely positive documents LP can be used to help boost the performance of classification techniques for PU learning problems

• LPLP algorithm benefited the most because of its ability to handle probabilistic labels and is thus better equipped to take advantage of the probabilistic LP set than the SVM-based approaches.

Thank you for your attention!