Combining labeled and unlabeled data for text categorization with a large number of categories

30
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project

description

Combining labeled and unlabeled data for text categorization with a large number of categories. Rayid Ghani KDD Lab Project. Supervised Learning with Labeled Data. Labeled data is required in large quantities and can be very expensive to collect. Why use Unlabeled data?. - PowerPoint PPT Presentation

Transcript of Combining labeled and unlabeled data for text categorization with a large number of categories

Page 1: Combining labeled and unlabeled data for text categorization with a large number of categories

Combining labeled and unlabeled data for text categorization with a large number of

categories

Rayid Ghani

KDD Lab Project

Page 2: Combining labeled and unlabeled data for text categorization with a large number of categories

Supervised Learning with Labeled Data

Labeled data is required in large quantities and can be very expensive to collect.

Page 3: Combining labeled and unlabeled data for text categorization with a large number of categories

Why use Unlabeled data?

Very Cheap in the case of text Web Pages Newsgroups Email Messages

May not be equally useful as labeled data but is available in enormous quantities

Page 4: Combining labeled and unlabeled data for text categorization with a large number of categories

Goal

Make learning more efficient and easy by reducing the amount of labeled data required for text classification with a large number of categories

Page 5: Combining labeled and unlabeled data for text categorization with a large number of categories

•ECOCvery accurate and efficient for text categorization with a large number of classes

•Co-Traininguseful for combining labeled and unlabeled data with a small number of classes

Page 6: Combining labeled and unlabeled data for text categorization with a large number of categories

Related research with unlabeled data

Using EM in a generative model (Nigam et al. 1999)

Transductive SVMs (Joachims 1999) Co-Training type algorithms (Blum &

Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)

Page 7: Combining labeled and unlabeled data for text categorization with a large number of categories

What is ECOC?

Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995)

Use a learner to learn the binary problems

Page 8: Combining labeled and unlabeled data for text categorization with a large number of categories

Training ECOC

0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1

ABCD

f1 f2 f3 f4 f5

X 1 1 1 1 0

Testing ECOC

Page 9: Combining labeled and unlabeled data for text categorization with a large number of categories

The Co-training algorithm

Loop (while unlabeled documents remain): Build classifiers A and B

Use Naïve Bayes Classify unlabeled documents with A & B

Use Naïve Bayes Add most confident A predictions and most

confident B predictions as labeled training examples

[Blum & Mitchell 1998]

Page 10: Combining labeled and unlabeled data for text categorization with a large number of categories

The Co-training Algorithm

Naïve Bayeson B

Naïve Bayeson A

Learn from labeled data

Estimate labels

Estimate labels

Select most confident

Select most confident

Add to labeled data

[Blum & Mitchell, 1998]

Page 11: Combining labeled and unlabeled data for text categorization with a large number of categories

One Intuition behind co-training

A and B are redundant A features independent of B features Co-training like learning with random

classification noise Most confident A prediction gives random B Small misclassification error for A

Page 12: Combining labeled and unlabeled data for text categorization with a large number of categories

ECOC + CoTraining = ECoTrain

ECOC decomposes multiclass problems into binary problems

Co-Training works great with binary problems

ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

Page 13: Combining labeled and unlabeled data for text categorization with a large number of categories

SPORTSSPORTS SCIENCESCIENCEARTSARTS

HEALTHHEALTH POLITICSPOLITICS

LAWLAW

Page 14: Combining labeled and unlabeled data for text categorization with a large number of categories

What happens with sparse data?

Percent Decrease in Error with Training size and length of code

30

35

40

45

50

55

60

65

70

0 20 40 60 80 100

Training Size

% D

ecre

ase

in E

rro

r

15bit

31bit

63bit

Page 15: Combining labeled and unlabeled data for text categorization with a large number of categories

Datasets

Hoovers-255 Collection of 4285 corporate websites Each company is classified into one of 255 categories Baseline 2%

Jobs-65 (from WhizBang) Job Postings (Two feature sets – Title, Description) 65 categories Baseline 11%

Page 16: Combining labeled and unlabeled data for text categorization with a large number of categories

0

2

4

6

8

10

12

Class

Perc

enta

ge

Page 17: Combining labeled and unlabeled data for text categorization with a large number of categories

Results

Dataset Naïve Bayes

(No UnLabeled Data)

ECOC

(No UnLabeled Data)

EM Co-Trainin

g

ECOC + Co-

Training

 10% Labeled

100% Labeled

10% Labeled

100% Labeled

10% Labeled

10% Labeled

10% Labeled

Jobs-65 50.1 68.2 59.3 71.2 58.2 54.1 64.5

Hoovers-255

15.2 32.0 24.8 36.5 9.1 10.2 27.6

Page 18: Combining labeled and unlabeled data for text categorization with a large number of categories

Results

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100

Recall

Pre

cisi

on

ECOC + CoTrainNaive BayesEM

z

Page 19: Combining labeled and unlabeled data for text categorization with a large number of categories

What Next?

Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration

Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Page 20: Combining labeled and unlabeled data for text categorization with a large number of categories

Summary

Page 21: Combining labeled and unlabeled data for text categorization with a large number of categories

The Co-training setting

…My advisor…

…Professor Blum…

…My grandson…

Tom Mitchell

Fredkin Professor of AI…

Avrim Blum

My research interests are…

JohnnyI like horsies!

Classifier A Classifier B

Page 22: Combining labeled and unlabeled data for text categorization with a large number of categories

Learning from Labeled and Unlabeled Data:

Using Feature Splits

Co-training [Blum & Mitchell 98] Meta-bootstrapping [Riloff & Jones 99] coBoost [Collins & Singer 99] Unsupervised WSD [Yarowsky 95]

Consider this the co-training setting

Page 23: Combining labeled and unlabeled data for text categorization with a large number of categories

Learning from Labeled and Unlabeled Data:

Extending supervised learning

MaxEnt Discrimination [Jaakkola et al. 99]

Expectation Maximization [Nigam et al. 98]

Transductive SVMs [Joachims 99]

Page 24: Combining labeled and unlabeled data for text categorization with a large number of categories

Using Unlabeled Data with EMEstimate labels of the unlabeled documents

Use all documents to build anew naïve Bayes classifier

Naïve Bayes

Page 25: Combining labeled and unlabeled data for text categorization with a large number of categories

Co-training vs. EM

Co-training Uses feature split Incremental labeling Hard labels

EM Ignores feature split Iterative labeling Probabilistic labels

Which differences matter?

Page 26: Combining labeled and unlabeled data for text categorization with a large number of categories

Hybrids of Co-training and EM

Yes No

Incremental co-training self-training

Iterative co-EM EM

Uses Feature Split?

Labeling

Naïve Bayeson A

Naïve Bayeson B

Label all Learn from all

Naïve Bayes

on A & B

Label all

Add only bestLabel allLearn from all

Page 27: Combining labeled and unlabeled data for text categorization with a large number of categories

Learning from Unlabeled Datausing Feature Splits

coBoost [Collins & Singer 99] Meta-bootstrapping [Riloff & Jones 99] Unsupervised WSD [Yarowsky 95] Co-training [Blum & Mitchell 98]

Page 28: Combining labeled and unlabeled data for text categorization with a large number of categories

Intuition behind Co-training

A and B are redundant A features independent of B features Co-training like learning with random

classification noise Most confident A prediction gives random B Small misclassification error for A

Page 29: Combining labeled and unlabeled data for text categorization with a large number of categories

Using Unlabeled Data with EM

Estimate labels of unlabeled documents

Use all documents to rebuild naïve Bayes classifier

Naïve Bayes

[Nigam, McCallum, Thrun & Mitchell, 1998]

Initially learn from labeled

only

Page 30: Combining labeled and unlabeled data for text categorization with a large number of categories

Co-EM

Naïve Bayeson A

Naïve Bayeson B

Estimate labels Build naïve Bayes with all data

Estimate labelsBuild naïve Bayes with all data

Use Feature Split?

EMco-EMLabel All

co-trainingLabel Few

NoYes

Initialize withlabeled data