Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of

Categories

Rayid Ghani

Center for Automated Learning & Discovery

Carnegie Mellon University

Learning from Sequences of fMRI Brain Images (with Tom Mitchell)

Learning to automatically build language-specific corpora from the web (with Rosie Jones & Dunja Mladenic)

Effect of Smoothing on Naive Bayes for Text Classification (with Tong Zhang @ IBM Research)

Hypertext Categorization using links and extracted information (with Sean Slattery & Yiming Yang)

Hybrids of EM & Co-Training for semi-supervised learning (with Kamal Nigam)

Error-Correcting Output Codes for Text Classification

Some Recent Work

Text Categorization

Numerous Applications•Search Engines/Portals•Customer Service•Email Routing ….

Domains:•Topics•Genres•Languages

$$$ Making

Problems

Practical applications such as web portal deal with a large number of categories

A lot of labeled examples are needed for training the system

How do people deal with a large number of classes?

Use fast multiclass algorithms (Naïve Bayes) Builds one model per class

Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems

What happens with a 1000 class problem? Can we do better?

ECOC to the Rescue!

An n-class problem can be solved by solving log2n binary problems

More efficient than one-per-class Does it actually perform better?

What is ECOC?

Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995)

Use a learner to learn the binary problems

Training ECOC

0 0 1 11 0 1 00 1 1 10 1 0 0

ABCD

f1 f2 f3 f4

X 1 1 1 1

Testing ECOC

ECOC - Picture

0 0 1 11 0 1 00 1 1 10 1 0 0

ABCD

A

DC

B

f1 f2 f3 f4

ECOC - Picture

0 0 1 1 1 0 1 0 0 1 1 1 0 1 0 0

ABCD

A

DC

B

f1 f2 f3 f4

ECOC - Picture

0 0 1 11 0 1 00 1 1 10 1 0 0

ABCD

A

DC

B

f1 f2 f3 f4

X 1 1 1 1

ECOC works but…

Increased code length = Increased Accuracy Increased code length = Increased

Computational Cost

Classification Performance

Efficiency

Naïve Bayes

ECOC

GOALGOAL

(as used in Berger 99)

Choosing the codewords

Random? [Berger 1999, James 1999] Asymptotically good (the longer the better) Computational Cost is very high

Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995] Guaranteed properties for a fixed-length code

Experimental Setup

Generate the code BCH Codes

Choose a Base Learner Naive Bayes Classifier as used in text

classification tasks (McCallum & Nigam 1998)

Text Classification with Naïve Bayes

“Bag of Words” document representation Estimate parameters of generative model:

Naïve Bayes classification:

)P(

)|P()P()|P(

doc

classwordclassdocclass docword

classdoc

classdoc

docV

docwordclassword

)N(||

),N(1)|P(

Industry Sector Dataset

Consists of company web pages classified into 105 economic sectors

[McCallum et al. 1998, Ghani 2000]

ResultsIndustry Sector Data Set

Naïve Bayes

Shrinkage1 MaxEnt2 MaxEnt/ w Prior3

ECOC 63-bit

66.1% 76% 79% 81.1% 88.5%

ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in

computational cost

1. (McCallum et al. 1998) 2,3. (Nigam et al. 1999)

Min HD for correctly classified examples

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9 10

Min HD

Fre

qu

ency

Min HD for wrongly classified examples

0

20

40

60

80

100

0 1 2 3 4 5 6 7 8 9 10

Min HD

Fre

qu

ency

ECOC for better Precision

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.49 0.6 0.65 0.7 0.75 0.81 0.86 0.91 0.97 1

Recall

Pre

cis

ion

ECOC

NB

ECOC for better Precision

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cisi

on

NB

15bitECOC

Classification Performance

Efficiency

NB

ECOC

GOAL

New Goal

(as used in Berger 99)

Solutions

Design codewords that minimize cost and maximize “performance”

Investigate the assignment of codewords to classes

Learn the decoding function

Incorporate unlabeled data into ECOC

What happens with sparse data?

Percent Decrease in Error with Training size and length of code

30

35

40

45

50

55

60

65

70

0 20 40 60 80 100

Training Size

% D

ecre

ase

in E

rro

r

15bit

31bit

63bit

Use unlabeled data with a large number of classes

How? Use EM Mixed Results

Think Again! Use Co-Training Disastrous Results

Think one more time

How to use unlabeled data?

Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories

ECOC works great with a large number of classes but there is no framework for using unlabeled data

ECOC + CoTraining = ECoTrain

ECOC decomposes multiclass problems into binary problems

Co-Training works great with binary problems

ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

ECOC+CoTrain - Results

Algorithm 300L+ 0U

Per Class

50L + 250U

Per Class

5L + 295U

Per Class

Naïve Bayes Uses No Unlabeled Data

76 67 40.3

ECOC 15bit 76.5 68.5 49.2

EM UsesUnlabeled Data-105Class Problem

68.2 51.4

Co-Train 67.6 50.1

ECoTrain(ECOC + Co-Training)

Uses Unlabeled Data

72.0 56.1

What Next?

Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration

Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Potential Drawbacks

Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems

Summary

Use ECOC for efficient text classification with a large number of categories

Increase Accuracy & Efficiency Use Unlabeled data by combining ECOC and

Co-Training Generalize to domain-independent

classification tasks involving a large number of categories

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories

Documents

Transcript of Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories