Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories
-
Upload
donna-cotton -
Category
Documents
-
view
32 -
download
0
description
Transcript of Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of
Categories
Rayid Ghani
Center for Automated Learning & Discovery
Carnegie Mellon University
Learning from Sequences of fMRI Brain Images (with Tom Mitchell)
Learning to automatically build language-specific corpora from the web (with Rosie Jones & Dunja Mladenic)
Effect of Smoothing on Naive Bayes for Text Classification (with Tong Zhang @ IBM Research)
Hypertext Categorization using links and extracted information (with Sean Slattery & Yiming Yang)
Hybrids of EM & Co-Training for semi-supervised learning (with Kamal Nigam)
Error-Correcting Output Codes for Text Classification
Some Recent Work
Text Categorization
Numerous Applications•Search Engines/Portals•Customer Service•Email Routing ….
Domains:•Topics•Genres•Languages
$$$ Making
Problems
Practical applications such as web portal deal with a large number of categories
A lot of labeled examples are needed for training the system
How do people deal with a large number of classes?
Use fast multiclass algorithms (Naïve Bayes) Builds one model per class
Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems
What happens with a 1000 class problem? Can we do better?
ECOC to the Rescue!
An n-class problem can be solved by solving log2n binary problems
More efficient than one-per-class Does it actually perform better?
What is ECOC?
Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995)
Use a learner to learn the binary problems
Training ECOC
0 0 1 11 0 1 00 1 1 10 1 0 0
ABCD
f1 f2 f3 f4
X 1 1 1 1
Testing ECOC
ECOC - Picture
0 0 1 11 0 1 00 1 1 10 1 0 0
ABCD
A
DC
B
f1 f2 f3 f4
ECOC - Picture
0 0 1 11 0 1 00 1 1 10 1 0 0
ABCD
A
DC
B
f1 f2 f3 f4
ECOC - Picture
0 0 1 1 1 0 1 0 0 1 1 1 0 1 0 0
ABCD
A
DC
B
f1 f2 f3 f4
ECOC - Picture
0 0 1 11 0 1 00 1 1 10 1 0 0
ABCD
A
DC
B
f1 f2 f3 f4
X 1 1 1 1
ECOC works but…
Increased code length = Increased Accuracy Increased code length = Increased
Computational Cost
Classification Performance
Efficiency
Naïve Bayes
ECOC
GOALGOAL
(as used in Berger 99)
Choosing the codewords
Random? [Berger 1999, James 1999] Asymptotically good (the longer the better) Computational Cost is very high
Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995] Guaranteed properties for a fixed-length code
Experimental Setup
Generate the code BCH Codes
Choose a Base Learner Naive Bayes Classifier as used in text
classification tasks (McCallum & Nigam 1998)
Text Classification with Naïve Bayes
“Bag of Words” document representation Estimate parameters of generative model:
Naïve Bayes classification:
)P(
)|P()P()|P(
doc
classwordclassdocclass docword
classdoc
classdoc
docV
docwordclassword
)N(||
),N(1)|P(
Industry Sector Dataset
Consists of company web pages classified into 105 economic sectors
[McCallum et al. 1998, Ghani 2000]
ResultsIndustry Sector Data Set
Naïve Bayes
Shrinkage1 MaxEnt2 MaxEnt/ w Prior3
ECOC 63-bit
66.1% 76% 79% 81.1% 88.5%
ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in
computational cost
1. (McCallum et al. 1998) 2,3. (Nigam et al. 1999)
Min HD for correctly classified examples
0
200
400
600
800
1000
0 1 2 3 4 5 6 7 8 9 10
Min HD
Fre
qu
ency
Min HD for wrongly classified examples
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
Min HD
Fre
qu
ency
ECOC for better Precision
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.49 0.6 0.65 0.7 0.75 0.81 0.86 0.91 0.97 1
Recall
Pre
cis
ion
ECOC
NB
ECOC for better Precision
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Pre
cisi
on
NB
15bitECOC
Classification Performance
Efficiency
NB
ECOC
GOAL
New Goal
(as used in Berger 99)
Solutions
Design codewords that minimize cost and maximize “performance”
Investigate the assignment of codewords to classes
Learn the decoding function
Incorporate unlabeled data into ECOC
What happens with sparse data?
Percent Decrease in Error with Training size and length of code
30
35
40
45
50
55
60
65
70
0 20 40 60 80 100
Training Size
% D
ecre
ase
in E
rro
r
15bit
31bit
63bit
Use unlabeled data with a large number of classes
How? Use EM Mixed Results
Think Again! Use Co-Training Disastrous Results
Think one more time
How to use unlabeled data?
Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories
ECOC works great with a large number of classes but there is no framework for using unlabeled data
ECOC + CoTraining = ECoTrain
ECOC decomposes multiclass problems into binary problems
Co-Training works great with binary problems
ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training
ECOC+CoTrain - Results
Algorithm 300L+ 0U
Per Class
50L + 250U
Per Class
5L + 295U
Per Class
Naïve Bayes Uses No Unlabeled Data
76 67 40.3
ECOC 15bit 76.5 68.5 49.2
EM UsesUnlabeled Data-105Class Problem
68.2 51.4
Co-Train 67.6 50.1
ECoTrain(ECOC + Co-Training)
Uses Unlabeled Data
72.0 56.1
What Next?
Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration
Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training
Potential Drawbacks
Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems
Summary
Use ECOC for efficient text classification with a large number of categories
Increase Accuracy & Efficiency Use Unlabeled data by combining ECOC and
Co-Training Generalize to domain-independent
classification tasks involving a large number of categories