Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum...
-
Upload
laura-carter -
Category
Documents
-
view
213 -
download
0
Transcript of Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum...
Two Methods for ImprovingText Classification
when Training Data is Sparse
Andrew [email protected]
Just Research (formerly JPRC)
Carnegie Mellon University
For more detail see http://www.cs.cmu.edu/~mccallum
Improving Text Classification by Shrinkage in a Hierarchy of Classes (Sub. to ICML-98)
McCallum, Rosenfeld, Mitchell, Ng
Learning to ClassifyText from Labeled and Unlabeled Documents (AAAI-98)
Nigam, McCallum, Thrun, Mitchell
2
The Task: Document Classification(AKA “Document Categorization”, “Routing” or “Tagging”)
Automatically placing documents in their correct categories.
Magnetism RelativityEvolutionBotanyIrrigation Crops
wheatcornsilogrow...
wheattulipssplicinggrow...
watergratingditchtractor...
selectionmutationDarwin...
... ...
“wheat grow tractor…”
TrainingData:
TestingData:
Categories:
3
A Probabilistic Approach to Document Classification
)|Pr(maxargˆ dcc jc j
Pick the most probable class, given the evidence:
jc
d
- a class (like “Crops”)
- a document (like “wheat grow tractor...”)
)Pr(
)|Pr()Pr()|Pr(
d
cdcdc jjj
Bayes Rule:
||
1 )Pr(
)|Pr()Pr(
d
i d
jdj
i
i
w
cwc
Independence assumption
“Naïve Bayes”:
idw - the i th word in d (like “grow”)
4
Comparison with TFIDF
ZcwTFIDFdwTFIDFcdScoreV
i
/),(),(),(||
1
ZcwdwTFccdScoreV
i
/))|log(Pr(),()Pr(),(||
1
TFIDF/Rocchio
Naïve Bayes
Where Z is some normalization constant
5
Parameter Estimation in Naïve Bayes
jk
jk
cd
V
tkt
cdki
ji
dwTFV
dwTF
cw||
1
),(||
),(1
)|Pr(
||
1
)|Pr()Pr()|Pr(d
ijdjj cwcdc
i
Bayes optimal estimate of Pr(w|c),(via LaPlace smoothing)
Naïve Bayes
A Key Problem: Getting better estimates of Pr(w|c)
Document Classification in aHierarchy of Classes
Andrew McCallum
Roni Rosenfeld
Tom Mitchell
Andrew Ng
7
The Idea: “Deleted Interpolation” or “Shrinkage”
We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. This represents a tradeoff between reliability and specificity.
Magnetism Relativity
Physics
EvolutionBotanyIrrigation Crops
BiologyAgriculture
Science
wheatcornsilogrow...
wheattulipssplicinggrow...
watergratingditchtractor...
“wheat grow tractor…”
selectionmutationDarwin...
... ...
TestingData:
TrainingData:
Categories:
8
“Deleted Interpolation” or “Shrinkage”
“Deleted Interpolation” in class hierarchy space:
),|Pr()1(),,|Pr(),,|r(P̂12121
iiiiiiii djdjddjdjddjd wcwwwcwwwcw
Learn the ’s via EM, performing the E-step with leave-one-out cross-validation.
“Deleted Interpolation” in N-gram space:
)|Pr()1()|Pr()|r(P̂ parentjijjijji cwcwcw
[Jelinek and Mercer, 1980], [James and Stein, 1961]
9
Experimental Results
• Industry Sector Dataset– 71 classes, 6.5k documents,
1.2 million words, 30k vocabulary
• 20 Newsgroups Dataset– 15 classes, 15k documents,
1.7 million words, 52k vocabulary
• Yahoo Science Dataset– 95 classes, 13k documents,
0.6 million words, 44k vocabulary
Learning to Classify Text from Labeled and Unlabeled Documents
Kamal Nigam
Andrew McCallum
Sebastian Thrun
Tom Mitchell
11
The Scenario
Training datawith class labels
Data available at trainingtime, but without class labels
Web pagesuser says areinteresting
Web pagesuser says areuninteresting
Web pages userhasn’t seen or saidanything about
Can we use the unlabeled documents to increase accuracy?
12
Using the Unlabeled DataBuild a classificationmodel using limitedlabeled data
Use model to guess thelabels of the unlabeleddocuments
Use all documents to build a new classification model, which is more accurate because it is trained using more data.
13
Expectation Maximization[Dempster, Laird, Rubin 1977]
Applies when there are two inter-dependent unknowns. (1) The word probabilities for each class
(2) The class labels of the unlabeled doc’s.
• E-step: Use current “guess” of (1) to estimate value of (2)– Use classification model built from limited training data to
assign probabilistic labels to unlabeled documents
• M-step:Use probabilistic estimates of (2) to update (1).– Use probabilistic class labels on unlabeled documents to build
a more accurate classification model.
• Repeat E- and M-steps until convergence.
14
Why it Works -- An Example
Baseball Ice Skating
Labeled Data
Fell on the ice...The new hitter struck out...
Pete Rose is not as good an athlete as Tara Lipinski...
Struck out in last inning...
Homerun in the first inning...
Perfect triple jump...
Katarina Witt’s gold medal performance...
New ice skates...
Practice at the ice rink every day...
Unlabeled Data
Tara Lipinski new ice skates didn’t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal...
Tara Lipinski bought a new house for her parents.
Pr ( Lipinski ) = 0.01 Pr ( Lipinski ) = 0.001
Pr ( Lipinski ) = 0.02
Pr ( Lipinski ) = 0.003
15
EM for Text Classification
Expectation-step (guess the class labels)
||
1
)|Pr()Pr()|Pr(d
ijdjj cwcdc
i
Maximization-step (set parameters using the guesses)
jk
jk
cd
V
tkjkt
cdkjki
ji
dcdwTFV
dcdwTF
cw||
1
)|Pr(),(||
)|Pr(),(1
)|Pr(
16
Experimental Results -- The Data
• Four classes of Web pages– Student, Faculty, Course, Project– 4199 Web pages total
• Twenty newsgroups from UseNet – several of religion, politics, sports, comp.*– 1000 articles per class
• New articles from Reuters– 90 different categories– 12902 articles total
17
Word Vector Evolution with EM
Iteration 0intelligence
DDartificial
understandingDDwdist
identicalrus
arrangegames
dartmouthnatural
cognitivelogic
provingprolog
Iteration 1DDD
lectureccD*
DD:DDhandout
dueproblem
settay
DDamyurtas
homeworkkfoury
sec
Iteration 2D
DDlecture
ccDD:DD
dueD*
homeworkassignment
handoutsethw
examproblemDDam
postscript
(D is a digit)
18
Related Work• Using EM to reduce the need for training examples:
– [Miller and Uyar 1997] [Shahshahani and Landgrebe 1994]
• AutoClass - unsupervised EM with Naïve Bayes:– [Cheeseman 1988]
• Using EM to fill in missing values– [Ghahramani and Jordan 1995]