Topic Classification in R
description
Transcript of Topic Classification in R
![Page 1: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/1.jpg)
Topic Classification in R
A Tutorial on Using Text Mining and Machine Learning Technologies to
Classify Documents
[email protected] December 2009
![Page 2: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/2.jpg)
Thanks
• Marco Maier (PhD Student at WU Wien)– Answering all of my R-related questions– Endless Patience
• Ingo Feinerer (Lecturer at TU Wien)– Creator of the tm-package for R– Answering all of my tm-related questions– Answers were always quick and profound
although we never met in person
![Page 3: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/3.jpg)
About the Talk• I want you to learn something today• If you feel bored you may leave at any time and I will not be
mad at you• Please ask at any(!) time if something is unclear• Please tell me if I‘m too fast or speak too low or loud• Please contact me per email if you have any questions about the
talk, R, Text Mining or Support Vector Machines• This presentation is under a Creative Commons license• The source code is under an LGPL license• Both are non-restrictive! (use it, alter it – just refer to me)• Everything is online at www.contextualism.net/talks
![Page 4: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/4.jpg)
Critique
• Please tell me what you think about this presentation
• Just send me an email and tell me one thing you liked and one you didn‘t
![Page 5: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/5.jpg)
Outline
• Why?• Classification
– Introduction– Support Vector Machines– Topic Classification
• Reuters 21578 Dataset• SVM Training in R• Evaluation
– Measures– Results
![Page 6: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/6.jpg)
Why? (me)
• Wanted to do Text Classification• Wanted to train a Support Vector Machine• Wanted to use R• Wanted to reproduce the results of a scientific
article
![Page 7: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/7.jpg)
Why? (you)
• Suppert Vector Machines are very good classificators• They are used in a lot of recent publications – they‘re
kind of the „new black“• SVMs are a neat tool in your analytic armory• If you understand topic classification you know how
your SPAM is filtered• Term-Document-Matrices are a very easy but powerful
data-structure• Usage of Text Mining Techniques could be fruitful for
your research
![Page 8: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/8.jpg)
Classification
![Page 9: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/9.jpg)
Classification
![Page 10: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/10.jpg)
Classification
Objects / Instances
Feature Dimensions
![Page 11: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/11.jpg)
Classification
Age
Years of Education
People earning more than 50k a year
People earning less than 50k a year
![Page 12: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/12.jpg)
Classification
Classifying Function
![Page 13: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/13.jpg)
Classification
?
?
![Page 14: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/14.jpg)
Classification
![Page 15: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/15.jpg)
Classification
Training Data
Testing Data / Live Data
![Page 16: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/16.jpg)
Classification• Many different Classifiers available• Strengths and weaknesses• Decision Tree Classifiers
– Good in Explaining the Classification Result• Naive Bayes Classifiers
– Strong Theory• K-nn Classifiers
– Lazy Classifiers• Support Vector Machines
– Currently the state of the art– Perform very well across different domains in practice
![Page 17: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/17.jpg)
Support Vector Machines
• Solve a linear optimization problem• Try to find the hyperplane with the largest
margin (Large Margine Classifier)• Use the „Kernel Trick“ to separate instances
that are linearly unseparable
![Page 18: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/18.jpg)
Largest Margin Hyperplane
![Page 19: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/19.jpg)
Largest Margin Hyperplane
h1
![Page 20: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/20.jpg)
Largest Margin Hyperplane
h1
Margin
![Page 21: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/21.jpg)
Largest Margin Hyperplane
h1
?
?
![Page 22: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/22.jpg)
Largest Margin Hyperplane
h1
![Page 23: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/23.jpg)
Largest Margin Hyperplane
h2
![Page 24: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/24.jpg)
Largest Margin Hyperplane
h2
?
?
![Page 25: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/25.jpg)
Largest Margin Hyperplane
h2
![Page 26: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/26.jpg)
Kernel Trick
![Page 27: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/27.jpg)
Kernel Trick
![Page 28: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/28.jpg)
Kernel Trick
![Page 29: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/29.jpg)
Kernel Trick
no linear function that separates the data
![Page 30: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/30.jpg)
Kernel Trick
polynomial function separating the data
x2
x1
![Page 31: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/31.jpg)
Kernel Trickx2
x1
f(x1,x2)
![Page 32: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/32.jpg)
Kernel Trickx2
x1
f(x1,x2)
Separating Hyperplane
![Page 34: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/34.jpg)
Kernel Trick
• A lot of different Kernels available– Linear Kernel– Polynomial Kernel– RBF Kernel– Sigmoid Kernel
• You don‘t have to find your own one!• Actually it‘s more or less a trial and error approach• Best Case: Literature tells you the best Kernel for
your domain
![Page 35: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/35.jpg)
Topic Classification
• Decide if a given text-document belongs to a given topic or not (e.g. merger, oil, sports)
• Features for learning are term-counts (words)• Every term becomes a feature dimension• Feature-Space is high-dimensional• Good for learning because probability of linear
separability rises with the number of dimensions
![Page 36: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/36.jpg)
Reuters 21578 Dataset
• 21.578 text documents (newswire articles)• 135 topics• Every document has one or more topics• ModApte Split creates 3 sets
– Training set (9.603 docs, at least 1 topic per doc, earlier April 7th 1987)
– Test set (3.299 docs, at least 1 topic per doc, April 7th 1987 or later)
– Unused set (8676 docs)– Topic distribution uneven in sets
![Page 37: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/37.jpg)
Reuters 21578 DatasetTopics with most Documents assigned
Class Document CountEarn 3987Acq 2448Money-Fx 801Grain 628Crude 634Trade 551Interest 513Ship 305Wheat 306Corn 254
![Page 38: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/38.jpg)
A Reuters Document<?xml version="1.0"?><REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5552" NEWID="9"> <DATE>26-FEB-1987 15:17:11.20</DATE> <TOPICS> <D>earn</D> </TOPICS> <PLACES> <D>usa</D> </PLACES> <PEOPLE/> <ORGS/> <EXCHANGES/> <COMPANIES/> <UNKNOWN>Ff0762 reuter f BC-CHAMPION-PRODUCTS-<CH02-26 0067</UNKNOWN> <TEXT> <TITLE>CHAMPION PRODUCTS <CH> APPROVES STOCK SPLIT</TITLE> <DATELINE>ROCHESTER, N.Y., Feb 26 -</DATELINE> <BODY>Champion Products Inc said its board of directors approved a two-for-one stock split of its common shares for shareholders of record as of April 1, 1987. The company also said its board voted to recommend to shareholders at the annual meeting April 23 an increase in the authorized capital stock from five mln to 25 mln shares. Reuter </BODY> </TEXT></REUTERS>
![Page 39: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/39.jpg)
A Reuters Document<?xml version="1.0"?><REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5552" NEWID="9"> <DATE>26-FEB-1987 15:17:11.20</DATE> <TOPICS> <D>earn</D> </TOPICS> <PLACES> <D>usa</D> </PLACES> <PEOPLE/> <ORGS/> <EXCHANGES/> <COMPANIES/> <UNKNOWN>Ff0762 reuter f BC-CHAMPION-PRODUCTS-<CH02-26 0067</UNKNOWN> <TEXT> <TITLE>CHAMPION PRODUCTS <CH> APPROVES STOCK SPLIT</TITLE> <DATELINE>ROCHESTER, N.Y., Feb 26 -</DATELINE> <BODY>Champion Products Inc said its board of directors approved a two-for-one stock split of its common shares for shareholders of record as of April 1, 1987. The company also said its board voted to recommend to shareholders at the annual meeting April 23 an increase in the authorized capital stock from five mln to 25 mln shares. Reuter </BODY> </TEXT></REUTERS>
![Page 40: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/40.jpg)
SVM Training in R
![Page 41: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/41.jpg)
Our Goal
• Take a topic like „earn“• Take enough „earn“ and „not earn“ documents
from the training set and create a training corpus• Bring training corpus in adequate format (data-
structure) for SVM training• Train a SVM (model)• Predict test data• Compare Results with those of Joachims
![Page 42: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/42.jpg)
Our Goal
earn earn
earn
crude
acq
trade
4
3
2
1
44342414
43332313
42322212
41312111
yyyy
xxxxxxxxxxxxxxxx
SVM„earn“ Model
![Page 43: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/43.jpg)
Our Goal
earn earn
earn
crude
acq
trade
4
3
2
1
44342414
43332313
42322212
41312111
yyyy
xxxxxxxxxxxxxxxx
SVM„earn“ Model
earn
oil „earn“ Model
Classification Results
![Page 44: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/44.jpg)
Document Term Matrices
WarFightWar
WarPeace
Contract
PeaceLoveHope
Corpus
doc1 doc2 doc3
![Page 45: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/45.jpg)
Document Term Matrices
WarFightWar
WarPeace
Contract
PeaceLoveHope
war fight peace contract love hope
doc1 2 1 0 0 0 0
doc2 1 0 1 1 0 0
doc3 0 0 1 0 1 1
doc1 doc2 doc3
Term Weight
Corpus
Document Term Matrix
![Page 46: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/46.jpg)
Document Term Matrices
WarFightWar
WarPeace
Contract
PeaceLoveHope
war fight peace contract love hope
doc1 2 1 0 0 0 0
doc2 1 0 1 1 0 0
doc3 0 0 1 0 1 1
doc1 doc2 doc3
Corpus
Document Term Matrix
Document Vector
![Page 47: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/47.jpg)
Document Term Matrices
WarFightWar
WarPeace
Contract
PeaceLoveHope
war fight peace contract love hope
doc1 2 1 0 0 0 0
doc2 1 0 1 1 0 0
doc3 0 0 1 0 1 1
doc1 doc2 doc3
Corpus
Document Term Matrix
Term Vector
![Page 48: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/48.jpg)
Term Weighting• Term Frequency (TF)
– Number of times a term occurs in a document• Term Frequency – Inverse Document Frequency (TF-
IDF)–
• Term Frequency – Inverse Document Frequency – Cosine Normalization (TFC)
–
)/log())log(1( , tdt dfNtf
2
log
log
ii
tt
dfNtf
dfNtf
![Page 49: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/49.jpg)
Term Weighting
• Weighting strategy is crucial for the classification results
• IDF factor puts heavier weight on terms that occur seldomly (strong discriminative power)
• Normalization dempens the impact of outliers on the model to be built
![Page 50: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/50.jpg)
Questions?
![Page 51: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/51.jpg)
Assumption
• Training and test documents of Reuters ModApte split reside in two separate directories– /reuters-21578-xml-train– /reuters-21578-xml-test
• Documents are in XML format• There‘s a preprocessing script (in R) that
accomplishes this task• The script will be online too and you can ask me
anytime if you have problems with it
![Page 52: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/52.jpg)
DirSource
• Abstracts the input location• Other sources are available for different file
formats like CSVSource and GmaneSource• The source concept eases internal processes in
the TM package (standardized interfaces)• Our DirSources will just contain a list of XML
documents from a given directory
![Page 53: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/53.jpg)
Corpus
• Collection or DB for text documents• In our case it holds the documents from the training
set and test set• Does some transformation on the XML files• Strips off XML• Extracts and stores meta Information (topics)• XML -> Plain Text• Access to documents via Indices• Some statistics about contained documents
![Page 54: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/54.jpg)
Document Term Matrix & Dictionary
aaa abbett zones zuckerman
doc1 2 1 0 1
doc2 1 0 0 0
doc9602 0 0 1 1
Dictionary: List of all the terms in a Corpus/DTM
![Page 55: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/55.jpg)
tm_filter
• Returns a filtered corpus• Predefined filters available
– searchFullText– sFilter (Meta Data)
• Custom filters can be added (e.g. TopicFilter and maxDocNumPerTopicFilter)
• doclevel decides if filter is applied on corpus as a whole or each single document
• Theoretically easy – Practically kind of cumbersome (R-internals)
![Page 56: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/56.jpg)
Topic Filter FunctiontopicFilter <- function (object, s, topicOfDoc) { query.df <- prescindMeta(object, c("Topics")) attach(query.df) boolFilter <- c() i <- 1 while (i <= length(Topics)) { res <- c(s) %in% Topics[[i]] boolFilter <- c(boolFilter, res) i <- i + 1 } if (!topicOfDoc) boolFilter <- unlist(lapply(boolFilter,`!`)) try (result <- rownames(query.df) %in% row.names(query.df[boolFilter,])) detach(query.df) result}
![Page 57: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/57.jpg)
The Training Corpus
earn earn
earn
crude
acq
trade
earn
earn
earn
crude
acq
acq
max. 10
max. 10
max. 10
…
…
…
…
…
…
max. 300
svmPosNegTrainCorpus
![Page 58: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/58.jpg)
Document Term Matrix
• Actually we could construct the DTM now and fed it together with the label vector to the SVM
• But, we have to ensure, that all terms in the dictionary (from ModApte Training Split) are used!
• DocumentTermMatrix removes terms that don‘t occur in the corpus – which would fall back on us during evaluation
• We have to ensure that the matrices always have the same structure using the same terms on the same place (column)
![Page 59: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/59.jpg)
Auxiliary 1-Document Corpus
• Create corpus with just one document• This document contains all terms from the
dictionary• Merge Auxiliary Corpus with Training Corpus
![Page 60: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/60.jpg)
Sparse Matrices
• Most of the values in a DTM are zero ( sparse)• Storing the 0s would be waste of space• Sparse matrices only store non-zero values and
some structure• DTM in tm-package is backed by
simple_triplet_matrix• SVM is backed by CompressedSparseRowMatrix• Conversion needed
![Page 61: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/61.jpg)
Sparse Matricesbook car peace war
doc1 0 0 0 1
doc2 0 0 1 0
doc3 0 1 0 0
doc4 1 0 0 0
Plain Storage16 numbers
![Page 62: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/62.jpg)
Sparse Matricesbook car peace war
doc1 0 0 0 1
doc2 0 0 1 0
doc3 0 1 0 0
doc4 1 0 0 0
Plain Storage16 numbers
![Page 63: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/63.jpg)
Sparse Matricesbook car peace war
doc1 0 0 0 1
doc2 0 0 1 0
doc3 0 1 0 0
doc4 1 0 0 0
Plain Storage16 numbers
1111 4321 1234
1111 1234
54321
Simple Triplet Matrix12 numbers (TM)
Compressed Sparse Row Matrix13 numbers (SVM)
payload
Structure
![Page 64: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/64.jpg)
Label Vector
11
11
yPos
11
11
yNeg
11
11
y
300
422
300 422
![Page 65: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/65.jpg)
Evaluation Measures
• Joachiems uses Precision-Recall Breakeven points
• To do so you‘d have to alter a variable in the setting over a range of values
• But he never states which variable he alters• So I decided to use the Precision and Recall
results without their Breakeven point
![Page 66: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/66.jpg)
Precision Recall Breakeven Point
some variable X
Pre
cisi
on -
Rec
all
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
PrecisionRecall
![Page 67: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/67.jpg)
Confusion Matrix
a b
c d
predicted predicted
true
true
![Page 68: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/68.jpg)
Confusion Matrix
true positive
false negative
false positive
true negative
predicted predicted
true
true
![Page 69: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/69.jpg)
Confusion Matrix
TP FN
FP TN
predicted predicted
true
true
![Page 70: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/70.jpg)
Precision
TP FN
FP TN
predicted predicted
true
true
![Page 71: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/71.jpg)
Precision
TP FN
FP TN
predicted predicted
true
true
![Page 72: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/72.jpg)
Precision
TP FN
FP TN
predicted predicted
true
true
FPTPTPmeasure of exactness
![Page 73: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/73.jpg)
Recall
TP FN
FP TN
predicted predicted
true
true
![Page 74: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/74.jpg)
Recall
TP FN
FP TN
predicted predicted
true
true
![Page 75: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/75.jpg)
Recall
TP FN
FP TN
predicted predicted
true
true
FNTPTP
measure of completeness
![Page 76: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/76.jpg)
Evaluation ResultsTopic Classification Results for Different Feature-Weighting Strategies
kernel = polyTopics
Pre
cisi
on -
Rec
all
earn acq money-fx grain crude trade interest ship wheat corn
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
tftf-idftfcprecisionrecall
![Page 77: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/77.jpg)
Evaluation ResultsTopic Classification Results for Different Feature-Weighting Strategies
kernel = radialTopics
Pre
cisi
on -
Rec
all
earn acq money-fx grain crude trade interest ship wheat corn
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
tftf-idftfcprecisionrecall
![Page 78: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/78.jpg)
Evaluation ResultsTopic Classification Results for Different Feature-Weighting Strategies
kernel = linearTopics
Pre
cisi
on -
Rec
all
earn acq money-fx grain crude trade interest ship wheat corn
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
tftf-idftfcprecisionrecall
![Page 79: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/79.jpg)
Comparison to JoachimsKernel=poly, degree=2
Joachims Me
Topic Prec.-Rec. Breakeven Precision Recall
earn 98.4 0.99 0.65
acq 94.6 0.99 0.77
money-fx 72.5 0.87 0.52
grain 93.1 0.91 0.63
crude 87.3 0.93 0.59
trade 75.5 0.83 0.7
interest 63.3 0.91 0.6
ship 85.4 0.93 0.34
wheat 84.5 0.87 0.66
corn 86.5 0.85 0.51
![Page 80: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/80.jpg)
Findings
• Even though the paper was very detailed and well written – some information was missing to fully reproduce the results
• tm-package is very comfortable to use but adding custom functionality is cumbersome because of R-internals
• svm-package is surprisingly easy to use but memory limits are reached very soon
• Classification results are quiet good
![Page 81: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/81.jpg)
Literature
• Feinerer I., Hornik K., Meyer D., Text Mining Infrastructure in R, Journal of Statistical Software, 2008
• Joachims T., Text categorization with Support Vector Machines: Learning with many relevant features, ECML, 1998
• Salton G., Buckley C., Term-weighting approaches in automatic text retrieval, Journal of Inf. Proc. Man., 1988
• Classification Performance Measures, 2006• XML Encoded Reuters 21578 Dataset
![Page 82: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/82.jpg)
Thank you!For Attention & Patience
![Page 83: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/83.jpg)
Apendix
![Page 84: Topic Classification in R](https://reader036.fdocuments.net/reader036/viewer/2022062305/568166e9550346895ddb2d8d/html5/thumbnails/84.jpg)
Versions of R and R-Packages used
• R 2.9.2• tm package 0.5 (text mining) • e1071 1.5-19 (svm)• slam package 0.1-6 (sparse matrices)• SparseM package 0.83 (sparse matrices)