A New Term Weighting Method for Text Categorization

81
A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007

Transcript of A New Term Weighting Method for Text Categorization

SoC Presentation Title 2004

A New Term Weighting Method for Text Categorization

LAN ManSchool of Computing

National University of Singapore16 Apr, 2007

SoC Presentation Title 2004

Outline• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term

Weighting Method• Experimental Research Work• Contributions and Future Work

SoC Presentation Title 2004

Outline• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term

Weighting Method• Experimental Research Work• Contributions and Future Work

SoC Presentation Title 2004

IntroductionBackground

• Explosive increase of textual information• Organizing and accessing these information

in flexible ways• Text Categorization (TC) is the task of

classifying natural language documents into a predefined set of semantic categories

SoC Presentation Title 2004

IntroductionApplications of TC

• Categorize web pages by topic (the directories like Yahoo!);

• Customize online newspapers to different labels according to a particular user’s reading preferences;

• Filter spam emails and forward incoming emails to the target expert by content;

• Word sense disambiguation is also taken as a text categorization task once we view word occurrence contexts as documents and word sense as categories.

SoC Presentation Title 2004

IntroductionTwo sub-issues of TC

• TC is a discipline at the crossroads ofmachine learning (ML) and information retrieval (IR)

• Two key issues in TC:– Text Representation– Classifier Construction

• This thesis focuses on the first issue.

SoC Presentation Title 2004

IntroductionConstruction of Text Classifier

• Approaches to build a classifier:– No more than 20 algorithms– Borrowed from Information Retrieval: Rocchio– Machine Learning: SVM, kNN, decision Tree,

Naïve Bayesian, Neural Network, Linear Regression, Decision Rule, Boosting, etc.

• SVM performs better.

SoC Presentation Title 2004

IntroductionConstruction of Text Classifier

Little room to improve the performance from the algorithm aspect:

– Excellent algorithms are few. – The rationale is inherent to each algorithm and

the method is usually fixed for one given algorithm– Tuning the parameter has limited improvement

SoC Presentation Title 2004

IntroductionText Representation

• Various text format, such as DOC, PDF, PostScript, HTML, etc.

• Can Computer read them like us? • Convert them into a compact format in order to

be recognized and categorized for classifiers or a classifier-building algorithm in a computer.

• This indexing procedure is also calledtext representation.

SoC Presentation Title 2004IntroductionVector Space Model

Texts are vectors in the term space.Assumption: Documents that are “close together”in space are similar in meaning.

SoC Presentation Title 2004

IntroductionText Representation

Two main issues in Text Representation:1. What should a term be – Term Type2. How to weight a term – Term Weighting

SoC Presentation Title 2004

Introduction1. What Should a Term be?

• Sub-word level – syllables• Word level – single token• Multi-word level – phrases, sentences, etc• Syntactic and semantic – sense (meaning)

SoC Presentation Title 2004

Introduction1. What Should a Term be?

1. Word senses (meanings) [Kehagias 2001] same word assumes different meanings in a different contexts

2. Term clustering [Lewis 1992]group words with high degree of pairwise semantic relatedness

3. Semantic and syntactic representation [Scott & matwin 1999]

Relationship between words, i.e. phrases, synonyms and hypernyms

SoC Presentation Title 2004

Introduction1. What Should a Term be?

4. Latent Semantic Indexing [Deerwester 1990]A feature reconstruction technique

5. Combination Approach [Peng 2003]Combine two types of indexing terms, i.e. words

and 3-grams

6. Theme Topic Mixture Model – [Keller 2004] Graphical Model

7. Using keywords from summarization [Li 2003]

SoC Presentation Title 2004

Introduction1. What Should a Term be?

• In general, high level representation did not show good performance in most cases

• Word level is better (bag-of-words)

SoC Presentation Title 2004

Introduction2. How to Weight a Term?

• [Salton 1988] elaborated three considerations:–1. term occurrences closely represent the content of document–2. other factors with the discriminating power pick up the relevant documents from other irrelevant documents–3. consider the effect of length of documents

SoC Presentation Title 2004

Introduction2. How to Weight a Term?

• Simplest method – binary• Most popular method – tf.idf• Combination with information-theory metrics

or statistics method – tf.chi2, tf.ig, tf.gr• Combination with the linear classifier

SoC Presentation Title 2004

Outline

• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term

Weighting Method• Experimental Research Work• Contributions and Future Work

SoC Presentation Title 2004

Motivation (1)

• SVM is better.• Leopold(2002) stated that text representation

dominates the performance of text categorization rather than the kernel functions of SVMs.

Which is the best method for SVM-based text classifier among these widely-used ones?

SoC Presentation Title 2004

Motivation (2)

• Text categorization is a form of supervised learning• The prior information on the membership of

training documents in predefined categories is useful in– feature selection– supervised learning for classifier

SoC Presentation Title 2004

Motivation (2)

• The supervised term weighting methods adopt this known information and consider the document distribution.

• They are naturally expected to be superior to the unsupervised (traditional) term weighting methods.

SoC Presentation Title 2004

MotivationThree questions to be addressed

Q1. How can we propose a new effective term weighting method by using the important prior information given by the training data set?

Q2. Which is the best term weighting method for SVM-based text classifier?

SoC Presentation Title 2004

MotivationThree questions to be addressed

Q3. Are supervised term weighting methods able to lead to better performance than unsupervised ones for TC?

What kinds of relationship can we find between term weighting methods and the widely-used learning algorithms, i.e. kNN and SVM, given different benchmark data collections?

SoC Presentation Title 2004

MotivationSub-tasks of This Thesis

First, we will analyze term’s discriminating power for TC and propose a new term weighting method using the prior information of the training data set to improve the performance of TC.

SoC Presentation Title 2004

MotivationSub-tasks of This Thesis

Second, we will explore term weighting methods for SVM-based text categorization and investigate the best method for SVM-based text classifier.

SoC Presentation Title 2004

MotivationSub-tasks of This Thesis

Third, we will extend research work on more general benchmark data sets and learning algorithms to examine the superiority of supervised term weighting methods and investigate the relationship between term weighting methods and different learning algorithms. Moreover, this study will be extended to a new application domain, i.e. biomedical literature classification.

SoC Presentation Title 2004

Outline

• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term

Weighting Method• Experimental Research Work• Contributions and Future Work

SoC Presentation Title 2004

Methodology of Research

• Machine Learning Algorithms– SVM, kNN

• Benchmark Data Corpora– Reuters News Corpus– 20Newsgroups Corpus– Ohsumed Corpus– 18Journal Corpus

SoC Presentation Title 2004

Methodology of Research

• Evaluation Measures– F1– Breakeven point

• Significance Tests– McNemar’s significance Test

SoC Presentation Title 2004

Outline

• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term

Weighting Method• Experimental Research Work• Contributions and Future Work

SoC Presentation Title 2004Analysis and Proposal of a New Term Weighting Method

Three considerations:1. Term occurrence: binary, tf, ITF, log(tf)

2. Term’s discriminative power: idfNote: chi^2, ig (information gain), gr(gain ratio), mi (mutual information), or (Odds Ratio), etc.

3. Document length: cosine normalization, linear normalization

SoC Presentation Title 2004Analysis and Proposal of a New Term Weighting MethodAnalysis of Term’s Discriminating Power

SoC Presentation Title 2004Analysis and Proposal of a New Term Weighting MethodAnalysis of Term’s Discriminating Power

• Assume they have same tfvalue. t1, t2, t3 share the same idf1 value; t4, t5, t6 share the same idf2 value.

• Clearly, the six terms contribute differently to the semantic of documents.

SoC Presentation Title 2004Analysis and Proposal of a New Term Weighting MethodTerm’s Discriminating Power -- idf

• Case 1. t1 contributes more than t2 and t3; t4 contributes more than t5 and t6.

• Case 2. t4 contributes more than t1 although idf(t4) < idf(t1).

SoC Presentation Title 2004Analysis and Proposal of a New Term Weighting MethodTerm’s Discriminating Power -- idf, chi^2, or

•Case 3. for t1, t2, t3,

in idf value, t1= t2= t3

in chi^2 value, t1=t3 >t2

in or value, t1 > t2 > t3

• Case 4. for t1 and t4,

in idf value, t1 > t4

in chi^2 and or value, t1 < t4

SoC Presentation Title 2004

Analysis and ProposalA New Term Weighting Method -- rf

• Intuitive Considerationthe more concentrated a high frequency term is in the positive category than in the negative category, the more contributions it makes in selecting the positive samples from among the negative samples.

SoC Presentation Title 2004

Analysis and ProposalA New Term Weighting Method -- rf

• rf – relevance frequency

• The rf is only related to the ratio of b and c, not involve d

• The base of log is 2.

• in case of c=0, c=1

• The final weight of term is: tf*rf

)),1max(_

2log(c

brf +=

SoC Presentation Title 2004

Analysis and ProposalEmpirical observation

Comparison of idf, rf, chi2, or, ig and gr value of four features in category 00_acq of Reuters Corpus

Category: 00_acqfeature idf rf chi2 or

3.553 30.668

24.427

0.014

0.142

4.201

4.999

3.567

ig grAcquir 4.368 850.66 0.125 0.161

Stake 2.975 303.94 0.074 0.096

Payout 1 10.87 0.011 0.014

dividend 1.033 46.63 0.017 0.022

SoC Presentation Title 2004

Analysis and ProposalEmpirical observation

Comparison of idf, rf, chi2, or, ig and gr value of four features in category 03_earn of Reuters Corpus

Category: 03_earnfeature idf rf chi2 or

3.553 0.139

0.164

364.327

35.841

4.201

4.999

3.567

ig grAcquir 1.074 81.50 0.031 0.032

Stake 1.082 31.26 0.018 0.018

Payout 7.820 44.68 0.041 0.042

dividend 4.408 295.46 0.092 0.095

SoC Presentation Title 2004

Outline

• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term

Weighting Method• Experimental Research Work• Contributions and Future Work

SoC Presentation Title 2004

Experiment Set 1Exploring the best term weighting method for SVM-based text categorization

Purposes of Experiment Set 1:1. Explore the best term weighting method for SVM-based text categorization – (Q2)

2. Compare tf.rf with various traditional term weighting methods on SVM – (Q1)

SoC Presentation Title 2004Experiment Set 1Experimental Methodology: 10 Methods

Methods Denotation Descriptionbinary 0: absence, 1: presence

TraditionalWeightingMethods

tf.idfAnd variants

tf.chi^2 tf * chi squareSupervisedWeightingMethods

tf Term frequency alonelogtf log(1+tf)ITF 1-1/(1+tf)idf idf alonetf.idf Classical tf * idflogtf.idf log(1+tf) * idftf. idf_prob tf * Probabilistic idf

tf.rf Our new methodOur new tf.rf

Term frequency

SoC Presentation Title 2004Experiment Set 1Results on Reuters Corpus

SoC Presentation Title 2004Experiment Set 1Results on subset of 20Newsgroups Corpus

SoC Presentation Title 2004

Experiment Set 1Conclusions (1)

• tf.rf performs better consistently

• No significant difference among the three term frequency variants, i.e. tf, logtf and ITF

• No significant difference between tf.idf, logtf.idf and tf.idf-prob

SoC Presentation Title 2004

Experiment Set 1Conclusions (2)

• idf and chi^2 factor, even considering the distribution of documents, do not improve or even decrease the performance

• binary and tf.chi^2 significantly underperform the other methods

SoC Presentation Title 2004Experiment Set 2Investigating supervised term weighting method and their relationship with learning algorithms

Purposes of Experiment Set 2:1. Investigate supervised term weighting methods and their relationship with learning algorithms – (Q3)

2. Compare the effectiveness of tf.rf under more general experimental circumstances –(Q1)

SoC Presentation Title 2004

Experiment Set 2Review

• Supervised term weighting methods– Use the prior information on the membership

of training documents in predefined categories• Unsupervised term weighting methods

– Does not use.– binary, tf, log(1+tf), ITF– Most popular is tf.idf and its variants: logtf.idf,

tf.idf-prob

SoC Presentation Title 2004

Experiment Set 2Review: Supervised Term Weighting Methods

1. Combined with information-theory functions or statistic metrics

– such as chi2, information gain, gain ratio, Odds ratio, etc.– Used in the feature selection step– Select the most relevant and discriminating features for the classification task, that is, the terms with higher feature selection scores

The results are inconsistent and/or incomplete.

SoC Presentation Title 2004

Experiment Set 2Review: Supervised Term Weighting Methods

2. Interaction with supervised text classifier–Linear SVM, Perceptron, kNN–Text classifier selects the positive test

documents from negative test documents by assigning different scores to the test samples, these scores are believed to be effective in assigning more appropriate weights to terms

SoC Presentation Title 2004

Experiment Set 2Review: Supervised Term Weighting Methods

3. Based on Statistical Confidence Intervals --ConfWeight

– Linear SVM– Compared with tf.idf and tf.gr (gain ratio)– The results failed to show that supervised

methods are generally higher than unsupervised methods.

SoC Presentation Title 2004

Experiment Set 2Hypothesis:

Since supervised schemes consider the document distribution, they should perform better than unsupervised ones

Motivation:1. Are supervised term weighting method is

superior to the unsupervised (traditional) methods?2. What kinds of relationship between term weighting methods and the learning algorithms, i.e. kNN and SVM, given different data collections?

SoC Presentation Title 2004

Experiment Set 2Methodology

binary 0 or 1

tf Term frequency

tf.idf Classic scheme

tf.rf Own scheme

tf.chi^2 Chi^2

tf.ig Information gain

tf.or Odds ratio

SupervisedTerm Weighting

UnsupervisedTerm Weighting

SoC Presentation Title 2004Experiment Set 2Results on Reuters Corpus using SVM

SoC Presentation Title 2004Experiment Set 2Results on Reuters Corpus using kNN

SoC Presentation Title 2004Experiment Set 2Results on 20Newsgroups Corpus using SVM

SoC Presentation Title 2004Experiment Set 2Results on 20Newsgroups Corpus using kNN

SoC Presentation Title 2004Experiment Set 2McNemar’s Significance Tests

Algorithm Corpus #_fea Significance Test Results

SVM R 15937 (tf.rf, tf, rf) > tf.idf > (tf.ig, tf.chi2, binary) >> tf.or

SVM 20 13456 (rf, tf.rf, tf.idf) > tf >> binary >> tf.or>> (tf.ig, tf.chi2)

kNN R 405 (binary, tf.rf) > tf >> (tf.idf, rf, tf.ig) > tf.chi2 >> tf.or

kNN 20 494 (tf.rf, binary, tf.idf,tf) >> rf >> (tf.or, tf.ig, tf.chi2)

SoC Presentation Title 2004

Experiment Set 2Effects of feature set size on algorithms

• For SVM, almost all methods achieved the best performance when in putting the full vocabulary (13000-16000 features)

• For kNN, the best performance achieved at a smaller feature set size (400-500 features)

• Possible reason: different noise resistance

SoC Presentation Title 2004

Experiment Set 2Conclusions (1)

Q1: Are supervised term weighting methods superior to the unsupervised term weighting methods?

-- No always.

SoC Presentation Title 2004

Experiment Set 2Conclusions (1)

• Specifically, three supervised methods based on information theory, i.e. tf.chi2, tf.ig and tf.or, perform rather poorly in all experiments.

• On the other hand, newly proposed supervised method, tf.rf achieved the best performance consistently and outperforms other methods substantially and significantly.

SoC Presentation Title 2004

Experiment Set 2Conclusions (2)

Q2: What kinds of relationship between term weighting methods and learning algorithms, given different benchmark data collections?

A: The performance of the term weighting methods, especially, the three unsupervised methods, has close relationships with the learning algorithms and data corpora.

SoC Presentation Title 2004

Experiment Set 2Conclusions (4)

Summary of supervised methods:– tf.rf performs consistently better in all

experiments;– tf.or, tf.chi^2 and tf.ig perform consistently

the worst in all experiments;– rf alone, shows a comparable performance

to tf.rf except for on Reuters using kNN

SoC Presentation Title 2004

Experiment Set 2Conclusions (5)

Summary of unsupervised methods:– tf.idf performs comparable to tf.rf on the uniform

category corpus either using SVM or kNN;– binary performs comparable to tf.rf on both

corpora using kNN while rather bad using SVM;– although tf does not perform as well as tf.rf, it

performs consistently well in all experiments

SoC Presentation Title 2004

Experiment Set 3Applications on Biomedical Domain

Purpose:Application on biomedical data collections

Motivation:Explosive growth of biomedical research

SoC Presentation Title 2004

Experiment Set 3Data Corpora

• The Ohsumed CorpusUsed by Joachims [1998]

• 18 Journals Corpus– Biochemistry and Molecular Biology– Top impact factor– 7903 documents from two-year span (2004 - 2005)

SoC Presentation Title 2004

Experiment Set 3Methodology

Four term weighting methodsbianry, tf, tf.idf, tf.rf

Evaluation Measures– Micro-averaged breakeven point– F1

SoC Presentation Title 2004Experiment Set 3Results on the Ohsumed Corpus

SoC Presentation Title 2004

Experiment Set 3The best performance of SVM with four term weighting schemes on the Ohsumed Corpus

scheme Micro-R Micro-P Micro-F1 Macro-F1

binary 0.6091 0.6097 0.6094 0.5757

tf 0.6578 0.6566 0.6572 0.6335

tf.idf 0.6567 0.6588 0.6578 0.6407

tf.rf 0.6810 0.6800 0.6805 0.6604

SoC Presentation Title 2004Experiment Set 3Comparison between our study and Joachims’ study

• Our linear SVM is more accurate– 68.05% (linear) vs 60.7%(linear) and 66.1%(rbf)

• The performances of tf.idf are identical– 65.78% vs 66%

• Our proposed tf.rf performs significantly better than other methods.

SoC Presentation Title 2004Experiment Set 3Results on 18 Journals Corpus

SoC Presentation Title 2004

Experiment Set 3Results on 3 subsets of 18 Journals Corpus

SoC Presentation Title 2004

Experiment Set 3Conclusions

• The comparison between our results and Joachims’ results shows that tf.rf can improve on the classification performance than tf.idf.

• tf.rf outperforms the other term weighting methods on the both data corpora.

• tf.rf can improve the classification performance of biomedical text classification.

SoC Presentation Title 2004

Outline• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term

Weighting Method• Experimental Research Work• Contributions and Future Work

SoC Presentation Title 2004

Contributions of Thesis

1. To propose an effective supervised term weighting method tf.rf to improve the performance of text categorization.

SoC Presentation Title 2004

Contributions of Thesis

2. To make an extensive comparative study of different term weighting methods under controlled conditions.

SoC Presentation Title 2004

Contributions of Thesis

3. To give a deep analysis of the terms’discriminating power for text categorization from the qualitative and quantitative aspects.

SoC Presentation Title 2004

Contributions of Thesis

4. To investigate the relationships between term weighting methods and learning algorithms given different corpora.

SoC Presentation Title 2004

Future Work (1)

• Extending term weighting methods on feature types other than word

SoC Presentation Title 2004

Future Work (2)

2. Applying term weighting methods to other text-related applications

SoC Presentation Title 2004

Thanks for your time and suggestion.