Text categorization using Rough Set

69
Text Categorization Using Rough Set Sreekumar Biswas Roll No.: 10273 Ph. D., Computer Application Chairman: Dr. Rajni Jain

Transcript of Text categorization using Rough Set

Page 1: Text categorization using Rough Set

Text Categorization UsingRough Set

Sreekumar BiswasRoll No.: 10273

Ph. D., Computer ApplicationChairman: Dr. Rajni Jain

Page 2: Text categorization using Rough Set

Contents

▪ Introduction

▪ Text Categorization

▪ Text mining

▪ Rough set theory

▪ Rough set based hybrid System

▪ References

23-Feb-15Text Categorization Using Rough Set 2

Page 3: Text categorization using Rough Set

Introduction

▪ There are a lot of information that are available as text, butcannot be used by computers for further processing tasks

▪ Therefore, specific (pre-) processing methods andalgorithms are required in order to extract useful patternsfrom these texts

▪ Hence, conclusions of the issues that have been mentionedearlier, lead us towards a technique, called TEXT MINING

23-Feb-15Text Categorization Using Rough Set 3

Page 4: Text categorization using Rough Set

▪ The text mining studies are gaining more importance dayby day because of the abundant availability of theincreasing number of the electronic documents from avariety of sources

Unstructured and structured

23-Feb-15Text Categorization Using Rough Set 4

Page 5: Text categorization using Rough Set

▪ The resources of unstructured and semi structuredinformation include the world wide web, electronicrepositories, news articles, biological databases, digitallibraries, online forums, electronic mail and blog repositories

▪ The main goal of text mining is to enable users to extractinformation from textual resources and to deal with theoperations like, retrieval, categorization (supervised,unsupervised and semi supervised) and summarization

▪ The most important part of text mining is the textcategorization

23-Feb-15Text Categorization Using Rough Set 5

Page 6: Text categorization using Rough Set

▪ Some problems:– proper annotation to the documents– appropriate document representation– dimensionality reduction to handle algorithmic issues– appropriate classifier function to obtain good

generalization and avoid over-fitting

23-Feb-15Text Categorization Using Rough Set 6

Page 7: Text categorization using Rough Set

▪ In recent days, approximately 80% of the information of anorganization is stored in unstructured textual format, in the form ofreports, email, views and news etc.

▪ Approximately 90% of the world’s data are stored in unstructuredformat

▪ For these huge amount of texts, the state-of-art approaches towardstext categorization, where 3 main problems have been defined:– Document representation– Classifier construction– Classifier evaluation

23-Feb-15Text Categorization Using Rough Set 7

Page 8: Text categorization using Rough Set

Text Categorization

▪ Automatic classification of text documents undersome pre-defined categories

▪ D = {d1, d2, …, dm} and C = {c1, c2, …, cn}▪ <di, cj, 1> di ⇒ 𝒄𝒋

▪ <di, cj, 0> di ⇒ 𝒄𝒋

23-Feb-15Text Categorization Using Rough Set 8

<di, cj, v>Where di ∈ Dcj ∈ C v = {true, false} or {1, 0}

Page 9: Text categorization using Rough Set

Prevalent Techniques:

▪Clustering▪Concept Linkage▪ Information Visualization

23-Feb-15Text Categorization Using Rough Set 9

Page 10: Text categorization using Rough Set

Clustering

23-Feb-15Text Categorization Using Rough Set 10

xxoxxoooxoooxxoooxxxxxxxx

xxxxxxxxxxxxxxx

oooooooooo

Page 11: Text categorization using Rough Set

▪ A basic clustering algorithm generates avector of topics for each document anddetermines the weights of how well thedocument fits into each cluster

▪ Clustering technology can be useful in theorganization of management informationsystems, which may contain thousands ofdocuments

23-Feb-15Text Categorization Using Rough Set 11

Page 12: Text categorization using Rough Set

Concept Linkage

23-Feb-15Text Categorization Using Rough Set 12

Page 13: Text categorization using Rough Set

▪ Concept linkage is a valuable idea in text mining,especially in the biomedical fields where so muchstudy has been done that it is impossible forresearchers to read all the materials and makeorganizations to other research

▪ A categorization software can easily identify linkbetween topics X and Y, and Y and Z. But a tool(using concept linkage), can also detect a potentiallink between X and Z, that can not be easily doneby human

23-Feb-15Text Categorization Using Rough Set 13

Page 14: Text categorization using Rough Set

Information Visualization

▪ Information visualization is useful when a userneeds to narrow down a broad range ofdocuments and explore related topics

▪ The user can interact with the document map byzooming, scaling, and creating sub-maps

▪ The government can use information visualizationto identify terrorist networks or to findinformation about crimes that may have beenpreviously thought unconnected

23-Feb-15Text Categorization Using Rough Set 14

Page 15: Text categorization using Rough Set

Text mining algorithms

▪ k nearest neighbor▪ Support vector machine▪ Bayesian classifier▪ K-mean clustering▪ Rough sets

23-Feb-15Text Categorization Using Rough Set 15

Page 16: Text categorization using Rough Set

Rough Set theory

▪ Rough set theory was developed by Z. Pawlak, inthe early 1980's

▪ It deals with the classificatory analysis of datatables

▪ The main goal of the rough set analysis is tosynthesize approximation of concepts from theacquired data

23-Feb-15Text Categorization Using Rough Set 16

Page 17: Text categorization using Rough Set

▪ The starting point of rough set theory which isbased on data analysis is a data set which isrepresented as a table, which is known asinformation system

▪ Let S = (U, A),–Where U = nonempty finite set of objects– A = nonempty finite set of attributes such that a: U →

Va , for all a ∈ A

23-Feb-15Text Categorization Using Rough Set 17

Page 18: Text categorization using Rough Set

▪ The set Va is called the value set of a▪ If B⊆A, then– INDA(B): {(x,y) ∈ U2 | V a ∈ B, a(x) = a(y)}

▪ If (x, y) ∈ INDA(B), then objects x and y areindiscernible from each other by attributes from B

▪ The equivalence classes of the B-indiscernibilityrelation are denoted [x]B

23-Feb-15Text Categorization Using Rough Set 18

Page 19: Text categorization using Rough Set

▪ Indiscernibility is an attribute by means of which,we can conclude that two objects are not differentfrom each other

▪ Now comes another table in case of rough settheory

▪ Like the information system, it called the decisionsystem

23-Feb-15Text Categorization Using Rough Set 19

Page 20: Text categorization using Rough Set

Age LEMS Walk

x1 16-30 50 1

x2 16-30 0 0

x3 31-45 1-25 0

x4 31-45 1-25 1

x5 46-60 26-49 0

x6 16-30 26-49 1

x7 46-60 26-49 0

Information System

Age LEMS

x1 16-30 50

x2 16-30 0

x3 31-45 1-25

x4 31-45 1-25

x5 46-60 26-49

x6 16-30 26-49

x7 46-60 26-4923-Feb-15Text Categorization Using Rough Set 20

Decision System

Page 21: Text categorization using Rough Set

▪ So, from the given information system, we canfind the three equivalence relations:1. IND({Age}) = {{x1, x2, x6},{x3, x4},{x5, x7}}2. IND({LEMS}) = {{x1}, {x2}, {x3, x4}, {x5, x6, x7}}3. IND({Age, LEMS}) = {{x1}, {x2}, {x3, x4}, {x5, x7},

{x6}}–we can say that the objects that are in pair, are

indiscernible to each other

23-Feb-15Text Categorization Using Rough Set 21

Page 22: Text categorization using Rough Set

▪ Let S = (U, {A U D}) , where D ∉ A▪ i.e., D = the decision attribute

23-Feb-15Text Categorization Using Rough Set 22

Age LEMS Walk

x1 16-30 50 1x2 16-30 0 0x3 31-45 1-25 0x4 31-45 1-25 1x5 46-60 26-49 0x6 16-30 26-49 1x7 46-60 26-49 0

Page 23: Text categorization using Rough Set

▪ Let S = (U, A) be an information system and letB⊆A and X ⊆ U

▪ Two most important aspects of rough set:– Lower approximation–Upper approximation

▪ By notation, they are BX and BX, and is called B-lower approximation, and B-upperapproximation, respectively

23-Feb-15Text Categorization Using Rough Set 23

Page 24: Text categorization using Rough Set

23-Feb-15Text Categorization Using Rough Set 24

BX = {x | [x]B ⊆ X}

BX = {x | [x]B Ո X ≠ φ}

The mathematical notion of both the approximations are given by:

Page 25: Text categorization using Rough Set

▪ The difference between the upper and lower approximation is called the boundary region of X and is denoted by BNB(X)

▪ Therefore, BNB(X) = BX – BX▪ Some Conclusions:– If the boundary region of X is the empty set, i.e.,

BNB(X) = φ, then the set X is crisp (exact) with respect to B

– if BNB(X) ≠ φ, the set X is referred to as rough (inexact) with respect to B

23-Feb-15Text Categorization Using Rough Set 25

Page 26: Text categorization using Rough Set

23-Feb-15Text Categorization Using Rough Set 26

B-lower approximation

B-upper approximation

Boundary Region

yes{{x1}, {x6}}

{{x3}, {x4}}{{x2}, {x5,x7}}

Yes/No

No

Approximations

Page 27: Text categorization using Rough Set

▪ Accuracy of approximation in roughset:–Denoted by α– αB(X) = |BX|

|BX||X| denotes the cardinality of X ≠ φ,The value of α ranges between 0 and 1

23-Feb-15Text Categorization Using Rough Set 27

Page 28: Text categorization using Rough Set

Rough Set based Hybrid System

▪ Text Document Representation▪ Classifier construction▪ Performance evaluation

23-Feb-15Text Categorization Using Rough Set 28

These are the three steps that are required toperform any text categorization task; the presentedsystem is not an exception too

Page 29: Text categorization using Rough Set

Text document representation

▪ The first step of any text mining process. Itincludes the following:– Tokenization– Storage of tokens– Feature set construction– Stemming–Dimensionality reduction

23-Feb-15Text Categorization Using Rough Set 29

Page 30: Text categorization using Rough Set

Classifier construction

▪ The second step. Here the system performslearning and testing processes– Learning: the classifier is built by observing the features

for each category from the training set– Testing: the classifier applies a pair of precise concepts

from the rough set theory that are called the lower andupper approximations to classify the input textdocument from the test set

23-Feb-15Text Categorization Using Rough Set 30

Page 31: Text categorization using Rough Set

▪ The final step. Here the performance of the hybridsystem can be measured by computing itsefficiency and its effectiveness. It most commonmeasures are:– Accuracy– Error rate– Precision– Recall

23-Feb-15Text Categorization Using Rough Set 31

Performance evaluation

Page 32: Text categorization using Rough Set

The Algorithm

23-Feb-15Text Categorization Using Rough Set 32

Page 33: Text categorization using Rough Set

Input: D1, D2,…,Dm (Different Text Documents), C1, C2,…,Cn (specific categories)Output: Classified Text DocumentBeginFor each category Ci DoFor each Text Document Dj for Ci DoSplit Dj into features ⇒FjRemove stop words, number and special characters from Fj ⇒ TjGive frequency equal to 1 for Tj ⇒ Ftrj, Ftr_freqjMake stemming and some morphology processing for Ftrj and increase frequency for Ftr_freqj ⇒ Short_Ftrj, short_Ftr_FreqjMake Dimensionality Reduction for Short_Ftrj ⇒ DRjAdd DRj in DB (database) for CiEnd For

23-Feb-15Text Categorization Using Rough Set 33

Page 34: Text categorization using Rough Set

Compute Upper Approximation for Ci using the following equationBX = {x | [x]B ∩ X ≠ ∅}Compute Lower Approximation for Ci using the following equationBX = {x | [x]B ⊆ X }Compute the Percentage between Upper Approximation for Ci and DRjfor Dj, the highestPercentage represent the correct category for DjCompute accuracy for Ci using the following equationαB(X) = | 𝑩𝑿 |

| 𝑩𝑿 |End ForEnd

23-Feb-15Text Categorization Using Rough Set 34

Page 35: Text categorization using Rough Set

▪ The algorithm can be best understood by thefollowing diagram:

23-Feb-15Text Categorization Using Rough Set 35

Page 36: Text categorization using Rough Set

23-Feb-15Text Categorization Using Rough Set 36

Text Document

Tokenization

Vector Space Model

Stop Word Removal

Stemming

Dimensionality Reduction

Rough Set Theory

Classified Text Document

Performance Evaluation

STEP I

STEP II

STEP III

DB Features

Figure 1: The Hybrid Text Categorization System

Page 37: Text categorization using Rough Set

Text Document Representation

23-Feb-15Text Categorization Using Rough Set 37

Text document

Trainingset

Test set

pre-classified set of textdocuments which is used fortraining the classifier

Used to test the accuracy of the classifierbased on the count of correct andincorrect classifications for each textdocument in that set

Page 38: Text categorization using Rough Set

Tokenization

▪ Each input text document is partitioned into a listof features which are called tokens

▪ The tokens are words, terms or attributes

23-Feb-15Text Categorization Using Rough Set 38

Page 39: Text categorization using Rough Set

Vector Space Model

▪ Each input text document is represented as avector in a vector space

▪ each dimension of this space represents a singlefeature of that vector and its weight which iscomputed by the frequency of occurrence of eachfeature in that text document

▪ The assigned weight may increase based on thefrequency of each feature in the input textdocument

23-Feb-15Text Categorization Using Rough Set 39

Page 40: Text categorization using Rough Set

Stop Word Removal

▪ Commonly repeated word. They include:– Pronouns– Conjunctions– Special character–Number

▪ They are of no use

23-Feb-15Text Categorization Using Rough Set 40

Page 41: Text categorization using Rough Set

Stemming

▪ Stemming is the process of removing affixes(prefixes and suffixes) from the set of features

▪ This process is used in order to reduce the numberof features in the feature space

▪ improve the performance of the classifier whenthe different forms of features are stemmed into asingle feature

23-Feb-15Text Categorization Using Rough Set 41

Page 42: Text categorization using Rough Set

Example of stemming

▪ S = {convert/converts/converted/converting}

▪ After Stemming:– S = {convert}

23-Feb-15Text Categorization Using Rough Set 42

Page 43: Text categorization using Rough Set

▪ The system uses these principles::– All prefixes are removed from features, if the prefix

exists in features– The stemming process uses a lexicon to find the root for

each irregular feature–When the only difference among the similar features in

the first characters

23-Feb-15Text Categorization Using Rough Set 43

Page 44: Text categorization using Rough Set

Dimensionality Reduction

▪ After the non-informative features removal andthe stemming process, if the number of features inthe feature space is still too large, this procedure isdone

▪ Among these selected features, some features maybe not useful to the categorization task andsometimes decrease accuracy, so such features canbe removed without affecting the classifierperformance

23-Feb-15Text Categorization Using Rough Set 44

Page 45: Text categorization using Rough Set

▪ Dimensionality reduction of the feature space canbe done by feature selection and featureextraction

▪ But this system doesn’t use any of them▪ Thus specific threshold method is used

23-Feb-15Text Categorization Using Rough Set 45

Page 46: Text categorization using Rough Set

▪ The features are selected from feature space suchthat the frequencies are equal to or greater than10%, 8%, 6% or 4% of the number of derivedfeatures from the stemming process

23-Feb-15Text Categorization Using Rough Set 46

Page 47: Text categorization using Rough Set

Categorization Technique for Text Categorization - Rough Set Theory

▪ Rough set theory has been successfully used as asupervised categorization technique

▪ Two precise concepts have been successfullyutilized to the text documents into one or more ofmain categories and sub categories

▪ When the test text document is given to thetrained classifier; it should predict the correctmain category and sub-category for that textdocument

23-Feb-15Text Categorization Using Rough Set 47

Page 48: Text categorization using Rough Set

▪ The testing set with 100 text documents wascategorized into 4 main categories and a numberof subcategories which belong to the first threecategories

▪ Computer Science, Mathematics and Physics arethree main categories

23-Feb-15Text Categorization Using Rough Set 48

Page 49: Text categorization using Rough Set

23-Feb-15Text Categorization Using Rough Set 49

Computer science Mathematics Physics

Artificial IntelligenceDatabase

Image Processing

InformationSecurity

Algebra

Numerical Analusis

Statistics

Laser

Material

Page 50: Text categorization using Rough Set

▪ Upper approximation: It is the intersectionbetween the features which represent the test textdocument and the features in any database table,that have a frequency ≥ (10%, 8%, 6% or 4%) ofthat database table which represent the sub-category of main categories; the resulted featuresrepresent a set of upper approximation features

23-Feb-15Text Categorization Using Rough Set 50

Page 51: Text categorization using Rough Set

▪ Lower Approximation: It is the intersectionbetween the features which represent the test textdocument and the features which appear in onlyone database table, that have a frequency underany frequency field ≥ (10%, 8%, 6% or 4%) of thatdatabase table which represents the sub-categoryof main categories. The resulted features representa set of lower approximation features

23-Feb-15Text Categorization Using Rough Set 51

Page 52: Text categorization using Rough Set

▪ The accuracy of approximation can be measuredby computing the ratio between the lower andupper approximations for the set of featureswhich represents the test text document

23-Feb-15Text Categorization Using Rough Set 52

Page 53: Text categorization using Rough Set

▪ After applying all the steps to represent the testtext documents and implementing the lower andupper approximations concepts from the roughset theory to their representation, the trainedclassifier should predict the correct maincategories and sub-categories for these textdocuments

23-Feb-15Text Categorization Using Rough Set 53

Classified Text Document

Page 54: Text categorization using Rough Set

Performance Evaluation for a classifier

▪ The performance of the hybrid system can bemeasured by calculating its efficiency and itseffectiveness

23-Feb-15Text Categorization Using Rough Set 54

Page 55: Text categorization using Rough Set

23-Feb-15Text Categorization Using Rough Set 55

Figure 2: The learning time for building the classifier

Page 56: Text categorization using Rough Set

23-Feb-15Text Categorization Using Rough Set 56

Figure 3: The average of the testing time for classifying of the test text documents

Page 57: Text categorization using Rough Set

▪ There are many metrics to evaluate theeffectiveness of the hybrid system. The mostcommon are accuracy, error rate, precision andrecall

▪ For computing these, we have to remember thefollowing notations:

23-Feb-15Text Categorization Using Rough Set 57

Page 58: Text categorization using Rough Set

▪ TPi (True Positive) = the number of text documents correctly classifiedin category ci

▪ TNi (True Negative) = the number of text documents correctlyclassified as not belonging to category ci

▪ FPi (False Positive) = the number of text documents incorrectlyclassified in category ci

▪ FNi (False Negative) = the number of text documents incorrectlyclassified as not belonging to category ci

23-Feb-15Text Categorization Using Rough Set 58

Page 59: Text categorization using Rough Set

▪ Accuracy (Ac): Is the ratio between the number of textdocuments which were correctly categorized and the totalnumber of documents,

Aci = 𝑻𝑷𝒊+𝑻𝑵𝒊

𝑻𝑷𝒊+ 𝑻𝑵𝒊+ 𝑭𝑷𝒊+ 𝑭𝑵𝒊

▪ Error rate (E): Is the ratio between the number of textdocuments which were not correctly categorized and thetotal number of text documents

Aci = 1- Aci =𝑭𝑷𝒊+𝑭𝑵𝒊

𝑻𝑷𝒊+ 𝑻𝑵𝒊+ 𝑭𝑷𝒊+ 𝑭𝑵𝒊

23-Feb-15Text Categorization Using Rough Set 59

Page 60: Text categorization using Rough Set

▪ Precision (P): Is the percentage of correctly categorized textdocuments among all text documents that were assigned tothe category by the classifier,

Pi = 𝑻𝑷𝒊

𝑻𝑷𝒊+ 𝑭𝑷𝒊

▪ Recall (R): Is the percentage of correctly categorized textdocuments among all text documents belonging to thatcategory,

Ri = 𝑻𝑷𝒊

𝑻𝑷𝒊+ 𝑭𝑵𝒊

23-Feb-15Text Categorization Using Rough Set 60

Page 61: Text categorization using Rough Set

Results

23-Feb-15Text Categorization Using Rough Set 61

Main CategorySub Category Precision Recall

Computer Science Artificial Intelligence (AI) 100% 100%Database 95.65% 100%

Image Processing 100% 94.73%Security 83.33% 95.23%

Mathematics Algebra 100% 100%Numerical Analysis 100% 100%

Statistics 95.23% 90.90%Physics Laser 100% 100%

Materials 100% 100%Unknown 100% 100%

Table 3. The results of calculating precision & recall for the hybrid system

Page 62: Text categorization using Rough Set

23-Feb-15Text Categorization Using Rough Set 62

Figure 4: The performance evaluation for computer science category

Page 63: Text categorization using Rough Set

23-Feb-15Text Categorization Using Rough Set 63

Figure 5: The performance evaluation for mathematics category

Page 64: Text categorization using Rough Set

23-Feb-15Text Categorization Using Rough Set 64

Figure 6: The performance evaluation for Physics category

Page 65: Text categorization using Rough Set

Conclusion

▪ The rough set theory is a supervised categorizationtechnique; it is used for building the text categorizationmodel by learning the properties of a set of pre-classifiedtext documents for each sub-category of main categories

▪ The presented model uses a pair of precise concepts fromthe rough set theory that are called the lower and upperapproximations to classify any test text document into oneor more of main categories and sub-categories, because thesystem deals not only with the main categories, but alsowith a number of sub-categories for each main category

23-Feb-15Text Categorization Using Rough Set 65

Page 66: Text categorization using Rough Set

▪ When the rough set theory concepts are used in the hybridsystem, the results of the system reach to 96% when it isapplied to a number of test text documents for each sub-category of main categories

▪ The average of that time is computed for all test textdocuments, which ranges from 5 to 14 Sec

▪ For future work, more precise concepts of rough set such asreduct theory can also be used for the system

23-Feb-15Text Categorization Using Rough Set 66

Page 67: Text categorization using Rough Set

References

▪ Kiritchenko, S. (2005). Hierarchical Text Categorization and Its Application to Bioinformatics. PhD Thesis, School of Information Technology and Engineering, Faculty of Engineering, University of Ottawa, Ottawa, Canada.

▪ Komorowski, J., Pawlak, Z., Polkowski, L. & Skowron, A. (1999). Rough Sets: A Tutorial. In: Pal, S.K., Skowron, A. (Eds) Rough-Fuzzy Hybridization: A New Trend in Decision Making, pp. 3-98, Springer-Verlag, Singapore.

▪ Oracle Corporation, WWW, oracle.com, 2008.▪ Pawlak, Z. (March 2002). Rough Set Theory and Its Applications. Journal of

Telecommunications and Information Technology, pp.7-10.▪ Pegah Falinouss “Stock Trend Prediction using News Article’s: a text mining

approach” Master thesis -2007.

23-Feb-15Text Categorization Using Rough Set 67

Page 68: Text categorization using Rough Set

▪ Raghavan, P., S. Amer-Yahia and L. Gravano eds., “Structure in Text: Extraction and Exploitation.” In. Proceeding of the 7th international Workshop on the Web and Databases (WebDB), ACM SIGMOD/PODS 2004, ACM Press, Vol 67, 2004.

▪ Ruiz, M. (December 2001). Combining Machine Learning and Hierarchical Structures for Text Categorization. PhD Thesis, Computer Science Dept., University of Iowa, Iowa City, Iowa, USA.

▪ Sadiq A. T., Abdullah S. M. Hybrid Intelligent Techniques for Text Categorization. International Journal of Advanced Computer Science and Information Technology (IJACSIT) Vol. 2, No. 2, April 2013, Page: 23-40.

▪ Sebastiani, F., “Machine learning in automated text categorization” ACM Computing Surveys (CSUR) 34, pp.1 – 47, 2002.

23-Feb-15Text Categorization Using Rough Set 68

Page 69: Text categorization using Rough Set

23-Feb-15Text Categorization Using Rough Set 69