Text categorization using Rough Set

Text Categorization UsingRough Set

Sreekumar BiswasRoll No.: 10273

Ph. D., Computer ApplicationChairman: Dr. Rajni Jain

Contents

▪ Introduction

▪ Text Categorization

▪ Text mining

▪ Rough set theory

▪ Rough set based hybrid System

▪ References

23-Feb-15Text Categorization Using Rough Set 2

Introduction

▪ There are a lot of information that are available as text, butcannot be used by computers for further processing tasks

▪ Therefore, specific (pre-) processing methods andalgorithms are required in order to extract useful patternsfrom these texts

▪ Hence, conclusions of the issues that have been mentionedearlier, lead us towards a technique, called TEXT MINING


▪ The text mining studies are gaining more importance dayby day because of the abundant availability of theincreasing number of the electronic documents from avariety of sources

Unstructured and structured


▪ The resources of unstructured and semi structuredinformation include the world wide web, electronicrepositories, news articles, biological databases, digitallibraries, online forums, electronic mail and blog repositories

▪ The main goal of text mining is to enable users to extractinformation from textual resources and to deal with theoperations like, retrieval, categorization (supervised,unsupervised and semi supervised) and summarization

▪ The most important part of text mining is the textcategorization


▪ Some problems:– proper annotation to the documents– appropriate document representation– dimensionality reduction to handle algorithmic issues– appropriate classifier function to obtain good

generalization and avoid over-fitting


▪ In recent days, approximately 80% of the information of anorganization is stored in unstructured textual format, in the form ofreports, email, views and news etc.

▪ Approximately 90% of the world’s data are stored in unstructuredformat

▪ For these huge amount of texts, the state-of-art approaches towardstext categorization, where 3 main problems have been defined:– Document representation– Classifier construction– Classifier evaluation


Text Categorization

▪ Automatic classification of text documents undersome pre-defined categories

▪ D = {d1, d2, …, dm} and C = {c1, c2, …, cn}▪ <di, cj, 1> di ⇒ 𝒄𝒋

▪ <di, cj, 0> di ⇒ 𝒄𝒋


<di, cj, v>Where di ∈ Dcj ∈ C v = {true, false} or {1, 0}

Prevalent Techniques:

▪Clustering▪Concept Linkage▪ Information Visualization


Clustering


xxoxxoooxoooxxoooxxxxxxxx

xxxxxxxxxxxxxxx

oooooooooo

▪ A basic clustering algorithm generates avector of topics for each document anddetermines the weights of how well thedocument fits into each cluster

▪ Clustering technology can be useful in theorganization of management informationsystems, which may contain thousands ofdocuments


Concept Linkage


▪ Concept linkage is a valuable idea in text mining,especially in the biomedical fields where so muchstudy has been done that it is impossible forresearchers to read all the materials and makeorganizations to other research

▪ A categorization software can easily identify linkbetween topics X and Y, and Y and Z. But a tool(using concept linkage), can also detect a potentiallink between X and Z, that can not be easily doneby human


Information Visualization

▪ Information visualization is useful when a userneeds to narrow down a broad range ofdocuments and explore related topics

▪ The user can interact with the document map byzooming, scaling, and creating sub-maps

▪ The government can use information visualizationto identify terrorist networks or to findinformation about crimes that may have beenpreviously thought unconnected


Text mining algorithms

▪ k nearest neighbor▪ Support vector machine▪ Bayesian classifier▪ K-mean clustering▪ Rough sets


Rough Set theory

▪ Rough set theory was developed by Z. Pawlak, inthe early 1980's

▪ It deals with the classificatory analysis of datatables

▪ The main goal of the rough set analysis is tosynthesize approximation of concepts from theacquired data


▪ The starting point of rough set theory which isbased on data analysis is a data set which isrepresented as a table, which is known asinformation system

▪ Let S = (U, A),–Where U = nonempty finite set of objects– A = nonempty finite set of attributes such that a: U →

Va , for all a ∈ A


▪ The set Va is called the value set of a▪ If B⊆A, then– INDA(B): {(x,y) ∈ U2 | V a ∈ B, a(x) = a(y)}

▪ If (x, y) ∈ INDA(B), then objects x and y areindiscernible from each other by attributes from B

▪ The equivalence classes of the B-indiscernibilityrelation are denoted [x]B


▪ Indiscernibility is an attribute by means of which,we can conclude that two objects are not differentfrom each other

▪ Now comes another table in case of rough settheory

▪ Like the information system, it called the decisionsystem


Age LEMS Walk

x1 16-30 50 1

x2 16-30 0 0

x3 31-45 1-25 0

x4 31-45 1-25 1

x5 46-60 26-49 0

x6 16-30 26-49 1

x7 46-60 26-49 0

Information System

Age LEMS

x1 16-30 50

x2 16-30 0

x3 31-45 1-25

x4 31-45 1-25

x5 46-60 26-49

x6 16-30 26-49

x7 46-60 26-4923-Feb-15Text Categorization Using Rough Set 20

Decision System

▪ So, from the given information system, we canfind the three equivalence relations:1. IND({Age}) = {{x1, x2, x6},{x3, x4},{x5, x7}}2. IND({LEMS}) = {{x1}, {x2}, {x3, x4}, {x5, x6, x7}}3. IND({Age, LEMS}) = {{x1}, {x2}, {x3, x4}, {x5, x7},

{x6}}–we can say that the objects that are in pair, are

indiscernible to each other


▪ Let S = (U, {A U D}) , where D ∉ A▪ i.e., D = the decision attribute


Age LEMS Walk

x1 16-30 50 1x2 16-30 0 0x3 31-45 1-25 0x4 31-45 1-25 1x5 46-60 26-49 0x6 16-30 26-49 1x7 46-60 26-49 0

▪ Let S = (U, A) be an information system and letB⊆A and X ⊆ U

▪ Two most important aspects of rough set:– Lower approximation–Upper approximation

▪ By notation, they are BX and BX, and is called B-lower approximation, and B-upperapproximation, respectively



BX = {x | [x]B ⊆ X}

BX = {x | [x]B Ո X ≠ φ}

The mathematical notion of both the approximations are given by:

▪ The difference between the upper and lower approximation is called the boundary region of X and is denoted by BNB(X)

▪ Therefore, BNB(X) = BX – BX▪ Some Conclusions:– If the boundary region of X is the empty set, i.e.,

BNB(X) = φ, then the set X is crisp (exact) with respect to B

– if BNB(X) ≠ φ, the set X is referred to as rough (inexact) with respect to B



B-lower approximation

B-upper approximation

Boundary Region

yes{{x1}, {x6}}

{{x3}, {x4}}{{x2}, {x5,x7}}

Yes/No

No

Approximations

▪ Accuracy of approximation in roughset:–Denoted by α– αB(X) = |BX|

|BX||X| denotes the cardinality of X ≠ φ,The value of α ranges between 0 and 1


Rough Set based Hybrid System

▪ Text Document Representation▪ Classifier construction▪ Performance evaluation


These are the three steps that are required toperform any text categorization task; the presentedsystem is not an exception too

Text document representation

▪ The first step of any text mining process. Itincludes the following:– Tokenization– Storage of tokens– Feature set construction– Stemming–Dimensionality reduction


Classifier construction

▪ The second step. Here the system performslearning and testing processes– Learning: the classifier is built by observing the features

for each category from the training set– Testing: the classifier applies a pair of precise concepts

from the rough set theory that are called the lower andupper approximations to classify the input textdocument from the test set


▪ The final step. Here the performance of the hybridsystem can be measured by computing itsefficiency and its effectiveness. It most commonmeasures are:– Accuracy– Error rate– Precision– Recall


Performance evaluation

The Algorithm


Input: D1, D2,…,Dm (Different Text Documents), C1, C2,…,Cn (specific categories)Output: Classified Text DocumentBeginFor each category Ci DoFor each Text Document Dj for Ci DoSplit Dj into features ⇒FjRemove stop words, number and special characters from Fj ⇒ TjGive frequency equal to 1 for Tj ⇒ Ftrj, Ftr_freqjMake stemming and some morphology processing for Ftrj and increase frequency for Ftr_freqj ⇒ Short_Ftrj, short_Ftr_FreqjMake Dimensionality Reduction for Short_Ftrj ⇒ DRjAdd DRj in DB (database) for CiEnd For


Compute Upper Approximation for Ci using the following equationBX = {x | [x]B ∩ X ≠ ∅}Compute Lower Approximation for Ci using the following equationBX = {x | [x]B ⊆ X }Compute the Percentage between Upper Approximation for Ci and DRjfor Dj, the highestPercentage represent the correct category for DjCompute accuracy for Ci using the following equationαB(X) = | 𝑩𝑿 |

| 𝑩𝑿 |End ForEnd


▪ The algorithm can be best understood by thefollowing diagram:



Text Document

Tokenization

Vector Space Model

Stop Word Removal

Stemming

Dimensionality Reduction

Rough Set Theory

Classified Text Document

Performance Evaluation

STEP I

STEP II

STEP III

DB Features

Figure 1: The Hybrid Text Categorization System

Text Document Representation


Text document

Trainingset

Test set

pre-classified set of textdocuments which is used fortraining the classifier

Used to test the accuracy of the classifierbased on the count of correct andincorrect classifications for each textdocument in that set

Tokenization

▪ Each input text document is partitioned into a listof features which are called tokens

▪ The tokens are words, terms or attributes


Vector Space Model

▪ Each input text document is represented as avector in a vector space

▪ each dimension of this space represents a singlefeature of that vector and its weight which iscomputed by the frequency of occurrence of eachfeature in that text document

▪ The assigned weight may increase based on thefrequency of each feature in the input textdocument


Stop Word Removal

▪ Commonly repeated word. They include:– Pronouns– Conjunctions– Special character–Number

▪ They are of no use


Stemming

▪ Stemming is the process of removing affixes(prefixes and suffixes) from the set of features

▪ This process is used in order to reduce the numberof features in the feature space

▪ improve the performance of the classifier whenthe different forms of features are stemmed into asingle feature


Example of stemming

▪ S = {convert/converts/converted/converting}

▪ After Stemming:– S = {convert}


▪ The system uses these principles::– All prefixes are removed from features, if the prefix

exists in features– The stemming process uses a lexicon to find the root for

each irregular feature–When the only difference among the similar features in

the first characters


Dimensionality Reduction

▪ After the non-informative features removal andthe stemming process, if the number of features inthe feature space is still too large, this procedure isdone

▪ Among these selected features, some features maybe not useful to the categorization task andsometimes decrease accuracy, so such features canbe removed without affecting the classifierperformance


▪ Dimensionality reduction of the feature space canbe done by feature selection and featureextraction

▪ But this system doesn’t use any of them▪ Thus specific threshold method is used


▪ The features are selected from feature space suchthat the frequencies are equal to or greater than10%, 8%, 6% or 4% of the number of derivedfeatures from the stemming process


Categorization Technique for Text Categorization - Rough Set Theory

▪ Rough set theory has been successfully used as asupervised categorization technique

▪ Two precise concepts have been successfullyutilized to the text documents into one or more ofmain categories and sub categories

▪ When the test text document is given to thetrained classifier; it should predict the correctmain category and sub-category for that textdocument


▪ The testing set with 100 text documents wascategorized into 4 main categories and a numberof subcategories which belong to the first threecategories

▪ Computer Science, Mathematics and Physics arethree main categories



Computer science Mathematics Physics

Artificial IntelligenceDatabase

Image Processing

InformationSecurity

Algebra

Numerical Analusis

Statistics

Laser

Material

▪ Upper approximation: It is the intersectionbetween the features which represent the test textdocument and the features in any database table,that have a frequency ≥ (10%, 8%, 6% or 4%) ofthat database table which represent the sub-category of main categories; the resulted featuresrepresent a set of upper approximation features


▪ Lower Approximation: It is the intersectionbetween the features which represent the test textdocument and the features which appear in onlyone database table, that have a frequency underany frequency field ≥ (10%, 8%, 6% or 4%) of thatdatabase table which represents the sub-categoryof main categories. The resulted features representa set of lower approximation features


▪ The accuracy of approximation can be measuredby computing the ratio between the lower andupper approximations for the set of featureswhich represents the test text document


▪ After applying all the steps to represent the testtext documents and implementing the lower andupper approximations concepts from the roughset theory to their representation, the trainedclassifier should predict the correct maincategories and sub-categories for these textdocuments


Classified Text Document

Performance Evaluation for a classifier

▪ The performance of the hybrid system can bemeasured by calculating its efficiency and itseffectiveness



Figure 2: The learning time for building the classifier


Figure 3: The average of the testing time for classifying of the test text documents

▪ There are many metrics to evaluate theeffectiveness of the hybrid system. The mostcommon are accuracy, error rate, precision andrecall

▪ For computing these, we have to remember thefollowing notations:


▪ TPi (True Positive) = the number of text documents correctly classifiedin category ci

▪ TNi (True Negative) = the number of text documents correctlyclassified as not belonging to category ci

▪ FPi (False Positive) = the number of text documents incorrectlyclassified in category ci

▪ FNi (False Negative) = the number of text documents incorrectlyclassified as not belonging to category ci


▪ Accuracy (Ac): Is the ratio between the number of textdocuments which were correctly categorized and the totalnumber of documents,

Aci = 𝑻𝑷𝒊+𝑻𝑵𝒊

𝑻𝑷𝒊+ 𝑻𝑵𝒊+ 𝑭𝑷𝒊+ 𝑭𝑵𝒊

▪ Error rate (E): Is the ratio between the number of textdocuments which were not correctly categorized and thetotal number of text documents

Aci = 1- Aci =𝑭𝑷𝒊+𝑭𝑵𝒊

𝑻𝑷𝒊+ 𝑻𝑵𝒊+ 𝑭𝑷𝒊+ 𝑭𝑵𝒊


▪ Precision (P): Is the percentage of correctly categorized textdocuments among all text documents that were assigned tothe category by the classifier,

Pi = 𝑻𝑷𝒊

𝑻𝑷𝒊+ 𝑭𝑷𝒊

▪ Recall (R): Is the percentage of correctly categorized textdocuments among all text documents belonging to thatcategory,

Ri = 𝑻𝑷𝒊

𝑻𝑷𝒊+ 𝑭𝑵𝒊


Results


Main CategorySub Category Precision Recall

Computer Science Artificial Intelligence (AI) 100% 100%Database 95.65% 100%

Image Processing 100% 94.73%Security 83.33% 95.23%

Mathematics Algebra 100% 100%Numerical Analysis 100% 100%

Statistics 95.23% 90.90%Physics Laser 100% 100%

Materials 100% 100%Unknown 100% 100%

Table 3. The results of calculating precision & recall for the hybrid system


Figure 4: The performance evaluation for computer science category


Figure 5: The performance evaluation for mathematics category


Figure 6: The performance evaluation for Physics category

Conclusion

▪ The rough set theory is a supervised categorizationtechnique; it is used for building the text categorizationmodel by learning the properties of a set of pre-classifiedtext documents for each sub-category of main categories

▪ The presented model uses a pair of precise concepts fromthe rough set theory that are called the lower and upperapproximations to classify any test text document into oneor more of main categories and sub-categories, because thesystem deals not only with the main categories, but alsowith a number of sub-categories for each main category


▪ When the rough set theory concepts are used in the hybridsystem, the results of the system reach to 96% when it isapplied to a number of test text documents for each sub-category of main categories

▪ The average of that time is computed for all test textdocuments, which ranges from 5 to 14 Sec

▪ For future work, more precise concepts of rough set such asreduct theory can also be used for the system


References

▪ Kiritchenko, S. (2005). Hierarchical Text Categorization and Its Application to Bioinformatics. PhD Thesis, School of Information Technology and Engineering, Faculty of Engineering, University of Ottawa, Ottawa, Canada.

▪ Komorowski, J., Pawlak, Z., Polkowski, L. & Skowron, A. (1999). Rough Sets: A Tutorial. In: Pal, S.K., Skowron, A. (Eds) Rough-Fuzzy Hybridization: A New Trend in Decision Making, pp. 3-98, Springer-Verlag, Singapore.

▪ Oracle Corporation, WWW, oracle.com, 2008.▪ Pawlak, Z. (March 2002). Rough Set Theory and Its Applications. Journal of

Telecommunications and Information Technology, pp.7-10.▪ Pegah Falinouss “Stock Trend Prediction using News Article’s: a text mining

approach” Master thesis -2007.


▪ Raghavan, P., S. Amer-Yahia and L. Gravano eds., “Structure in Text: Extraction and Exploitation.” In. Proceeding of the 7th international Workshop on the Web and Databases (WebDB), ACM SIGMOD/PODS 2004, ACM Press, Vol 67, 2004.

▪ Ruiz, M. (December 2001). Combining Machine Learning and Hierarchical Structures for Text Categorization. PhD Thesis, Computer Science Dept., University of Iowa, Iowa City, Iowa, USA.

▪ Sadiq A. T., Abdullah S. M. Hybrid Intelligent Techniques for Text Categorization. International Journal of Advanced Computer Science and Information Technology (IJACSIT) Vol. 2, No. 2, April 2013, Page: 23-40.

▪ Sebastiani, F., “Machine learning in automated text categorization” ACM Computing Surveys (CSUR) 34, pp.1 – 47, 2002.


Text categorization using Rough Set

Science

Transcript of Text categorization using Rough Set