Text categorization using Rough Set
-
Upload
sreekumar-biswas -
Category
Science
-
view
142 -
download
3
Transcript of Text categorization using Rough Set
Text Categorization UsingRough Set
Sreekumar BiswasRoll No.: 10273
Ph. D., Computer ApplicationChairman: Dr. Rajni Jain
Contents
▪ Introduction
▪ Text Categorization
▪ Text mining
▪ Rough set theory
▪ Rough set based hybrid System
▪ References
23-Feb-15Text Categorization Using Rough Set 2
Introduction
▪ There are a lot of information that are available as text, butcannot be used by computers for further processing tasks
▪ Therefore, specific (pre-) processing methods andalgorithms are required in order to extract useful patternsfrom these texts
▪ Hence, conclusions of the issues that have been mentionedearlier, lead us towards a technique, called TEXT MINING
23-Feb-15Text Categorization Using Rough Set 3
▪ The text mining studies are gaining more importance dayby day because of the abundant availability of theincreasing number of the electronic documents from avariety of sources
Unstructured and structured
23-Feb-15Text Categorization Using Rough Set 4
▪ The resources of unstructured and semi structuredinformation include the world wide web, electronicrepositories, news articles, biological databases, digitallibraries, online forums, electronic mail and blog repositories
▪ The main goal of text mining is to enable users to extractinformation from textual resources and to deal with theoperations like, retrieval, categorization (supervised,unsupervised and semi supervised) and summarization
▪ The most important part of text mining is the textcategorization
23-Feb-15Text Categorization Using Rough Set 5
▪ Some problems:– proper annotation to the documents– appropriate document representation– dimensionality reduction to handle algorithmic issues– appropriate classifier function to obtain good
generalization and avoid over-fitting
23-Feb-15Text Categorization Using Rough Set 6
▪ In recent days, approximately 80% of the information of anorganization is stored in unstructured textual format, in the form ofreports, email, views and news etc.
▪ Approximately 90% of the world’s data are stored in unstructuredformat
▪ For these huge amount of texts, the state-of-art approaches towardstext categorization, where 3 main problems have been defined:– Document representation– Classifier construction– Classifier evaluation
23-Feb-15Text Categorization Using Rough Set 7
Text Categorization
▪ Automatic classification of text documents undersome pre-defined categories
▪ D = {d1, d2, …, dm} and C = {c1, c2, …, cn}▪ <di, cj, 1> di ⇒ 𝒄𝒋
▪ <di, cj, 0> di ⇒ 𝒄𝒋
23-Feb-15Text Categorization Using Rough Set 8
<di, cj, v>Where di ∈ Dcj ∈ C v = {true, false} or {1, 0}
Prevalent Techniques:
▪Clustering▪Concept Linkage▪ Information Visualization
23-Feb-15Text Categorization Using Rough Set 9
Clustering
23-Feb-15Text Categorization Using Rough Set 10
xxoxxoooxoooxxoooxxxxxxxx
xxxxxxxxxxxxxxx
oooooooooo
▪ A basic clustering algorithm generates avector of topics for each document anddetermines the weights of how well thedocument fits into each cluster
▪ Clustering technology can be useful in theorganization of management informationsystems, which may contain thousands ofdocuments
23-Feb-15Text Categorization Using Rough Set 11
Concept Linkage
23-Feb-15Text Categorization Using Rough Set 12
▪ Concept linkage is a valuable idea in text mining,especially in the biomedical fields where so muchstudy has been done that it is impossible forresearchers to read all the materials and makeorganizations to other research
▪ A categorization software can easily identify linkbetween topics X and Y, and Y and Z. But a tool(using concept linkage), can also detect a potentiallink between X and Z, that can not be easily doneby human
23-Feb-15Text Categorization Using Rough Set 13
Information Visualization
▪ Information visualization is useful when a userneeds to narrow down a broad range ofdocuments and explore related topics
▪ The user can interact with the document map byzooming, scaling, and creating sub-maps
▪ The government can use information visualizationto identify terrorist networks or to findinformation about crimes that may have beenpreviously thought unconnected
23-Feb-15Text Categorization Using Rough Set 14
Text mining algorithms
▪ k nearest neighbor▪ Support vector machine▪ Bayesian classifier▪ K-mean clustering▪ Rough sets
23-Feb-15Text Categorization Using Rough Set 15
Rough Set theory
▪ Rough set theory was developed by Z. Pawlak, inthe early 1980's
▪ It deals with the classificatory analysis of datatables
▪ The main goal of the rough set analysis is tosynthesize approximation of concepts from theacquired data
23-Feb-15Text Categorization Using Rough Set 16
▪ The starting point of rough set theory which isbased on data analysis is a data set which isrepresented as a table, which is known asinformation system
▪ Let S = (U, A),–Where U = nonempty finite set of objects– A = nonempty finite set of attributes such that a: U →
Va , for all a ∈ A
23-Feb-15Text Categorization Using Rough Set 17
▪ The set Va is called the value set of a▪ If B⊆A, then– INDA(B): {(x,y) ∈ U2 | V a ∈ B, a(x) = a(y)}
▪ If (x, y) ∈ INDA(B), then objects x and y areindiscernible from each other by attributes from B
▪ The equivalence classes of the B-indiscernibilityrelation are denoted [x]B
23-Feb-15Text Categorization Using Rough Set 18
▪ Indiscernibility is an attribute by means of which,we can conclude that two objects are not differentfrom each other
▪ Now comes another table in case of rough settheory
▪ Like the information system, it called the decisionsystem
23-Feb-15Text Categorization Using Rough Set 19
Age LEMS Walk
x1 16-30 50 1
x2 16-30 0 0
x3 31-45 1-25 0
x4 31-45 1-25 1
x5 46-60 26-49 0
x6 16-30 26-49 1
x7 46-60 26-49 0
Information System
Age LEMS
x1 16-30 50
x2 16-30 0
x3 31-45 1-25
x4 31-45 1-25
x5 46-60 26-49
x6 16-30 26-49
x7 46-60 26-4923-Feb-15Text Categorization Using Rough Set 20
Decision System
▪ So, from the given information system, we canfind the three equivalence relations:1. IND({Age}) = {{x1, x2, x6},{x3, x4},{x5, x7}}2. IND({LEMS}) = {{x1}, {x2}, {x3, x4}, {x5, x6, x7}}3. IND({Age, LEMS}) = {{x1}, {x2}, {x3, x4}, {x5, x7},
{x6}}–we can say that the objects that are in pair, are
indiscernible to each other
23-Feb-15Text Categorization Using Rough Set 21
▪ Let S = (U, {A U D}) , where D ∉ A▪ i.e., D = the decision attribute
23-Feb-15Text Categorization Using Rough Set 22
Age LEMS Walk
x1 16-30 50 1x2 16-30 0 0x3 31-45 1-25 0x4 31-45 1-25 1x5 46-60 26-49 0x6 16-30 26-49 1x7 46-60 26-49 0
▪ Let S = (U, A) be an information system and letB⊆A and X ⊆ U
▪ Two most important aspects of rough set:– Lower approximation–Upper approximation
▪ By notation, they are BX and BX, and is called B-lower approximation, and B-upperapproximation, respectively
23-Feb-15Text Categorization Using Rough Set 23
23-Feb-15Text Categorization Using Rough Set 24
BX = {x | [x]B ⊆ X}
BX = {x | [x]B Ո X ≠ φ}
The mathematical notion of both the approximations are given by:
▪ The difference between the upper and lower approximation is called the boundary region of X and is denoted by BNB(X)
▪ Therefore, BNB(X) = BX – BX▪ Some Conclusions:– If the boundary region of X is the empty set, i.e.,
BNB(X) = φ, then the set X is crisp (exact) with respect to B
– if BNB(X) ≠ φ, the set X is referred to as rough (inexact) with respect to B
23-Feb-15Text Categorization Using Rough Set 25
23-Feb-15Text Categorization Using Rough Set 26
B-lower approximation
B-upper approximation
Boundary Region
yes{{x1}, {x6}}
{{x3}, {x4}}{{x2}, {x5,x7}}
Yes/No
No
Approximations
▪ Accuracy of approximation in roughset:–Denoted by α– αB(X) = |BX|
|BX||X| denotes the cardinality of X ≠ φ,The value of α ranges between 0 and 1
23-Feb-15Text Categorization Using Rough Set 27
Rough Set based Hybrid System
▪ Text Document Representation▪ Classifier construction▪ Performance evaluation
23-Feb-15Text Categorization Using Rough Set 28
These are the three steps that are required toperform any text categorization task; the presentedsystem is not an exception too
Text document representation
▪ The first step of any text mining process. Itincludes the following:– Tokenization– Storage of tokens– Feature set construction– Stemming–Dimensionality reduction
23-Feb-15Text Categorization Using Rough Set 29
Classifier construction
▪ The second step. Here the system performslearning and testing processes– Learning: the classifier is built by observing the features
for each category from the training set– Testing: the classifier applies a pair of precise concepts
from the rough set theory that are called the lower andupper approximations to classify the input textdocument from the test set
23-Feb-15Text Categorization Using Rough Set 30
▪ The final step. Here the performance of the hybridsystem can be measured by computing itsefficiency and its effectiveness. It most commonmeasures are:– Accuracy– Error rate– Precision– Recall
23-Feb-15Text Categorization Using Rough Set 31
Performance evaluation
The Algorithm
23-Feb-15Text Categorization Using Rough Set 32
Input: D1, D2,…,Dm (Different Text Documents), C1, C2,…,Cn (specific categories)Output: Classified Text DocumentBeginFor each category Ci DoFor each Text Document Dj for Ci DoSplit Dj into features ⇒FjRemove stop words, number and special characters from Fj ⇒ TjGive frequency equal to 1 for Tj ⇒ Ftrj, Ftr_freqjMake stemming and some morphology processing for Ftrj and increase frequency for Ftr_freqj ⇒ Short_Ftrj, short_Ftr_FreqjMake Dimensionality Reduction for Short_Ftrj ⇒ DRjAdd DRj in DB (database) for CiEnd For
23-Feb-15Text Categorization Using Rough Set 33
Compute Upper Approximation for Ci using the following equationBX = {x | [x]B ∩ X ≠ ∅}Compute Lower Approximation for Ci using the following equationBX = {x | [x]B ⊆ X }Compute the Percentage between Upper Approximation for Ci and DRjfor Dj, the highestPercentage represent the correct category for DjCompute accuracy for Ci using the following equationαB(X) = | 𝑩𝑿 |
| 𝑩𝑿 |End ForEnd
23-Feb-15Text Categorization Using Rough Set 34
▪ The algorithm can be best understood by thefollowing diagram:
23-Feb-15Text Categorization Using Rough Set 35
23-Feb-15Text Categorization Using Rough Set 36
Text Document
Tokenization
Vector Space Model
Stop Word Removal
Stemming
Dimensionality Reduction
Rough Set Theory
Classified Text Document
Performance Evaluation
STEP I
STEP II
STEP III
DB Features
Figure 1: The Hybrid Text Categorization System
Text Document Representation
23-Feb-15Text Categorization Using Rough Set 37
Text document
Trainingset
Test set
pre-classified set of textdocuments which is used fortraining the classifier
Used to test the accuracy of the classifierbased on the count of correct andincorrect classifications for each textdocument in that set
Tokenization
▪ Each input text document is partitioned into a listof features which are called tokens
▪ The tokens are words, terms or attributes
23-Feb-15Text Categorization Using Rough Set 38
Vector Space Model
▪ Each input text document is represented as avector in a vector space
▪ each dimension of this space represents a singlefeature of that vector and its weight which iscomputed by the frequency of occurrence of eachfeature in that text document
▪ The assigned weight may increase based on thefrequency of each feature in the input textdocument
23-Feb-15Text Categorization Using Rough Set 39
Stop Word Removal
▪ Commonly repeated word. They include:– Pronouns– Conjunctions– Special character–Number
▪ They are of no use
23-Feb-15Text Categorization Using Rough Set 40
Stemming
▪ Stemming is the process of removing affixes(prefixes and suffixes) from the set of features
▪ This process is used in order to reduce the numberof features in the feature space
▪ improve the performance of the classifier whenthe different forms of features are stemmed into asingle feature
23-Feb-15Text Categorization Using Rough Set 41
Example of stemming
▪ S = {convert/converts/converted/converting}
▪ After Stemming:– S = {convert}
23-Feb-15Text Categorization Using Rough Set 42
▪ The system uses these principles::– All prefixes are removed from features, if the prefix
exists in features– The stemming process uses a lexicon to find the root for
each irregular feature–When the only difference among the similar features in
the first characters
23-Feb-15Text Categorization Using Rough Set 43
Dimensionality Reduction
▪ After the non-informative features removal andthe stemming process, if the number of features inthe feature space is still too large, this procedure isdone
▪ Among these selected features, some features maybe not useful to the categorization task andsometimes decrease accuracy, so such features canbe removed without affecting the classifierperformance
23-Feb-15Text Categorization Using Rough Set 44
▪ Dimensionality reduction of the feature space canbe done by feature selection and featureextraction
▪ But this system doesn’t use any of them▪ Thus specific threshold method is used
23-Feb-15Text Categorization Using Rough Set 45
▪ The features are selected from feature space suchthat the frequencies are equal to or greater than10%, 8%, 6% or 4% of the number of derivedfeatures from the stemming process
23-Feb-15Text Categorization Using Rough Set 46
Categorization Technique for Text Categorization - Rough Set Theory
▪ Rough set theory has been successfully used as asupervised categorization technique
▪ Two precise concepts have been successfullyutilized to the text documents into one or more ofmain categories and sub categories
▪ When the test text document is given to thetrained classifier; it should predict the correctmain category and sub-category for that textdocument
23-Feb-15Text Categorization Using Rough Set 47
▪ The testing set with 100 text documents wascategorized into 4 main categories and a numberof subcategories which belong to the first threecategories
▪ Computer Science, Mathematics and Physics arethree main categories
23-Feb-15Text Categorization Using Rough Set 48
23-Feb-15Text Categorization Using Rough Set 49
Computer science Mathematics Physics
Artificial IntelligenceDatabase
Image Processing
InformationSecurity
Algebra
Numerical Analusis
Statistics
Laser
Material
▪ Upper approximation: It is the intersectionbetween the features which represent the test textdocument and the features in any database table,that have a frequency ≥ (10%, 8%, 6% or 4%) ofthat database table which represent the sub-category of main categories; the resulted featuresrepresent a set of upper approximation features
23-Feb-15Text Categorization Using Rough Set 50
▪ Lower Approximation: It is the intersectionbetween the features which represent the test textdocument and the features which appear in onlyone database table, that have a frequency underany frequency field ≥ (10%, 8%, 6% or 4%) of thatdatabase table which represents the sub-categoryof main categories. The resulted features representa set of lower approximation features
23-Feb-15Text Categorization Using Rough Set 51
▪ The accuracy of approximation can be measuredby computing the ratio between the lower andupper approximations for the set of featureswhich represents the test text document
23-Feb-15Text Categorization Using Rough Set 52
▪ After applying all the steps to represent the testtext documents and implementing the lower andupper approximations concepts from the roughset theory to their representation, the trainedclassifier should predict the correct maincategories and sub-categories for these textdocuments
23-Feb-15Text Categorization Using Rough Set 53
Classified Text Document
Performance Evaluation for a classifier
▪ The performance of the hybrid system can bemeasured by calculating its efficiency and itseffectiveness
23-Feb-15Text Categorization Using Rough Set 54
23-Feb-15Text Categorization Using Rough Set 55
Figure 2: The learning time for building the classifier
23-Feb-15Text Categorization Using Rough Set 56
Figure 3: The average of the testing time for classifying of the test text documents
▪ There are many metrics to evaluate theeffectiveness of the hybrid system. The mostcommon are accuracy, error rate, precision andrecall
▪ For computing these, we have to remember thefollowing notations:
23-Feb-15Text Categorization Using Rough Set 57
▪ TPi (True Positive) = the number of text documents correctly classifiedin category ci
▪ TNi (True Negative) = the number of text documents correctlyclassified as not belonging to category ci
▪ FPi (False Positive) = the number of text documents incorrectlyclassified in category ci
▪ FNi (False Negative) = the number of text documents incorrectlyclassified as not belonging to category ci
23-Feb-15Text Categorization Using Rough Set 58
▪ Accuracy (Ac): Is the ratio between the number of textdocuments which were correctly categorized and the totalnumber of documents,
Aci = 𝑻𝑷𝒊+𝑻𝑵𝒊
𝑻𝑷𝒊+ 𝑻𝑵𝒊+ 𝑭𝑷𝒊+ 𝑭𝑵𝒊
▪ Error rate (E): Is the ratio between the number of textdocuments which were not correctly categorized and thetotal number of text documents
Aci = 1- Aci =𝑭𝑷𝒊+𝑭𝑵𝒊
𝑻𝑷𝒊+ 𝑻𝑵𝒊+ 𝑭𝑷𝒊+ 𝑭𝑵𝒊
23-Feb-15Text Categorization Using Rough Set 59
▪ Precision (P): Is the percentage of correctly categorized textdocuments among all text documents that were assigned tothe category by the classifier,
Pi = 𝑻𝑷𝒊
𝑻𝑷𝒊+ 𝑭𝑷𝒊
▪ Recall (R): Is the percentage of correctly categorized textdocuments among all text documents belonging to thatcategory,
Ri = 𝑻𝑷𝒊
𝑻𝑷𝒊+ 𝑭𝑵𝒊
23-Feb-15Text Categorization Using Rough Set 60
Results
23-Feb-15Text Categorization Using Rough Set 61
Main CategorySub Category Precision Recall
Computer Science Artificial Intelligence (AI) 100% 100%Database 95.65% 100%
Image Processing 100% 94.73%Security 83.33% 95.23%
Mathematics Algebra 100% 100%Numerical Analysis 100% 100%
Statistics 95.23% 90.90%Physics Laser 100% 100%
Materials 100% 100%Unknown 100% 100%
Table 3. The results of calculating precision & recall for the hybrid system
23-Feb-15Text Categorization Using Rough Set 62
Figure 4: The performance evaluation for computer science category
23-Feb-15Text Categorization Using Rough Set 63
Figure 5: The performance evaluation for mathematics category
23-Feb-15Text Categorization Using Rough Set 64
Figure 6: The performance evaluation for Physics category
Conclusion
▪ The rough set theory is a supervised categorizationtechnique; it is used for building the text categorizationmodel by learning the properties of a set of pre-classifiedtext documents for each sub-category of main categories
▪ The presented model uses a pair of precise concepts fromthe rough set theory that are called the lower and upperapproximations to classify any test text document into oneor more of main categories and sub-categories, because thesystem deals not only with the main categories, but alsowith a number of sub-categories for each main category
23-Feb-15Text Categorization Using Rough Set 65
▪ When the rough set theory concepts are used in the hybridsystem, the results of the system reach to 96% when it isapplied to a number of test text documents for each sub-category of main categories
▪ The average of that time is computed for all test textdocuments, which ranges from 5 to 14 Sec
▪ For future work, more precise concepts of rough set such asreduct theory can also be used for the system
23-Feb-15Text Categorization Using Rough Set 66
References
▪ Kiritchenko, S. (2005). Hierarchical Text Categorization and Its Application to Bioinformatics. PhD Thesis, School of Information Technology and Engineering, Faculty of Engineering, University of Ottawa, Ottawa, Canada.
▪ Komorowski, J., Pawlak, Z., Polkowski, L. & Skowron, A. (1999). Rough Sets: A Tutorial. In: Pal, S.K., Skowron, A. (Eds) Rough-Fuzzy Hybridization: A New Trend in Decision Making, pp. 3-98, Springer-Verlag, Singapore.
▪ Oracle Corporation, WWW, oracle.com, 2008.▪ Pawlak, Z. (March 2002). Rough Set Theory and Its Applications. Journal of
Telecommunications and Information Technology, pp.7-10.▪ Pegah Falinouss “Stock Trend Prediction using News Article’s: a text mining
approach” Master thesis -2007.
23-Feb-15Text Categorization Using Rough Set 67
▪ Raghavan, P., S. Amer-Yahia and L. Gravano eds., “Structure in Text: Extraction and Exploitation.” In. Proceeding of the 7th international Workshop on the Web and Databases (WebDB), ACM SIGMOD/PODS 2004, ACM Press, Vol 67, 2004.
▪ Ruiz, M. (December 2001). Combining Machine Learning and Hierarchical Structures for Text Categorization. PhD Thesis, Computer Science Dept., University of Iowa, Iowa City, Iowa, USA.
▪ Sadiq A. T., Abdullah S. M. Hybrid Intelligent Techniques for Text Categorization. International Journal of Advanced Computer Science and Information Technology (IJACSIT) Vol. 2, No. 2, April 2013, Page: 23-40.
▪ Sebastiani, F., “Machine learning in automated text categorization” ACM Computing Surveys (CSUR) 34, pp.1 – 47, 2002.
23-Feb-15Text Categorization Using Rough Set 68
23-Feb-15Text Categorization Using Rough Set 69