AN ENHANCED TEXT DOCUMENT - B.S. Abdur Rahman Crescent ... AN ENHANCED TEXT DOCUMENT CLASSIFICATION...

i

AN ENHANCED TEXT DOCUMENT

CLASSIFICATION BASED ON TERMS AND

SYNONYMS RELATION

A THESIS REPORT

Submitted by

PRANEETHA K.

Under the guidance of

Dr. ANGELINA GEETHA

in partial fulfillment for the award of the degree of

MASTER OF PHILOSOPHY in

COMPUTER SCIENCE

B.S.ABDUR RAHMAN UNIVERSITY (B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)

(Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in

December 2012

ii

B.S.ABDUR RAHMAN UNIVERSITY (B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)

(Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in

BONAFIDE CERTIFICATE

Certified that this thesis report AN ENHANCED TEXT DOCUMENT

CLASSIFICATION BASED ON TERMS AND SYNONYMS RELATION is the

bonafide work of PRANEETHA K. (RRN: 1145213) who carried out the thesis work

under my supervision. Certified further, that to the best of my knowledge the work

reported herein does not form part of any other thesis report or dissertation on the

basis of which a degree or award was conferred on an earlier occasion on this or any

other candidate.

SIGNATURE SIGNATURE

Dr. ANGELINA GEETHA Dr. P. SHEIK ABDUL KHADER

SUPERVISOR HEAD OF THE DEPARTMENT

Professor & Head Professor & Head

Dept. of Computer Science & Engg. Dept. of Computer Applications

B.S. Abdur Rahman University B.S. Abdur Rahman University

Vandalur, Chennai – 600 048 Vandalur, Chennai – 600 048

iii

ACKNOWLEDGEMENT

I would like to start by thanking Dr. V. M. Periasamy, Registrar, B. S.

Abdur Rahman University for providing an excellent infrastructure and

facilities to carry out my course successfully.

I owe my deepest gratitude to Dr. Angelina Geetha, Professor & Head,

Department of Computer Science and Engineering, B. S. Abdur Rahman

University for her valuable advice and guidance in carrying out this research

work. I will always remain indebted to her for the moral support,

encouragement and enthusiastic motivation for work that she imbibed upon

me.

With great pleasure and acknowledgement, I extend my profound

thanks to Dr. P. Sheik Abdul Khader, Head, Department of Computer

Applications, B. S. Abdur Rahman University who has been a constant

source of inspiration throughout this work.

I extend my sincere thanks to my class advisor Dr. A. Jaya,

Professor, Department of Computer Applications, B. S. Abdur Rahman

University for her constant support and encouragement.

I am also thankful to all the staff members of the department for their

full cooperation and help.

iv

ABSTRACT

Data mining is the discovery of knowledge and useful information from

large amounts of data stored in databases. Since a large portion of the

available data is stored in text databases, the field of text mining is gaining

importance. The text databases are rapidly growing due to the increasing

amount of data available in electronic form such as digital libraries, World

Wide Web, electronic repositories etc. Due to this vast amount of digitized

texts, classification systems are used more often in text mining, to analyse

texts and to extract knowledge they contain. Text classification (also called

text categorization) is a process that assigns a text document to one of a set

of predefined classes.

Most of the existing classification systems use the Bag-of-Words model

which classifies the text document based on number of occurrences of its

component words and omit the fact that various words might have been used

to express a similar concept. Hence this model suffers from the problem of

synonymy which arises due to different words with similar meanings. The

proposed approach classifies the text documents by enriching the Bag-of-

words data representation with synonyms. This approach uses WordNet – a

lexical database of English to extract the synonyms for all the key terms in

the text document, and then combines them with the key terms to form a new

representative vector. As a result, the system counts the occurrence of both

the key term and corresponding synonyms in the document for the

classification, resulting in the reduction of synonymy problem.

The performance of the proposed system in comparison with the two

classification approaches i.e. synonym frequency approach and term

frequency approach is evaluated using the 20Newsgroups data corpus. The

experimental results showed that the proposed approach of using the sum of

term and its synonym frequencies for classification results in the increase in

performance of the classification system when compared to the classification

using the other two approaches.

v

TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT iv

LIST OF TABLES viii

LIST OF FIGURES ix

LIST OF ABBREVIATIONS x

1. INTRODUCTION 1

1.1 TERMS AND TERMINOLOGIES 2

1.1.1 Data 2

1.1.2 Information 2

1.1.3 Knowledge 3

1.1.4 Data Mining 3

1.1.5 Text Mining 4

1.1.6 Information Extraction 5

1.1.7 Information Retrieval 6

1.1.8 Text Retrieval 7

1.1.9 Text Classification 8

1.1.10 Vector Space Model 9

1.1.11 Natural language Processing 10

1.1.11.1 Text Tokenization 10

1.1.11.2 Removal of Stop Words 11

1.1.11.3 Stemming 11

1.2 THESIS ORGANIZATION 11

2. LITERATURE REVIEW 12

vi

3. DEVELOPMENT PROCESS 20

3.1 PROBLEM DEFINITION 20

3.2 SYSTEM DESIGN 21

3.2.1 Learning Phase 21

3.2.2 Classification Phase 21

3.2.3 Data Pre-processing 23

3.2.3.1 Tokenization 23

3.2.3.2 Noise Removal 24

3.2.3.3 Stop Word Removal 24

3.2.3.4 Stemming 24

3.2.4 Bag-of-Words 24

3.2.5 WordNet 25

3.2.5.1 WordNet & Synonyms 25

3.3 DETAILED DESIGN 26

3.3.1 Data Flow Diagram for Level 0 26



3.4 METHODOLOGY 29

3.4.1 Calculation of Sum of Frequencies 29

of Term and its Synonyms

3.4.2 Vector Generation by Weighting the 31

Key Terms

3.4.3 Calculation of Similarity between 33

Document and Categories Profiles

vii

3.5 IMPLEMENTATION 34

3.5.1 Java Programming Language 34

3.5.2 NetBeans IDE 36

3.5.3 Implementation Details 37

4. EXPERIMENTS AND EVALUATION 39

4.1 EVALUATION METRICS 39

4.1.1 Precision and Recall 39

4.1.2 F-measure 41

4.1.2.1 Macro Averaged F-measure 42

4.1.2.2 Micro Averaged F-measure 42

4.2 EXPERIMENTAL DATASET 44

4.3 PERFORMANCE ANALYSIS 45

4.4 INFERENCE FROM THE RESULT 58

5. CONCLUSION AND FUTURE ENHANCEMENT 59

REFERENCES 60

APPENDIX 1: SAMPLE SCREEN SHOTS 67

TECHNICAL BIOGRAPHY 72

viii

LIST OF TABLES

Table No. Table Name Page No.

4.1 Possible Predictions of a Classifier 40

4.2 Details of the 20Newsgroups Categories 45

used for Evaluation

4.3 Macro Averaged F-measure Results for 47

20Newsgroups Categories

4.4 Micro Averaged F-measure Results for 50

20Newsgroups Categories

4.5 Macro Averaged F-measure Results based 53

on Size of Categories Profile

4.6 Micro Averaged F-measure Results based 56

on Size of Categories Profile

ix

LIST OF FIGURES

Figure No. Figure Name Page No.

1.1 Relation between Data, Information and 3

Knowledge

1.2 Text Mining Process 4

1.3 Architecture of an Information Extraction 5

System

1.4 Architecture of an Information Retrieval 6

System

1.5 Text Classification Process 9

3.1 System Architecture Diagram 22

3.2 Data Flow Diagram - Level 0 26



4.1 Macro Averaged F1-score for 48

20Newsgroups Data Corpus

4.2 Micro Averaged F1-score for 51

20Newsgroups Data Corpus

4.3 Macro Averaged F- measure Results 54

for Varying Size of Categories Profile

4.4 Micro Averaged F- measure Results 57

for Varying Size of Categories Profile

x

LIST OF ABBREVIATIONS

S. No. ACRONYM EXPANSION

1. KDD Knowledge Discovery from Data

2. tf Term Frequency

3. sf Synonym Frequency

4. idf Inverse Document Frequency

5. tfidf Term Frequency Inverse Document

Frequency

6. IE Information Extraction

7. IR Information Retrieval

8. IRS Information Retrieval System

9. NLP Natural Language Processing

10. rMFoM Regularized Maximum Figure-of-Merit

11. MFoM Maximum Figure-of-Merit

12. NL Negative Label

13. NLP Negative Label Propagation

14. SVM Support Vector Machine

15. TP True Positive

16. TN True Negative

17. FP False Positive

18. FN False Negative

19. CI Class Information

20. HMM Hidden Markov Model

21. SOM Self Organizing Map

22. FFCA Fuzzy Formal Concept Analysis

xi

23. HONB Higher Order Naive Bayes

24. IID Independent and Identically Distributed

25. FRAM Frequency Ratio Accumulation Method

26. KNN K-Nearest Neighbors

27. DIFS Distributional Information based Feature

Selection

28. BoW Bag of Words

29. DFD Data Flow Diagram

30. API Application Programming Interface

31. WWW World Wide Web

32. JVM Java Virtual Machine

33. CPU Central Processing Unit

34. OS Operating System

35. IDE Integrated Development Environment

36. Java SE Java Platform, Standard Edition

37. Java ME Java Platform, Micro Edition

38. EJB Enterprise JavaBeans

39. GUI Graphical User Interface

40. XML Extensible Markup Language

41. HTML Hyper Text Markup Language

42. CSS Cascading Style Sheets

43. JSP Java Server Pages

44. SQL Structured Query Language

1

1. INTRODUCTION

Computerization and automated data gathering has resulted in

extremely large data repositories. Hence data mining has gained a great deal

of attention in recent years, due to the availability of huge amounts of data

and the need for converting such data into useful information and knowledge.

Data mining refers to the process of mining or extracting knowledge from

large amounts of data. It is an indispensable step in the Knowledge

Discovery from Data (KDD) process. Knowledge discovery is an iterative

process of data cleaning, data integration, data selection, data

transformation, data mining, pattern evaluation and knowledge

representation.

Text mining is the extraction of useful information from textual

resources using data mining techniques. The data sources used for text

mining may be semi structured or unstructured documents. Basic text mining

tasks include text classification (text categorization), information retrieval, text

clustering, information extraction etc. Nowadays, a large portion of the data is

stored in text databases, which consists of large collections of documents

from various sources such as World Wide Web, digital libraries, news

articles, electronic repositories etc. Since text databases are rapidly growing

due to the increasing amount of information available in electronic form, text

classification systems are used more often in text mining.

Text classification (text categorization) is the task of assigning a text

document to a relevant category. In order to be classified, each document

should be turned into a machine understandable format. The limitation of the

traditional Bag of Words document representation is the problem of

synonymy. This model counts the key word occurrences and omits the fact

that different words might have been used to express a similar concept within

same document. This representation has to be enhanced so that the key

words and their corresponding synonyms in the document should be

considered for the classification process.

2

1.1 TERMS AND TERMINOLOGIES

There are several basic terms under data mining area. Some of them are

as follow:

1.1.1 Data

Data is raw material for data processing. It refers to the unprocessed

information. Data on its own carries no meaning. Data is the facts, statistics

used for reference or analysis. It may be numbers, characters, symbols,

images etc., which can be processed by a computer. Data must be

interpreted by a human or machine, to derive meaning. In computational

terms, data refers to the computerized representations of models and

attributes of entities.

For example, researchers who conduct market survey might ask public

to complete questionnaires about a product or a service. These completed

questionnaires are called data.

1.1.2 Information

Information is the result of interpretation of data. When data is

processed, manipulated, organized and presented in such a way as to be

meaningful to the person who receives it, it is called as information. Data that

represents the results of a computational process to which meaning has

been assigned by human beings or computers is referred to as information in

computational terms.

For example, the data collected by the market survey in the form of

questionnaires is processed and analysed in order to prepare a report on the

survey. This resulting report is information.

3

1.1.3 Knowledge

Knowledge is application of information. Knowledge is obtained when

information is given meaning by interpretation related to some context. In

computer terms, data that represents the results of a computer related

process such as learning, association and is called as knowledge.

For example, the application of information gained about the product

based on the report on the market survey to increase its sale is considered to

be knowledge. The relation between data, information and knowledge is

depicted in the Figure 1.1.

Figure 1.1: Relation between Data, Information and Knowledge.

1.1.4 Data Mining

Data mining is the process for the extraction of previously unknown and

potentially useful information from data stored in repositories. It is an

essential step in the process of knowledge discovery [51].

Knowledge discovery consists of an iterative sequence of the following steps:

Data cleaning – to eliminate noisy and inconsistent data.

Data integration – the process of combining multiple data sources.

Data selection – data relevant to the analysis task are retrieved.

Knowledge

Information

Data Obtain raw data

Give meaning to

obtained facts

Analyse derived

information

4

Data transformation – transforming the data into the form appropriate

for mining.

Data mining – process of extracting interesting data patterns.

Pattern evaluation – identifying the patterns representing knowledge

based on some interestingness measures.

Knowledge presentation – visualization and knowledge representation

techniques are used to present the mined knowledge to the user.

Data mining can be performed on different kind of data repositories such as

relational databases, data warehouses, text databases, multimedia

databases, data streams etc.

1.1.5 Text Mining

Text mining is the process of extracting useful information from large

amount of textual resources. Data stored in these textual resources may be a

combination of structured and unstructured data such as natural language

texts. The text mining process is as depicted in figure 1.2.

Figure 1.2: Text Mining Process

The first step in text mining process is to pre-process the text to remove

noise and inconsistent data. Thereafter, the text is transformed into the form

suitable for mining. Then attributes are selected and appropriate patterns are

discovered. Finally the discovered patterns are evaluated based on

Text Text Pre-

processing

Text

Transformation

Evaluation Pattern

Discovery

Attribute

Selection

5

interestingness measures. Basic text mining tasks include information

extraction, information retrieval, summarization, visualization etc.

1.1.6 Information Extraction

Information Extraction (IE) is the process of analysing the text to

extract facts about pre specified entities. These facts are then stored in the

database which may be used for further analysis. The figure 1.3 shows the

architecture for a simple information extraction system [51].

Figure 1.3: Architecture of an Information Extraction System.

The system takes raw text of documents as its input and generates a

list of tuples as its output. It begins by processing a document in a series of

procedures. In the first step, the raw text of the document is split into multiple

sentences using a sentence segmenter and subsequently, each sentence is

Input (Raw Text)

Sentence

Segmentation

Tokenization

Part of Speech

Tagging

Relation

Detection

Entity

Detection

Output (List of Tuples)

Sentences

Tokenized Sentences

Grouped Sentences

String Relations

POS Tagged

Sentences

6

further divided into words using a tokenizer. Thereafter, part-of-speech of

each sentence is obtained. In the next step (named entity detection), the

interesting entities in each sentence is searched for. Finally, the relation

detection step is used to search for relations between different entities in the

text. As a result of these steps, the list of related tuples is extracted.

IE mainly deals with identifying key words or key terms within a textual

document. It produces structured data ready for post processing. While

Information Retrieval (IR) retrieves relevant documents from collections, IE

retrieves relevant information from documents.

1.1.7 Information Retrieval

Information Retrieval (IR) is a process that deals with representing,

storing, organizing and accessing the information. It is the technique used to

retrieve information from large collections of documents. There are multiple

ways to represent an IR system (IRS). In their paper, Isabel Volpe et al. [26]

have described a model of the architecture of the IR system depicted in

Figure 1.4.

Figure 1.4: Architecture of an Information Retrieval System.

User

Information Retrieval System

Query

representation

Matching

Produce

ranked list of

matches

Tokenization, Stop word

removal, Stemming

Index

Document

Collection

Results

Query

7

The goal of Information Retrieval is to select documents that are most

relevant to a query. IR allows the user to retrieve relevant documents that

best matches a query, but does not specify exactly where the required

information lies.

IR views the text in a document merely as a bag of unordered words.

It deals with different types of information such as text, images, audio, video

etc. The IRS pre-processes the data in the documents using tokenization,

stop word removal and stemming processes before indexing them. When the

user enters a query, same pre-processing is applied to the query text and

then the query is matched to the documents. Matching is done by a similarity

measure which assigns a similarity score to a document in response to the

query. These scores are used to generate a ranked list of documents which

is returned to the user as the result of IR.

1.1.8 Text Retrieval

Text Retrieval (also called document retrieval) is a branch of

information retrieval where the information stored in the form of text is

retrieved. Text retrieval process retrieves the relevant documents by

matching the user query against a set of text documents. A user query can

be a single or multiple sentence description of information.

The purpose of text retrieval system is to search the text database for

relevant documents. If the database is very large, its response time will be

slow. To overcome this latency, the text database is pre-processed and

stored in a system which helps in fast searching. This pre-processed form is

called text representation and is the form of text provided as input to the IR

system.

The major applications of the text document retrieval systems are the

internet search engines. To retrieve the documents according to the interest

of the user, text classification systems are used. For example, computer

science retrieval might be classified based on subject areas, like Operating

System, Data Structure, and Artificial Intelligence etc. The people interested

in the specific topic of computer science would find this classification useful.

8

1.1.9 Text Classification

Text Classification is also called as Text Categorization. It is the task

of classifying a document under a predefined category or class. Text

classification tasks can be divided into different types. They are:

Supervised document classification – where some external

mechanism provides information on the correct classification for the

documents.

Unsupervised document classification – where classification must be

done entirely without reference to external information.

Semi-supervised document classification – where parts of the

documents are labeled by the external mechanism.

The proposed approach deals with the supervised document classification

technique as it makes use of prior information on the membership of training

documents in predefined categories.

Supervised classification is a two-phase process. First phase is called

the training phase or learning phase. Second phase in the classification

process is the testing or classification phase. For developing any

categorization model, a collection of input data set is used. This data set is

sub divided into training data set and test data set.

Training data set refers to the collection of records whose class labels

are already known and is used to build the categorization model. Using these

documents, a set of predetermined classes are described. The classifier built

in the training phase is then applied to the test data set.

Test data set refers to the collection of records whose class labels are

known but when given as an input to the built categorization model, should

return the accurate class labels of the records. It determines the accuracy of

the model based on the count of correct and incorrect predictions of the test

records.

9

The general approach to text classification is given in figure 1.5.

Figure 1.5: Text Classification Process.

The figure 1.5 shows the steps followed in text classification process.

The first step is to transform documents into a representation suitable for the

classification task. Then word stemming is performed followed by deletion of

stop words. Thereafter, the document is represented as the vector of its

content words obtained in the previous step. Next feature selection is done to

avoid unnecessarily large feature vectors. Finally learning algorithm is

applied to predict the class labels of previously unknown documents.

To represent the document as vector of its key terms Vector Space

Model is used.

1.1.10 Vector Space Model

Vector Space Model or Term Vector Model is a model for

representing text documents as vectors of key terms. Generally this model

represents both, a document and query, as vectors in a high dimensional

space corresponding to all the keywords and uses an appropriate similarity

measure to compute the similarity between the query vector and the

document vector.

Vector Space Model is divided into three stages. The first stage is

document indexing where the keywords representing the content are

extracted from the document text. The second stage is the weighing of the

Read document

Feature selection and feature

transformation

Vector representation of text

Tokenize text Stemming

Delete stop words

Learning algorithm

10

extracted key terms using an appropriate weighing scheme. The final stage is

to calculate the similarity between the document vector and the query vector

using the appropriate similarity measure.

When used in text classification process, the Vector Space Model

calculates the similarity between categories vector and the vector of the

document to be classified. To represent the document as the vector of its key

terms, first the text in the document has to be pre-processed using Natural

Language Processing Techniques to turn it into machine readable form.

1.1.11 Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field dealing with the

interaction between computer and human languages (natural languages). It

is a process in which the computer extracts meaningful information from

natural language input, processes it and gives output in natural language.

Tokenization, stop word removal and stemming are the commonly

used NLP tasks for pre-processing the document in classification process.

1.1.11.1 Text Tokenization

Tokenization is the process of breaking up the input text into

individual units called tokens. The tokens may be the string of alphabetic

characters or numbers. These tokens are separated by a space or line break

or any of the punctuation characters. These separators may or may not be

included in the list of resulting tokens.

For example,

Input: Tokenization splits given text into tokens, based on the specified

delimiters.

If the punctuation characters full stop and comma are used as the delimiters

for tokenization then,

Output: Tokenization splits given text into tokens

based on the specified delimiters

11

1.1.11.2 Removal of Stop Words

Stop words are the one which does not convey any meaning, such

as auxiliary verbs, conjunctions and articles (e.g. “the”, “a”, “and”… etc.).

Stop word removal is the process that makes the document easier to process

by removing all the stop words in the document.

1.1.11.3 Stemming

The document may contain different grammatical forms of a word

such as realise, realises, realising or it may contain a group of derivationally

related words with similar meaning such as democracy, democratic,

democratization etc. In such situations, the technique of stemming would be

useful. The goal of stemming is to reduce inflected or derived words to their

stem or root form. A stemming algorithm processes words with the same

stem and retains only the stem as the key word.

For example, the words “stemming”, “stemmed”, “stemmer” are reduced to

their root word which is “stem”.

1.2 THESIS ORGANIZATION

The outline of the thesis can be summarized as follows:

Chapter 2: Literature review section gives a brief survey of research going on

in the area of text classification and the methods used for it.

Chapter 3: The problem statement and the development process of the

proposed approach to solve this problem have been defined.

Chapter 4: In this chapter, all experimental results obtained using the

proposed classification approach are tabulated, compared and evaluated.

Chapter 5: Focuses on the conclusion of the work presented and the ideas

for the further research have been proposed.

12

2. LITERATURE REVIEW

Due to the increased availability of documents in digital form, the

automated categorization or classification of texts into predefined categories

has become an active research topic over the last few years. Traditionally, a

domain expert does text categorization manually. Documents are read by the

expert and then assigned to the appropriate category. To eliminate the large

amount of manual effort required, automatic text categorization is used. The

dominant approach to the text classification problem is based on Machine

Learning techniques such as supervised text categorization in which a

classifier is built by learning, from a set of preclassified documents. In the late

80’s the most popular approach to text categorization was knowledge

engineering where a set of rules where defined manually to encode the

expert knowledge on how to classify the documents under the given

categories. But in 90’s this approach is slowly being replaced by the machine

learning paradigm. Fabrizio Sebastiani [9], in his survey discusses the main

approaches to text categorization that fall within the machine learning

paradigm.

The commonly used text document representation is the bag-of-

words, which simply uses a set of words and number of occurrences of the

words in a document to represent the document [3, 12]. Many efforts have

been taken to improve this simple and limited document representation. For

example, phrases or word sequences are used to replace single words.

Besides the single word, syntactic phrases have been explored by many

researchers [20], [11], [5], [19]. A syntactic phrase is extracted according to

language grammars. In general, experiments showed that syntactic phrases

were not able to improve the performance of standard “bag-of-word”

indexing. Statistical phrases have also attracted much attention from

researchers [15], [10], [4]. A statistical phrase is composed of a sequence of

words that occur contiguously in text in a statistically interesting way [15],

which is usually called n-gram. Here, n is the number of words in the

sequence. Researchers also indicated that the short statistical phrase was

13

more helpful than the long one [4]. In addition to phrases, other linguistic

features such as POS-tag were tried by researchers [1], [19].

Word cluster i.e., a word’s distribution on different categories was used

to characterize a word [13], [16], [5]. The clustering methods used by the

researchers included the agglomerative approach [13] and the Information

Bottleneck [15]. Experiments showed that the word cluster based

representation outperformed the single word based representation

sometimes. Several variants of term frequency (tf) such as the logarithmic

frequency and the inverse frequency were used by researchers [14], [7]. The

logarithmic frequency reflected that the intuition that the importance of a word

should increase logarithmically instead of linearly with the increase of its

frequency. The inverse frequency was derived in order to distribute term

frequencies evenly on the interval from 0 to 1 [7]. Ko et al. [22] used the

importance of each sentence to weight the term frequency. On the other

hand categorization at the document classification stage is a traditional

problem of machine learning and more sophisticated machine learning

methods and classification algorithms such as Neural Networks [8], Decision

Trees [2], [17], the Naive Bayes Method [6], K-Nearest Neighbor [23] and

Boosting Algorithms [18], as well as Support Vector Machines (SVM) [21] are

applied to induce representations for documents based on the categories.

The following section of literature review gives a brief survey of research

going on in the area of text classification and the methods used for it.

Atika Mustafa et al. [24] have discussed the implementation of

Information Extraction (IE) and categorization in text mining applications.

Modified version of Porter’s Algorithm for inflectional stemming is used to

extract the key terms from a document. A domain dictionary which defines a

collective set of all key terms of a certain field (in this case Computer

Science) is used for calculating term frequencies for categorization.

14

Aurangzeb Khan et al. [25] presented a study on E-Documents

classification. The growing numbers of textual data needs text mining,

machine learning and natural language processing techniques and

methodologies to organize and extract pattern and knowledge from the

documents. This study focused on and explored the main techniques and

methods for automatic documents classification.

Chengyuan Ma et al. [26] proposed a new approach called

Regularized Maximum Figure-of-Merit (rMFoM) to supervised and semi

supervised learning. A regularized extension to supervised maximum figure-

of-merit (MFoM) learning to improve its generalization capability and

successfully extend it to semi supervised learning is proposed. The MFoM

learning criteria is reformulated in the Tikhonov regularization framework, to

improve the generalization capability of any classifiers based on discriminant

functions.

Chenping Hou et al. [27] described a novel type of semi supervised

learning using negative labels. A new type of supervision information called

Negative Label (NL) and an approach called Negative Label Propagation

(NLP) is proposed to guide the process of semi supervised learning. The

NLP algorithm first constructs an initialization matrix and a parameter matrix

which specifies the type of an NL point. Then the data labels are fed under

the guidance of NL information and the structure is given by both labelled

and unlabelled points.

Christian Wartena et al. [28] presented a study on keyword extraction

using word co-occurrence. The relevance measure of a word for the text is

computed by defining co-occurrence distributions for words and comparing

these distributions with the document and the corpus distribution. The

semantic similarity of two terms is calculated by computing similarity measure

for their co-occurrence distributions. The co-occurrence distribution of a word

can also be compared with the word distribution of a text. This gives a

measure to determine how typical a word is for a text.

15

Christoph Goller et al. [29] presented an evaluation of various

automatic document classification methods. Different feature construction

and selection methods and various classifiers are evaluated. By this

evaluation, it is shown that feature selection or dimensionality reduction is

necessary not only to reduce learning and classification time, but also to

avoid over fitting. Also it is shown that Support Vector Machine (SVM) is

significantly better than all other classification methods.

Deng Cai et al. [30] proposed a novel learning algorithm called

Manifold Adaptive Experimental Design for text categorization. Unlike most

previous learning approaches which explore either Euclidean or data-

independent nonlinear structure of the data space, the proposed approach

explicitly takes into account the intrinsic manifold structure. The local

geometry of the data is captured by a nearest neighbour graph. The graph

Laplacian is incorporated into the manifold adaptive kernel space in which

active learning is then performed.

Der Chiang Li et al. [31] described a new attribute construction

approach to solve the small dataset classification problem. The proposed

method computes the classification oriented membership degrees to

construct new attributes, called as class-possibility attributes, and also

develops an attribute construction procedure to construct another set of new

attributes, called as synthetic attributes, to increase the amount of

information for small data set analysis.

Fang Lu et al. [32] proposed a refined weighted K-Nearest Neighbors

algorithm for text categorization. Traditional KNN algorithm assumes that the

weight of each feature item in various categories is identical. Obviously, this

is not reasonable. For each feature item may have different importance and

distribution in different categories. This disadvantage of traditional KNN

algorithm can be significantly overcome using the refined KNN algorithm.

16

Jung-Yi Jiang et al. [33] presented a paper about Fuzzy Self-

Constructing Feature Clustering Algorithm for text classification. This paper

proposes a fuzzy similarity-based self-constructing algorithm for feature

clustering. The words in the feature vector of a document are grouped into

clusters, based on similarity test. By this algorithm, the membership functions

describe properly the distribution of the training data.

Lifei Chen et al. [34] proposed a classifier for text categorization using

class-dependent projection based approach. By projecting onto a set of

individual subspaces, the samples belonging to different document classes

are separated such that they are easily to be classified. This is achieved by

developing a supervised feature weighting algorithm to learn the optimized

subspaces for each document class. This method learns class-specific

weighting values for each term from training data in training phase, and

classifies new documents based on a weighted distance measurement in

testing phase.

Ma Zhanguo et al. [35] presented an improved term weighting method

for text classification. The traditional text classification methods use does not

involve class information of the terms. The proposed tf-idf-CI model uses

class information (CI) for weighting. The class information contains intra class

information and inner class information. Using this method it is shown that

the role of important and representative terms is raised and the effect of the

unimportant feature term to classification is decreased.

Makoto Suzuki et al. [36] presented a new mathematical model of

automatic text categorization and a classification method based on VSM.

They proposed a new classification technique called the Frequency Ratio

Accumulation Method (FRAM). This is a simple technique that adds up the

ratios of term frequencies among categories, and it is able to use index terms

without limit. Then, the Character N-gram is used to form index terms,

thereby improving FRAM.

17

Man Lan et al. [37] described several supervised and unsupervised

term weighting methods for automatic text categorization. Also a new

supervised term weighting method to improve the term’s discriminating power

for text categorization task is proposed. In the comparative study of

supervised and unsupervised term weighting methods, it is found that not all

supervised term weighting methods are superior to unsupervised methods

and the performance of the term weighting methods has close relationships

with the learning algorithms and data corpus.

Murat Can Ganiz et al. [38] introduced a novel Bayesian framework for

classification called as Higher Order Naïve Bayes (HONB). In traditional

machine learning algorithms is that instances are Independent and Identically

Distributed (IID). These critical independence assumptions made in

traditional machine learning algorithms prevent them from going beyond

instance boundaries to exploit latent relations between features. Unlike

approaches that assume data instances are independent, HONB leverages

higher order relations between features across different instances.

Nianyun Shi et al. [39] presented a feature selection method named

Distributional Information based Feature Selection (DIFS). In DIFS a new

estimation mechanism is proposed to measure the relevance between

feature's distribution characteristics and contribution to categorization. The

authors discussed three kinds of distribution informative characteristics of

features and their direct or indirect relevance to feature's contribution to text

categorization. In addition, two kinds of algorithms are designed to implement

DIFS.

Nikos Tsimboukakis et al. [40] present a new approach called Word-

Map Systems for the classification of documents in terms of their content.

This approach consists of two stages. The first stage uses a word map to

create a feature representation of the documents, while the second stage

comprises a supervised classifier that classifies the documents into

predefined categories. Two approaches to create word maps are presented

18

and compared based on Hidden Markov Models (HMM) and the self-

organizing map (SOM).

Ning Zhong et al. [41] in their paper, proposed an innovative and

effective pattern discovery technique. To overcome the problem of low

frequency and misinterpretation of derived patterns the proposed technique

uses two processes, pattern deploying and pattern evolving, to refine the

discovered patterns in text documents. The proposed approach improves the

accuracy of evaluating term weights because discovered patterns are more

specific than whole documents.

Sheng-Tun Li et al. [42] proposed a novel classification approach

based on Fuzzy Formal Concept Analysis (FFCA) to control the impact from

noise. Most of existing document classification algorithms are easily affected

by noise data. This research uses Fuzzy Formal Concept Analysis to

generalize documents to concepts in order to decrease the impact from noise

terms. Every formal concept is used to recommend the category for new

documents.

Tomoharu Iwata et al. [43] proposed a framework for improving

classifier performance by effectively using auxiliary samples. The auxiliary

samples are labelled not in terms of the target taxonomy but according to

classification schemes or taxonomies that are different from the target

taxonomy. This method finds a classifier by minimizing a weighted error over

the target and auxiliary samples. The weights are defined so that the

weighted error approximates the expected error when samples are classified

into the target taxonomy.

Weibin Deng [44] developed a hybrid algorithm for text classification

based on rough set for the problem of high dimensions of text feature words.

In the first stage, the most documents are classified into certain classes with

high accuracy by rough set. In addition, based on the attributes’ importance

degree theory in the informational view of rough set, the documents of the

19

doubt set are classified further. And in the second stage, weighted Naive

Bayes relieves the conditional dependence of Naive Bayes.

Xiao-Bing Xue et al. [45] explored the effect of a novel value assigned

to a word called distributional features, which express the distribution of a

word in the document. The widely used Bag-of-Word (BOW) may not fully

express the information contained in the document. The proposed

distributional features include the compactness of the appearances of the

word and the position of the first appearance of the word. The analysis shows

that the distributional features are useful for text categorization, especially

when they are combined with term frequency or combined together.

Xiaojun Quan et al. [46] investigated the suitability of the existing term-

weighing methods for question categorization. The popular unsupervised and

supervised term-weighting methods for question categorization are compared

and three new supervised term-weighing methods are proposed. The

evaluation of the newly proposed supervised term weighting schemes exhibit

stable and consistent improvement over most of the previous term-weighing

methods.

Yaxin Bi et al. [47] introduced an approach to combining the decisions

of text classifiers. Each classifier output is modelled as a list of prioritized

decisions and then divided into the subsets of 2 and 3 decisions which are

subsequently represented by the evidential structures in terms of triplet and

quartet. Also a general formula is developed based on the Dempster-Shafer

theory of evidence for combining such decisions.

Ying Liu et al. [48] reported an approach of concepts handling in

document representation and its effect on the performance of text

categorization. A Frequent word Sequence algorithm that generates concept-

centred phrases to render domain knowledge concepts is introduced. It is

also observed that a universally suitable support threshold does not exist and

the removal of concept irrelevant sequences can possibly further improve the

performance at a lower support level.

20

3. DEVELOPMENT PROCESS

3.1 PROBLEM DEFINITION

Text classification is one of the basic text mining tasks which classifies

the documents with respect to one or more pre-existing categories. In order

to be classified, each document should be represented in a machine

understandable format. Most of the existing classification systems use the

traditional Bag-of-Words model. It is the common way of representing a text

as a bag of its component words called as Bag-of-Words document

representation. The limitation of the Bag-of-Words document representation

is the problem of synonymy which arises due to the different words with

similar meanings. This model counts the number of occurrences of the key

words and omits the fact that different words might have been used to

express a similar concept within the same document. This representation has

to be enhanced so that the key words and their corresponding synonyms in

the document should be considered for the classification process.

The proposed system uses an approach to automatically classify the

documents by enriching the Bag-of-Words text representation with

synonyms. This approach uses WordNet, a lexical database of English to

help the process of document representation and classification. After pre-

processing the data in the text document, the document is represented as a

bag of key words. Then the system uses the WordNet to extract the

synonyms for all the key words in the text document and combines them with

the key words to form a new representative vector of the document.

The proposed approach helps in improving the performance of the

classification system by providing a solution to the problem of synonymy. If

the document contains different words with same meaning, then the system

counts the number of occurrences of all the synonymic words and adds them

together. As a result, both the key words and their synonyms in the text

document are considered and their occurrences counted together for the

classification process.

21

3.2 SYSTEM DESIGN

The figure 3.1 displays the system architecture design. It consists of two

phases, Learning and Classification phase.

3.2.1 Learning Phase

The learning phase is also called as training phase and deals with the

preparation of training data set. In this phase, a set of training documents are

given as input to the classifier. The class labels of these training documents

will be previously known. The data in the document is pre-processed to

represent the text in the form of vector of key terms. Based on these training

documents a set of pre – defined classes are described and a training data

set is prepared which is called as the Categorical Profile.

3.2.2 Classification Phase

The document to be classified is given as input to the classifier in the

classification phase. It is also called as testing phase since the classifier is

tested using the new document. A data set called Document Profile is

prepared using the input document to the testing phase. In the next step, the

similarity between the profile of the document to be classified and the

categories profiles is calculated using a similarity measure and the input

document is assigned the category whose profile is most similar to the profile

of the document. The category to which the document is assigned is provided

as the output of the system, i.e., the class of the input document.

22

Figure. 3.1: System Architecture Diagram.

Training

documents

Tokenization

Stop word removal

Stemming

Input - Document

to be classified

Wordnet

Generation of document

profile

Calculation of similarity between

categorical profile and

document profile

Calculation of sum of

term and its synonyms frequencies

Data pre-processing

Generation of categorical profile

Classification Phase Learning Phase

Input

Output

Output – Classified

Document

Bag of Words

Document profile Categorical profile

Noise removal

23

Both the learning phase and classification phase use the following

modules to perform the classification process.

3.2.3 Data Pre-processing

Data Pre-processing deals with the representation of data in a

document in terms of the vector of its key terms. The data in the text

document is pre-processed to reduce the complexity of the document and to

make them easier to handle. It is an important step in documents

classification and gives the compact form of content of the document. The

purpose of text representation is to reduce possible language dependent

factors.

The functions carried in pre-processing process are as follows:

3.2.3.1 Tokenization

It is the process of breaking a stream of text into individual words or

phrases called tokens. The tokens may be the string of alphabetic characters

or numbers. The delimiters used to separate the tokens in tokenization may

be a space or line break or any of the punctuation characters. The delimiters

used for tokenization may or may not be included in the list of resulting

tokens.

For instance,

Input: Tokenization splits sentences into individual tokens. This is an

example of tokenization.

If the punctuation character full stop is used as the delimiter for the

tokenization process then,

Output: Tokenization splits sentences into individual tokens

This is an example of tokenization

24

3.2.3.2 Noise Removal

All the irrelevant data such as non-alphabetic characters like full

stops, commas, brackets, numerals and special characters are removed.

3.2.3.3 Stop Word Removal

Stop words such as auxiliary verbs, conjunctions and articles (e.g.

“the”, “a”, “and”… etc.) which does not convey any meaning are removed by

this process.

3.2.3.4 Stemming

The stemming algorithm that converts different words into their root

forms is applied. Stemming refers to the process of reducing inflection and

derivational variants of words to their stem or the root of a certain word.

There are basically two types of stemming techniques, inflectional

and derivational. Derivational stemming derives a new word from an existing

word, by changing its grammatical category (for example, changing a noun to

a verb). When the singular is changed to plural or past to present, it is

referred to inflectional stemming.

A stemmer (an algorithm which performs stemming), removes words

with the same stem and keeps only the stem as key word. For example, the

words “train”, “training”, “trainer” and “trains” can be replaced with “train”. To

minimize the effects of inflection and morphological variations of words,

Porter stemming algorithm is used.

The screen shots for the data pre-processing process carried out in

this research work such as tokenization, noise removal, stop word removal

and stemming are depicted in Appendix 1.

3.2.4 Bag-of-Words

The Bag-of-Words is the collection of component words of an input text

document. These component words are also called as feature words or key

25

words and are obtained after the data in the document is pre-processed. It is

the simplest representation of texts in vector space model. It transforms texts

into vectors where each component represents a word. The purpose of this

representation is to make the text understandable to the machine.

3.2.5 WordNet

To move from counting word occurrences to counting synonyms, a

thesaurus is required. The proposed approach uses WordNet to help the

process of document representation and classification.

WordNet is a large lexical database of English developed at the

University of Princeton [50]. It is a combination of a dictionary and thesaurus.

The basic building block of WordNet is synsets consisting of all the words

that represent a given concept. Nouns, verbs, adjectives and adverbs are

groups into sets of synonyms each expressing a distinct concept. Each

synset represents the underlying lexical concept expressed by all the

synonymic words. Each synset is identified by a unique synset number. In

addition, each synset also contains pointers to other semantically related

synsets. A word may belong to more than one synset. Each synset is

associated to a sense, i.e. a word meaning; for example the words “car” and

“automobile” grouped in the synset {car, automobile}. A word form in

WordNet can be a single word or two or more words connected by

underscores. WordNet is capable of referring a word form to a synset.

3.2.5.1 WordNet & Synonyms

The relation of synonymy is the base of structure of the WordNet.

Synonymy is the relation binding two equivalent or close concepts. It is a

symmetrical relation. A synonym is a word which can be substituted to

another without major change of meaning. The lexemes are gathered in sets

of synonyms called as synsets. Thus a synset consists of all the terms used

to indicate the concept.

26

The definition of synonymy used in WordNet is as follows: "Two

expressions are synonymous in a linguistic context C if the substitution of for

the other out of C does not modify the value of truth of the sentence in which

substitution is made". Example of a synset is [feature, characteristic,

lineament, have, sport, boast].

3.3 DETAILED DESIGN

3.3.1 Data Flow Diagram for Level 0

The overview of overall process is provided in Level 0 DFD in

figure 3.2.

Figure. 3.2: Data Flow Diagram - Level 0.

In the overall view of the process given in the above diagram, the

input text document is processed by the classifier and assigned to a class

whose profile is most similar to the profile of the document.


More detailed insight of the overall process is provided in Level 1 DFD

in figure 3.3.

The document given as input to the system is pre-processed to

reduce the complexity of the document and to get the compact form of the

input document. The bag of component words of input document is obtained

as the result of data pre-processing process.

Training/test

document Input – Text

document

Output –

Class of the

input

document

Classifier

Category most

similar to the input

document

27

By using this bag of words of text document, the categorical and

document profile are prepared. The similarity between the document profile

and all categories profile is calculated using the similarity measure. The input

document is assigned to the category or class whose profile is most similar to

the document profile.



The detailed flow of data within each module in the classification

system can be obtained by the Level 2 DFD in figure 3.4.

The input document is passed to the pre-processing module where it

gets tokenized, noise and stop words are removed and finally the key words

in the document are stemmed.

Keywords

Categorical

or document

data set

Input – Text

document

Process

data

Output –

Class of the

input

document

Prepare

profiles

Training/test

document

Classifying

the document

Keywords

Categorical

or document

profiles Category most

similar to the input

document

Bag-of-Words Data Set

28


Tokenization splits the stream of text in the document into individual

tokens based on the delimiters provided. Then the words that do not convey

Resulted tokens

Input – Text

document

Tokenization Noise

Removal

Tokens

Training/test document

Resulted words

Tokenized Data Tokens after Noise

Removal

Stop word

Removal

Tokens after Stop

word Removal Stemming

Synonyms

Calculate term

frequency (tf) +

synonym

frequency (sf)

Words to be

stemmed

Keywords

Keywords

WordNet Bag–of- Words

Sum of frequencies of term and its synonyms

Vector

generation

frequencies tf + sf

Categorical/ document data

Term frequency (tf) +

Synonym frequency (sf)

Data Set

Calculate

similarity

between profiles

Output –

Class of the

input

document

Category most similar to the input

document Categorical/ document

profiles

Resulted Tokens

Resulted Tokens

29

any meaning such as special characters and articles are removed in noise

removal and stop word removal operation. At the end of pre-processing

operation, the document undergoes stemming where all the key words got as

the result of previous operation are brought to their root form.

The bag of key words of the input text document is obtained as the

result of tokenization, noise removal, stop word removal and stemming

operations. Each one of these key words is passed through WordNet which

is a lexical database of English and their synonyms are extracted. These

synonyms are checked against the key words and if any of the synonyms of

the key word appears in the document then the frequency of that synonym

are added to the key term (key word) frequency.

After calculating the sum of frequencies of key term and its synonyms,

categorical and document profile are prepared by assigning the weights to

the key terms. Then, the similarity between the document profile and all

categories profile is calculated using the similarity measure. Finally, the input

document is assigned to the category or class whose profile is most similar to

its profile.

3.4 METHODOLOGY

The two phases of the classifier, learning phase and classification phase

are used to perform the classification process. The learning or training phase

deals with classification model construction and the usage of the classifier is

done in the testing or classification phase. The way in which the classification

system works is stated as follows.

3.4.1 Calculation of Sum of Frequencies of Term and its Synonyms

Term Frequency (tf) is the number of occurrences of a key word in the

bag of component words of the text document generated as a result of the

data pre-processing process. Synonym Frequency (sf) is the number of

occurrences of all the synonyms of a key word in the bag of component

words of the text document.

30

Most of the existing classification systems use the term frequency to

classify the document, whereas some classification systems classify the

documents based on the synonym frequency.

To enhance the performance of these classification systems, this

research proposes an approach to classify the text documents based on both

term and its synonym frequency. It classifies the documents by finding the

sum of term and its corresponding synonym frequencies in the document.

Calculation of sum of the term and its synonym frequency is the

process where the sum of number of occurrences of each key word and

number of occurrences of all of its synonyms in the document are obtained.

This process interacts with WordNet, the lexical database of English to obtain

the synonyms of the key words. Each of the key word in the bag of words is

passed through WordNet and its synonyms are extracted. Then these

synonyms are checked against the key words in the text document obtained

by the pre-processing process. If for a key term, any of its synonyms

obtained using WordNet appears within the document then the frequency of

that synonyms are added to the term (key word) frequency. This strategy

extends each term vector by entries for WordNet synonyms S appearing in

the text document.

As a result, the frequency of each of the key terms in the document is given

by,

where,

is the number of occurrences of a key word.

is the total number of occurrences of all the synonyms appearing

in the document for a key word.

31

Thus each of the term vector will be replaced by the concatenation of

term vector and synonym vector where,

where,

where denotes the frequency that a synonym

sЄS of the key terms appearing in a document d.

3.4.2 Vector Generation by Weighting the Key Terms

Given the key word frequencies, the input test document and all the

categories can be represented as a weighted vector of its key terms.

Weighting the key terms is the process where the weights are assigned to

each of the key words as an indication of the importance of a word. There are

various methods to calculate the weight of a key word. The proposed

approach uses the standard (product of Term Frequency and Inverse

Document Frequency) measure defined as,

where,

is the number of occurrences of the term in document d.

represents importance of term which is a measure of

whether the term is common or rare across all documents.

Given the key term weights in all categories, the weighted vector for

each category is given by,

( )

where,

are the weights of the key terms in the

category

32

The key words in the category are weighted by the proposed approach

as follow:

(

| | )

where,

is the sum of frequency of term and its synonyms

in category

is the total number categories.

| | is the number of categories that contain term .

The equation 3.5 measures the degree of association between a key

term and category. Its application is based on the assumption that a term

whose frequency strongly depends on the category in which it occurs will be

useful for distinguishing among the categories.

Also, the weighted vector for test document is given by,

( )

where,

are the weights of the key terms in the

document

In this approach, the key terms in the test document are weighted as,

(

| |)

where,

is the sum of frequency of term and its synonyms

in document

is the total number of training documents.

| | is the number of documents containing term .

33

The equation 3.7 computes the document dependent weights for the

key terms so as to generate a vectorial representation for each document in

which each term is weighted by its contribution to the discriminative

semantics of the document.

3.4.3 Calculation of Similarity between Document and Categories

Profiles

The similarity measure is used to determine the degree of

resemblance between two vectors. The similarity between documents is

estimated by calculating the distance between the vectors of these

documents. The similarity measure value will be larger to documents that

belong to the same class and smaller otherwise. Many measures have been

proposed for measuring document similarity based on term occurrences or

document vectors.

The measure used to calculate the similarity in the proposed approach

is cosine similarity measure. The cosine similarity measure evaluates the

cosine of the angle between two document vectors and is given by [29],

The equation 3.8 can be derived as,

( ∑

) (√∑

√∑

)⁄

where:

t is a key word, and are the two vectors (profiles) to be

compared.

is the weight of the term t in .

34

is the weight of the term t in .

In the proposed approach, this similarity measure is used to calculate

the distance between all categories vector and the vector of the document to

be categorized. When this similarity measure is used, if there are more

common key terms and these key terms have strong weightings, the

similarity will be closer to 1, and vice versa. As a result, the document will be

assigned to the category whose vector is closest with the document vector.

3.5 IMPLEMENTATION

To develop the classifier, this research uses advantages of the built in

functions of Java programming language and NetBeans which is an

Integrated Development Environment (IDE) intended for Java development.

3.5.1. Java Programming Language

Java is an object-oriented programming language developed by Sun

Microsystems [52], [53]. It has a built-in Application Programming Interface

(API) that can handle graphics and user interfaces and can be used to create

applications and applets. Applications are programs that perform the same

functions as written in other programming languages. Applets are the

programs that can be embedded in a webpage and accessed over the

internet. When a program written in Java is compiled, a byte code is

produced that can be read and executed by any platform that can run Java.

Because of its rich set of API’s and its platform independence, Java can also

be thought of as a platform in itself.

The Java has become a language of choice for providing worldwide

Internet solutions because of its security and portability features. The key that

allows Java to facilitate these features is the Bytecode. It is a highly

optimized set of instructions designed to be executed on the Java run-time

system called Java Virtual machine (JVM), which is an interpreter for

bytecode. Translating a Java program into bytecode makes it much easier to

35

run in a wide variety of environments since only the JVM needs to be

implemented for each platform. Although the details of the JVM will differ

from platform to platform, all interpret the same Java bytecode.

Java provides a number of advantages to developers which are as

follow:

Simple

Java is designed to be easy for the programmer to learn and use

effectively. Java is built on and improved the ideas of C++. Since it

inherits most of the syntax and object-oriented features of C++, the

people already having some programming experience require very

little effort in learning and using Java.

Object-Oriented

Java is object oriented because programming in Java is mainly

focused on creating objects, manipulating objects and making objects

work together. This allows the developer to create modular programs

and reusable code.

Portable

Java program can be compiled on one platform and run on

another, even if the Central Processing Units (CPU) and Operating

Systems (OS) of the two platforms are completely different. The ability

to execute the same program on different systems is important on

World Wide Web and Java succeeds in this by being platform

independent.

Secured

Java considers security as part of its design. Using a Java-

compatible web browser, the user can safely download Java applets

without fear of viral infection. Java achieves this protection by limiting

a Java program to the Java execution environment and not allowing it

access to other parts of the computer.

36

Robust

Since Java is a strictly typed language, it checks the code at

compile time and also at run time. Java gives importance on early

checking for possible errors, as Java compilers are able to detect

many problems that would show up only during execution time in other

programming languages.

Multithreaded

Java supports multithreaded programming, which allows the

programmer to write programs that do many things simultaneously.

But in other programming languages, operating system specific

functions have to be called in order to enable multithreading.

3.5.2 NetBeans IDE

NetBeans IDE is an open source integrated development environment

for developing Java applications. It is written in Java and runs on all major

platforms. NetBeans IDE supports development of all Java application types

including Java SE, Java ME, EJB and mobile applications. The functions of

the NetBeans IDE needed for Java development is provided by the modules.

Each module provides a function such as support for Java language, editing

etc. These modules also allow the NetBeans to be extended. New features

can be added by installing additional modules. The modules NetBeans

profiler, GUI design tool, NetBeans JavaScript editor etc. are part of the

NetBeans IDE [53].

The NetBeans IDE has many features and tools for each of the Java

platforms. Some of these features are listed below:

Syntax highlighting for Java, JavaScript, XML, HTML, CSS, JSP etc.

Customizable fonts, colours, and keyboard shortcuts.

Live parsing and error marking.

Pop-up Javadoc for quick access to documentation.

Advanced code completion.

Automatic indentation, which is customizable.

37

Word matching with the same initial prefixes.

Navigation of current class and commonly used features.

Macros and abbreviations.

Matching brace highlighting.

JumpList allows you to return the cursor to previous modification.

Zoom view ability.

Database schema browsing to see the tables, views, and stored

procedures defined in a database.

Database schema editing using wizards.

Data view to see data stored in tables.

Works with databases, such as MySQL, Oracle, IBM DB2, Microsoft

SQL Server, Sybase, Informix, Cloudscape, Derby etc.

3.5.3 Implementation Details

The text document is loaded from 20Newsgroup data corpus which is

stored inside a directory. In the preprocessing stage, the document is first

tokenized using Java’s StringTokenizer class. The string functions are used

to compare the resulted tokens first with the array of noisy characters and

then with array of stop words for their removal. As the last stage of

preprocessing, Porter’s stemming algorithm is applied to the key words

obtained from the previous process to reduce them to their root form.

Thereafter, the document is searched for the synonyms of each key

word that are generated as the result of preprocessing process. If any of the

synonyms appears in the document, then their frequency is added to the key

term frequency to obtain the sum of key term and its synonym frequency.

This module makes use of WordNet to extract the synonyms of the key

words. The MIT Java WordNet Interface (JWI) is used to interface with the

WordNet dictionary. First, a URL is constructed that points to the WordNet

dictionary directory. Then using the instance of the default Dictionary object

of JWI, the dictionary is searched for the senses or synonyms of the input

tokens.

38

Categorical and document profiles or vectors are generated in the

next module. In this process, the sum of key term and its synonym

frequencies obtained in the previous module are loaded to the function that

calculates the weights for these key terms. Then using these weighted key

terms the vectors are generated for the categories and the test document.

The last module of the system calculates the similarity between

categorical and test document profile. The module makes use of cosine

similarity function. The similarity between test document profile and each of

the category profile is calculated by providing the category vector and

document vector as input to the similarity function.

The features of GUI design tools of NetBeans IDE are used to present

the details and results at different stages.

39

4. EXPERIMENTS AND EVALUATION

In this research, the 20Newsgroups collection is used to evaluate the

proposed approach. All documents for training and testing involve a pre-

processing step, which includes the task of tokenization, noise and stop word

removal and stemming. Then common measures such as Precision, Recall

and F-measure are applied for performance evaluation.

4.1 EVALUATION METRICS

The performance of a text categorization system is evaluated using

performance measures from information retrieval. Common metrics for text

categorization evaluation include precision, recall and F1.

4.1.1 Precision and Recall

Precision is a measure of the ability of a system to present only

relevant items to the user.

Recall is a measure of the ability of a system to present all relevant

items to the user.

For classification tasks, the terms true positives, true negatives, false

positives, and false negatives compare the results of the classifier under test

phase with previously known results. The terms positive and negative refer to

the classifier's prediction and the terms true and false refer to whether that

prediction corresponds to the previously known result.

40

The terms true positive (TP), true negative (TN), false positive (FP)

and false negative (FN) are defines as follows.

– number of documents assigned correctly to class i.

– number of documents assigned correctly to the classes other than i.

– number of documents that do not belong to class i but are assigned to

class i incorrectly by the classifier.

– number of documents that are not assigned to class i by the classifier

but which actually belong to class i.

The definitions given above can be illustrated by the table 4.1 below.

Table 4.1: Possible Predictions of a Classifier.

Category i Actual category

TRUE FALSE

Classifier Prediction TRUE

FALSE

In a classification task, the precision for a class is the number of true

positives (i.e. the number of items correctly labeled as belonging to the

positive class) divided by the total number of documents labeled as

belonging to the positive class (i.e. the sum of true positives and false

positives - which are documents incorrectly labeled as belonging to the

class).

where,

is number of documents assigned correctly to class i.

is the number of documents that do not belong to class i but are

assigned to class i incorrectly by the classifier.

http://en.wikipedia.org/wiki/Classification_(machine_learning)

http://en.wikipedia.org/wiki/Type_I_and_type_II_errors


41

Recall is defined as the number of true positives divided by the total

number of documents that actually belong to the positive class (i.e. the sum

of true positives and false negatives - which are items which were not labeled

as belonging to the positive class but actually belonging to positive class).

where,


is number of documents that are not assigned to class i by the

classifier but which actually belong to class i.

4.1.2 F-measure

The F-measure or F1 score is used to calculate the performance of the

classification system based on precision and recall values. It is the harmonic

mean of precision and recall. In some cases there will be a need to trade off

precision for recall or vice versa. Hence F-score is used since it takes into

account both precision and recall. The F-measure value lies in the interval (0,

1) and larger the F-measure value, higher will be the classification quality.

The F-measure is defined as,

The overall F-measure score of the entire classification system can be

computed by two different types of average, micro-average and macro-

average. Both macro averaged and micro averaged F-measure is used to

evaluate the performance of the proposed classification system.


42

4.1.2.1 Macro Averaged F-Measure

The macro averaged F-measure is the average of F1 scores of all

the categories. In macro averaging, F-measure is computed locally over each

category first. Then the average over F1 scores of all categories is taken.

Macro averaged F-measure gives equal weight to each category, without

taking into account its frequency. It is generally dominated by the classifier’s

performance on rare categories.

Given a training dataset with n categories, and F1 value for the ith

category as , the macro averaged F1 is defined as,

∑

⁄

The equation 4.6 can be derived as,

∑(

)

⁄

where,

is the number of categories.

is the F-measure value of category.

is the value of category precision.

is the recall value for category.

4.1.2.2 Micro Averaged F-measure

In micro averaging, first the micro averaged precision and recall are

computed globally by adding the individual true positive, false positive and

false negative decisions of the system. Then the micro averaged F-measure

is calculated by taking the harmonic mean of the micro averaged precision

and micro averaged recall. Micro averaged F-measure assigns equal weight

43

to every document. It is generally dominated by the classifier’s performance

over common categories.

Given a training dataset with n categories, the micro averaged F1 is

defined by,

In equation 4.8, is the micro averaged precision for the system which

is given by,

(∑

) (∑

) ⁄

and is the micro averaged recall value for the system given by,

(∑

) (∑

) ⁄

where,

is the number of categories.


is the number of documents that do not belong to class i but are

assigned to class i incorrectly by the classifier.

is number of documents that are not assigned to class i by the

classifier but which actually belong to class i.

44

4.2 EXPERIMENTAL DATASET

The 20 Newsgroups data set is a collection of approximately 20,000

newsgroup documents taken from the Usenet newsgroups collection and

partitioned (nearly) evenly across 20 different newsgroups [54]. The 20

newsgroups collection has become a popular data set for experiments in text

applications of machine learning techniques, such as text classification and

text clustering. The data is organized into 20 different newsgroups, each

corresponding to a different topic. Some of the newsgroups are very closely

related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.

hardware), while others are highly unrelated (e.g. misc.forsale / soc. religion.

christian). Each category contains 1,000 articles and 4% of the articles are

cross-posted. The categories in 20Newsgroups are as follows:

alt.atheism

comp.graphics

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

misc.forsale

rec.autos

rec.motorcycles

rec.sport.baseball

rec.sport.hockey

sci.crypt

sci.electronics

sci.med

sci.space

soc.religion.christian

talk.politics.guns

talk.politics.mideast

talk.politics.misc

talk.religion.misc

45

4.3 PERFORMANCE ANALYSIS

To evaluate the performance of the proposed classification approach,

experiments are conducted over 20Newsgroups corpus. The 20Newsgroups

corpus is a collection of approximately 20,000 newsgroup documents nearly

uniformly distributed among 20 groups. In this corpus some newsgroups are

very closely related to each other and some are highly unrelated. The

20Newsgroups corpus has become a popular dataset for experiments in text

classification systems. Compared with asymmetrical category distribution in

other data corpus, the 20 categories in the 20Newsgroups corpus are nearly

uniformly distributed. The macro averaged F-measure performance is almost

similar to that of micro averaged F-measure in this nearly uniform category

distribution corpus [27], [30], [33], [37].

The table 4.2 shows the details of the 20Newsgroups categories used for

evaluation.

Table 4.2: Details of the 20Newsgroups Categories used for Evaluation.

Category

Number of

Training

Documents

Number of Test

Documents

Total number of

Documents

comp.graphics 189 37 226

misc.forsale 204 42 246

rec.autos 198 40 238

sci.space 193 39 232

talk.religion.misc 216 42 258

Total 1000 200 1200

46

Table 4.3 summarizes the macro averaged F-measure results of the

proposed approach compared with term frequency alone strategy and

synonym frequency alone strategy for 20Newsgroups categories. The results

obtained in the experiment suggest that the integration of the term frequency

with its synonym frequency improved the text classification performance

significantly compared with classification using either synonym frequency

strategy or term frequency strategy.

Labels in table 4.3 are defined as follows:

Category stands for the name of the 20Newsgroups categories used

for evaluation.

Proposed Approach (tf+sf) is the classification strategy of calculating

sum of term and its synonym frequencies.

Synonym Frequency (sf) is the classification approach using only the

frequency of synonyms.

Term Frequency (tf) is the strategy used for the classification using

only term frequency.

F1 score is the F-measure value for individual categories under each

strategy.

Macro averaged F-measure is the category pivoted performance

measure of the system.

47

Table 4.3: Macro Averaged F-measure Results for 20Newsgroups

Categories.

Category

F1 score

Proposed

Approach

tf + sf

Synonym

Frequency

sf

Term

Frequency

tf

comp.graphics 0.718 0.667 0.641

misc.forsale 0.720 0.665 0.648

rec.autos 0.719 0.668 0.65

sci.space 0.719 0.666 0.646

talk.religion.misc 0.721 0.668 0.644

Macro Averaged

F – measure 0.719 0.667 0.646

To calculate the macro averaged F1-score, the F1-score is computed

for each of the categories first and then the average of all F1 scores is

calculated. The values from table 4.3 shows that the best overall macro

averaged F-measure value is achieved by the proposed classification

approach i.e. sum of term and its synonym frequencies given by 0.719 which

is higher than that of classification using synonym frequency strategy (0.667)

or only term frequency strategy (0.646).

48

The figure 4.1 represents the performance of the proposed system in

comparison with the synonym frequency and traditional term frequency

approaches for 20Newsgroups dataset.

Figure 4.1: Macro Averaged F1-score for 20Newsgroups Data Corpus.

In figure 4.1 the x-axis represents the macro averaged F1-score (in

percentage) for the classification approaches and the y-axis represents the

different classification strategies including the proposed approach. The

results from the figure shows that the macro averaged F1 values for the

proposed system (tf + sf) reached 71.9% achieving an improvement of 5.2%

compared to the synonym frequency (sf) strategy and 7.3% compared to the

term frequency (tf) strategy.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

tf sf tf+sf

F1-S

core

Classification Strategies

Macro averaged F1-Score

49

Table 4.4 summarizes the micro averaged F-measure results of the

proposed approach compared with term frequency strategy and synonym

frequency strategy for 20Newsgroups categories. By the results obtained in

the experiment it can be concluded that the addition of term frequency with

its synonym frequency improves the text classification performance

compared with classification using either synonym frequency approach or

term frequency approach.


Category refers to the name of the 20Newsgroups categories used for

evaluation.



Synonym Frequency (sf) is the strategy used for the classification

using only the synonym frequency.

Term Frequency (tf) is the classification approach using only the

frequency of terms in the document.

Precision and Recall are the evaluation measures used to compare

the system’s result with the previously known results.

Micro averaged F-measure is the document pivoted performance

measure of the classification system.

50

Table 4.4: Micro Averaged F-measure Results for 20Newsgroups

Categories.

Category

Proposed

Approach

tf + sf

Synonym

Frequency

sf

Term Frequency

tf

Precision Recall Precision Recall Precision Recall

comp.graphics

0.720 0.726 0.67 0.674 0.651 0.656

misc.forsale

rec.autos

sci.space

talk.religion.misc

Micro averaged

F - measure 0.723 0.672 0.653

The micro averaged F1-score is calculated by computing the F1-

score globally regardless of categories. The individual decisions of the

system are added together and then applied to get the performance

measure. From the values in the table 4.4 it is shown that the proposed

approach of classification using sum of term and its synonym frequencies

achieves the best micro averaged performance value of 0.723 compared to

0.672 and 0.653 of classification using either the synonym frequency

approach or term frequency approach respectively.

51

The figure 4.2 represents the performance of the proposed system for

20Newsgroups data set plotted in comparison with the synonym frequency

and term frequency classification approaches.

Figure 4.2: Micro Averaged F1-score for 20Newsgroups Data Corpus.

In figure 4.2 the x-axis represents the micro averaged F1-score (in

percentage) for the classification approaches and the y-axis represents the

different classification strategies including the proposed approach. The

results from the figure shows that the proposed system’s (tf + sf) micro

averaged performance score reaches 72.3% achieving an increase of 5.1%

than the synonym frequency (sf) approach and 7% than the traditional term

frequency (tf) approach.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

tf sf tf+sf

F1-S

core

Classification Strategies

Micro averaged F1-Score

52

The table 4.5 displays the macro averaged F-measure results for the

proposed classification approach of counting sum of term and its synonym

frequency compared to that of classification using only synonym frequency or

only term frequency for number of keywords or size of categories profile

ranging from 50 to 300.


Size of Categories Profile stands for the size of test data set or

number of keywords in the categorical profile.


sum of term and its synonym frequency.

Synonym Frequency (sf) is the strategy used for the classification

using only the synonym frequency.

Term Frequency (tf) is the classification approach using only the

frequency of terms in the document.

Macro averaged F1-score is the category pivoted performance

measure of the system.

53

Table 4.5: Macro Averaged F-measure Results based on Size of Categories

Profile.

Size of Categories Profile

Macro averaged F1 score

Proposed

Approach

tf + sf

Synonym

Frequency

sf

Term

Frequency

tf

50 0.714 0.664 0.637

100 0.717 0.663 0.646

150 0.716 0.663 0.649

200 0.717 0.666 0.643

250 0.719 0.667 0.646

300 0.719 0.667 0.646

Figure 4.3 shows the macro averaged performance of the three

different classification approaches i.e. proposed approach using sum of term

and its synonym frequency, using synonym frequency only and using term

frequency only, plotted against the size of the categories profile or number of

key terms ranging from 50 to 300. The macro averaged F1-score points show

a tendency to increase as the number of keywords grows. It remains

constant when the number of key terms exceeds 250.

54

Figure 4.3: Macro Averaged F- measure Results for Varying Size of

Categories Profile.

The macro averaged F-measure performance of the three

approaches for classification i.e. using only term frequency (tf), using only

synonym frequency (sf) and using sum of term and its synonym frequencies

(tf+sf) for 20Newsgroups data corpus plotted against the categorical profile

size ranging from 50 to 300 is given in figure 4.3. The x-axis represents the

number of key words; y-axis represents the macro averaged F-measure

value for the three classification approaches. From the figure it is concluded

that among the three approaches the best F1-score points are achieved by

the proposed approach of classification using the sum of term and its

synonym frequencies.

0.63

0.64

0.65

0.66

0.67

0.68

0.69

0.70

0.71

0.72

0.73

50 100 150 200 250 300

F1

-sco

re

Size of categories profile

Macro averaged F1-score

tf+sf

sf

tf

55

The table 4.6 displays the micro averaged F-measure results for the

proposed approach of calculating the sum of term and its synonym

frequencies for classification compared to that of classification using

frequency of synonyms or frequency of terms for number of keywords or size

of categories profile ranging from 50 to 300.


Size of Categories Profile refers to the size of test data set or number

of keywords in categorical profile.



Synonym Frequency (sf) is the classification approach using only the

frequency of synonyms.

Term Frequency (tf) is the strategy used for the classification using

only term frequency.

Micro averaged F-measure is the document pivoted performance

measure of the classification system.

56

Table 4.6: Micro Averaged F-measure Results based on Size of Categories

Profile.

Size of Categories Profile

Micro averaged F1 score

Proposed

Approach

tf + sf

Synonym

Frequency

sf

Term

Frequency

tf

50 0.717 0.667 0.648

100 0.720 0.669 0.649

150 0.720 0.668 0.652

200 0.721 0.671 0.651

250 0.723 0.672 0.653

300 0.723 0.672 0.653

Figure 4.4 displays the micro averaged performance of the three

different classification approaches i.e. proposed approach of using sum of

term and its synonym frequency, using synonym frequency alone and using

term frequency alone plotted against the size of the categories profile or

number of key terms ranging from 50 to 300. The micro averaged F1-score

points show the tendency to increase as the number of keywords grows. The

micro averaged F-measure results remain constant when the number of key

terms exceeds 250.

57

Figure 4.4: Micro Averaged F- measure Results for Varying Size of

Categories Profile.

The figure 4.4 shows the micro averaged F-measure performance

results of classification on the 20Newsgroups corpus using term frequency,

using synonym frequency and using the proposed approach that is sum of

term and its synonym frequencies for the different sizes of the key terms

ranging from 50 to 300. In the figure, size of categorical profile is given by the

x-axis and the micro averaged F1-scores for the classification approaches

are given by the y-axis. The results obtained in the experiment shows that

the proposed classification strategy of calculating the sum of term and its

synonym frequencies achieves the best F1 value compared to the other two

classification approaches.

0.64

0.65

0.66

0.67

0.68

0.69

0.70

0.71

0.72

0.73

0.74

50 100 150 200 250 300

F1

-sco

re

Size of categories profile

Micro averaged F1-score

tf+sf

sf

tf

58

4.4 INFERENCE FROM THE RESULT

The experimental results show that proposed approach of text

document classification using the sum of term and its synonym frequencies is

effective in increasing the performance of the classification system. From the

results obtained in the experiments on 20Newsgroups data corpus it is

shown that the proposed approach achieves higher F-measure value

compared to that of other two approaches. For the proposed approach, the

macro averaged F1 value reaches 71.9% achieving a rise of 5.2% compared

to the classification strategy of using only synonym frequency (sf) and a rise

of 7.3% compared to the term frequency (tf) alone approach. Also, the

proposed system achieves micro averaged F-measure of 72.3% with an

increase of 5.1% than the synonym frequency approach and an increase of

7% than the approach of classification using only term frequency.

Also the proposed approach outperforms the other two approaches

when the experiment is carried out with varying size of categories profiles.

When the experiments are carried out with different sizes of categorical

profile, all the three approaches reach their peak performance when the

number of key terms exceeds 200. Among them, the proposed approach

achieves the best macro averaged and micro averaged F1 scores by

showing the consistent performance compared to the other two approaches.

Generally macro averaging gives equal weight to each class and

micro averaging gives equal weight to each per-document classification

decision. Because the F1 measure ignores true negatives and its magnitude

is mostly determined by the number of true positives, large categories

dominate small categories in micro averaging. Whereas, macro averaged F

measure is dominated by the system’s performance on small categories.

Since the categories in the 20Newsgroups data corpus are nearly uniformly

distributed compared to the asymmetrical category distribution in other data

corpus, the macro averaged F-measure performance (0.719) is almost

similar to that of micro averaged F-measure (0.723).

59

5. CONCLUSION AND FUTURE ENHANCEMENT

This research proposes an approach to classify the text documents

based on the integration of key term and its synonym frequencies in the

document to solve the synonymy problem in text classification process. To

classify a document, the system extracts the synonyms for all the key terms

in the text document using WordNet which arranges words into groups of

synonyms and combines them with the key terms to form a new document

representative vector. Hence the system counts the occurrence of both the

key terms and their corresponding synonyms for the classification, resulting

in the reduction of synonymy problem. The experimental results with

20Newsgroups dataset show that incorporating the frequency of synonyms

with frequency of key terms in the document results in the increase in

performance of classification system when compared to the classification

approaches of using only synonym frequency or using only term frequency.

The proposed system uses only the first sense synonyms of the key

term for the classification process since the WordNet returns an ordered list

of synsets for a term in such a way that more commonly used terms are

listed before less commonly used terms. Also a word usually has multiple

synonyms with somewhat different meanings and it is difficult to choose the

correct synonym to use. The future work includes providing an option to the

user for specifying the number of senses or synsets to be used for the

classification process and using a disambiguation strategy for the

identification of proper synonyms for the key term.

60

REFERENCES [1] A. Moschitti and R. Basili, “Complex Linguistic Features for Text

Classification: A Comprehensive Study,” 26th European Conference

on IR Research (ECIR), pp. 181-196, 2004.

[2] C. Apte, F. Damerau and S.M. Weiss, "Automated Learning of

Decision Rules for Text Categorization," ACM Transactions of

Information Systems, Vol.12, No.3, pp.223-251, 1994.

[3] D. Koller and M. Sahami, “Hierarchically Classifying Documents using

Very Few Words,” 14th International Conference on Machine Learning

(ECML), 1998.

[4] D. Mladenic and M. Globelnik, “Word Sequences as Features in Text

Learning,” 17th Electrotechnical and Computer Science Conference

(ERK), pp. 145-148, 1998.

[5] D.D. Lewis, “An Evaluation of Phrasal and Clustered Representations

on a Text Categorization Task,” ACM SIGIR ’92, pp. 37-50, 1992.

[6] D.O. Lewis and M. Ringuette, "A Comparison of Two Learning

Algorithms for Text Categorization," 3rd Annual Symposia on

Document Analysis and Information Retrieval (SDAIR), pp.81-93,

1994.

[7] E. Leopold and J. Kingermann, “Text Categorization with Support

Vector Machines: How to Represent Text in Input Space?” Machine

Learning, Vol. 46, No. 1-3, pp. 423-444, 2002.

61

[8] E.D. Wiener, J.O. Pedersen and A.S. Weigend: "A Neural Network

Approach to Topic Spotting," 4th Symposia on Document Analysis

and Information Retrieval (SDAIR), pp.317-332, 1995.

[9] Fabrizio Sebatiani, “Machine Learning in Automated Text

Categorization”, ACM Computing Surveys, 1999.

[10] J. Fu rnkranz, “A Study using n-Gram Features for Text

Categorization,” Technical Report OEFAI-TR-98-30, Austrian Institute

for Artificial Intelligence, Vienna, Austria, 1998.

[11] J. Fu rnkranz, T. Mitchell and E. Riloff, “A Case Study in Using

Linguistic Phrases for Text Categorization on the WWW,” First

AAAI Workshop Learning for Text Categorization, pp. 5-12, 1998.

[12] K. Lang, “Newsweeder: Learning to Filter News,” 12th International

Conference on Machine Learning, pp.331-339, 1995.

[13] L.D. Baker and A. K. McCallum, “Distributional Clustering of Words for

Text Classification,” ACM SIGIR, pp. 96-103, 1998.

[14] M. Lan, S.Y. Sung, H.B. Low and C.L. Tan, “A Comparative Study on

Term Weighting Schemes for Text Categorization,” International

Conference on Neural Networks (IJCNN ’05), pp. 546-551, 2005.

[15] M.F. Caropreso, S. Matwin and F. Sebastiani, “A Learner-

Independent Evaluation of the Usefulness of Statistical Phrases for

Automated Text Categorization,” Text Databases and Document

Management: Theory and Practice, A.G. Chin, pp. 78-102, Idea

Group Publishing, 2001.

62

[16] R. Bekkerman, R. El-Yaniv, N. Tishby and Y. Winter, “Distributional

Word Clusters versus Words for Text Categorization,” J. Machine

Learning Research, Vol. 3, pp. 1182-1208, 2003.

[17] R. Rastogi and K. Shim, "A Decision Tree Classifier that Integrates

Building and Pruning," 24th International Conference on Very

Large Data Bases, pp.404-415, 1998.

[18] R.E. Schapire and Y. Singer, "BoosTexter - A Boosting-based System

for Text Categorization," Machine Learning, Vol. 39, No.2-3, pp.135-

168, 2000.

[19] S. Scott and S. Matwin, “Feature Engineering for Text Classification,”

16th International Conference on Machine Learning (ICML), pp. 379-

388, 1999.

[20] S.T. Dumais, J.C. Platt, D. Heckerman and M. Sahami, “Inductive

Learning Algorithms and Representations for Text Categorization,”

Seventh International Conference on Information and Knowledge

Management (CIKM), pp. 148-155, 1998.

[21] T. Joachims, "Text Categorization with Support Vector Machines:

Learning with many Relevant Features," 10th European

Conference on Machine Learning, No.1398, pp.137-142, 1998.

[22] Y. Ko, J. Park and J. Seo, “Improving Text Categorization Using the

Importance of Sentences,” Information Processing and Management,

Vol. 40, No. 1, pp. 65-79, 2004.

[23] Y. Yang, "An Evaluation of Statistical Approaches to Text

Categorization," Journal of lriformation Retrieval, Vol.1, No.1, pp. 67-

88, 1999.

63

[24] Atika Mustafa, Ali Akbar and Ahmer Sulthan, “Knowledge Discovery

Using Text Mining: A Programmable Implementation on Information

Extraction and Categorization”, International Journal of Multimedia and

Ubiquitous Engineering, Vol. 4, No. 2, pp. 183-188, April 2009.

[25] Aurangzeb Khan, Baharum B. Bahurdin and Khairullah Khan, “An

Overview of E-documents Classification”, International Journal of

Machine Learning and Computing IPCSIT Vol.3, pp. 544-552, IACSIT

Press, Singapore, 2011.

[26] Chengyuan Ma and Chin-Hui Lee, “A Regularized Maximum Figure-of-

Merit (rMFoM) Approach to Supervised and Semi-Supervised

Learning”, IEEE Transactions on Audio, Speech and Language

Processing, Vol. 19, No. 5, pp. 1316-1327, July 2011.

[27] Chenping Hou, Feiping Nie, Fei Wang, Changshui Zhang and Yi Wu,

“Semisupervised Learning Using Negative Labels”, IEEE Transactions

on Neural Networks, Vol. 22, No. 3, pp. 420-432, March 2011.

[28] Christian Wartena, Rogier Brussee and Wout Slakhorst, “Keyword

Extraction using Word Co-occurrence”, IEEE International Conference

on Database and Expert Systems Applications (DEXA), pp. 54-58,

August 2010.

[29] Christoph Goller, Joachim Loning, Thilo Will and Werner Wolff,

“Automatic Document Classification: A Thorough Evaluation of

Various Methods” International Journal of Computer Applications, Vol.

35, No.6, pp. 0975 –8887, December 2011.

[30] Deng Cai and Xiaofei He, “Manifold Adaptive Experimental Design for

Text Categorization”, IEEE Transactions on Knowledge and Data

Engineering, Vol. 24, No. 4, pp. 707-719, April 2012.

64

[31] Der Chiang Li and Chiao Wen Liu, “Extending Attribute Information for

Small Data Set Classification”, IEEE Transactions on Knowledge and

Data Engineering, Vol. 24, No. 3, pp. 452-464, March 2012.

[32] Fang Lu and Qingyuan Bai, “A Refined Weighted K-Nearest

Neighbors Algorithm for Text Categorization”, IEEE International

Conference on Intelligent Systems and Knowledge Engineering

(ISKE), pp. 326-330, November 2010.

[33] Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee, “A Fuzzy Self-

Constructing Feature Clustering Algorithm for Text Classification”,

IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No.

3, pp. 335-349, March 2011.

[34] Lifei Chen and Gongde Guo, “Using Class-dependent Projection

for Text Categorization”, IEEE International Conference on Machine

Learning and Cybernetics (ICMLC), pp. 1305-1310, July 2010.

[35] Ma Zhanguo, Feng Jing, Chen Liang, Hu Xiangyi, Shi Yanqin and Ma

Zhanguo, “An Improved Approach to Terms Weighting in Text

Classification”, IEEE International Conference on Computer and

Management, pp. 1-4, 2011.

[36] Makoto Suzuki, Naohide Yamagishi, Takashi Ishidat, Masayuki Gotot

and Shigeichi Hirasawa, “On a New Model for Automatic Text

Categorization Based on Vector Space Model”, IEEE International

Conference on Systems man and Cybernetics (SMC), pp. 3152-3159,

2010.

[37] Man Lan, Chew Lim Tan, Jian Su and Yue Lu, “Supervised and

Traditional Term Weighting Methods for Automatic Text

Categorization”, IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. 31, No. 4, pp. 721-735, April 2009.

65

[38] Murat Can Ganiz, Cibin George and William M. Pottenger, “Higher

Order Naive Bayes: A Novel Non-IID Approach to Text Classification”,

IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No.

7, pp. 1022-1034, July 2011.

[39] Nianyun Shi and Lingling Liu, “A New Feature Selection Method

Based on Distributional Information for Text Classification”, IEEE

International Conference on Progress in Informatics and Computing

(PIC), pp. 190-194, December 2010.

[40] Nikos Tsimboukakis and George Tambouratzis, “Word- Map Systems

for Content-Based Document Classification”, IEEE Transactions on

Systems, Man and Cybernetics-Part C: Applications and Reviews, Vol.

41, No. 5, pp. 662-673, September 2011.

[41] Ning Zhong, Yuefeng Li and Sheng-Tang Wu, “Effective Pattern

Discovery for Text Mining”, IEEE Transactions on Knowledge and

Data Engineering, Vol. 24, No. 1, pp. 30-44, January 2012.

[42] Sheng-Tun Li and Fu-Ching Tsai, “Noise Control in Document

Classification Based On Fuzzy Formal Concept Analysis”, IEEE

International Conference on Fuzzy Systems, pp. 2583-2588, June

2011.

[43] Tomoharu Iwata, Toshiyuki Tanaka, Takeshi Yamada and Naonori

Ueda, “Improving Classifier Performance Using Data with Different

Taxonomies”, IEEE Transactions on Knowledge and Data

Engineering, Vol. 23, No. 11, pp. 1668-1677, November 2011.

[44] Weibin Deng, “A Hybrid Algorithm for Text Classification based on

Rough Set”, IEEE International Conference on Computer Research

and Development (ICCRD), pp. 406-410, March 2011.

66

[45] Xiao-Bing Xue and Zhi-Hua Zhou, “Distributional Features for Text

Categorization”, IEEE Transactions on Knowledge and Data

Engineering, Vol. 21, No. 3, pp. 428-442, March 2009.

[46] Xiaojun Quan, Wenyin Liu and Bite Qiu, “Term Weighting Schemes for

Question Categorization”, IEEE Transactions on Pattern Analysis and

Machine Intelligence, Vol. 33, No. 5, pp. 1009-1021, May 2011.

[47] Yaxin Bi, Shengli Wu, Hui Wang and Gongde Guo, “Combination of

Evidence-based Classifiers for Text Categorization”, IEEE

International Conference on Tools with Artificial Intelligence (ICTAI),

pp. 422-429, November 2011.

[48] Ying Liu and Han Tong Loh, “Domain Concept Handling in Automated

Text Categorization”, IEEE International Conference on Industrial

Electronics and Applications (ICIEA), pp. 1543-1549, June 2010.

[49] Isabel Volpe, Viviane Moreira and Christian Huyck, “Cell Assemblies

for Query Expansion in Information Retrieval”, IEEE International

Conference on Neural Networks, pp. 551-558, August 2011.

[50] http://wordnet.princeton.edu/

[51] J. Han and M. Kamber, Data Mining: Concepts and Techniques,

Second Edition, Morgan Kaufmann Publishers.

[52] Herbert Schildt, Java 2: The Complete Reference, Fifth Edition, Tata

McGraw-Hill Publishers.

[53] http://java.sun.com/

[54] http://www.ai.mit.edu/~jrennie/20Newsgroups/

67

APPENDIX1: SAMPLE SCREEN SHOTS

Loading the text document from 20Newsgroups data corpus

Displaying the content of the selected file

68

Tokenizing the text

Process of noise removal

69

Removing the stop words

Performing word stemming

70

Displaying sum of key term and its synonym frequencies

Displaying the weights of the key terms

71

Sample output of the system

72

TECHNICAL BIOGRAPHY

Mrs. Praneetha K. (RRN. 1145213) was born on 31st May 1982, in Karkala,

Karnataka. She did her schooling in Christ King School. She received B.Sc.

degree in Computer Science from Sri Bhuvanenedra College in the year

2003 from Mangalore University and M.C.A. degree from N. M. A. M. Institute

of Technology in the year 2006 from Visveswariah Technological University.

She is currently pursuing her M.Phil. Degree in Computer Science in the

Department of Computer Applications of B.S. Abdur Rahman University,

Chennai. She had participated in International workshop on “Advances in

Data Mining and Web Mining”. Her areas of interests include Information

Retrieval, Web mining and Natural Language Processing. The e-mail ID is:

[email protected] and the contact number is: +91 8939955709.

Publications:

Praneetha K., “Classification of Text Documents using WordNet”,

International Conference on Recent Trends in Computer Science and

Engineering (ICRTCSE), May 2012.

Praneetha K., “Text Representation using WordNet for the Reduction

of Synonymy”, International Conference on Computational Intelligence

and Communication (ICCIC), pp. 144-148, July 2012.

Praneetha K. and Dr. Angelina Geetha, “An Enhanced Text Document

Classification based on Terms and Synonyms Relations”, IFRSA

International Journal of Data Warehousing and Mining (IIJDWM), Vol.

2, No. 3, pp. 175-181, August 2012.

AN ENHANCED TEXT DOCUMENT - B.S. Abdur Rahman Crescent ... AN ENHANCED TEXT DOCUMENT CLASSIFICATION...

Documents

Transcript of AN ENHANCED TEXT DOCUMENT - B.S. Abdur Rahman Crescent ... AN ENHANCED TEXT DOCUMENT CLASSIFICATION...