CHAPTER 1 INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/24674/6/06_chapter1.pdf ·...

1

CHAPTER 1

INTRODUCTION

1.1. CLUSTERING

Clustering analysis is a vibrant field of research intelligence and

data mining. The fundamental concept of clustering analysis lies in using its

salient characteristics to calculate the degree of similar relationships among

objects and attaining automatic classification. In the recent years, clustering

of data in large data sets has become a rallying area of research and it is

gaining momentum. Clustering of document (Jain and Dubes 1988) is

important for the purpose of document organization, summarization, topic

extraction and information retrieval in an efficient way. Initially, clustering is

applied for enhancing the information retrieval techniques. Of late, clustering

techniques have been applied in the areas which involve browsing the

gathered data or in categorizing the outcome provided by the search engines

for the reply to the query raised by the users.

Document clustering is also applicable in producing the hierarchical

grouping of document (Ward 1963). In order to search and retrieve the

information efficiently in Document Management Systems (DMS), the

metadata set should be created for the documents with adequate details. But

just one metadata set is not enough for the whole document management

systems. This is because various document types need different attributes to

be distinguished appropriately.

2

So clustering of documents is an automatic grouping of text

documents into clusters such that documents within a cluster have high

resemblance in comparison to one another, but are different from documents

in other clusters. Hierarchical document clustering (Murtagh 1984)

categorizes clusters into a tree or a hierarchy that benefits browsing.

Information Retrieval (IR) (Baeza 1992) is the field of computer

science that focuses on the processing of documents in such a way that the

document can be quickly retrieved based on keywords specified in a user’s

query. IR technology is the foundation of web-based search engines and plays

a key role in biomedical research, as it is the basis of software that aids

literature search. Documents can be indexed by both words and concepts that

can be matched with domain-specific thesauri, known as concept matching.

For several decades, people have realized the importance of archiving and

nding information. With the arrival of computers, it has become necessary to

store a large amount of information and to allow users to gather and search

useful information from such collections. Several IR systems are used on an

everyday basis by a wide variety of users for various applications.

In the recent years, digital libraries and web accessible text

databases containing distributed high-dimensional data have created a need

for infrastructure. This facilitates effectively to make searches regarding

Peer-to-Peer (P2P) setup. P2P systems have become a promising solution to

deal with data management in cases of high degree of distribution. Unlike

centralized systems or conventional client-server architectures, nodes

participating in a large P2P network store and share data in an autonomous

manner. Such nodes can be information providers, which do not reveal their

full data to a server or are a part of a centralized index.

Hence, the main challenge is to offer an effective and scalable

searching approach, in a highly distributed content, without essentially

3

moving the actual contents away from the information providers. Then there

is a drawback of lack of global knowledge as each peer is conscious of only a

small portion of the network topology and so content has to be dealt with

carefully. The solutions that are already proposed in the literature in such an

unstructured P2P environment usually lead to techniques that incur high costs,

thus preventing the design of effective searching over P2P content.

Document clustering is becoming more and more important with the

abundance of text documents available through World Wide Web and

corporate document management systems. But there are still some major

drawbacks in the existing text clustering techniques that greatly affect their

practical applicability. The drawbacks in the existing clustering approaches

are listed below:

Text clustering that yields a clear cut output has got to be the

most favorable. However, documents can be regarded

differently by people with different needs vis-à-vis the

clustering of texts. For example, a businessman looks at

business documents not in the same way as a technologist sees

them (Macskassy et al. 1998). So clustering tasks depend on

intrinsic parameters that make way for a diversity of views.

Text clustering is a clustering task in a high-dimensional

space, where each word is seen as an important attribute for a

text. Empirical and mathematical analysis have revealed that

clustering in high-dimensional spaces is very complex, as

every data point is likely to have the same distance from all

the other data points (Beyer et al. 1999).

Text clustering is often useless, unless it is integrated with

reason for particular texts are grouped into a particular cluster.

It means that one output preferred from clustering in practical

4

settings is the explanation why a particular cluster result was

created rather than the result itself. One usual technique for

producing explanations is the learning of rules based on the

cluster results. But this technique suffers from a high number

of features chosen for computing clusters. Even though

several approaches for clustering exist, most of these feature

vectors have the same principal problems without really

approaching the matters of subjectivity and explanation. So

the researcher intends to investigate various aggregation

levels. From that perspective, the text documents will be used

to obtain clustering results.

1.1.1 Feature selection

A general type of text processing in several information retrieval

systems depends on the study of world events across a document collection.

The quantity of terms utilized by the system describes the dimension of a

vector space in which the study is executed. Minimization of the dimension

leads to major savings of computer resources and processing time. But poor

feature selection may radically degrade the performance of the information

retrieval systems. Feature selection is a fundamental step in the creation of a

vector space or bag of words model (Berry and Browne 1999). Especially,

when dividing a given document collection into clusters of similar documents,

making a choice of significant features together with good clustering

algorithms is of supreme importance.

Calculating the similarity between documents is a vital step in

almost all document clustering algorithms. Usually, the computation cost is

proportional to the number of dimensions in the feature space. In this way,

computation efficiency can be enhanced by minimizing the feature space.

Recently, feature selection for document categorization has become an

5

attractive area of research. For document categorization, every word can be

quantified after training, thus the less salient feature words can be filtered.

Things are very different for document clustering, it is in essence

unsupervised: there are no training documents. So it is not easy to compute

the weight of every word. In document clustering, although the feature word

cannot be found directly by using class label, there is an alternative to the

word distribution to find these words. Enlightened by this idea, a novel

feature selection algorithm is introduced in the present research study.

1.1.2 Ontology

The Internet has made it easy to access a huge collection of

information across the world. This facility has encouraged an increasing

demand for understanding how to combine multiple and heterogeneous

information sources. The research mainly focuses on recognition and

integration of appropriate information to provide a better knowledge on a

specific domain. The combination is predominantly helpful when it allows the

communication among dissimilar sources without affecting their autonomy.

The difficulty in combining heterogeneous resources has been dealt with in

the literature. So far, only a few researchers have concentrated on

heterogeneity among resources other than databases. Dealing with different

information resources necessitates the formation of a few techniques. Hence,

ontologies have been formulated to make information sharing and reuse the

various data in all areas and tasks (Gomez Perez and Benjamins 1999). The

key task is to develop explicitly the differences in semantics.

An ontology is defined as a logical theory accounting for the

intended meaning of a formal vocabulary, specifically its ontological

commitment to a specific conceptualisation of the world (Guarino 1998).

Ontologies offer a generally agreed understanding of a domain that can be

reused and shared across several applications. Among the different techniques

6

for the combination of heterogeneous sources presented in the literature, the

main attention is directed on the structure of multiple, shared ontologies

which are suggested in the literature (Visser and Tamma 1999). This

architecture aggregates multiple shared ontologies into clusters, so as to

obtain a structure that is able to reconcile different types of heterogeneity. It is

also intended to be more convenient in terms of implementation and it gives

better prospects for maintenance and scaling. Furthermore, such a structure is

suitable for avoiding information loss when performing translations between

diverse resources.

Fuzzy ontology: Fuzzy ontologies are very much useful in the Web.

Ontologies serve as fundamental semantic infrastructure, offering shared

understanding of certain domains across different applications, so as to assist

machine understanding of Web resources. Moreover, Ontology can also

handle fuzzy and imprecise information which is very important to the Web

(Shridharan et al. 2004). An ontology language, such as the Ontology Web

Language (OWL), is used to encode the ontology. Ontology is far more

complex than taxonomy that mostly focuses on the classification of concepts

in a domain.

Semantic web ontologies have a crucial role in that they make

internet resources easily accessible to users and put in place other electronic

formats. OWL, an ontological language, allows for searches and this has

made them widely popular. The reason is that they are based on semantic

links, not on string matching. Consequently, ontologies have become one of

the most useful knowledge management tools for efficient integration.

Semantic Web has been criticized for not addressing uncertainty. In

order to provide solution for addressing semantic meaning in an uncertain and

inconsistent word, fuzzy ontologies (Amy Trappey et al. 2009) have been

proposed. When using fuzzy logic, reasoning is approximate rather than

7

precise. The main purpose is to avoid the theoretical pitfalls of monolithic

ontologies to assist interoperability between different and independent

ontologies (Cross 2004), and to offer flexible information retrieval

competence (Widyantoro and Yen 2001, Thomas and Sheth 2006). More

recently, fuzzy ontology has become more popular and is being used in many

applications including classification, clustering etc. Hotho et al. (2003). They

have the following advantages:

Fuzzy ontology provides a better classification of a large,

vague database.

It can be initially applied to the database to reduce the

convergence time and number of iterations.

1.1.3 Genetic algorithm

Genetic Algorithms (GAs) are randomized search and optimization

(Goldberg 1989) approaches, inspired by the principles of evolution and

natural genetics. They have a large amount of inherent parallelism. GAs carry

out search in complex, large databases and offer near-optimal solutions for

objective or fitness function of an optimization issue.

GAs is usually implemented by means of computer simulations in

which an optimization difficulty is indicated. To avoid this difficulty,

members of a space of candidate solutions, called individuals, are denoted

through abstract representations called chromosomes. The Genetic algorithm

(GA) comprises an iterative process that develops a working set of individuals

called a population toward an objective function, or fitness function.

Conventionally, solutions are denoted using predetermined length

strings, in particular binary strings, but alternative encodings have been

developed. The evolutionary development of a GA is greatly made easy. It

8

begins from a population of individuals arbitrarily created in accordance with

some probability distribution, typically identical and renews this population in

steps called generations. In every generation, several individuals are

randomly chosen from the recent population based upon application of

fitness, breed using crossover, and personalized traits through mutation to

create a new population.

• Crossover refers to the exchange of genetic material

(substrings) pertaining to rules, structural components and

elements of machine learning.

• Selection relates to the fitness function for selecting

individuals from a population that reproduces new offsprings.

• Replication propagates individuals from one generation to the

next.

• Mutation is concerned with the modification of chromosomes

by individuals.

The existing literature (Goldberg, 2002) includes numerous

common observations about the creation of solutions by means of GA:

• GAs have complexity with adaptation to dynamic concepts or

objective condition. This occurrence, called concept drift in

supervised learning and data mining, is difficult as GAs are

conventionally designed to develop highly-fit solutions

(populations comprising building blocks of huge relative and

complete fitness) regarding stationary concepts.

• GAs are not always efficient at discovering globally optimal

solutions, but can quickly locate high-quality solutions, even

for complicated search spaces. This makes fixed-state GAs

(Bayesian optimization GAs bring together and incorporate

9

solution outputs after convergence to an exact representation

of building blocks) helpful substitutes to generational GAs.

1.1.4 Ant colony optimization

In an ant colony, ants cooperate in an endeavour to find the source

of food. Moving in a colony, viz. in groups, the ants make sure that the

solution—the means to find food—they find helps the ant colony to search for

the source of food (Dorigo et al. 1996). Ants leave a substrance known as

pheromone to determine the path they have already passed. Feeling the

pheromone on the way where there is more of it, the pursuing ants leave the

pheromone for others to follow. This process, aka autocatalic, maximizes the

chances for finding food.

1.2 LITERATURE SURVEY

Document clustering is the process of categorizing text document

into a systematic cluster or group, such that the documents in the same cluster

are similar whereas the documents in the other clusters are dissimilar. It is one

of the vital processes in text mining. Liping (2005) emphasized that the

expansion of internet and computational processes has paved the way for

various clustering techniques. Text mining especially has gained a lot of

importance and it demands various tasks such as production of granular

taxonomies, document summarization etc., for developing a higher quality

information from text.

1.2.1 Clustering techniques

(i) Global K-means

Likas et al. (2003) proposed the global K-means clustering

technique that creates initial centers by recursively dividing data space into

10

disjointed subspaces using the K-dimensional tree approach. The cutting

hyper plane used in this approach is the plane that is perpendicular to the

maximum variance axis derived by Principal Component Analysis (PCA).

Partitioning was carried out as far as each of the leaf nodes possess less than a

predefined number of data instances or the predefined number of buckets has

been generated. The initial center for K-means is the centroids of data that are

present in the final buckets. Shehroz Khan and Amir Ahmad (2004) stipulated

iterative clustering techniques to calculate initial cluster centers for

K-means. This process is feasible for clustering techniques for continuous

data.

(ii) CLIQUE Algorithm

Agrawal et al. (2005) ascribed data mining applications and their

various requirements on clustering techniques. The main requirements

considered are their ability to identify clusters embedded in subspaces. The

subspaces contain high dimensional data and scalability. They also consist of

the comprehensible ability of results by end-users and distribution of

unpredictable data transfer.

A clustering algorithm called CLIQUE fulfills all the above

requirements. CLIQUE finds dense clusters in subspaces of maximum

dimensionality. It produces cluster explanations in the form of Disjunctive

Normal Function (DNF) expressions that are reduced for ease of

comprehension. The approach generates matching results without considering

the data in which input records are presented. It does not presume any

particular mathematical form for data distribution. From the experimental

result, it was observed that CLIQUE algorithm efficiently identified accurate

clusters in large high dimensional datasets.

11

(iii) Enhanced K-means

The main limitation of K-means approach is that it generates empty

clusters based on initial center vectors. However, this drawback does not

cause any significant problem for static execution of K-means algorithm and

the problem can be overcome by executing K-means algorithm for a number

of times. However, in a few applications, the cluster issue poses problems of

erratic behavior of the system and affects the overall performance. Malay

Pakhira et al. (2009) mooted a modified version of the K-means algorithm

that effectively eradicates this empty cluster problem. In fact, in the

experiments done in this regard, this algorithm showed better performance

than that of traditional methods.

(iv) Heterogeneous Uncertainty Clustering Feature (H-UCF)

Uncertainty heterogeneous data streams (Charu Aggarwal et .al

2003) are seen in most of the applications. But the clustering quality of the

existing approaches for clustering heterogeneous data streams with

uncertainty is not satisfactory. Guo-Yan Huang et al. (2010) posited an

approach for clustering heterogeneous data streams with uncertainty. A

frequency histogram using H-UCF helps to trace characteristic categorical

statistic. Initially, creating ‘n’ clusters by a K-prototype algorithm, the new

approach proves to be more useful than UMicro in regard to clustering

quality.

(v) Hierarchical Particle Swarm Optimization (HPSO) clustering

Alam et al. (2010) designed a novel clustering algorithm by

blending partitional and hierarchical clustering called HPSO. It utilized the

swarm intelligence of ants in a decentralized environment. This algorithm

proved to be very effective as it performed clustering in a hierarchical

manner.

12

(vi) Hybrid clustering approach (HCA)

Shin-Jye Lee et al. (2010) suggested clustering-based method to

identify the fuzzy system. To initiate the task, it tried to present a modular

approach, based on hybrid clustering technique. Next, finding the number

and location of clusters seemed the primary concerns for evolving such a

model. So, taking input, output, generalization and specialization, a HCA has

been designed. This three-part input-output clustering algorithm adopts

several clustering characteristics simultaneously to identify the problem.

(vii) Incremental clustering

Only a few researchers have focused attention on partitioning

categorical data in an incremental mode. Designing an incremental clustering

for categorical data is a vital issue. Li Taoying et al. (2010) lent support to an

incremental clustering for categorical data using clustering ensemble. They

initially reduced redundant attributes if required, and then made use of true

values of different attributes to form clustering memberships. Then clustering

ensemble was employed to combine or partition clusters to gain optimal

clustering. Ultimately, the proposed approach was applied in yellow-small

data set, diagnosis data set and zoo data set and results revealed the

effectiveness of the approach.

(viii) Partitional clustering approach

Clustering approaches have been extensively used in the area of

pattern discovery (Nagy 1968) from Web Usage Data (WUD). In e—

commerce applications, clustering is used for the purpose of creating

marketing policies, product offerings etc. Raju et al. (2011) presented a novel

partition based technique for dynamically grouping web users, based on their

web access patterns, using Adaptive Resonance Theory Neural Network

(ART1NN) clustering algorithm. The results show that ART1NN clustering

13

technique outperforms K-means and Self Organizing Map (SOM) clustering

algorithms in terms of intra-cluster and inter-cluster distances.

1.2.2 Text document clustering

(i) Data grabber

Crescenzi et al. (2004) cited an approach that automatically extracts

data from large data-intensive web sites. The “data grabber” investigates a

large web site and infers a scheme for it, describing it as a directed graph with

nodes. It describes classes of structurally similar pages and arcs representing

links between these pages. After locating the classes of interest, a library of

wrappers can be created, one per class with the help of an external wrapper

generator and in this way suitable data can be extracted.

(ii) Link-Analysis and Text-mining toolbox (LATINO)

Miha Grcar et al. (2008) mulled over a technique about the lack of

software mining technique, which is a process of extracting knowledge out of

source code. They presented a software mining mission with an integration of

text mining and link study technique. This technique is concerned with the

inter links between instances. Retrieval and knowledge based approaches are

the two main tasks used in constructing a tool for software component. An

ontology-learning framework named LATINO was developed by Grcar et al.

(2006). LATINO, an open source purpose data mining platform, offers text

mining, link analysis, machine learning, etc.

(iii) Similarity and model based approaches

Similarity-based approach and model-based approaches (Meila and

Heckerman 2001) are the two major categories of clustering approaches and

these have been described by Pallav Roxy and Durga Toshniwal (2009). The

14

former, capable of maximizing average similarities within clusters and

minimizing the same among clusters, is a pairwise similarity clustering

approach. The latter tries to generate techniques from the documents, each

approach representing one document group in particular.

Self-organizing map (Kohonen et.al 2000), mixture of Gaussians

(Tantrum et al. 2002), spherical K-mean (Dhillon and Modha 2001),

bi-secting K-means (Steinbach et.al 2000), mixture of multinomial

(Vaithyanathan and Dom 2000) are some of the new techniques available to

improve clustering performance. K-means, an unsupervised learning

technique, is good at solving clustering issues as well as minimizing objective

function. It defines K-centroids for each cluster. After positioning, the

centroids located away from each other, where each point equals to a given

set, should be taken and linked next to the centroid. Then, it is allowed to

recompute K-centroids. Next, by means of a loop, the location of K-centroids

is altered until no more alterations are made.

(iv) Fuzzy clustering for text

Jiabin Deng et al. (2010) came up with an enhanced fuzzy clustering

text clustering by relying on the Fuzzy C-Means (FCM) and edit distance

approach. It uses feature estimation and minimizes dimensionality of a high-

dimensional text vector. The researchers, in order to sustain the stability of

FCM output, added aspects like high-power sample point set, field of radius

and weight. Since FCM has constraints such as boundary value attribution

(Dave 1992) they recommended the edit distance approach. Consequently, the

outputs showed that it was more precise and constant than FCM in regard to

text clustering.

Odukoya et al. (2010) organized an enhanced data clustering

approach for mining web documents which formulates, simulates and assesses

15

the web documents with the intention of conserving their conceptual

relationships. The enhanced data clustering approach was formulated with the

concept of K-means algorithm. The experimental result showed that this

clustering approach attains more accuracy (89.3%) than the other existing

approach (88.9%). The entropy was constant for both approaches with a value

of 0.2485 at k = 3. This also reduces with the increase in the number of

clusters until the number of clusters reaches eight where it increases to some

extent. The altered rand index values change from 0 to 1 for both clustering

approaches.

The other existing approach attains a value of 53% when compared

with this approach which attains an altered rand index value of 63.7%, when

the number of clusters was five. Additionally, the response time got reduced

from 0.0451 seconds to 0.0439 seconds when the number of clusters was

three. This confirms that the response time of the data clustering approach has

been reduced by 2.7% when compared with the traditional K-means data

clustering. This study revealed that the proposed data clustering approach

could be utilized by developers of web search engines for well-organized web

search result clustering.

(v) Semantic concepts

In a way, a majority of the available clustering algorithms in the

Chinese text clustering suffer from the drawbacks of data scalability and the

interpretability of results. Liu Jinling and Zhou Hong (2010) presented a well-

organized Chinese text clustering technique depending on the semantic

concepts. This technique significantly reduced the number of required data to

be processed and enhanced the capability of the clustering approaches. The

experimental output showed that this clustering approach attained an

acceptable clustering outcome and better implementation efficiency.

16

(vi) Seeds Affinity Propagation

Renchu Guan et al. (2011) presented a novel semisupervised Seeds

Affinity Propagation (SAP) approach based on an Affinity Propagation (AP)

algorithm. The two most important contributions in this technique are:

Structural information of texts are obtained by a novel

similarity metric.

The semisupervised clustering process is enhanced by seed

construction approach.

From the experimental results, it is revealed that the proposed

similarity metric is more efficient in text clustering and the proposed

semisupervised approach attains better clustering outcomes and faster

convergence (using only 76 % iterations of the original AP). The entire SAP

algorithm attains a higher F-measure and lower entropy; improves

considerable clustering execution time (20 times faster) with respect to

K-means, and offers better robustness when compared with other existing

approaches.

1.2.3 Clustering in P2P environment

(i) K-means

Eisenhardt et al. (2003) used distributed clustering in a distributed

P2P network (Adjiman et al. 2006) involving K-means approach. Changing it

to work in a distributed P2P fashion, Li and Morris (2002) adopted the

technique as a probe and echo system having similar problems from the

information retrieval point of view. Generating from a subset of a document

group, this approach involved dividing document groups centrally. After

assigning a node to each cluster, documents were grouped to their respective

clusters.

17

To handle K-means clustering, for P2P network, Dhillon and Modha

(2000) presented novel techniques with parallel implementation approach.

K-means monitoring algorithm helps a centralized K-means process and

recalculates clusters by supervising the distribution of centroids across peers.

In case the data distribution is subject to changes over time, it starts

reclustering. The P2P K-means, the other choice, works by bringing up-to-

date the centroids at each peer. This is done in regard to the information

available from the neighbors in proximity. The algorithm terminates when the

information fails to update centroids of peers put together.

The clustering approaches in use are beset by efficiency and

accuracy in relation to results getting minimized as the network size grows.

However, an approach developed by Qing He et al. (2010) used frequent term

sets for P2P networks. It achieved results whose quality did not have any

bearing on the network size, due to its lower communication volume. It made

users get, not only a better understanding of clustering results, but also

enabled them to deal with local documents as per the entire network.

(ii) Self-organizing clustering

Zhao-Kui Li and Yan Wang (2010) resolved the topology mismatch

in a structured P2P network by using super nodes and self-organizing

clustering, that lay in physical location data. Each node to become a

clustering header, evaluated its ability. This approach not only enhanced

routing performance but topology mismatch also.

(iii) Hybrid P2P K-Medoids Method (P2PKMM)

A hybrid clustering approach, namely P2PKMM, evolved by

Zhongjun Deng et al. (2010) achieved remarkable results using a

decentralized approach. There the peers exactly matched with only those near

the topological network.

18

(iv) Correlation-based clustering

Yuan Li and Xia Shixiong (2010), dealing with the problems of

poor scalability and low search efficiency in unstructured P2P networks,

produced a correlation-based hierarchical P2P network approach. Besides

partitioning the network into clusters, by means of nodes through correlation,

the approach attained considerable success in terms of query success rate and

query delay.

1.2.4 Analysis of fuzzy clustering

(i) Formal Concept Analysis (FCA)

Pollandt (1996) tried to unify fuzzy logic with FCA using

L-Fuzzy context that uses linguistic variable, usually terms connected to

fuzzy sets. It relates to the uncertainty in the context. However, these

variables need human explanation. Besides, the fuzzy concept lattice from

L-Fuzzy context produces blended explosion of ideas. FCA is believed to be

an efficient technique for conceptual clustering and data analysis. All the

same, it is important to make concept lattices, usually complex, simple.

Stumme et al. (2002) used association rules in relation to Iceberg concept

lattice clustering.

Formal Concept Analysis (FCA) is a proper technique for data

analysis and knowledge presentation. FCA is considered to be an efficient

technique for conceptual clustering in recent years. But as most concept

lattices are fairly complex in terms of the number of concepts created, it is

essential to simplify the lattice generated. In Iceberg concept lattice (Stumme

et al. 2002), association rules were used to cluster concepts on the lattice, and

conceptual scaling (Vogt and Wille 1994) was then used to generate the

concept hierarchy.

19

(ii) Fuzzy C-Medoids (FCMdd ) clustering

Raghu Krishnapuram et al. (2001) presented new techniques,

namely Fuzzy C-Medoids (FCMdd) and Fuzzy C Trimmed Medoids for fuzzy

clustering of relational data. The objective functions are based on choosing

‘c’ representative objects (medoids) from the data set in such a way that the

total dissimilarity within each cluster is minimal. The experimental

observations were conducted by comparing FCMdd with the Relational FCM

(RFCM). The results point out that FCMdd is much faster and more efficient.

(iii) Inter-transaction fuzzy association rules

A technique to identify inter-transaction fuzzy association rules that

can calculate the variations of events was proposed by Yo-Ping Huang and

Li-Jen Kao (2005). The proposed algorithm first mapped a quantitative

attribute into several fuzzy attributes. A normalization process was carried out

to avoid the total contribution of fuzzy attributes from being larger than one.

In order to mine inter-transaction fuzzy association rules, both the

dimensional attribute and sliding window concepts were proposed in this

technique. A complete comparison of the investigation of text document

clustering for a biomedical digital library MEDLINE was proposed by (Lllhoi

Yoo and Xiaohua Hu (2006). These researchers conducted widespread

experiments based on various evaluation metrics such as misclassification

index, F-measure, cluster purity, and entropy on very large article sets from

MEDLINE. Document clustering technique can provide the advantages of

better document retrieval, document browsing, and text mining in digital

library. Ontology can be used to solve conventional information retrieval

problems such as synonym/hypernym/hyponym.

Yeong-Chyi Lee et al. (2006) envisaged the concept of multiple-

level taxonomy and multiple minimum supports to identify fuzzy association

20

rules in a given quantitative transaction data set. The significance of different

items was judged using different criteria, managing taxonomic relationships

among items, and dealing with quantitative data sets having three issues that

generally occur in real mining applications. This fuzzy mining approach

created large itemsets, level by level and then derived fuzzy association rules

from the quantitative transaction data.

(iv) Cluster-based fuzzy-genetic

Chen et al. (2006) designed a novel approach called cluster-based

fuzzy-genetic mining technique for extracting both fuzzy association rules

and membership functions from quantitative transactions. The technique

dynamically adjusted membership functions by GAs and used them to fuzzify

quantitative transactions. It also accelerated the evaluation procedure and kept

nearly the same quality of solutions by clustering chromosomes. When

finding the 1-itemsets which is time saving, the evaluation cost can be

reduced.

(v) Fuzzy ontology

Jun Zhai et al. (2008) devised fuzzy ontology techniques by means

of intuitionistic fuzzy set which enabled the sharing of knowledge on

semantic web. It served as a standard benchmark for knowledge

representation and helped to share collaborative design on the semantic web.

The researchers mooted a series of fuzzy ontology techniques containing

fuzzy domain ontology that used intuitionistic fuzzy set and fuzzy linguistic

terms. It looked into the relationship between fuzzy concepts, order and

equivalent relations.

Hongwei Chen et al. (2009) used a concrete model obtained from

fuzzy comprehensive evaluation technique for P2P-based trust system, fuzzy

21

rank-ordering technique for Blin algorithm, and fuzzy inference approach

system for mamdani approach. This method showed that different fuzzy trust

techniques for P2P based system leads result in different fuzzy sets results.

(vi) Fuzzy Transduction-based Clustering Approach (FTCA)

Matsumoto et al. (2010) made up a prototype web search results

clustering engine that improved search results by performing fuzzy clustering

on web documents returned by traditional search engines, as well as ranking

the results and labeling the resulting clusters. This was done using a FTCA,

which used a Transduction-based Relevance Model (TRM) to create

document relevance values. Experimental observations on five different data

sets revealed a significant clustering quality and performance advantage over

suffix tree clustering and lingo approaches.

(vii) Fuzzy similarity-based feature clustering

Feature clustering is a dominant technique to minimize the

dimensionality of feature vectors for text classification. Jung-Yi Jiang

et al. (2011) emphasized a fuzzy similarity-based self-constructing approach

for feature clustering. By this approach, the derived membership functions

matched closely with and illustrated appropriately the real distribution of the

training data. From the experimental observation, it became clear that the new

approach could run faster and obtain better extracted features than the other

techniques.

1.2.5 K-means and GA for clustering

(i) Fast Genetic K-means cluster technique (FGKA)

Lu et al. (2004) laid the way for FGKA. It is the quickest version of

GKA and FGKA and has shown a lot of development over GKA comprising

22

an effective estimation of the objective value Total Within-Cluster Variation

(TWCV) which reduced illegal string elimination overhead and a

generalization of the mutation operator. These enhancements augured that

FGKA runs 20 times quicker than GKA, according to (Krishna et al. 1999).

Even though FGKA is outperformed by GKA considerably, it has some

potential disadvantages. The cost to compute the centroids and TWCV from

score will be more. To overcome the difficulties of FGKA, Lu et al. (2001)

formulated an Incremental Genetic K-means approach (IGKA). IGKA takes

over all the benefits of FGKA including the convergence to the global

optimum, and when the mutation possibility is low this method outperforms

FGKA.

The main focus of IGKA is to compute incrementally the objective

value of TWCV and to cluster the centroids. When the mutation possibility is

less than the threshold, then IGKA comparatively performs better than FGA.

But when the mutation possibility is higher than the threshold, FGA performs

better. Hence, the Hybrid Genetic K-means algorithm (HGKA) of (Francisco

2005) holds the advantages of both FGKA and IGKA and performs better in

higher and lower mutation probability. The GA method of clustering is used

only for numeric data sets that depend exclusively on K-means.

Consequently, a Genetic Clustering Algorithm,called GKMODE is

introduced. GKMODE combines both the k-modes algorithm, recommended

by Anil Chaturvedi et al. (2001) and the genetic approach. GKMODE

performs well in relation to categorical data but it is very difficult to manage

both the combination of ‘n’ numeric and categorical data.

(ii) Improved GA (IGA)

Genetic approach is an analytical population-dependent search

technique which has three most important operators, namely crossover,

mutation and selection. Selection operator performs a vital role in discovering

23

optimal result for constrained optimization complexity. Ke-Zong Tang et al.

(2010) formulated an IGA depending on a novel selection approach that is

offered to manage nonlinear programming difficulties. Experimental

observation determines that the IGA has enhanced robustness, efficiency and

steadiness technique.

(iii) Incremental Clustering using GA (ICGA)

Atul Kamble (2010) developed a new technique depending on GA.

This approach can be applied to all the databases including the data from

metric space. It is confirmed that the incremental approach provides better

output than any other approach. ICGA is a performance estimation approach

on spatial database which represents the effectiveness of the approach. ICGA

provides speed-up aspects unlike the other clustering algorithms.

(iv) Chaotic method GA (CGA)

Dharmendra Roy and Lokesh Sharma (2010) deciphered a

clustering approach depending on Genetic K-means model that performs

better for data containing both numeric and categorical attributes. Min-Yuan

Cheng et al. (2010) offered an n—dimension convergence approach in order

to follow the possible development in conventional GA by K-means

clustering method. The chaotic approach was exploited to avoid the new

approach from prematurity. This approach not only maintains the fundamental

search capability, but also enhances the flexibility and effectiveness of

parametric method. This approach provides 84% of probability to get

optimization. It has been tested with many construction examples and all the

results confirm that this method is very effective.

Ghaemi et al. (2010) introduced a new approach that used clustering

ensembles to enhance cluster partition fitness. Moreover, the approach solves

24

the diversity clustering problem. The fitness average among individuals

generated by the algorithm and other clustering algorithms has been

calculated by three different fitness functions and they are then compared.

The experimental observations on several benchmark data sets have clarified

that the proposed algorithm improves cluster fitness of solutions.

Jia Zhen and Wang Yong Gui (2010) presented a genetic clustering

approach, based on dynamic granularity. From the perspective of a parallel,

random search, global optimization and diversity features of GA, it is

integrated with dynamic granularity approach. In the process of granularity

changing, suitable granulation can be made by refining the granularity, which

can improve the efficiency and the accuracy of the clustering algorithm. The

experimental results pointed out that the approach enhances the clustering

algorithm based on GA local search ability and convergence speed.

(v) Genetic Fuzzy C Means

Chen Wei et al. (2010) proposed a technique which deals with the

drawbacks of the genetic Fuzzy C Means (FCM) such as higher classification

time and less clustering accuracy. This approach improves the crossover,

selection, and mutation parts of the GA, and improves its global searching

capability. Also it eliminates the trouble encountered in setting up genetic

parameters. This enhanced approach performs FCM optimization immediately

after each generation of genetic operation by increasing the converging speed.

Consequently, the clustering accuracy, converging speed and stability of the

approach are found to be better than the conventional clustering approaches.

Chen Rui et al. (2010) endorsed an enhanced GA. This algorithm

keeps the population diversity by similarity checks on the population before

selection. The approach resolves the early-maturing issue of the population

evolution and presents a formula for mutation probability connected with

25

similarity rate and iteration times. It not only retains a good diversity of

population, but also ensures the convergence of the approach. The

experiments were conducted in UCI data sets of WINE and IRIS. From the

experimental results, it was observed that this approach shows a better

performance when compared to C-means clustering algorithm.

1.2.6 Ontology based clustering

There are several approaches such as Natural Language Processing

(NLP) combined with association rules (Maedche and Staab 2001), statistical

model (Faatz and Steinmetz 2002), and clustering (Bisson and Nedellec

2000). They have been applied to create ontology from a concept hierarchy

automatically or semi-automatically. Clustering is one of the most widely

used approaches for ontology learning. Furthermore, conceptual clustering

such as COBWEB can be used for the creation of concept representations and

relationships for ontology (Bisson and Nedellec 2001).

Andreas et al. (2002) adopted the clustering technique for text data.

It used background information during preprocessing to improve the

clustering outcome. It allowed for selection between outcomes. The input data

supplied to ontology-based heuristics was pre-processed for feature selection

and aggregation. Therefore, various choices for text illustrations were

constructed. Based on these illustrations, they calculated the multiple

clustering outcomes using K-means. The results of the proposed approach

looked very significant. A WDC algorithm for large data sets was proposed

by Sharma et al. (2009). Wordsets-based clustering makes use of hierarchical

technique to cluster text documents having common words.

26

(i) Affinity-based similarity measure

Affinity-based similarity measure for web document clustering was

presented by Shyu et al. (2004). In this work, the concept of clustering the

document was extended into web document clustering by establishing the

affinity based similarity measure technique. It makes use of the user access

patterns in finding the similarities among web documents through a

probabilistic model. An evaluation done with the help of real data set

illustrated that the similarity measure was better than the cosine coefficient

and the Euclidean distance technique.

(ii) Document Index Graph (DIG)

Web document clustering using document index graph was put forth

by Momin et al. (2006). Effective informative features like phrases are

essential to attain a more precise document clustering. Therefore, the first part

of the work provided phrase-based model and DIG permitted incremental

phrase-based encoding of documents. It stressed the efficiency of phrase-

based similarity measure over conventional single term based similarities. In

the next step, a Document Index Graph Based Clustering (DIGBC) technique

is provided to improve the DIG model for incremental and soft clustering.

This technique incrementally clusters documents based on presented cluster-

document similarity measure. It enables the assignment of a document to

more than a single cluster.

(iii) Fuzzy named entity-based clustering

Cao et al. (2008) dealt with a fuzzy named entity-dependent

document clustering technique, leaving behind the conventional keyword-

based method of document clustering. They involve, the reason being,

restrictions due to simplistic treatment of words and rigid partitional clusters.

27

In the research work, they introduced named entities which are important

elements in defining document semantics as objectives into fuzzy document

clustering. Initially, the existing keyword-based vector space representation

was adapted with vectors defined over spaces of entity names, types, name-

type pairs, identifiers and alternatives of keywords. Next, hierarchical fuzzy

document clustering can be applied using a similarity measure of the vectors

indicating documents.

Lena Tenenboim et al. (2008) lent a novel technique on ontology

based classification. They discussed a prototype model of a futuristic

personalized newspaper service on a mobile reading device on categorization

of news items in e-Paper. The e-Paper system comprises news items from

different news suppliers and distributes to each subscribed user a personalized

electronic newspaper, making use of content-based and collaborative filtering

techniques. The e-Paper can also offer the users a standard version of chosen

newspapers, besides allowing the browsing abilities in the warehouse of news

items. This deliberates on the automatic categorization of incoming news with

the help of hierarchical news ontology. Based on this clustering technique and

on the users' profiles, the personalization engine is capable of affording a

personalized paper to every user onto the mobile reading device

(iv) GeneticCA

Zhenya Zhang et al. (2008) gave a clustering aggregation based on

GA for document clustering. A novel technique based on GA for clustering

aggregation difficulty, named as GeneticCA, was provided to approximate the

clustering performance of a clustering division. The clustering precision was

defined and features of clustering precision were considered. In the evaluation

of GeneticCA, hamming neural network was used to make clustering

divisions with fluctuant and weak clustering performance.

28

(v) Model clustering

Haojun et al. (2008) endeavoured to develop a document clustering

method based on hierarchical algorithm. This work involved analyzing and

making use of cluster overlapping technique to design the cluster merging

criterion. Depending on the hierarchical clustering technique, the expectation-

maximization (EM) method is used in the Gaussian mixture model to count

the parameters and formulate the two sub-clusters combined when their

overlying became the biggest.

(vi) Prototype-based genetic Approach

Gang Li et al. (2009) considered the difficulty in classical Euclidean

distance metric which could not create an apt separation for data lying in

manifold. They evolved the GA based clustering method with the help of

geodesic distance measure. Evaluating with generic K-means method for the

function of clustering, the technique had the potential to distinguish

complicated non-convex clusters. Its clustering performance is obviously

better than that of the K-means method for complex manifold structures.

(vii) Latent Semantic Index (LSI)

Muflikhah et al. (2009) engendered a document clustering technique

using concept space and cosine similarity measurement. They combine the

information retrieval and document clustering techniques as concepts space

approach. The technique is known as Latent Semantic Index (LSI) which uses

Singular Vector Decomposition (SVD) or PCA. This technique aims to

decrease the matrix dimension by identifying the pattern (Bezdek 1981) in

document collection with reference to terms. Every technique is employed to

the weight of term-document in vector space model (VSM) for document

clustering with the help of FCM technique. In addition to the reduction of

29

term-document matrix, this research also utilizes the cosine similarity

measurement as an alternative of Euclidean distance to engage FCM.

ELdesoky et al. (2009) suggested a novel similarity measure for

document clustering based on topic phrases. In the conventional VSM,

researchers used the unique word that is contained in the document set as the

candidate feature. This work presented a new technique for evaluating the

similarity measure of the traditional VSM. It considered the topic phrases of

the document as comprising terms for the VSM instead of the conventional

terms. It applied the new technique to the Buckshot technique, which is a

combination of the Hierarchical Agglomerative Clustering (HAC) technique

and the K-means clustering method. Such a method may increase the

effectiveness of clustering by incrementing the evaluation metric values.

Thaung et al. (2010) developed document clustering with the help of

FCM algorithm. Most traditional clustering techniques allocate each data to

exactly a single cluster, thereby creating a crisp separation of the provided

data. But fuzzy clustering permits degrees of membership, to which the data

fit various clusters. In this research, documents were partitioned with the help

of FCM clustering technique. FCM clustering is a widely applied

unsupervised clustering method. But FCM method needs the user to mention

the number of clusters and different values of clusters corresponding to

different fuzzy partitions. So the validation of clustering result is required. In

this regard, PBM index and F-measure are helpful to measure the cluster

quality.

Web Recommendation System (WRS) has several drawbacks, like

sparsity and the new user problem. Moreover, WRS does not make full use of

or harness the power of domain knowledge and semantic web ontologies.

Mabroukeh et al. (2011) discussed how an ontology-based WRS could use

relations and concepts in ontology, along with user-provided tags, to provide

30

top-n recommendations without the need for item clustering or user ratings.

For this reason, they suggested a dimensionality reduction technique based on

the domain ontology, to solve the sparsity problem.

Paliwal et al. (2011) addressed the problem of web service

discovery provided with non-explicit service description semantics that match

a particular service request. This semantic based web service discovery

technique comprises semantic based service categorization and semantic

improvement of the service request. They presented a significant solution for

obtaining functional level service categorization depending on an ontology

framework. Moreover, they used clustering for accurately classifying the web

services depending on service functionality. The semantic based classification

is carried out off-line for the Universal Description Discovery and Integration

(UDDI). The semantic enhancement of the service request attains better

matching with relevant services.

1.2.7 Clustering in feature selection

Feature selection, a process of extraction of popular content of the

text document collection, needs content relationship factors. So does

document grouping. Marcelo N. Ribeiro et al. (2009) came up with a local

feature selection technique. It enabled partitional hierarchical text clustering.

Here each cluster denotes a variant subset feature. In a comparison between

the local technique and the global feature selection technique for bisecting

K-means, the former showed a significant precision even for a few selected

terms.

Xu et al. (2007) described a new feature selection technique for text

clustering. It was based on expectation maximization and cluster validity.

Using supervised feature selection technique on the intermediate clustering,

it produced feature selection for text clustering in a better way.

31

Meena Janaki et al. (2010) gelled a new feature selection technique based on

ant colony optimization. It originated from the swarm intelligence. The

classifiers’ effectiveness is compared to the feature, selected by chi-square

and CHIR techniques. It discovered better feature sets than the existing

techniques.

Shen Huang et al. (2006) suggested a novel feature co-selection

technique called Multitype Feature Co-selection for Clustering (MFCC) for

effective web document clustering. MFCC exploited intermediary clustering

results in one type of feature space to assist the selection in the other type.

Feature co-selection, carried out iteratively, significantly incorporated into an

iterative clustering technique. This is done through intermediate clustering

result in one feature space, to improve the feature selection in another. So

better clusters in each space are obtained.

Sun Park et al. (2010) set forth the document clustering approaches

via weighted semantic features and cluster similarity through Non-negative

Matrix Factorization (NMF). Several advantages exist because of the

similarity between the clusters and documents. It can effortlessly cluster

documents with the major topics using clustering based weighted semantic

features. Another advantage is that the quality of document clustering is

improved. This approach is more efficient than the other document clustering

technique using NMF. Therefore, by using these enhanced techniques and

algorithms the efficiency of clustering approaches can be considerably

improved.

1.2.8 Ant colony optimization in clustering

Hui Fu (2008) hit upon a novel clustering technique with Ant

Colony Optimization (ACO) based on cluster center initialization. This

approach gives initialized cluster centers by different techniques and then

32

solves clustering problems by iterative technique. Three techniques of cluster

center initialization are used in clustering the algorithm with ACO.

Xiaohua Wang et al. (2009) concocted novel hybrid ant colony and

agglomerate document clustering algorithm, hybrid-AC and A. It was based

on ant colony model and agglomerate clustering algorithms. When compact

algorithm was applied, ants dropped their load. Then by applying evaluated

function based schedule algorithm, ants received the load. Thus, agglomerate

clustering algorithm got integrated into the iteration procedure of ACO

clustering algorithm.

Wu Bin et al. (2002) scheduled a document clustering algorithm

depending on swarm intelligence and k-means, namely CSIM. Initially, a

document clustering technique based on swarm intelligence was used. It was

obtained from a fundamental model interpreting ant colony organization of

cemeteries. Good initial clusters were obtained in the first phase in CSIM

through Swarm Intelligence. Classical K-means clustering technique was then

integrated by using the clusters as initial centers. CSIM incorporated

significant properties of both swarm intelligence and K-means.

Lijuan Jiao et al. (2010) derived text classification algorithm based

on ant colony algorithm. It effectively solved discrete problems by ACO and

discreteness of text document features. Texts are categorized by crawling of

class population ants. They have class information to identify an optimal path

that matches it during iteration.

From literature review it is observed that peer to peer document

clustering techniques are highly effective with good speed up. Similarly

optimization techniques including Ant colony optimization have been found

to be effective for text document clustering. However work in the areas of

P2P document clustering with improved optimization techniques have not

33

been investigated in literature. Optimization techniques lead to faster

convergence and improved speed up in distributed systems which are crucial

for text mining.

1.3 SCOPE OF THE RESEARCH

As clustering plays a very vital role in various applications, many

researches are still being done. The upcoming innovations are mainly due to

the properties and the characteristics of existing methods. These existing

approaches form the basis for the various innovations in the field of

clustering. From the existing clustering techniques, it is clearly observed that

the clustering techniques based on GA, fuzzy and ontology provide

significant results and performance. Hence, this research concentrates mainly

on the semantic clustering based on GA, NDRGA, ACO and fuzzy ontology

clustering for better performance. The thesis aims to fulfill the objectives of

various cluster related algorithms by developing an effective and efficient

clustering technique. That technique is expected to give good accuracy and

performance.

1.4 PROBLEM DEFINITION

Text document clustering approaches are beset with many problems,

namely accuracy, time duration in large databases, selection of initial clusters

(Shehroz Khan and Amir Ahmad 2004), higher objective function, lesser

silhouette coefficient, lesser F-measure etc. Document clustering in a large

database (Sharma and Dhir 2009) becomes complex issue to the users.

Moreover, the required information is not obtained from the clustering results

due to the complexity. As a result, the accuracy of clustering results becomes

a problem in a large database. So an efficient method needs to be found to

improve the stability of the P2P networks and provide clustering accuracy as

well.

34

1.5 OBJECTIVES

The research study aims to develop an improved document

clustering technique with high classification accuracy. The convergence time

for the usage of existing clustering techniques is more and also the number

iterations required is also very high. Therefore new and efficient techniques

are needed to reduce these constraints. Moreover, in distributed data mining,

the adoption of a flat node distribution technique entails an impact upon its

scalability. Hence the proposed techniques are formulated in order to address

these issues of modularity, flexibility and scalability. The main objectives of

this research include:

• developing a novel technique for distributed document

clustering using ontology which provides very high

classification accuracy.

• reducing the clustering time taken by clustering approaches in

the P2P environment.

• minimizing the convergence time and the number of

iterations.

The time utilized for active clustering of documents is more if large

databases are taken up for clustering. In the case of determining the initial

clusters also, varying clusters would result for the same data set. The

proposed clustering algorithms involve the grouping of electronic documents,

extracting important contents from the document collection and supporting

effective management of digital library documents. Contents of digital

documents are analyzed and grouped into various categories.

35

1.6 METHODOLOGY

A novel approach becomes necessary in the areas of document

clustering which provides higher efficiency and accuracy. The research study

deals primarily with five proposed approaches for document clustering. They

are:

1. Distributed clustering with feature selection for text

documents based on ontology.

2. Ontology based hierarchical distributed document clustering.

3. A Genetic Fuzzy Ontology Model (GFOM) for distributed

document clustering.

4. A novel clustering approach using Fuzzy Ontology with Non-

Dominated Ranked Genetic Algorithm (FONDRGA).

5. Enhanced distributed document clustering with the help of

Fuzzy Ontology and Ant Colony Optimization (FOACO)

Algorithm.

Each of the proposed document clustering methods are evaluated on

the following criteria.

Classification accuracyObjective functionClassification timeAlgorithm complexitySpeed upConvergence behaviorSilhouette coefficientF-measureEntropySeparation index

36

The improvements achieved in those performance measures have

been tested for statistical significance using t-test.

1.7 OUTLINE OF THE THESIS

Each of the remaining chapters of the thesis deals with the

approaches to semantic clustering, ontologies and optimization methods with

the help of soft computing techniques. Chapter 2 describes the ontology based

distributed clustering with feature selection for text documents for better

cluster performance. The Semantic Enhanced Hierarchically Distributed P2P

Clustering (SEHP2PC) is explained in Chapter 3. Term relationships such as

meaning (synonym), part-of (meronym) and kind-of (hypernym) factors are

utilized easily to cluster the documents.

In chapter 4, a genetic fuzzy ontology model is proposed to improve

the performance of distributed document clustering. In chapter 5, fuzzy

ontology with Non-Dominated Ranked Genetic Algorithm (NDRGA) is

delineated. Then, the contribution of ACO Algorithm is incorporated into

Ontology Generation using Fuzzy Logic (OGFL) and explained. Chapter 6

details the fuzzy ontology based distributed document clustering with ACO

algorithm for better results and performance. Chapter 7 draws concluding

remarks and the future scope of the proposed approaches.

1.8 SUMMARY

This chapter dealt with some important features of clustering and

details of research done in that area with the proposed contribution.

CHAPTER 1 INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/24674/6/06_chapter1.pdf ·...

Documents

Transcript of CHAPTER 1 INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/24674/6/06_chapter1.pdf ·...