CHAPTER 1 INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/24674/6/06_chapter1.pdf ·...
Transcript of CHAPTER 1 INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/24674/6/06_chapter1.pdf ·...
1
CHAPTER 1
INTRODUCTION
1.1. CLUSTERING
Clustering analysis is a vibrant field of research intelligence and
data mining. The fundamental concept of clustering analysis lies in using its
salient characteristics to calculate the degree of similar relationships among
objects and attaining automatic classification. In the recent years, clustering
of data in large data sets has become a rallying area of research and it is
gaining momentum. Clustering of document (Jain and Dubes 1988) is
important for the purpose of document organization, summarization, topic
extraction and information retrieval in an efficient way. Initially, clustering is
applied for enhancing the information retrieval techniques. Of late, clustering
techniques have been applied in the areas which involve browsing the
gathered data or in categorizing the outcome provided by the search engines
for the reply to the query raised by the users.
Document clustering is also applicable in producing the hierarchical
grouping of document (Ward 1963). In order to search and retrieve the
information efficiently in Document Management Systems (DMS), the
metadata set should be created for the documents with adequate details. But
just one metadata set is not enough for the whole document management
systems. This is because various document types need different attributes to
be distinguished appropriately.
2
So clustering of documents is an automatic grouping of text
documents into clusters such that documents within a cluster have high
resemblance in comparison to one another, but are different from documents
in other clusters. Hierarchical document clustering (Murtagh 1984)
categorizes clusters into a tree or a hierarchy that benefits browsing.
Information Retrieval (IR) (Baeza 1992) is the field of computer
science that focuses on the processing of documents in such a way that the
document can be quickly retrieved based on keywords specified in a user’s
query. IR technology is the foundation of web-based search engines and plays
a key role in biomedical research, as it is the basis of software that aids
literature search. Documents can be indexed by both words and concepts that
can be matched with domain-specific thesauri, known as concept matching.
For several decades, people have realized the importance of archiving and
nding information. With the arrival of computers, it has become necessary to
store a large amount of information and to allow users to gather and search
useful information from such collections. Several IR systems are used on an
everyday basis by a wide variety of users for various applications.
In the recent years, digital libraries and web accessible text
databases containing distributed high-dimensional data have created a need
for infrastructure. This facilitates effectively to make searches regarding
Peer-to-Peer (P2P) setup. P2P systems have become a promising solution to
deal with data management in cases of high degree of distribution. Unlike
centralized systems or conventional client-server architectures, nodes
participating in a large P2P network store and share data in an autonomous
manner. Such nodes can be information providers, which do not reveal their
full data to a server or are a part of a centralized index.
Hence, the main challenge is to offer an effective and scalable
searching approach, in a highly distributed content, without essentially
3
moving the actual contents away from the information providers. Then there
is a drawback of lack of global knowledge as each peer is conscious of only a
small portion of the network topology and so content has to be dealt with
carefully. The solutions that are already proposed in the literature in such an
unstructured P2P environment usually lead to techniques that incur high costs,
thus preventing the design of effective searching over P2P content.
Document clustering is becoming more and more important with the
abundance of text documents available through World Wide Web and
corporate document management systems. But there are still some major
drawbacks in the existing text clustering techniques that greatly affect their
practical applicability. The drawbacks in the existing clustering approaches
are listed below:
Text clustering that yields a clear cut output has got to be the
most favorable. However, documents can be regarded
differently by people with different needs vis-à-vis the
clustering of texts. For example, a businessman looks at
business documents not in the same way as a technologist sees
them (Macskassy et al. 1998). So clustering tasks depend on
intrinsic parameters that make way for a diversity of views.
Text clustering is a clustering task in a high-dimensional
space, where each word is seen as an important attribute for a
text. Empirical and mathematical analysis have revealed that
clustering in high-dimensional spaces is very complex, as
every data point is likely to have the same distance from all
the other data points (Beyer et al. 1999).
Text clustering is often useless, unless it is integrated with
reason for particular texts are grouped into a particular cluster.
It means that one output preferred from clustering in practical
4
settings is the explanation why a particular cluster result was
created rather than the result itself. One usual technique for
producing explanations is the learning of rules based on the
cluster results. But this technique suffers from a high number
of features chosen for computing clusters. Even though
several approaches for clustering exist, most of these feature
vectors have the same principal problems without really
approaching the matters of subjectivity and explanation. So
the researcher intends to investigate various aggregation
levels. From that perspective, the text documents will be used
to obtain clustering results.
1.1.1 Feature selection
A general type of text processing in several information retrieval
systems depends on the study of world events across a document collection.
The quantity of terms utilized by the system describes the dimension of a
vector space in which the study is executed. Minimization of the dimension
leads to major savings of computer resources and processing time. But poor
feature selection may radically degrade the performance of the information
retrieval systems. Feature selection is a fundamental step in the creation of a
vector space or bag of words model (Berry and Browne 1999). Especially,
when dividing a given document collection into clusters of similar documents,
making a choice of significant features together with good clustering
algorithms is of supreme importance.
Calculating the similarity between documents is a vital step in
almost all document clustering algorithms. Usually, the computation cost is
proportional to the number of dimensions in the feature space. In this way,
computation efficiency can be enhanced by minimizing the feature space.
Recently, feature selection for document categorization has become an
5
attractive area of research. For document categorization, every word can be
quantified after training, thus the less salient feature words can be filtered.
Things are very different for document clustering, it is in essence
unsupervised: there are no training documents. So it is not easy to compute
the weight of every word. In document clustering, although the feature word
cannot be found directly by using class label, there is an alternative to the
word distribution to find these words. Enlightened by this idea, a novel
feature selection algorithm is introduced in the present research study.
1.1.2 Ontology
The Internet has made it easy to access a huge collection of
information across the world. This facility has encouraged an increasing
demand for understanding how to combine multiple and heterogeneous
information sources. The research mainly focuses on recognition and
integration of appropriate information to provide a better knowledge on a
specific domain. The combination is predominantly helpful when it allows the
communication among dissimilar sources without affecting their autonomy.
The difficulty in combining heterogeneous resources has been dealt with in
the literature. So far, only a few researchers have concentrated on
heterogeneity among resources other than databases. Dealing with different
information resources necessitates the formation of a few techniques. Hence,
ontologies have been formulated to make information sharing and reuse the
various data in all areas and tasks (Gomez Perez and Benjamins 1999). The
key task is to develop explicitly the differences in semantics.
An ontology is defined as a logical theory accounting for the
intended meaning of a formal vocabulary, specifically its ontological
commitment to a specific conceptualisation of the world (Guarino 1998).
Ontologies offer a generally agreed understanding of a domain that can be
reused and shared across several applications. Among the different techniques
6
for the combination of heterogeneous sources presented in the literature, the
main attention is directed on the structure of multiple, shared ontologies
which are suggested in the literature (Visser and Tamma 1999). This
architecture aggregates multiple shared ontologies into clusters, so as to
obtain a structure that is able to reconcile different types of heterogeneity. It is
also intended to be more convenient in terms of implementation and it gives
better prospects for maintenance and scaling. Furthermore, such a structure is
suitable for avoiding information loss when performing translations between
diverse resources.
Fuzzy ontology: Fuzzy ontologies are very much useful in the Web.
Ontologies serve as fundamental semantic infrastructure, offering shared
understanding of certain domains across different applications, so as to assist
machine understanding of Web resources. Moreover, Ontology can also
handle fuzzy and imprecise information which is very important to the Web
(Shridharan et al. 2004). An ontology language, such as the Ontology Web
Language (OWL), is used to encode the ontology. Ontology is far more
complex than taxonomy that mostly focuses on the classification of concepts
in a domain.
Semantic web ontologies have a crucial role in that they make
internet resources easily accessible to users and put in place other electronic
formats. OWL, an ontological language, allows for searches and this has
made them widely popular. The reason is that they are based on semantic
links, not on string matching. Consequently, ontologies have become one of
the most useful knowledge management tools for efficient integration.
Semantic Web has been criticized for not addressing uncertainty. In
order to provide solution for addressing semantic meaning in an uncertain and
inconsistent word, fuzzy ontologies (Amy Trappey et al. 2009) have been
proposed. When using fuzzy logic, reasoning is approximate rather than
7
precise. The main purpose is to avoid the theoretical pitfalls of monolithic
ontologies to assist interoperability between different and independent
ontologies (Cross 2004), and to offer flexible information retrieval
competence (Widyantoro and Yen 2001, Thomas and Sheth 2006). More
recently, fuzzy ontology has become more popular and is being used in many
applications including classification, clustering etc. Hotho et al. (2003). They
have the following advantages:
Fuzzy ontology provides a better classification of a large,
vague database.
It can be initially applied to the database to reduce the
convergence time and number of iterations.
1.1.3 Genetic algorithm
Genetic Algorithms (GAs) are randomized search and optimization
(Goldberg 1989) approaches, inspired by the principles of evolution and
natural genetics. They have a large amount of inherent parallelism. GAs carry
out search in complex, large databases and offer near-optimal solutions for
objective or fitness function of an optimization issue.
GAs is usually implemented by means of computer simulations in
which an optimization difficulty is indicated. To avoid this difficulty,
members of a space of candidate solutions, called individuals, are denoted
through abstract representations called chromosomes. The Genetic algorithm
(GA) comprises an iterative process that develops a working set of individuals
called a population toward an objective function, or fitness function.
Conventionally, solutions are denoted using predetermined length
strings, in particular binary strings, but alternative encodings have been
developed. The evolutionary development of a GA is greatly made easy. It
8
begins from a population of individuals arbitrarily created in accordance with
some probability distribution, typically identical and renews this population in
steps called generations. In every generation, several individuals are
randomly chosen from the recent population based upon application of
fitness, breed using crossover, and personalized traits through mutation to
create a new population.
• Crossover refers to the exchange of genetic material
(substrings) pertaining to rules, structural components and
elements of machine learning.
• Selection relates to the fitness function for selecting
individuals from a population that reproduces new offsprings.
• Replication propagates individuals from one generation to the
next.
• Mutation is concerned with the modification of chromosomes
by individuals.
The existing literature (Goldberg, 2002) includes numerous
common observations about the creation of solutions by means of GA:
• GAs have complexity with adaptation to dynamic concepts or
objective condition. This occurrence, called concept drift in
supervised learning and data mining, is difficult as GAs are
conventionally designed to develop highly-fit solutions
(populations comprising building blocks of huge relative and
complete fitness) regarding stationary concepts.
• GAs are not always efficient at discovering globally optimal
solutions, but can quickly locate high-quality solutions, even
for complicated search spaces. This makes fixed-state GAs
(Bayesian optimization GAs bring together and incorporate
9
solution outputs after convergence to an exact representation
of building blocks) helpful substitutes to generational GAs.
1.1.4 Ant colony optimization
In an ant colony, ants cooperate in an endeavour to find the source
of food. Moving in a colony, viz. in groups, the ants make sure that the
solution—the means to find food—they find helps the ant colony to search for
the source of food (Dorigo et al. 1996). Ants leave a substrance known as
pheromone to determine the path they have already passed. Feeling the
pheromone on the way where there is more of it, the pursuing ants leave the
pheromone for others to follow. This process, aka autocatalic, maximizes the
chances for finding food.
1.2 LITERATURE SURVEY
Document clustering is the process of categorizing text document
into a systematic cluster or group, such that the documents in the same cluster
are similar whereas the documents in the other clusters are dissimilar. It is one
of the vital processes in text mining. Liping (2005) emphasized that the
expansion of internet and computational processes has paved the way for
various clustering techniques. Text mining especially has gained a lot of
importance and it demands various tasks such as production of granular
taxonomies, document summarization etc., for developing a higher quality
information from text.
1.2.1 Clustering techniques
(i) Global K-means
Likas et al. (2003) proposed the global K-means clustering
technique that creates initial centers by recursively dividing data space into
10
disjointed subspaces using the K-dimensional tree approach. The cutting
hyper plane used in this approach is the plane that is perpendicular to the
maximum variance axis derived by Principal Component Analysis (PCA).
Partitioning was carried out as far as each of the leaf nodes possess less than a
predefined number of data instances or the predefined number of buckets has
been generated. The initial center for K-means is the centroids of data that are
present in the final buckets. Shehroz Khan and Amir Ahmad (2004) stipulated
iterative clustering techniques to calculate initial cluster centers for
K-means. This process is feasible for clustering techniques for continuous
data.
(ii) CLIQUE Algorithm
Agrawal et al. (2005) ascribed data mining applications and their
various requirements on clustering techniques. The main requirements
considered are their ability to identify clusters embedded in subspaces. The
subspaces contain high dimensional data and scalability. They also consist of
the comprehensible ability of results by end-users and distribution of
unpredictable data transfer.
A clustering algorithm called CLIQUE fulfills all the above
requirements. CLIQUE finds dense clusters in subspaces of maximum
dimensionality. It produces cluster explanations in the form of Disjunctive
Normal Function (DNF) expressions that are reduced for ease of
comprehension. The approach generates matching results without considering
the data in which input records are presented. It does not presume any
particular mathematical form for data distribution. From the experimental
result, it was observed that CLIQUE algorithm efficiently identified accurate
clusters in large high dimensional datasets.
11
(iii) Enhanced K-means
The main limitation of K-means approach is that it generates empty
clusters based on initial center vectors. However, this drawback does not
cause any significant problem for static execution of K-means algorithm and
the problem can be overcome by executing K-means algorithm for a number
of times. However, in a few applications, the cluster issue poses problems of
erratic behavior of the system and affects the overall performance. Malay
Pakhira et al. (2009) mooted a modified version of the K-means algorithm
that effectively eradicates this empty cluster problem. In fact, in the
experiments done in this regard, this algorithm showed better performance
than that of traditional methods.
(iv) Heterogeneous Uncertainty Clustering Feature (H-UCF)
Uncertainty heterogeneous data streams (Charu Aggarwal et .al
2003) are seen in most of the applications. But the clustering quality of the
existing approaches for clustering heterogeneous data streams with
uncertainty is not satisfactory. Guo-Yan Huang et al. (2010) posited an
approach for clustering heterogeneous data streams with uncertainty. A
frequency histogram using H-UCF helps to trace characteristic categorical
statistic. Initially, creating ‘n’ clusters by a K-prototype algorithm, the new
approach proves to be more useful than UMicro in regard to clustering
quality.
(v) Hierarchical Particle Swarm Optimization (HPSO) clustering
Alam et al. (2010) designed a novel clustering algorithm by
blending partitional and hierarchical clustering called HPSO. It utilized the
swarm intelligence of ants in a decentralized environment. This algorithm
proved to be very effective as it performed clustering in a hierarchical
manner.
12
(vi) Hybrid clustering approach (HCA)
Shin-Jye Lee et al. (2010) suggested clustering-based method to
identify the fuzzy system. To initiate the task, it tried to present a modular
approach, based on hybrid clustering technique. Next, finding the number
and location of clusters seemed the primary concerns for evolving such a
model. So, taking input, output, generalization and specialization, a HCA has
been designed. This three-part input-output clustering algorithm adopts
several clustering characteristics simultaneously to identify the problem.
(vii) Incremental clustering
Only a few researchers have focused attention on partitioning
categorical data in an incremental mode. Designing an incremental clustering
for categorical data is a vital issue. Li Taoying et al. (2010) lent support to an
incremental clustering for categorical data using clustering ensemble. They
initially reduced redundant attributes if required, and then made use of true
values of different attributes to form clustering memberships. Then clustering
ensemble was employed to combine or partition clusters to gain optimal
clustering. Ultimately, the proposed approach was applied in yellow-small
data set, diagnosis data set and zoo data set and results revealed the
effectiveness of the approach.
(viii) Partitional clustering approach
Clustering approaches have been extensively used in the area of
pattern discovery (Nagy 1968) from Web Usage Data (WUD). In e—
commerce applications, clustering is used for the purpose of creating
marketing policies, product offerings etc. Raju et al. (2011) presented a novel
partition based technique for dynamically grouping web users, based on their
web access patterns, using Adaptive Resonance Theory Neural Network
(ART1NN) clustering algorithm. The results show that ART1NN clustering
13
technique outperforms K-means and Self Organizing Map (SOM) clustering
algorithms in terms of intra-cluster and inter-cluster distances.
1.2.2 Text document clustering
(i) Data grabber
Crescenzi et al. (2004) cited an approach that automatically extracts
data from large data-intensive web sites. The “data grabber” investigates a
large web site and infers a scheme for it, describing it as a directed graph with
nodes. It describes classes of structurally similar pages and arcs representing
links between these pages. After locating the classes of interest, a library of
wrappers can be created, one per class with the help of an external wrapper
generator and in this way suitable data can be extracted.
(ii) Link-Analysis and Text-mining toolbox (LATINO)
Miha Grcar et al. (2008) mulled over a technique about the lack of
software mining technique, which is a process of extracting knowledge out of
source code. They presented a software mining mission with an integration of
text mining and link study technique. This technique is concerned with the
inter links between instances. Retrieval and knowledge based approaches are
the two main tasks used in constructing a tool for software component. An
ontology-learning framework named LATINO was developed by Grcar et al.
(2006). LATINO, an open source purpose data mining platform, offers text
mining, link analysis, machine learning, etc.
(iii) Similarity and model based approaches
Similarity-based approach and model-based approaches (Meila and
Heckerman 2001) are the two major categories of clustering approaches and
these have been described by Pallav Roxy and Durga Toshniwal (2009). The
14
former, capable of maximizing average similarities within clusters and
minimizing the same among clusters, is a pairwise similarity clustering
approach. The latter tries to generate techniques from the documents, each
approach representing one document group in particular.
Self-organizing map (Kohonen et.al 2000), mixture of Gaussians
(Tantrum et al. 2002), spherical K-mean (Dhillon and Modha 2001),
bi-secting K-means (Steinbach et.al 2000), mixture of multinomial
(Vaithyanathan and Dom 2000) are some of the new techniques available to
improve clustering performance. K-means, an unsupervised learning
technique, is good at solving clustering issues as well as minimizing objective
function. It defines K-centroids for each cluster. After positioning, the
centroids located away from each other, where each point equals to a given
set, should be taken and linked next to the centroid. Then, it is allowed to
recompute K-centroids. Next, by means of a loop, the location of K-centroids
is altered until no more alterations are made.
(iv) Fuzzy clustering for text
Jiabin Deng et al. (2010) came up with an enhanced fuzzy clustering
text clustering by relying on the Fuzzy C-Means (FCM) and edit distance
approach. It uses feature estimation and minimizes dimensionality of a high-
dimensional text vector. The researchers, in order to sustain the stability of
FCM output, added aspects like high-power sample point set, field of radius
and weight. Since FCM has constraints such as boundary value attribution
(Dave 1992) they recommended the edit distance approach. Consequently, the
outputs showed that it was more precise and constant than FCM in regard to
text clustering.
Odukoya et al. (2010) organized an enhanced data clustering
approach for mining web documents which formulates, simulates and assesses
15
the web documents with the intention of conserving their conceptual
relationships. The enhanced data clustering approach was formulated with the
concept of K-means algorithm. The experimental result showed that this
clustering approach attains more accuracy (89.3%) than the other existing
approach (88.9%). The entropy was constant for both approaches with a value
of 0.2485 at k = 3. This also reduces with the increase in the number of
clusters until the number of clusters reaches eight where it increases to some
extent. The altered rand index values change from 0 to 1 for both clustering
approaches.
The other existing approach attains a value of 53% when compared
with this approach which attains an altered rand index value of 63.7%, when
the number of clusters was five. Additionally, the response time got reduced
from 0.0451 seconds to 0.0439 seconds when the number of clusters was
three. This confirms that the response time of the data clustering approach has
been reduced by 2.7% when compared with the traditional K-means data
clustering. This study revealed that the proposed data clustering approach
could be utilized by developers of web search engines for well-organized web
search result clustering.
(v) Semantic concepts
In a way, a majority of the available clustering algorithms in the
Chinese text clustering suffer from the drawbacks of data scalability and the
interpretability of results. Liu Jinling and Zhou Hong (2010) presented a well-
organized Chinese text clustering technique depending on the semantic
concepts. This technique significantly reduced the number of required data to
be processed and enhanced the capability of the clustering approaches. The
experimental output showed that this clustering approach attained an
acceptable clustering outcome and better implementation efficiency.
16
(vi) Seeds Affinity Propagation
Renchu Guan et al. (2011) presented a novel semisupervised Seeds
Affinity Propagation (SAP) approach based on an Affinity Propagation (AP)
algorithm. The two most important contributions in this technique are:
Structural information of texts are obtained by a novel
similarity metric.
The semisupervised clustering process is enhanced by seed
construction approach.
From the experimental results, it is revealed that the proposed
similarity metric is more efficient in text clustering and the proposed
semisupervised approach attains better clustering outcomes and faster
convergence (using only 76 % iterations of the original AP). The entire SAP
algorithm attains a higher F-measure and lower entropy; improves
considerable clustering execution time (20 times faster) with respect to
K-means, and offers better robustness when compared with other existing
approaches.
1.2.3 Clustering in P2P environment
(i) K-means
Eisenhardt et al. (2003) used distributed clustering in a distributed
P2P network (Adjiman et al. 2006) involving K-means approach. Changing it
to work in a distributed P2P fashion, Li and Morris (2002) adopted the
technique as a probe and echo system having similar problems from the
information retrieval point of view. Generating from a subset of a document
group, this approach involved dividing document groups centrally. After
assigning a node to each cluster, documents were grouped to their respective
clusters.
17
To handle K-means clustering, for P2P network, Dhillon and Modha
(2000) presented novel techniques with parallel implementation approach.
K-means monitoring algorithm helps a centralized K-means process and
recalculates clusters by supervising the distribution of centroids across peers.
In case the data distribution is subject to changes over time, it starts
reclustering. The P2P K-means, the other choice, works by bringing up-to-
date the centroids at each peer. This is done in regard to the information
available from the neighbors in proximity. The algorithm terminates when the
information fails to update centroids of peers put together.
The clustering approaches in use are beset by efficiency and
accuracy in relation to results getting minimized as the network size grows.
However, an approach developed by Qing He et al. (2010) used frequent term
sets for P2P networks. It achieved results whose quality did not have any
bearing on the network size, due to its lower communication volume. It made
users get, not only a better understanding of clustering results, but also
enabled them to deal with local documents as per the entire network.
(ii) Self-organizing clustering
Zhao-Kui Li and Yan Wang (2010) resolved the topology mismatch
in a structured P2P network by using super nodes and self-organizing
clustering, that lay in physical location data. Each node to become a
clustering header, evaluated its ability. This approach not only enhanced
routing performance but topology mismatch also.
(iii) Hybrid P2P K-Medoids Method (P2PKMM)
A hybrid clustering approach, namely P2PKMM, evolved by
Zhongjun Deng et al. (2010) achieved remarkable results using a
decentralized approach. There the peers exactly matched with only those near
the topological network.
18
(iv) Correlation-based clustering
Yuan Li and Xia Shixiong (2010), dealing with the problems of
poor scalability and low search efficiency in unstructured P2P networks,
produced a correlation-based hierarchical P2P network approach. Besides
partitioning the network into clusters, by means of nodes through correlation,
the approach attained considerable success in terms of query success rate and
query delay.
1.2.4 Analysis of fuzzy clustering
(i) Formal Concept Analysis (FCA)
Pollandt (1996) tried to unify fuzzy logic with FCA using
L-Fuzzy context that uses linguistic variable, usually terms connected to
fuzzy sets. It relates to the uncertainty in the context. However, these
variables need human explanation. Besides, the fuzzy concept lattice from
L-Fuzzy context produces blended explosion of ideas. FCA is believed to be
an efficient technique for conceptual clustering and data analysis. All the
same, it is important to make concept lattices, usually complex, simple.
Stumme et al. (2002) used association rules in relation to Iceberg concept
lattice clustering.
Formal Concept Analysis (FCA) is a proper technique for data
analysis and knowledge presentation. FCA is considered to be an efficient
technique for conceptual clustering in recent years. But as most concept
lattices are fairly complex in terms of the number of concepts created, it is
essential to simplify the lattice generated. In Iceberg concept lattice (Stumme
et al. 2002), association rules were used to cluster concepts on the lattice, and
conceptual scaling (Vogt and Wille 1994) was then used to generate the
concept hierarchy.
19
(ii) Fuzzy C-Medoids (FCMdd ) clustering
Raghu Krishnapuram et al. (2001) presented new techniques,
namely Fuzzy C-Medoids (FCMdd) and Fuzzy C Trimmed Medoids for fuzzy
clustering of relational data. The objective functions are based on choosing
‘c’ representative objects (medoids) from the data set in such a way that the
total dissimilarity within each cluster is minimal. The experimental
observations were conducted by comparing FCMdd with the Relational FCM
(RFCM). The results point out that FCMdd is much faster and more efficient.
(iii) Inter-transaction fuzzy association rules
A technique to identify inter-transaction fuzzy association rules that
can calculate the variations of events was proposed by Yo-Ping Huang and
Li-Jen Kao (2005). The proposed algorithm first mapped a quantitative
attribute into several fuzzy attributes. A normalization process was carried out
to avoid the total contribution of fuzzy attributes from being larger than one.
In order to mine inter-transaction fuzzy association rules, both the
dimensional attribute and sliding window concepts were proposed in this
technique. A complete comparison of the investigation of text document
clustering for a biomedical digital library MEDLINE was proposed by (Lllhoi
Yoo and Xiaohua Hu (2006). These researchers conducted widespread
experiments based on various evaluation metrics such as misclassification
index, F-measure, cluster purity, and entropy on very large article sets from
MEDLINE. Document clustering technique can provide the advantages of
better document retrieval, document browsing, and text mining in digital
library. Ontology can be used to solve conventional information retrieval
problems such as synonym/hypernym/hyponym.
Yeong-Chyi Lee et al. (2006) envisaged the concept of multiple-
level taxonomy and multiple minimum supports to identify fuzzy association
20
rules in a given quantitative transaction data set. The significance of different
items was judged using different criteria, managing taxonomic relationships
among items, and dealing with quantitative data sets having three issues that
generally occur in real mining applications. This fuzzy mining approach
created large itemsets, level by level and then derived fuzzy association rules
from the quantitative transaction data.
(iv) Cluster-based fuzzy-genetic
Chen et al. (2006) designed a novel approach called cluster-based
fuzzy-genetic mining technique for extracting both fuzzy association rules
and membership functions from quantitative transactions. The technique
dynamically adjusted membership functions by GAs and used them to fuzzify
quantitative transactions. It also accelerated the evaluation procedure and kept
nearly the same quality of solutions by clustering chromosomes. When
finding the 1-itemsets which is time saving, the evaluation cost can be
reduced.
(v) Fuzzy ontology
Jun Zhai et al. (2008) devised fuzzy ontology techniques by means
of intuitionistic fuzzy set which enabled the sharing of knowledge on
semantic web. It served as a standard benchmark for knowledge
representation and helped to share collaborative design on the semantic web.
The researchers mooted a series of fuzzy ontology techniques containing
fuzzy domain ontology that used intuitionistic fuzzy set and fuzzy linguistic
terms. It looked into the relationship between fuzzy concepts, order and
equivalent relations.
Hongwei Chen et al. (2009) used a concrete model obtained from
fuzzy comprehensive evaluation technique for P2P-based trust system, fuzzy
21
rank-ordering technique for Blin algorithm, and fuzzy inference approach
system for mamdani approach. This method showed that different fuzzy trust
techniques for P2P based system leads result in different fuzzy sets results.
(vi) Fuzzy Transduction-based Clustering Approach (FTCA)
Matsumoto et al. (2010) made up a prototype web search results
clustering engine that improved search results by performing fuzzy clustering
on web documents returned by traditional search engines, as well as ranking
the results and labeling the resulting clusters. This was done using a FTCA,
which used a Transduction-based Relevance Model (TRM) to create
document relevance values. Experimental observations on five different data
sets revealed a significant clustering quality and performance advantage over
suffix tree clustering and lingo approaches.
(vii) Fuzzy similarity-based feature clustering
Feature clustering is a dominant technique to minimize the
dimensionality of feature vectors for text classification. Jung-Yi Jiang
et al. (2011) emphasized a fuzzy similarity-based self-constructing approach
for feature clustering. By this approach, the derived membership functions
matched closely with and illustrated appropriately the real distribution of the
training data. From the experimental observation, it became clear that the new
approach could run faster and obtain better extracted features than the other
techniques.
1.2.5 K-means and GA for clustering
(i) Fast Genetic K-means cluster technique (FGKA)
Lu et al. (2004) laid the way for FGKA. It is the quickest version of
GKA and FGKA and has shown a lot of development over GKA comprising
22
an effective estimation of the objective value Total Within-Cluster Variation
(TWCV) which reduced illegal string elimination overhead and a
generalization of the mutation operator. These enhancements augured that
FGKA runs 20 times quicker than GKA, according to (Krishna et al. 1999).
Even though FGKA is outperformed by GKA considerably, it has some
potential disadvantages. The cost to compute the centroids and TWCV from
score will be more. To overcome the difficulties of FGKA, Lu et al. (2001)
formulated an Incremental Genetic K-means approach (IGKA). IGKA takes
over all the benefits of FGKA including the convergence to the global
optimum, and when the mutation possibility is low this method outperforms
FGKA.
The main focus of IGKA is to compute incrementally the objective
value of TWCV and to cluster the centroids. When the mutation possibility is
less than the threshold, then IGKA comparatively performs better than FGA.
But when the mutation possibility is higher than the threshold, FGA performs
better. Hence, the Hybrid Genetic K-means algorithm (HGKA) of (Francisco
2005) holds the advantages of both FGKA and IGKA and performs better in
higher and lower mutation probability. The GA method of clustering is used
only for numeric data sets that depend exclusively on K-means.
Consequently, a Genetic Clustering Algorithm,called GKMODE is
introduced. GKMODE combines both the k-modes algorithm, recommended
by Anil Chaturvedi et al. (2001) and the genetic approach. GKMODE
performs well in relation to categorical data but it is very difficult to manage
both the combination of ‘n’ numeric and categorical data.
(ii) Improved GA (IGA)
Genetic approach is an analytical population-dependent search
technique which has three most important operators, namely crossover,
mutation and selection. Selection operator performs a vital role in discovering
23
optimal result for constrained optimization complexity. Ke-Zong Tang et al.
(2010) formulated an IGA depending on a novel selection approach that is
offered to manage nonlinear programming difficulties. Experimental
observation determines that the IGA has enhanced robustness, efficiency and
steadiness technique.
(iii) Incremental Clustering using GA (ICGA)
Atul Kamble (2010) developed a new technique depending on GA.
This approach can be applied to all the databases including the data from
metric space. It is confirmed that the incremental approach provides better
output than any other approach. ICGA is a performance estimation approach
on spatial database which represents the effectiveness of the approach. ICGA
provides speed-up aspects unlike the other clustering algorithms.
(iv) Chaotic method GA (CGA)
Dharmendra Roy and Lokesh Sharma (2010) deciphered a
clustering approach depending on Genetic K-means model that performs
better for data containing both numeric and categorical attributes. Min-Yuan
Cheng et al. (2010) offered an n—dimension convergence approach in order
to follow the possible development in conventional GA by K-means
clustering method. The chaotic approach was exploited to avoid the new
approach from prematurity. This approach not only maintains the fundamental
search capability, but also enhances the flexibility and effectiveness of
parametric method. This approach provides 84% of probability to get
optimization. It has been tested with many construction examples and all the
results confirm that this method is very effective.
Ghaemi et al. (2010) introduced a new approach that used clustering
ensembles to enhance cluster partition fitness. Moreover, the approach solves
24
the diversity clustering problem. The fitness average among individuals
generated by the algorithm and other clustering algorithms has been
calculated by three different fitness functions and they are then compared.
The experimental observations on several benchmark data sets have clarified
that the proposed algorithm improves cluster fitness of solutions.
Jia Zhen and Wang Yong Gui (2010) presented a genetic clustering
approach, based on dynamic granularity. From the perspective of a parallel,
random search, global optimization and diversity features of GA, it is
integrated with dynamic granularity approach. In the process of granularity
changing, suitable granulation can be made by refining the granularity, which
can improve the efficiency and the accuracy of the clustering algorithm. The
experimental results pointed out that the approach enhances the clustering
algorithm based on GA local search ability and convergence speed.
(v) Genetic Fuzzy C Means
Chen Wei et al. (2010) proposed a technique which deals with the
drawbacks of the genetic Fuzzy C Means (FCM) such as higher classification
time and less clustering accuracy. This approach improves the crossover,
selection, and mutation parts of the GA, and improves its global searching
capability. Also it eliminates the trouble encountered in setting up genetic
parameters. This enhanced approach performs FCM optimization immediately
after each generation of genetic operation by increasing the converging speed.
Consequently, the clustering accuracy, converging speed and stability of the
approach are found to be better than the conventional clustering approaches.
Chen Rui et al. (2010) endorsed an enhanced GA. This algorithm
keeps the population diversity by similarity checks on the population before
selection. The approach resolves the early-maturing issue of the population
evolution and presents a formula for mutation probability connected with
25
similarity rate and iteration times. It not only retains a good diversity of
population, but also ensures the convergence of the approach. The
experiments were conducted in UCI data sets of WINE and IRIS. From the
experimental results, it was observed that this approach shows a better
performance when compared to C-means clustering algorithm.
1.2.6 Ontology based clustering
There are several approaches such as Natural Language Processing
(NLP) combined with association rules (Maedche and Staab 2001), statistical
model (Faatz and Steinmetz 2002), and clustering (Bisson and Nedellec
2000). They have been applied to create ontology from a concept hierarchy
automatically or semi-automatically. Clustering is one of the most widely
used approaches for ontology learning. Furthermore, conceptual clustering
such as COBWEB can be used for the creation of concept representations and
relationships for ontology (Bisson and Nedellec 2001).
Andreas et al. (2002) adopted the clustering technique for text data.
It used background information during preprocessing to improve the
clustering outcome. It allowed for selection between outcomes. The input data
supplied to ontology-based heuristics was pre-processed for feature selection
and aggregation. Therefore, various choices for text illustrations were
constructed. Based on these illustrations, they calculated the multiple
clustering outcomes using K-means. The results of the proposed approach
looked very significant. A WDC algorithm for large data sets was proposed
by Sharma et al. (2009). Wordsets-based clustering makes use of hierarchical
technique to cluster text documents having common words.
26
(i) Affinity-based similarity measure
Affinity-based similarity measure for web document clustering was
presented by Shyu et al. (2004). In this work, the concept of clustering the
document was extended into web document clustering by establishing the
affinity based similarity measure technique. It makes use of the user access
patterns in finding the similarities among web documents through a
probabilistic model. An evaluation done with the help of real data set
illustrated that the similarity measure was better than the cosine coefficient
and the Euclidean distance technique.
(ii) Document Index Graph (DIG)
Web document clustering using document index graph was put forth
by Momin et al. (2006). Effective informative features like phrases are
essential to attain a more precise document clustering. Therefore, the first part
of the work provided phrase-based model and DIG permitted incremental
phrase-based encoding of documents. It stressed the efficiency of phrase-
based similarity measure over conventional single term based similarities. In
the next step, a Document Index Graph Based Clustering (DIGBC) technique
is provided to improve the DIG model for incremental and soft clustering.
This technique incrementally clusters documents based on presented cluster-
document similarity measure. It enables the assignment of a document to
more than a single cluster.
(iii) Fuzzy named entity-based clustering
Cao et al. (2008) dealt with a fuzzy named entity-dependent
document clustering technique, leaving behind the conventional keyword-
based method of document clustering. They involve, the reason being,
restrictions due to simplistic treatment of words and rigid partitional clusters.
27
In the research work, they introduced named entities which are important
elements in defining document semantics as objectives into fuzzy document
clustering. Initially, the existing keyword-based vector space representation
was adapted with vectors defined over spaces of entity names, types, name-
type pairs, identifiers and alternatives of keywords. Next, hierarchical fuzzy
document clustering can be applied using a similarity measure of the vectors
indicating documents.
Lena Tenenboim et al. (2008) lent a novel technique on ontology
based classification. They discussed a prototype model of a futuristic
personalized newspaper service on a mobile reading device on categorization
of news items in e-Paper. The e-Paper system comprises news items from
different news suppliers and distributes to each subscribed user a personalized
electronic newspaper, making use of content-based and collaborative filtering
techniques. The e-Paper can also offer the users a standard version of chosen
newspapers, besides allowing the browsing abilities in the warehouse of news
items. This deliberates on the automatic categorization of incoming news with
the help of hierarchical news ontology. Based on this clustering technique and
on the users' profiles, the personalization engine is capable of affording a
personalized paper to every user onto the mobile reading device
(iv) GeneticCA
Zhenya Zhang et al. (2008) gave a clustering aggregation based on
GA for document clustering. A novel technique based on GA for clustering
aggregation difficulty, named as GeneticCA, was provided to approximate the
clustering performance of a clustering division. The clustering precision was
defined and features of clustering precision were considered. In the evaluation
of GeneticCA, hamming neural network was used to make clustering
divisions with fluctuant and weak clustering performance.
28
(v) Model clustering
Haojun et al. (2008) endeavoured to develop a document clustering
method based on hierarchical algorithm. This work involved analyzing and
making use of cluster overlapping technique to design the cluster merging
criterion. Depending on the hierarchical clustering technique, the expectation-
maximization (EM) method is used in the Gaussian mixture model to count
the parameters and formulate the two sub-clusters combined when their
overlying became the biggest.
(vi) Prototype-based genetic Approach
Gang Li et al. (2009) considered the difficulty in classical Euclidean
distance metric which could not create an apt separation for data lying in
manifold. They evolved the GA based clustering method with the help of
geodesic distance measure. Evaluating with generic K-means method for the
function of clustering, the technique had the potential to distinguish
complicated non-convex clusters. Its clustering performance is obviously
better than that of the K-means method for complex manifold structures.
(vii) Latent Semantic Index (LSI)
Muflikhah et al. (2009) engendered a document clustering technique
using concept space and cosine similarity measurement. They combine the
information retrieval and document clustering techniques as concepts space
approach. The technique is known as Latent Semantic Index (LSI) which uses
Singular Vector Decomposition (SVD) or PCA. This technique aims to
decrease the matrix dimension by identifying the pattern (Bezdek 1981) in
document collection with reference to terms. Every technique is employed to
the weight of term-document in vector space model (VSM) for document
clustering with the help of FCM technique. In addition to the reduction of
29
term-document matrix, this research also utilizes the cosine similarity
measurement as an alternative of Euclidean distance to engage FCM.
ELdesoky et al. (2009) suggested a novel similarity measure for
document clustering based on topic phrases. In the conventional VSM,
researchers used the unique word that is contained in the document set as the
candidate feature. This work presented a new technique for evaluating the
similarity measure of the traditional VSM. It considered the topic phrases of
the document as comprising terms for the VSM instead of the conventional
terms. It applied the new technique to the Buckshot technique, which is a
combination of the Hierarchical Agglomerative Clustering (HAC) technique
and the K-means clustering method. Such a method may increase the
effectiveness of clustering by incrementing the evaluation metric values.
Thaung et al. (2010) developed document clustering with the help of
FCM algorithm. Most traditional clustering techniques allocate each data to
exactly a single cluster, thereby creating a crisp separation of the provided
data. But fuzzy clustering permits degrees of membership, to which the data
fit various clusters. In this research, documents were partitioned with the help
of FCM clustering technique. FCM clustering is a widely applied
unsupervised clustering method. But FCM method needs the user to mention
the number of clusters and different values of clusters corresponding to
different fuzzy partitions. So the validation of clustering result is required. In
this regard, PBM index and F-measure are helpful to measure the cluster
quality.
Web Recommendation System (WRS) has several drawbacks, like
sparsity and the new user problem. Moreover, WRS does not make full use of
or harness the power of domain knowledge and semantic web ontologies.
Mabroukeh et al. (2011) discussed how an ontology-based WRS could use
relations and concepts in ontology, along with user-provided tags, to provide
30
top-n recommendations without the need for item clustering or user ratings.
For this reason, they suggested a dimensionality reduction technique based on
the domain ontology, to solve the sparsity problem.
Paliwal et al. (2011) addressed the problem of web service
discovery provided with non-explicit service description semantics that match
a particular service request. This semantic based web service discovery
technique comprises semantic based service categorization and semantic
improvement of the service request. They presented a significant solution for
obtaining functional level service categorization depending on an ontology
framework. Moreover, they used clustering for accurately classifying the web
services depending on service functionality. The semantic based classification
is carried out off-line for the Universal Description Discovery and Integration
(UDDI). The semantic enhancement of the service request attains better
matching with relevant services.
1.2.7 Clustering in feature selection
Feature selection, a process of extraction of popular content of the
text document collection, needs content relationship factors. So does
document grouping. Marcelo N. Ribeiro et al. (2009) came up with a local
feature selection technique. It enabled partitional hierarchical text clustering.
Here each cluster denotes a variant subset feature. In a comparison between
the local technique and the global feature selection technique for bisecting
K-means, the former showed a significant precision even for a few selected
terms.
Xu et al. (2007) described a new feature selection technique for text
clustering. It was based on expectation maximization and cluster validity.
Using supervised feature selection technique on the intermediate clustering,
it produced feature selection for text clustering in a better way.
31
Meena Janaki et al. (2010) gelled a new feature selection technique based on
ant colony optimization. It originated from the swarm intelligence. The
classifiers’ effectiveness is compared to the feature, selected by chi-square
and CHIR techniques. It discovered better feature sets than the existing
techniques.
Shen Huang et al. (2006) suggested a novel feature co-selection
technique called Multitype Feature Co-selection for Clustering (MFCC) for
effective web document clustering. MFCC exploited intermediary clustering
results in one type of feature space to assist the selection in the other type.
Feature co-selection, carried out iteratively, significantly incorporated into an
iterative clustering technique. This is done through intermediate clustering
result in one feature space, to improve the feature selection in another. So
better clusters in each space are obtained.
Sun Park et al. (2010) set forth the document clustering approaches
via weighted semantic features and cluster similarity through Non-negative
Matrix Factorization (NMF). Several advantages exist because of the
similarity between the clusters and documents. It can effortlessly cluster
documents with the major topics using clustering based weighted semantic
features. Another advantage is that the quality of document clustering is
improved. This approach is more efficient than the other document clustering
technique using NMF. Therefore, by using these enhanced techniques and
algorithms the efficiency of clustering approaches can be considerably
improved.
1.2.8 Ant colony optimization in clustering
Hui Fu (2008) hit upon a novel clustering technique with Ant
Colony Optimization (ACO) based on cluster center initialization. This
approach gives initialized cluster centers by different techniques and then
32
solves clustering problems by iterative technique. Three techniques of cluster
center initialization are used in clustering the algorithm with ACO.
Xiaohua Wang et al. (2009) concocted novel hybrid ant colony and
agglomerate document clustering algorithm, hybrid-AC and A. It was based
on ant colony model and agglomerate clustering algorithms. When compact
algorithm was applied, ants dropped their load. Then by applying evaluated
function based schedule algorithm, ants received the load. Thus, agglomerate
clustering algorithm got integrated into the iteration procedure of ACO
clustering algorithm.
Wu Bin et al. (2002) scheduled a document clustering algorithm
depending on swarm intelligence and k-means, namely CSIM. Initially, a
document clustering technique based on swarm intelligence was used. It was
obtained from a fundamental model interpreting ant colony organization of
cemeteries. Good initial clusters were obtained in the first phase in CSIM
through Swarm Intelligence. Classical K-means clustering technique was then
integrated by using the clusters as initial centers. CSIM incorporated
significant properties of both swarm intelligence and K-means.
Lijuan Jiao et al. (2010) derived text classification algorithm based
on ant colony algorithm. It effectively solved discrete problems by ACO and
discreteness of text document features. Texts are categorized by crawling of
class population ants. They have class information to identify an optimal path
that matches it during iteration.
From literature review it is observed that peer to peer document
clustering techniques are highly effective with good speed up. Similarly
optimization techniques including Ant colony optimization have been found
to be effective for text document clustering. However work in the areas of
P2P document clustering with improved optimization techniques have not
33
been investigated in literature. Optimization techniques lead to faster
convergence and improved speed up in distributed systems which are crucial
for text mining.
1.3 SCOPE OF THE RESEARCH
As clustering plays a very vital role in various applications, many
researches are still being done. The upcoming innovations are mainly due to
the properties and the characteristics of existing methods. These existing
approaches form the basis for the various innovations in the field of
clustering. From the existing clustering techniques, it is clearly observed that
the clustering techniques based on GA, fuzzy and ontology provide
significant results and performance. Hence, this research concentrates mainly
on the semantic clustering based on GA, NDRGA, ACO and fuzzy ontology
clustering for better performance. The thesis aims to fulfill the objectives of
various cluster related algorithms by developing an effective and efficient
clustering technique. That technique is expected to give good accuracy and
performance.
1.4 PROBLEM DEFINITION
Text document clustering approaches are beset with many problems,
namely accuracy, time duration in large databases, selection of initial clusters
(Shehroz Khan and Amir Ahmad 2004), higher objective function, lesser
silhouette coefficient, lesser F-measure etc. Document clustering in a large
database (Sharma and Dhir 2009) becomes complex issue to the users.
Moreover, the required information is not obtained from the clustering results
due to the complexity. As a result, the accuracy of clustering results becomes
a problem in a large database. So an efficient method needs to be found to
improve the stability of the P2P networks and provide clustering accuracy as
well.
34
1.5 OBJECTIVES
The research study aims to develop an improved document
clustering technique with high classification accuracy. The convergence time
for the usage of existing clustering techniques is more and also the number
iterations required is also very high. Therefore new and efficient techniques
are needed to reduce these constraints. Moreover, in distributed data mining,
the adoption of a flat node distribution technique entails an impact upon its
scalability. Hence the proposed techniques are formulated in order to address
these issues of modularity, flexibility and scalability. The main objectives of
this research include:
• developing a novel technique for distributed document
clustering using ontology which provides very high
classification accuracy.
• reducing the clustering time taken by clustering approaches in
the P2P environment.
• minimizing the convergence time and the number of
iterations.
The time utilized for active clustering of documents is more if large
databases are taken up for clustering. In the case of determining the initial
clusters also, varying clusters would result for the same data set. The
proposed clustering algorithms involve the grouping of electronic documents,
extracting important contents from the document collection and supporting
effective management of digital library documents. Contents of digital
documents are analyzed and grouped into various categories.
35
1.6 METHODOLOGY
A novel approach becomes necessary in the areas of document
clustering which provides higher efficiency and accuracy. The research study
deals primarily with five proposed approaches for document clustering. They
are:
1. Distributed clustering with feature selection for text
documents based on ontology.
2. Ontology based hierarchical distributed document clustering.
3. A Genetic Fuzzy Ontology Model (GFOM) for distributed
document clustering.
4. A novel clustering approach using Fuzzy Ontology with Non-
Dominated Ranked Genetic Algorithm (FONDRGA).
5. Enhanced distributed document clustering with the help of
Fuzzy Ontology and Ant Colony Optimization (FOACO)
Algorithm.
Each of the proposed document clustering methods are evaluated on
the following criteria.
Classification accuracyObjective functionClassification timeAlgorithm complexitySpeed upConvergence behaviorSilhouette coefficientF-measureEntropySeparation index
36
The improvements achieved in those performance measures have
been tested for statistical significance using t-test.
1.7 OUTLINE OF THE THESIS
Each of the remaining chapters of the thesis deals with the
approaches to semantic clustering, ontologies and optimization methods with
the help of soft computing techniques. Chapter 2 describes the ontology based
distributed clustering with feature selection for text documents for better
cluster performance. The Semantic Enhanced Hierarchically Distributed P2P
Clustering (SEHP2PC) is explained in Chapter 3. Term relationships such as
meaning (synonym), part-of (meronym) and kind-of (hypernym) factors are
utilized easily to cluster the documents.
In chapter 4, a genetic fuzzy ontology model is proposed to improve
the performance of distributed document clustering. In chapter 5, fuzzy
ontology with Non-Dominated Ranked Genetic Algorithm (NDRGA) is
delineated. Then, the contribution of ACO Algorithm is incorporated into
Ontology Generation using Fuzzy Logic (OGFL) and explained. Chapter 6
details the fuzzy ontology based distributed document clustering with ACO
algorithm for better results and performance. Chapter 7 draws concluding
remarks and the future scope of the proposed approaches.
1.8 SUMMARY
This chapter dealt with some important features of clustering and
details of research done in that area with the proposed contribution.