DMN09-Hierarchically Distributed Peer-To-peer Document Clustering and Cluster Summarization

8/8/2019 DMN09-Hierarchically Distributed Peer-To-peer Document Clustering and Cluster Summarization

1/18

Hierarchically Distributed Peer-to-PeerDocument Clustering and

Cluster SummarizationKhaled M. Hammouda and Mohamed S. Kamel, Fellow, IEEE

AbstractIn distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of

modularity, flexibility, and scalability, we propose a Hierarchically distributed Peer-to-Peer (HP2PC) architecture and clustering

algorithm. The architecture is based on a multilayer overlay network of peer neighborhoods. Supernodes, which act as representatives

of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate

within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular

way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up

the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize

the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters.

Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with

comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that

HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralizedcounterparts with up to 88 percent accuracy.

Index TermsDistributed data mining, distributed document clustering, hierarchical peer-to-peer networks.

1 INTRODUCTION

A recent shift toward distributed data mining (DDM)was sparked by the data mining community since themid-1990s. It was realized that analyzing massive data sets,which often span different sites, using traditional centra-

lized approaches can be intractable. In addition, DDM isbeing fueled by recent advances in grid infrastructures anddistributed computing platforms.

Huge data sets are being collected daily in differentfields; e.g., retail chains, banking, biomedicine, astronomy,and so forth, but it is still extremely difficult to drawconclusions or make decisions based on the collectivecharacteristics of such disparate data.

Four main approaches for performing DDM can beidentified. A common approach is to bring the data to acentral site, then apply centralized data mining on thecollected data. Such approach clearly suffers from a hugecommunication and computation cost to pool and mine theglobal data. In addition, we cannot preserve data privacy insuch scenarios.

A smarter approach is to perform local mining at eachsite to produce a local model. All local models can then betransmitted to a central site that combines them into a

global model [1], [2], [3]. Ensemble methods also fall intothis category [4]. While this approach may not scale wellwith the number of sites, it is a better solution than poolingthe data.

Another smart approach is for each site to carefully selecta small set of representative data objects and transmit it to acentral site, which combines the local representatives intoone global representative data set. Data mining can then becarried on the global representative data set [5], [6].

All previous three approaches involve a central site tofacilitate the DDM process. A more departing approachdoes not involve centralized operation, and thus belongs tothe peer-to-peer (P2P) class of algorithms. P2P networks canbe unstructured or structured. Unstructured networks areformed arbitrarily by establishing and dropping links overtime, and they usually suffer from flooding of traffic to

resolve certain requests. Structured networks, on the otherhand, make an assumption about the network topology andimplement a certain protocol that exploits such topology.

In P2P DDM, sites communicate directly with each otherto perform the data mining task [7], [8], [9], [10]. Commu-nication in P2P DDM can be very costly if care is not takento localize traffic, instead of relying on flooding of control ordata messages.

Regardless of any particular DDM approach, the dis-tributed nature of the DDM concept itself usually entails atradeoff between accuracy and scalability. If better accuracyis desired, the granularity level of information exchanged

between distributed nodes should become finer and/or theconnectedness of different nodes should be increased. Onthe other hand, if better scalability is desired, the granular-ity should be coarser and/or the connectedness should be

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009 681

. K.M. Hammouda is with Desire2Learn Inc., 305 King St. West, Suite 200,Kitchener, ON N2G 1B9, Canada. E-mail: [email protected].

. M.S. Kamel is with the Department of Electrical and ComputerEngineering, University of Waterloo, 200 University Ave. West, Waterloo,ON N2L 3G1, Canada. E-mail: [email protected].

Manuscript received 23 July 2007; revised 4 Jan. 2008; accepted 13 Aug. 2008;

published online 9 Sept. 2008.Recommended by acceptance by S. Zhang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2007-07-0380.Digital Object Identifier no. 10.1109/TKDE.2008.189.

1041-4347/09/$25.00 2009 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: Naga krishna. Downloaded on October 27, 2009 at 00:45 from IEEE Xplore. Restrictions apply.


2/18

reduced. Bandyopadhyay et al. [11] derive for their P2PK-means algorithm an upper bound on the clustering errorduring computing the distributed solution, which measuresthe degree to which accuracy has been sacrificed at theexpense of lowered communication cost. The relation isevident in many DDM methods as well, since the prevalentapproaches employ approximate algorithms, as opposed toexact algorithms.

In this paper, we introduce an approach for distributeddata clustering, based on a structured P2P networkarchitecture. The goal is to achieve a flexible DDM modelthat can be tailored to various scenarios. The proposedmodel is called the Hierarchically distributed P2P Cluster-ing (HP2PC). It involves a hierarchy of P2P neighborhoods,in which the peers in each neighborhood are responsible forbuilding a clustering solution, using P2P communication,based on the data they have access to. As we move up thehierarchy, clusters are merged from lower levels in thehierarchy. At the root of the hierarchy, one global clusteringcan be derived.

The model deviates from the standard definition of P2Pnetworks, which typically involve loose structure (or nostructure at all), based on peer connections that are createdand dropped frequently. The HP2PC model, on the otherhand, is based on static hierarchical structure that isdesigned up front, upon which the peer network is formed.We plan to introduce a dynamic structure extension to thismodel in future work.

Using the HP2PC model, we can partition the problem ina modular way, solve each part individually, then succes-sively combine solutions if it is desired to find a globalsolution. This way, we avoid two problems in the currentstate-of-the-art DDM: 1) we avoid high communication cost

usually associated with a structured, fully connected net-work, and 2) we avoid uncertainty in the network topologyusually introduced by unstructured P2P networks. Experi-ments performed on document clustering show that we canachieve comparable results to centralized clustering withhigh gain in speedup.

The model lends itself to real-world structures, such ashierarchically distributed organizations or governmentagencies. In such scenario, different departments orbranches can perform local clustering to draw conclusionsfrom local data. Parent departments or organizations cancombine results from those in lower levels to draw

conclusions on a more holistic view of the data.In addition, when applied to document clustering, we

also provide interpretation of the distributed documentclusters using a distributed keyphrase extraction algorithm,which is a distributed variant of the CorePhrase singlecluster summarization algorithm. The algorithm finds thecore phrases within a distributed document cluster byiteratively intersecting relevant keyphrases between nodesin a neighborhood. Once converged to a set of core phrases,they are attached to the cluster for interpretation of itscontents.

This paper is organized as follows: Section 2 provides

some background on the topic and identifies related work.Section 3 introduces the HP2P distributed architecture, andSection 4 discusses the foundation behind the HP2Pdistributed clustering algorithm. Section 5 discusses the

distributed cluster summarization algorithm. Section 6presents experimental results and discussion. Finally,conclusion and future directions are presented in Section 7.

2 BACKGROUND AND RELATED WORK

DDM started to gain attention during the late 1990s.Although it is still a young area of research, the body ofliterature on DDM constitutes a sizeable portion of the

broader data mining literature.Basic definitions. Data mining in distributed environ-

ments is known as DDM, and sometimes as DistributedKnowledge Discovery (DKD). The central assumption inDDM is that data are distributed over a number of sites andthat it is desirable to derive, through data mining techniques,a global model that reflects the characteristics of the wholedata set.

A number of challenges (often conflicting) arise whendeveloping DDM methods:

. Communication model and complexity,

.

Quality of global model, and. Privacy of local data.

It is desirable to develop methods that have lowcommunication complexity, especially in mobile applica-tions such as sensor networks, where communicationconsumes battery power. Quality of the global modelderived from the data should be either equal or comparableto a model derived using a centralized method. Finally, insome situations when local data are sensitive and not easilyshared, it is desirable to achieve a certain level of privacy oflocal data while deriving the global model.

Although not yet proven, usually deriving high-quality

models requires sharing as much data as possible, thusincurring higher communication cost and sacrificing priv-acy at the same time.

Homogeneous versus heterogeneous distributed data.

We can differentiate between two types of data distribution.The first is homogeneous, where data are partitionedhorizontally across the sites; i.e., each site holds a subsetof the data. The second is heterogeneous, where data arepartitioned vertically; i.e., each site holds a subset of theattribute space, and the data are linked among sites via acommon key.

Exact versus approximate DDM algorithms. A DDM

algorithm can be described as either exact or approximate.Exact algorithms produce a final model identical to ahypothetical model generated by a centralized processhaving access to the full data set. Fig. 1 illustrates the

682 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 5, MAY 2009

Fig. 1. Exact distributed clustering model.



3/18

hypothetical process that is modeled by an exact distributedclustering algorithm. The exact algorithm works as if thedata subsets, Di, from each node were brought together intoone data set, D, first; then a centralized clustering algorithm,A, had performed the clustering procedure on the wholedata set. The clustering solutions are then distributed againby intersecting the data subsets with the global clusteringsolution.

Approximate algorithms, on the other hand, produce amodel that closely approximates a centrally generatedmodel. Most DDM research studies focus on approximatealgorithms as they tend to produce comparable results toexact algorithms with far less complexity [3].

Communication models. Communication betweennodes in distributed clustering algorithms can be categor-ized into three classes (in increasing order of communica-tion cost): 1) communicating models, 2) communicatingrepresentatives, and 3) communicating actual data. The firstcase involves calculating local models that are then sent topeers or a central site. Models often are comprised of cluster

centroids, e.g., P2P K-means [9], cluster dendograms, e.g.,RACHET [1], or generative models, e.g., DMC [2]. In thesecond case, nodes select a number of representativesamples of the local data to be sent to a central site forglobal model generation, such as the case in the KDECdistributed clustering algorithm [6] and the DBDC algo-rithm [5]. The last model of communication is for nodes toexchange actual data objects; i.e., data objects can changesites to facilitate construction of clusters that exist in certainsites only, such as the case in the collaborative clusteringscheme in [12].

Applications. Applications of DDM are numerous and

are usually manifested as distributed computing projects.They often try to solve problems in mathematics and science.Specific areas and sample projects include: astronomy(SETI@home), biology (Folding@home, Predictor@home),climate change (CPDN), physics (LHC@home), cryptogra-phy (distributed.net), and biomedicine (grid.org). Thoseprojects are usually built on top of a common platformproviding low level services for distributed or grid comput-ing. Examples of those platforms include: Berkeley OpenInfrastructure for Network Computing (BOINC), Grid.org,World Community Grid, and Data Mining Grid.

Text mining. Applications of DDM in the text mining

area are rare but usually employ a form of distributedinformation retrieval. Distributed text classification andclustering have received little attention. PADMA is an earlyexample of parallel text classification [13].

The work presented by Eisenhardt et al. [7] achievesdocument clustering using a distributed P2P network. Theyuse the K-means clustering algorithm, modified to workin a distributed P2P fashion using a probe-and-echomechanism. They report improvement in speedup com-pared to centralized clustering. Their algorithm is an exactalgorithm, although it requires global synchronization ateach iteration.

A similar system can be found in [14], but the problem isposed from the information retrieval point of view. In thiswork, a subset of the document collection is centrallypartitioned into clusters, for which cluster signatures are

created. Each cluster is then assigned to a node, and laterdocuments are classified to their respective clusters bycomparing their signature with all cluster signatures.Queries are handled in the same way, where they aredirected from a root node to the node handling the clustermost similar to the query.

State of the art. In the latest issue of IEEE InternetComputing [15] (at the time of writing this paper), a few

algorithms were presented representing the state of the art inDDM. Datta et al. [10] described an exact local algorithm formonitoring a K-means clustering (originally proposed byWolff et al. [16]), as well as an approximate local K-meansclustering algorithm for P2P networks (originally proposedby Datta et al. [8], [9]).

Although the K-means monitoring algorithm doesnot produce a distributed clustering, it helps a centralizedK-means process know when to recompute the clusters by monitoring the distribution of centroids across peersand triggering a reclustering if the data distributionsignificantly changes over time.

On the other hand, the P2P K-means algorithm in [8] and[9] works by updating the centroids at each peer based oninformation received from their immediate neighbors. Thealgorithm terminates when the information received doesnot result in significant update to the centroids of all peers.

The P2P K-means algorithm finds its roots in a parallelimplementation of K-means proposed by Dhillon andModha [17].

3 THE HP2PC DISTRIBUTED ARCHITECTURE

HP2PC is a hierarchically distributed P2P architecture for

scalable distributed clustering of horizontally partitioneddata. We argue that a scalable distributed clustering system(or any data mining system for that matter) should involvehierarchical distribution. A hierarchical processing strategyallows for delegation of responsibility and modularity.

Central to this hierarchical architecture design is theformation of neighborhoods. A neighborhood is a group ofpeers forming a logical unit of isolation in an otherwiseunrestricted open P2P network. Peers in a neighborhoodcan communicate directly but not with peers in otherneighborhoods. Each neighborhood has a supernode. Com-munication between neighborhoods is achieved through

their respective supernodes. This model reduces floodingproblems usually encountered in large P2P networks.The notion of a neighborhood accompanied by a super-

node can be applied recursively to construct a multileveloverlay hierarchy of peers; i.e., a group of supernodes canform a higher level neighborhood, which can communicatewith other neighborhoods on the same level of the hierarchythrough their respective (higher level) supernodes. Thistype of hierarchy is illustrated in Fig. 2.

3.1 Notations

Symbol-Description

. p. A set of peers comprising a P2P network.

. pi. Peer i in p.

. q. A set of neighborhoods.

. qj. Neighborhood j in q.

HAMMOUDA AND KAMEL: HIERARCHICALLY DISTRIBUTED PEER-TO-PEER DOCUMENT CLUSTERING AND CLUSTER SUMMARIZATION 683



4/18

. spj. A peer designated as the supernode ofqj.

. H. The hierarchy height of an HP2PC network.

. h. A specific level within an HP2PC network.

. ph. The set of peers at level h.

. qh. The set of neighborhoods at level h.

. . Network partitioning factor.

. D. Data set.

.

Di

. Horizontally partitioned subset of D that is inpeer i.. di. Data object i in D.. c. A set of clusters.. ck. Cluster k in c.. mk. Mean (centroid) of a cluster k.. sk. A set of pairwise similarities within cluster k.. Hk. Histogram of similarity distribution within

cluster k.. k. Skew of similarity histogram of cluster k.. wk. Weight of cluster k.. k. Summary of cluster k.. K. Number of clusters.. I. Number of algorithm iterations.. n. Cardinality of a set or vector.

3.2 Hierarchical Overlays

A P2P network is comprised of a set of peers, or nodes,p fpig, 1 < i < np, where np is the number of nodesin the network. An overlay network is a logical network ontop of p that connects a certain subset of the nodes in p.Most work in the P2P literature refers to a single level ofoverlay; i.e., there exists one overlay network on top of theoriginal P2P network. In our work, this concept is extendedfurther to allow multiple overlay levels on top of p.

To distinguish between the different levels of overlay, wewill use the level of an overlay network. An overlay networkat level h is denoted as ph. The lowest level network (thephysical P2P network) is at level h 0, thus denoted as p0;while the highest overlay possible is at level H, denoted aspH, and consists of a single node (the root of the overlayhierarchy). In subsequent formulations, we will drop thesuperscript h if the formulation is referring to an arbitrarylevel; otherwise, we will use it to distinguish network levels.

The size of the overlay network at each level will differaccording to how many peers comprise the overlay. How-ever, since a higher level overlay will always be a subset of itsimmediate lower level overlay, we maintain the following

inequality:

0 < n ph

< n ph1

; 8h > 0: 1

The choice of which subset of peers forms the next leveloverlay is closely related to Section 3.3 discussing peerneighborhoods.

3.3 Neighborhoods

We divide a network overlay into a set of neighborhoods, q.A neighborhood, qj 2 q, comprises a set of peers that issubset of an overlay p and that has a designated peer

known as a supernode, spj; thusNeighborhoodj qj; spj : qj p; spj 2 qj:

The following neighborhood properties are enforced forthe HP2PC architecture:

. A set of neighborhoods, q fqjg, 1 < j < nq,covers the overlay network p:

p [nq

j1

qj:

. Neighborhoods do not overlap:

8i; j 6 i : qi \ qj ;:

. A node must belong to some neighborhood:

8p 2 p : p 2 qj;qj 2 q:

A network partitioning factor, 2 IR, 0 1, parti-tions the P2P network into equally sized neighborhoods.The number of neighborhoods in an overlay network as afunction of is then given by

nq 1 np 1b c: 2

At one extreme, when 0, there is only one neighbor-hood that contains all the peers in p. On the other extreme,when 1, there are np neighborhoods, each containingone peer only. In between, we can set to a value thatdetermines the number of neighborhoods and, conse-quently, the size of each neighborhood.

An initial attempt to determine the size of eachneighborhood as a fraction of the size of the network wasto use

nqj np=nqb c; 1 j nq 1;

np Pnq1

i1 nqi; j nq:& 3

However, this only works well when np ) nq. Whennp and nq are of the same order of magnitude, nqjtends to be a small real number. The problem is that suchsmall number will cause all neighborhoods, except thenqth, to have a very small number of nodes, while thenqth will absorb the remaining (large number of) nodes.For example, if np is 250 and nq is 150, then we have149 neighborhoods with a size of 1 and one neighborhoodwith a size of 101.

To solve this problem, we use a binary function togenerate the neighborhood sizes based on a specified

probability for each value. Let r represent this probability,given by

r np=nq np=nqb c:


Fig. 2. The HP2PC hierarchy architecture.



5/18

The following function then generates two neighborhoodsizes based on r:

nqj np=nqb c; with probability 1 r;np=nqd e; with probability r:

&4

This method produces a more even distribution of

neighborhood sizes, even if np and nq are of the same

order of magnitude. In the above example, using (4) we

have np=nq 1:67 r 0:67, so 67 percent of the

neighborhoods will be of size 2, and 33 percent will be of

size 1; far more balanced than with (3).All peers at the same level, h, of the hierarchy are denoted

byph. Let the function levelp determine the level of a peer;

i.e., levelph h. A peerpi can communicate with peerpj if

and only if levelpi levelpj and pi 2 ql ( pj 2 ql.Peer hierarchy formation is bottom-up, so the lowest

level of the hierarchy is h 0. The supernodes of level 0

neighborhoods form the overlay network at level h 1.

Recursively, at level h 2 are the supernodes of level 1

neighborhoods (groups of level 1 supernodes). The root

supernode is at level H, the height of the hierarchy; i.e.,

there exists exactly one pH in the system.The network partitioning factor, , can be different for

different levels of the hierarchy; i.e., neighborhood countand size is not necessarily the same at each level. If we

apply the same network partitioning factor to every level,we can deduce the height, H, of the hierarchy. We canapproximate (2) to

nq % np:

Since the number of nodes at a certain level is equal to thenumber of supernodes (or neighborhoods) in the lowerlevel, then we can say that

nph nph1; 8h > 0:

At the top of the hierarchy (level H), we have onenode, then

Hnp 1 5

from which we can deduce H:

H

log nplog l m; 0 < < 1;

1; 0;1; 1:

8>: 6

If, however, the partitioning factor is chosen to bedifferent for different levels of the hierarchy, then wecannot deduce the full height of the hierarchy up front.However, we can deduce the maximum height reachablefrom a certain level using (6), and hence, we can iterativelycalculate the full hierarchy height if all level s are knowna priori.

If, instead of specifying s, a certain hierarchy heightis desired, we can calculate the proper (the same for alllevels) using the following equation (which is derivedfrom (6)):

e log np =H; H > 1;0; H 1:

&7

3.4 Example

Fig. 3 illustrates the HP2PC architecture with an example.The network shown consists of 16 nodes and four hierarchylevels. The set of nodes at level 0, p0, is divided into fourneighborhoods, subject to the network partitioning factor

0 0:2. Each supernode of level 0 becomes a regularnode at level 1, forming the set of four nodes p1. Those inturn are grouped into two neighborhoods forming q1,satisfying 1 0:33. At level 2, only one neighborhood isformed out of level 1 supernodes, satisfying 1 0.Finally, the root of the hierarchy is found at level 3.

3.5 HP2PC Network Construction

An HP2PC network is constructed recursively, startingfrom level 0 up to the height of the hierarchy, H. Thenumber of neighborhoods and the size of each neighbor-hood are controlled through the partitioning factor , which

is specified for each level of the hierarchy (except the rootlevel).

The construction process is given in Algorithm 1. Giventhe initial set of nodes, p0, and the set of partitioning


Fig. 3. Example of an HP2PC network.



6/18

factors B fhg, 0 < h < H 1, the algorithm recursivelyconstructs the network. At each level, we partition thecurrent ph into the proper number of neighborhoods andassign a supernode for each one. The set of supernodes at acertain level forms the set of nodes for the next higher level,which are passed to the next recursive call. Constructionstops when the root is reached.

Algorithm 1 HP2PC ConstructionInput: p0, B fhg, 0 < h < H 1Output: fphg, fqhg, 0 h H

1: for h 0 to H 1 do2: nph jphj3: Calculate nqh by substituting nph and h

into (2)4: qh , ph1 5: a 1, b 1 fPartition ph into nqh neighborhoodsg6: for j 1 to nqh do7: Calculate nqj using (4)8: b b nqj

9: qj fphi g, a i b

10: a b 111: Add qj to q

h

12: spj first node in qj fsupernode for qjg13: Add spj to p

h1

14: end for15: end for16: qH pH froot nodeg

4 THE HP2PC DISTRIBUTED CLUSTERINGALGORITHM

The HP2PC algorithm is a distributed iterative clusteringprocess. It is a centroid-based clustering algorithm, where aset of cluster centroids is generated to describe theclustering solution. In HP2PC, each neighborhood con-verges to a set of centroids that describe the data set in thatneighborhood. The distributed clustering strategy within asingle neighborhood is similar to the parallel K-meansalgorithm [17] in that the final set of centroids of aneighborhood will be identical to those produced bycentralized K-means on the data within that neighborhood.Other neighborhoods, either on the same level or at higherlevels of the hierarchy, may converge to another set of

centroids.Once a neighborhood converges to a set of centroids,those centroids are acquired by the supernode of thatneighborhood. The supernode, in turn as part of its higherlevel neighborhood, collaborates with its peers to form a setof centroids for its neighborhood. This process continueshierarchically until a set of centroids is generated at the rootof the hierarchy.

4.1 Estimating Clustering Quality

The distributed search for cluster centroids is guided by acluster quality measure that estimates intracluster cohe-siveness and intercluster separation.

Cluster cohesiveness. The distribution of pairwisesimilarities within a cluster is represented using a clustersimilarity histogram, which is a concise statistical representa-tion of the cluster tightness [18].

Let sim be a similarity measure between two objects,and sk be the set of pairwise similarity between objects ofcluster ck:

nsk nck nck 1

2; 8a

sk sl : 1 l nskf g; 8b

sl simdi;dj; di;dj 2 ck: 8c

Thehistogram of thesimilaritiesin clusterck is represented as

Hk fhi : 1 i Bg; 9a

hi countsl; 9b

sl 2 sk; i 1 sl < i; 9c

where

. B is the number of histogram bins,

. hi is the count of similarities in bin i, and

. is the bin width of the histogram.

To estimate the cohesiveness of cluster ck, we calculatethe histogram skew. Skew is the third central moment of adistribution; it tells us if one tail of the distribution is longerthan the other. A positive skew indicates a longer tail in thepositive direction (higher interval of the histogram), while anegative skew indicates a longer tail in the negative (lowerinterval) direction. A similarity histogram that is negativelyskewed indicates a tight cluster.

Skew is calculated as

skew

Pixi

3

N3; 10

so the tightness of a cluster, ck, calculated as the skew ofits histogram Hk, is

ck skewHk

Plsl sk

3

nskj3sk; sl 2 sk: 11

A clustering quality measure based on skewness ofsimilarity histograms of individual clusters can be derivedas a weighted average of the individual cluster skew:

c P

k nckckPk nck

; ck 2 c: 12

4.2 Distributed Clustering (Level h 0)

We define a general function for updating cluster models ina fully connected neighborhood:

ci;t f fcj;t1g

; i; j 2 q; 13a

ci;0 c0; 13b

where ci;t is the clustering model (a set of clusters)

calculated by peer i at iteration t, and f is an aggregatingfunction. The equation can be illustrated by Fig. 4, wherethe output of each peer at iteration t depends on themodels calculated by all other peers at iteration t 1. In




7/18

P2P K-means [8], f avg and the neighborhood arebased on ad hoc network topology.

Algorithm 2 shows the iterative HP2PC process forupdating the cluster centroids at each node. Utilityroutines for the main algorithm are given in Algorithm 3.For each neighborhood, an initial set of centroids iscalculated by the supernode and transmitted to all peersin the neighborhood. Like K-means, during each iteration,each peer assigns local data to their nearest centroid and

calculates their new centroids. In addition, it also calcu-lates cluster skews using (11).

Algorithm 2 Level 0 ClusteringInput: Number of clusters KOutput: Set of clusters cr for each neighborhood qr in q

0

1: for all qr 2 q0 do

2: for all pi 2 qr do3: Di data set at pi4: si CalcPairwiseSimilarityDi5: fmr;wrg ReceiveFromspr fspr : supernode ofqrg6: 8j 6 r : mj 0, wj 0

7:fmi;wig UpdateClustersDi; fmig

8: while change in fmig > do9: for all pj 2 qr, j 6 i do

10: SendTopj; fmi;wig

11: fmj;wjg ReceiveFrompj12: end for13: for k 1 to K do14: mik 0, w

ik 0

15: for j 1 to nqr do16: mik m

ik m

jk w

jk

17: wik wik w

jk

18: end for

19: mi

k mi

k=wi

k20: end for21: fmi;wig UpdateClustersDi; fmig22: end while23: end for24: if pi spr then25: cr ci fset of clusters stored at supernodeg26: end if27: end for

Algorithm 3 Utility Routines for Level 0 ClusteringFunction UpdateClustersDi; fmig

1: ci

2: for all dj 2 Di do

3: l argminkfjdj mkjg4: Add dj to c

il

5: end for6: for all cik do7: mik avgdj, j 2 c

ik

8: sik si # cik fpart of s

i indexed by objects in cikg9: ik CalcSkewsk

10: wik ik nc

ik

11: end for12: return fmi;wigFunction CalcSkews

1: 0, avgs, stddevs2: for all sl 2 s do3: sl

3

4: end for5: =ns 36: return

Function CalcSimilarityMatrixD1: s 2: for i 1 to nD do3: for j 0 to i 1 do

4: Add simdi;dj to s5: end for6: end for7: return s

The final set of centroids for each iteration is calculatedfrom all peer centroids. Unlike K-means (or P2P K-means),those final centroids are weighed by the skew and size ofclusters at individual peers. The weight of a cluster ck atpeer pj is defined as

wjk cjk nc

jk:

At the end of each iteration, each node transmits thecluster centroids and their corresponding weights to all itspeers:

cj;t fmjk; wjkg

t; k 2 1; K:

At peer pi, the centroid of cluster k is updated accordingto the following equation, which favors tight and denseclusters:

mi;tk

Pj w

j;t1k m

j;t1k

Pj w

j;t1k

; j 2 qr: 14

This is followed by assigning objects to their nearestcentroid and calculating the new set of cluster skews, fikg,and sizes, fncikg, which are used in the next iteration. Thealgorithm terminates when object assignment does notchange, or when 8i; k kmi;tk m

i;t1k k < , where is a

sufficiently small parameter.

4.3 Distributed Clustering (Level h > 0)

Once a neighborhood converges to a set of clusters, thecentroids and weights of those clusters are acquired by thesupernode as its initial set of clusters; i.e., for neighbor-hood qr with supernode spr

cr;0;h cr;T;h1;

where T is the final iteration of the algorithm at level h 1for neighborhood qr.


Fig. 4. Iterative level 0 neighborhood clustering.



8/18

Since at level h of the hierarchy the actual data objects arenot available, we rely on metaclustering: merging the clustersusing centroid and weight information alone. At level h > 0,clusters are merged in a bottom-up fashion, up to the root ofthe hierarchy; i.e., ch fch1. This means once aneighborhood at level h converges to a set of clusters, it isfrozen, and the higher level clustering is invoked. (A moreelaborate technique would involve bidirectional traffic,

making ch1 fch as well, but the complexity of thisapproach could be prohibitive, so we leave it forfuture work.)

A neighborhood at level h consists of a set of peers, eachhaving a set of K centroids. To merge those clusters, thecentroids are collected and clustered at the supernode ofthis neighborhood, using K-means clustering. This processrepeats until one set of clusters is computed at the root ofthe hierarchy. The formal procedure representing thisclustering process is presented in Algorithm 4.

Algorithm 4 HP2PC Clustering1: for all qi 2 q

0 do

2: fmig0 NeighborhoodClusterqi

3: end for4: for h 1 to H do5: for all qi 2 q

h do

6: for all pj 2 qi do7: fmjg fmjgh1

8: SendTospi; fmjg

9: end for10: fmigh K-meansfmjg fonly at peer spig11: end for12: end for

For comparison against the baseline K-means algorithm,

we calculate the centroids of each neighborhood based oncentralized K-means, then compare against the HP2PCcentroids with respect to the merged data set of therespective neighborhood (Algorithm 5.) For neighborhoodsat level 0, the merged data set is a union of the data from allnodes in the neighborhood. For those at higher levels, themerged data set is the union of all data reachable fromevery node in the neighborhood through its respectivelower level nodes. The EvaluateAndCompare function eval-uates both the centralized and distributed solutions andcompares them against each other, as reported in thevarious experiments in Section 6.

Algorithm 5 Centralized K-means Clustering Comparison1: for h 0 to H do2: for all qi 2 q

h

3: Di S

j2qiDj fmerged data set of neighborhoodqig

4: fmighctr kmeansDi

5: EvaluateAndComparefmighctr; fmigh;Di

6: end for7: end for

One of the major benefits of this algorithm is the ability tozoom in to more refined clusters by descending down thehierarchy and zoom out to more generalized clusters byascending up the hierarchy. The other major benefit is the

ability to merge a forest of independent hierarchies into onehierarchy by putting all roots of the forest into oneneighborhood and invoking the merge algorithm on thatneighborhood.

4.4 Complexity Analysis

We divide the complexity of HP2PC into computationalcomplexity and communication complexity.

4.4.1 Computation Complexity

Assume the entire data set size across all nodes is D. Thedata set is equally partitioned among nodes, so each nodeholds DP D=np data objects. For level 0, we have nqneighborhoods, each of size SQ np=nq.

Each node has to compute a pairwise similarity matrix before it begins the P2P clustering process, requiringDPDP 1=2 similarity computations. For each iteration,each node computes a new set of K centroids by averagingall neighborhood centroids K SQ, assigns the data objectsto those centroids K DP, recomputes centroids based onnew data assignment DP, and calculates the skew of theclusters DPDP 1=2. Those requirements are summar-ized as

Tsim DPDP 1=2;

Tupdate K S0Q DP

DP;

Tskew DPDP 1=2:

Let the number of iterations required to converge to asolution be I. Then, the total number of computationsrequired by each node to converge is

TP Tsim ITupdate Tskew: 15

If we assume that DP ) 1 and I ) 1, then we canrewrite (15) as

TP

I K S0

Q D

P D2

P=2 : 16

For levels above 0, each neighborhood is responsible formetaclustering a set of KSQ centroids into K centroidsusing K-means. Then, for each neighborhood at level h, therequired computation is

Th IShQ K

2: 17

Since each neighborhood computation is done in parallelwith the others, we need only Th computations per level.However, since computations at higher levels of thehierarchy need to wait for lower levels to complete, we

have to sumT

h for all levels. The total computationsrequired for all levels above 0 are thus

TH XH1h1

Th: 18

Finally, we can combine (16) and (18) to find the totalcomputation complexity for HP2PC:

T TP TH: 19

It can be seen that computation complexity is largelyaffected by the data set size of each node DP. Byincreasing the total number of nodes, we can decrease DP

(since the data are equally partitioned among nodes) but onthe expense of increasing communication complexity, aswell as decreasing clustering quality due to fragmentationof the data set.




9/18

4.4.2 Communication Complexity

In neighborhoods at level 0, at each iteration every peersends SQ 1 messages to its neighbors, each message is ofsize K. For SQ peers in one neighborhood, SQSQ 1messages are exchanged. Communication complexity fornq0 neighborhoods at level 0 is then

M0 % nq0S

02

QIK : 20

Since nq np=SQ, then

M0 % np0S

0Q IK : 21

For levels above 0, each neighborhood requires SQ 1messages to be sent to each supernode, each message is ofsize K. Communication complexity for nqh neighbor-hoods at level 0 is then

Mh % nqhS

hQ K

% nph:22

Total communication requirements for HP2PC is then

M % M0 XH1h1

Mh: 23

We can see that the communication complexity is greatlyinfluenced by the size of neighborhoods at level 0. Worstcase is when all nodes are put in one neighborhood,resulting in quadratic complexity (in terms of the number ofnodes). As we adopt more fine-grained neighborhoods, wecan reduce both computation and communication complex-ities but on the expense of clustering quality as will be

discussed in the results section.

5 DISTRIBUTED CLUSTER SUMMARIZATION USINGKEYPHRASE EXTRACTION

Summarizing the clusters generated by HP2PC usingCorePhrase poses two challenges. First, since CorePhraseworks by intersecting documents in a cluster together,generating a summary for a document cluster that isdistributed across various nodes cannot be done directly,and thus, CorePhrase needs modification to work in thiskind of environment. Second, merging cluster summaries

up the hierarchy will require working with keyphrasesextracted at level 0 only, without any access to the actualdocuments.

5.1 Single Cluster Summarization

The summary of a document cluster is represented as a setof core keyphrases that describe the topic of the cluster. TheCorePhrase [19] keyphrase extraction algorithm is used forthis purpose. CorePhrase works by first constructing a listof candidate keyphrases for each cluster, scoring eachcandidate keyphrase according to its features, ranking thekeyphrases by score, and finally selecting a number of the

top ranking keyphrases for output. We briefly reviewCorePhrase here for completeness.

Extraction of candidate keyphrases. Candidate key-phrases naturally lie at the intersection of the document

cluster. The CorePhrase algorithm compares every pair of

documents to extract matching phrases. This process ofmatching every pair of documents is inherently On2.However, using a document phrase indexing graph

structure, known as the Document Index Graph (DIG), thealgorithm can achieve this goal in near-linear time [20].

In essence, what the DIG model does is to keep a

cumulative graph representing currently processed docu-

ments: Gi Gi1 [ gi, where gi is the subgraph representa-

tion of a new document. Upon introducing a new document,

its subgraph is matched with the existing cumulative graph

to extract the matching phrases between the new document

and all previous documents. That is, the list of matching

phrases between document di and previous documents is

given by Mi gi \ Gi1. The graph maintains complete

phrase structure identifying the containing document andphrase location, so cycles can be uniquely identified. This

process produces complete phrase-matching output between

every pair of documents in near-linear time, with arbitrary

length phrases. Fig. 5 illustrates the process of phrase

matching between two documents. In the figure, the two

subgraphs of two documents are matched to get the list of

phrases shared between them.Phrase features. Quantitative features are needed to

judge the quality of the candidate keyphrases. Each

candidate keyphrase p is assigned the following features:

. df: document frequency; the number of documentsin which the phrase appeared, normalized by thetotal number of documents:

df j documents containing p j

j all documents j:

. w: average weight; the average weight of the phraseover all documents. The weight of a phrase in adocument is calculated using structural text cues.Examples: title phrases have maximum weight,section headings are weighted less, while body text

is weighted lowest.. pf: average phrase frequency; the average number

of times this phrase has appeared in one document,normalized by the length of the document in words:


Fig. 5. Phrase matching using DIG.



10/18

pf arg avgj occurrences of p j

j words in document j

!:

. d: average phrase depth; the location of the firstoccurrence of the phrase in the document:

d arg avg 1 j words before first occurrence j

j words in document j !:

Phrase ranking. Phrase features are used to calculate a

score for each phrase. Phrases are then ranked by score, and

a number of the top phrases are selected as the ones

describing the cluster topic. The score of each phrase p is

scorep ffiffiffiffiffiffiffiffiffiffiffiffi

w pfp

d2 log1 df: 24

The equation is derived from the tf idf term weightingmeasure; however, we are rewarding phrases that appear in

more documents (high df) rather than punishing those

phrases. By examining the distribution of the values of eachfeature in a typical corpus, it was found that the weight and

frequency features usually have low values compared to the

depth feature. To take this fact into account, it was necessary

to expand the weight and frequency features by taking theirsquare root and to compact the depth by squaring it. This

helps even out the feature distributions and prevents one

feature from dominating the score equation.Word weight-based score assignment. Another method

for scoring phrases was used, which is based on

individual word weights. This method will be referred toas CorePhrase-M:

. First, assign initial scores to each phrase based onphrase scoring formulas given above.

. Construct a list of unique individual words out ofthe candidate phrases.

. For each word: add up all the scores of the phrases inwhich this word appeared to create a word weight.

. For each phrase: assign the final phrase score byadding the individual word weights of the constitu-ent words and average them.

5.2 Summarizing Level 0 Clusters

Let every node pi generate a cluster summary ik (set of

keyphrases) for each cluster cik using CorePhrase. Nodes inthe same neighborhood will enter into multiple rounds ofcommunication to agree on a common summary for each

cluster. For cluster cik, on each round, node pi receives a

cluster summary jk, j 2 q, from all nodes in its neighbor-hood. Node pi then produces two sets of keyphrases based

on jk, one is called core summary, qk, and the other is called

local cluster summary, ik. The core summary is generated byintersecting all keyphrases in fjk;

ikg:

qk \j2qjk: 25

The local cluster summary is generated by intersecting allsummaries from other nodes with local documents fromcluster cik:

ik [

j2q;j6i

jk \ Di;k: 26

Note that, by definition, the core summary will be thesame at all nodes, since the operation is identical at all nodes.The local cluster summary, however, will be different due tointersection with local documents. On the next iteration, eachnode will send its local cluster summary to all other nodes.Local cluster summaries are intersected together again

according to (25), and the result is appended to the coresummary qk:

q;t1k q;tk [

\j2q

j;tk

" #: 27

This process repeats until the desired number of key-phrases per cluster summary is acquired, or the intersectionyields an empty set.

5.3 Summarizing Higher Level Clusters

At higher levels, (26) is not applicable, since no local data isavailable. Summarization of a cluster at higher levels issimply an intersection of keyphrase summaries of theclusters chosen to be merged into said cluster by the higherlevel K-means algorithm. Let Ck fcig be the set ofclusters chosen to be merged into one cluster, ck, and letfig be their corresponding summaries. The summary forcluster ck is the intersection of the constituent clustersummaries, merged with an equal subset from everyconstituent summary, up to L keyphrases.

Let k represent the core intersection:

k \

i2Ck

i:

If jkj > L, then k is truncated up to L keyphrases.Otherwise, k is merged with an equal subset from everyconstituent cluster summary ki. Let M L jkj=K,where K is the number of clusters; and let ki ki n k,then the final cluster summary is

k k;[

i

Mki

( ); 28

where Mki denotes the top M keyphrases in

ki. Thus, the

core cluster summary is augmented with an equal subsetfrom the top keyphrases in each constituent clustersummary that is not in the core summary. This is to make

sure that the core summary is representative of allconstituent clusters.

6 EXPERIMENTAL RESULTS

6.1 Data Sets

Experiments were performed on four document data setswith various characteristics and sizes. Table 1 lists the datasets used for evaluation. YAHOO, 20NG, and RCV1 arestandard text mining data sets, while SN was manuallycollected but was already labeled. Below is a briefdescription of each data set.

YAHOO is a collection of 2,340 news articles from Yahoo!News. It contains 20 categories (such as health, entertainment,and so forth), which have rather unbalanced distribution. Ithas been used in document clustering-related research,




11/18

including [21], [22], [23], and [24]. The data set is available atftp://ftp.cs.umn.edu/dept/users/boley/.

SN is a collection of 3,271 metadata records collectedfrom the SchoolNet learning resources website (http://www.schoolnet.ca/). We extracted the fields containing textfrom the metadata records (title, description, and key-words) and combined them to form one document permetadata record. We used the 17 top-level categories fromthe SchoolNet data set.

20NG

is the standard 20-newsgroup data set, whichcontains 18,828 documents from 20 Usenet news-groups divided into 20 balanced categories. This dataset is available at http://people.csail.mit.edu/jrennie/20Newsgroups/.

RCV1 is a subset of 23,149 documents selected from thestandard Reuters RCV1 text categorization data set,converted from the original Reuters RCV1 data set byLewis et al. [25]. The documents in the RCV1 data set areassigned multiple labels. In order to properly evaluate theclustering algorithms using single-label validity measures,we restricted the labels of the documents to the firstdocument label that appears in the data set.

6.2 Text Preprocessing

All texts were preprocessed in the following way. First,words were tokenized using a specially built finite-state-machine tokenizer that can detect both alphanumeric andspecial entities (such as currencies, dates, and so forth).Then, tokens were lower cased, stop-words were removed,and finally the remaining words were stemmed using thepopular Porter stemmer algorithm [26].

Further to text preprocessing, we applied simple featureselection to reduce the number of features for every dataset. The method is based on the Document Frequency (DF)feature selection measure. It is based on the argument thatterms with very low DF tend to be noninformative incategorization-related tasks [27], [28], [29]. After ranking theset of terms in descending order with respect to their DF,we pruned the list by keeping only the top 20 percent terms.A threshold as far as 10 percent has been used in theliterature for selecting features to increase categorizationaccuracy [28] (or at least not affecting it), but we opted for amore conservative threshold.

6.3 Evaluation Measures

Three aspects of the algorithm were evaluated: clusteringaccuracy, speedup, and distributed summarization accuracy.

For evaluating clustering accuracy, we used two evaluationmeasures: Entropy, which evaluates clusters with respect toan external predefined categorization, and Separation Index,which does not rely on predefined categories.

6.3.1 Entropy

Entropy reflects the homogeneity of a set of objects, andthus can be used to indicate the homogeneity of a cluster.This is referred to cluster entropy, introduced by Boley et al.[21]. Lower cluster entropy indicates more homogeneousclusters. On the other hand, we can also measure theentropy of a prelabeled class of objects, which indicates thehomogeneity of a class with respect to the generatedclusters. The less fragmented a class across clusters, thehigher its entropy, and vice versa. This is referred to as classentropy, due to He et al. [30], Tan et al. [32]

Cluster entropy [21]. For every cluster cj in theclustering result c, we compute nli; cj=ncj, the prob-ability that a member of cluster cj belongs to class li. Theentropy of each cluster cj is calculated using the standardformula

Ecj X

i

nli; cj

ncjlog

nli; cj

ncj;

where the sum is taken over all classes. The total entropy for

a set of clusters is calculated as the sum of entropies for eachcluster weighted by the size of each cluster:

Ec Xncj1

ncj

nD Ej: 29

Class entropy [30], [32]. A drawback of cluster entropy isthat it rewards small clusters, which means that if a class isfragmented across many clusters it would still get a lowentropy value. To counter this problem, we calculate alsothe class entropy.

The entropy of each class li is calculated using

Eli X

j

nli; cjnli

lognli; cj

nli;

where the sum is taken over all clusters. The total entropyfor a set of classes is calculated as the weighted average ofthe individual class entropies:

El Xnli1

nli

nD Eli: 30

As with cluster entropy, a drawback of class entropy is thatif multiple small classes are lumped into one cluster, their

class entropy would still be small.Overall entropy [30], [32]. To avoid the drawbacks of

either cluster or class entropy, their values can be combinedinto an overall entropy measure:


TABLE 1Data Sets



12/18

Ec Ec 1 El: 31

In our experiments, we set to 0.5.We evaluated the quality of clustering at different levels

of the hierarchy. At level h 0, we evaluated the quality ofclustering for each neighborhood, with respect to the subsetof the data in the neighborhood, i.e.

Er Ecr jDr ;

where cr is the set of clusters obtained for neighborhood r,and Dr is the union of data sets of all nodes in thatneighborhood Dr

Si2qr

Di.

At level h > 0, we evaluated the clustering acquired by asupernode with respect to the data subset of the nodes atthe level 0 reachable from the supernode. Thus, evaluationof the clustering acquired at the root node reflects thequality with respect to the whole data set.

6.3.2 Separation Index

SI is another cluster validity measure that utilizes clustercentroids to measure the distance between clusters, as well

as between points in a cluster to their respective clustercentroid. It is defined as the ratio of average within-clustervariance (cluster scatter) to the square of the minimumpairwise distance between clusters:

SI

PNCi1

Pxj2ci

distxj; mi2

ND min1r;sNCr6s

distmr; msf g2

PNCi1

Pxj2ci

distxj; mi2

ND dist2min

;

32

where mi is the centroid of cluster ci, and distmin is theminimum pairwise distance between cluster centroids.

Clustering solutions with more compact clusters and largerseparation have lower Separation Index, thus lower valuesindicate better solutions. This index is more computation-ally efficient than other validity indices, such as Dunnsindex [31], which is also used to validate clusters that arecompact and well separated. In addition, it is less sensitiveto noisy data.

Speedup is a measure of the relative increase in speed ofone algorithm over the other. For evaluating HP2PC, it iscalculated as the ratio of time taken in the centralizedcase Tc to the time taken in the distributed case Td,including communication time, i.e.,

S Tc

Td: 33

To take communication time into consideration in thesimulations, we factored the time taken to transmit amessage from one node to another on a 100 Mbps link.1

Thus, the time required to transmit a message of size jMjbytes is calculated as

TM jMj=100;000;000=8 seconds:

During simulation, each time a message is sent from (orreceived by) one node to another, its time is calculated andis added to the total time taken by that node. Since in a realenvironment all nodes on the same level of the hierarchy

run in parallel, the total time taken by that level iscalculated as the maximum time taken by any node onthe same level. The time taken by different levels is addedto arrive at the global Td.

For cluster summarization accuracy, evaluation of theproduced cluster summaries was based on how much theextracted keyphrases agree with the centralized version of

CorePhrase when run on the centralized cluster. AssumeHP2PC produced a cluster ck that spanned np nodes, eachholding subset of the documents, Dki, from that cluster. Ifall documents were pooled into a centralized cluster, wehave Dk documents in that cluster. The percentage ofcorrect keyphrases is calculated as

percent correct keyphrases

CorePhraseDk \ DistCorePhrase fDkig

L;

where L is the maximum number of top keyphrasesextracted.

6.4 Experimental Setup

A simulation environment was used for evaluating theHP2PC algorithm. During simulation, data were parti-tioned randomly over all nodes of the network. The numberof clusters was specified to the algorithm such that itcorresponds to the actual number of classes in each data set.A random set of centroids was chosen by each supernode,and the centroids were distributed to all nodes in itsneighborhood at the beginning of the process. Clusteringwas invoked at level 0 neighborhoods and was propagatedto the root of the hierarchy as described in Section 4.

In the next sections, we evaluate the effect of network size

on clustering accuracy, the effect of scaling the hierarchyheight, the quality of clustering at different levels within asingle hierarchy, and the accuracy of distributed clustersummarization using the distributed CorePhrase algorithm.

6.5 Network Size and Height

Experiments on different network sizes and heights wereperformed, and their effect on clustering accuracy (Entropyand SI) and speedup over centralized clustering weremeasured. Table 2 summarizes those results for the YAHOOdata set, and Table 3 summarizes the same results for theSN data set. The same results are illustrated in Figs. 6 and 7,respectively.

The first observation here is that for networks of heightH 1 0, the distributed clustering accuracy staysalmost the same as the network size increases. This isevident through both the Entropy and SI. Since for networks


1. This is a simplified assumption. Real networks exhibit communicationoverhead due to network protocols and network congestion.

TABLE 2Accuracy and Performance of HP2PC [YAHOO]



13/18

of height 1 all nodes at level 0 are in the same neighborhood,every node can update its centroids based on completeinformation received from all other nodes at the end of eachiteration (at the cost of increased communication). Thismeans that increasing the network size does not affectaccuracy of clustering, as long as it is of height 1.

The second observation is that, for networks of the same

size, larger network heights cause clustering accuracy todrop. It is not surprising that this is the case, since at higher

levels metaclustering of lower level centroids is expected to

produce some deviation from the true centroids. It is also

noticeable that unlike networks of height 1, networks withheight H > 0 tend to have less accuracy as the number ofnodes is increased. As we keep Hconstant and increase np,the network partitioning factor, , increases. This in turnmeansneighborhoods become smaller, thuscausing the moreaccurate centroids at level 0 to become more fragmented.

An interesting observation is that there is a noticeable

plateau region between the centralized case np 1 and a

point where the data are finely partitioned (np > somevalue), after which quality degrades rapidly. This plateau

provides a clue on the relation between the data set size and

the number of nodes, beyond which the number of nodes

should not be increased without increasing the data set size.

An appropriate strategy for automatically detecting the

higher boundary of this region (in scenarios where the

network grows arbitrarily) is to compare the SI measure

before andafteraddingnodes; if a sufficiently largedifference

in SI is noticed then network growth should be suspended

until more data are available (and equally partitioned).

Since the increase in hierarchy height has the biggesteffect on the accuracy of the resulting clustering accuracy, a

strategy based on the SI measure can be adopted to select the

most appropriate hierarchy for a certain application. Given a


TABLE 3Accuracy and Performance of HP2PC [SN]

Fig. 6. HP2PC accuracy and speedup [YAHOO].

Fig. 7. HP2PC accuracy and speedup [SN].



14/18

sufficiently small accuracy parameter , we can use the

following strategy to recommend a certain hierarchy height:

1. Initialize i 1.2. Set H i and compute the corresponding SIi

measure for the resulting clustering solution.3. Set i i 1.4. Compute SIi.5. If SI SIi SIi1 < , go to step 3.6. Output H i as the recommended height.

We investigate the effect of increasing hierarchy heights,

as well as the accuracy at different levels within a singlehierarchy, in more detail in the next sections.

In terms of speedup, the trends show that the HP2PCalgorithm exhibits decent speedup over the centralized case.For H 1, however, speedup does not scale well with thenetwork size, largely due to the increased communicationcost for networks of that height. For H > 0, speedup becomesmore scalable, as we cannoticea bigdifference between H 1and H 2 than between H 2 and H 3. This result carriesan assertion that the hierarchical architecture of HP2PC isindeed scalable compared to flat P2P networks.

6.5.1 Comparison with P2P K-Means

The accuracy of HP2PC is compared with P2P K-means[8], which is the current state of the art in P2P-baseddistributed clustering. Since the implementation of P2P

K-means is nontrivial, we used their benchmark syntheticdata set and results to compare against. The data set is a2D mixture of 10 Gaussians, containing 78,200 points(referred hereto as 10G). The actual data were not

available from the authors but rather the parameters ofthe Gaussians, which we used to regenerate the data.2

The10G

data set is illustrated in Fig. 8.The measure of accuracy in [8] was based on thedifference between cluster membership produced by P2PK-means and that of the same data point as produced bythe centralized K-means. To ensure accurate comparison,initial seeds for both the centralized and the P2P algorithmswere the same. They report the total number of mislabeleddata points as a percentage of the size of the data set. Thepercentage of mislabeled points (PMP) is

100j d2 D : Lcentd 6 Lp2pd

j

jDj:


2. This means that there could be a difference in the actual data points between our and their generated data due to the random numbergeneration. However, we assume that the very large number of points willoffset differences due to sampling.

Fig. 8. Two-dimensional mixture of 10 Gaussians data set [ 10G].

TABLE 4PMP Comparison between HP2PC and P2P K-Means [10G]

Fig. 9. PMP comparison between HP2PC and P2P K-means [10G].

Fig. 10. Clustering accuracy versus hierarchy level, H 5 [20NG].



15/18

Table 4 reports the results for P2P K-means and HP2PC(with various hierarchy heights). Nodes vary between 50 and500, as reported in [8]. Fig. 9 illustrates the trend in theresults.HP2PC has zero error for networks of height 1, as expected. Itis clear that for networks of low height, HP2PC is superior toP2P K-means. As the height increases, HP2PC starts toapproach the error rate of P2P K-means H 4; H 5, but

interestingly, HP2PC does not suffer from the sharp increasein PMP at very large number of nodes np > 300.P2P K-means has an advantage of being a model for

unstructured P2P networks. It assumes that each node has afinite number of reachable neighbors, which are randomlyselected from the node population. HP2PC, on the otherhand, has a fixed hierarchical structure that allows it toproduce superior results by avoiding random peering andpropagation delay and error, a common disadvantage inP2P networks.

6.6 Clustering Quality at Different Hierarchy Levels

To test the effect of the hierarchical structure on clustering

quality at different levels, we performed experiments on anetwork of size 250 nodes and a fixed height of 5 0:33,

on 20NG and RCV1. Thenumber of nodes at each level is250, 83, 28, 9, 3, and 1, from level 0 to level 5, respectively.We compared the results to centralized K-means performedat each level of the hierarchy and took the average over allneighborhoods on that level.

Fig. 10 shows the accuracy achieved at each level of thehierarchy for 20NG and compares it to the averagecentralized K-means accuracy at the same level. Fig. 10a

shows Entropy accuracy, while Fig. 10b shows SI changewith hierarchy level. We notice that the clustering qualityachieved by HP2PC is comparable to centralized K-meansand that it slightly degrades as we go up the hierarchy.Fig. 11 shows the same results for RCV1 and verifies thesame trend. Since at higher level of the hierarchy we onlyrely on cluster centroid information, this result is justifiable.Nevertheless, it is clear that at level 0 we can achieveclustering quality close to the centralized K-means algo-rithm. In scenarios where tall hierarchies are necessary (e.g.,deep hierarchical organization), we can still achieve resultsthat do not deviate much from the centralized case.

6.7 Hierarchy Height Scalability

Finally, we performed a set of experiments to test the effectof increasing hierarchy heights. Experimenting on tallhierarchies requires large number of nodes so as to keepthe neighborhood sizes reasonable. For this reason, only thelarger data sets 20NG and RCV1 were used in thoseexperiments to avoid fine-grained partitioning of dataacross such large number of nodes.

Table 5 reports the outcome of those experiments for anetwork of 250 nodes. Performance measures for bothHP2PC and centralized K-means clustering are reported(thosesuffixed with ctr). Fig. 12 reports thesame information,but only forRCV1with respect to theclustering quality (20NGexhibits similar trends) and compares it to the centralizedclustering quality. Note that centralized clustering producesroughly the same quality regardless of the hierarchy height,since all data in the network is centrally clustered in this case.We can see that as we increase the hierarchy height, HP2PCclustering quality (which is measured at the root of thehierarchy) is affected (Figs. 12a and 12b). The sharpestdecrease happens as soon as the height increases from 1 to 2,and then the degradation in Entropy and SI tend to stabilize.A similar trend can be seen in speedup (Fig. 12c). Bothobservations can be related to the size of neighborhoods at

level 0 S0Q , which decreasessignificantly when theheight isincreased from 1 (250 nodes) to 2 (16.67 nodes).


Fig. 11. Clustering accuracy versus hierarchy level, H 5 [RCV1].

TABLE 5Performance of HP2PC versus Hierarchy Heights, np 250 [20NG, RCV1]



16/18

In fact, to demonstrate that increasing the hierarchyheight is not the primary cause of the drop in quality, weshow in Fig. 13 the change in clustering quality as we keepthe hierarchy height constant at 2 and increase the numberof neighborhoods from 1 to 15. We notice that the quality isonly slightly reduced as we partition the network to includeone more neighborhood. Thus, although the drop in qualityseems to be related to a slight increase in hierarchy height,the actual case is that the big jump in the number ofneighborhoods is the primary cause. A well-designednetwork should take this observation into consideration,so as not to overpartition a network.

From those observations, we can conclude thatneighborhood size plays a key role in determining bothclustering quality and speedup. The more fine-grained theneighborhoods in the network, the less the final clustering

solution is dependent on the actual data (only available tolevel 0 nodes), because we have to go up the hierarchyseveral levels before we can converge to one solution forthe whole network. Conversely, the more coarse-grainedthe neighborhoods, the better the final clustering solutionis, due to creating more accurate clustering at lower

levels before the less accurate merging of centroids takesplace at higher levels of the hierarchy.

Similar argument can be said about speedup. The less

nodes in a neighborhood, the less communication is needed

between peers. However, from Fig. 12c, we can see that we

do not gain much speedup after a certain height (around

H 4 or 5). In fact, speedup tends to decrease slightly after

that point. This can be explained by looking at the size of

neighborhoods in Table 5. As soon as S0Q decreases from

250 to 16.67, we notice a big jump in speedup (from 94.60 to

135.07). S0Q then tends to decrease slowly as we increase the

height, which after H 4 stays almost the same. So, in

effect, no gain is achieved; on the other hand, due to theincreased height, we have to go through several cluster

merging layers before the final solution is achieved. Thus,

our conclusion here is that hierarchy height should not be

increased unless there is a corresponding increase in the

number of nodes at level 0.

6.8 Distributed Cluster Summarization

Generation of cluster summaries using the distributedversion of CorePhrase was evaluated using differentnetwork sizes np and heights H. Experiments wereperformed on the 20NG data set where the summary of each

distributed cluster is compared to that of its centralizedcounterpart, and an average is taken over all clusters.

Fig. 14 illustrates the accuracy of the distributed clustersummarization compared with the baseline centralizedcluster summarization. The first observation is that distrib-uted cluster summaries can agree with their centralizedcounterpartsup to 88 percent H 1; np 50, which showsthe feasibility of the distributed summarization algorithm.

The second observation is that the number of topkeyphrases, L, has a direct effect on accuracy. Lower valuesof L (usually lower than 100) tend to produce poor results,as well as higher values (usually above 500). For networks

of low height (here H 1), 100 < L < 500 produces bestresults; while for those of larger heights (here H 3; 5),400 < L < 700 produces best results, albeit less accuratethan those of lower height networks. An interpretation of


Fig. 12. HP2PC performance versus hierarchy height. (a) Entropyversus hierarchy heightRCV1. (b) SI versus hierarchy heightRCV1.(c) Speedup versus hierarchy height20NG, RCV1.

Fig. 13. Entropy versus number of level 0 neighborhoods H 2.



17/18

this observation is that at level 0, distributed summarizationis directly dependent on actual data, while at higher levelsonly keyphrases from level 0 are merged together.

The third observation is that networks of smaller numberof nodes, np, produce more accurate results. Since thewhole data setis partitioned among np nodes, it is expectedthat a coarse-grained partitioning (smaller np) means thateach node has access to larger portion of the distributedcluster, thus is able to get more accurate keyphrases.

To summarize those findings: 1) results of distributedcluster summarization can agree with centralized summar-ization with up to 88 percent accuracy; 2) for networks of

small height, 100 < L < 500 should be used, while fornetworks of large height, 400 < L < 700 should be used;and 3) accuracy of distributed summarization increases asthe network size and height are decreased.

7 CONCLUSION AND FUTURE WORK

In this paper, we have introduced a novel architecture andalgorithm for distributed clustering, the HP2PC model,which allows building hierarchical networks for clusteringdata. We demonstrated the flexibility of the model, showingthat it achieves comparable quality to its centralized counter-part while providing significant speedup and that it is

possible to make it equivalent to traditional distributedclustering models (e.g., facilitator-worker models) by ma-nipulating the neighborhood size and height parameters.The model shows good scalability with respect to networksize and hierarchy height, degrading the distributed cluster-ing quality significantly.

The importance of this contribution stems from itsflexibility to accommodate regular types of P2P networksas well as modularized networks through neighborhoodand hierarchy formation. It also allows privacy withinneighborhood boundaries (no data shared between neigh-borhoods). In addition, we provide interpretation capability

for document clustering through document cluster sum-marization using distributed keyphrase extraction.For future work, we plan to extend this model to be

dynamic, allowing nodes to join and leave the network,which requires maintaining a balanced network in terms ofpartitioning and height. This will also lead us to a way tofind the optimal network height for certain applications. Wealso plan to extend it to allow merging and splitting ofcomplete hierarchies.

We are also investigating the possibility of making theclustering algorithm more global by allowing centroids tocross neighborhoods through higher levels; i.e., clusters at

lower level neighborhoods should be a function of higherlevel centroids. We believe that this will create an opportu-nity for better global clustering solutions but on the expenseof computational complexity.

REFERENCES[1] N.F. Samatova, G. Ostrouchov, A. Geist, and A.V. Melechko,

RACHET: An Efficient Cover-Based Merging of ClusteringHierarchies from Distributed Datasets, Distributed and ParallelDatabases, vol. 11, no. 2, pp. 157-180, 2002.

[2] S. Merugu and J. Ghosh, Privacy-Preserving Distributed Cluster-ing Using Generative Models, Proc. Third IEEE Intl Conf. Data

Mining (ICDM 03), pp. 211-218, 2003.[3]

J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, and M. Klusch,Distributed Data Mining and Agents, Eng. Applications ofArtificial Intelligence, vol. 18, no. 7, pp. 791-807, 2005.

[4] A. Strehl and J. Ghosh, Cluster EnsemblesA Knowledge ReuseFramework for Combining Multiple Partitions, J. MachineLearning Research, vol. 3, pp. 583-617, Dec. 2002.

[5] E. Januzaj, H.-P. Kriegel, and M. Pfeifle, DBDC: Density BasedDistributed Clustering, Proc. Ninth Intl Conf. Extending DatabaseTechnology (EDBT 04), pp. 88-105, 2004.

[6] M. Klusch, S. Lodi, and G. Moro, Agent-Based Distributed DataMining: The KDEC Scheme, Proc. AgentLink, pp. 104-122, 2003.

[7] M. Eisenhardt, W. Muller, and A. Henrich, Classifying Docu-ments by Distributed P2P Clustering, Informatik 2003: InnovativeInformation Technology Uses, 2003.

[8] S. Datta, C. Giannella, and H. Kargupta, K-Means Clusteringover Peer-to-Peer Networks, Proc. Eighth Intl Workshop High

Performance and Distributed Mining (HPDM), SIAM Intl Conf. DataMining (SDM), 2005.[9] S. Datta, C. Giannella, and H. Kargupta, K-Means Clustering

over a Large, Dynamic Network, Proc. Sixth SIAM Intl Conf. Data Mining (SDM 06), pp. 153-164, 2006.


Fig. 14. Distributed cluster summarization accuracy [20NG].



18/18

[10] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta,Distributed Data Mining in Peer-to-Peer Networks, IEEEInternet Computing, vol. 10, no. 4, pp. 18-26, 2006.

[11] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu,and S. Datta, Clustering Distributed Data Streams in Peer-to-PeerEnvironments, Information Sciences, vol. 176, pp. 1952-1985, 2006.

[12] K. Hammouda and M. Kamel, Collaborative Document Cluster-ing, Proc. Sixth SIAM Intl Conf. Data Mining (SDM 06), pp. 453-463, Apr. 2006.

[13] H. Kargupta, I. Hamzaoglu, and B. Stafford, Scalable, Distributed

Data Mining Using an Agent-Based Architecture, Proc. Third IntlConf. Knowledge Discovery and Data Mining (KDD 97), pp. 211-214,1997.

[14] J. Li and R. Morris, Document Clustering for Distributed FulltextSearch, Proc. Second MIT Student Oxygen Workshop, Aug. 2002.

[15] A. Kumar, M. Kantardzic, and S. Madden, Guest EditorsIntroduction: Distributed Data MiningFramework and Imple-mentations, IEEE Internet Computing, vol. 10, no. 4, pp. 15-17,2006.

[16] R. Wolff, K. Bhaduri, and H. Kargupta, Local L2-ThresholdingBased Data Mining in Peer-to-Peer Systems, Proc. Sixth SIAMIntl Conf. Data Mining (SDM 06), pp. 430-441, 2006.

[17] I.S. Dhillon and D.S. Modha, A Data-Clustering Algorithm onDistributed Memory Multiprocessors, Large-Scale Parallel Data

Mining, pp. 245-260, Springer, 2000.[18] K. Hammouda and M. Kamel, Incremental Document Clustering

Using Cluster Similarity Histograms, Proc. IEEE/WIC Intl Conf.Web Intelligence (WI 03), pp. 597-601, Oct. 2003.

[19] K. Hammouda and M. Kamel, Corephrase: Keyphrase Extractionfor Document Clustering, Proc. IAPR Intl Conf. Machine Learningand Data Mining in Pattern Recognition (MLDM 05), P. Perner andA. Imiya, eds., pp. 265-274, July 2005.

[20] K. Hammouda and M. Kamel, Document Similarity Using aPhrase Indexing Graph Model, Knowledge and InformationSystems, vol. 6, no. 6, pp. 710-727, Nov. 2004.

[21] D. Boley, Principal Direction Divisive Partitioning, Data Miningand Knowledge Discovery, vol. 2, no. 4, pp. 325-344, 1998.

[22] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis,V. Kumar, B. Mobasher, and J. Moore, Partitioning-BasedClustering for Web Document Categorization, Decision SupportSystems, vol. 27, pp. 329-341, 1999.

[23] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis,V. Kumar, B. Mobasher, and J. Moore, Document Categoriza-tion and Query Generation on the World Wide Web UsingWebACE, AI Rev., vol. 13, nos. 5/6, pp. 365-391, 1999.

[24] A. Strehl, Relationship-Based Clustering and Cluster Ensemblesfor High-Dimensional Data Mining, PhD dissertation, Faculty ofGraduate School, Univ. of Texas at Austin, 2002.

[25] D.D. Lewis, Y. Yang, T. Rose, and F. Li, RCV1: A NewBenchmark Collection for Text Categorization Research,

J. Machine Learning Research, vol. 5, pp. 361-397, 2004.[26] M.F. Porter, An Algorithm for Suffix Stripping, Program, vol. 14,

no. 3, pp. 130-137, July 1980.[27] G. Salton, A. Wong, and C. Yang, A Vector Space Model for

Automatic Indexing, Comm. ACM, vol. 18, no. 11, pp. 613-620,Nov. 1975.

[28] W. Wong and A. Fu, Incremental Document Clustering for Web

Page Classification, Proc. Intl Conf. Information Soc. in the 21stCentury: Emerging Technologies and New Challenges (IS), 2000.[29] Y. Yang and J.P. Pedersen, A Comparative Study on Feature

Selection in Text Categorization, Proc. 14th Intl Conf. MachineLearning (ICML 97), pp. 412-420, 1997.

[30] J. He, A.-H. Tan, C.-L. Tan, and S.-Y. Sung, On QuantitativeEvaluation of Clustering Systems, Clustering and InformationRetrieval, pp. 105-133, Kluwer Academic, 2003.

[31] J.C. Dunn, Well Separated Clusters and Optimal Fuzzy Parti-tions, J. Cybernetica, vol. 4, pp. 95-104, 1974.

[32] A.-H. Tan, H.-L. Ong, H. Pan, J. Ng, and Q.-X. Li, TowardsPersonalized Web Intelligence, Knowledge and Information Sys-tems, vol. 6, no. 5, pp. 595-616, May 2004.

Khaled M. Hammouda received the BSc (Hons)degree in computer engineering from CairoUniversity in 1997 and the MASc and PhDdegrees in systems design engineering from theUniversity of Waterloo in 2002 and 2007,respectively. He is currently a professional soft-ware engineer at Desire2Learn Inc., where heworks on emerging learning object repositorytechnology. He received numerous awards,including the NSERC Postgraduate Scholarship,

Ontario Graduate Scholarship in Science and Technology, and theUniversity of Waterloo Presidents Graduate Scholarship and Faculty ofEngineering Scholarship. He is a former member of the PAMI ResearchGroup, University of Waterloo, where his research interests were indocument clustering and distributed text mining, especially keyphraseextraction and summarization. He authored several papers in this field.

Mohamed S. Kamel received the BSc (Hons)degree in electrical engineering from AlexandriaUniversity, the MASc degree from McMasterUniversity, and the PhD degree from the Uni-versity of Toronto. In 1985, he joined theUniversity of Waterloo, Waterloo, Ontario, wherehe is currently a professor and the director of thePattern Analysis and Machine Intelligence La-boratory, Department of Electrical and Computer

Engineering and holds a university researchchair. He held a Canada research chair in cooperative intelligent systemsfrom 2001 to 2008. His research interests are in computationalintelligence, pattern recognition, machine learning, and cooperativeintelligent systems. He has authored and coauthored more than350 papers in journals and conference proceedings, 10 edited volumes,two patents, and numerous technical and industrial project reports.Under his supervision, 75 PhD and MASc students have completed theirdegrees. He is the editor in chief of the International Journal of Roboticsand Automation and an associate editor of the IEEE Transactions onSystems, Man, and Cybernetics, Part A, Pattern Recognition Letters,Cognitive Neurodynamics Journal, and Pattern Recognition Journal. Heis also a member of the editorial advisory board of the InternationalJournal of Image and Graphics and the Intelligent Automation and SoftComputing Journal. He also served as an associate editor of Simulation,the journal of the Society for Computer Simulation. Based on his workat the NCR, hereceivedthe NCR Inventor Award.He is also a recipient ofthe Systems Research Foundation Award for outstanding presentation in1985 and the ISRAM Best Paper Award in 1992. In 1994, he has beenawarded the IEEE Computer Society Press Outstanding Referee Award.He was also a coauthor of the best paper in the 2000 IEEE CanadianConference on Electrical and Computer Engineering. He is a recipient ofthe University of Waterloo Outstanding Performance Award twice, thefaculty of engineering distinguished performance award. He is a memberof theACM and thePEO, a fellowof theIEEE, theEngineering InstituteofCanada (EIC), and the Canadian Academy of Engineering (CAE), andselected to be a fellow of the International Association of PatternRecognition (IAPR) in 2008. He served as a consultant for GeneralMotors, NCR, IBM, Northern Telecom, and Spar Aerospace. He is acofounder of Virtek Vision of Waterloo and the chair of its TechnologyAdvisory Group. He served as a member of the board from 1992 to 2008

and the vice president for research and development from 1987 to 1992.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


DMN09-Hierarchically Distributed Peer-To-peer Document Clustering and Cluster Summarization

Documents

Transcript of DMN09-Hierarchically Distributed Peer-To-peer Document Clustering and Cluster Summarization