9781420035063%2Ech22

18
22 Swarming Agents for Decentralized Clustering in Spatial Data Gianluigi Folino Agostino Forestiero Giandomenico Spezzano 22.1 Introduction .............................................. 22-341 22.2 A Multiagent Spatial Clustering Algorithm ............ 22-343 The DBSCAN Algorithm The Reynolds’ Flock Model SPARROW: A Flocking Algorithm for Spatial Clustering 22.3 The SPARROW-SNN Algorithm ....................... 22-346 The SNN Clustering Algorithm The SPARROW-SNN Implementation 22.4 Experimental Results .................................... 22-349 SPARROW Results SPARROW-SNN Results 22.5 Entropy Model ........................................... 22-354 22.6 Conclusions .............................................. 22-356 Acknowledgment ................................................ 22-357 References ....................................................... 22-357 22.1 Introduction In recent years several approaches to knowledge discovery and data mining, and in particular, clustering, have been developed, but only a few of them are designed using a decentralized approach. Clustering data is the process of grouping similar objects according to their distance, connectivity, or relative density in space [1]. There are a large number of algorithms for discovering natural clusters, if they exist, in a dataset, but they are usually studied as a centralized problem. These algorithms can be classified into partitioning methods [2], hierarchical methods [3], density-based methods [4], and grid-based methods [5]. Han et al.’s paper [6] is a good introduction to this subject. Many of these algorithms work on data contained in a file or database. In general, clustering algorithms focus on creating good compact representation of clusters and appropriate distance functions between data points. For this purpose, they generally need to be given one or two parameters by a user that indicate the types of clusters expected. Since they use a central representation where each point can be compared with each other point or cluster representation, points are never placed in a cluster with largely different members. Centralized clustering 22-341 © 2006 by Taylor & Francis Group, LLC

description

bio

Transcript of 9781420035063%2Ech22

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 341 #1

    22Swarming Agents for

    DecentralizedClustering inSpatial Data

    Gianluigi FolinoAgostino ForestieroGiandomenico Spezzano

    22.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-34122.2 A Multiagent Spatial Clustering Algorithm . . . . . . . . . . . . 22-343

    The DBSCAN Algorithm The Reynolds Flock Model SPARROW: A Flocking Algorithm for Spatial Clustering

    22.3 The SPARROW-SNN Algorithm . . . . . . . . . . . . . . . . . . . . . . . 22-346The SNN Clustering Algorithm The SPARROW-SNNImplementation

    22.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-349SPARROW Results SPARROW-SNN Results

    22.5 Entropy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-35422.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-356Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-357References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-357

    22.1 Introduction

    In recent years several approaches to knowledge discovery and data mining, and in particular, clustering,have been developed, but only a few of them are designed using a decentralized approach. Clusteringdata is the process of grouping similar objects according to their distance, connectivity, or relative densityin space [1]. There are a large number of algorithms for discovering natural clusters, if they exist, in adataset, but they are usually studied as a centralized problem. These algorithms can be classified intopartitioning methods [2], hierarchical methods [3], density-based methods [4], and grid-based methods[5]. Han et al.s paper [6] is a good introduction to this subject. Many of these algorithms work ondata contained in a file or database. In general, clustering algorithms focus on creating good compactrepresentation of clusters and appropriate distance functions between data points. For this purpose, theygenerally need to be given one or two parameters by a user that indicate the types of clusters expected.Since they use a central representation where each point can be compared with each other point or clusterrepresentation, points are never placed in a cluster with largely different members. Centralized clustering

    22-341

    2006 by Taylor & Francis Group, LLC

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 342 #2

    22-342 Handbook of Bioinspired Algorithms and Applications

    is problematic if we have large data to explore or data is widely distributed. Parallel and distributedcomputing is expected to relieve current mining methods from the sequential bottleneck, providing theability to scale to massive datasets, and improving the response time. Achieving good performance ontodays high performance systems is a nontrivial task. The main challenges include synchronization andcommunication minimization, workload balancing, finding good data decomposition, etc. Some existingcentralized clustering algorithms have been parallelized and the results have been encouraging. Centralizedschemes require high level of connectivity, impose a substantial computational burden, are typically moresensitive to failures than decentralized schemes, and are not scalable, which is a property that distributedcomputing systems are required to have.

    Recently, other algorithms based on biological models [710] have been introduced to solve the clus-tering problem in a decentralized fashion. These algorithms are characterized by the interaction of a largenumber of simple agents that sense and change their environment locally. Furthermore, they exhibitcomplex, emergent behavior that is robust with respect to the failure of individual agents. Ant colonies,flocks of birds, termites, swarms of bees, etc., are agent-based insect models that exhibit a collective intel-ligent behavior (SWARM intelligence, SI) [11] that may be used to define new algorithms of clustering.The use of new SI-based techniques for data clustering is an emerging new area of research that hasattracted many investigators in the last years. SI models have many features in common with evolutionaryalgorithms (EAs). Like EA, SI models are population-based. The system is initialized with a population ofindividuals (i.e., potential solutions). These individuals are then manipulated over many iteration stepsby mimicking the social behavior of insects or animals, in an effort to find the optima in the problemspace. Unlike EAs, SI models do not explicitly use evolutionary operators, such as crossover and muta-tion. A potential solution simply flies through the search space by modifying itself according to its pastexperience and its relationship with other individuals in the population and the environment. In thesemodels, the emergent collective behavior is the outcome of a process of self-organization, in which insectsare engaged through their repeated actions and interaction with their evolving environment. Intelligentbehavior frequently arises through indirect communication between the agents using the principle ofstigmergy [12]. This mechanism is a powerful principle of cooperation in insect societies. According tothis principle, an agent deposits something in the environment that makes no direct contribution to thetask being undertaken but is used to influence the subsequent behavior that is task related. The advantagesof SI are twofold. First, it offers intrinsically distributed algorithms that can use parallel computationquite easily. Second, these algorithms show a high level of robustness to change by allowing the solution todynamically adapt itself to global changes by letting the agents self-adapt to the associated local changes.Social insects provide a new paradigm for developing decentralized clustering algorithms.

    In this chapter, we present a method for decentralized clustering based on an adaptive flocking algorithmproposed by Macgill [13]. We consider clustering as a search problem in a multiagent system in whichindividual agents have the goal of finding specific elements in the search space, represented by a largedataset of tuples, by walking efficiently through this space. The method takes advantage of the parallelsearch mechanism a flock implies, by which if a member of a flock finds an area of interest; the mechanicsof the flock will draw other members to scan that area in more detail. The algorithm selects interestingsubsets of tuples without inspecting the whole search space guaranteeing a fast placing of points correctly inthe clusters. We have applied this strategy as a data reduction technique to perform efficiently approximateclustering [14]. In the algorithm, each agent can use hierarchical, partitioned, density-based, and grid-based clustering methods to discover if a tuple belongs to a cluster.

    To illustrate this method we present two algorithms: SPARROW (SPAtial clusteRing algoRithm thrOughsWarm intelligence) and SPARROW-SNN. SPARROW combines the flocking algorithm with the density-based DBSCAN algorithm [15]. SPARROW-SNN combines the flocking algorithm with a shared nearest-neighbor (SNN) cluster algorithm [16] to discover clusters with differing sizes, shapes, and densities innoise, high dimensional data. We have built a SWARM [17] simulation of both algorithms to investigatethe interaction of the parameters that characterize them. First, experiments showed encouraging resultsand a better performance of both algorithms in comparison with the linear randomized search using anentropy-based model.

    2006 by Taylor & Francis Group, LLC

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 343 #3

    Swarming Agents for Decentralized Clustering in Spatial Data 22-343

    The rest of this chapter is organized as follows: Section 22.2 presents the SPARROW clustering algorithm,first, introducing the heuristics of the DBSCAN algorithm, used by each agent to discover clusters, andthe flocking algorithm for the interaction among the agents. Section 22.3 describes the centralized SNNclustering algorithm and how it is combined with the flocking algorithm to produce the SPARROW-SNNalgorithm. Section 22.4 discusses the results obtained. Section 22.5 presents an entropy-based model totheoretically explain the behavior of the algorithm and Section 22.6 draws some conclusions.

    22.2 A Multiagent Spatial Clustering Algorithm

    In this section, we will present the SPARROW algorithm that combines the stochastic search of an adaptiveflocking with the DBSCAN heuristics for discovering clusters in parallel. SPARROW replaces the DBSCANserial procedure for clusters identification with a multiagent stochastic search that has the advantage ofbeing easily implementable on parallel computers and that is robust compared with the failure of individualagents. We will first introduce the DBSCAN algorithm and then will present the Reynolds flock modelthat describes the standard movement rules of birds. Finally, we will illustrate the details of the SPARROWbehavioral rules of the Swarming agents (SAs) that move through the spatial data looking for clusters andcommunicating their findings to each other.

    22.2.1 The DBSCAN Algorithm

    One of the most popular spatial clustering algorithms is DBSCAN, which is a density-based spatialclustering algorithm. A complete description of the algorithm and its theoretical basis is presented inthe paper by Ester et al. [15]. In the following, we briefly present the main principles of DBSCAN. Thealgorithm is based on the idea that all points of a dataset can be regrouped into two classes: clusters andnoise. Clusters are defined as a set of dense connected regions with a given radius (Eps) and containing atleast a minimum number (MinPts) of points. Data are regarded as noise if the number of points containedin a region falls below a specified threshold. The two parameters, Eps and MinPts, must be specified by theuser and allowed to control the density of the cluster that must be retrieved. The algorithm defines twodifferent kinds of points in a clustering: core point s and non-core points. A core point is a point with at leastMinPts number of points in an Eps neighborhood of the point. The non-core points in turn are eitherborder points if not core points but are density reachable from another core point or noise points if not corepoints and are not density reachable from other points. To find the clusters in a dataset, DBSCAN startsfrom an arbitrary point and retrieves all points with the same density reachable from that point using Epsand MinPts as controlling parameters. A point p is density reachable from a point q if the two points areconnected by a chain of points such that each point has a minimal number of data points, including thenext point in the chain, within a fixed radius. If the point is a core point, then the procedure yields a cluster.If the point is on the border, then DBSCAN goes on to the next point in the database and the point isassigned to the noise. DBSCAN builds clusters in sequence (i.e., one at a time), in the order in which theyare encountered during space traversal. The retrieval of the density of a cluster is performed by successivespatial queries. Such queries are supported efficiently by spatial access methods such as R-trees.

    22.2.2 The Reynolds Flock Model

    The flocking algorithm was originally proposed by Reynolds [18] as a method for mimicking the flockingbehavior of birds on a computer both for animation and as a way to study emergent behavior. Flocking isan example of emergent collective behavior: there is no leader, that is, no global control. Flocking behavioremerges from the local interactions. In the flock algorithm, each agent has direct access to the geometricdescription of the whole scene, but reacts only to flock mates within a certain small radius. The basicflocking model consists of three simple steering behaviors:

    Separation gives an agent the ability to maintain a certain distance from others. This prevents agentsfrom crowding too closely together, allowing them to scan a wider area.

    2006 by Taylor & Francis Group, LLC

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 344 #4

    22-344 Handbook of Bioinspired Algorithms and Applications

    Cohesion gives an agent the ability to cohere (approach and form a group) with other nearby agents.Steering for cohesion can be computed by finding all agents in the local neighborhood and com-puting the average position of the nearby agents. The steering force is then applied in the directionof that average position.

    Alignment gives an agent the ability to align with other nearby characters. Steering for alignment canbe computed by finding all agents in the local neighborhood and averaging together the headingvectors of the nearby agents.

    22.2.3 SPARROW: A Flocking Algorithm for Spatial Clustering

    SPARROW is a multiagent adaptive algorithm able to discover clusters in parallel. It uses a modifiedversion of standard flocking algorithm that incorporates the capacity for learning that can be found inmany social insects. In the algorithm, the agents are transformed into hunters with a foraging behaviorthat allow them to explore the spatial data while searching for clusters.

    SPARROW starts with a fixed number of agents that occupy a randomly generated position. Eachagent moves around the spatial data testing the neighborhood of each location in order to verify if thepoint can be identified as a core point. In such a case, a temporary label is assigned to all the points ofthe neighborhood of the core point. The labels are updated concurrently as multiple clusters take shape.Contiguous points belonging to the same cluster take the label corresponding to the smallest label in thegroup of contiguous points. Each agent follows the rules of movement described in Reynolds model. Inaddition, our model considers four different kinds of agents, classified based on the density of data in theirneighborhood. These different kinds are characterized by a different color: red, revealing a high densityof interesting patterns in the data, green, a medium one, yellow, a low one, and white, indicating a totalabsence of patterns. The main idea behind our approach is to take advantage of the colored agent in orderto explore more accurately the most interesting regions (signaled by the red agents) and avoid the oneswithout clusters (signaled by the white agents). Red and white agents stop moving in order to signal thistype of regions to the others, while green and yellow ones fly to find more dense clusters. Indeed, eachflying agent computes its heading by taking the weighted average of alignment, separation, and cohesion(as illustrated in Figure 22.1). The following are the main features that make our model different fromReynolds model:

    Alignment and cohesion do not consider yellow agents, since they move in a not very attractivezone.

    Cohesion is the resultant of the heading toward the average position of the green flockmates(centroid), of the attraction toward reds, and of the repulsion from whites, as illustrated inFigure 22.1.

    A separation distance is maintained from all the boids, apart from their color.

    Centroid

    Resultant

    Ignore it WhiteYellow

    Green

    Red

    FIGURE 22.1 Computing the direction of an agent in SPARROW.

    2006 by Taylor & Francis Group, LLC

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 345 #5

    Swarming Agents for Decentralized Clustering in Spatial Data 22-345

    for i =1..MaxGenerations

    foreach agent (yellow, green) age =age +1; if (age >Max_Life) generate_new_agent( ); die( ); endif

    if (not visited (current_point)) density = compute_local_density( ); mycolor = color_agent( ); endif

    end foreach foreach agent (yellow, green) dir =compute_dir( ); end foreach foreach agent (all)

    switch (mycolor){ case yellow, green: move(dir, speed(mycolor)); break; case white: stop( );generate_new_agent( ); break; case red: stop( ); merge( ); generate_new_close_agent( ); break;}

    end foreach

    FIGURE 22.2 Pseudo code of the SPARROW algorithm.

    The SPARROW algorithm consists of a setup phase and a running phase shown in Figure 22.2. Duringthe setup phase agents are created, data are loaded, and some general settings are made. In the runningphase each agent repeats four distinct procedures for a fixed number of times (MaxGenerations). Weuse the foreach statement to indicate that the rules are executed in parallel by the agents whose color isspecified in the argument. The mycolor procedure chooses the color and the speed of the agents withregard to the local density of the points of clusters in the data. It is based on the same parameters used inthe DBSCAN algorithm: MinPts, the minimum number of points to form a cluster and Eps, the maximumdistance that the agents can look at. In practice, the agent computes the local density (density) in a circularneighborhood (with a radius determined by its limited sight, i.e., Eps) and then it chooses the coloraccording to the following simple rules:

    density > MinPts mycolor = red (speed=0)

    MinPts/4 < density

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 346 #6

    22-346 Handbook of Bioinspired Algorithms and Applications

    FIGURE 22.3 The cage effect.

    The variable speed introduces an adaptive behavior in the algorithm. In fact, agents adapt their move-ment and change their behavior (speed) based on their previous experience represented from the red andwhite agents. Red and white agents will stop signaling to the others the interesting and desert regions.Note that for any agent that has become red or white, a new agent will be generated in order to maintaina constant number of agents exploring the data. In the first case, the new agent will be generated in aclose random point, since the zone is considered interesting, while in the latter it will be generated in arandom point over all the space. Next, red agents will run the merge procedure, which will merge theneighboring clusters. The merging phase considers two different cases: when we have never visited pointsin the circular neighborhood and when we have points belonging to different clusters. In the first case, thepoints will be labeled and will constitute a new cluster; in the second case, all the points will be mergedinto the same cluster, that is, they will get the label of the cluster discovered first. The first part of codeexecuted by yellow and green agents was added to avoid a cage effect (see Figure 22.3), which occurredduring the first simulations; in fact, some agents could remain trapped inside regions surrounded by redor white agents and would have no way to go out, wasting useful resources for the exploration. Therefore,a limit was imposed on their life. When their age exceeded a determined value (Max_Life) they were madeto die and were regenerated in a new randomly chosen position of the space.

    SPARROW suffers the same limitation as DBSCAN, that is, cannot cope with clusters of differentdensities. The new algorithm SPARROW-SNN, introduced in Section 22.3, is more general and overcomesthese drawbacks. It can be used to discover clusters with differing sizes, shapes, and densities in noise data.

    22.3 The SPARROW-SNN Algorithm

    In this section, we present the SPARROW-SNN multiagent clustering algorithm, which combines thestochastic search of the SPARROW algorithm with the SNN heuristics for discovering clusters in spatialdata. This approach has a number of nice properties. It can be easily implemented on parallel computersand is robust compared with the failure of individual agents. It can also be applied to perform efficientlyapproximate clustering since the points, that are visited and analyzed by the agents, represent a significant(in ergodic sense) subset of the entire dataset. The subset reduces the size of the dataset while keepingthe accuracy loss as small as possible. Moreover, SPARROW-SNN can discover clusters with differingsizes, shapes, and densities in noise data in parallel. The behavior requires the individual members tofirst explore the environment, searching for a goal whose position was not known a priori, and then,after the goal is located, all the flock members should move toward the goal. The same strategy is usedby SPARROW to transform the agents into hunters with a foraging behavior in order to explore thespatial data while searching for clusters. Clusters are discovered using the heuristics principles of the SNNclustering algorithm.

    22.3.1 The SNN Clustering Algorithm

    The SNN is a clustering algorithm developed by Ertz et al. [16] to discover clusters with differingsizes, shapes, and densities in noise, high dimensional data. The algorithm extends the nearest-neighbornonhierarchical clustering technique by Jarvis and Patrick [19] redefining the similarity between pairs

    2006 by Taylor & Francis Group, LLC

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 347 #7

    Swarming Agents for Decentralized Clustering in Spatial Data 22-347

    of points in terms of how many nearest neighbors the two points share. Using this new definition ofsimilarity, the algorithm eliminates noise and outliers, identifies representative points, and then buildsclusters around the representative points. These clusters do not contain all the points, rather they representrelatively uniform group of points.

    The SNN algorithm starts performing the JarvisPatrick scheme. In the JarvisPatrick algorithm, a setof objects is partitioned into clusters based on the number of shared nearest neighbors. The standardimplementation is constituted by two phases. The first, a preprocessing stage identifies the K nearestneighbors of each object in the dataset. In the subsequent clustering stage, two objects i and j join thesame cluster if:

    i is one of the K nearest neighbors of j j is one of the K nearest neighbors of i i and j have at least Kmin of their K nearest neighbors in common

    where K and Kmin are used-defined parameters. For each pair of points i and j is defined as a link with anassociate weight. The strength of the link between i and j is defined as:

    strength(i, j) = (k + 1m)(k + 1 n) where im = jn .

    In this equation, k is the nearest-neighbor list size, m and n are the positions of a shared nearestneighbor in i and j s lists. At this point, clusters can be obtained by removing all edges with weights lessthan a user-specified threshold and taking all the connected components as clusters. A major drawbackof the JarvisPatrick algorithm is that, the threshold needs to be set high enough since two distinct set ofpoints can be merged into same cluster even if there is only a link across them. On the other hand, if ahigh threshold is applied, then a natural cluster will be split into many small clusters due to the variationsin the similarity in the cluster.

    Shared nearest-neighbor addresses these problems introducing the following steps:

    1. For every node (data point) calculates the total strength of links coming out of the point.2. Identify representative points by choosing the point that have high density (>core_threshold).3. Identify noise points by choosing the points that have low density (

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 348 #8

    22-348 Handbook of Bioinspired Algorithms and Applications

    for i=1..MaxGenerationsforeach agent (yellow, green)

    if (not visited (current_point))conn=compute_conn( );

    if (conn core_threshold mycolor = red

    noise_threshold < conn

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 349 #9

    Swarming Agents for Decentralized Clustering in Spatial Data 22-349

    DS4: 8843 points GEORGE: 5463 points

    (a) (b)

    FIGURE 22.5 The two datasets used in our experiments.

    In addition, in this case, during simulations a cage effect, was observed. This part is not reported inthe pseudo code, as it is the same used in the SPARROW algorithm.

    22.4 Experimental Results

    In Sections 22.4.1 and 22.4.2, we present the experimental results obtained for the two algorithms. Bothalgorithms have been implemented using SWARM, a multiagent software platform for the simulationof complex adaptive systems. In the SWARM system, the basic unit of simulation is the SWARM, acollection of agents executing a schedule of actions. SWARM provides object-oriented libraries of reusablecomponents for building models and analyzing, displaying, and controlling experiments on those models.More information about SWARM can be obtained from Reference 17.

    22.4.1 SPARROW Results

    We evaluated the accuracy of the solution supplied by SPARROW in comparison with the one of DBSCANand the performance of the search strategy of SPARROW in comparison with the standard flocking searchstrategy and with the linear randomized search. Furthermore, we evaluated the impact of the number ofagents on foraging for clusters performance. Results are compared with the ones obtained using a publiclyavailable version of DBSCAN.

    For the experiments, we used two synthetic datasets, shown in Figure 22.5(a) and (b). The first dataset,called GEORGE, consists of 5463 points. The second dataset, called DS4, contains 8843 points. Each pointof the two datasets has two attributes that define the x and y coordinates. Furthermore, both datasets havea considerable quantity of noise.

    Although DBSCAN and SPARROW produce the same results if we examine all points of the dataset, ourexperiments show that SPARROW can obtain, with an average accuracy about 93% on GEORGE datasetand about 78% on DS4, the same number of clusters with a slightly smaller percentage of points for eachcluster using only 22% of the spatial queries used by DBSCAN. The same results cannot be obtained byDBSCAN because of the different strategy of attribution of the points to the clusters. In fact, if we stopDBSCAN before which it has performed the spatial queries on all the points, we should obtain a correctnumber of points for the clusters already individuated and probably a smaller number of points for thecluster that we were building but of course, we will not discover all the clusters.

    for each cluster found by DBSCAN and SPARROW.To verify the effectiveness of the search strategy we have compared SPARROW with the random walk

    search (RWS) strategy of the Reynolds flock algorithm and with the linear randomized search (LRS)strategy.

    versus number of visited points for the DS4 and GEORGE dataset.

    2006 by Taylor & Francis Group, LLC

    Table 22.1 and Table 22.2 show, for the two datasets, the number of clusters and the percentage of points

    Figure 22.6 and Figure 22.7 give the number of clusters found through the three different strategies

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 350 #10

    22-350 Handbook of Bioinspired Algorithms and Applications

    TABLE 22.1 Number of Clusters and Number ofPoints for Clusters for GEORGE Dataset (Percentagein Comparison to the Total Point for Cluster foundby DBSCAN) when SPARROW Analyzes 7, 12, and22% Points

    Percentage ofClustering data points forusing the cluster found by SPARROWGEORGEdataset 7% 12% 22%

    G 57.6 83.2 92.4E 58.2 71.1 91.3O 61.3 85.8 94.2R 50.6 72.7 93.3G 48.8 77.2 89.7E 61.1 81.2 94.5

    TABLE 22.2 Number of Clusters and Number ofPoints for Clusters for DS4 Dataset (Percentage inComparison to the Total Point for Cluster found byDBSCAN) when SPARROW Analyzes 7, 12, and 22%Points

    Percentage ofdata points for

    Clusteringusing theDS4 dataset

    cluster found by SPARROW

    7% 12% 22%

    1 51.16 70.99 78.862 45.91 64.74 74.403 40.68 59.36 81.954 44.21 60.66 81.675 54.65 58.72 71.546 48.77 59.91 78.107 54.29 66.43 79.188 51.16 70.99 78.869 45.91 64.74 74.40

    0

    100

    200

    300

    400

    0 100 200 300 400 500Visited points

    Core

    poi

    nts

    SPARROWRandomFlock

    FIGURE 22.6 Number of core points found for SPARROW, random, and flock strategy versus total number of visitedpoints for the DS4 dataset.

    2006 by Taylor & Francis Group, LLC

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 351 #11

    Swarming Agents for Decentralized Clustering in Spatial Data 22-351

    0

    100

    200

    300

    400

    0 100 200 300 400 500Visited points

    Core

    poi

    nts

    SPARROWRandomFlock

    FIGURE 22.7 Number of core points found for SPARROW, random, and flock strategy versus total number ofvisited points for the GEORGE dataset.

    0200400600800

    100012001400160018002000

    0 20 40 60 80 100Generations

    Core

    poi

    nts

    100 boids50 boids25 boids

    FIGURE 22.8 The impact of the number of agents on the foraging for clusters strategy (DS4).

    0200400600800

    100012001400160018002000

    0 20 40 60 80 100Generations

    Core

    poi

    nts

    100 boids50 boids25 boids

    FIGURE 22.9 The impact of the number of agents on the foraging for clusters strategy (GEORGE).

    At the beginning, the random strategy, and also (to a minor extent) the flock, overcomes SPARROW;however, after 200 to 250 visited points SPARROW presents a superior behavior on both the searchstrategies because of the adaptive behavior of the algorithm that allows agents to learn on their previousexperience. Finally, we present the impact of the number of agents on the foraging for clusters performance.Figure 22.8 and Figure 22.9 give the number of clusters found in 100 time steps (generations) for 25, 50,and 100 agents. A comparative analysis reveals that a 100-agents population discovers a larger number ofclusters than the other two populations with a smaller number of agents (the scalability is almost linear).This scalable behavior of the algorithm determines a faster completion time because a smaller number ofiterations are necessary to produce the solution.

    2006 by Taylor & Francis Group, LLC

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 352 #12

    22-352 Handbook of Bioinspired Algorithms and Applications

    Boston

    Philadelphia

    New York

    DS1: 8000 points North East: 123593 points

    (a) (b)

    FIGURE 22.10 The two datasets used in our experiments (circles surround the three towns of NorthEast dataset).

    22.4.2 SPARROW-SNN Results

    In this section, we present an experimental evaluation of SPARROW-SNN algorithm. We intent to explorethe following issues:

    1. Determine the accuracy of the approximate solution that we obtain if we run our cluster algorithmon only a percent of points, as opposed to running the SNN clustering algorithm on the entiredataset.

    2. Determine the effects of using SPARROW-SNN searching strategy as opposed to random-walksearching strategy in order to identify clusters.

    3. Determine the impact of the number agents on the foraging for clusters performance.

    For the experiments, we used synthetic datasets and a real life dataset extracted from a spatial database.The structures of these datasets are shown in Figure 22.10(a) and 22.10(b).

    The first dataset, called DS1, contains 8,000 points and presents different densities in the clusters.The second dataset, called NorthEast, contains 123,593 points representing postal addresses of threemetropolitan areas (New York, Boston, and Philadelphia) in the North East States.

    We first illustrate the accuracy loss of our SPARROW-SNN algorithm in comparison with SNNalgorithm when SPARROW-SNN is used as a technique for approximate clustering. For this purpose,we implemented a version of SNN and we computed the number of clusters and the number of points for

    from SPARROW-SNN when a population of 50 agents visited, respectively, 7, 12, and 22% of the entiredataset.

    Note that with only 7% of points we can have a clear vision of the found clusters and with a few morepoints we can obtain the near totality of the points. This trend is well marked in the NorthEast dataset.For the DS1 dataset, the results are not so well defined because the clusters called numbers 3 and 4 havevery few points, so they are very hard to discover. For the real dataset, we only reported the results for thethree main clusters representing the towns of Boston, New York, and Philadelphia.

    We can explain the good results through the adaptive search strategy of SPARROW-SNN that requiresto the individual agents to first explore the data searching for representative points whose position is notknown a priori, and then, after the representative points are located, all the flock member are steeredto move toward the representative points, that represent the interesting regions, in order to help them,avoiding the uninteresting areas that are instead marked as obstacles and adaptively changing their speed.

    To verify the effectiveness of the search strategy we have compared SPARROW-SNN with the RWS

    of representative points found with SPARROW-SNN and those found with the random search and withthe standard flock versus the total number of visited points. Both figures reveal that the number ofrepresentative points discovered at the beginning (and until about 170 visited points for DS1 and 150 for

    2006 by Taylor & Francis Group, LLC

    cluster for the two datasets. Table 22.3 presents a comparison of these results with respect to ones obtained

    strategy and with the standard flocking search strategy. Figure 22.11 and Figure 22.12 show the number

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 353 #13

    Swarming Agents for Decentralized Clustering in Spatial Data 22-353

    TABLE 22.3 Number of Clusters and Number of Points forClusters for DS1 and NorthEast (Percentage in Comparison tothe Total Point for Cluster Found by SNN) whenSPARROW-SNN Analyzes 7, 12, and 22% Points

    Percentage of data points forcluster

    Clustering usingNorthEast dataset 7% 12% 22%

    Philadelphia 42.5 65.2 79.4New York 38.7 52.3 67.6Boston 46.5 68.6 82.3

    Percentage of data points forcluster found by SPARROW-SNN

    Clustering usingthe DSI dataset 7% 12% 22%

    1 41.35 58.31 70.372 30.08 53.58 60.723 29.28 40.99 53.024 20.9 30.5 51.415 53.38 65.36 76.566 56.89 69.87 73.637 33.89 43.5 61.58

    0

    100

    200

    300

    400

    0 100 200 300 400 500

    Visited points

    Core

    poi

    nts

    SPARROW-SNNRandomFlock

    FIGURE 22.11 Number of core points found for SPARROW-SNN, random, and flock versus total number of visitedpoints for DS1 dataset.

    NorthEast dataset) from the random strategy, and (to a minor extent) for the flock, is slightly greaterthan of SPARROW-SNN.

    Next, our strategy presents a superior behavior on both the strategies because of the adaptive behaviorof the algorithm that allows to the agents to learn on their previous experience.

    Finally, in order to study the scalability of our approach, we present the impact of the number of agents

    NorthEast dataset, the number of clusters found in 100 time steps (generations) for 25, 50, and 100agents. A comparative analysis reveals that a 100-agents population discovers a larger number of clusters(almost linear) than the other two populations with a smaller number of agents. This scalable behavior ofthe algorithm should determine a faster completion time because a smaller number of iterations shouldbe necessary to produce the solution.

    We try to partially give a theoretical explanation of the behavior observed in Figure 22.11 and

    entropy-based model in Section 22.5.

    2006 by Taylor & Francis Group, LLC

    on the foraging for clusters performance. Figure 22.13 and Figure 22.14 give, respectively, the DS1 and

    Figure 22.12, that is, the improvement in accuracy of the SPARROW-SNN strategy, introducing an

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 354 #14

    22-354 Handbook of Bioinspired Algorithms and Applications

    0

    100

    200

    300

    400

    0 100 200 300 400Visited points

    Core

    poi

    nts

    SPARROW-SNNRandomFlock

    FIGURE 22.12 Number of core points found for SPARROW-SNN, random, and flock versus total number of visitedpoints for NorthEast dataset.

    0200400600800

    100012001400160018002000

    100806040200Generations

    Core

    poi

    nts

    100 boids50 boids25 boids

    FIGURE 22.13 The impact of the number of agents on foraging strategy for the DS1 dataset.

    0200400600800

    100012001400160018002000

    Generations0 20 40 60 80 100

    Core

    poi

    nts

    100 boids

    50 boids

    25 boids

    FIGURE 22.14 The impact of the number of agents on foraging strategy for the NorthEast dataset.

    22.5 Entropy Model

    In order to verify and explain the behavior of our system, we used a model based on the entropy asintroduced in Reference 20. The authors use a measure of entropy to analyze emergence in multiagentsystems. The fundamental claim of this chapter is that the relation between self-organization based on

    2006 by Taylor & Francis Group, LLC

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 355 #15

    Swarming Agents for Decentralized Clustering in Spatial Data 22-355

    emergence in multiagent systems and concepts as entropy is not just a loose metaphor, but can providequantitative, analytical guidelines for designing and operating agent systems. These concepts can be appliedin measuring the behavior of multiagent systems. The main result suggested here concerns the principlethat the key to reducing disorder in a multiagent system to achieve a coherent global behavior is couplingthat system to another in which disorder increases. This corresponds to a macrolevel where the orderincreases, that is, a coherent behavior arises, and the microlevel where an increase in disorder is the causefor this coherent behavior at the macrolevel.

    A multiagent system should follow the second law of thermodynamics energy spontaneously dispersesfrom being localized to becoming spread out if it is not hindered, if agents move without any constriction.However, if we add information in an intelligent way the agents natural tendency to maximum entropywill be contrasted and system goes toward self-organization.

    Here, we will apply this model to the SPARROW-SNN algorithm. As stated in Reference 20, we canobserve two levels of entropy: a macrolevel in which organization takes place, balanced by the micro inthat we have an increase of entropy. For the sake of clarity, in SPARROW-SNN, microlevel is representedby red and white agents positions, signaling, respectively, interesting and desert zones, and the macrolevelis computed considering all the agents positions. Therefore, we expect to observe an increase of microentropy by the birth of new red and white agents and, on the contrary, a decrease in macro entropyindicating organization in the coordination model of the agents.

    In the following, we give a more exact description of the entropy-based model. In information theory,entropy can be defined as

    S =

    i

    pi log pi .

    Now, to adapt this formula to our need, we use the location-based entropy. Consider an agent movingin the space of data divided in a grid N M = K , where each cell has the same dimension. Hence, if Nand M are quite large, each random agent has the same probability to be in one of the K cells of the grid.We measured the entropy running a simulation for 100 times, for 2000 time steps using 50 agents andcounting how many times the agent falls in the same cell i for each time step. Dividing this number by Twe obtain the probability pi that the agent will be in this cell.

    Then, the locational entropy will be:

    S = a

    i=1 pi log pilog a

    . (22.1)

    In case of random distribution, every state has probability 1/a, so the overall entropy will be(log a)/(log a) = 1; this explains the factor of normalization a.

    This equation can be generalized for P agents, summing the over all agents and dividing the averageby P . Equation (22.1) represents the macro entropy; if we consider only red and white points, it representsthe micro entropy.

    We computed the micro and macro locational entropy for the real dataset NorthEast with the

    SNN due to the organization introduced in the coordination model of the agents by the attraction towardred agents and the repulsion of white agents. On the contrary, in random and standard flock model, thecurve of macro entropy is almost constant, confirming the absence of organization.

    These experiments partially explain the validity of our approach, but further studies are necessary andwill be conducted to better demonstrate the positive effects of organization and the correlation betweenmacro entropy and validity of searching strategy.

    2006 by Taylor & Francis Group, LLC

    As expected, we can observe an increase in micro entropy and a decrease in macro entropy of SPARROW-

    SPARROW-SNN algorithm using the parameters cited earlier. The results are showed in Figure 22.15and Figure 22.16.D

    ownl

    oade

    d by

    [Ind

    ian In

    stitut

    e of T

    echn

    ology

    , Kan

    pur],

    [TAT

    HAGA

    TA B

    HADU

    RI] a

    t 08:3

    3 01 J

    une 2

    015

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 356 #16

    22-356 Handbook of Bioinspired Algorithms and Applications

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0 200 400 600 800 1000 1200 1400 1600 1800 2000

    Mic

    ro e

    ntro

    py

    Iterations

    FIGURE 22.15 Micro entropy (red and white agents) for the NorthEast dataset using SPARROW-SNN.

    0.955

    0.96

    0.965

    0.97

    0.975

    0.98

    0.985

    0.99

    0.995

    1

    0 200 400 600 800 1000 1200 1400 1600 1800 2000

    Mac

    ro e

    ntro

    py

    Iterations

    SPARROW-SNN

    Random

    Flock

    FIGURE 22.16 Macro entropy (all the agents) for the NorthEast dataset using SPARROW-SNN.

    22.6 Conclusions

    In this chapter, we have described a fully decentralized approach to clustering spatial data by a multi-agent system. The approach is based on the use of SI techniques. Two novel algorithms that combinedensity-based and shared nearest-neighbor clustering strategy with a flocking algorithm have been presen-ted. The algorithms have been implemented in SWARM and evaluated using synthetic datasets and onereal word dataset. Measures of accuracy of the results show that the flocking algorithm can be efficientlyapplied as a data reduction strategy to perform approximate clustering. Moreover, by an entropy-basedmodel we have theoretically demonstrated that the adaptive search strategy of SPARROW is more efficientthan that of the RWS strategy. Finally, the algorithms show a good scalable behavior.

    2006 by Taylor & Francis Group, LLC

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 357 #17

    Swarming Agents for Decentralized Clustering in Spatial Data 22-357

    Acknowledgment

    This work was supported by the CNR/MIUR project legge 449/97-DM 30/10/2000 and by ProjectFIRB Grid.it (RBNE01KNFP).

    References

    [1] Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Mateo,CA, 2000.

    [2] Kaufman, L. and Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis.John Wiley & Sons, New York, 1990.

    [3] Karypis, G., Han, E., and Kumar, V. CHAMELEON: A hierarchical clustering algorithm usingdynamic modeling. IEEE Computer, 32, 6875, 1999.

    [4] Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. Density-based clustering in spatial databases: Thealgorithm GDBSCAN and its applications, Data Mining and Knowledge Discovery, 2, 169194,1998.

    [5] Wang, W., Yang, J., and Muntz, R. STING: A statistical information grid approach to spatial datamining. In Proceedings of International Conference on Very Large Data Bases (VLDB97), 1997,pp. 186195.

    [6] Han, J., Kamber, M., and Tung, A.K.H. Spatial clustering methods in data mining: A survey. InGeographic Data Mining and Knowledge Discovery, H. Miller and J. Han (Eds.), Taylor & Francis,London, 2001.

    [7] Deneubourg, J.L., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C., and Chretien, L. Thedynamic of collective sorting robot-like ants and ant-like robots. In Proceedings of the first Confer-ence on Simulation of Adaptive Behavior, J.A. Meyer and S.W. Wilson (Eds.), MIT Press/BradfordBooks, Cambridge, MA, 1990, pp. 356363.

    [8] Lumer, E.D. and Faieta, B. Diversity and adaptation in populations of clustering ants. In Proceedingsof the Third International Conference on Simulation of Adaptive Behavior: From Animals to Animats(SAB94), D. Cliff, P. Husbands, J.A. Meyer, and S.W. Wilson (Eds.), MIT Press, Cambridge, MA,1994, pp. 501508.

    [9] Kuntz, P. and Snyers, D., Emergent colonization and graph partitioning, In Proceedings of the ThirdInternational Conference on Simulation of Adaptive Behavior: From Animals to Animats (SAB94),D. Cliff, P. Husbands, J.A. Meyer, and S.W. Wilson (Eds.), MIT Press, Cambridge, MA, 1994,pp. 494500.

    [10] Monmarch, N., Slimane, M., and Venturini, G. On improving clustering in numerical databaseswith artificial ants. In Advances in Artificial Life: 5th European Conference, ECAL 99, Vol. 1674 ofLecture Notes in Computer Science, Springer-Verlag, Berlin, 1999, pp. 626635.

    [11] Bonabeau, E., Dorigo, M., and Theraulaz, G. Swarm Intelligence: From Natural to Artificial Systems.Oxford University Press, Oxford, 1999.

    [12] Grass, P.P., La Reconstruction du nid et les Coordinations Inter-Individuelles chez BeellicositermesNatalensis et Cubitermes sp. La Thorie de la Stigmergie: Essai dinterprtation du Comportementdes Termites Constructeurs, Insect. Soc. 6, pp. 4180, 1959.

    [13] Macgill, J. Using flocks to drive a geographical analysis engine. In Artificial Life VII: Proceedings ofthe Seventh International Conference on Artificial Life. MIT Press, Reed College, Portland, Oregon,2000, pp. 16.

    [14] Kollios, G., Gunopoulos, D., Koudas, N., and Berchtold, S. Efficient biased sampling forapproximate clustering and outlier detection in large datasets. IEEE Transactions on Knowledgeand Data Engineering, 15, 11701187, 2003.

    [15] Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. A density-based algorithm for discovering clusters inlarge spatial databases with noise. In Proceedings of the 2nd International Conference on KnowledgeDiscovery and Data Mining (KDD-96), Portland, OR, 1996, pp. 226231.

    2006 by Taylor & Francis Group, LLC

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

  • CHAPMAN: C4754_C022 2005/8/6 14:10 page 358 #18

    22-358 Handbook of Bioinspired Algorithms and Applications

    [16] Ertz, L., Steinbach, M., and Kumar, V. A new shared nearest neighbor clustering algorithm and itsapplications. In Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAMInternational Conference on Data Mining, Washington, USA, April 2002, pp. 105115.

    [17] Minar, N., Burkhart, R., Langton, C., and Askenazi, M. The Swarm Development Group, 1996,

    [18] Reynolds, C.W. Flocks, herds, and schools: A distributed behavioral model. Computer Graphics,21, 2534, 1987.

    [19] Jarvis, R.A. and Patrick, E.A. Clustering using a similarity measure based on shared nearestneighbors. IEEE Transactions on Computers, C-22(11), 10251034, 1973.

    [20] Van Dyke Parunak, H. and Brueckner, S. Entropy and self-organization in multi-agent systems.In Proceedings of the Fifth International Conference on Autonomous Agents, ACM Press, 2001,pp. 124130.

    2006 by Taylor & Francis Group, LLC

    http://www.santafe.edu/projects/swarm.

    Dow

    nloa

    ded

    by [I

    ndian

    Insti

    tute o

    f Tec

    hnolo

    gy, K

    anpu

    r], [T

    ATHA

    GATA

    BHA

    DURI

    ] at 0

    8:33 0

    1 Jun

    e 201

    5

    Chapter 22: Swarming Agents for Decentralized Clustering in Spatial Data22.1 Introduction22.2 A Multiagent Spatial Clustering Algorithm22.2.1 The DBSCAN Algorithm22.2.2 The Reynolds Flock Model22.2.3 SPARROW: A Flocking Algorithm for Spatial Clustering

    22.3 The SPARROW-SNN Algorithm22.3.1 The SNN Clustering Algorithm22.3.2 The SPARROW-SNN Implementation

    22.4 Experimental Results22.4.1 SPARROW Results22.4.2 SPARROW-SNN Results

    22.5 Entropy Model22.6 ConclusionsAcknowledgmentReferences