traclass

7/29/2019 traclass

1/12

Trajectory Clustering: A Partition-and-Group Framework

Jae-Gil Lee, Jiawei HanDepartment of Computer Science

University of Illinois at Urbana-Champaign

[email protected], [email protected]

Kyu-Young WhangDepartment of Computer Science / AITrc

KAIST

[email protected]

ABSTRACT

Existing trajectory clustering algorithms group similar tra-jectories as a whole, thus discovering common trajectories.Our key observation is that clustering trajectories as a wholecould miss common sub-trajectories. Discovering commonsub-trajectories is very useful in many applications, espe-cially if we have regions of special interest for analysis. Inthis paper, we propose a new partition-and-group framework

for clustering trajectories, which partitions a trajectory intoa set of line segments, and then, groups similar line seg-ments together into a cluster. The primary advantage of thisframework is to discover common sub-trajectories from a tra-jectory database. Based on this partition-and-group frame-work, we develop a trajectory clustering algorithm TRA-CLUS. Our algorithm consists of two phases: partitioningand grouping. For the first phase, we present a formal trajec-tory partitioning algorithm using the minimum descriptionlength (MDL) principle. For the second phase, we presenta density-based line-segment clustering algorithm. Exper-imental results demonstrate that TRACLUS correctly dis-covers common sub-trajectories from real trajectory data.

Categories and Subject Descriptors: H.2.8 [DatabaseManagement]: Database Applications Data Mining

General Terms: Algorithms

Keywords: Partition-and-group framework, trajectory clus-tering, MDL principle, density-based clustering

1. INTRODUCTIONClustering is the process of grouping a set of physical or

abstract objects into classes of similar objects [11]. Cluster-ing has been widely used in numerous applications such as

The work was supported in part by the Korea Re-search Foundation grant, funded by the Korean Govern-ment (MOEHRD), KRF-2006-214-D00129 and the U.S. Na-tional Science Foundation grants IIS-05-13678/06-42771 andBDI-05-15813. Any opinions, findings, and conclusions ex-pressed here are those of the authors and do not necessarilyreflect the views of the funding agencies.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD07, June 1214, 2007, Beijing, China.Copyright 2007 ACM 978-1-59593-686-8/07/0006 ...$5.00.

market research, pattern recognition, data analysis, and im-age processing. A number of clustering algorithms have beenreported in the literature. Representative algorithms in-clude k-means[17], BIRCH [24], DBSCAN[6], OPTICS[2],and STING[22]. Previous research has mainly dealt withclustering of point data.

Recent improvements in satellites and tracking facilitieshave made it possible to collect a large amount of trajec-tory data of moving objects. Examples include vehicle posi-

tion data, hurricane track data, and animal movement data.There is increasing interest to perform data analysis overthese trajectory data. A typical data analysis task is to findobjects that have moved in a similar way[21]. Thus, an ef-ficient clustering algorithm for trajectories is essential forsuch data analysis tasks.

Gaffney et al. [7, 8] have proposed a model-based cluster-ing algorithm for trajectories. In this algorithm, a set oftrajectories is represented using a regression mixture model.Then, unsupervised learning is carried out using the max-imum likelihood principle. Specifically, the EM algorithmis used to determine the cluster memberships. This algo-rithm clusters trajectories as a whole; i.e., the basic unit ofclustering is the whole trajectory.

Our key observation is that clustering trajectories as awhole could not detect similar portions of the trajectories.We note that a trajectory may have a long and complicatedpath. Hence, even though some portions of trajectories showa common behavior, the whole trajectories might not.

Example 1. Consider the five trajectories in Figure 1. Wecan clearly see that there is a common behavior, denotedby the thick arrow, in the dotted rectangle. However, ifwe cluster these trajectories as a whole, we cannot discoverthe common behavior since they move to totally differentdirections; thus, we miss this valuable information. 2

A common sub-trajectory

TR5

TR1

TR2

TR3

TR4

Figure 1: An example of a common sub-trajectory.

Our solution is to partition a trajectory into a set ofline segments and then group similar line segments. Thisframework is called a partition-and-group framework. Theprimary advantage of the partition-and-group framework isthe discovery of common sub-trajectories from a trajectory

593

7/29/2019 traclass

2/12

database. This is exactly the reason why we partition atrajectory into a set of line segments.

We contend that discovering the common sub-trajectoriesis very useful, especially if we have regions of special interestfor analysis. In this case, we want to concentrate on thecommon behaviors within those regions. There are manyexamples in real applications. We present two applicationscenarios.

1. Hurricanes: Landfall Forecasts [18]Meteorologists are trying to improve the ability to fore-cast the location and time of hurricane landfall. An ac-curate landfall location forecast is of prime importancesince it is crucial for reducing hurricane damages. Me-teorologists will be interested in the common behaviorsof hurricanes near the coastline(i.e., at the time of land-ing) or at sea(i.e., before landing). Thus, discovering thecommon sub-trajectories helps improve the accuracy ofhurricane landfall forecasts.

2. Animal Movements: Effects of Roads and Traffic [23]

Zoologists are investigating the impacts of the varyinglevels of vehicular traffic on the movement, distribution,

and habitat use of animals. They mainly measure themean distance between the road and animals. Zoolo-gists will be interested in the common behaviors of ani-mals near the road where the traffic rate has been varied.Hence, discovering the common sub-trajectories helps re-veal the effects of roads and traffic.

Example 2. Consider an animal habitat and roads in Fig-ure 2. Thick lines represent roads, and they have differenttraffic rates. Wisdom et al. [23] explore the spatial patternsof mule deer and elk in relation to roads of varying trafficrates. More specifically, one of the objectives is to assessthe degree to which mule deer and elk may avoid areas nearroads based on variation in rates of motorized traffic. Theseareas are represented as solid rectangles. Thus, our frame-

work is indeed useful for this kind of research. 2

Moderate Traffic Rate

Low

Traffic Rate

High Traffic Rate

Zero Traffic3030 M Cells

Roads

Figure 2: Monitoring animal movements [23]1.

One might argue that, if we prune the useless parts oftrajectories and keep only the interesting ones, we can usetraditional clustering algorithms that cluster trajectories asa whole. This alternative, however, has two major draw-backs compared with our partition-and-group framework.First, it is tricky to determine which part of the trajectoriesis useless. Second, pruning useless parts of trajectoriesforbids us to discover unexpected clustering results.

In this paper, we propose a partition-and-group frame-work for clustering trajectories. As indicated by its name,

1The figure is borrowed from the authors slide.

trajectory clustering based on this framework consists of thefollowing two phases:

(1) The partitioning phase: Each trajectory is optimallypartitioned into a set of line segments. These line seg-ments are provided to the next phase.

(2) The grouping phase: Similar line segments are groupedinto a cluster. Here, a density-based clustering methodis exploited.

Density-based clustering methods [2, 6] are the most suit-able for line segments because they can discover clusters ofarbitrary shape and can filter out noises[11]. We can easilysee that line segment clusters are usually of arbitrary shape,and a trajectory database contains typically a large amountof noise(i.e., outliers).

In summary, the contributions of this paper are as follows:

We propose a partition-and-group framework for cluster-ing trajectories. This framework enables us to discovercommon sub-trajectories, whereas previous frameworksdo not.

We present a formal trajectory partitioning algorithm

using the minimum description length (MDL) [9] princi-ple.

We present an efficient density-based clustering algo-rithm for line segments. We carefully design a distancefunction to define the density of line segments. In ad-dition, we provide a simple but effective heuristic fordetermining parameter values of the algorithm.

We demonstrate, by using various real data sets, thatour clustering algorithm effectively discovers the repre-sentative trajectories from a trajectory database.

The rest of the paper is organized as follows. Section 2presents an overview of our trajectory clustering algorithm.Section 3 proposes a trajectory partitioning algorithm for

the first phase. Section 4 proposes a line segment clusteringalgorithm for the second phase. Section 5 presents the re-sults of experimental evaluation. Section 6 discusses relatedwork. Finally, Section 7 concludes the paper.

2. TRAJECTORY CLUSTERINGIn this section, we present an overview of our design. Sec-

tion 2.1 formally presents the problem statement. Section2.2 presents the skeleton of our trajectory clustering algo-rithm, which we call TRACLUS. Section 2.3 defines our dis-tance function for line segments.

2.1 Problem StatementWe develop an efficient clustering algorithm based on the

partition-and-group framework. Given a set of trajectoriesI = {T R1, , T Rnumtra}, our algorithm generates a set ofclusters O = {C1, , Cnumclus} as well as a representativetrajectory for each cluster Ci, where the trajectory, cluster,and representative trajectory are defined as follows.

A trajectory is a sequence of multi-dimensional points. Itis denoted as T Ri = p1p2p3 pj pleni (1 i numtra).Here, pj (1 j leni) is a d-dimensional point. The lengthleni of a trajectory can be different from those of other tra-jectories. A trajectory pc1pc2 pck (1 c1 < c2 < costnopar) then/* partition at the previous point */

08: Add pcurrIndex1 into the set CPi;09: startIndex := currIndex 1, length := 1;10: else11: length := length + 1;12: Add pleni into the set CPi; /* the ending point */

Figure 8: An approximate algorithm for partitioning

a trajectory.

We now present, in Lemma 1, the time complexity of theapproximate algorithm.

Lemma 1. The time complexity of the algorithm in Figure8 is O(n), where n is the length (i.e., the number of points)of a trajectory T Ri.Proof: The number of MDL computations required is equalto the number of line segments (i.e., n 1) in the trajectoryT Ri. 2

Of course, this algorithm may fail to find the optimal par-titioning. Let us present a simple example in Figure 9. Sup-pose that the optimal partitioning with the minimal MDL

cost is {p1p5}. This algorithm cannot find the exact solutionsince it stops scanning at p4 where M DLpar is larger than

M DLnopar. Nevertheless, the precision of this algorithm isquite high. Our experience indicates that the precision isabout 80% on average, which means that 80% of the ap-proximate solutions appear also in the exact solutions.

1p

2p

4p

),(),(),(),(

5151

4141

ppMDLppMDLppMDLppMDL

noparpar

noparpar

approximate solution

exact solution

5p3p

Figure 9: An example where the approximate algo-rithm fails to find the optimal partitioning.

4. LINE SEGMENT CLUSTERINGIn this section, we propose a line segment clustering al-

gorithm for the grouping phase. We first discuss the den-sity of line segments in Section 4.1. We then present ourdensity-based clustering algorithm, which is based on the al-gorithm DBSCAN [6], in Section 4.2. We discuss the reasonfor choosing DBSCAN in the extended version [16]. Next,we present a method of computing a representative trajec-tory in Section 4.3. Finally, since density-based clustering

597

7/29/2019 traclass

6/12

algorithms intrinsically are sensitive to their parameter val-ues [11], we present a heuristic for determining parametervalues in Section 4.4.

4.1 Density of Line Segments

4.1.1 Review of Our Distance Function

Our distance function is the weighted sum of three kindsof distances. First, the perpendicular distance measures the

positional difference mainly between line segments extractedfrom different trajectories. Second, the parallel distancemeasures the positional difference mainly between line seg-ments extracted from the same trajectory. We note thatthe parallel distance between two adjacent line segments ina trajectory is always zero. Third, the angle distance mea-sures the directional difference between line segments.

Before proceeding, we address the symmetry of our dis-tance function in Lemma 2. Symmetry is important to avoidambiguity of the clustering result. If a distance function isasymmetric, a different clustering result can be obtained de-pending on the order of processing.

Lemma 2. The distance function dist(Li, Lj) defined inSection 2.3 is symmetric, i.e., dist(Li, Lj) = dist(Lj , Li).

Proof: To make dist(Li, Lj) symmetric, we have assigneda longer line segment to Li and a shorter one to Lj . (Thetie can be broken by comparing the internal identifier of aline segment.) Hence, we can easily know that dist(Li, Lj)is symmetric. 2

4.1.2 Notions for Density-Based Clustering

We summarize the notions required for density-based clus-tering through Definitions 49. Let D denote the set of allline segments. We change the definitions for points, origi-nally proposed for the algorithm DBSCAN [6], to those forline segments.

Definition 4. The -neighborhood N(Li) of a line segment

Li D is defined by N(Li) = {Lj D | dist(Li, Lj) }.

Definition 5. A line segment Li D is called a core linesegment w.r.t. and MinLns if |N(Li)| MinLns.

Definition 6. A line segment Li D is directly density-reachable from a line segment Lj D w.r.t. and MinLnsif (1) Li N(Lj) and (2) |N(Lj)| MinLns.

Definition 7. A line segment Li D is density-reachablefrom a line segment Lj D w.r.t. and MinLns if thereis a chain of line segments Lj , Lj1, , Li+1, Li D suchthat Lk is directly density-reachable from Lk+1 w.r.t. andMinLns.

Definition 8. A line segment Li D is density-connected

to a line segment Lj D w.r.t. and MinLns if there is aline segment Lk D such that both Li and Lj are density-reachable from Lk w.r.t. and MinLns.

Definition 9. A non-empty subset C D is called a density-connected set w.r.t. and MinLns ifC satisfies the followingtwo conditions:(1) Connectivity: Li, Lj C , Li is density-connected to Lj

w.r.t. and MinLns;

(2) Maximality: Li, Lj D, if Li C and Lj is density-reachable from Li w.r.t. and MinLns, then Lj C .

We depict these notions in Figure 10. Density-reachabilityis the transitive closure of direct density-reachability, andthis relation is asymmetric. Only core line segments aremutually density-reachable. Density-connectivity, however,is a symmetric relation.

Example 4. Consider the line segments in Figure 10. LetMinLns be 3. Thick line segments indicate core line seg-ments. -neighborhoods are represented by irregular el-lipses. Based on the above definitions,

L1, L2, L3, L4, and L5 are core line segments; L2 (or L3) is directly density-reachable from L1; L6 is density-reachable from L1, but not vice versa; L1, L4, and L5 are all density-connected. 2

L1

L3

L5

L2

L4

L6

L6

L5

L3

L1

L2

L4

Figure 10: Density-reachability and density-connectivity.

4.1.3 Observations

Since line segments have both direction and length, theyshow a few interesting features in relation to density-basedclustering. Here, we mention our two observations.

We observe that the shape of an -neighborhood in linesegments is not a circle or a sphere. Instead, its shape is verydependent on data and is likely to be an ellipse or an ellip-soid. Figure 10 shows various shapes of -neighborhoods.They are mainly affected by the direction and length of acore line segment and those of border line segments.

Another interesting observation is that a short line seg-ment could drastically degrade the clustering quality. Wenote that the length of a line segment represents its direc-tional strength; i.e., if a line segment is short, its directionalstrength is low. This means that the angle distance is smallregardless of the actual angle if one of line segments is veryshort. Let us compare the two sets of line segments in Fig-ure 11. The only difference is the length of L2. L1 can bedensity-reachable from L3 (or vice versa) via L2 in Figure 11(a), but cannot be density-reachable in Figure 11 (b). Theresult in Figure 11 (a) is counter-intuitive since L1 and L3are far apart. Thus, it turns out that a short line segmentmight induce over-clustering.

0),(),(

),(),(

23221

3221

==

==

>=

23221

3221

),(),(

),(),(

LLLdLLd

LLdistLLdist

Q

3L

2L

1L

notdensity-reachablevs.

(a) L2 is very short. (b) L2 is not short.

Figure 11: Influence of a very short line segment onclustering.

598

7/29/2019 traclass

7/12

A simple solution of this problem is to make trajectorypartitions longer by adjusting the partitioning criteria inFigure 8. That is, we suppress partitioning at the cost ofpreciseness. More specifically, to suppress partitioning, weadd a small constant to costnopar in line 6. Our experienceindicates that increasing the length of trajectory partitionsby 2030% generally improves the clustering quality.

4.2 Clustering Algorithm

We now present our density-based clustering algorithm forline segments. Given a set D of line segments, our algorithmgenerates a set O of clusters. It requires two parameters and MinLns. We define a cluster as a density-connectedset. Our algorithm shares many characteristics with thealgorithm DBSCAN.

Unlike DBSCAN, however, not all density-connected setscan become clusters. We need to consider the number oftrajectories from which line segments have been extracted.This number of trajectories is typically smaller than that ofline segments. For example, in the extreme, all the line seg-ments in a density-connected set could be those extractedfrom one trajectory. We prevent such clusters since they donot explain the behavior of a sufficient number of trajec-

tories. Here, we check the trajectory cardinality defined inDefinition 10.

Definition 10. The set of participating trajectories of acluster Ci is defined by P T R(Ci) = {T R(Lj) | Lj Ci}.Here, T R(Lj) denotes the trajectory from which Lj has beenextracted. Then, |P T R(Ci)| is called the trajectory cardi-nality of the cluster Ci.

Figure 12 shows the algorithm Line Segment Clustering.Initially, all the line segments are assumed to be unclas-sified. As the algorithm progresses, they are classified aseither a cluster or a noise. The algorithm consists of threesteps. In the first step (lines 112), the algorithm computesthe -neighborhood of each unclassified line segment L. If

L is determined as a core line segment (lines 710), the al-gorithm performs the second step to expand a cluster (line9). The cluster currently contains only N(L). In the sec-ond step (lines 1728), the algorithm computes the density-connected set of a core line segment. The procedure Exp-landCluster() computes the directly density-reachable linesegments (lines 1921) and adds them to the current clus-ter (lines 2224). If a newly added line segment is unclas-sified, it is added to the queue Q for more expansion sinceit might be a core line segment (lines 25 26); otherwise, itis not added to Q since we already know that it is not acore line segment. In the third step (lines 1316), the algo-rithm checks the trajectory cardinality of each cluster. If itstrajectory cardinality is below the threshold, the algorithmfilters out the corresponding cluster.

In Lemma 3, we present the time complexity of this clus-tering algorithm.

Lemma 3. The time complexity of the algorithm in Figure12 is O(n log n) if a spatial index is used, where n is thetotal number of line segments in a database. Otherwise, itis O(n2) for high ( 2) dimensionality.

Proof: From the lines 5 and 20 in Figure 12, we knowthat an -neighborhood query must be executed for eachline segment. Hence, the time complexity of the algorithmis n (the time complexity of an -neighborhood query).

Algorithm Line Segment Clustering

Input: (1) A set of line segments D = {L1, , Lnumln},(2) Two parameters and MinLns

Output: A set of clusters O = {C1, , Cnumclus}Algorithm:

/* Step 1 */01: Set clusterId to be 0; /* an initial id */

02: Mark all the line segments in D as unclassified;03: for each (L D) do04: if (L is unclassified) then05: Compute N(L);06: if (|N(L)| MinLns) then07: Assign clusterId to X N(L);08: Insert N(L) {L} into the queue Q;

/* Step 2 */09: ExpandCluster(Q, clusterId, , MinLns);10: Increase clusterId by 1; /* a new id */11: else12: Mark L as noise;

/* Step 3 */13: Allocate L D to its cluster CclusterId;

/* check the trajectory cardinality */

14: for each (C O) do/* a threshold other than MinLns can be used */

15: if (|P T R(C)| < MinLns) then16: Remove C from the set O of clusters;

/* Step 2: compute a density-connected set */17: ExpandCluster(Q, clusterId, , MinLns) {18: while (Q = ) do19: Let M be the first line segment in Q;20: Compute N(M);21: if (|N(M)| MinLns) then22: for each (X N(M)) do23: if (X is unclassified or noise) then24: Assign clusterId to X;25: if (X is unclassified) then26: Insert X into the queue Q;27: Remove M from the queue Q;28: }

Figure 12: A density-based clustering algorithm forline segments.

If we do not use any index, we have to scan all the linesegments in a database. Thus, the time complexity wouldbe O(n2). If we use an appropriate index such as the R-tree [10], we are able to efficiently find the line segments inan -neighborhood by traversing the index. Thus, the time

complexity is reduced to O(n log n). 2

This algorithm can be easily extended so as to supporttrajectories with weights. This extension will be very use-ful in many applications. For example, it is natural that astronger hurricane should have a higher weight. To accom-plish this, we need to modify the method of deciding thecardinality of an -neighborhood (i.e., |N(L)|). Instead ofsimply counting the number of line segments, we computethe weighted count by summing up the weights of the linesegments.

599

7/29/2019 traclass

8/12

We note that our distance function is not a metric sinceit does not obey the triangle inequality. It is easy to seean example of three line segments L1, L2, and L3 wheredist(L1, L3) > dist(L1, L2) + dist(L2, L3). This makes di-rect application of traditional spatial indexes difficult. Sev-eral techniques can be used to address this issue. For exam-ple, we can adopt constant shift embedding[19] to convert adistance function that does not follow the triangle inequal-ity to another one that follows. We do not consider it here

leaving it as the topic of a future paper.

4.3 Representative Trajectory of a ClusterThe representative trajectory of a cluster describes the

overall movement of the trajectory partitions that belong tothe cluster. It can be considered a model [1] for clusters. Weneed to extract quantitative information on the movementwithin a cluster such that domain experts are able to under-stand the movement in the trajectories. Thus, in order togain full practical potential from trajectory clustering, thisrepresentative trajectory is absolutely required.

Figure 13 illustrates our approach of generating a rep-resentative trajectory. A representative trajectory is a se-quence of points RT Ri = p1p2p3 pj pleni (1 i

numclus). These points are determined using a sweep lineapproach. While sweeping a vertical line across line seg-ments in the direction of the major axis of a cluster, wecount the number of the line segments hitting the sweepline. This number changes only when the sweep line passesa starting point or an ending point. If this number is equalto or greater than MinLns, we compute the average coor-dinate of those line segments with respect to the major axisand insert the average into the representative trajectory;otherwise, we skip the current point (e.g., the 5th and 6thpositions in Figure 13). Besides, if a previous point is lo-cated too close (e.g., the 3rd position in Figure 13), we skipthe current point to smooth the representative trajectory.

sweep

MinLns = 3

1

2 34 5

6 78

Figure 13: An example of a cluster and its repre-sentative trajectory.

We explain the above approach in more detail. To repre-sent the major axis of a cluster, we define the average direc-tion vector in Definition 11. We add up vectors instead oftheir direction vectors (i.e., unit vectors) and normalize the

result. This is a nice heuristic giving the effect of a longervector contributing more to the average direction vector. Wecompute the average direction vector over the set of vectorsrepresenting each line segment in the cluster.

Definition 11. Suppose a set of vectors V = {v1,v2,

v3,

, vn}. The average direction vectorV ofV is defined as

Formula (8). Here, |V| is the cardinality of V.

V =

v1 +v2 +

v3 + +vn

|V|(8)

As stated above, we compute the average coordinate withrespect to the average direction vector. To facilitate thiscomputation, we rotate the axes so that the X axis is madeto be parallel to the average direction vector. Here, therotation matrix in Formula (9) is used. 3 The angle canbe obtained using the inner product between the averagedirection vector and the unit vector

x. After computing an

average in the XY coordinate system as in Figure 14, it istranslated back to a point in the XY coordinate system. x

y

= cos sin

sin cos

xy

(9)

X

X

YY

1x

1y

1x

1y

)3

,(),( 321111lll yyyxyx++

=

average coordinate

in the coordinate systemYX

average direction vector

),(11 lyx

),(21 lyx

),(31 lyx

Figure 14: Rotation of the X and Y axes.

Figure 15 shows the algorithm Representative TrajectoryGeneration. We first compute the average direction vec-tor and rotate the axes temporarily (lines 12). We thensort the starting and ending p oints by the coordinate of therotated axis (lines 34). While scanning the starting andending points in the sorted order, we count the number ofline segments and compute the average coordinate of thoseline segments (lines 512).

Algorithm Representative Trajectory Generation

Input: (1) A cluster Ci of line segments, (2) MinLns,(3) A smoothing parameter

Output: A representative trajectory RT Ri for Ci

Algorithm:01: Compute the average direction vector

V;

02: Rotate the axes so that the X axis is parallel toV;

03: Let P be the set of the starting and ending points ofthe line segments in Ci;/* X-value denotes the coordinate of the X axis */

04: Sort the points in the set P by their X-values;05: for each (p P) do

/* count nump using a sweep line (or plane) */06: Let nump be the number of the line segments

that contain the X-value of the point p;07: if (nump MinLns) then08: diff := the difference in X-values between p

and its immediately previous point;09: if (diff ) then10: Compute the average coordinate avgp;11: Undo the rotation and get the point avgp;12: Append avgp to the end of RT Ri;

Figure 15: An algorithm for generating the repre-sentative trajectory.

3We assume two dimensions for ease of exposition. Thesame approach can be applied also to three dimensions.

600

7/29/2019 traclass

9/12

4.4 Heuristic for Parameter Value SelectionWe first present the heuristic for selecting the value of the

parameter . We adopt the entropy [20] theory. In informa-tion theory, the entropy relates to the amount of uncertaintyabout an event associated with a given probability distribu-tion. If all the outcomes are equally likely, then the entropyshould be maximal.

Our heuristic is based on the following observation. Inthe worst clustering, |N(L)| tends to be uniform. That is,for too small an , |N(L)| becomes 1 for almost all linesegments; for too large an , it becomes numln for almostall line segments, where numln is the total number of linesegments. Thus, the entropy becomes maximal. In contrast,in a good clustering, |N(L)| tends to be skewed. Thus, theentropy becomes smaller.

We use the entropy definition of Formula (10). Then,we find the value of that minimizes H(X). This optimal can be efficiently obtained by a simulated annealing [14]technique.

H(X) =ni=1

p(xi)log21

p(xi)=

ni=1

p(xi)log2 p(xi),

where p(xi) =

|N(xi)|nj=1 |N(xj)| and n = numln(10)

We then present the heuristic for selecting the value of theparameter MinLns. We compute the average avg|N(L)| of|N(L)| at the optimal . This operation induces no ad-ditional cost since it can be done while computing H(X).Then, we determine the optimal MinLns as (avg|N(L)| +1 3). This is natural since MinLns should be greaterthan avg|N(L)| to discover meaningful clusters.

It is not obvious that these parameter values estimatedby our heuristic are truly optimal. Nevertheless, we believeour heuristic provides a reasonable range where the optimalvalue is likely to reside. Domain experts are able to try a fewvalues around the estimated value and choose the optimalone by visual inspection. We verify the effectiveness of thisheuristic in Section 5.

5. EXPERIMENTAL EVALUATIONIn this section, we evaluate the effectiveness of our trajec-

tory clustering algorithm TRACLUS. We describe the ex-perimental data and environment in Section 5.1. We presentthe results for two real data sets in Sections 5.2 and 5.3. Webriefly discuss the effects of parameter values in Section 5.4.

5.1 Experimental SettingWe use two real trajectory data sets: the hurricane track

data set 4 and the animal movement data set 5.The hurricane track data set is called Best Track. Best

Track contains the hurricanes latitude, longitude, maximumsustained surface wind, and minimum sea-level pressure at6-hourly intervals. We extract the latitude and longitudefrom Best Track for experiments. We use the Atlantic hur-ricanes from the years 1950 through 2004. This data set has570 trajectories and 17736 points.

The animal movement data set has been generated by theStarkey project. This data set contains the radio-telemetrylocations (with other information) of elk, deer, and cattle

4http://weather.unisys.com/hurricane/atlantic/5http://www.fs.fed.us/pnw/starkey/data/tables/

from the years 1993 through 1996. We extract the x andy coordinates from the telemetry data for experiments. Weuse elks movements in 1993 and deers movements in 1995.We call each data set Elk1993 and Deer1995, respectively.Elk1993 has 33 trajectories and 47204 points; Deer1995 has32 trajectories and 20065 p oints. We note that trajectoriesin the animal movement data set are much longer than thosein the hurricane track data set.

We attempt to measure the clustering quality while vary-

ing the values of and MinLns. Unfortunately, there isno well-defined measure for density-based clustering meth-ods. Thus, we define a simple quality measure for a ballparkanalysis. We use the Sum of Squared Error(SSE)[11]. Inaddition to the SSE, we consider the noise penalty to penal-ize incorrectly classified noises. The noise penalty becomeslarger if we select too small or too large MinLns. Con-sequently, our quality measure is the sum of the total SSEand the noise penalty as Formula (11). Here, N denotes theset of all noise line segments. The purpose of this measureis to get a hint of the clustering quality.

QMeasure = Total SSE+ Noise Penalty (11)

=

numclus

i=1 12|Ci| xCi yCi dist(x, y)2 +

1

2|N |

wN

zN

dist(w, z)2

We conduct all the experiments on a Pentium-4 3.0 GHzPC with 1 GBytes of main memory, running on WindowsXP. We implement our algorithm and visual inspection toolin C++ using Microsoft Visual Studio 2005.

5.2 Results for Hurricane Track DataFigure 16 shows the entropy as is varied. The minimum

is achieved at = 31. Here, avg|N(L)| is 4.39. According toour heuristic, we try parameter values around = 31 andMinLns = 57. Using visual inspection and domain knowl-edge, we are able to obtain the optimal parameter values: = 30 and MinLns = 6. We note that the optimal value = 30 is very close to the estimated value = 31.

Figure 17 shows the quality measure as and MinLns arevaried. The smaller QMeasure is, the better the clusteringquality is. There is a little discrepancy between the actualclustering quality and our measure. Visual inspection resultsshow that the best clustering is achieved at MinLns = 6, notat MinLns = 5. Nevertheless, our measure is shown to bea good indicator of the actual clustering quality within thesame MinLns value. That is, if we consider only the resultfor MinLns= 6, we can see that our measure becomes nearlyminimal when the optimal value of is used.

Figure 18 shows the clustering result using the optimal

parameter values. Thin green lines display trajectories, andthick red lines representative trajectories. We point out thatthese representative trajectories are exactly common sub-trajectories. Here, the number of clusters is that of red lines.We observe that seven clusters are identified.

The result in Figure 18 is quite reasonable. We know thatsome hurricanes move along a curve, changing their direc-tion from east-to-west to south-to-north, and then to west-to-east. On the other hand, some hurricanes move along astraight east-to-west line or a straight west-to-east line. Thelower horizontal cluster represents the east-to-west move-

601

7/29/2019 traclass

10/12

10.06

10.08

10.1

10.12

10.14

10.16

10.18

10.2

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58

Eps

Entropy

optimum

Figure 16: Entropy for the hurricane data.

130000

140000

150000

160000

170000

180000

27 28 29 30 31 32 33

Eps

QMeasure

MinLns=5

MinLns=6

MinLns=7

optimum

Figure 17: Quality measure for the hurricane data.

Figure 18: Clustering result for the hurricane data.

ments, the upper horizontal one the west-to-east movements,and the vertical ones the south-to-north movements.

5.3 Results for Animal Movement Data

5.3.1 Elks Movements in 1993

Figure 19 shows the entropy as is varied. The minimumis achieved at = 25. Here, avg|N(L)| is 7.63. Visually, weobtain the optimal parameter values: = 27 and MinLns= 9. Again, the optimal value = 27 is very close to theestimated value = 25.

Figure 20 shows the quality measure as and MinLnsare varied. We observe that our measure becomes nearlyminimal when the optimal parameter values are used. Thecorrelation between the actual clustering quality and ourmeasure, QMeasure, is shown to be stronger in Figure 20than in Figure 17.

11.36

11.38

11.4

11.42

11.44

11.46

11.48

11.5

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58

Eps

Entropy

optimum

Figure 19: Entropy for the Elk1993 data.

510000

530000

550000

570000

590000

610000

630000

25 26 27 28 29 30 31

Eps

QMeasure

MinLns=8

MinLns=9

MinLns=10optimum

Figure 20: Quality measure for the Elk1993 data.

Figure 21 shows the clustering result using the optimalparameter values. We observe that thirteen clusters arediscovered in the most of the dense regions. Although theupper-right region looks dense, we find out that the elks ac-tually moved along different paths. Hence, the result havingno cluster in that region is verified to be correct.

5.3.2 Deers Movements in 1995

Figure 22 shows the clustering result using the optimalparameter values ( = 29 and MinLns = 8). The resultindicates that two clusters are discovered in the two mostdense regions. This result is exactly what we expect. The

center region is not so dense to be identified as a cluster. Dueto space limit, we omit the detailed figures for the entropyand quality measure.

5.4 Effects of Parameter ValuesWe have tested the effects of parameter values on the clus-

tering result. If we use a smaller or a larger MinLns com-pared with the optimal ones, our algorithm discovers a largernumber of smaller (i.e., having fewer line segments) clusters.In contrast, if we use a larger or a smaller MinLns, ouralgorithm discovers a smaller number of larger clusters. Forexample, in the hurricane track data, when = 25, nineclusters are discovered, and each cluster contains 38 line seg-ments on average; in contrast, when = 35, three clustersare discovered, and each cluster contains 174 line segmentson average.

6. RELATED WORKClustering has been extensively studied in the data min-

ing area. Clustering algorithms can be classified into fourcategories: partitioning methods (e.g., k-means [17]), hier-archical methods (e.g., BIRCH [24]), density-based methods(e.g., DBSCAN [6] and OPTICS[2]), and grid-based meth-ods (e.g., STING [22]). Our algorithm TRACLUS falls inthe category of density-based methods. In density-based

602

7/29/2019 traclass

11/12

Figure 21: Clustering result for the Elk1993 data.

Figure 22: Clustering result for the Deer1995 data.

methods, clusters are regions of high density separated byregions of low density. DBSCAN has been regarded as themost representative density-based clustering algorithm, andOPTICS has been devised to reduce the burden of determin-ing parameter values in DBSCAN. The majority of previousresearch has been focused on clustering of point data.

The most similar work to ours is the trajectory cluster-ing algorithm proposed by Gaffney et al. [7, 8]. It is basedon probabilistic modeling of a set of trajectories. Formally,the probability density function of observed trajectories is amixture density: P(yj |xj , ) =

K

k fk(yj |xj, k)wk, wherefk(yj |xj, k) is the density component, wk is the weight,

and k is the set of parameters for k-th component. Here,k and wk can be estimated from the trajectory data usingthe Expectation-Maximization (EM) algorithm. The esti-mated density components fk(yj |xj , k) are then interpretedas clusters. The fundamental difference of this algorithmfrom TRACLUS is being based on probabilistic clusteringand clustering trajectories as a whole.

Distance measures for searching similar trajectories havebeen proposed recently. Vlachos et al. [21] have proposedthe distance measure LCSS, and Chen et al. [5] the distancemeasure EDR. Both LCSS and EDR are based on the edit

distance and are extended so as to be robust to noises,shifts, and different lengths that occur due to sensor fail-ures, errors in detection techniques, and different samplingrates. EDR can represent the gap between two similar sub-sequences more precisely compared with LCSS [5]. Besides,dynamic time warping has been widely adopted as a distancemeasure for time series[12]. These distance measures, how-ever, are not adequate for our problem since they are orig-inally designed to compare the whole trajectory (especially,

the whole time-series sequence). In other words, the dis-tance could be large although some portions of trajectoriesare very similar. Hence, it is hard to detect only similarportions of trajectories.

The MDL principle has been successfully used for diverseapplications, such as graph partitioning [3] and distance func-tion design for strings[13]. For graph partitioning, a graphis represented as a binary matrix, and then, the matrix isdivided into disjoint row and column groups such that therectangular intersections of groups are homogeneous. Here,the MDL principle is used to automatically select the num-ber of row and column groups [3]. For distance functiondesign, data compression is used to measure the similaritybetween two strings. This idea is tightly connected with the

MDL principle [13].

7. DISCUSSION AND CONCLUSIONS

7.1 DiscussionWe discuss some possible extensions of our trajectory clus-

tering algorithm.

1. Extensibility: We can support undirected or weightedtrajectories. We handle undirected ones using the sim-plified angle distance and weighted ones using the ex-tended cardinality of an -neighborhood.

2. Parameter Insensitivity: We can make our algorithmmore insensitive to parameter values. A number of ap-proaches, e.g., OPTICS [2], have been developed for this

purpose in the context of point data. We are applyingthese approaches to trajectory data.

3. Efficiency: We can improve the clustering p erformanceby using an index to execute an -neighborhood query.The major difficulty is that our distance function is nota metric. We will adopt an indexing technique for anon-metric distance function [19].

4. Movement Patterns: We will extend our algorithmto support various types of movement patterns, espe-cially circular motion. Our algorithm primarily supportsstraight motion. We believe this extension can be doneby enhancing the approach of generating a representa-tive trajectory.

5. Temporal Information: We will extend our algorithm to

take account of temporal information during clustering.One can expect that time is also recorded with location.This extension will significantly improve the usability ofour algorithm.

7.2 ConclusionsIn this paper, we have proposed a novel framework, the

partition-and-group framework, for clustering trajectories.Based on this framework, we have developed the trajec-tory clustering algorithm TRACLUS. As the algorithm pro-gresses, a trajectory is partitioned into a set of line segments

603

7/29/2019 traclass

12/12

at characteristic points, and then, similar line segments in adense region are grouped into a cluster. The main advantageof TRACLUS is the discovery of common sub-trajectoriesfrom a trajectory database.

To show the effectiveness of TRACLUS, we have per-formed extensive experiments using two real data sets: hur-ricane track data and animal movement data. Our heuristicfor parameter value selection has been shown to estimate theoptimal parameter values quite accurately. We have imple-

mented a visual inspection tool for cluster validation. Thevisual inspection results have demonstrated that TRACLUSeffectively identifies common sub-trajectories as clusters.

Overall, we believe that we have provided a new paradigmin trajectory clustering. Data analysts are able to get a newinsight into trajectory data by virtue of the common sub-trajectories. This work is just the first step, and there aremany challenging issues discussed above. We are currentlyinvestigating into detailed issues as a further study.

8. REFERENCES

[1] Achtert, E., Bohm, C., Kriegel, H.-P., Kroger, P., andZimek, A., Deriving Quantitative Models for

Correlation Clusters, In Proc. 12th ACM SIGKDDIntl Conf. on Knowledge Discovery and Data Mining,Philadelphia, Pennsylvania, pp. 413, Aug. 2006.

[2] Ankerst, M., Breunig, M. M., Kriegel, H.-P., andSander, J., OPTICS: Ordering Points to Identify theClustering Structure, In Proc. 1999 ACM SIGMODIntl Conf. on Management of Data, Philadelphia,Pennsylvania, pp. 4960, June 1999.

[3] Chakrabarti, D., Papadimitriou, S., Modha, D. S., andFaloutsos, C., Fully Automatic Cross-Associations,In Proc. 10th ACM SIGKDD Intl Conf. onKnowledge Discovery and Data Mining, Seattle,Washington, pp. 7988, Aug. 2004.

[4] Chen, J., Leung, M. K. H., and Gao, Y., Noisy Logo

Recognition Using Line Segment Hausdorff Distance,Pattern Recognition, 36(4): 943955, 2003.

[5] Chen, L., Ozsu, M. T., and Oria, V., Robust andFast Similarity Search for Moving ObjectTrajectories, In Proc. 2005 ACM SIGMOD IntlConf. on Management of Data, Baltimore, Maryland,pp. 491502, June 2005.

[6] Ester, M., Kriegel, H.-P., Sander, J., and Xu, X., ADensity-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise, In Proc. 2ndIntl Conf. on Knowledge Discovery and Data Mining,Portland, Oregon, pp. 226231, Aug. 1996.

[7] Gaffney, S. and Smyth, P., Trajectory Clusteringwith Mixtures of Regression Models, In Proc. 5th

ACM SIGKDD Intl Conf. on Knowledge Discoveryand Data Mining, San Diego, California, pp. 6372,Aug. 1999.

[8] Gaffney, S., Robertson, A., Smyth, P., Camargo, S.,and Ghil, M., Probabilistic Clustering of ExtratropicalCyclones Using Regression Mixture Models, TechnicalReport UCI-ICS 06-02, University of California,Irvine, Jan. 2006.

[9] Grunwald, P., Myung, I. J., and Pitt, M., Advances inMinimum Description Length: Theory andApplications, MIT Press, 2005.

[10] Guttman, A., R-Trees: A Dynamic Index Structurefor Spatial Searching, In Proc. 1984 ACM SIGMODIntl Conf. on Management of Data, Boston,Massachusetts, pp. 4757, June 1984.

[11] Han, J. and Kamber, M., Data Mining: Concepts andTechniques, 2nd ed., Morgan Kaufmann, 2006.

[12] Keogh, E. J., Exact Indexing of Dynamic TimeWarping, In Proc. 28th Intl Conf. on Very LargeData Bases, Hong Kong, China, pp. 406417, Aug.2002.

[13] Keogh, E. J., Lonardi, S., and Ratanamahatana,C. A., Towards Parameter-Free Data Mining, InProc. 10th ACM SIGKDD Intl Conf. on KnowledgeDiscovery and Data Mining, Seattle, Washington, pp.206215, Aug. 2004.

[14] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P.,Optimization by Simulated Annealing, Science,220(4598): 671680, 1983.

[15] Lee, T. C. M., An Introduction to Coding Theoryand the Two-Part Minimum Description LengthPrinciple, International Statistical Review, 69(2):169184, 2001.

[16] Lee, J.-G., Han, J., and Whang, K.-Y., Trajectory

Clustering: A Partition-and-Group Framework,Technical Report UIUCDCS-R-2007-2828, Universityof Illinois at Urbana-Champaign, Mar. 2007.

[17] Lloyd, S., Least Squares Quantization in PCM,IEEE Trans. on Information Theory, 28(2): 129137,1982.

[18] Powell, M. D. and Aberson, S. D., Accuracy ofUnited States Tropical Cyclone Landfall Forecasts inthe Atlantic Basin (1976-2000), Bull. of the AmericanMeteorological Society, 82(12): 27492767, 2001.

[19] Roth, V., Laub, J., Kawanabe, M., and Buhmann,J. M., Optimal Cluster Preserving Embedding ofNonmetric Proximity Data, IEEE Trans. on PatternAnalysis and Machine Intelligence, 25(12): 15401551,

2003.[20] Shannon, C. E., A Mathematical Theory of

Communication, The Bell System Technical Journal,27: 379423 and 623656, 1948.

[21] Vlachos, M., Gunopulos, D., and Kollios, G.,Discovering Similar Multidimensional Trajectories,In Proc. 18th Intl Conf. on Data Engineering, SanJose, California, pp. 673684, Feb./Mar. 2002.

[22] Wang, W., Yang, J., and Muntz, R. R., STING: AStatistical Information Grid Approach to Spatial DataMining, In Proc. 23rd Intl Conf. on Very Large DataBases, Athens, Greece, pp. 186195, Aug. 1997.

[23] Wisdom, M. J., Cimon, N. J., Johnson, B. K., Garton,E. O., and Thomas, J. W., Spatial Partitioning by

Mule Deer and Elk in Relation to Traffic, Trans. ofthe North American Wildlife and Natural ResourcesConf., Spokane, Washington, pp. 509530, Mar. 2004.

[24] Zhang, T., Ramakrishnan, R., and Livny, M.,BIRCH: An Efficient Data Clustering Method forVery Large Databases, In Proc. 1996 ACM SIGMODIntl Conf. on Management of Data, Montreal,Canada, pp. 103114, June 1996.

604

traclass

Documents

Transcript of traclass