Mean Shift Cluster

download Mean Shift Cluster

of 10

Transcript of Mean Shift Cluster

  • 8/2/2019 Mean Shift Cluster

    1/10

    790 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO, 8, AUGUST

    Mean Shift, Mode Seeking, and ClusteringYizong Cheng

    Abstract-Mean shift, a simple iterative procedure that shiftseach data point to the average of data points in its neighborhood,is generalized and analyzed in this paper. This generalizati onmakes some k-means like clustering algorithms its special cases. Itis shown that mean shift is a mode-seeking process on a surfaceconstructed with a shadow kernel. For Gaussian kernels, meanshift is a gradient mapping. Convergence is studied for mean shiftiterations. Cluster analysi s is treated as a deterministic problemof finding a fixed point of mean shift t hat characteri zes the data.Applications in clustering and Hough transform are demon-strated. Mean shift is also considered as an evolutionary strategythat performs multistart global optimization.

    Index Terms-Mean shift, gradient descent, global optimiza-tion, Hough transform, cluster analysis, k-means clustering.

    I. INTRODUCTION

    L T data be a finite set S embedded in the n-dimensionalEuclidean space, X. Let K be aflat kernel that is the char-acteristic function of the L-ball in X,

    K(x) =1 ifllxll5 il0 ifllxll > il (1)The sample mean at x E X is

    c K(s-x)s

    44 = $g K(s-x) . (2)

    sdThe difference m(x) - x is called mean shift in Fukunaga

    and Hostetler [ 11. The repeated movement of data points to thesample means is called the mean shzji algorithm [l], [2]. Ineach iteration of the algorithm, s t m(s) is performed for alls E S simultaneously.

    The mean shift algorithm has been proposed as a method forcluster analysis [l], [2], [3]. However, the intuition that meanshift is gradient ascent, the convergence of the process needsverification, and its relation with similar al gorithms needsclarification.

    In this paper, the mean shift algorithm is generalized in

    three ways. First, nonflat kernels are allowed. Second, pointsin data can be weighted. Third, shift can be performed on anysubset of X, while the data set S stay the same.

    In Section II, kernels with five operations are defined. Aspecific weight function unifying certain fuzzy clustering al-gorithms including the maximum-entropy clustering algo-rithm will be discussed. It will be shown that the k-meansclustering algorithm is a limit case of mean shift.

    Manuscript received Aug. 27, 1993; revised Mar. 28, 1995. Recommendedfor acceptance by R. Duin.

    Y. Cheng is with the Department of Electrical and Computer Engineeringand Computer Science, University of Cincinnati, Cincinnati, Ohio, 45221.

    IEEECS Log Number P95097.

    A relation among kernels called shadow will be defined Section III. It will be proved that mean shift on any kernel equivalent to gradient ascent on the density estimated with ashadow of its. Convergence and its rate is the subject of Sec-tion IV. Section V shows some peculiar behaviors of meanshift in cluster analysis, with application in Hough tr ansformSection VI shows how, with a twist i n weight assignment, thdeterministic mean shift is transformed into a probabilisticevolutionary strategy, and how it becomes a global optimization algorithm.

    II. GENERALIZING MEAN SHIFT

    In Section II, we first define the kernel, its notation, anoperations. Then we define the generalized sample mean andthe generalized mean shift algorithm. We show how this algorithm encompasses some other familiar clustering algorithms and how k- means clustering becomes a limit instance omean shift.

    A. Kernels

    DEFINITION 1. Let X be the n-dimensional Euclidean space, RDenote the ith component of x E X by xi. The norm of x E X

    is a nonnegative number IhI/ such that llxl12 = $1~~1~ . Thi=l

    inner product of x and y in X is (x, y) = 2 xi yi . A functioi=l

    K : X -+ R is said to be a kernel if there exists a profile,k : [O,w] + R, such that

    m =k(llxll)and

    1) k is nonnegative.

    (3)

    2) k is nonincreasing: k(a) 2 k(b) if a < b.

    3) k is piecewise continuous and jr k(r)dr < m .

    Let a > 0. If K is a kernel, then aK, K,, and K are kernedefined, respectively, as

    (aK)(x) = aK(x),

    K,(x) =K;,0

    (Kq4 = K(#.if K and H are kernels, then K + H is a kernel defined (K + H)(x) = K(x) + H(x) and KH is a kernel defined

    0162-8828/9.5$04.00 0 1995 IEEE

  • 8/2/2019 Mean Shift Cluster

    2/10

  • 8/2/2019 Mean Shift Cluster

    3/10

    792 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO. 8, AUGUST 1

    3) Update cluster centers:

    c vt,.2scs

    yy rET*SCS

    Go to 2.Indeed, when the profile k is strictly decreasing,

    Kp(s-t)c~+) +vr,sp a++-.

    tsT

    Thus, k-means clustering is the limit of t he meanshift al-gorithm with a strictly decreasing kernel p when p+=. n

    (11)

    (12)

    III. MEAN SHIFT AS GRADIENT MAPPING

    It has been pointed out in [l] that mean shift is a very in-tuitive estimate of t he gradient of t he data density. In this

    section, we give a more rigorous st udy of this intuition. Theo-rem 1 relates each kernel t o a shadow kernel so that meanshift using a kernel will be in the gradient direction of thedensity estimate using the corresponding shadow kernel.

    A. Shadow of a Kernel

    DEFINITION3. Kernel H is said to be a shadow of kernel K, ifthe mean shift using K,

    c K(s- x)w(s)s

    m(x) - x = scsc K(s-x)w(s) -

    (13)

    ES

    is in the gradient direction at x of the density estimate usingH,

    q(x) = gw~(s). (14)

    Cl

    THEOREM 1.Kernel H is a shadow of kernel K if and only iftheir PROFILES, h and k, satisfy the following equation.

    h(r) = f(r)+cJrmk(t)dt, (15)

    where c > 0 is a constant and f is a piecewise constantfunction.

    PROOF.The mean shift using kernel K can be rewritten as

    (16)

    with p(x) = zk( - ) ( ) h dx w s , t e ensity estimate using K.sss

    The gradient of (14) at x is

    Vq(x) =-2~h(lls-x1\*)(s-x)w(s). (17)S&S

    To have (16) and (17) point to the same direction, we needh(r) = - ck(r) for all r and some c > 0. By the fundamental

    theorem of calculus and the requirement that

    J (1hrdr 0. Thefollowing are true.

    1) aH is a shadow of K.2) Ha is a shadow of K,.3) If L is a shadow of M, then H + L is a shadow of K + M.4) A truncated kernel KF, may not be continuous at IHI =a.

    If t he shadow is also allowed to be discontinuous at thesame points, then HFn is a shadow of KF,. 0

    EXAMPLE3. Using (15) we find that theEpanechnikov kernel

    K(x) =

    1

    (I-Iklty 4l4l~l (19)0 ifllxll > 1

    is a shadow of the flat kernel, (5), and thebiweight kernel

    K(x) = (l- t~xl12)2

    i

    ifllxll . Gm0 if/xl] > 1

    is a shadow of the Epanechnikov kernel. These kernels areso named in [2] and they are shown in Fig. 3. q

    (a) (b)

    Fig. 3. (a)TheEpanechnikovernel ndb) thebiweight ernel.

    B. Gaussian Kernels

    THEOREM 2.The only kernels that are their own shadows arethe Gaussian kernel GP and its truncated version GBFX Inthis case, the mean shif is equal to

    m(x)-x=-$Vlogq(x), (21)

    where q is the data density estimate using the same kernel.

    PROOF. From Theorem 1 we know that kernelK is its ownshadow if and only if k(r) = -ck(r). Using the method ofseparation of variables, we have

  • 8/2/2019 Mean Shift Cluster

    4/10

    CHENG: MEAN SHIFT, MODE SEEKING, AND CLUSTERING

    This gives us k(r) = k(0)eyr,which makes K the Gaussiankernel. If discontinuities are allowed in k, then we have thetruncated Gaussian kernel. When K is its own shadow, p in(18) is equal toq, and (Vq(x))/q(x) =V log q(x). 0

    REMARK 2. A mapping f :R+ R is said to be a gradientmapping if t here exists a function g :R + R such thatf(x) = Vg(x) for all x 161.Theorem 2 is a corollary of a moregeneral result from the symmetry principle: f is a gradientmapping if and only if the Jacobian matrix off is symmetric.In our case, f(x) = m(x) - x and by equating aflaxj andafjaxi for alli and , one obtains the necessary and sufficientcondition that k(r) =-ck(r) for any mean shift to be a gra-dient mapping. 0

    C. Mode Seeking

    Suppose an idealized mode in the density surface also has a

    Gaussian shape, which, without loss of generality, centers atthe origin:

    (22)

    The mean shift is now

    -%w(x) ym(x)-x=--&Vlogq(x)=--2&(x) =-p"* (23)

    Because the density surface is estimated with the kernel G,any mode approximated by superimposing G will have ay < p. The mean shift (23) will not cause overshoots in thiscase.

    Mean shift is st eepest ascent with a varying step size that isthe magnitude of the gradient. A notorious problem associatedwith steepest ascent with fixed step size is the slow movementon plateaus of the surface. For a density surface, large plateaushappen only at low density regions and after taking logarithm,the inclination of a plateau is magnified. Combined with thepreceding result about overshoot avoidance, mean shift is well-adjusted steepest ascent.

    IV. CONVERGENCE

    Theorem 1 says that the mean shift algorithm is steepest as-cent over the density of S. Each T point climbs the hill in the

    density surface independently. Therefore, if S or its densitydoes not change during the execution of the algorithm, theconvergence of the evolution ofT is a consequence of the con-vergence of steepest ascent for individual T points.

    However, in a blurring process,T is S, and S and its densitychange as the result of each iteration. In this case, convergenceis not as obvious as steepest ascent. The main results of thissection are two convergence theorems about the blurring proc-ess. The concepts of radius and diameter of data, defined be-low, will be used in the proofs.

    A. Radius and Diameter of Data

    DEFINITION4. A direction in X is a point on the unit sphere.That i s, a E X is a direction if and only if llall = 1. We callthe mapping J& : X + R withz~(x) = (x,a) the projection inthe direction of a. Let n*(S) = {z(s); s E S). Theconvexhull of a set Y c X is defined as

    h(Y)=,,Q{~tX;minn,(Y)~s,(x)~max~,(Y)}. (24)(1 1

    0CLAIM 3. The following are true.

    1) min Z&S) I minn,(m(T)) I max 7cU(m(r)) I max n,(S).2) m(T) c h(m(T)) G h(S).3) In a blurring process, we have h(S) a h(m(S)) 2

    h(m(m(S))). . . . There exists an x E X that is in all theconvex hulls of data. It is possible to make a translationso that the origin is in all the convex hulls of data. 0

    DEFINITION5. Suppose after a translation, the origin is in all

    the convex hulls of data during a blurring process. Then,p(S) = max { Isl; s E S) is said to be theradius of data. Thediameter of data is defined as

    d(S)= ,g$ maxa, S) min (S)). (25)

    0It should be clear that p(S) < d(S) I 2&S). Because the

    convex hulls of data form a shrinking inclusion sequence, theradius or diameter of data also form a nonincreasing non-negative sequence, which must approach a nonnegative limit.Theorem 3 says that this limit is zero when the kernel in theblurring process has a support wide enough to cover the data

    set.

    B. Blurring With Broad Kernels

    THEOREM3. Let k be the profile of the kernel used in a blur-

    ring process, and S, the initial data. If k(d(S,,)) 2 IC for

    some K > 0, then diameter of data approaches zero. Theconvergence rate is at least as fast as

    44s))

    4s) - 4k;O)(26)

    PROOF.Let z be a projection, u = min n(S), v = max Z(S), andz = (u + v)/2. Suppose

    (27)

    Then, for s E S,

  • 8/2/2019 Mean Shift Cluster

    5/10

    194 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. VOL. 17, NO. 8, AUGUST

    ,-.(,( ))=CK(s-s)w(s)(v-n(s))S

    SES

    c K(s- s)w(s)SES

    CK+)(v-4

    2 y;;y- s)w(s,)

    SES (28)

    K(V - U)=-

    4k(O) .

    Clearly, we have

    max7r(m(S))-min7r(m(S))$maxx(m(S))-u

    =(v-u)-(v-maxz(m(S))) 0. If the weights on x and y are the same, a blur-ring process will make them converge to (x + y)/2 with the

    rate

    m(y)- ( )= k(~-k(~x-yi)(y~x)mx k(O)+k(lx-yl)

    (30)

    The difference between the two points will not become zeroat any iteration, unless the kernel is flat in t he range contain-ing both points. When the weights are different, x and yconverge to a point that may not be the weighted average ofthe initial x and y with the same weights.

    Now assume that S is fixed through the process while T isinitialized to S. Because this is no longer a blurring process,the result in Theorem 3 does not apply. That is,T may con-verge to more than one point. The two T points converge to(x + y)/2 when the density of S is unimodal. Otherwise, t heywill converge to the two modes of the bimodal density.When G, is used as the kernel, and x = 0, y = 1, the densityis

    2-z -(z_l)zq(z)=e p +e nz . (31)

    The first derivative ofq(z) always has a zero at z = l/2. Butthe second derivative of q(z) at z = l/2 is

    (321

    which is negative when A > l/A and positi ve whe

    ;1< l/& . Thus l/A is a critical a value. When J. is

    larger than this critical value, the density is unimodal anthe two T points will converge to l/2. When a is smallethan l/A , they will approach two distinct limit points.

    C. Blurring with Truncated Kernels

    When truncated kernels are used in the blurring process, Smay converge to many points. In fact, if the kernel is truncateso that it will not cover more than one point in S, then S winot change in the process. The result of Theorem 3 applies toan isolated group of data points that can be covered by a trun-cated kernel. In this case, they eventually converge to onepoint, although the rest of the data set may converge to othercluster centers. Again, when the kernel is not flat, no mergewill take place after any finite number of i terations. (Flat kenels generate a special case where merger is possible in finitenumber of steps. See [8].)

    When t he blurring process is simulated on a digital computer, points do merge in finite number of steps, due to thlimits on floating-point precision. In fact, t here is a minimudistance between data points that can be maintained. Theorem4 below shows that under this condition, the blurring procesterminates in finite number of steps.

    LEMMA1. Suppose X = R and r(S) i s the radius of data in ablurring process. If the mini mum distance between datapoints is 6, then for any direction a, there cannot be more

    than one data point s E S with no(s) > r(S) - h, where h is afunction of r(S), 6, and n.

    PROOF. Suppose there is a dir ection a such that there ares, t E S with ZJS) > r(S) - h and na(t)> r(S) - h. Let b be adirection perpendicular toa and dt, = I?%(s) %(t)l. Becauseboth s and t are at l east r(S) - h away from the origin in direction a and one of them must also be dd2 away from theorigin in directionb, the square of the radius of data cannot

    be smaller than (r(S)-h)* +dt/4. It follows that

    di I 8r(S)h-4h*. The distance between s and t, IIs - tl(must satisfy

    6*

  • 8/2/2019 Mean Shift Cluster

    6/10

    CHENG: MEAN SHIFT, MODE SEEKING, AND CLUSTERING

    LEMMA2. In a blurring process, suppose the mini mum dis-tance between data points is 6, K > 0 is a constant such thatwhen K(x) > 0, K(x) > Kk(O), and

    minsess)w=CW(S)SCS

    (33).

    The radius of data, r(S), reaches its final value in no morethan r(S)/( K W h,) steps, where h, is defined in Lemma 1.

    PROOF.Let SE S be a data point that moves during an iteration.Then there must have been another data point SE S suchthat K(s - s) > k%(O).Let a be a direction. Lemma 1 saysthat at least one of these two points, s or s, denoted with s,must have n,(s) I r(S) - h,. Hence,

    CK(t-s)w(t)(r(S)-n,(t))4s) - ru (44) = eS c K(t _ s)w(t)

    Id

    >K(s+w)(r(s)Eu4)c k(OM)KS

    2 KWh.

    (34)This shows that all moving points i n S moves in one itera-tion to a position at leastK Wh, away from the current r(S).Therefore, if r(S) changes during an iteration, i ts changemust be at least K Wh,. 0

    THEOREM4. If data points cannot move arbitrarily close toeach other, and K(x) is either zero or larger than a fixedpositive constant, then the blurring process reaches a fixedpoint infinitely many iterations.

    PROOF.Lemma 2 says that the radius of data reaches its finalvalue in finite number of steps. Lemma 2 also implies thatthose points at this final radius will not affect other datapoints or each other. Hence, they can be taken out fromconsideration for further process of the algorithm. The sameargument can be applied to the rest of data. Since S is finite,by induction on the size of S, a fixed point must be reachedin finitely many iterations. More precisely, the blurringprocess halts in no more than r(S)l( K Wh,) steps. 0

    REMARK 3. It is important that when close data points aremerged, the radius of data does not increase. A simplemechanism is merging close points into one of them. This

    must be done before the exhaustion of fl oating-point preci-sion, because truncation may indeed increase the radius ofdata and cause cycles. An extreme case is the blurring proc-ess applied to categorical data, when a flat kernel based onthe Hamming distance and round-off to the nearest integerare used in each mean shift step. In this case, the mean shiftstep becomes

    s t Majority{t E S;IIt - sI[5 a} . (35)

    Round-off may actually shift a data point out of the flat ker-nel centering at it. For example, 0111, 1011, 1101, and1110 all have Hamming distance 3 from 0000, but the ma-

    jority of the five will be 1111, which has a distance 4 fromthe center, 0000, and there is a projection on whichmax 7r(S)actually increases. 0

    D. Fixed Points as Local Maxima

    In Section III, we showed that mean shift for individualpoints in X is hill climbing on the data density function. Be-cause the data density function also evolves in t he blurringprocess, it is difficult to see where the hill climbing leads to.When t he evolving set of points in X, either S (in blurring) orT(as cluster centers), is treated as a point in X, where N i s thenumber of points i nvolved, real functions can be constructedand the fixed points of mean shift can be identified as localmaxima of these functions.

    THEOREM5. When S is fired, the stable fired points of themean shift process

    c K(s- t)w(s)s

    T t m(T) = m(t) = sEsc K(s-t)w(s) ETES

    are local maxima of 136)

    (37)

    where H is a shadow of K. For the blurring process,S t m(S), assuming weights of data points do not change,the fixed points are the local maxima of

    V(S) = c H(s-t)w(s)w(t). (38)s,td

    PROOF. When S is fixed, each I E T reaches its fixed pointwhen Vq(t) = 0, using the result and notation in (18). Be-causeU(T) = xq(t), a local maximum of V is reached

    IET

    when each t E T attains a local maximum ofq(t). Becauselocal minima ofq are not stable fixed points for t, a stablefixed point of V can only be its local maximum. For theblurring process, we have

    $=2xh$s-rll*)(s-f)w(s)w(t). (39)td

    Notice that w(s) will not change and thus is treated as aconstant. a vias = 0 is equivalent to

    xK(s-t)w(t)(t-s)=O (40)IS.7

    or

    m(s)-s= CKb-++)(t-4Id

    c K(s- t)w(t) = (41)

    IES

    and thus, the local maxima of V(S) are fixed points ofS t m(S). By the same reason as before, they are the onlystable fixed points. 0

  • 8/2/2019 Mean Shift Cluster

    7/10

    796 IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE, VOL. 17, NO, 8, AUGUST

    V. MEAN SHIFT CLUSTERING

    Theorem 3 shows that in a blurring process with a broadkernel (one with a broad support), data points do converge to asingle position. To the other extreme, when the kernel is trun-cated to the degree that it covers no more than one data pointat any position, the initial data set is a fixed point of t he blur-ring process and thus no merger takes place. When the kernelsize is between these extremes, data points may have trajecto-ries that merge into varying numbers of cluster centers.

    Iteration of mean shift gives rise to natural clustering algo-rithms. The final T contains the final positions of the clustercenters. In k-means like clustering, T is not initiali zed to 5,fuzzy membership or nearest center classification must be usedto decide how dat a points are divided into clusters.

    In this section, we will study the clustering results whenT isinitialized to S or whenT is S (the blurring process). The dataset S is parti tioned into clusters based solely on their meanshift trajectories. When two data points or theirT representa-tives converge to the same final positi on, they are consideredto belong to the same cluster. Unlike k-means like clusteringalgorithms, which are probabilistic because the randomness ofthe initialization of T, mean shift clustering withT initializedto S is deterministic.

    A. Clustering as a Natural Process

    Many clustering algorithms are treated as means for opti-mizing certain measures about the partitioning. For example,the k-means clustering algorithm is aiming at minimizing thewithin-group sum of squared errors [7], and the maximum en-tropy clustering algorithm is to maximize entropy while thewithin-group sum of squared errors is held constant. Some-times, it is the algorithm itself that is emphasized, for instancein the case of t he k-means clustering. The initial cluster cen-ters, T, are randomly or strategically chosen, and there is noguarantee that any execution of t he algorithm will reach theglobal minimum. After the execution of the algorithm, all onecan say is that a local minimum is reached, and the optimiza-tion goal becomes illusive.

    At other times, the reach of a global optimum is essential.The maximum entropy clustering is an example. The actualiteration that hopefully attains the goal is de-emphasized,based on precisely the same reason, that every run of the al-gorithm only reaches a local maximum [5]. It is known thatoptimization problems like t hese are NP-hard [9]. Hence, ingeneral, clustering as optimization is computationally unattain-able.

    The philosophy of this paper is that clustering can also beviewed as the result of some natural process, like mean shift orblurring. As a deterministic process, the clustering result pro-vides a characteristic of t he data set. In t he light of Theorem 5,when T is initialized to S, the final configuration ofT is a localmaximum of U(T); whenT is S, it is a local maximum ofV(S).The result or purpose of mean shift clustering is to use a localmaximum of U or V as a characterization of S. The globalmaximum of U orV is not only unattainable, but also undesir-able. For instance, V reaches its global maximum when Sshrinks to one point, which is the result only when the kernel

    has a broad support. The number and distribution of localmaxima of U and V depend only on the kernel, the dimensionality of t he space, and the data size.

    EXAMPLE 5. To visualize mean shift as clustering, we randomly chose an S of size 100 in the unit square (Fig. 4a)and applied five diff erent variations of mean shift. Proc

    esses were terminated when no update larger than 10 tookplace. The truncated Gaussian kernel, (GF)p-vi, was used

    in Figs. 4b, 4c, and 4d, while the Gaussian kernel, non-truncated, GO,was used in Figs. 4e) and 4f, all with p = 30The blurring process was used in Fig. 4b and nonblurring,meaning

    \

    (4

    \

    (4

    Fig. 4. Trajectories of different mean shift variations on the same data set. (The data set S, also the initial T set for nonblurri ng processes. (b) A blurrinprocess with IO iterations and nine clusters. (c) A nonblurring mean shprocess with a truncated Gaussi an kernel and uniform weights. It terminatein 20 iterations at 37 clusters, (d) Nonblurring mean shift wit h a truncatGaussian kernel a nd an adaptive weight function. It terminated in 20 itertions at 64 clusters. (e) Nonblurring with a nontruncated Gaussian kernel anuniform weights. It terminated in 274 iterations at two clusters. (f) Nonbluring with a nontruncated Gaussian kernel and adaptive weights. It terminatein 127 iterations with 13 clusters.

  • 8/2/2019 Mean Shift Cluster

    8/10

    CHENG: MEAN SHIFT, MODE SEEKING, AND CLUSTERING 797

    T was only initialized to S but S is fixed, was used in others.In (b), (c), and (e), data points were equally weighted, whilein (d) and (f), a weight w that satisfies 1 w(s) = c K(t - s)

    I STwas used. u

    B. Validation of Clustering

    When the truncated Gaussian kernel (GF)p,,2 with varying

    /I is used, we can see that in general, the smaller j3 is, t he fewerclusters will be generated. But this may not always be true.Furthermore, a smaller cluster associated with a larger p valuemay not be the result of a split of a larger cluster associat edwith a smaller p value. However, if t he clustering outcomesare plotted against a range of p values, then it will be clear thatsome clustering outcomes are transient and unstable while oth-ers are more stable. The recurrence of a clustering outcomewith varying j3 values can be used as an indication that thepattern may be more valid than the transient ones.

    EXAMPLE 6. Fig. 5 is a plot of t he blurring process applied onthe velocities of 82 galaxies [I I]. The same data were fitwith a bimodal normal mixture model by Roeder [IO] andconclusion was that the observable universe contained twosuperclusters of galaxies surrounded by large voids. How-ever, by observing Fig. 5, one can see that t he most stableoutcome is three instead of two clusters. This shows thatwhile bimodal mixture or k-means clustering requires someprior guessing about the number of clusters, the result f rommean shift clustering is less arbitrary. 0

    Fig. 5. Clustering of 82 galaxies base d on their velocities. The tree-shapediagram shows the relative vali dity of clustering out comes with the kernel

    (GQ of different J. values. Larger I values were used near the root of thetree, while smaller jl values were used near the leaves of the tree, where thedimensi on across the tree indicates the velocities.

    EXAMPLE 7. Another example of blurri ng within a parametercontinuum is demonstrated in Fig. 6, where 62 alphanu-meric characters were treated as data points in a 144 (thenumber of pixels involved in the 8 x 18 font) dimensionalEuclidean space. Blurring with kernel (GF), with il rangingfrom 1.6 to 3.8 was performed. The clustering results werediscretized and displayed as two-tone pixels, 0

    Fig. 6. Blurring of 62 8x 18 font al phanumeric characters. Each row is theoutcome of blurring using the kernel (GF), with a 1 value between 1.6 and3.8. The average number of i terations was 4.7.

    C. Application to Hough Transform

    EXAMPLE 8. Fig. 7 shows an application of mean shift cluster-ing in generalized Hough tr ansform. 300 edge pixels wererandomly chosen from a 100 x 100 image and each pair ofthem generated a straight line passing them. The intersec-tions of these lines and the borders of the image are roundedoff to the nearest integers (pixel positions) and they areregistered with 400 x 400 accumulators in the parameterspace, similar to the muff transform suggested by Wallace[la. q

    (a) (b)Fig. 7. Mean shift peak detection in the parameter space of Hough (muff)transform. The image contains 300 edge pixels. Pairs of edge pixels make44,850 suggestions of possible lines that are coded as points in the parameterspace. Each li ne is coded with two i ntegers between 0 and 400, that l abel theintersections of the line with the borders of the image. (a) shows the linedetection based on the number of suggesti ons falli ng into the each point in the

    parameter space. A threshold of 30 was used. (b) is the result after mean shiftis applied to these lines in the parameter space. The kernel was (GF)to,weights were the number of suggesti ons, and four clusters emerged as the fourlines shown.

    VI. MEAN SHIFT OPTIMIZATION

    The blurring process moves data points i n the gradient di-rection of the functionq on X,

    44 = z Kb - -444~ (42)

    In clustering with mean shift, this functionq is considered as

  • 8/2/2019 Mean Shift Cluster

    9/10

    798 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO. 8. AUGUST

    an approximation to the data density, and mean shift finds thelocal maxima of the density function.

    When data S is uniformly distributed in an area of X, andw(s) is the value of atar-gel function f : X + R at the point s,qbecomes an approximation off with some scaling. Mean shiftwith this setting will find all the local maxima off in a region,and this leads to another application of mean shift, namelyglobal optimization.

    Because now the initial data set S has to be randomly (andhopefully uniformly) generated, the algorithm becomes prob-abilistic, even when T is S or is initialized to S. To compensate

    3(b)

    Cd)

    (e)Fig. 8. Multistart global optimizati on using blurring. (a) shows the functionf;whose global maxi mum is to be found. The next four figures show t he meanshift of S, at the (b) initial, (c) first, (d) third, and (e) fifth iterations of a blur-ring process when f i s used as the weight function. In each of these four fig-ures, the vertical bars show the positions and)values of the S points, and thecurve shows the y function, whose local maxima locations approximate thoseoff.

    the inevitable non-uniformity of any finite random set, thweight w(s) can be thef value at s augmented with a datadensity balancing factor, as

    f(s)w(w)= CK(Q (43)

    te.7

    When the blurring process is used, the next generation of Swill concentrate more at the modes of the approximated Ffunction, but the weight w will contain a factor that offsets theffect. It i s also possible to make the algorithm more deterministic by placing the initial S on regular grid points.

    EXAMPLE 9: Fig. 8 is a demonstration of this optimizationstrategy. (a) is the underlying functionf, whose maxima arto be found. This function is unknown, but with a price,f(x)may be obtained for any x E X. The upper half of (b) showthe initial randomly chosen 100 x E X, as the set S, alongwith theirf(x) values. The lower half is the estimatedf func-tion using

    q(x) = C(GF),(s - 44%

    f(s)w(s) = C(GF),(t - s) .

    (44)

    In this demonstration, the range of X is the unit i nterval anil = 0.18. (c) contains the S set along withf(s) values after asingle mean shift step, with the estimatedf using this dataset S in the lower half. (d) and (e) are the snapshots afterthree and five mean shift steps respectively. After five iterations, the global maximum and a local maximum were dis

    covered and S reached its final configuration, a fixed poinof the blurring process. I1

    Mean shift optimization is a parallel hill cli mbing methcomparable to many genetic algorithms. The blurring procesin effect is an evolutionary strategy, withf being the so-calledfitness function.

    Mean shift optimization also shows some similarity witsome multistart global optimization methods [13]. Because is necessary to have sample points on both sides of a localmaximum in order for mean shift t o work, the method may nobe efficient in a space of high dimensionality or anf functionwith too many local maxima. Nevertheless, it has the uniquproperty that the same simple iteration works for both locaand global optimization, compared to most multistart methodwhere separate ocal and global approaches are used.

    VII. CONCLUDING REMARKS

    Suppose one has made a number of observations, per-formed a series of experiments, or collected a stack of casesWhat is a natural way in which the memory of data is organized? It could be organized as a sorted list based on some key,or a decision tree, or a rule-based system, or a distributed as-sociative memory. But what can be more basic than to associ-ate each experience with similar ones, and to extract the com-

  • 8/2/2019 Mean Shift Cluster

    10/10

    CHENG: MEAN SHIFT, MODE SEEKING, AND CLUSTERING

    mon features that make them different from others? Meanshift, or confusing a point wi th the average of similar points,must be a simple and natural process that plays a role in mem-ory organization. A multiresolution representation of the data,containing both commonalit y and specifi city, seems likely tobe the foundation of law discovery and knowledge acquisit ion.

    Since this process is so intuitive and basic, it should havebeen used as an ingredient in various modeling and algo-rithmic efforts, or at least it should have been studied inmathematics as a simple dynamic system. However, based onthe authors search and communicati on, the existence of previ-ous effort s i s not apparent. To t he best knowledge of theauthor, Fukunaga and Hostetler [l ] is still t he first work pro-posing mean shift explici tly as an iterative algorithm, and arigorous and comprehensive treatment of the process has notbeen done.

    This paper attempted to provide an appropriate generaliza-tion to the mean shift algorithm, so that many interesting anduseful properties would be preserved. One property is that theprocess is either a gradient mapping, or a similar one thatseeks modes of a real function. Compared to gradient descentor ascent methods, mean shift seems more effective in terms ofadapting to the right step size.

    The two major applications of mean shift di scussed in thispaper, namely cluster analysis and global optimization, havenot been practiced widely. There may be computati onal ob-stacles, and they may not be suitable for problems with pro-hibitive sizes and dimensionalit ies. The computati onal cost ofan iteration of mean shift is O(n2) where n is the size of S, thedata set. It i s obviously possible to reduce this time complexityto O(n log n), by a better storage of the data, when only neigh-boring points are used in the computation of the mean.

    ACKNOWLEDGMENTS

    The author thanks Keinosuke Fukunaga and John Schlipffor conversati ons and Zhangyong Wan for early exploration ofthe convergence of the process.

    This work was supported by the National Science Founda-tion, under Grant no. IRI-9010555.

    121

    131

    ]41

    [51

    161

    [71

    181

    [91

    REFERENCES

    K. Fukunaga and L.D. Hostetler, The estimation of the gradient of adensity function, with applications in pattern recogniti on, IEEETrans.Infi,r mution Theory, vol. 21, pp. 32-40, 1975.B.W. Silverman, Density Estirnution ,jijr Stutist ics und Dutu Anulysis.London: Chapman and Hall, 1986.Y. Cheng and KS. Fu, Conceptual clustering in knowledge organiza-

    tion, IEEE Truns. Puttern Analysis und Muchine Intelligence, vol. 7,pp. 592-598, 1985.K. Rose, E. Gurewitz, and G.C. Fox, Statistical mechanics and phasetransitions in clustering, Physicul Review Letters, vol. 65, pp.945-948,1990.K. Rose, E. Gurewitz, and G.C. Fox, Constrained clustering as anoptimization method, IEEE Truns. Pattern Anulysis and Muchine In-telligence, vol. 15, pp.785-794, 1993.J.M. Ortega and W.C. Rheinboldt, Iferutive Solution of NonlineurEqutions in Severul Vuriubles. San Diego: Academic Press, 1970.S.Z. Selim and M.A. Ismail, K-means-type algorithms: A generalizedconvergence theorem and characterization of local optimality, IEEETrans. Pattern Anulysis und Machine Intelligence, vol. 6, pp.81-86,1984.Y. Cheng and Z. Wan, Analysis of the blurring process,Cumputu-tionul Leurning Theory und Nuturul Leurning Systems, vol. 3,

    T. Petsche, et al., eds., 1992.P. Bruckcr, On the complexity of clustering problems, Henn, Korte,and Oetth, eds. Optimizution und Operutions Reseurch. Berlin:Springer-Verlag, 1978.

    [lo] K. Roeder, Density estimation with confidence sets exemplif ied bysuperciuaters and voids in the galaxies, J. Amer. Stutisticul Assoc.,vol. 85, pp. 617-624, 1990, also see [ll].

    [I l] B.S. Everitt,Cluster Analysis. 3rd ed., London: Edward Arnold, 1993.[12] R.S. Wallace, A modifi ed Hough transform for lines,IEEE CVPR

    Conf:, pp. 665-667, San Francisco, 1985.[ 131 A.H.G. Rinnooy Kan and G.T. Timmer,Stochastic global optimization

    methods Part I: Clustering methods, Muthemuticul Programming,vol. 39, pp. 27-56, 1987.

    Yizong Cheng received the BSEE and PhD degreesfrom Purdue University, West Lafayette, I nd., in1981 and 1986, respectively. He is now on the fac-ulty of the University of Cincinnati.

    His current research interests are in data self-organization and knowledge discovery.