A Robust Information Clustering Algorithm

27
LETTER Communicated by Steven Nowlan A Robust Information Clustering Algorithm Qing Song [email protected] School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798 We focus on the scenario of robust information clustering (RIC) based on the minimax optimization of mutual information (MI). The minimiza- tion of MI leads to the standard mass-constrained deterministic annealing clustering, which is an empirical risk-minimization algorithm. The max- imization of MI works out an upper bound of the empirical risk via the identification of outliers (noisy data points). Furthermore, we estimate the real risk VC-bound and determine an optimal cluster number of the RIC based on the structural risk-minimization principle. One of the main ad- vantages of the minimax optimization of MI is that it is a nonparametric approach, which identifies the outliers through the robust density esti- mate and forms a simple data clustering algorithm based on the square error of the Euclidean distance. 1 Introduction There are basically two unsupervised learning approaches in data clustering and classification problems: the parametric and nonparametric algorithms (Bishop, 1995). In the parametric clustering approach, predefined proba- bility density functions are used to calculate the clustering objective func- tion to achieve optimal performance. This has limited applications in cases of unknown distributions of the input pattern (Vapnik, 1998; Scholkopf, Smola, & Muller, 1998). In a related robust density estimation of the filter- ing problem, the minimax optimization was studied by Levy & Nikoukhah (2004). Alternatively, the nonparametric clustering algorithm normally di- vides the input patterns into groups and minimizes the dissimilarity or maximizes the similarity objective functions (Bajcsy & Ahuja, 1998; Gokcay & Principe, 2002; Shen & Wu, 2004). For specific data structures, the nonpara- metric density-based clustering algorithms were applied to identify high- density clusters separated from the low-density pattern by exploring either regions of the high-density pattern (Bajcsy & Ahuja, 1998) or regions with fewer patterns, such as in valley-seeking clustering (Gokcay & Principle, 2002). Recently, issues of robust clustering have received extensive attention. This commonly refers to two aspects: outlier and cluster numbers that are Neural Computation 17, 2672–2698 (2005) © 2005 Massachusetts Institute of Technology

Transcript of A Robust Information Clustering Algorithm

Page 1: A Robust Information Clustering Algorithm

LETTER Communicated by Steven Nowlan

A Robust Information Clustering Algorithm

Qing SongeqsongntuedusgSchool of Electrical and Electronic Engineering Nanyang Technological UniversitySingapore 639798

We focus on the scenario of robust information clustering (RIC) basedon the minimax optimization of mutual information (MI) The minimiza-tion of MI leads to the standard mass-constrained deterministic annealingclustering which is an empirical risk-minimization algorithm The max-imization of MI works out an upper bound of the empirical risk via theidentification of outliers (noisy data points) Furthermore we estimate thereal risk VC-bound and determine an optimal cluster number of the RICbased on the structural risk-minimization principle One of the main ad-vantages of the minimax optimization of MI is that it is a nonparametricapproach which identifies the outliers through the robust density esti-mate and forms a simple data clustering algorithm based on the squareerror of the Euclidean distance

1 Introduction

There are basically two unsupervised learning approaches in data clusteringand classification problems the parametric and nonparametric algorithms(Bishop 1995) In the parametric clustering approach predefined proba-bility density functions are used to calculate the clustering objective func-tion to achieve optimal performance This has limited applications in casesof unknown distributions of the input pattern (Vapnik 1998 ScholkopfSmola amp Muller 1998) In a related robust density estimation of the filter-ing problem the minimax optimization was studied by Levy amp Nikoukhah(2004) Alternatively the nonparametric clustering algorithm normally di-vides the input patterns into groups and minimizes the dissimilarity ormaximizes the similarity objective functions (Bajcsy amp Ahuja 1998 Gokcayamp Principe 2002 Shen amp Wu 2004) For specific data structures the nonpara-metric density-based clustering algorithms were applied to identify high-density clusters separated from the low-density pattern by exploring eitherregions of the high-density pattern (Bajcsy amp Ahuja 1998) or regions withfewer patterns such as in valley-seeking clustering (Gokcay amp Principle2002)

Recently issues of robust clustering have received extensive attentionThis commonly refers to two aspects outlier and cluster numbers that are

Neural Computation 17 2672ndash2698 (2005) copy 2005 Massachusetts Institute of Technology

A Robust Information Clustering Algorithm 2673

closely linked to each other (Dave amp Krishnapuram 1997) Not knowingthe number of clusters in a data set complicates the task of separating thegood data points from the noise points and conversely the presence of noisemakes it harder to determine the number of clusters A nonparametric den-sity estimation was used in a similarity-based fuzzy clustering algorithm(Shen amp Wu 2004) which used an M-estimator to search for a robust so-lution An interesting information-theoretic perspective approach sought aparametric study for an optimal cluster number based on the informationbottleneck method (Tishby Pereira amp Bialek 1999) and mutual informationcorrection (Still amp Bialek 2004)

We propose a robust information clustering (RIC) algorithm and claimthat any data point could become an outlier in the RIC learning procedureThis is dependent on the given data structure and chosen cluster centersOur primary target is to partition the given data set into effective clustersand determine an optimal cluster number based on the basic deterministicannealing (DA) clustering via the identification of outliers (the latter is onlya by-product of RIC see also the simulation results) The RIC is basically atwo-step minimax mutual information (MI) approach The minimization ofthe MI or precisely the rate distortion function leads to mass-constrainedDA clustering which is designed essentially to divide a given data set intoeffective clusters of the Euclidean distance (Rose 1998) As the cluster num-ber is increased from one to a predefined maximum number and the tem-perature is lowered in the annealing procedure the DA tends to exploremore and more details in the input data structure and may result in theoverfitting problem with poor generalization ability

The maximization of MI or precisely the capacity maximization leadsto the minimax optimization of the RIC algorithm based on a common con-straint the dissimilarity or distortion measure The minimax MI estimates anupper bound of the empirical risk and identifies the noisy input data points(outliers) by choosing different cluster numbers Furthermore we reinves-tigate the character of the titled distribution which can be interpreted asthe cluster membership function that forms a set of indicator functions andis linear in parameters at zero temperature This allows the RIC to calcu-late explicitly the Vapnik-Cervonenkis (VC)-dimension and determines anoptimal cluster number based on the structural risk minimization (SRM)principle with compromise between the minimum empirical risk (signal tonoise ratio) and VC-dimension (Vapnik 1998)1

Contrary to the parametric algorithms the RIC is a simple yet effectivenonparametric method to solve the intertwined robust clustering problemsthat is to estimate an optimal or at least a suboptimal cluster number

1 VC-dimension is also one of the capacity components in statistical learning theory(Vapnik 1998) This is different from the concept of capacity maximization in classicalinformation theory See section 4 for more details

2674 Q Song

via the identification of outliers (unreliable data points) The argument isthat the ultimate target of the RIC is not to look for or approximate thetrue probability distribution of the input data set (similarly we are alsonot looking for ldquotruerdquo clusters or cluster centers) which is proved to be anill-posed problem (Vapnik 1998) but to determine an optimal number ofeffective clusters by eliminating the unreliable data points (outliers) basedon the inverse theorem and MI maximization It also turns out that theoutlier is in fact a data point that fails to achieve capacity Therefore anydata point could become an outlier as the cluster number is increased in theannealing procedure Furthermore by replacing the Euclidean distance withother dissimilarity measures it is possible to extend the new algorithm intothe kernel and nonlinear clustering algorithms for linearly nonseparablepatterns (Scholkopf et al 1998 Gokcay amp Principe 2002 Song Hu amp Xie2002)

The letter is organized as follows Section 2 gives the motivation of theproposed research by reviewing a few related and well-established cluster-ing algorithms Section 3 discusses the rate distortion function which formsthe foundation of the DA clustering The capacity maximization SRM prin-ciple and the RIC algorithm are studied in section 4 In section 5 instructivesimulation results are presented to show the superiority of the RIC The con-clusion is presented in section 6

2 Motivation

Suppose that there are two random n-dimensional data sets xi isin X i =1 l and wk isin W k = 1 K which represent input data points (in-formation source in term of communication theory) and cluster centers (pro-totypes) respectively For the clustering problem a hard dissimilarity mea-sure can be presented in a norm space for example a square error in theEuclidean distance

minki

d(wk xi ) (21)

where d(wk xi ) = wk minus xi2Note that the definition of equation 21 is used in DA and most model-

free clustering algorithms for a general data clustering problem Thiscan be easily extended into kernel and other nonlinear-based measuresto cover the linearly nonseparable data clustering problem (Scholkopfet al 1998 Gokcay amp Principe 2002) Based on the hard distortion mea-sure equation 21 some popular clustering algorithms have been de-veloped including basic k-means fuzzy c-means (FCM) and improvedrobust versionsmdashthe robust noise and possibilistic clustering algorithms(Krishnapuram amp Keller 1993) The optimization of possibilistic clustering

A Robust Information Clustering Algorithm 2675

for example can be reformulated as a minimization of the Lagrangianfunction

J =lsum

i=1

Ksumk=1

(uik)md(wk xi ) +lsum

i=1

Ksumk=1

ηi (1 minus uik)m (22)

where uik is the fuzzy membership with the degree parameter m and ηi isa suitable positive number to control the robustness of each cluster

The common theme among the robust clustering algorithms is to reject orignore a subset of the input patterns by evaluating the membership functionui j which can also be viewed as a robust weight function to phase outoutliers The robust performance of the fuzzy algorithms in equation 22is explained in a sense that it involves fuzzy membership of every patternto all clusters instead of crisp membership However in addition to thesensitivity to the initialization of prototypes the objective function of robustfuzzy clustering algorithms is treated as a monotonic decreasing functionwhich leads to difficulties finding an optimal cluster number K and a properfuzziness parameter m

Recently maximization of the relative entropy of the continuous timedomain which is similar to maximization of the MI of the discrete timedomain in equation 31 has been used in robust signal filtering and densityestimation (Levy amp Nikoukhah 2004) This uses the following objectivefunction as a minimax optimization problem

J = minW

maxf

12

E f X minus W 2 minusυ

(intR

ln

(f z(z)fz(z)

)dz minus c

) (23)

where X and W represent the input and output data sets fz(z) is definedas a nominal probability density function against the true density functionf z(z) and υ is the Lagrange multiplier with the positive constraint c Itmaximizes the relative entropy of the second term of the cost function 23against uncertainty or outlier through the true density function for the leastfavorable density estimation The cost function is minimized in a sense ofthe least mean square The key point of this minimax approach is to searchfor a saddle point of the objective function for a global optimal solution

A recent information-theoretic perspective approach sought a parametricstudy based on the information bottleneck method for an optimal clusternumber via MI correction (Still amp Bialek 2004) The algorithm maximizesthe following objective function

maxp(W|X)

I (W V) minus T I (W X) (24)

2676 Q Song

where V represents the relevant information data set against the input dataset X and assumes that the joint distribution p(V W) is known approxi-mately I (W V) and I (W X) are the mutual information T is a temperatureparameter (refer to the next section for details) A key point of this approachis that it presents implicitly a structural risk minimization problem (Vapnik1998) and uses the corrected mutual information to search for a risk boundat an optimal temperature (cluster number)

3 Rate Distortion Function and DA Clustering

The original DA clustering is an optimal algorithm in term of insensitivityto the volume of input patterns in the respective cluster This avoids localminima of the hard clustering algorithm like k-means and splits the wholeinput data set into effective clusters in the annealing procedure (Rose 1998)However the DA algorithm is not robust against disturbance and outliersbecause it tends to associate the membership of a particular pattern in allclusters with equal probability distribution (Dave amp Krishnapuram 1997)Furthermore the DA is an inherent empirical risk-minimization algorithmThis explores details of the input data structure without a limit and needs apreselected maximum cluster number Kmax to stop the annealing procedureTo solve these problems we first investigate the original rate distortionfunction which lays the theoretical foundation of the DA algorithm (Blahut1988 Gray 1990)

The definition of the rate distortion function is defined as (Blahut 1988)

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X)) (31)

I (plowast(X) p(W|X))

= p(W|X)

[lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) lnp(wk |xi )suml

i=1 p(wk |xi )plowast(xi )

] (32)

with the constraint

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) le D(plowast(X))

=lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (33)

A Robust Information Clustering Algorithm 2677

where I (plowast(X) p(W|X)) is the mutual information2 p(wk |xi ) isin p(W|X) is thetitled distribution

p(wk |xi ) = p(wk) exp(minusd(wk xi )T)Nxi

(34)

where the normalized factor is

Nxi =Ksum

k=1

p(wk) exp(minusd(wk xi )T) (35)

with the induced unconditional pmf

p(wk) =lsum

i=1

p(wk xi ) =lsum

i=1

plowast(xi )p(wk |xi ) k = 1 K (36)

p(wk |xi ) isin p(W|X) achieves a minimum point of the lower curveR(D(plowast(X))) in Figure 1 at a specific temperature T (Blahut 1988) plowast(xi ) isinplowast(X) is a fixed unconditional a priori pmf (normally as an equal distributionin DA clustering Rose 1998)

The rate distortion function is usually investigated in term of a parameters = minus1T with T isin (0 infin) This is introduced as a Lagrange multiplier andequals the slope of the rate distortion function curve as shown in Figure 1in classical information theory (Blahut 1988) T is also referred as the tem-perature parameter to control the data clustering procedure as its value islowered from infinity to zero (Rose 1998) Therefore the rate distortionfunction can be presented as a constraint optimization problem

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X))

minus s

(lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) minus D(plowast(X))

) (37)

One important property of R(D (plowast(X))) is that it is a decreasing convexand continuous function defined in the interval 0 le D (plowast(X)) le Dmax for

2 The MI I (plowast(X) p(W|X)) has another notation I (X W) similar to the one used inequation 24 However as pointed out by Blahut (1988) the latter may not be the bestnotation for the optimization problem because it suggests that MI is merely a functionof the variable vectors X and W For the same reason we use probability distributionnotation for all related functions For example the rate distortion function is presentedas R(D(plowast(X))) which is a bit more complicated than the original paper (Blahut 1972)This inconvenience turns out to be worth it as we study the related RIC capacity problemwhich is coupled closely with the rate distortion function as shown in the next section

2678 Q Song

( ( (X)))R D p

I

1T

( W |X )min ( (W | X) (X))p

F p p0 T

Empirical Risk Minimization Optimal Saddle Point

( (X))

( (X))

D p

D p

( (X)) ( (X))D p D p

Capacity Maximization

( (X)) ( (X))D p D p

maxD

( ( (X )))C D p

Figure 1 Plots of the rate distortion function and capacity curves for anyparticular cluster number K le Kmax The plots are parameterized by the tem-perature T

any particular cluster number 0 lt K le Kmax as shown in Figure 1 (Blahut1972)

Define the DA clustering objective function as (Rose 1998)

F (plowast(X) p(W|X)) = I (plowast(X) p(W|X))

minus slsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (38)

The rate distortion function

R(D(plowast(X))) = s D(plowast(X)) + minp(W|X)

F (plowast(X) p(W|X)) (39)

is minimized by the titled distribution 34 (Blahut 1972)From the data clustering point of view equations 22 and 38 are well

known to be soft dissimilarity measures of different clusters (Dave ampKrishnapuram 1997) To accommodate the DA-based RIC algorithm in asingle framework of classical information theory we use a slightly differenttreatment from the original paper of Rose (1998) for the DA clustering algo-rithm that is to minimize equation 38 with respect to the free pmf p(wk |xi )rather than the direct minimization against the cluster center W This recasts

A Robust Information Clustering Algorithm 2679

the clustering optimization problem as that of seeking the distribution pmfand minimizing equation 38 subject to a specified level of randomness Thiscan be measured by the minimization of the MI equation 31

The optimization is now to minimize the function F (plowast(X) p(W|X))which is a by-product of the MI minimization over the titled distributionp(wk |xi ) to achieve a minimum distortion and leads to the mass-constrainedDA clustering algorithm

Plugging equation 34 into 38 the optimal objective function equation38 becomes the entropy functional in a compact form3

F (plowast(X) p(W|X)) = minuslsum

i=1

plowast(xi ) lnKsum

k=1

p(wk) exp (minusd(wk xi )T) (310)

Minimizing equation 310 against the cluster center wk we have

part F (plowast(X) p(W|X))partwk

=lsum

i=1

plowast(xi )p(wk |xi )(wk minus xi ) = 0 (311)

which leads to the optimal clustering center

wk =lsum

i=1

αikxi (312)

where

αik = plowast(xi )p(wk |xi )p(wk)

(313)

For any cluster number K le Kmax and a fixed arbitrary pmf set plowast(xi ) isinplowast(X) minimization of the clustering objective function 38 against the pmfset p(W|X) is monotone nonincrease and converges to a minimum point ofthe convex function curve at a particular temperature The soft distortionmeasure D(plowast(X)) in equation 33 and the MI equation 31 are minimizedsimultaneously in a sense of empirical risk minimization

3 F (plowast(X) p(W|X)) =lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) ln(

p(wk ) exp(minusd(wk xi )T)p(wk )Nxi

)minuss

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) = minuslsum

i=1plowast(xi )

Ksumk=1

p(wk |xi ) ln Nxi

= minuslsum

i=1plowast(xi ) ln

Ksumk=1

p(wk ) exp(minusd(wk xi )T) (according to equation 34Ksum

k=1p(wk |xi ) = 1)

2680 Q Song

4 Minimax Optimization and the Structural Risk Minimization

41 Capacity Maximization and Input Data Reliability In the con-strained minimization of MI of the last section we obtain an optimal feed-forward transition probability a priori pmf p(wk |xi ) isin p(W|X) A backwardtransition probability a posteriori pmf p(xi |wk) isin p(X|W) can be obtainedthrough the Bayes formula

p(xi |wk) = p(xi )p(wk |xi )sumli=1 p(xi )p(wk |xi )

= p(xi )p(wk |xi )p(wk)

(41)

The backward transition probability is useful to assess the realizabilityof the input data set in classical information theory Directly using the pmfequation 41 yields an optimization problem by simply evaluating a singlepmf p(xi |wk) and is not a good idea to reject outlier (Mackay 1999) How-ever we can use the capacity function of classical information theory This isdefined by maximizing an alternative presentation of the MI against inputprobability distribution

C = maxp(X)

I (p(X) p(W|X)) (42)

with

I (p(X) p(W|X)) = I (p(X) p(X|W))

=lsum

i=1

Ksumk=1

p(xi )p(wk |xi ) lnp(xi |wk)

p(xi ) (43)

where C is a constant represented the channel capacityNow we are in a position to introduce the channel reliability of classical

information theory (Bluhat 1988) To deal with the input data uncertaintythe MI can be presented in a simple channel entropy form

I (p(X) p(X|W)) = H(p(X)) minus H(p(W) p(X|W)) (44)

where the first term represents uncertainty of the channel input variable X4

H(p(X)) = minuslsum

i=1

p(xi ) ln(p(xi )) (45)

4 In nats (per symbol) since we use the natural logarithm basis rather bits (per symbol)in log2 function Note that we use the special entropy notations H(p(X)) = H(X) andH(p(W) p(X|W)) = H(X|W) here

A Robust Information Clustering Algorithm 2681

and the second term is conditional entropy

H(p(W) p(X|W)) = minuslsum

i=1

Ksumk=1

p(wk)p(xi |wk) ln p(xi |wk) (46)

Lemma 1 (inverse theorem) 5 The clustering data reliability is presented in asingle symbol error pe of the input data set with empirical error probability

pe =lsum

i=1

Ksumk =i

p(xi |wk) (47)

such that if the input uncertainty H(p(X)) is greater than C the error pe is boundedaway from zero as

pe ge 1ln l

(H(p(X)) minus C minus 1) (48)

Proof We first give an intuitive discussion here over Fanorsquos inequality (SeeBlahut 1988 for a formal proof)

Uncertainty in the estimated channel input can be broken into two partsthe uncertainty in the channel whether an empirical error pe was made andgiven that an error is made the uncertainty in the true value However theerror occurs with probability pe such that the first uncertainty is H(pe ) =minus(1 minus pe ) ln(1 minus pe ) and can be no larger than ln(l) This occurs only whenall alternative errors are equally likely Therefore if the equivocation can beinterpreted as the information lost we should have Fanorsquos inequality

H(p((W)) p(X|W)) le H(pe ) + pe ln (l) (49)

Now consider that the maximum of the MI is C in equation 42 so we canrewrite equation 44 as

H(p(W) p(X|W)) = H(p(X)) minus I (p(X) p(X|W)) ge H(p(X)) minus C (410)

Then Fanorsquos inequality is applied to get

H(p(X)) minus C le H(pe ) + pe ln(l) le 1 + pe ln l (411)

5 There is a tighter bound pe compared to the one of lemma 1 as in the work of Jelinet(1968) However this may not be very helpful since minimization of the empirical risk isnot necessary to minimize the real structural risk as shown in section 43

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 2: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2673

closely linked to each other (Dave amp Krishnapuram 1997) Not knowingthe number of clusters in a data set complicates the task of separating thegood data points from the noise points and conversely the presence of noisemakes it harder to determine the number of clusters A nonparametric den-sity estimation was used in a similarity-based fuzzy clustering algorithm(Shen amp Wu 2004) which used an M-estimator to search for a robust so-lution An interesting information-theoretic perspective approach sought aparametric study for an optimal cluster number based on the informationbottleneck method (Tishby Pereira amp Bialek 1999) and mutual informationcorrection (Still amp Bialek 2004)

We propose a robust information clustering (RIC) algorithm and claimthat any data point could become an outlier in the RIC learning procedureThis is dependent on the given data structure and chosen cluster centersOur primary target is to partition the given data set into effective clustersand determine an optimal cluster number based on the basic deterministicannealing (DA) clustering via the identification of outliers (the latter is onlya by-product of RIC see also the simulation results) The RIC is basically atwo-step minimax mutual information (MI) approach The minimization ofthe MI or precisely the rate distortion function leads to mass-constrainedDA clustering which is designed essentially to divide a given data set intoeffective clusters of the Euclidean distance (Rose 1998) As the cluster num-ber is increased from one to a predefined maximum number and the tem-perature is lowered in the annealing procedure the DA tends to exploremore and more details in the input data structure and may result in theoverfitting problem with poor generalization ability

The maximization of MI or precisely the capacity maximization leadsto the minimax optimization of the RIC algorithm based on a common con-straint the dissimilarity or distortion measure The minimax MI estimates anupper bound of the empirical risk and identifies the noisy input data points(outliers) by choosing different cluster numbers Furthermore we reinves-tigate the character of the titled distribution which can be interpreted asthe cluster membership function that forms a set of indicator functions andis linear in parameters at zero temperature This allows the RIC to calcu-late explicitly the Vapnik-Cervonenkis (VC)-dimension and determines anoptimal cluster number based on the structural risk minimization (SRM)principle with compromise between the minimum empirical risk (signal tonoise ratio) and VC-dimension (Vapnik 1998)1

Contrary to the parametric algorithms the RIC is a simple yet effectivenonparametric method to solve the intertwined robust clustering problemsthat is to estimate an optimal or at least a suboptimal cluster number

1 VC-dimension is also one of the capacity components in statistical learning theory(Vapnik 1998) This is different from the concept of capacity maximization in classicalinformation theory See section 4 for more details

2674 Q Song

via the identification of outliers (unreliable data points) The argument isthat the ultimate target of the RIC is not to look for or approximate thetrue probability distribution of the input data set (similarly we are alsonot looking for ldquotruerdquo clusters or cluster centers) which is proved to be anill-posed problem (Vapnik 1998) but to determine an optimal number ofeffective clusters by eliminating the unreliable data points (outliers) basedon the inverse theorem and MI maximization It also turns out that theoutlier is in fact a data point that fails to achieve capacity Therefore anydata point could become an outlier as the cluster number is increased in theannealing procedure Furthermore by replacing the Euclidean distance withother dissimilarity measures it is possible to extend the new algorithm intothe kernel and nonlinear clustering algorithms for linearly nonseparablepatterns (Scholkopf et al 1998 Gokcay amp Principe 2002 Song Hu amp Xie2002)

The letter is organized as follows Section 2 gives the motivation of theproposed research by reviewing a few related and well-established cluster-ing algorithms Section 3 discusses the rate distortion function which formsthe foundation of the DA clustering The capacity maximization SRM prin-ciple and the RIC algorithm are studied in section 4 In section 5 instructivesimulation results are presented to show the superiority of the RIC The con-clusion is presented in section 6

2 Motivation

Suppose that there are two random n-dimensional data sets xi isin X i =1 l and wk isin W k = 1 K which represent input data points (in-formation source in term of communication theory) and cluster centers (pro-totypes) respectively For the clustering problem a hard dissimilarity mea-sure can be presented in a norm space for example a square error in theEuclidean distance

minki

d(wk xi ) (21)

where d(wk xi ) = wk minus xi2Note that the definition of equation 21 is used in DA and most model-

free clustering algorithms for a general data clustering problem Thiscan be easily extended into kernel and other nonlinear-based measuresto cover the linearly nonseparable data clustering problem (Scholkopfet al 1998 Gokcay amp Principe 2002) Based on the hard distortion mea-sure equation 21 some popular clustering algorithms have been de-veloped including basic k-means fuzzy c-means (FCM) and improvedrobust versionsmdashthe robust noise and possibilistic clustering algorithms(Krishnapuram amp Keller 1993) The optimization of possibilistic clustering

A Robust Information Clustering Algorithm 2675

for example can be reformulated as a minimization of the Lagrangianfunction

J =lsum

i=1

Ksumk=1

(uik)md(wk xi ) +lsum

i=1

Ksumk=1

ηi (1 minus uik)m (22)

where uik is the fuzzy membership with the degree parameter m and ηi isa suitable positive number to control the robustness of each cluster

The common theme among the robust clustering algorithms is to reject orignore a subset of the input patterns by evaluating the membership functionui j which can also be viewed as a robust weight function to phase outoutliers The robust performance of the fuzzy algorithms in equation 22is explained in a sense that it involves fuzzy membership of every patternto all clusters instead of crisp membership However in addition to thesensitivity to the initialization of prototypes the objective function of robustfuzzy clustering algorithms is treated as a monotonic decreasing functionwhich leads to difficulties finding an optimal cluster number K and a properfuzziness parameter m

Recently maximization of the relative entropy of the continuous timedomain which is similar to maximization of the MI of the discrete timedomain in equation 31 has been used in robust signal filtering and densityestimation (Levy amp Nikoukhah 2004) This uses the following objectivefunction as a minimax optimization problem

J = minW

maxf

12

E f X minus W 2 minusυ

(intR

ln

(f z(z)fz(z)

)dz minus c

) (23)

where X and W represent the input and output data sets fz(z) is definedas a nominal probability density function against the true density functionf z(z) and υ is the Lagrange multiplier with the positive constraint c Itmaximizes the relative entropy of the second term of the cost function 23against uncertainty or outlier through the true density function for the leastfavorable density estimation The cost function is minimized in a sense ofthe least mean square The key point of this minimax approach is to searchfor a saddle point of the objective function for a global optimal solution

A recent information-theoretic perspective approach sought a parametricstudy based on the information bottleneck method for an optimal clusternumber via MI correction (Still amp Bialek 2004) The algorithm maximizesthe following objective function

maxp(W|X)

I (W V) minus T I (W X) (24)

2676 Q Song

where V represents the relevant information data set against the input dataset X and assumes that the joint distribution p(V W) is known approxi-mately I (W V) and I (W X) are the mutual information T is a temperatureparameter (refer to the next section for details) A key point of this approachis that it presents implicitly a structural risk minimization problem (Vapnik1998) and uses the corrected mutual information to search for a risk boundat an optimal temperature (cluster number)

3 Rate Distortion Function and DA Clustering

The original DA clustering is an optimal algorithm in term of insensitivityto the volume of input patterns in the respective cluster This avoids localminima of the hard clustering algorithm like k-means and splits the wholeinput data set into effective clusters in the annealing procedure (Rose 1998)However the DA algorithm is not robust against disturbance and outliersbecause it tends to associate the membership of a particular pattern in allclusters with equal probability distribution (Dave amp Krishnapuram 1997)Furthermore the DA is an inherent empirical risk-minimization algorithmThis explores details of the input data structure without a limit and needs apreselected maximum cluster number Kmax to stop the annealing procedureTo solve these problems we first investigate the original rate distortionfunction which lays the theoretical foundation of the DA algorithm (Blahut1988 Gray 1990)

The definition of the rate distortion function is defined as (Blahut 1988)

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X)) (31)

I (plowast(X) p(W|X))

= p(W|X)

[lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) lnp(wk |xi )suml

i=1 p(wk |xi )plowast(xi )

] (32)

with the constraint

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) le D(plowast(X))

=lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (33)

A Robust Information Clustering Algorithm 2677

where I (plowast(X) p(W|X)) is the mutual information2 p(wk |xi ) isin p(W|X) is thetitled distribution

p(wk |xi ) = p(wk) exp(minusd(wk xi )T)Nxi

(34)

where the normalized factor is

Nxi =Ksum

k=1

p(wk) exp(minusd(wk xi )T) (35)

with the induced unconditional pmf

p(wk) =lsum

i=1

p(wk xi ) =lsum

i=1

plowast(xi )p(wk |xi ) k = 1 K (36)

p(wk |xi ) isin p(W|X) achieves a minimum point of the lower curveR(D(plowast(X))) in Figure 1 at a specific temperature T (Blahut 1988) plowast(xi ) isinplowast(X) is a fixed unconditional a priori pmf (normally as an equal distributionin DA clustering Rose 1998)

The rate distortion function is usually investigated in term of a parameters = minus1T with T isin (0 infin) This is introduced as a Lagrange multiplier andequals the slope of the rate distortion function curve as shown in Figure 1in classical information theory (Blahut 1988) T is also referred as the tem-perature parameter to control the data clustering procedure as its value islowered from infinity to zero (Rose 1998) Therefore the rate distortionfunction can be presented as a constraint optimization problem

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X))

minus s

(lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) minus D(plowast(X))

) (37)

One important property of R(D (plowast(X))) is that it is a decreasing convexand continuous function defined in the interval 0 le D (plowast(X)) le Dmax for

2 The MI I (plowast(X) p(W|X)) has another notation I (X W) similar to the one used inequation 24 However as pointed out by Blahut (1988) the latter may not be the bestnotation for the optimization problem because it suggests that MI is merely a functionof the variable vectors X and W For the same reason we use probability distributionnotation for all related functions For example the rate distortion function is presentedas R(D(plowast(X))) which is a bit more complicated than the original paper (Blahut 1972)This inconvenience turns out to be worth it as we study the related RIC capacity problemwhich is coupled closely with the rate distortion function as shown in the next section

2678 Q Song

( ( (X)))R D p

I

1T

( W |X )min ( (W | X) (X))p

F p p0 T

Empirical Risk Minimization Optimal Saddle Point

( (X))

( (X))

D p

D p

( (X)) ( (X))D p D p

Capacity Maximization

( (X)) ( (X))D p D p

maxD

( ( (X )))C D p

Figure 1 Plots of the rate distortion function and capacity curves for anyparticular cluster number K le Kmax The plots are parameterized by the tem-perature T

any particular cluster number 0 lt K le Kmax as shown in Figure 1 (Blahut1972)

Define the DA clustering objective function as (Rose 1998)

F (plowast(X) p(W|X)) = I (plowast(X) p(W|X))

minus slsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (38)

The rate distortion function

R(D(plowast(X))) = s D(plowast(X)) + minp(W|X)

F (plowast(X) p(W|X)) (39)

is minimized by the titled distribution 34 (Blahut 1972)From the data clustering point of view equations 22 and 38 are well

known to be soft dissimilarity measures of different clusters (Dave ampKrishnapuram 1997) To accommodate the DA-based RIC algorithm in asingle framework of classical information theory we use a slightly differenttreatment from the original paper of Rose (1998) for the DA clustering algo-rithm that is to minimize equation 38 with respect to the free pmf p(wk |xi )rather than the direct minimization against the cluster center W This recasts

A Robust Information Clustering Algorithm 2679

the clustering optimization problem as that of seeking the distribution pmfand minimizing equation 38 subject to a specified level of randomness Thiscan be measured by the minimization of the MI equation 31

The optimization is now to minimize the function F (plowast(X) p(W|X))which is a by-product of the MI minimization over the titled distributionp(wk |xi ) to achieve a minimum distortion and leads to the mass-constrainedDA clustering algorithm

Plugging equation 34 into 38 the optimal objective function equation38 becomes the entropy functional in a compact form3

F (plowast(X) p(W|X)) = minuslsum

i=1

plowast(xi ) lnKsum

k=1

p(wk) exp (minusd(wk xi )T) (310)

Minimizing equation 310 against the cluster center wk we have

part F (plowast(X) p(W|X))partwk

=lsum

i=1

plowast(xi )p(wk |xi )(wk minus xi ) = 0 (311)

which leads to the optimal clustering center

wk =lsum

i=1

αikxi (312)

where

αik = plowast(xi )p(wk |xi )p(wk)

(313)

For any cluster number K le Kmax and a fixed arbitrary pmf set plowast(xi ) isinplowast(X) minimization of the clustering objective function 38 against the pmfset p(W|X) is monotone nonincrease and converges to a minimum point ofthe convex function curve at a particular temperature The soft distortionmeasure D(plowast(X)) in equation 33 and the MI equation 31 are minimizedsimultaneously in a sense of empirical risk minimization

3 F (plowast(X) p(W|X)) =lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) ln(

p(wk ) exp(minusd(wk xi )T)p(wk )Nxi

)minuss

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) = minuslsum

i=1plowast(xi )

Ksumk=1

p(wk |xi ) ln Nxi

= minuslsum

i=1plowast(xi ) ln

Ksumk=1

p(wk ) exp(minusd(wk xi )T) (according to equation 34Ksum

k=1p(wk |xi ) = 1)

2680 Q Song

4 Minimax Optimization and the Structural Risk Minimization

41 Capacity Maximization and Input Data Reliability In the con-strained minimization of MI of the last section we obtain an optimal feed-forward transition probability a priori pmf p(wk |xi ) isin p(W|X) A backwardtransition probability a posteriori pmf p(xi |wk) isin p(X|W) can be obtainedthrough the Bayes formula

p(xi |wk) = p(xi )p(wk |xi )sumli=1 p(xi )p(wk |xi )

= p(xi )p(wk |xi )p(wk)

(41)

The backward transition probability is useful to assess the realizabilityof the input data set in classical information theory Directly using the pmfequation 41 yields an optimization problem by simply evaluating a singlepmf p(xi |wk) and is not a good idea to reject outlier (Mackay 1999) How-ever we can use the capacity function of classical information theory This isdefined by maximizing an alternative presentation of the MI against inputprobability distribution

C = maxp(X)

I (p(X) p(W|X)) (42)

with

I (p(X) p(W|X)) = I (p(X) p(X|W))

=lsum

i=1

Ksumk=1

p(xi )p(wk |xi ) lnp(xi |wk)

p(xi ) (43)

where C is a constant represented the channel capacityNow we are in a position to introduce the channel reliability of classical

information theory (Bluhat 1988) To deal with the input data uncertaintythe MI can be presented in a simple channel entropy form

I (p(X) p(X|W)) = H(p(X)) minus H(p(W) p(X|W)) (44)

where the first term represents uncertainty of the channel input variable X4

H(p(X)) = minuslsum

i=1

p(xi ) ln(p(xi )) (45)

4 In nats (per symbol) since we use the natural logarithm basis rather bits (per symbol)in log2 function Note that we use the special entropy notations H(p(X)) = H(X) andH(p(W) p(X|W)) = H(X|W) here

A Robust Information Clustering Algorithm 2681

and the second term is conditional entropy

H(p(W) p(X|W)) = minuslsum

i=1

Ksumk=1

p(wk)p(xi |wk) ln p(xi |wk) (46)

Lemma 1 (inverse theorem) 5 The clustering data reliability is presented in asingle symbol error pe of the input data set with empirical error probability

pe =lsum

i=1

Ksumk =i

p(xi |wk) (47)

such that if the input uncertainty H(p(X)) is greater than C the error pe is boundedaway from zero as

pe ge 1ln l

(H(p(X)) minus C minus 1) (48)

Proof We first give an intuitive discussion here over Fanorsquos inequality (SeeBlahut 1988 for a formal proof)

Uncertainty in the estimated channel input can be broken into two partsthe uncertainty in the channel whether an empirical error pe was made andgiven that an error is made the uncertainty in the true value However theerror occurs with probability pe such that the first uncertainty is H(pe ) =minus(1 minus pe ) ln(1 minus pe ) and can be no larger than ln(l) This occurs only whenall alternative errors are equally likely Therefore if the equivocation can beinterpreted as the information lost we should have Fanorsquos inequality

H(p((W)) p(X|W)) le H(pe ) + pe ln (l) (49)

Now consider that the maximum of the MI is C in equation 42 so we canrewrite equation 44 as

H(p(W) p(X|W)) = H(p(X)) minus I (p(X) p(X|W)) ge H(p(X)) minus C (410)

Then Fanorsquos inequality is applied to get

H(p(X)) minus C le H(pe ) + pe ln(l) le 1 + pe ln l (411)

5 There is a tighter bound pe compared to the one of lemma 1 as in the work of Jelinet(1968) However this may not be very helpful since minimization of the empirical risk isnot necessary to minimize the real structural risk as shown in section 43

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 3: A Robust Information Clustering Algorithm

2674 Q Song

via the identification of outliers (unreliable data points) The argument isthat the ultimate target of the RIC is not to look for or approximate thetrue probability distribution of the input data set (similarly we are alsonot looking for ldquotruerdquo clusters or cluster centers) which is proved to be anill-posed problem (Vapnik 1998) but to determine an optimal number ofeffective clusters by eliminating the unreliable data points (outliers) basedon the inverse theorem and MI maximization It also turns out that theoutlier is in fact a data point that fails to achieve capacity Therefore anydata point could become an outlier as the cluster number is increased in theannealing procedure Furthermore by replacing the Euclidean distance withother dissimilarity measures it is possible to extend the new algorithm intothe kernel and nonlinear clustering algorithms for linearly nonseparablepatterns (Scholkopf et al 1998 Gokcay amp Principe 2002 Song Hu amp Xie2002)

The letter is organized as follows Section 2 gives the motivation of theproposed research by reviewing a few related and well-established cluster-ing algorithms Section 3 discusses the rate distortion function which formsthe foundation of the DA clustering The capacity maximization SRM prin-ciple and the RIC algorithm are studied in section 4 In section 5 instructivesimulation results are presented to show the superiority of the RIC The con-clusion is presented in section 6

2 Motivation

Suppose that there are two random n-dimensional data sets xi isin X i =1 l and wk isin W k = 1 K which represent input data points (in-formation source in term of communication theory) and cluster centers (pro-totypes) respectively For the clustering problem a hard dissimilarity mea-sure can be presented in a norm space for example a square error in theEuclidean distance

minki

d(wk xi ) (21)

where d(wk xi ) = wk minus xi2Note that the definition of equation 21 is used in DA and most model-

free clustering algorithms for a general data clustering problem Thiscan be easily extended into kernel and other nonlinear-based measuresto cover the linearly nonseparable data clustering problem (Scholkopfet al 1998 Gokcay amp Principe 2002) Based on the hard distortion mea-sure equation 21 some popular clustering algorithms have been de-veloped including basic k-means fuzzy c-means (FCM) and improvedrobust versionsmdashthe robust noise and possibilistic clustering algorithms(Krishnapuram amp Keller 1993) The optimization of possibilistic clustering

A Robust Information Clustering Algorithm 2675

for example can be reformulated as a minimization of the Lagrangianfunction

J =lsum

i=1

Ksumk=1

(uik)md(wk xi ) +lsum

i=1

Ksumk=1

ηi (1 minus uik)m (22)

where uik is the fuzzy membership with the degree parameter m and ηi isa suitable positive number to control the robustness of each cluster

The common theme among the robust clustering algorithms is to reject orignore a subset of the input patterns by evaluating the membership functionui j which can also be viewed as a robust weight function to phase outoutliers The robust performance of the fuzzy algorithms in equation 22is explained in a sense that it involves fuzzy membership of every patternto all clusters instead of crisp membership However in addition to thesensitivity to the initialization of prototypes the objective function of robustfuzzy clustering algorithms is treated as a monotonic decreasing functionwhich leads to difficulties finding an optimal cluster number K and a properfuzziness parameter m

Recently maximization of the relative entropy of the continuous timedomain which is similar to maximization of the MI of the discrete timedomain in equation 31 has been used in robust signal filtering and densityestimation (Levy amp Nikoukhah 2004) This uses the following objectivefunction as a minimax optimization problem

J = minW

maxf

12

E f X minus W 2 minusυ

(intR

ln

(f z(z)fz(z)

)dz minus c

) (23)

where X and W represent the input and output data sets fz(z) is definedas a nominal probability density function against the true density functionf z(z) and υ is the Lagrange multiplier with the positive constraint c Itmaximizes the relative entropy of the second term of the cost function 23against uncertainty or outlier through the true density function for the leastfavorable density estimation The cost function is minimized in a sense ofthe least mean square The key point of this minimax approach is to searchfor a saddle point of the objective function for a global optimal solution

A recent information-theoretic perspective approach sought a parametricstudy based on the information bottleneck method for an optimal clusternumber via MI correction (Still amp Bialek 2004) The algorithm maximizesthe following objective function

maxp(W|X)

I (W V) minus T I (W X) (24)

2676 Q Song

where V represents the relevant information data set against the input dataset X and assumes that the joint distribution p(V W) is known approxi-mately I (W V) and I (W X) are the mutual information T is a temperatureparameter (refer to the next section for details) A key point of this approachis that it presents implicitly a structural risk minimization problem (Vapnik1998) and uses the corrected mutual information to search for a risk boundat an optimal temperature (cluster number)

3 Rate Distortion Function and DA Clustering

The original DA clustering is an optimal algorithm in term of insensitivityto the volume of input patterns in the respective cluster This avoids localminima of the hard clustering algorithm like k-means and splits the wholeinput data set into effective clusters in the annealing procedure (Rose 1998)However the DA algorithm is not robust against disturbance and outliersbecause it tends to associate the membership of a particular pattern in allclusters with equal probability distribution (Dave amp Krishnapuram 1997)Furthermore the DA is an inherent empirical risk-minimization algorithmThis explores details of the input data structure without a limit and needs apreselected maximum cluster number Kmax to stop the annealing procedureTo solve these problems we first investigate the original rate distortionfunction which lays the theoretical foundation of the DA algorithm (Blahut1988 Gray 1990)

The definition of the rate distortion function is defined as (Blahut 1988)

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X)) (31)

I (plowast(X) p(W|X))

= p(W|X)

[lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) lnp(wk |xi )suml

i=1 p(wk |xi )plowast(xi )

] (32)

with the constraint

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) le D(plowast(X))

=lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (33)

A Robust Information Clustering Algorithm 2677

where I (plowast(X) p(W|X)) is the mutual information2 p(wk |xi ) isin p(W|X) is thetitled distribution

p(wk |xi ) = p(wk) exp(minusd(wk xi )T)Nxi

(34)

where the normalized factor is

Nxi =Ksum

k=1

p(wk) exp(minusd(wk xi )T) (35)

with the induced unconditional pmf

p(wk) =lsum

i=1

p(wk xi ) =lsum

i=1

plowast(xi )p(wk |xi ) k = 1 K (36)

p(wk |xi ) isin p(W|X) achieves a minimum point of the lower curveR(D(plowast(X))) in Figure 1 at a specific temperature T (Blahut 1988) plowast(xi ) isinplowast(X) is a fixed unconditional a priori pmf (normally as an equal distributionin DA clustering Rose 1998)

The rate distortion function is usually investigated in term of a parameters = minus1T with T isin (0 infin) This is introduced as a Lagrange multiplier andequals the slope of the rate distortion function curve as shown in Figure 1in classical information theory (Blahut 1988) T is also referred as the tem-perature parameter to control the data clustering procedure as its value islowered from infinity to zero (Rose 1998) Therefore the rate distortionfunction can be presented as a constraint optimization problem

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X))

minus s

(lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) minus D(plowast(X))

) (37)

One important property of R(D (plowast(X))) is that it is a decreasing convexand continuous function defined in the interval 0 le D (plowast(X)) le Dmax for

2 The MI I (plowast(X) p(W|X)) has another notation I (X W) similar to the one used inequation 24 However as pointed out by Blahut (1988) the latter may not be the bestnotation for the optimization problem because it suggests that MI is merely a functionof the variable vectors X and W For the same reason we use probability distributionnotation for all related functions For example the rate distortion function is presentedas R(D(plowast(X))) which is a bit more complicated than the original paper (Blahut 1972)This inconvenience turns out to be worth it as we study the related RIC capacity problemwhich is coupled closely with the rate distortion function as shown in the next section

2678 Q Song

( ( (X)))R D p

I

1T

( W |X )min ( (W | X) (X))p

F p p0 T

Empirical Risk Minimization Optimal Saddle Point

( (X))

( (X))

D p

D p

( (X)) ( (X))D p D p

Capacity Maximization

( (X)) ( (X))D p D p

maxD

( ( (X )))C D p

Figure 1 Plots of the rate distortion function and capacity curves for anyparticular cluster number K le Kmax The plots are parameterized by the tem-perature T

any particular cluster number 0 lt K le Kmax as shown in Figure 1 (Blahut1972)

Define the DA clustering objective function as (Rose 1998)

F (plowast(X) p(W|X)) = I (plowast(X) p(W|X))

minus slsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (38)

The rate distortion function

R(D(plowast(X))) = s D(plowast(X)) + minp(W|X)

F (plowast(X) p(W|X)) (39)

is minimized by the titled distribution 34 (Blahut 1972)From the data clustering point of view equations 22 and 38 are well

known to be soft dissimilarity measures of different clusters (Dave ampKrishnapuram 1997) To accommodate the DA-based RIC algorithm in asingle framework of classical information theory we use a slightly differenttreatment from the original paper of Rose (1998) for the DA clustering algo-rithm that is to minimize equation 38 with respect to the free pmf p(wk |xi )rather than the direct minimization against the cluster center W This recasts

A Robust Information Clustering Algorithm 2679

the clustering optimization problem as that of seeking the distribution pmfand minimizing equation 38 subject to a specified level of randomness Thiscan be measured by the minimization of the MI equation 31

The optimization is now to minimize the function F (plowast(X) p(W|X))which is a by-product of the MI minimization over the titled distributionp(wk |xi ) to achieve a minimum distortion and leads to the mass-constrainedDA clustering algorithm

Plugging equation 34 into 38 the optimal objective function equation38 becomes the entropy functional in a compact form3

F (plowast(X) p(W|X)) = minuslsum

i=1

plowast(xi ) lnKsum

k=1

p(wk) exp (minusd(wk xi )T) (310)

Minimizing equation 310 against the cluster center wk we have

part F (plowast(X) p(W|X))partwk

=lsum

i=1

plowast(xi )p(wk |xi )(wk minus xi ) = 0 (311)

which leads to the optimal clustering center

wk =lsum

i=1

αikxi (312)

where

αik = plowast(xi )p(wk |xi )p(wk)

(313)

For any cluster number K le Kmax and a fixed arbitrary pmf set plowast(xi ) isinplowast(X) minimization of the clustering objective function 38 against the pmfset p(W|X) is monotone nonincrease and converges to a minimum point ofthe convex function curve at a particular temperature The soft distortionmeasure D(plowast(X)) in equation 33 and the MI equation 31 are minimizedsimultaneously in a sense of empirical risk minimization

3 F (plowast(X) p(W|X)) =lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) ln(

p(wk ) exp(minusd(wk xi )T)p(wk )Nxi

)minuss

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) = minuslsum

i=1plowast(xi )

Ksumk=1

p(wk |xi ) ln Nxi

= minuslsum

i=1plowast(xi ) ln

Ksumk=1

p(wk ) exp(minusd(wk xi )T) (according to equation 34Ksum

k=1p(wk |xi ) = 1)

2680 Q Song

4 Minimax Optimization and the Structural Risk Minimization

41 Capacity Maximization and Input Data Reliability In the con-strained minimization of MI of the last section we obtain an optimal feed-forward transition probability a priori pmf p(wk |xi ) isin p(W|X) A backwardtransition probability a posteriori pmf p(xi |wk) isin p(X|W) can be obtainedthrough the Bayes formula

p(xi |wk) = p(xi )p(wk |xi )sumli=1 p(xi )p(wk |xi )

= p(xi )p(wk |xi )p(wk)

(41)

The backward transition probability is useful to assess the realizabilityof the input data set in classical information theory Directly using the pmfequation 41 yields an optimization problem by simply evaluating a singlepmf p(xi |wk) and is not a good idea to reject outlier (Mackay 1999) How-ever we can use the capacity function of classical information theory This isdefined by maximizing an alternative presentation of the MI against inputprobability distribution

C = maxp(X)

I (p(X) p(W|X)) (42)

with

I (p(X) p(W|X)) = I (p(X) p(X|W))

=lsum

i=1

Ksumk=1

p(xi )p(wk |xi ) lnp(xi |wk)

p(xi ) (43)

where C is a constant represented the channel capacityNow we are in a position to introduce the channel reliability of classical

information theory (Bluhat 1988) To deal with the input data uncertaintythe MI can be presented in a simple channel entropy form

I (p(X) p(X|W)) = H(p(X)) minus H(p(W) p(X|W)) (44)

where the first term represents uncertainty of the channel input variable X4

H(p(X)) = minuslsum

i=1

p(xi ) ln(p(xi )) (45)

4 In nats (per symbol) since we use the natural logarithm basis rather bits (per symbol)in log2 function Note that we use the special entropy notations H(p(X)) = H(X) andH(p(W) p(X|W)) = H(X|W) here

A Robust Information Clustering Algorithm 2681

and the second term is conditional entropy

H(p(W) p(X|W)) = minuslsum

i=1

Ksumk=1

p(wk)p(xi |wk) ln p(xi |wk) (46)

Lemma 1 (inverse theorem) 5 The clustering data reliability is presented in asingle symbol error pe of the input data set with empirical error probability

pe =lsum

i=1

Ksumk =i

p(xi |wk) (47)

such that if the input uncertainty H(p(X)) is greater than C the error pe is boundedaway from zero as

pe ge 1ln l

(H(p(X)) minus C minus 1) (48)

Proof We first give an intuitive discussion here over Fanorsquos inequality (SeeBlahut 1988 for a formal proof)

Uncertainty in the estimated channel input can be broken into two partsthe uncertainty in the channel whether an empirical error pe was made andgiven that an error is made the uncertainty in the true value However theerror occurs with probability pe such that the first uncertainty is H(pe ) =minus(1 minus pe ) ln(1 minus pe ) and can be no larger than ln(l) This occurs only whenall alternative errors are equally likely Therefore if the equivocation can beinterpreted as the information lost we should have Fanorsquos inequality

H(p((W)) p(X|W)) le H(pe ) + pe ln (l) (49)

Now consider that the maximum of the MI is C in equation 42 so we canrewrite equation 44 as

H(p(W) p(X|W)) = H(p(X)) minus I (p(X) p(X|W)) ge H(p(X)) minus C (410)

Then Fanorsquos inequality is applied to get

H(p(X)) minus C le H(pe ) + pe ln(l) le 1 + pe ln l (411)

5 There is a tighter bound pe compared to the one of lemma 1 as in the work of Jelinet(1968) However this may not be very helpful since minimization of the empirical risk isnot necessary to minimize the real structural risk as shown in section 43

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 4: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2675

for example can be reformulated as a minimization of the Lagrangianfunction

J =lsum

i=1

Ksumk=1

(uik)md(wk xi ) +lsum

i=1

Ksumk=1

ηi (1 minus uik)m (22)

where uik is the fuzzy membership with the degree parameter m and ηi isa suitable positive number to control the robustness of each cluster

The common theme among the robust clustering algorithms is to reject orignore a subset of the input patterns by evaluating the membership functionui j which can also be viewed as a robust weight function to phase outoutliers The robust performance of the fuzzy algorithms in equation 22is explained in a sense that it involves fuzzy membership of every patternto all clusters instead of crisp membership However in addition to thesensitivity to the initialization of prototypes the objective function of robustfuzzy clustering algorithms is treated as a monotonic decreasing functionwhich leads to difficulties finding an optimal cluster number K and a properfuzziness parameter m

Recently maximization of the relative entropy of the continuous timedomain which is similar to maximization of the MI of the discrete timedomain in equation 31 has been used in robust signal filtering and densityestimation (Levy amp Nikoukhah 2004) This uses the following objectivefunction as a minimax optimization problem

J = minW

maxf

12

E f X minus W 2 minusυ

(intR

ln

(f z(z)fz(z)

)dz minus c

) (23)

where X and W represent the input and output data sets fz(z) is definedas a nominal probability density function against the true density functionf z(z) and υ is the Lagrange multiplier with the positive constraint c Itmaximizes the relative entropy of the second term of the cost function 23against uncertainty or outlier through the true density function for the leastfavorable density estimation The cost function is minimized in a sense ofthe least mean square The key point of this minimax approach is to searchfor a saddle point of the objective function for a global optimal solution

A recent information-theoretic perspective approach sought a parametricstudy based on the information bottleneck method for an optimal clusternumber via MI correction (Still amp Bialek 2004) The algorithm maximizesthe following objective function

maxp(W|X)

I (W V) minus T I (W X) (24)

2676 Q Song

where V represents the relevant information data set against the input dataset X and assumes that the joint distribution p(V W) is known approxi-mately I (W V) and I (W X) are the mutual information T is a temperatureparameter (refer to the next section for details) A key point of this approachis that it presents implicitly a structural risk minimization problem (Vapnik1998) and uses the corrected mutual information to search for a risk boundat an optimal temperature (cluster number)

3 Rate Distortion Function and DA Clustering

The original DA clustering is an optimal algorithm in term of insensitivityto the volume of input patterns in the respective cluster This avoids localminima of the hard clustering algorithm like k-means and splits the wholeinput data set into effective clusters in the annealing procedure (Rose 1998)However the DA algorithm is not robust against disturbance and outliersbecause it tends to associate the membership of a particular pattern in allclusters with equal probability distribution (Dave amp Krishnapuram 1997)Furthermore the DA is an inherent empirical risk-minimization algorithmThis explores details of the input data structure without a limit and needs apreselected maximum cluster number Kmax to stop the annealing procedureTo solve these problems we first investigate the original rate distortionfunction which lays the theoretical foundation of the DA algorithm (Blahut1988 Gray 1990)

The definition of the rate distortion function is defined as (Blahut 1988)

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X)) (31)

I (plowast(X) p(W|X))

= p(W|X)

[lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) lnp(wk |xi )suml

i=1 p(wk |xi )plowast(xi )

] (32)

with the constraint

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) le D(plowast(X))

=lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (33)

A Robust Information Clustering Algorithm 2677

where I (plowast(X) p(W|X)) is the mutual information2 p(wk |xi ) isin p(W|X) is thetitled distribution

p(wk |xi ) = p(wk) exp(minusd(wk xi )T)Nxi

(34)

where the normalized factor is

Nxi =Ksum

k=1

p(wk) exp(minusd(wk xi )T) (35)

with the induced unconditional pmf

p(wk) =lsum

i=1

p(wk xi ) =lsum

i=1

plowast(xi )p(wk |xi ) k = 1 K (36)

p(wk |xi ) isin p(W|X) achieves a minimum point of the lower curveR(D(plowast(X))) in Figure 1 at a specific temperature T (Blahut 1988) plowast(xi ) isinplowast(X) is a fixed unconditional a priori pmf (normally as an equal distributionin DA clustering Rose 1998)

The rate distortion function is usually investigated in term of a parameters = minus1T with T isin (0 infin) This is introduced as a Lagrange multiplier andequals the slope of the rate distortion function curve as shown in Figure 1in classical information theory (Blahut 1988) T is also referred as the tem-perature parameter to control the data clustering procedure as its value islowered from infinity to zero (Rose 1998) Therefore the rate distortionfunction can be presented as a constraint optimization problem

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X))

minus s

(lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) minus D(plowast(X))

) (37)

One important property of R(D (plowast(X))) is that it is a decreasing convexand continuous function defined in the interval 0 le D (plowast(X)) le Dmax for

2 The MI I (plowast(X) p(W|X)) has another notation I (X W) similar to the one used inequation 24 However as pointed out by Blahut (1988) the latter may not be the bestnotation for the optimization problem because it suggests that MI is merely a functionof the variable vectors X and W For the same reason we use probability distributionnotation for all related functions For example the rate distortion function is presentedas R(D(plowast(X))) which is a bit more complicated than the original paper (Blahut 1972)This inconvenience turns out to be worth it as we study the related RIC capacity problemwhich is coupled closely with the rate distortion function as shown in the next section

2678 Q Song

( ( (X)))R D p

I

1T

( W |X )min ( (W | X) (X))p

F p p0 T

Empirical Risk Minimization Optimal Saddle Point

( (X))

( (X))

D p

D p

( (X)) ( (X))D p D p

Capacity Maximization

( (X)) ( (X))D p D p

maxD

( ( (X )))C D p

Figure 1 Plots of the rate distortion function and capacity curves for anyparticular cluster number K le Kmax The plots are parameterized by the tem-perature T

any particular cluster number 0 lt K le Kmax as shown in Figure 1 (Blahut1972)

Define the DA clustering objective function as (Rose 1998)

F (plowast(X) p(W|X)) = I (plowast(X) p(W|X))

minus slsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (38)

The rate distortion function

R(D(plowast(X))) = s D(plowast(X)) + minp(W|X)

F (plowast(X) p(W|X)) (39)

is minimized by the titled distribution 34 (Blahut 1972)From the data clustering point of view equations 22 and 38 are well

known to be soft dissimilarity measures of different clusters (Dave ampKrishnapuram 1997) To accommodate the DA-based RIC algorithm in asingle framework of classical information theory we use a slightly differenttreatment from the original paper of Rose (1998) for the DA clustering algo-rithm that is to minimize equation 38 with respect to the free pmf p(wk |xi )rather than the direct minimization against the cluster center W This recasts

A Robust Information Clustering Algorithm 2679

the clustering optimization problem as that of seeking the distribution pmfand minimizing equation 38 subject to a specified level of randomness Thiscan be measured by the minimization of the MI equation 31

The optimization is now to minimize the function F (plowast(X) p(W|X))which is a by-product of the MI minimization over the titled distributionp(wk |xi ) to achieve a minimum distortion and leads to the mass-constrainedDA clustering algorithm

Plugging equation 34 into 38 the optimal objective function equation38 becomes the entropy functional in a compact form3

F (plowast(X) p(W|X)) = minuslsum

i=1

plowast(xi ) lnKsum

k=1

p(wk) exp (minusd(wk xi )T) (310)

Minimizing equation 310 against the cluster center wk we have

part F (plowast(X) p(W|X))partwk

=lsum

i=1

plowast(xi )p(wk |xi )(wk minus xi ) = 0 (311)

which leads to the optimal clustering center

wk =lsum

i=1

αikxi (312)

where

αik = plowast(xi )p(wk |xi )p(wk)

(313)

For any cluster number K le Kmax and a fixed arbitrary pmf set plowast(xi ) isinplowast(X) minimization of the clustering objective function 38 against the pmfset p(W|X) is monotone nonincrease and converges to a minimum point ofthe convex function curve at a particular temperature The soft distortionmeasure D(plowast(X)) in equation 33 and the MI equation 31 are minimizedsimultaneously in a sense of empirical risk minimization

3 F (plowast(X) p(W|X)) =lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) ln(

p(wk ) exp(minusd(wk xi )T)p(wk )Nxi

)minuss

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) = minuslsum

i=1plowast(xi )

Ksumk=1

p(wk |xi ) ln Nxi

= minuslsum

i=1plowast(xi ) ln

Ksumk=1

p(wk ) exp(minusd(wk xi )T) (according to equation 34Ksum

k=1p(wk |xi ) = 1)

2680 Q Song

4 Minimax Optimization and the Structural Risk Minimization

41 Capacity Maximization and Input Data Reliability In the con-strained minimization of MI of the last section we obtain an optimal feed-forward transition probability a priori pmf p(wk |xi ) isin p(W|X) A backwardtransition probability a posteriori pmf p(xi |wk) isin p(X|W) can be obtainedthrough the Bayes formula

p(xi |wk) = p(xi )p(wk |xi )sumli=1 p(xi )p(wk |xi )

= p(xi )p(wk |xi )p(wk)

(41)

The backward transition probability is useful to assess the realizabilityof the input data set in classical information theory Directly using the pmfequation 41 yields an optimization problem by simply evaluating a singlepmf p(xi |wk) and is not a good idea to reject outlier (Mackay 1999) How-ever we can use the capacity function of classical information theory This isdefined by maximizing an alternative presentation of the MI against inputprobability distribution

C = maxp(X)

I (p(X) p(W|X)) (42)

with

I (p(X) p(W|X)) = I (p(X) p(X|W))

=lsum

i=1

Ksumk=1

p(xi )p(wk |xi ) lnp(xi |wk)

p(xi ) (43)

where C is a constant represented the channel capacityNow we are in a position to introduce the channel reliability of classical

information theory (Bluhat 1988) To deal with the input data uncertaintythe MI can be presented in a simple channel entropy form

I (p(X) p(X|W)) = H(p(X)) minus H(p(W) p(X|W)) (44)

where the first term represents uncertainty of the channel input variable X4

H(p(X)) = minuslsum

i=1

p(xi ) ln(p(xi )) (45)

4 In nats (per symbol) since we use the natural logarithm basis rather bits (per symbol)in log2 function Note that we use the special entropy notations H(p(X)) = H(X) andH(p(W) p(X|W)) = H(X|W) here

A Robust Information Clustering Algorithm 2681

and the second term is conditional entropy

H(p(W) p(X|W)) = minuslsum

i=1

Ksumk=1

p(wk)p(xi |wk) ln p(xi |wk) (46)

Lemma 1 (inverse theorem) 5 The clustering data reliability is presented in asingle symbol error pe of the input data set with empirical error probability

pe =lsum

i=1

Ksumk =i

p(xi |wk) (47)

such that if the input uncertainty H(p(X)) is greater than C the error pe is boundedaway from zero as

pe ge 1ln l

(H(p(X)) minus C minus 1) (48)

Proof We first give an intuitive discussion here over Fanorsquos inequality (SeeBlahut 1988 for a formal proof)

Uncertainty in the estimated channel input can be broken into two partsthe uncertainty in the channel whether an empirical error pe was made andgiven that an error is made the uncertainty in the true value However theerror occurs with probability pe such that the first uncertainty is H(pe ) =minus(1 minus pe ) ln(1 minus pe ) and can be no larger than ln(l) This occurs only whenall alternative errors are equally likely Therefore if the equivocation can beinterpreted as the information lost we should have Fanorsquos inequality

H(p((W)) p(X|W)) le H(pe ) + pe ln (l) (49)

Now consider that the maximum of the MI is C in equation 42 so we canrewrite equation 44 as

H(p(W) p(X|W)) = H(p(X)) minus I (p(X) p(X|W)) ge H(p(X)) minus C (410)

Then Fanorsquos inequality is applied to get

H(p(X)) minus C le H(pe ) + pe ln(l) le 1 + pe ln l (411)

5 There is a tighter bound pe compared to the one of lemma 1 as in the work of Jelinet(1968) However this may not be very helpful since minimization of the empirical risk isnot necessary to minimize the real structural risk as shown in section 43

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 5: A Robust Information Clustering Algorithm

2676 Q Song

where V represents the relevant information data set against the input dataset X and assumes that the joint distribution p(V W) is known approxi-mately I (W V) and I (W X) are the mutual information T is a temperatureparameter (refer to the next section for details) A key point of this approachis that it presents implicitly a structural risk minimization problem (Vapnik1998) and uses the corrected mutual information to search for a risk boundat an optimal temperature (cluster number)

3 Rate Distortion Function and DA Clustering

The original DA clustering is an optimal algorithm in term of insensitivityto the volume of input patterns in the respective cluster This avoids localminima of the hard clustering algorithm like k-means and splits the wholeinput data set into effective clusters in the annealing procedure (Rose 1998)However the DA algorithm is not robust against disturbance and outliersbecause it tends to associate the membership of a particular pattern in allclusters with equal probability distribution (Dave amp Krishnapuram 1997)Furthermore the DA is an inherent empirical risk-minimization algorithmThis explores details of the input data structure without a limit and needs apreselected maximum cluster number Kmax to stop the annealing procedureTo solve these problems we first investigate the original rate distortionfunction which lays the theoretical foundation of the DA algorithm (Blahut1988 Gray 1990)

The definition of the rate distortion function is defined as (Blahut 1988)

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X)) (31)

I (plowast(X) p(W|X))

= p(W|X)

[lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) lnp(wk |xi )suml

i=1 p(wk |xi )plowast(xi )

] (32)

with the constraint

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) le D(plowast(X))

=lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (33)

A Robust Information Clustering Algorithm 2677

where I (plowast(X) p(W|X)) is the mutual information2 p(wk |xi ) isin p(W|X) is thetitled distribution

p(wk |xi ) = p(wk) exp(minusd(wk xi )T)Nxi

(34)

where the normalized factor is

Nxi =Ksum

k=1

p(wk) exp(minusd(wk xi )T) (35)

with the induced unconditional pmf

p(wk) =lsum

i=1

p(wk xi ) =lsum

i=1

plowast(xi )p(wk |xi ) k = 1 K (36)

p(wk |xi ) isin p(W|X) achieves a minimum point of the lower curveR(D(plowast(X))) in Figure 1 at a specific temperature T (Blahut 1988) plowast(xi ) isinplowast(X) is a fixed unconditional a priori pmf (normally as an equal distributionin DA clustering Rose 1998)

The rate distortion function is usually investigated in term of a parameters = minus1T with T isin (0 infin) This is introduced as a Lagrange multiplier andequals the slope of the rate distortion function curve as shown in Figure 1in classical information theory (Blahut 1988) T is also referred as the tem-perature parameter to control the data clustering procedure as its value islowered from infinity to zero (Rose 1998) Therefore the rate distortionfunction can be presented as a constraint optimization problem

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X))

minus s

(lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) minus D(plowast(X))

) (37)

One important property of R(D (plowast(X))) is that it is a decreasing convexand continuous function defined in the interval 0 le D (plowast(X)) le Dmax for

2 The MI I (plowast(X) p(W|X)) has another notation I (X W) similar to the one used inequation 24 However as pointed out by Blahut (1988) the latter may not be the bestnotation for the optimization problem because it suggests that MI is merely a functionof the variable vectors X and W For the same reason we use probability distributionnotation for all related functions For example the rate distortion function is presentedas R(D(plowast(X))) which is a bit more complicated than the original paper (Blahut 1972)This inconvenience turns out to be worth it as we study the related RIC capacity problemwhich is coupled closely with the rate distortion function as shown in the next section

2678 Q Song

( ( (X)))R D p

I

1T

( W |X )min ( (W | X) (X))p

F p p0 T

Empirical Risk Minimization Optimal Saddle Point

( (X))

( (X))

D p

D p

( (X)) ( (X))D p D p

Capacity Maximization

( (X)) ( (X))D p D p

maxD

( ( (X )))C D p

Figure 1 Plots of the rate distortion function and capacity curves for anyparticular cluster number K le Kmax The plots are parameterized by the tem-perature T

any particular cluster number 0 lt K le Kmax as shown in Figure 1 (Blahut1972)

Define the DA clustering objective function as (Rose 1998)

F (plowast(X) p(W|X)) = I (plowast(X) p(W|X))

minus slsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (38)

The rate distortion function

R(D(plowast(X))) = s D(plowast(X)) + minp(W|X)

F (plowast(X) p(W|X)) (39)

is minimized by the titled distribution 34 (Blahut 1972)From the data clustering point of view equations 22 and 38 are well

known to be soft dissimilarity measures of different clusters (Dave ampKrishnapuram 1997) To accommodate the DA-based RIC algorithm in asingle framework of classical information theory we use a slightly differenttreatment from the original paper of Rose (1998) for the DA clustering algo-rithm that is to minimize equation 38 with respect to the free pmf p(wk |xi )rather than the direct minimization against the cluster center W This recasts

A Robust Information Clustering Algorithm 2679

the clustering optimization problem as that of seeking the distribution pmfand minimizing equation 38 subject to a specified level of randomness Thiscan be measured by the minimization of the MI equation 31

The optimization is now to minimize the function F (plowast(X) p(W|X))which is a by-product of the MI minimization over the titled distributionp(wk |xi ) to achieve a minimum distortion and leads to the mass-constrainedDA clustering algorithm

Plugging equation 34 into 38 the optimal objective function equation38 becomes the entropy functional in a compact form3

F (plowast(X) p(W|X)) = minuslsum

i=1

plowast(xi ) lnKsum

k=1

p(wk) exp (minusd(wk xi )T) (310)

Minimizing equation 310 against the cluster center wk we have

part F (plowast(X) p(W|X))partwk

=lsum

i=1

plowast(xi )p(wk |xi )(wk minus xi ) = 0 (311)

which leads to the optimal clustering center

wk =lsum

i=1

αikxi (312)

where

αik = plowast(xi )p(wk |xi )p(wk)

(313)

For any cluster number K le Kmax and a fixed arbitrary pmf set plowast(xi ) isinplowast(X) minimization of the clustering objective function 38 against the pmfset p(W|X) is monotone nonincrease and converges to a minimum point ofthe convex function curve at a particular temperature The soft distortionmeasure D(plowast(X)) in equation 33 and the MI equation 31 are minimizedsimultaneously in a sense of empirical risk minimization

3 F (plowast(X) p(W|X)) =lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) ln(

p(wk ) exp(minusd(wk xi )T)p(wk )Nxi

)minuss

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) = minuslsum

i=1plowast(xi )

Ksumk=1

p(wk |xi ) ln Nxi

= minuslsum

i=1plowast(xi ) ln

Ksumk=1

p(wk ) exp(minusd(wk xi )T) (according to equation 34Ksum

k=1p(wk |xi ) = 1)

2680 Q Song

4 Minimax Optimization and the Structural Risk Minimization

41 Capacity Maximization and Input Data Reliability In the con-strained minimization of MI of the last section we obtain an optimal feed-forward transition probability a priori pmf p(wk |xi ) isin p(W|X) A backwardtransition probability a posteriori pmf p(xi |wk) isin p(X|W) can be obtainedthrough the Bayes formula

p(xi |wk) = p(xi )p(wk |xi )sumli=1 p(xi )p(wk |xi )

= p(xi )p(wk |xi )p(wk)

(41)

The backward transition probability is useful to assess the realizabilityof the input data set in classical information theory Directly using the pmfequation 41 yields an optimization problem by simply evaluating a singlepmf p(xi |wk) and is not a good idea to reject outlier (Mackay 1999) How-ever we can use the capacity function of classical information theory This isdefined by maximizing an alternative presentation of the MI against inputprobability distribution

C = maxp(X)

I (p(X) p(W|X)) (42)

with

I (p(X) p(W|X)) = I (p(X) p(X|W))

=lsum

i=1

Ksumk=1

p(xi )p(wk |xi ) lnp(xi |wk)

p(xi ) (43)

where C is a constant represented the channel capacityNow we are in a position to introduce the channel reliability of classical

information theory (Bluhat 1988) To deal with the input data uncertaintythe MI can be presented in a simple channel entropy form

I (p(X) p(X|W)) = H(p(X)) minus H(p(W) p(X|W)) (44)

where the first term represents uncertainty of the channel input variable X4

H(p(X)) = minuslsum

i=1

p(xi ) ln(p(xi )) (45)

4 In nats (per symbol) since we use the natural logarithm basis rather bits (per symbol)in log2 function Note that we use the special entropy notations H(p(X)) = H(X) andH(p(W) p(X|W)) = H(X|W) here

A Robust Information Clustering Algorithm 2681

and the second term is conditional entropy

H(p(W) p(X|W)) = minuslsum

i=1

Ksumk=1

p(wk)p(xi |wk) ln p(xi |wk) (46)

Lemma 1 (inverse theorem) 5 The clustering data reliability is presented in asingle symbol error pe of the input data set with empirical error probability

pe =lsum

i=1

Ksumk =i

p(xi |wk) (47)

such that if the input uncertainty H(p(X)) is greater than C the error pe is boundedaway from zero as

pe ge 1ln l

(H(p(X)) minus C minus 1) (48)

Proof We first give an intuitive discussion here over Fanorsquos inequality (SeeBlahut 1988 for a formal proof)

Uncertainty in the estimated channel input can be broken into two partsthe uncertainty in the channel whether an empirical error pe was made andgiven that an error is made the uncertainty in the true value However theerror occurs with probability pe such that the first uncertainty is H(pe ) =minus(1 minus pe ) ln(1 minus pe ) and can be no larger than ln(l) This occurs only whenall alternative errors are equally likely Therefore if the equivocation can beinterpreted as the information lost we should have Fanorsquos inequality

H(p((W)) p(X|W)) le H(pe ) + pe ln (l) (49)

Now consider that the maximum of the MI is C in equation 42 so we canrewrite equation 44 as

H(p(W) p(X|W)) = H(p(X)) minus I (p(X) p(X|W)) ge H(p(X)) minus C (410)

Then Fanorsquos inequality is applied to get

H(p(X)) minus C le H(pe ) + pe ln(l) le 1 + pe ln l (411)

5 There is a tighter bound pe compared to the one of lemma 1 as in the work of Jelinet(1968) However this may not be very helpful since minimization of the empirical risk isnot necessary to minimize the real structural risk as shown in section 43

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 6: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2677

where I (plowast(X) p(W|X)) is the mutual information2 p(wk |xi ) isin p(W|X) is thetitled distribution

p(wk |xi ) = p(wk) exp(minusd(wk xi )T)Nxi

(34)

where the normalized factor is

Nxi =Ksum

k=1

p(wk) exp(minusd(wk xi )T) (35)

with the induced unconditional pmf

p(wk) =lsum

i=1

p(wk xi ) =lsum

i=1

plowast(xi )p(wk |xi ) k = 1 K (36)

p(wk |xi ) isin p(W|X) achieves a minimum point of the lower curveR(D(plowast(X))) in Figure 1 at a specific temperature T (Blahut 1988) plowast(xi ) isinplowast(X) is a fixed unconditional a priori pmf (normally as an equal distributionin DA clustering Rose 1998)

The rate distortion function is usually investigated in term of a parameters = minus1T with T isin (0 infin) This is introduced as a Lagrange multiplier andequals the slope of the rate distortion function curve as shown in Figure 1in classical information theory (Blahut 1988) T is also referred as the tem-perature parameter to control the data clustering procedure as its value islowered from infinity to zero (Rose 1998) Therefore the rate distortionfunction can be presented as a constraint optimization problem

R(D(plowast(X))) = minp(W|X)

I (plowast(X) p(W|X))

minus s

(lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) minus D(plowast(X))

) (37)

One important property of R(D (plowast(X))) is that it is a decreasing convexand continuous function defined in the interval 0 le D (plowast(X)) le Dmax for

2 The MI I (plowast(X) p(W|X)) has another notation I (X W) similar to the one used inequation 24 However as pointed out by Blahut (1988) the latter may not be the bestnotation for the optimization problem because it suggests that MI is merely a functionof the variable vectors X and W For the same reason we use probability distributionnotation for all related functions For example the rate distortion function is presentedas R(D(plowast(X))) which is a bit more complicated than the original paper (Blahut 1972)This inconvenience turns out to be worth it as we study the related RIC capacity problemwhich is coupled closely with the rate distortion function as shown in the next section

2678 Q Song

( ( (X)))R D p

I

1T

( W |X )min ( (W | X) (X))p

F p p0 T

Empirical Risk Minimization Optimal Saddle Point

( (X))

( (X))

D p

D p

( (X)) ( (X))D p D p

Capacity Maximization

( (X)) ( (X))D p D p

maxD

( ( (X )))C D p

Figure 1 Plots of the rate distortion function and capacity curves for anyparticular cluster number K le Kmax The plots are parameterized by the tem-perature T

any particular cluster number 0 lt K le Kmax as shown in Figure 1 (Blahut1972)

Define the DA clustering objective function as (Rose 1998)

F (plowast(X) p(W|X)) = I (plowast(X) p(W|X))

minus slsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (38)

The rate distortion function

R(D(plowast(X))) = s D(plowast(X)) + minp(W|X)

F (plowast(X) p(W|X)) (39)

is minimized by the titled distribution 34 (Blahut 1972)From the data clustering point of view equations 22 and 38 are well

known to be soft dissimilarity measures of different clusters (Dave ampKrishnapuram 1997) To accommodate the DA-based RIC algorithm in asingle framework of classical information theory we use a slightly differenttreatment from the original paper of Rose (1998) for the DA clustering algo-rithm that is to minimize equation 38 with respect to the free pmf p(wk |xi )rather than the direct minimization against the cluster center W This recasts

A Robust Information Clustering Algorithm 2679

the clustering optimization problem as that of seeking the distribution pmfand minimizing equation 38 subject to a specified level of randomness Thiscan be measured by the minimization of the MI equation 31

The optimization is now to minimize the function F (plowast(X) p(W|X))which is a by-product of the MI minimization over the titled distributionp(wk |xi ) to achieve a minimum distortion and leads to the mass-constrainedDA clustering algorithm

Plugging equation 34 into 38 the optimal objective function equation38 becomes the entropy functional in a compact form3

F (plowast(X) p(W|X)) = minuslsum

i=1

plowast(xi ) lnKsum

k=1

p(wk) exp (minusd(wk xi )T) (310)

Minimizing equation 310 against the cluster center wk we have

part F (plowast(X) p(W|X))partwk

=lsum

i=1

plowast(xi )p(wk |xi )(wk minus xi ) = 0 (311)

which leads to the optimal clustering center

wk =lsum

i=1

αikxi (312)

where

αik = plowast(xi )p(wk |xi )p(wk)

(313)

For any cluster number K le Kmax and a fixed arbitrary pmf set plowast(xi ) isinplowast(X) minimization of the clustering objective function 38 against the pmfset p(W|X) is monotone nonincrease and converges to a minimum point ofthe convex function curve at a particular temperature The soft distortionmeasure D(plowast(X)) in equation 33 and the MI equation 31 are minimizedsimultaneously in a sense of empirical risk minimization

3 F (plowast(X) p(W|X)) =lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) ln(

p(wk ) exp(minusd(wk xi )T)p(wk )Nxi

)minuss

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) = minuslsum

i=1plowast(xi )

Ksumk=1

p(wk |xi ) ln Nxi

= minuslsum

i=1plowast(xi ) ln

Ksumk=1

p(wk ) exp(minusd(wk xi )T) (according to equation 34Ksum

k=1p(wk |xi ) = 1)

2680 Q Song

4 Minimax Optimization and the Structural Risk Minimization

41 Capacity Maximization and Input Data Reliability In the con-strained minimization of MI of the last section we obtain an optimal feed-forward transition probability a priori pmf p(wk |xi ) isin p(W|X) A backwardtransition probability a posteriori pmf p(xi |wk) isin p(X|W) can be obtainedthrough the Bayes formula

p(xi |wk) = p(xi )p(wk |xi )sumli=1 p(xi )p(wk |xi )

= p(xi )p(wk |xi )p(wk)

(41)

The backward transition probability is useful to assess the realizabilityof the input data set in classical information theory Directly using the pmfequation 41 yields an optimization problem by simply evaluating a singlepmf p(xi |wk) and is not a good idea to reject outlier (Mackay 1999) How-ever we can use the capacity function of classical information theory This isdefined by maximizing an alternative presentation of the MI against inputprobability distribution

C = maxp(X)

I (p(X) p(W|X)) (42)

with

I (p(X) p(W|X)) = I (p(X) p(X|W))

=lsum

i=1

Ksumk=1

p(xi )p(wk |xi ) lnp(xi |wk)

p(xi ) (43)

where C is a constant represented the channel capacityNow we are in a position to introduce the channel reliability of classical

information theory (Bluhat 1988) To deal with the input data uncertaintythe MI can be presented in a simple channel entropy form

I (p(X) p(X|W)) = H(p(X)) minus H(p(W) p(X|W)) (44)

where the first term represents uncertainty of the channel input variable X4

H(p(X)) = minuslsum

i=1

p(xi ) ln(p(xi )) (45)

4 In nats (per symbol) since we use the natural logarithm basis rather bits (per symbol)in log2 function Note that we use the special entropy notations H(p(X)) = H(X) andH(p(W) p(X|W)) = H(X|W) here

A Robust Information Clustering Algorithm 2681

and the second term is conditional entropy

H(p(W) p(X|W)) = minuslsum

i=1

Ksumk=1

p(wk)p(xi |wk) ln p(xi |wk) (46)

Lemma 1 (inverse theorem) 5 The clustering data reliability is presented in asingle symbol error pe of the input data set with empirical error probability

pe =lsum

i=1

Ksumk =i

p(xi |wk) (47)

such that if the input uncertainty H(p(X)) is greater than C the error pe is boundedaway from zero as

pe ge 1ln l

(H(p(X)) minus C minus 1) (48)

Proof We first give an intuitive discussion here over Fanorsquos inequality (SeeBlahut 1988 for a formal proof)

Uncertainty in the estimated channel input can be broken into two partsthe uncertainty in the channel whether an empirical error pe was made andgiven that an error is made the uncertainty in the true value However theerror occurs with probability pe such that the first uncertainty is H(pe ) =minus(1 minus pe ) ln(1 minus pe ) and can be no larger than ln(l) This occurs only whenall alternative errors are equally likely Therefore if the equivocation can beinterpreted as the information lost we should have Fanorsquos inequality

H(p((W)) p(X|W)) le H(pe ) + pe ln (l) (49)

Now consider that the maximum of the MI is C in equation 42 so we canrewrite equation 44 as

H(p(W) p(X|W)) = H(p(X)) minus I (p(X) p(X|W)) ge H(p(X)) minus C (410)

Then Fanorsquos inequality is applied to get

H(p(X)) minus C le H(pe ) + pe ln(l) le 1 + pe ln l (411)

5 There is a tighter bound pe compared to the one of lemma 1 as in the work of Jelinet(1968) However this may not be very helpful since minimization of the empirical risk isnot necessary to minimize the real structural risk as shown in section 43

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 7: A Robust Information Clustering Algorithm

2678 Q Song

( ( (X)))R D p

I

1T

( W |X )min ( (W | X) (X))p

F p p0 T

Empirical Risk Minimization Optimal Saddle Point

( (X))

( (X))

D p

D p

( (X)) ( (X))D p D p

Capacity Maximization

( (X)) ( (X))D p D p

maxD

( ( (X )))C D p

Figure 1 Plots of the rate distortion function and capacity curves for anyparticular cluster number K le Kmax The plots are parameterized by the tem-perature T

any particular cluster number 0 lt K le Kmax as shown in Figure 1 (Blahut1972)

Define the DA clustering objective function as (Rose 1998)

F (plowast(X) p(W|X)) = I (plowast(X) p(W|X))

minus slsum

i=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) (38)

The rate distortion function

R(D(plowast(X))) = s D(plowast(X)) + minp(W|X)

F (plowast(X) p(W|X)) (39)

is minimized by the titled distribution 34 (Blahut 1972)From the data clustering point of view equations 22 and 38 are well

known to be soft dissimilarity measures of different clusters (Dave ampKrishnapuram 1997) To accommodate the DA-based RIC algorithm in asingle framework of classical information theory we use a slightly differenttreatment from the original paper of Rose (1998) for the DA clustering algo-rithm that is to minimize equation 38 with respect to the free pmf p(wk |xi )rather than the direct minimization against the cluster center W This recasts

A Robust Information Clustering Algorithm 2679

the clustering optimization problem as that of seeking the distribution pmfand minimizing equation 38 subject to a specified level of randomness Thiscan be measured by the minimization of the MI equation 31

The optimization is now to minimize the function F (plowast(X) p(W|X))which is a by-product of the MI minimization over the titled distributionp(wk |xi ) to achieve a minimum distortion and leads to the mass-constrainedDA clustering algorithm

Plugging equation 34 into 38 the optimal objective function equation38 becomes the entropy functional in a compact form3

F (plowast(X) p(W|X)) = minuslsum

i=1

plowast(xi ) lnKsum

k=1

p(wk) exp (minusd(wk xi )T) (310)

Minimizing equation 310 against the cluster center wk we have

part F (plowast(X) p(W|X))partwk

=lsum

i=1

plowast(xi )p(wk |xi )(wk minus xi ) = 0 (311)

which leads to the optimal clustering center

wk =lsum

i=1

αikxi (312)

where

αik = plowast(xi )p(wk |xi )p(wk)

(313)

For any cluster number K le Kmax and a fixed arbitrary pmf set plowast(xi ) isinplowast(X) minimization of the clustering objective function 38 against the pmfset p(W|X) is monotone nonincrease and converges to a minimum point ofthe convex function curve at a particular temperature The soft distortionmeasure D(plowast(X)) in equation 33 and the MI equation 31 are minimizedsimultaneously in a sense of empirical risk minimization

3 F (plowast(X) p(W|X)) =lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) ln(

p(wk ) exp(minusd(wk xi )T)p(wk )Nxi

)minuss

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) = minuslsum

i=1plowast(xi )

Ksumk=1

p(wk |xi ) ln Nxi

= minuslsum

i=1plowast(xi ) ln

Ksumk=1

p(wk ) exp(minusd(wk xi )T) (according to equation 34Ksum

k=1p(wk |xi ) = 1)

2680 Q Song

4 Minimax Optimization and the Structural Risk Minimization

41 Capacity Maximization and Input Data Reliability In the con-strained minimization of MI of the last section we obtain an optimal feed-forward transition probability a priori pmf p(wk |xi ) isin p(W|X) A backwardtransition probability a posteriori pmf p(xi |wk) isin p(X|W) can be obtainedthrough the Bayes formula

p(xi |wk) = p(xi )p(wk |xi )sumli=1 p(xi )p(wk |xi )

= p(xi )p(wk |xi )p(wk)

(41)

The backward transition probability is useful to assess the realizabilityof the input data set in classical information theory Directly using the pmfequation 41 yields an optimization problem by simply evaluating a singlepmf p(xi |wk) and is not a good idea to reject outlier (Mackay 1999) How-ever we can use the capacity function of classical information theory This isdefined by maximizing an alternative presentation of the MI against inputprobability distribution

C = maxp(X)

I (p(X) p(W|X)) (42)

with

I (p(X) p(W|X)) = I (p(X) p(X|W))

=lsum

i=1

Ksumk=1

p(xi )p(wk |xi ) lnp(xi |wk)

p(xi ) (43)

where C is a constant represented the channel capacityNow we are in a position to introduce the channel reliability of classical

information theory (Bluhat 1988) To deal with the input data uncertaintythe MI can be presented in a simple channel entropy form

I (p(X) p(X|W)) = H(p(X)) minus H(p(W) p(X|W)) (44)

where the first term represents uncertainty of the channel input variable X4

H(p(X)) = minuslsum

i=1

p(xi ) ln(p(xi )) (45)

4 In nats (per symbol) since we use the natural logarithm basis rather bits (per symbol)in log2 function Note that we use the special entropy notations H(p(X)) = H(X) andH(p(W) p(X|W)) = H(X|W) here

A Robust Information Clustering Algorithm 2681

and the second term is conditional entropy

H(p(W) p(X|W)) = minuslsum

i=1

Ksumk=1

p(wk)p(xi |wk) ln p(xi |wk) (46)

Lemma 1 (inverse theorem) 5 The clustering data reliability is presented in asingle symbol error pe of the input data set with empirical error probability

pe =lsum

i=1

Ksumk =i

p(xi |wk) (47)

such that if the input uncertainty H(p(X)) is greater than C the error pe is boundedaway from zero as

pe ge 1ln l

(H(p(X)) minus C minus 1) (48)

Proof We first give an intuitive discussion here over Fanorsquos inequality (SeeBlahut 1988 for a formal proof)

Uncertainty in the estimated channel input can be broken into two partsthe uncertainty in the channel whether an empirical error pe was made andgiven that an error is made the uncertainty in the true value However theerror occurs with probability pe such that the first uncertainty is H(pe ) =minus(1 minus pe ) ln(1 minus pe ) and can be no larger than ln(l) This occurs only whenall alternative errors are equally likely Therefore if the equivocation can beinterpreted as the information lost we should have Fanorsquos inequality

H(p((W)) p(X|W)) le H(pe ) + pe ln (l) (49)

Now consider that the maximum of the MI is C in equation 42 so we canrewrite equation 44 as

H(p(W) p(X|W)) = H(p(X)) minus I (p(X) p(X|W)) ge H(p(X)) minus C (410)

Then Fanorsquos inequality is applied to get

H(p(X)) minus C le H(pe ) + pe ln(l) le 1 + pe ln l (411)

5 There is a tighter bound pe compared to the one of lemma 1 as in the work of Jelinet(1968) However this may not be very helpful since minimization of the empirical risk isnot necessary to minimize the real structural risk as shown in section 43

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 8: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2679

the clustering optimization problem as that of seeking the distribution pmfand minimizing equation 38 subject to a specified level of randomness Thiscan be measured by the minimization of the MI equation 31

The optimization is now to minimize the function F (plowast(X) p(W|X))which is a by-product of the MI minimization over the titled distributionp(wk |xi ) to achieve a minimum distortion and leads to the mass-constrainedDA clustering algorithm

Plugging equation 34 into 38 the optimal objective function equation38 becomes the entropy functional in a compact form3

F (plowast(X) p(W|X)) = minuslsum

i=1

plowast(xi ) lnKsum

k=1

p(wk) exp (minusd(wk xi )T) (310)

Minimizing equation 310 against the cluster center wk we have

part F (plowast(X) p(W|X))partwk

=lsum

i=1

plowast(xi )p(wk |xi )(wk minus xi ) = 0 (311)

which leads to the optimal clustering center

wk =lsum

i=1

αikxi (312)

where

αik = plowast(xi )p(wk |xi )p(wk)

(313)

For any cluster number K le Kmax and a fixed arbitrary pmf set plowast(xi ) isinplowast(X) minimization of the clustering objective function 38 against the pmfset p(W|X) is monotone nonincrease and converges to a minimum point ofthe convex function curve at a particular temperature The soft distortionmeasure D(plowast(X)) in equation 33 and the MI equation 31 are minimizedsimultaneously in a sense of empirical risk minimization

3 F (plowast(X) p(W|X)) =lsum

i=1

Ksumk=1

plowast(xi )p(wk |xi ) ln(

p(wk ) exp(minusd(wk xi )T)p(wk )Nxi

)minuss

lsumi=1

Ksumk=1

plowast(xi )p(wk |xi )d(wk xi ) = minuslsum

i=1plowast(xi )

Ksumk=1

p(wk |xi ) ln Nxi

= minuslsum

i=1plowast(xi ) ln

Ksumk=1

p(wk ) exp(minusd(wk xi )T) (according to equation 34Ksum

k=1p(wk |xi ) = 1)

2680 Q Song

4 Minimax Optimization and the Structural Risk Minimization

41 Capacity Maximization and Input Data Reliability In the con-strained minimization of MI of the last section we obtain an optimal feed-forward transition probability a priori pmf p(wk |xi ) isin p(W|X) A backwardtransition probability a posteriori pmf p(xi |wk) isin p(X|W) can be obtainedthrough the Bayes formula

p(xi |wk) = p(xi )p(wk |xi )sumli=1 p(xi )p(wk |xi )

= p(xi )p(wk |xi )p(wk)

(41)

The backward transition probability is useful to assess the realizabilityof the input data set in classical information theory Directly using the pmfequation 41 yields an optimization problem by simply evaluating a singlepmf p(xi |wk) and is not a good idea to reject outlier (Mackay 1999) How-ever we can use the capacity function of classical information theory This isdefined by maximizing an alternative presentation of the MI against inputprobability distribution

C = maxp(X)

I (p(X) p(W|X)) (42)

with

I (p(X) p(W|X)) = I (p(X) p(X|W))

=lsum

i=1

Ksumk=1

p(xi )p(wk |xi ) lnp(xi |wk)

p(xi ) (43)

where C is a constant represented the channel capacityNow we are in a position to introduce the channel reliability of classical

information theory (Bluhat 1988) To deal with the input data uncertaintythe MI can be presented in a simple channel entropy form

I (p(X) p(X|W)) = H(p(X)) minus H(p(W) p(X|W)) (44)

where the first term represents uncertainty of the channel input variable X4

H(p(X)) = minuslsum

i=1

p(xi ) ln(p(xi )) (45)

4 In nats (per symbol) since we use the natural logarithm basis rather bits (per symbol)in log2 function Note that we use the special entropy notations H(p(X)) = H(X) andH(p(W) p(X|W)) = H(X|W) here

A Robust Information Clustering Algorithm 2681

and the second term is conditional entropy

H(p(W) p(X|W)) = minuslsum

i=1

Ksumk=1

p(wk)p(xi |wk) ln p(xi |wk) (46)

Lemma 1 (inverse theorem) 5 The clustering data reliability is presented in asingle symbol error pe of the input data set with empirical error probability

pe =lsum

i=1

Ksumk =i

p(xi |wk) (47)

such that if the input uncertainty H(p(X)) is greater than C the error pe is boundedaway from zero as

pe ge 1ln l

(H(p(X)) minus C minus 1) (48)

Proof We first give an intuitive discussion here over Fanorsquos inequality (SeeBlahut 1988 for a formal proof)

Uncertainty in the estimated channel input can be broken into two partsthe uncertainty in the channel whether an empirical error pe was made andgiven that an error is made the uncertainty in the true value However theerror occurs with probability pe such that the first uncertainty is H(pe ) =minus(1 minus pe ) ln(1 minus pe ) and can be no larger than ln(l) This occurs only whenall alternative errors are equally likely Therefore if the equivocation can beinterpreted as the information lost we should have Fanorsquos inequality

H(p((W)) p(X|W)) le H(pe ) + pe ln (l) (49)

Now consider that the maximum of the MI is C in equation 42 so we canrewrite equation 44 as

H(p(W) p(X|W)) = H(p(X)) minus I (p(X) p(X|W)) ge H(p(X)) minus C (410)

Then Fanorsquos inequality is applied to get

H(p(X)) minus C le H(pe ) + pe ln(l) le 1 + pe ln l (411)

5 There is a tighter bound pe compared to the one of lemma 1 as in the work of Jelinet(1968) However this may not be very helpful since minimization of the empirical risk isnot necessary to minimize the real structural risk as shown in section 43

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 9: A Robust Information Clustering Algorithm

2680 Q Song

4 Minimax Optimization and the Structural Risk Minimization

41 Capacity Maximization and Input Data Reliability In the con-strained minimization of MI of the last section we obtain an optimal feed-forward transition probability a priori pmf p(wk |xi ) isin p(W|X) A backwardtransition probability a posteriori pmf p(xi |wk) isin p(X|W) can be obtainedthrough the Bayes formula

p(xi |wk) = p(xi )p(wk |xi )sumli=1 p(xi )p(wk |xi )

= p(xi )p(wk |xi )p(wk)

(41)

The backward transition probability is useful to assess the realizabilityof the input data set in classical information theory Directly using the pmfequation 41 yields an optimization problem by simply evaluating a singlepmf p(xi |wk) and is not a good idea to reject outlier (Mackay 1999) How-ever we can use the capacity function of classical information theory This isdefined by maximizing an alternative presentation of the MI against inputprobability distribution

C = maxp(X)

I (p(X) p(W|X)) (42)

with

I (p(X) p(W|X)) = I (p(X) p(X|W))

=lsum

i=1

Ksumk=1

p(xi )p(wk |xi ) lnp(xi |wk)

p(xi ) (43)

where C is a constant represented the channel capacityNow we are in a position to introduce the channel reliability of classical

information theory (Bluhat 1988) To deal with the input data uncertaintythe MI can be presented in a simple channel entropy form

I (p(X) p(X|W)) = H(p(X)) minus H(p(W) p(X|W)) (44)

where the first term represents uncertainty of the channel input variable X4

H(p(X)) = minuslsum

i=1

p(xi ) ln(p(xi )) (45)

4 In nats (per symbol) since we use the natural logarithm basis rather bits (per symbol)in log2 function Note that we use the special entropy notations H(p(X)) = H(X) andH(p(W) p(X|W)) = H(X|W) here

A Robust Information Clustering Algorithm 2681

and the second term is conditional entropy

H(p(W) p(X|W)) = minuslsum

i=1

Ksumk=1

p(wk)p(xi |wk) ln p(xi |wk) (46)

Lemma 1 (inverse theorem) 5 The clustering data reliability is presented in asingle symbol error pe of the input data set with empirical error probability

pe =lsum

i=1

Ksumk =i

p(xi |wk) (47)

such that if the input uncertainty H(p(X)) is greater than C the error pe is boundedaway from zero as

pe ge 1ln l

(H(p(X)) minus C minus 1) (48)

Proof We first give an intuitive discussion here over Fanorsquos inequality (SeeBlahut 1988 for a formal proof)

Uncertainty in the estimated channel input can be broken into two partsthe uncertainty in the channel whether an empirical error pe was made andgiven that an error is made the uncertainty in the true value However theerror occurs with probability pe such that the first uncertainty is H(pe ) =minus(1 minus pe ) ln(1 minus pe ) and can be no larger than ln(l) This occurs only whenall alternative errors are equally likely Therefore if the equivocation can beinterpreted as the information lost we should have Fanorsquos inequality

H(p((W)) p(X|W)) le H(pe ) + pe ln (l) (49)

Now consider that the maximum of the MI is C in equation 42 so we canrewrite equation 44 as

H(p(W) p(X|W)) = H(p(X)) minus I (p(X) p(X|W)) ge H(p(X)) minus C (410)

Then Fanorsquos inequality is applied to get

H(p(X)) minus C le H(pe ) + pe ln(l) le 1 + pe ln l (411)

5 There is a tighter bound pe compared to the one of lemma 1 as in the work of Jelinet(1968) However this may not be very helpful since minimization of the empirical risk isnot necessary to minimize the real structural risk as shown in section 43

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 10: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2681

and the second term is conditional entropy

H(p(W) p(X|W)) = minuslsum

i=1

Ksumk=1

p(wk)p(xi |wk) ln p(xi |wk) (46)

Lemma 1 (inverse theorem) 5 The clustering data reliability is presented in asingle symbol error pe of the input data set with empirical error probability

pe =lsum

i=1

Ksumk =i

p(xi |wk) (47)

such that if the input uncertainty H(p(X)) is greater than C the error pe is boundedaway from zero as

pe ge 1ln l

(H(p(X)) minus C minus 1) (48)

Proof We first give an intuitive discussion here over Fanorsquos inequality (SeeBlahut 1988 for a formal proof)

Uncertainty in the estimated channel input can be broken into two partsthe uncertainty in the channel whether an empirical error pe was made andgiven that an error is made the uncertainty in the true value However theerror occurs with probability pe such that the first uncertainty is H(pe ) =minus(1 minus pe ) ln(1 minus pe ) and can be no larger than ln(l) This occurs only whenall alternative errors are equally likely Therefore if the equivocation can beinterpreted as the information lost we should have Fanorsquos inequality

H(p((W)) p(X|W)) le H(pe ) + pe ln (l) (49)

Now consider that the maximum of the MI is C in equation 42 so we canrewrite equation 44 as

H(p(W) p(X|W)) = H(p(X)) minus I (p(X) p(X|W)) ge H(p(X)) minus C (410)

Then Fanorsquos inequality is applied to get

H(p(X)) minus C le H(pe ) + pe ln(l) le 1 + pe ln l (411)

5 There is a tighter bound pe compared to the one of lemma 1 as in the work of Jelinet(1968) However this may not be very helpful since minimization of the empirical risk isnot necessary to minimize the real structural risk as shown in section 43

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 11: A Robust Information Clustering Algorithm

2682 Q Song

Lemma 1 gives an important indication that any income information (inputdata) beyond the capacity C will generate unreliable data transmission Thisis also called the inverse theorem in a sense that it uses the DA-generatedoptimal titled distribution to produce the backward transition probabilityequation 41 and assess an upper bound of the empirical risk equation 410

42 Capacity Maximization and the Optimal Solution Equation 33 iswell known to be a soft dissimilarity measure minimized by the DA clus-tering as the temperature T is lowered toward zero (Rose 1998) Howeverthere is no way for the DA to search for an optimal temperature value andin turn an optimal cluster number because the rate distortion function pro-vides only limited information and aims at the empirical risk minimizationas shown in section 3 Therefore we propose a capacity or MI maximizationscheme This is implicitly dependent on the distortion measure similar tothe rate distortion function

We define a constrained maximization of MI as6

C(D(p(X))) = maxp(X)

C(D(p(X))) = maxp(X)

I (p(X) p(W|X)) (412)

with a similar constraint as in equation 33

D(p(X)) =lsum

i=1

Ksumk=1

p(xi )p(wk |xi )d(wk xi ) le D(plowast(X)) (413)

This is because minimization of the soft distortion measure D(plowast(X)) equa-tion 33 is the ultimate target of the DA clustering algorithm as analyzed insection 3 We need to assess maximum possibility to make an error (risk)According to lemma 1 reliability of the input data set depends on the capac-ity that is the maximum value of the MI against the input density estimateTo do this we evaluate the optimal a priori pmf robust density distributionpmf p(xi ) isin (p(X)) to replace the fixed arbitrary plowast(xi ) in the distortion mea-sure equation 33 and assess reliability of the input data of each particularcluster number K based on a posteriori pmf in equation 41 If most of thedata points (if not all) achieve the capacity (fewer outliers) then we canclaim that the clustering result reaches an optimal or at least a subopti-mal solution at this particular cluster number in a sense of empirical riskminimization

6 Here we use a similar notation of the capacity function as for the rate distortionfunction R(D(p(X))) to indicate implicitly that the specific capacity function is in fact animplicit function of the distortion measure D(p(X)) For each particular temperature T the capacity C(D(p(X))) achieves a point at the upper curve corresponding to the lowercarve R(D(plowast(X))) as shown in equation 417

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 12: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2683

Similar to the minimization of the rate distortion function in section 3constrained capacity maximization can be rewritten as an optimizationproblem with a Lagrange multiplier λ ge 0

C(D(p(X))) = maxp(X)

[I (p(X) p(W|X)) + λ(D(plowast(X)) minus D(p(X)))] (414)

Theorem 1 Maximum of the constrained capacity C(D(p(X))) is achieved by therobust density estimate

p(xi ) =exp

(sumKk=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )

)suml

i=1 exp(sumK

k=1 p(wk |xi ) ln p(xi |wk) minus λp(wk |xi )d(wk |xi )) (415)

with the specific distortion measure D(p(X)) = D(plowast(X)) for p(xi ) ge 0 of all 0 lei le l

Proof Similar to Blahut (1972) we can temporarily ignore the conditionp(xi ) ge 0 and set the derivative of the optimal function 414 equal to zeroagainst the independent variable a priori pmf p(xi ) This results in

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)= 0

= minus ln p(xi ) minus 1 +Ksum

k=1

p(wk |xi )(ln p(xi |wk)

minusλp(wk |xi )d(wk xi )) + λ1 p(xi ) (416)

We also select a suitable λ1 which ensure that the probability constraintsumli=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution

estimate equation 415According to the Kuhn-Tucker theorem (Blahut 1988) if there exists an

optimal robust distribution p(xi ) which is derived from equation 415 thenthe inequality constraint equation 413 of the distortion measure becomesequality and achieves the optimal solution of equation 414 at an optimalsaddle point between the curve C(D(p(X))) and R(D(plowast(X))) with the cor-responding average distortion measure

D(p(X)) = D(plowast(X))) (417)

By dividing the input data into effective clusters the DA clustering min-imizes the relative Shannon entropy without a priori knowledge of the datadistribution (Gray 1990) The prototype (cluster center) equation 312 is

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 13: A Robust Information Clustering Algorithm

2684 Q Song

1w 2w

1

1

(w | x )( (w | x )ln )

(w | x ) (x )

(x ) 0

Kk i

k i lk

k i ii

i

pp C

p p

p

2(w | x )ip1(w | x )ip

2(x | w )ip1(x | w )ip

Figure 2 The titled distribution and robust density estimation based on theinverse theorem for a two-cluster data set

clearly presented as a mass center This is insensitive to the initialization ofcluster centers and volumes with a fixed probability distribution for exam-ple an equal value plowast(xi ) = 1 l for the entire input data points (Rose 1998)Therefore the prototype parameter αki depends on the titled distributionp(wk |xi ) equation 34 which tends to associate the membership of any par-ticular pattern in all clusters and is not robust against outlier or disturbanceof the training data (Dave amp Krishnapuram 1997) This in turn generatesdifficulties in determining an optimal cluster number as shown in Figure 2(see also the simulation results) Any data point located around the middleposition between two effective clusters could be considered an outlier

Corollary 1 The capacity curve C(D(p(X))) is continuous nondecreasing andconcave on D(p(X)) for any particular cluster number K

Proof Let pprime(xi ) isin pprime(X) and pprimeprime(xi) isin pprimeprime(X) achieve [D(pprime(X)) C(D(pprime(X)))]and [D(pprimeprime(X)) C(D(pprimeprime(X)))] respectively and p(xi ) = λprime pprime(xi ) + λprimeprime pprimeprime(xi ) isan optimal density estimate in theorem 1 where λprimeprime = 1 minus λprime and 0 lt λprime lt 1Then

D(p(X)) =lsum

i=1

Ksumk=1

(λprime pprime(xi ) + λprimeprime pprimeprime(xi ))p(wk |xi )d(wk xi )

= λprime D(pprime(X)) + λprimeprime D(pprimeprime(X)) (418)

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 14: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2685

and because p(X) is the optimal value we have

C(D(p(X))) ge I (p(X) p(W|X)) (419)

Now we use the fact that I (p(X) p(W|X)) is concave (upward convex) inp(X) (Jelinet 1968 Blahut 1988) and arrive at

C(D(p(X))) ge λprime I (pprime(X) p(W|X)) + λprimeprime I (pprimeprime(X) p(W|X)) (420)

We have finally

C(λprime D(pprime(X)) + λprimeprime D(pprimeprime(X))) ge λprimeC(D(pprime(X))) + λprimeprimeC(D(pprimeprime(X))) (421)

Furthermore because C(D(p(X))) is concave on [0 Dmax] it is continuousnonnegative and nondecreasing to achieve the maximum value at Dmaxwhich must also be strictly increased for D(p(X)) smaller than Dmax

Corollary 2 The robust distribution estimate p(X) achieves the capacity at

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)= V forallp(xi ) = 0

(422)

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)ltV forallp(xi ) = 0

(423)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik1998)

p(xi )

[V minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

))]= 0 foralli (424)

Proof Similar to the proof of theorem 1 we use the concave property ofC(D(p(X)))

part

part p(xi )(C(D(p(X))) + λ1

(lsum

i=1

p(xi ) minus 1)

)ge 0 (425)

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 15: A Robust Information Clustering Algorithm

2686 Q Song

which can be rewritten as

Ksumk=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

minus λp(wk |xi )d(wk |xi )

)le minusλ1 + 1 foralli

(426)

with equality for all p(xi ) = 0 Setting minusλ1 + 1 = V completes the proof

Similarly it is easy to show that if we choose λ = 0 the Kuhn-Tucker con-dition becomes

p(xi )

[C minus

(Ksum

k=1

p(wk |xi )

(ln

p(wk |xi )sumli=1 p(xi )p(wk |xi )

))]= 0 foralli (427)

where C is the maximum capacity value defined in equation 42Note that the MI is not negative However individual items in the sum of

the capacity maximization equation 42 can be negative If the i th patternxi is taken into account and p(wk |xi ) lt

sumli=1 p(xi )p(wk |xi ) then the prob-

ability of the kth code vector (cluster center) is decreased by the observedpattern and gives negative information about pattern xi This particularinput pattern may be considered an unreliable pattern (outlier) and itsnegative effect must be offset by other input patterns Therefore the max-imization of the MI equation 42 provides a robust density estimation ofthe noisy pattern (outlier) in terms that the average information is over allclusters and input patterns The robust density estimation and optimiza-tion is now to maximize the MI against the pmf p(xi ) and p(xi |wk) for anyvalue of i if p(xi |wk) = 0 then p(xi ) should be set equal to zero in orderto obtain the maximum such that a corresponding training pattern (outlier)xi can be deleted and dropped from further consideration in the optimiza-tion procedure as outlier shown in Figure 2

As a by-product the robust density estimation leads to an improvedcriterion at calculation of the critical temperature to split the input data setinto more clusters of the RIC compared to the DA as the temperature islowered (Rose 1998) The critical temperature of the RIC can be determinedby the maximum eigenvalue of the covariance (Rose 1998)

VXW =lsum

i=1

p(xi |wk)(xi minus wk)(xi minus wk)T (428)

where p(xi |wk) is optimized by equation 41 This has a bigger value repre-senting the reliable data since the channel communication error pe is rela-tively smaller compared to the one of outlier (see lemma 1)

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 16: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2687

43 Structural Risk Minimization and Optimal Cluster Number Tosolve the intertwined outlier and cluster number problem some intuitivenotations can be obtained based on classical information theory as presentedthe previous sections Increasing K and model complexity (as the tempera-ture is lowered) may reduce capacity C(D(p(X))) since it is a nondecreasingfunction of D(p(X)) as shown in corollary 1 (see also Figure 1) Thereforein view of theorem 1 we should use the smallest cluster number as longas a relatively small number of outliers is achieved (if not zero outlier) say1 percent of the entire input data points However how to make a trade-off between empirical risk minimization and capacity maximization is adifficult problem for classical information theory

We can solve this difficulty by bridging the gap between classical infor-mation theory on which the RIC algorithm is based and the relatively newstatistical learning theory with the so-called structural risk minimization(SRM) principle (Vapnik 1998) Under the SRM a set of admissible struc-tures with nested subsets can be defined specifically for the RIC clusteringproblem as

S1 sub S2 sub sub SK (429)

where SK = (QK (xi W) W isin K ) foralli with a set of indicator functions ofthe empirical risk7

QK (xi W) =Ksum

k=1

limTrarr0

p(wk |xi ) =Ksum

k=1

limTrarr0

p(wk) exp(minusd(xi wk)T)Nxi

foralli

(430)

We shall show that the titled distribution p(wk |xi ) equation 34 at zero tem-perature as in equation 430 can be approximated by the complement ofa step function This is linear in parameters and assigns the cluster mem-bership of each input data point based on the Euclidean distance betweendata point xi and cluster center wk for a final hard clustering partition (Rose1998 see also the algorithm in section 44)

The titled distribution at T rarr 0 can be presented as

limTrarr0

p(wk) exp(minusd(xi wk)T)sumKk=1 p(wk) exp(minusd(xi wk)T)

7 According to definition of the titled distribution equation 34 it is easy to see thatthe defined indictor function is a constant number that is QK (xi W) = 1 See also note 3

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 17: A Robust Information Clustering Algorithm

2688 Q Song

asymp

p(wk) exp(minusd0(xi wk))p(wk) exp(minusd0(xi wk))

= 1 if d0(xi wk) = infin

p(wk) exp(minusd0(xi wk))sumKk=1 p(wk) exp(minusd0(xi wk))

= 0 if d0(xi wk) rarr infin

(431)

Now consider the radius d0(xi wk) between data point xi and clusterk at zero temperature This can be rewritten as an inner product of twon-dimensional vectors of the input space as

d0(xi wk) = limTrarr0

d(xi wk)T

= limTrarr0

lt xi minus wk gt lt xi minus wk gt

T

=nsum

o=1

rkoφko(X) (432)

where rko represents the radius parameter component in the n-dimensionalspace and φko(X) is a linearly independent function similar to the hyper-plane case (Vapnik 1998)

Using equations 432 and 431 we can rewrite 430 as

QK (xi W) =Ksum

k=1

θ

(nsum

o=1

rkoφko(X) minus d0(xi wk)

) foralli (433)

where θ () = 1 minus θ () is the complement of the step function θ ()Note that there is one and only one d0(xi wk) = infin forall(1 le k le K ) in each

conditional equality of equation 431 since it gives a unique cluster mem-bership of any data point xi in a nested structure SK Therefore the indi-cator QK (xi W) is linear in parameters According to Vapnik (1998) theVC-dimension of the complexity control parameter is equal to the numberof parameters hK = (n + 1) lowast K for each nested subset SK By design of theDA clustering the nested structure in equation 429 provides ordering ofthe VC-dimension h1 le h2 le le hK such that the increase of clusternumber is proportional to the increase of the estimated VC-dimension froma neural network point of view (Vapnik 1998)

To obtain good generalization performance one has to use the admissiblestructure equation 429 based on the set of indicator functions to search foran optimal cluster number K This minimizes a VC-bound ps similar to thatof the support vector machine except that we are looking for the strongestdata point of the input space instead of seeking the weakest data point ofthe feature (kernel) space (Vapnik 1998) So we have

ps le η + ε

2

(1 +

(1 + η

)12)

(434)

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 18: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2689

with

η = ml

(435)

ε = 4hK

(ln 2l

hK+ 1

)minus ln ζ

4

l (436)

where m is the number of outliers identified in the capacity maximizationas in the previous section ζ lt 1 is a constant

The signal-to-noise ratio η in equation 435 appears as the first term ofthe right-hand side of the VC-bound equation 434 This represents theempirical risk and the second term is the confidence interval of the SRM-based estimate

Discussion

Stop criterion and optimal cluster number At the initial DA clusteringstage with a small cluster number K and relatively large ratio betweenthe number of input data points and the VC-dimension say l

hKgt 20

(Vapnik 1998) the real risk VC-bound equation 434 is mainly deter-mined by the first term of the right-hand side of the inequality thatis the empirical risk (signal-to-noise) ratio η in equation 435 As thetemperature is lowered and the cluster number is increased a rela-tively small ratio l

hKmay require both terms in the right-hand side of

equation 434 to be small simultaneously Therefore we can assess firstthe ratio l(hK ) which is near the upper bound of the critical number20 for a maximum cluster number K = Kmax beyond which the sec-ond term of the VC-bound equation 434 may become dominant evenfor a small empirical risk ratio η especially in a high-dimensional dataspace Therefore we can follow the minimax MI optimization as insections 3 and 4 to increase the cluster number from one until Kmax fora minimum value of the VC-bound that is take a trade-off betweenminimization of the empirical risk and VC-dimension

Selection of λ The degree of robustness of the RIC algorithm is con-trolled by the parameter λ The Kuhn-Tucker condition in corollary 2tells that a relatively larger value of λ yields more outliers (noisy pat-terns) If one chooses λ = 0 the RIC allows the maximum empiricalrisk with a possible overcapacity distortion beyond the optimal saddlepoint and a minimum number of the estimated outliers (see Figure 1)In a general clustering problem using the L2 distortion measure equa-tion 22 selection of the λ is insensitive to determination of an optimalcluster number because the VC-bound depends on only the relativevalues of η and hK over different cluster numbers (see also example 2)

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 19: A Robust Information Clustering Algorithm

2690 Q Song

As a general rule of thumb if eliminating more outliers is an interestwe can gradually increase λ and redo the capacity maximization toreject outliers located between intercluster boundaries at an optimalcluster number determined by an arbitrary value of λ

44 Implementation of the RIC Algorithm

Phase I (Minimization)

1 Determine the ratio l(n lowast K ) which is near the critical number 20 fora maximum cluster number K = Kmax and p(xi ) = 1 l for i = 1 to l

2 Initialize T gt 2Emax(Vx) where Emax is the largest eigenvalue of thevariance matrix Vx of the input pattern set X K = 1 and p(w1) = 1

3 For i = 1 K of the fixed-point iteration of the DA clustering ac-cording to equations 34 415 and 312

4 Convergence test If not satisfied go to 3

5 If T le Tmin perform the last iteration and stop

6 Cooling step T larr αT (α lt 1)

7 If K lt Kmax check condition for phase transition for i = 1 K Ifa critical temperature T = 2Emax(Vxw) where Emax(Vxw) is the largesteigenvalue of the covariance VXW matrix in equation 428 between theinput pattern and code vector (Rose 1998) is reached for the clus-tering add a new center wK+1 = wK + δ with p(wK+1) = p(wK )2p(wK ) larr p(wK )2 and update K + 1 larr K

Phase II (Maximization)

8 If it is the first time for the calculation of the robust density estima-tion select p(xi ) = 1 l infin gt λ ge 0 and ε gt 0 and start the fixed-pointiteration of the robust density estimation in the following step 9 to 10

9

ci = exp

[Ksum

k=1

(p(wk |xi ) lnp(wk |xi )suml

i=1 p(xi )p(wk |xi )minus λp(wk |xi )d(wk xi ))

]

(437)

10 If

lnlsum

i=1

p(xi )ci minus ln maxi=1l

ci lt ε (438)

then go to 9 where ε gt 0 otherwise update the density estimation

p(xi ) = p(xi )cisuml

i=1 p(xi )ci (439)

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 20: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2691

11 Verify the robust solutions of the RIC algorithm around the optimalsaddle point for a minimum value of the VC-bound equation 434within the range of maximum cluster number Kmax If the minimum isfound then delete outliers and set T rarr 0 for the titled distribution toobtain cluster membership of all input data points for a hard clusteringsolution Recalculate the cluster center using equation 312 withoutoutliers then stop Otherwise go to 3

5 Simulation Results

This section presents a few simulation examples to show the superiority ofthe RIC over the standard DA clustering algorithm This is in fact a self-comparison since the RIC is just an extension of the DA by identifyingoutliers for an optimal cluster number A comparison can also be madewith the popular fuzzy c-means (FCM) and the robust version of the FCMclustering algorithms (see section 2) However this may not make muchsense since the FCM needs a predetermined cluster number in addition tothe initialization problem (Krishnapuram amp Keller 1993)

Example 1 which follows presents a numerical analysis to reveal detailsof the weakness of the titled distribution This also explains how the ro-bust density estimate of the RIC algorithm finds an optimal cluster numbervia the identification of outliers Example 2 illustrates that one can alwayschoose a relatively larger control parameter λ to eliminate more outliersbetween the intercluster area without affecting the estimated optimal clus-ter number Example 3 shows an interesting partition of a specific data setwithout clear cluster boundaries In particular we show that any data pointcould become outlier dependent on the given data structure and chosencluster centers in the annealing procedure based on the limited numberof input data for a minimum VC-bound Similarly we are not looking forldquotruerdquo clusters or cluster centers but effective clusters in a sense of the SRMbased on the simple Euclidean distance8

Example 1 Figure 3 is an extended example used in the robust FCM cluster-ing algorithm (Krishnapuram amp Keller 1993) which has two well-separatedclusters with seven data points each and four outliers sitting around themiddle position between the two given clusters The data set has 18 datapoints such that the ratio lh1 = 18(3 lowast 1) is already smaller than the criticalnumber 20 An optimal cluster number should be the minimum two (notethat DA does not work for one cluster) However we would like to use thisexample to reveal the weakness of the titled distribution and how the ro-bust density estimate helps Figure 3 also shows that the RIC algorithm with

8 We set ζ = 01 of the VC-bound for all the simulation results The Matlab pro-gram can be downloaded from the authorrsquos Internet address httpwwwntuedusghomeeqsong

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 21: A Robust Information Clustering Algorithm

2692 Q Song

(a) The original data set (b) K = 2 ps = 49766

(c) K = 3 ps = 57029 (d) K = 4 ps = 64161

Figure 3 The clustering results of RIC (λ = 0) in example 1 The bigger lowast repre-sents the estimated cluster center of the RIC after eliminating the estimated out-liers The black dot points are the identified outliers by the RIC in b c and d

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 22: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2693

Table 1 Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi )in Example 1 with K = 2

i p(xi ) p(w1|xi ) p(w2|xi )

1 03134 09994 000062 00638 09991 000093 00354 09987 000134 00329 09987 000135 00309 09987 000136 00176 09981 000197 00083 09972 000288 00030 00028 099729 00133 00019 09981

10 00401 00013 0998711 00484 00013 0998712 00567 00013 0998713 01244 00009 0999114 02133 00006 0999415 00000 09994 0000616 00000 09994 0000617 00000 09994 0000618 00000 09994 00006

K = 2 identifies the four data points around the middle position betweenthe two clusters as outliers and eliminates them with p(xi ) = 0 Further de-tails on the values of the titled distribution p(wk |xi) and the robust estimatep(xi) are listed in Table 1 for the case of K = 2 The first 14 rows correspondto the data in the two clusters and the last 4 rows represent the four iden-tified outliers Despite the balanced geometric positions of the outliers themembership of the four outliers is assigned to cluster 1 by the DA becauseof p(w1|xi) asymp 1 for the four outliers The minor difference in the numericalerror may be the only cause for the DA to assign the membership of the fourdata points to the first cluster This explains why minimization of the titleddistribution is not robust (Dave amp Krishnapuram 1997)

More important the RIC estimates the real risk-bound ps as the clusternumber is increased from one This also eliminates the effect of outliers Theratio between the number of total data points and VC-dimension h2 is smallat 186 = 3 so the second term of the VC-bound becomes dominant as Kincreases as shown in Figure 3 The optimal cluster number is determinedas ldquotwordquo with a minimum ps = 49766 despite the fact that the minimumnumber of outliers of the empirical risk is achieved at the cluster numberK = 4 Note also that the original outliers become valid data points as thecluster numbers are increased to K = 3 and K = 4 respectively

Example 2 The two-dimensional data set has 292 data points so the ratiolh7 = 292(3 lowast 7) is well below the critical number 20 We should searchfor an optimal cluster number from two to seven clusters Figures 4 and 5

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 23: A Robust Information Clustering Algorithm

2694 Q Song

(a) ps = 15635 K = 2 (b) ps = 06883 K = 3

(c) ps = 11888 K = 4 (d) ps = 14246 K = 5

(e) ps = 13208 K = 6 (f) ps = 24590 K = 7

Figure 4 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 24: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2695

(a) ps = 18924 K = 2 (b) ps = 09303 K = 3

(c) ps = 12826 K = 4 (d) ps = 15124 K = 5

(e) ps = 13718 K = 6 (f) ps = 246244 K = 7

Figure 5 The two-dimensional data set with 292 data points in example 2 isclustered by the RIC algorithm with λ = 18 The black dot points are identifiedoutliers by the RIC in all pictures

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 25: A Robust Information Clustering Algorithm

2696 Q Song

(a) ps = 18177 η = 08667 (b) ps = 13396 η = 03900

(c) ps = 08486 η = 0 (d) ps = 09870 ηη

= 0

(e) ps = 11374 ηη

= 00033 (f) ps = 2169 η = 04467

Figure 6 The two-dimensional data set with 300 data points in example 3clustered by the RIC algorithm with λ = 0 The black dot points are identifiedoutliers by the RIC in all pictures with (a) K = 2 (b) K = 3 (c) K = 4 (d) K = 5(e) K = 6 and (f) K = 7

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 26: A Robust Information Clustering Algorithm

A Robust Information Clustering Algorithm 2697

show that a ldquonativerdquo noise-free three-cluster data set is clustered by theRIC algorithm with different cluster numbers The RIC gives the correctoptimal cluster number ldquothreerdquo because there is a minimum value of theVC-bound ps This also coincides with the empirical risk of the minimumnumber of outliers at K = 3 for both cases λ = 0 and λ = 18 Note thatwe can always use a relatively larger λ value to eliminate more outliersbetween the intercluster area without affecting the optimal cluster numberin a general clustering problem The black dot points are identified outliersby the RIC in all pictures

Example 3 This is an instructive example to show the application of theRIC algorithm with λ = 0 for a data set without clear cluster boundariesin a two-dimensional space The data set has 300 data points such that theratio lh7 = 300(3 lowast 7) is well below the critical number 20 We shall searchfor an optimal cluster number from two to seven clusters In particular toshow the difference between the empirical risk η and the VC-bound ps we indicate both values for each case Figure 6 illustrates that the optimalcluster number is four based on the SRM principle It is interesting to notethat the five-cluster case also achieves the minimum number of outliers in asense of the empirical risk minimization but its VC-bound ps is bigger thanthe one of the four-cluster because of the increase in the VC-dimension

6 Conclusion

A robust information clustering algorithm is developed based on the mini-max optimization of MI In addition to the algorithm the theoretical contri-butions of this letter are twofold (1) the capacity maximization is implicitlylinked to the distortion measure against the input pattern pmf and providesan upper bound of the empirical risk to phase out outliers (2) the opti-mal cluster number is estimated based on the SRM principle of statisticallearning theory The RIC can also be extended to the c-shells or kernel-basedalgorithms to deal with the linearly nonseparable data This is an interestingtopic for further research

Acknowledgments

I thank the anonymous reviewers for constructive comments on an earlierversion of this letter

References

Bajcsy P amp Ahuja N (1998) Location- and density-based hierarchical clusteringusing similarity analysis IEEE Trans on Pattern Analysis and Machine Intelligence20 1011ndash1015

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005

Page 27: A Robust Information Clustering Algorithm

2698 Q Song

Bishop C M (1995) Neural networks for pattern recognition New York Oxford Uni-versity Press

Blahut R E (1972) Computation of channel capacity and rate-distortion functionsIEEE Trans on Information Theory 18 460ndash473

Blahut R E (1988) Principle and practice of information theory Reading MA Addison-Wesley

Dave R N amp Krishnapuram R (1997) Robust clustering methods A unified viewIEEE Trans on Fuzzy Systems 5 270ndash293

Gokcay E amp Principe J C (2002) Information theoretic clustering IEEE Trans onPattern Analysis and Machine Intelligence 24 158ndash171

Gray R M (1990) Source coding theory Norwood MA KluwerJelinet F (1968) Probabilistic information theory New York McGraw-HillKrishnapuram R amp Keller J M (1993) A possibilistic approach to clustering IEEE

Trans on Fuzzy Systems 1 98ndash110Levy B C amp Nikoukhah R (2004) Robust least-squares estimation with a relative

entropy constraint IEEE Trans on Information Theory 50 89ndash104Mackay D C (1999) Comparision of approximate methods for handling hyperpa-

rameters Neural Computation 11 1035ndash1068Rose K (1998) Deterministic annealing for clustering compression classification

regression and related optimization problem Proceedings of the IEEE 86 2210ndash2239

Scholkopf B Smola A amp Muller K M (1998) Nonlinear component analysis as akernel eigenvalue problem Neural Computation 10 1299ndash1319

Shen M amp Wu K L (2004) A similarity-based robust clustering method IEEETrans on Pattern Analysis and Machine Intelligence 26 434ndash448

Song Q Hu W J amp Xie W F (2002) Robust support vector machine for bullet holeimage classification IEEE Transactions on Systems Man and CyberneticsmdashPart C32 440ndash448

Still S amp Bialek W (2004) How many clusters An information-theoretic perspec-tive Neural Computation 16 2483ndash2506

Tishby N Pereira F amp Bialek W (1999) The information bottleneck method InB Hajek and R S Sreenivas (Eds) Proc 37th Annual Allerton Conf Urbana Uni-versity of Illinois

Vapnik V N (1998) Statistical learning theory New York Wiley

Received July 28 2004 accepted April 20 2005