Methodology Clustering Methods · 329 Methodology Review: Clustering Methods Glenn W. Milligan and...

329

Methodology Review:Clustering MethodsGlenn W. Milligan and Martha C. CooperOhio State University

A review of clustering methodology is presented,with emphasis on algorithm performance and the re-sulting implications for applied research. After an over-view of the clustering literature, the clustering processis discussed within a seven-step framework. The four

major types of clustering methods can be characterizedas hierarchical, partitioning, overlapping, and ordina-tion algorithms. The validation of such algorithms re-fers to the problem of determining the ability of themethods to recover cluster configurations which areknown to exist in the data. Validation approaches in-clude mathematical derivations, analyses of empiricaldatasets, and monte carlo simulation methods. Next,interpretation and inference procedures in cluster anal-ysis are discussed. inference procedures involve test-ing for significant cluster structure and the problem ofdetermining the number of clusters in the data. Thepaper concludes with two sets of recommendations.One set deals with topics in clustering that would ben-efit from continued research into the methodology.The other set offers recommendations for applied anal-yses within the framework of the clustering process.

Classification is a basic mental process.Relevant groupings can provide economy of mem-ory, predictive power, or possible theoretical de-velopment. Much classificatory activity is carriedout at a subjective level. However, with the adventof high-speed computational ~~u~~r~~~~y many dis-ciplines have involved in the development of

automatic or objective algorithms for the genera-tion of classifications. Before reviewing the liter-ature, four issues peculiar to the area deserve dis-cussion :

1. Many researchers are not aware of the im-

~~~s~ amount of in ~1~~ field of clas-sification. ~l~s~~~l~. and Aldenderfer (1978)noted that the number of articles using clus-tering methodology grew from 25 in 1964 to501 in 1976. Between 1958 and morethan 196~® articles on classificationwere published. A separate literature searchindicated that in 1985 alone, 1,658 referenceswere found on the topic. e

Despite the wide of interest in the de-velopment and use of clustering methodology,the literature on the topic is remarkably seg-mented, with authors in one academic disci-

often isolated from research in other fields.~l~sh~~ld (1976, and Blashfield and Al-dendeifer (1978) have documented the lack ofcross-reference between disciplines. It is not

unusual to find identical or highly similar tech-niques being discovered in different fields andgiven different names. eThe richness of this literature base presents

problems for the interested reader. Most dis-cussions of clustering procedures are embed-ded in content material specific to a dis-cipline. A reader may have to deal with articlesin soil science, 9 ~~~~~lt~~l~~ biogeography, 9

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

330

cladistics, inorganic molecular structures,

segmentation, or consistency theoremsof mathematical statistics. The interdiscipli-nary nature of the sources makes it difficultfor an applied researcher to become fully in-formed about developments in this methodo-logical area. Nevertheless, it is not reasonableto ignore the contributions in other fields. °

2. Defining the term ‘~luste~°9’ presents a prob-lem. Numerous definitions for the concept ex-ist, each can be valid within a particularapplication or framework. Some researchershave viewed clusters as mixtures of multivar-iate normal populations (Biashfield, 1976; Fleiss& Zubin, 1969; Wolfe, 1970). clustercan be viewed as a multivariate normal pop-ulation. Because each cluster forms a different

population, it is possible to conceptualize a&dquo;mixture&dquo; of populations present in the da-tabase available for sampling. This representsa logical extension of the use of the normaldistribution in other cases. Such mixtures al-low for clusters to overlap in the variable space.

Milligan (1980, 1985) used truncated mul-tivariate normal mixtures to ensure that clus-ters did not overlap. A truncated multivariatenormal population would involve constrainingall data points to fall within a specified intervalabout the mean for a given variable. For ex-ample, each point could be required to fallwithin ±2.0 standard deviations of the mean.This has the effect of eliminating the tails ofthe distribution. Assuming that the centroidsfor the populations are separated by a sufficientminimum distance, the populations will notoverlap in the variable space. (The centroid isthe location that corresponds to the means ofthe variables in the multivariate space.) Theconcept of distinct groups was incorporated byCormack (1971) into a definition of naturalclusters. The definition that clusters shouldexhibit cohesion and external isola-tion. Several other authors have expressed thisconcept in similar terms. Sneath (1969) statedthat &dquo;in a broad sense clusters are thought ofas collections of points which are relatively

close, but which are by empty re-gions of from other clusters&dquo; (p. 260).

However, the concept of clusters maybe overly restrictive for applications andinappropriate for others. If all elements withina given cluster are to be highly similarto each other, then more elongated clusters areruled out. When all data points are requiredto be highly similar, the result is a compactcluster approximating a in the variablespace. Thus, it may be desired to modify theconcept of a compact cluster to allow for clus-ters that are or continuously con-nected. Such elongated clusters could occur,for example, if the cluster follows a regressioneffect between variables. Similarly, requiringclusters to be disjoint be an inappro-model of a human population. A defi-nition which allows for overlapping clusters isneeded in such cases. Thus, when attemptinga cluster analysis of an empirical dataset, re-searchers must address the issue of the defi-

nition of the concept of a cluster. This is es-sential because different clustering algorithms sattempt to find different kinds of clusters.

3. Further difficulties are presented by the heu-ristic nature of the clustering methods them-selves. Unlike regression, ANOVA, evenfactor analysis, is no analysis tech-nique based on some widely accepted statis-tical principle. The problem is compoundedby the computational involved. For thefairly small problem of dividing 25 elementsinto 5 nonoveriapping ~iust~~s 9 th~~~ are over2.4 x i®15 different possible solutions

(Anderberg, 1973). As such, each method mustattempt to find the optimal clustering, usingits own definition of cluster structure and op-timality, without testing all possible partitions.There is no guarantee that a given algorithmwill find the optimal partition in the data. Tocompound the there are literallyhundreds of clustering algorithms in existence.Because no single method is known to be op-timal and so many methods are available for

use, a literature has developed on the problem



331

of validating the accuracy of clustering algo-rithms. °

4. Inference procedures have developed forcluster analysis. For example, procedures existto test whether significant cluster structure hasfound in the or whether there is a

partition of random data containing nostructure. This is an important problembecause virtually all clustering algorithms givesolution partitions regardless of the presenceor the of structure in the data. A dif-but related problem deals with the taskof determining the number of clusters. Thisissue has been approached from many per-spectives. °

In large p the remainder of this review ad-the issues of algorithm validation and thedevelopment of inference procedures. Before turn-ing to the literature, a discussion of the clusteringprocess and a general survey of the types of clus-algorithms are presented.

m the Clustering Process

A seven-step structure is used to organize theclustering process. This structure is consistent withthe discussions found in Anderberg (1973), Cor-mack (1971), Everitt (1980), and Lorr (1983). Anapplied article using clustering should contain in-formation on the actions taken by the researcherfor each of steps. For the purposes of thispaper, a clustering method will refer to the specificmeans by which entities are grouped together.~I~rd~s minimum variance or K-means proceduresare examples of clustering methods. The clusteringprocess refers to the steps outlined in this sectionwhich represent the sequence necessary for a com-

plete analysis. Implications of the decisions in-

volved in each of these are discussed in latersections.1 ° The entities to be clustered must be selected.

The of elements should be chosen tobe representative of the cluster structure in thepopulation.

2. The variables to be used in the cluster analysisare selected. Again, the variables must contain

sufficient information to permit the clusteringof the objects. °

3 ° The researcher must decide whether or not tostandardize the data. If standardization is tobe performed, then the researcher must selecta procedure from several different approaches.

4. A similarity or dissimilarity measure must beselected. These measures reflect the degree ofcloseness or separation between objects. A dis-similarity measure, such as distaince, assumesvalues as two objects become less sim-ilar. A similarity measure, such as correlation,assumes larger values as two objects becomemore similar.

5. A clustering method must be selected. Theresearcher’s concept of what constitutes a clus-ter is important because different methods havebeen designed to find different types of clusterstructures. °

’

6. The number of clusters must be determined.This problem has received increased attentionin the clustering literature during the last sev-eral years. °

7. ° The last step in the clustering process is to

interpret, test, and replicate the resulting clus-ter analysis. Interpretation of the clusters withe applied context requires the knowledge andexpertise of the researcher’s particular disci-pline. Testing involves the problem of deter-mining whether there is a significant clusteringor an arbitrary partition of random noise data.Finally, replication determines whether the re-sulting cluster structure can be replicated inother samples. °

Although variations on this seven-phase processmay be necessary to fit a particular application, thissequence represents the critical steps in a cluster

analysis. The next section describes the variousclustering methods which can be selected for usein step 5. °

of Clustering Methods

With several hundred clustering methods in ex-istence, some means of classification is needed todescribe the techniques. Four major categories can



332

be identified: hierarchical methods, partitioning(nonhierarchical) algorithms, overlapping cluster-ing procedures, and ordination techniques.

~~~~~.~~~~~~~ Methods

Perhaps the most popular clustering algorithmhave been the sequential hierarchi-cal methods. Such hierarchical methods witheach entity considered as a separate cluster. At eachsuccessive level in the clustering, two of the clus-ters are The continues until onlyone cluster, containing the entire dataset, remains.The routine will ~~r~er~t~ ~ strictly &dquo;h~~rw

archy&dquo; of n partitions, where n is the number ofentities in the dataset. The represent non-overlapping clusters and have the property that oncetwo elements become members of the same cluster, yare never again The researcher hasthe option of using the entire hierarchy as the so-lution, or selecting a level representing the specificnumber of clusters of interest.

Different hierarchical methods are distinguishedby the criterion for determining which two clustersto merge at each level. Lance and WiHiams (1967)demonstrated that many agglomerative hierarchicalmethods are variations of a common recurrenceformula. Details concerning the recurrence formulacan be found in Cormack (1971), Everitt (1980),Lorr (1983), and Milligan (1979). A more exten-sive and recent reference is Gordon (1987).The use of agglomerative clustering accelerated

in the 1960s as computers became available. Ofparticular interest to the psychological communityis a series of articles published by ~~~~~~tty inEducational and Measurement (seeMcQuitty, 1987). Other important articles from thisperiod would include the introduction of ~Y~rd9s(1963) minimum variance method and Johnson’s(1967) discussion of the complete- and single-linkmethods. ~~~a~~y~ ~9~~dr~d~ (1978) introduced aroutine based on the U statistic 0

At least one other type of hierarchicalclustering strategy exists. Divisive clustering met-ods follow a pattern that is the reverse of the ag-glomerative techniques. Divisive methods begin withall entities in one cluster and partition the data into

two or more clusters. This process can be allowed

to continue until M clusters are which con-tain individual entities from the dataset. The Ed-

wards and Cavalli-Sforza (1965) method attemptsto find the division which minimizes the within-

cluster error sum of squares. Unfortunately, divi-sive methods face problems of computational com-plexity which are not easily overcome (see An-derberg, 1973). e

PaiDtitioixiog 84 thods

Partitioning methods produce distinct nonover-lapping clusters. Often, the methods are known asnonhierarcbical clustering procedures because ~~~ ~~’a single d~~ is (Anderbcrg, 1973;~~~~~~~ ~~ Sokal, 1973; 1980). The tech-niques in complexity from ~~~~ ~~~~~~9s s ( 1 975)very simple algorithms to rather -’-Ti’tricate

iterative reallocation methods, such as ISODATA (Ball& Hall, 1965), Friedman ~~b~~9s (1967)method, and f~a’~&dquo;:~S‘ 11~4_li~ ~~~’~,4 ~.;~~~~~ 1970).

Partitioning methods can be distinguished by fivecharacteristics (Blashfield, 1977a). The first char-acteristic involves the of the initial start-

ing partition G1J: &dquo;seed points.&dquo; Some 1~’~°~JLl~FS~liA~‘J

methods use randomly selected data aspartitions, while others allow the user tospecify starting (Anderberg, 1973; 3 ~~~~.~~ ~1966). Finally, ~~®~~~5s ~~9‘~Q~ FioR1wv< rolit~,ie utses

~1~~°~’s hierarchical method to the algorithm.The second and third characteristics deal with

the type of cluster assignment pass throughthe and the statistical criterion to assignthe points to the clusters. Some K-means algo-rithms a pass, assigning point inturn to the cluster centroid, while others

multiple and update the centroids s ~~~~~each point assignment. Similarly, the statistical cri-teria range from a simple distance betweena point and a cluster centroid, to an tooptimize complex matrix borrowedfrom multivariate normal distribution theory (seeMamott, b~‘~~9 ~~a-~t~ ~ Symons, 1971). oThe final two features involve whether a fixed

or variable number of clusters will be formed, andthe eventual treatment of outliers in the solution,



333

Most methods the user to specify the num-ber of clusters, outliers are forced to ~~fi~ ®~~of the clusters in the solution. Only a fewmethods, such as isODATA, allow for a variablenumber of clusters in the solution, or for a residual

pool of points.

O;.erlapjoing ~~~~~r ~ds

Compared with the first two categories of clus-methods, there is a much smaller number ofalgorithms which allow for overlapping clusters.At times, are called clumping or clique for-mation methods. An early algorithm which pro-duced overlapping structures was published byNeedham (196’?). Many authors working in thisarea have used theoretic to helpdevelop overlapping methods. Examples includeLing (1973), Ozawa (1985), and the more axio-matic approach of Jardine Sibson (1971). °Methods first introduced in the psychological list-

include the ADCLUS procedure developedby Shepard and Arabic (1979), as well as the Hu-bert (1974) and Peay (1975) methods. ~’~~Ifiy9 Corterand Tversky (1986) have introduced a techniquefor identifying overlapping clusters by a graphicalrepresentation of extended trees. °

Ordin?tioo Mettiods

Although most social scientists are unfamiliarwith the ordination, psychologists have beenprimarily responsible for developments in this area.The term is more commonly in the biologicalsciences and statistics (see Cormack, 1971). Or-dination techniques attempt to provide some typeof dimensional representation, usually based onfewer variables than in the original dataset. Thus,techniques such as factor analysis (R-- (?-mode)and metric and ~®~~~~.~t~c multidimensional scalingfall into this Extensive discussions offactor-analytic procedures as applied to classifi-cation can be found in Cattell (~9~2) ~~d Tryonand Bailey (1970). One feature of ordination meth-ods is that only a representation of the en-tities in the dataset is produced. The actual deter-

mination of cluster membership is left to ther~s~~r~h~r9s subjective judgment.

Validation Techniques

Because all clustering methods are heuristic innature, the critical issue of recovery performancemust be addressed. That is it is necessary to verifya given method can recover the true clusterstructure in a dataset. Unless a method can be shown

to reliably recover known configurations in error-data, it will not be very useful in applied anal-yses. Furthermore, it is desirable to determine the

sensitivity of such methods to different forms oferror in the data or to errors in judgment madeduring the clustering process. A literature address-ing these concerns exists. Generally, three strate-gies have been used (Dubes & Jain, 1979, 1980): o(1) mathematical or theoretical derivations, (2)analysis of empirical datasets, or (3) monte carlosimulation. A discussion of the advantages and lim-itations of each approach follows.

Mathematical Deriv.%tion

An analytical or theoretical derivation would bethe method of choice. Unfortunately, the over-

whelming complexity of the process has limitedadvances in this area. A few articles have appearedwhich use graph theory (see Matula, 1 9%7) . Othershave a statistical approach (e.g., Bock, 1985;Hartigan, 1985). Still others have addressed the

problem from a geometric perspective (~®~i~r9 1967;Milligan, 1979). However, progress has been slow,derivational results have often had limited value

for applied analyses; thus, work in this hashad little impact on the &dquo;end user&dquo; applicationsof cluster analysis.A different problem with the derivational ap-

proach occurs when theoretical recommendationsconflict or run counter to applied experience withthe methods. For Jardine and Sibson ( I 97 1 )developed an elegant axiomatic system for definingan acceptable clustering method. As it turned out,only the single-link hierarchical method could sat-isfy the requirements of the system. This resultgenerated a heated controversy because experience



334

with the single-link method indicated that it wasone of the poorest performing algorithms (Wil-liams, Lance, ~~1~9 ~ Clifford, 1971).

Similarly, Fisher and Van Ness ( ~ 9~ ~ ) proposeda set of nine admissibility criteria to evaluate clus-tering methods. Again, the single-link method ap-peared to be quite satisfactory, whereas nonhier-archical K-means methods rated rather poorly.However, Wong (1982) has made successful useof the ~-means approach in a hybrid clusteringscheme. Milligan (1980) found the K-means meth-ods to give exceptionally good recovery when well-chosen starting centroids were used. Hence, theo-retical findings are available which conflict withresults obtained from other approaches. Discrep-ancies of this sort led Dubes and Jain (1980) toconclude that the derivational approach was lessuseful as a validation strategy in the clustering con-text.

of Empirical Datasets

By far the most common validation strategy hasbeen the application of a clustering method to anempirical dataset. Typical examples of such anal-yses are found in Goldstein and Linden (1969) andHarrigan (1985). Often, only one clustering methodis tested on the data, hence no comparative infor-mation is offered to the reader. In fact, the mostcommon way to introduce a new clustering methodinto the literature is to test it on an applied dataset(see, e.g., Johnson, 1967).However, this approach has a serious and po-

tentially fatal weakness. If the clustering algorithmfinds the subjective clustering that the researchersuspects exists in the data, then some evidenceindicates that the algorithm may be able to find thecorrect structure; however, the evidence is basedon a sample of size one. More seriously, if the

clustering results deviate from prior expectations,then the discrepancy is difficult to explain. Theresearcher’s subjective clustering may have beenincorrect, and the structure found by the algorithmmay be the correct one. On the other hand, thealgorithm may have missed the correct cluster

structure, or there may be no structure in the data

at all. Given the methodological weakness of thisapproach, results from such studies should be treatedwith some skepticism.

Simulation Analysis

A simulation or monte carlo study typically re-quires three major our phases. First, artificialdatasets with known cluster structure are selectedor generated. Second, these constructed datasetsare analyzed by the various clustering methods orprocedures of interest in the study. Finally, the

level of agreement between the known cluster

structure and the structure found by the clusteringprocedures is determined through the use of oneor more recovery indices. The results reported fromsuch simulation studies are usually based on sum-mary statistics or inference procedures computedfrom the recovery indices. °

In the first of the analysis, artificial dataare generated which contain a known cluster struc-ture. Several methods exist for the construction

process, including use of a pencil and graph paper.However, the most commonly employed methodis to write a computer program which generatesthe datasets. Before such a program is written, theresearcher must decide on some definition or con-

ceptualization of the clusters. For example, the

clusters might be viewed as samples from a mixtureof multivariate normal populations in a specifiedvariable space. During the preparation of the datageneration program, routines are designed for spec-ifying various characteristics of the clusters. Theseinclude decisions concerning the number of clus-ters, the number of dimensions, the number of ele-ments per cluster, the centroids for the clusters,and the variance-covariance matrices for the pop-ulations from which the clusters are sampled.

These characteristics have a direct impact on thenature of the resulting clusters. For example, if

cluster centroids are required to be widely spacedin the multivariate space, then generally nonover-lapping clusters will be obtained. Otherwise, anoverlapping structure can be generated. Once thefeatures of the clusters have been determined, thesimulation program uses a random number gen-



335

erator to sample a set of points that will compriseeach cluster. Finally, the researcher may introducevarious forms of error or noise into the data. For

example, outliers and random noise dimensions couldbe added to study their effect on various clusteringprocedures. Examples of cluster generation rou-tines that have seen repeated use include Blashfield(1976) and Milligan (1985).The second phase is to analyze the constructed

datasets using the clustering methods or proceduresof interest. Often, the computer code for the clus-tering procedures is modified to eliminate extra-neous output and to store the information about theresulting cluster solution. Despite the programmodification, this phase is not overly complicated.The analysis of the constructed data is comparableto analysis of data provided by the same proceduresin applied research.

In the final phase, two partition sets are obtainedfor each constructed dataset. The first partition isthat which was used to define the clusters in thedata generation program. This partition is the truecluster structure of the data. The second partitionset is the one obtained from the clustering proce-dure, and corresponds to the partition which wouldhave been used if this had been an analysis of anapplied (empirical) dataset. The most convenientway to present the results of the simulation analysishas been to compute some measure of agreementbetween the true partition structure and the ob-tained clusters. These measures have been called

recovery or consensus indices.One measure which has seen active use is the

Rand index (see Hubert & ~r~bie, 1985). Both thenumerator and denominator of the index reflect fre-

quency counts. The numerator involves taking eachpair of elements and determining whether the clas-sification of the pair is consistent between the knownand obtained clusterings. That is, the points mustbe treated in the same manner in both partitionsets. If the pair of points is in the same cluster inboth the known and obtained clusterings, then thefrequency count for the numerator is increased by1.0. Similarly, if the points are in different clustersin both partitions, the numerator count is increasedby 1.0. The denominator is the total number of

possible pairwise comparisons and equals n x (n- 1)/2, where n is the number of elements. If theobtained clustering exactly matches that of the knownpartitioning, then the numerator and denominatorare equal and the index value is 1.0. If any dis-

crepancies are found, then the numerator is lessthan the denominator and the index value is less

than 1.0. The smaller the index value, the greaterthe inconsistency between the known and obtainedclusterings.The Rand index is but one of several recovery

measures that have been proposed. The process ofselecting a recovery measure can be complex-forexample, the original version of the Rand index isno longer recommended for use in simulation stud-ies-and the literature has begun to address thetopic (e.g., Day, 1986; Hubert & Arabie, 1985;Milligan & Cooper, 1986).A distinct advantage of the simulation approach

is that there is no doubt as to the true cluster struc-ture. It is possible to examine recovery acrosshundreds of datasets, thus avoiding conclusionsbased on a single dataset. Furthermore, the processcircumvents the mathematical complexities foundin attempting a derivational study. The main dis-advantage is the limited generalizability to datadistributions and structures which were not con-

sidered in the simulation experiments. This last

problem can be troublesome for an applied re-searcher attempting to select a method for an em-pirical analysis. However, the simulation literaturehas contributed some of the clearest evidence aboutmethod performance and will have a dominant im-pact on the review of validation results in the nextsection.

Validation Results

The following review of the validation results isorganized around the four categories of clusteringmethods. The majority of the findings relate di-rectly t® clustering method performance. In gen-eral, results will be based on the simulation liter-ature. As discussed prwi®usly9 the selection of theclustering method for an applied analysis representsthe fifth step in the clustering process. Decision



336

errors -made in the other steps also can cause re-duced recovery performance. Thus, an additionalsection presents the results obtained when speci-fication errors are made in the other steps of the

clustering process. These include the impact of out-liers when selecting data elements (step 1), selec-tion of variables (step 2) standardization (step 3), 9and dissimilarity measures (step 4).

~~~~~.~~hg~~~ Methods

Agglomerative hierarchical clustering proce-dures have the most frequently examined. Atleast 1 ~ 1 systenlatic studies have been conducted;results from these experiments are summarized inTable 1. The most commonly tested algorithmshave been the single link, complete link, groupaverage, and ~I~°d’s method. In addition, recoveryinformation concerning Lance and Williams’ (1967)beta-flexible method has been included in the table.Other hierarchical routines have been examined,such as the centroid and median methods; however,the performance of these methods has not beenparticularly noteworthy and they have not been listed.Additional smaller-scale or more specialized stud-ies (Blashfield & Morey, 1980; Cunningham &Ogilvie, 9 1972; 9 D’Andrade, 9 1978; 9 Gross, 9 1972;I~I®r~y9 Blashfield Skinner, 1983; Rand, 1971)also omitted.

When examining Table 9 and subsequently Ta-ble 2, it is important to note that the criterion indexused in the experiments may differ from study tostudy. Thus, direct numerical comparisons of cri-terion values can be made only within a study andnot across reports. However, all criteria have the

property that perfect recovery of the true clusterstructure would generate values of 1.0. As such,the closer the average or median criterion value isto 1.0, the better the recovery performance of thealgorithm. The values presented in the tables havebeen selected as typical representations of the re-spective experiments, or reflect sununary statisticsfor an entire study.One of the earliest systematic studies was con-

ducted by Baker (1974). examined three hi-erarchical structures consisting of chained, binary,

and random-link trees. A total of 100 datasets were

generated for each type error condition.Findings in Table 1 are for the binarytree and are typical of the overall results. Only thesingle- and complete-link methods were consid-ered. found that the performance of the sin-gle-link method was severely impaired by the pres-of error of the interpomtdistances. The complete-link method exhibited lesssensitivity to this factor. Even in error-free data,the complete-link method usually gave better re-covery performance than the single-link procedure.

Kuiper Fisher (1975) conducted a more ex-tensive study which examined the impact of a va-riety of design factors on cluster recovery, In gen-eral, bivariate normal mixtures were used to generatethe artificial datasets. A total of 30 datasets werecreated for each experimental condition. Selectedresults from the study appear in Table 1. The firsttwo lines for the Kuiper Fisher study in thetable correspond to the case when the clusters wereof equal size. In this case, Ward’s method producedthe best recovery of the underlying structure. How-ever, when unequal size clusters were present inthe data, the complete-link and group-averagemethods gave superior recovery. Except for thislast condition, Ward’s method appeared to be thebest clustering procedure for recovering clustersfrom bivariate normal mixtures. Finally, the single-method produced recovery which was signif-~ t~y inferior to any other method.The next study listed in Table 1 was performed

by Blashfield (1976). Unlike many researchers,Blashfield used nonzero covariances and more re-

alistic principal component structures to construct50 datasets. Multivariate normal mixtures with

complex covariance patterns were used to generate2 to 6 clusters embedded in a space of 3 to 22dimensions. Cluster varied from 5 to 40. Thesecharacteristics were selected randomly for each da-taset. In retrospect, it is unfortunate that these fea-tures were not systematically controlled in an ex-perimental context. Recovery was averaged overdatasets with differing numbers of clusters and withclusters of unequal sizes. Nevertheless, Blashfieldfound that Ward’s method gave significantly better



337

Table 1

Validation Results for Hierarchical Clustering Methods

recovery performance than any other procedure.The complete-link method was the next best, withthe group-average method a distant third. Finally,as found with the previous studies, the single-linkmethod gave the poorest recovery performance. (Itshould be noted that the data possessed overlappingclusters; ~~~~1~~~~9 1981b, found that cluster over-lap favored Ward’s technique over other methods.) )

~~~~~~®~~~ (1979) conducted a simulation studyusing 10 of the 50 datasets generated by BlashfieldRather than on recovery of theexact number of the Edelbrock argued infavor of reduced coverage. Solutions in the hier-

archy that contained more clusters than actuallyexisted in the data were examined. The results pre-sented in Table 1 were obtained at the 90% cov-



338

erage level. Recovery did improve when lowercoverage was allowed. Edelbrock found a signifi-cant advantage for the use of the correlation sim-ilarity measure. The best recovery was obtainedfrom the group-average method using correlation.Ward’s method was not tested with the correlation

measure, but its performance using Euclidean dis-tance was quite good.

Mojena ( 1977) generated 12 datasets consistingof clusters based on mixtures from multivariate

gamma populations. The univariate gamma prob-ability function represents a fairly rich family ofdistributions that includes, as a special case, thechi-square family. As such, gamma populations arenot generally symmetric; as the mean increases,the variance increases. For the multivariate case,

Mojena set all of the covariances to 0. Clusterswere of equal size, and consisted of 30 points each.Mojena systematically varied the degree of clusteroverlap in the experiment. The results in Table 1

represent mean recovery values for each method.The overall ranking and performance of the meth-ods is quite consistent with that of Blashfield (1976).Ward’s method gave significantly better recoverythan any other procedure. The single-link methodperformed significantly worse. Mojena did find thatas the level of overlap increased, the impact onWard’s method was less severe than the impact onthe group-average algorithm.

Mezzich’s (1978) study has received fairly ex-tensive coverage. However, only two artificial da-tasets were examined. The results in Table 1 showthe complete-link procedure was much moreeffective at recovering the cluster structure than thesingle-link method. Unlike Edelbrock (1979), thestudy found little impact from the use of differingdissimilarity measures, and the effect was not con-sistent from method to method.

Milligan and Isaac ( 19~0) generated clusters thatcorresponded to ultrametric tree structures. Ultra-metric distances are more restrictive than Euclideandistances. In an ultrametric space, the distances

between any three points must form an equilateralor isosceles triangle where the base is shorter thanthe two equal sides. No such restriction applies toEuclidean space (see Milligan, 1979). Various non-

overlapping configurations and error levels wereincluded in the experiment. Typical results fromthe study are presented in Table 1. The authors

found that the group-average method gave the bestoverall recovery. The complete-link procedure wassecond best, with Ward’s method a distant third inrecovery performance. The performance of the sin-gle-link method was found to be seriously impairedby the presence of error in the data. This charac-teristic caused the method to give the poorest over-all recovery. Finally, as would be expected, re-covery declined as the separation between clustersdecreased for all methods.

Bayne, ~e~uch p9 and Kane (1980)used a variety ofparameterizations of two bivariatenormal populations. Each wasbased on 200 constructed datasets. The results pre-sented in Table 1 represent typical realizations fromtheir experiment after converting their criterion toa percentage-correct value. Overall, the results

showed that Ward’s method gave the best perfor-mance, with the group-average and complete-linkalgorithms placing a close second. The single-linkmethod performed much worse than all other meth-ods. The authors found that the performance of thegroup-average method declined rapidly as the sep-aration between populations decreased. Ward’s

method was less affected by this factor.Edelbrock and McLaughlin (1980) reported a

study based on the same principles and logic as theEdelbrock (1979) experiment. For validation pur-poses, a total of 20 datasets from Blashfield (1976)and 12 datasets from Mojena ( 1977) were used. Ascan be seen in Table 19 the results were basicallythe same as those found by Edelbrock (1979). Edel-brock and McLaughlin also examined additionalsimilarity measures which involved one- and two-way intraclass correlation coefficients. The one-

way intraclass correlation was found to provideenhanced recovery performance for the group-av-erage method.

In an extensive validation experiment, Milligan(1980) examined 11 hierarchical clustering pro-cedures, including the five listed in Table 1. A totalof 108 error-free datasets were created which con-sisted of truncated multivariate normal mixtures.



339

The generation process ensured that the clusterswere nonoveriapping in the variable space. Twoadditional error conditions were created which in-volved perturbing the interpoint distances at lowand high levels. The results in Table 1 indicate thatthe group-average and beta-flexible methods gavethe best recovery, with Ward’s method a close sec-ond. None of the other hierarchical algorithms ex-amined in the experiment produced superi®r re-covery rates. The error perturbation process showeda impact on the complete-link procedure, 9and a very marked impact on the single-link method.The last and most recent study was conducted

by Scheibler and Schneider (1985). The authors

generated 200 constrained normal mixtures. Again,all five methods listed in Table 1 were tested. Ward’s s

procedure and the beta-flexible methods performedwell using either correlation or Euclidean distancemeasures. However, the group-average method gaveexcellent recovery when using the correlation in-dex, but performed badly with Euclidean distance.This result is similar to that found in the Edelbrock

(1979) study, but is less similar to the results ofEdelbrock and B4cLaughlin (1980). It is not con-sistent with the results of Milligan (1980, 1981b).Finally, the complete-link and single-link methods(especially the latter) gave poor recovery of theunderlying cluster structure.

Several conclusions can be drawn from the seriesof experiments summarized in Table 1. Ward’s

method tended to perform well in the cases whereit was tested. Often, it gave the best cluster recov-

ery. The performance of the group-average methodwas more erratic. At times, it produced the bestrecovery of cluster structure. However, its perfor-mance was not good on other occasions. The rea-sons for the discrepant recovery behavior have notyet been identified. On the other hand, the beta-flexible method performed well in the few studieswhere it has been included. The enhanced perfor-mance pattern of the beta-flexible method was con-firmed in a recent study which systematically var-ied the beta parameter across a wide range of values

° lg~~a9 1987a). The complete-link algorithm hasoccasionally performed better than the group-av-method, but it is usually inferior to Ward’ s

procedure. The single-link method, while theoret-ically attractive, has been repeatedly shown to givepoor cluster recovery and to be seriously affectedby the presence of even small levels of error in thedata.

Methods

A fairly substantial validation literature exists fornonhierarchical procedures. Five major systematicstudies have been conducted; their results are com-pared in Table 2. The first study was conducted byBlashfield (1977a) and was based on 20 multivar-iate normal mixture datasets generated in an earlierstudy (Blashfield, 1976). The best recovery wasobtained from three methods which gave similarmedian recovery values. These were the CLUSTANK rn~ procedure with random starting seeds, andthe two methods that used the JW) criterion. (Forpurposes of this paper, ~ 9 ~ and T represent thewithin-cluster, between-cluster, and total sum ofsquares and cross-products matrices, respectively.)It is difficult to explain the reduced performancelevel (.643) of the CLUSTAN K-means procedurewhen the starting centroids were obtained fromWard’s method. Given the results for Ward’s methodfrom B lashfield (1976), these centroids should havebeen fairly accurate.

In Mezzich’s (1978) study, all but one parti-tioning procedure gave recovery values which didnot differ much from method to method, as seenin Table 2. These recovery values were equivalentto the results for the complete-link hierarchicalmethod in Table 1 for Mezzich. The best parti-tioning technique was a K-means procedure usingEuclidean distances. The only procedure which didnot produce equivalent recovery was Wolfe’s (1970)NORMIX method. It performed rather poorly despitethe fact that centroids from Ward’s method wereused as the starting seeds for the algorithm. .

Bayne et al. (1980) also considered four parti-tioning methods, as listed in Table 2, in additionto the hierarchical methods listed in Table 1. Theauthors found that the convergent K-means methodand the two versions of the Friedman and Rubin

algorithm gave the best recovery of all methods



340

Table 2Validation Results for Nonhierarchical Clustering Methods

Note. Parenthetical entries indicate recovery performancefor methods starting centre ids obtainedfrom Ward’s or group hierarchical clustering proce-dures. Otherwise:, randomly selected data used as seedpoints.

tested. The performance of the K-i>eans and Wcriterion is consistent with the studies by Blashfield(1977a) and Mezzich (1978)~ ~~~~~~~~°9 the en-hanced performance of the Trace W criterion is notin accord with the Blashfield study. Finally, ~T~~~~’~ s(1970) NORMIX method gave rather poor recovery 9which is similar to h4ezzich’s result.

Milligan (1980) conducted a study which in-cluded four partitioning ~-means methods. The re-sults indicated that the methods were ~~a~~~~~~~ to

the nature of the starting but lllll ~S manner

different that found by Blashfield (1977a).When randomly selected data elements were usedas seeds, the four K-means procedures gavereduced recovery values. The results for the ran-

dom condition are given as the first columnof values for Milligan (1980) in Table 2. Whenobtained the group-average hierar-chical method were used as starting seeds, the pro-cedures in Table 2 for recovery. The the recoveryentries in Table 2 for Milligan the recoveryvalues when the starting centroids were based on



341

the group-average method. The four ~-means

methods seemed to be equivalent except for~I~~~~~~~9s procedure, which tended to give lowerrecovery values th the other three methods. eThe recent study by Scheibler and Schneider

(1985) included two different versions of the K-algorithms. The results indicated that bothclustering methods gave better recovery when cen-troids from Ward’s procedure were used to specifyinitial seed points. This result is consistent withMilligan (1980) but not with Blashfield (1977a).Finally, the CLUSTAN ~-means method gave betterrecovery than ,~~~th9s (1980) technique. (It shouldbe noted that ~p~th9s procedure uses medians ratherthan centroids to locate the clusters.) )

The success of the K-means procedures with im-proved starting seeds found in some studies has ledseveral authors to propose hybrid algorithms forapplied clustering (~~~~i~~~ ~ Sokol, 1980; Punj& Stewart, 1983). These procedures use the cen-troids obtained from a hierarchical method to starta ~-means algorithm. A different hybrid model wassuggested by Wong and Lane (1983). The firstinvolves a Kth nearest-neighbor density es-timation process followed by a hierarchical clus-of the nearest-neighbor distance matrix. Ex-ample analyses in Wong Lane suggest goodrecovery perfon-nance.

In summary, the convergent K-means methodto give the best recovery of cluster structure.This result was obtained despite the theoreticalfindings of Fisher Van Ness (1971), whichindicated that the K-means procedures were fairlyunattractive methods. Furthermore, method so-

phistication or complexity may have little impacton the quality of the obtained solution. The ~~-whereas the methods using direct criterion arewhereas the methods using the IWI criterion arerather complex. Yet the methods tended to giveequivalent recovery. The results for the Trace Wcriterion were inconsistent. Bayne et al. found itto be the best methods tested, whereasBlashfield found the criterion to give the lowestrecovery values. The reduced performance level isconsistent with less systematic studies of clusteringcriteria (~~~da~~r~ ~ Rubin, 1967; Scott ~~. Sy-

mons, 9 ~ 97 ~ ) Finally, Wolfe’s NORMix methodperformed poorly in all cases where it was tested.

Overlapping Methods

It appears that no systematic validation study ofoverlapping clustering methods has been con-

ducted. Jardine and Sibson (1971) presented a sin-gle clustering of 23 Indian caste groups to dem-onstrate their B, overlapping clustering method.Jardine and Sibson apparently felt that their axio-derivation provided sufficient justification forthe algorithm. Hubert (1974) used a dataset basedon 13 diagnostic psychiatric categories to study histwo-diameter clustering procedure. Peay (1975) usedtwo datasets to demonstrate his method, one re-lating to the study of marriages in eight ethnicgroups, and the other an artificially constructedof six elements. Shepard and Arabic (1979)studied ADCLUS with five datasets, of whichdealt with similarities or confusions in letters ornumbers based on set sizes of 10, 16, and 26. Theremaining two datasets studied a communicationnetwork of 14 industrial workers and results basedon sorting names of 20 anatomical terms. Finally, 9Ozawa (1985) used a constructed dataset of eightpoints. Included for comparison against his ownalgorithm was Jardine and Sibson’s ~1~ method.Both methods found the same correct structure.

Several conclusions can be drawn from the lit-erature. First, the information that is available onthe performance of overlapping methods is basedon the analysis of empirical datasets. As such, re-covery performance is difficult to judge. Second,the example datasets possessed small samplessizes; the average sample size in the studies wasonly 14.4. This may not be representative of themore typical applications of clustering. Finally,without comparative information, it is uncertain

which method may have the more desirable recov-

ery characteristics.

Ordination Methods

Social scientists have been using or exploringthe properties of ordination methods for severaldecades (Blashfield, 1980). Among the early ad-



342

vocates of the use of inverted or ~-f~ct®r analysisfor clustering applications were Cattell ( 1952) andTryon (see Tryon & Bailey, 1970). Certainly, ifthe clusters exist in the reduced factor space9 9 ~dif these clusters correspond to the type of simplestructure that most ~®de orthogonal factor ro-tation programs seek, then a 6-anatysis should al-low the user to identify the correct cluster structure.However, if any of the prerequisite conditions can-not be satisfied, then recovery may be marginal atbest. For example, if the clusters are found in ordefined by an oblique factor space, then an or-thogonal varimax rotation may make it difficult forthe user to determine correct cluster membership.A discussion of other methodological difficultiesfollows.

First, assume that clusters exist in the space de-fined by the original dataset. Sneath (1980) hasshown that there is a high that a re-searcher will conclude that a subset of points com-prises one cluster, when in fact the points comprisetwo or more clusters. The reduction in dimen-

sionality produced by the ~-analysis impairs theuser’s to detect clusters that existed in thespace defined by the original variables.

Fleiss, Lawlor, Platman, and Fieve (1971) reacheda parallel conclusion in their study of inverted fac-tor analysis. These authors felt that some indicationof distinct grouping should be present in the orig-inal data before a 6-analysis is attempted. If evi-dence for clustering existed, such as n1ultinxodaIor n®nsy etric variables, then Fleiss et al. furtherconcluded that methods other than inverted factor

analysis might do a better job at finding the clusters.Blashfield and Morey (1980) confirmed this ex-pectation in a study of simulated mmpi profiles.They found that both the group average and Ward’s shierarchical methods gave better recovery and wereless problematic in application than 6-factor anal-ysis. Mezzich (1978) reached similar conclusionswhen using simulated psychiatric profiles.A different issue in the use of ordination methods

is the subjective nature of the cluster identificationtask. Different researchers may interpret the outputfrom the same ordination analysis as forming dif-ferent groups. Mezzich (1978) studied the inter-

rater reliability for determining cluster membershipwith nonmetric multidimensional scaling. Relia-

bility was found to average about .77 using Cra-mer’s statistic. Mezzich did not measure interrater

reliability when clusters were derived from an in-verted factor analysis. It would seem reasonable toassume that moderate to low reliability values alsowould be found for this technique.

Given these and other logical difficulties, manyresearchers have expressed reservations about theuse of ordination methods as clustering procedures.Lorr (1983) noted that the appropriateness of Q-analysis has long been in dispute. Even one of theearly advocates of the method, Cattell (1978), hasstressed the fact that Q-analysis is not a procedurefor finding types (clusters), but a technique forfinding dimensions. Similar comments would holdfor other ordination methods, such as multidimen-sional scaling.

Ordination methods have been proposed as astrategy for data preparation prior to the applicationof a clustering procedure. This factoring or scalingwould then be part of the third step in the clusteringprocess. Some authors would suggest a dimen-sional analysis if a large number of variables hasbeen collected in a study. Rather than conductinga ~-~alysis on the entities, a regular factor anal-ysis would be performed on the variables. How-ever, if the clusters were defined (or existed) in theoriginal variable space, then an ordination methodwould serve to distort or hide the true structure asshown in Sneath’s (1980) study. On the other hand,if the clusters existed in the reduced variable space,then an ordination method should be useful for

detecting the true clustering.Kaufman (1985) reported the results of a sim-

ulation study where the clusters were first definedand generated in the reduced variable space. Fivestrategies for preprocessing the data were studied.Four strategies were based on principal componentsanalysis, taking into account whether all compo-nents were used or just those with above1.09 ~nd whether the scores were weighted ac-cording to the corresponding eigenvalues. A fifthcondition involved using the original standardizedvariables directly with the clustering method. All



343

processed datasets were clustered with ~J~rd9s hi-’ erarchical method. The results indicated that

weighted principal components analysis gave thebest recovery of the correct cluster structure. °

This result is logical, given the way in whichthe data were constructed. However, it is interest-

ing to note that simple standardization performednearly as well as the more complex analysis. Thus,the assumption that principal components process-ing is a necessity was not confirmed, even whenthe data were constructed to be most favorable forsuch an analysis. Kaufman failed to include a con-dition where the unstandardized data were analyzeddirectly. Such a condition would have addressedthe question of whether standardization itself wasnecessary

Other Validation Results

Validation results exist for a variety of otheror in the clustering process. Theseinclude the effect of outliers and the correspondingissue of coverage, improper selection of a dissim-ilarity measure, inclusion of random noise dimen-sions in the variable set, standardization of vari-

ables, and the impact of overlapping data structureson nonoverlapping clustering methods.

Outliers. ° which are not members ofany cluster in the dataset can be considered outliersto all clusters. Of course, such elements could be

intermediates between clusters and not outliers to

the entire data mass. Milligan (1980) examined theeffect of adding 209h or 40% additional data pointsas outliers. As would be expected, the presence ofoutliers served to confuse or mask the assignmentof data elements which did belong to a cluster.Edelbrock (1979) argued that, from a psychologicalperspective, it is not necessary for all persons in adataset to be classified in order to obtain useful

partitions. Edelbrock found that reduced coverageresulted in improved recovery performance with aset of hierarchical methods. Similar results wereobtained in the study by Scheibler and Schneider(1985), which included nonhierarchical routines.However, both studies were based on the use ofthe kappa statistic, and there is a that

the results on reduced coverage were confoundedwith characteristics of this recovery measure (Mil-ligan, 1987a).

measures. In terms of the clusteringprocess, it is important to note that the use of thewrong dissimilarity measure for a dataset mightlead to reduced recovery of the cluster structure

present in the data. Hence, some care should betaken when selecting a dissimilarity measure. Forexample, if the clusters are embedded in a Euclid-ean space, then a Euclidean distance dissimilaritymeasure would be appropriate. A three-dimen-sional Euclidean space corresponds to that nor-mally encountered by people in their day-to-dayactivities. ° ~l~shf~e~d (1977b) warned that severalformulas for Euclidean distance exist in the clus-

tering literature. For purposes of the present dis-cussion, the following formulation will be used:

In Equation 1, dj is the 6 sta°~~~h~,-lb~e’ distancebetween objects and j. The summation is com-puted over all variables in the d~t~set9 k is the indexvalue for the summation operator. The values xi,,and xj, the variable values for objects i

and j, on variable k. The values w,are weights applied to the squared difference oneach variable and are usually set to 1.0. A prioriweights determined by the researcher can also beused. Other weighting schemes exist; a particularlyeffective procedure will be discussed below.

Euclidean distance is a case of the moregeneral Minkowski family of metric distances. An-other member of the family is the so-called cityblock or Manhattan distance. Rather than measur-

ing straight-line distance, the city-block measurewould determine the distance by finding the short-est path in a grid system, similar to that taken byan automobile through city streets.A different measure of similarity which has been

popular in the social sciences is the Pearson cor-relation computed between two objects. The Pear-son correlation is the cosine of the angle betweentwo vectors representing the standardized scores of



344

the objects. The correlation measure has the ad-vantage (or disadvantage, depending on the con-text) of eliminating the effects of differing meansand variances between variables when computingthe similarity measure (see Cronbach & Gleser,1953; Skinner, 1978). Neither Euclidean distancenor the city-block measure eliminates these differ-ences. Many other measures have been proposed;see the discussions in Anderberg (1973), Cormack(1971), Everitt (1980), and Lorr (1983) for a morecomplete introduction to this topic.

In the validation literature, a number of research-ers have considered the consequences of the useof alternative or incorrect dissimilarity measures.The research indicates that differing dissimilaritymeasures can change the extent of cluster recovery.However, Punj and Stewart (1983) have arguederrors in the choice of a dissimilarity measuredo not seem to be as serious as other decision errorsin the clustering process.

~~~~~w~~~ variables. A few experiments haveconducted to determine the impact of addingvariables to the which are irrelevant to thecluster structure. These variables can be viewed asrandom noise dimensions. Milligan (1980) inves-tigated the impact of adding one or two irrelevantdimensions to the set of variables which define the

clustering. The results indicated that the ad-dition of even one irrelevant variable seriously re-duced the extent of cluster recovery. Thus, it seemsill-advised to indiscriminately include variables ina dataset for an applied cluster analysis.The use of all available data will likely obscure

any clustering present in a subset of the variables. °Similar masking effects were obtained by Fowlkesin two unpublished studies (see De Soete, DeSarbo,& C ~~i9 1985). These results led De Soete et al.to develop an optimal weighting scheme for vari-ables in a hierarchical cluster analysis. That is,values other than 1.0 are assigned to the weightsw, in Equation 1 when computing the Euclideandistance between points. The weights provide anoptimal fit between the resulting distances and anultrametric tree structure. The logic behind the op-timal fit is that most hierarchical clustering methods

force the ultrametric structure on the clustering so-lution. Results obtained by Milligan (1987b) in-dicated that the De Soete et al. is quiteeffective in assigning near-zero weights to irrele-vant variables. Recovery of true cluster structurewas enhanced in all cases examined, and the impactof the masking effect was greatly reduced.

Standardization. A researcher must decide

whether to standardize the variables in a dataset.Edelbrock (1979) found a slight advantage for stan-dardized data when all elements were required tobe clustered. When coverage was reduced, stan-dardization produced no improvement in clusterrecovery. However, Milligan (1980) found thatstandardization can lead to iimi~~d reduction in

recovery performance when the clusters exist in theunstandardized space. In a more recent study, Mil-ligan and Cooper (in press) conducted a large-scalesimulation study of eight standardization proce-dures. Results of the experiment indicated that

standardization procedures based on division by therange of the variable were more ef-fective than any other approach, including the tra-ditional z-score procedure. The Milligan and Cooperpaper includes a fairly review of the is-sues on the topic and references to the literature.

®~~~~~p~~~~ .~~~~~a~~e~° Some researchers have

considered the problem of cluster recovery by hi-erarchical and partitioning methods when the dataconsist of overlapping clusters. Effectively, thereis a mismatch between the clustering algorithm se-lected and the structure of the data. Of course, inan applied ~~~iysis 9 ~ researcher may not be awarethat an overlapping structure is in the data.Mojena (1977) conducted a small wherethe overlap between clusters was systematicallyvaried. As the extent of overlap increased, the pro-portion of correct cluster assignments decreased.Similarly, half of the datasets generated by Blash-field (1976) possessed overlapping characteristics(see ~iili~~~ ~ Isaac, 1980). Unfortunately, theresults were not broken down according to thischaracteristic. °

This led Milligan (1981b) to conduct an exper-iment which directly compared performance on



345

overlapping disjoint structures. Ward’s hier-archical method was found to give the best per-formance when overlap was allowed to exist in thedata. The group-average method tended to performbetter with structures. These resultsare consistent with those of Bayne et al. (1980).Scheibler and Schneider (1985) reported that theycould not confirm the finding. However, Scheiblerand Schneider did not examine overlapping stures in their experiment.

It appears that no studies have considered thereverse situation. A study of the performance ofoverlapping methods on consisting of disjointclusters would be of interest. An &dquo;optimal&dquo; clus-algorithm would produce an overlapping ora disjoint clustering solution, as appropriate for thestructure in the data.

Inference, ~~~~g~~~~~~9and Procedures

Several of research can be identified whichhave addressed issues relating to the last two stepsin the clustering process. The following sectionsdeal with the problem of determining the numberof clusters (step 6), hypothesis testing in a clus-context, replication analysis, and aids to

interpretation (the last three comprise step 7).

the Clusters

Most clustering procedures require the user tospecify or to the number of clusters inthe final solution. A substantial segment of theliterature which addresses this decision task is ref-erenced in and Cooper These au-thors conducted a simulation study of the perfor-mance of 30 decision rules. the experimentused only hierarchical clustering methods, all de-cision rules are to nonhierarchical al-

~~~~:~a. Only well-separated, distinct clusters werein the structure of the 108 test datasets.

Given such well-defined clustering, it would bethat the performance of any rule for de-

termining the number of clusters would be fairlygood. Milligan Cooper did find a set of rea-sonably effective rules. This set included the tech-niques developed by Baker and Hubert (1975), 9 ~~~-inski and Harabasz (1974), Duda and Hart (1973), 9Beale and the cubic clustering criterionused in SAS (Sarle, 1983). The set of techniques isfairly and includes adaptedfrom classical parametric and nonparametric pro-cedures. The Calinski and Harabasz, Duda and Hart,and Beale indices use various forms of sum of

squares within or between clusters, or both.The Milligan and Cooper study also revealed that

a number of methods did not work well. Amongmethods which performed poorly were Trace ~ 9criteria based on IWI, and a generalized distancemeasure. Unfortunately, the Trace W (error sumof squares) criterion has been the most frequentlymethod. Further, most criteria borrowedfrom traditional multivariate normality approaches,such as Wolfe’s (1970) likelihood ratio test, gaveat best only mediocre correct detection rates. Ev-eritt (1981) found that Wolfe’s (1970) likelihoodratio test well for large sample sizes.

In a more limited study, Begovich and Kane(1982) confirmed that a criterion based on IWI per-formed rather poorly. In contrast to Milligan andCooper, the authors found that the Calinski andHarabasz procedure did not work well when theclusters possessed unequal covariance matrices. Adifferent strategy, called simulation cluster anal-

ysis (scA), gave the best in their study.SCA is similar to parallel analysis in factor analysis, 9and involves creating a series of datasets by addingrandom error to the original data. Each error-per-turbed dataset is subjected to a cluster analysis andthe results from the simulations are combined to

provide a likelihood of the number of groupsin the data.

Milligan and Cooper did not consider techniquesfor determining the number of clusters which re-quired the user to subjective judgments (see,e.g., Gower, 1975). Similarly, techniques whichare dependent on the use of a specific clusteringmethod were not studied. ~~r~~~9 potentially useful



346

methods such as that of Wong and Schaak (1982)have not been independently validated.

Hypothesis Testing

Despite ~~m~sba~r~’s (1984) assertion to the

contrary, hypothesis testing in a cluster-analyticsituation can certainly be performed. Studies byBock (1985), Hartigan (1977, 1978, 1985), Lee(1979), and Sneath (1977) indicate that many re-searchers have been addressing the distributionalproblems in cluster analysis. Most testing proce-dures have been developed to determine whethersignificant cluster structure exists in the partitionsfound by a clustering method. The null hypothesistypically specifies that the data consist of a randompattern of points with no distinct clustering present.For example, the hypothesis may specify that thedata were sampled from a single multivariate nor-population, or the data correspond to whatwould be expected from a uniform distribution con-tained in a hypercube. °

In general, the testing procedures can be dividedinto two approaches (Sneath, 1969). In the first

approach, it is assumed that an external criterionvariable is available to validate the clustering re-sults. The second approach, using an internal cri-terion, information which is contained withinthe cluster analysis. Before considering the twoapproaches, comments are in order regarding thenaive application of standard hypothesis testingprocedures in a cluster-analytic framework.

After clustering results have been obtained, it istempting to conduct a disciiminant analysis, or someother standard testing procedure such as ANOVA orMANOVA. Examples of such applications can befound as early as Turner (1969) and as recently asAndes (1986). The logic is that if significant dif-ferences exist between the groups found in the clus-ter analysis, then significant clustering must bepresent in the data. The variables used in the cluster

analysis are employed as the variablesin the tests, and the cluster partitions define thegroups for the testing procedure.

It is widely recognized by clustering methodol-that this strategy is invalid, but this has not

been stressed in the literature (see Dubes & Jain,1979; T~~~~~~~ ~ Mahajan, ~~~~69 Morey et ~~° 91983). Such procedures will almost always returntest results, even for random datathat contain no cluster structure. The reason for the

bias when discriminant analysis is that thegroups were not defined a priori. More seriously,the variables tested were the same ones used todefine the groups in the first place. Similarly, inthe case of MANOVA, there is systematic violationof the assumption of random assignment of obser-vations to the different groups. °The cluster analysis process does not haphaz-

ardly assign the to groups, but assignsthem to maximize the similarity between obser-vations within the groups. Most clustering algo-rithms will find homogeneous groups of points inrandom noise data, even though such partitions donot represent any significant clustering property.Fortunately, the testing procedures which followare valid and avoid the problems in the use oftraditional methods.

External criterion An external cri-terion represents information which is external sothe cluster analysis. That is, the information in theexternal criterion was not used at any other pointin the clustering process. This information couldbe in the form of one or more variables which can

help validate the grouping, or in terms of a partitionof the elements into groups. The must bespecified a priori or obtained from a clus-tering of a dataset. When the cri-terion is in the form of one or more variables, it

is permissible to perform a standard parametricanalysis such as ANOVA or MANOVA to validate theclustering results. Unfortunately, many applied re-searchers find it difficult to omit such variablesfrom the cluster analysis if they believe that thevariables contain information the truecluster structure. Hence, few studies have used theexternal criterion approach.

If the researcher has a criterion which representsan independent partitioning of the data, then a pro-cedure developed by Hubert and Baker (1977) canbe used to test the validity of the results.It is important to note that the Hubert and Baker



347

cannot be used to test the similarity oftwo clusterings of the same dataset.

Internal criterion analysis. An internal crite-rion measure uses information obtained from withinthe clustering process. Internal criterion measurestypically reflect the goodness of fit between theinput dissimilarity matrix the resulting clus-tering. ~~r~~~9 the of the clustering withrespect to the input data is measured. Two specificexamples of such measures are the Baker and Hu-bert (1975) gamma index and the cor-relation measure 1980).

The gamma index is the ratio of two frequencycounts. The numerator is the difference betweenthe counts for consistent and inconsistent pairingsof distances. The denominator is simply the sumof the number of pairings of both kinds. A con-sistent pairing occurs when a within-cluster dis-tance is found to be smaller than a between-cluster

distance. An inconsistent pairing is recorded whena within-cluster distance is larger than a distance

: between two points not in the same cluster. Thus,the gamma index has a value of .® when the ~~~as-solution is perfectly consistent with respectto between- and within-cluster distances.The point-biserial measure is a Pearson corre-

lation between a variable coded 0 or 1 and a var-

iable corresponding to the input distances. For agiven distance between two points, the binaryvariable is ~ss~~~~d ~ value of 0 if the resultingclustering solution placed the two points in thesame cluster. Otherwise, a 1 is recorded, denotinga between-cluster distance. Larger values of thepoint-biserial measure would indicate greateragreement between the clustering of the andthe input distances.

Reviews of internal criteria can be found in Cor-

mack Jardine and Sibson ( ~ 971 ) and Rohlf(1974). A comparative study of 30 internal criteriawas conducted by Milligan (1981a). founda set of highly effective internal criteria which in-eluded the gamma index and the mea-sure. It was found that Trace W and criteria adaptedfrom the multivariate normality literature (such asthose based on functions of W and IWI) performedpoorly. These results do not give much support to

the testing procedures proposed by Arnold (1979)using !og(jTj/jW)) as the test statistic.Once an effective internal criterion has been se-

lected, it can be used as test statistic. The statisticis used to test an alternative hypothesis that spec-ifies that some form of significant cluster structureexists in the obtained data partitions. The maindifficulty in conducting the test is to determine anappropriate sampling distribution for the statistic.One approach has been to use permutation or montecarlo procedures to generate an approximate sam-distribution. Milligan and Sokol (1980) de-veloped such a test based on the point-biserial cri-terion. Other authors have adopted this approach,including Begovich and Kane (1982) and Good(1982).

Replication ~~ysis

A different approach, on the logic of cross-validation as used in regression analysis, was pro-posed by Mclntyre and Blashfield (1980):1. ° Two samples are obtained for clustering pur-

poses. This can be accomplished by randomlydividing one larger dataset into two samples.

2. The first sample is cluster-analyzed, and thecentroids for the clusters are computed. (Thisstep assumes that a determination of the num-ber of clusters was made.)

3. The distance (or other dissimilarity measure)is computed between each element in the sec-ond and each of the centroids deter-mined from the clustering of the first sample.

4. Each element in the second sample is assignedto the nearest cluster centroid from the first

sample. This last assignment activity producesa clustering of the second sample based on thecharacteristics of the first sample. °

5 ° The second sample is directly cluster-analyzedusing its own data. There are now two clus-of the sample for comparisonpurposes.

6. Some measure of agreement is computed be-tween the two partitions of the same data. Theconsistency between the original cluster so-



348

lution the cross-validated cluster assign-ments indicates the stability of the solution. °

Mclntyre and Blashfield suggested the use of thestatistic as a measure of consistency. Theauthors provided some initial information on thenumerical values of kappa that would occur in sit-uations where the agreement between the partitionsets was high. However, Hubert and Arable’s (1985)corrected Rand index to exhibit prop-for the comparison of partitions (Milligan,1987a). Unfortunately, no typical for theHubert and Arabic index have published forthe cross-validation application.A different strategy involves comparisons of

clusterings from fairly different sources. However,comparisons usually are based on a singledataset. For example, it is possible to comparepartitionings from different clustering methods. Alevel obtained from a clustering methodcould be selected, and the clusters could be com-

pared to those found by a nonhierarchical proce-dure. If the cluster structure fairly con-sistent across different clustering methods, it wouldseem reasonable to conclude that the structure is

strong and not an artifact of any given method. Infact, comparisons are not restricted to those in-volving the same number of clusters in the twopartition sets, although in most situations the equal-number case would be the most interesting to anapplied researcher. Finally, comparisons based onthe use of different similarity or dissimilarity mea-sures are also possible. a

These more comparisons a levelof confounding in the results if a successful rep-lication does not occur. A failure to replicate maybe due to a lack of structure in the data, or todifferences in the types of structures that differingclustering methods impose on the resulting solu-tions. More general comparisons could provideuseful information if the context of the clusteringproblem suggested that the selected comparison waslogical and meaningful.

The use of a single sample in the comparisonscomes at a very heavy cost to the researcher, inthat the ability to generalize the clustering resultsto other datasets is lost. The subtle shift from two-

to one-sample comparisons rather dramaticallythe analysis from involving externalvalidity to one giving results only on internal con&dquo;

sistency. e

for interpretation

A very important characteristic of any appliedcluster analysis is the interpretability of the result-ing partitions. Because interpretation is dependenton the underlying context of the problem, the re-sponsibility for this part of the analysis fails uponthe user. ~~~~~~~~~~ clustering metbodologists havedeveloped a variety of techniques to help with theinterpretation task. °Logically, descriptive statistics for each cluster

should be computed. Any understanding of the na-ture of the variables employed in the analysis canbe used for interpretive purposes. Note that anypreprocessing of the input variables, such as stan-dardization or principal components, would makethis activity more difficult. A different approachsuggested by Anderberg (1973) involves permutingthe dissimilarity matrix according to the groupsspecified by a cluster analysis. That is, the rowsand columns are reordered in such a as to

place points within the cluster in consecutiveorder. This should form a block diagonal matrix.The within-block values would correspond to thewithin-cluster distances, and the remaining entrieswould be between-cluster distances. If distinctclusters exist in the data the clustering methoddetected them, the within-block shouldbe small compared to those outside the block.

In fact, most interpretive methods are graphicalin nature. For example, Kruskal Landwehr(1983) developed a method for presenting the re-sults of a hierarchical cluster analysis. Their tech-nique, called &dquo;icicle&dquo; plots, is an improvement onthe more standard dendrogram or s~~~~~~~ plots pro-duced by clustering packages. More elabor-a~~e ap-proaches proposed by Kleiner and Hartigan(1981) based on natural-appearing &dquo;trees&dquo; and

11 castles.&dquo; Besides showing the clustering of thedata points, the trees presented by these authorsare to encode additional information. This is



349

accomplished by varying the thickness of the

branches, the between branches, the lengthof the branch or stem, and the order of the elements

as determined by the branch (left vs. right). ~®~~-

ever, as the comments rejoinder which im-mediately follow the article indicate, there is noconsistent as to the most effective wayof displaying clustering results. Finally, a differentgraphical approach, based on hierarchical butallowing for overlapping structures, was intro-

duced by Corter and Tversky (1986).There have been several attempts to develop

mathematical approaches which measure the sta-bility or replicability of the generated cluster so-lution. One such approach is the cluster validityprofiles introduced by Bailey and Dubes (1982).Validity profiles are obtained for each cluster inthe solution and indicate the relative compactnessor isolation of the cluster. The profiles are scaledin probability units and are derived from the math-ematical field of graph theory see ~~ta~i~91977). The profiles are compared with the bestcompactness and isolation values that would be

expected in a randomly chosen graph. The profileshelp reject spurious and appear to identifyvalid ones. °

Discussion

for in ~~a~~~~~~~~~

Clearly, much remains to be done to more com-pletely understand the characteristics and limita-tions of existing clustering methods, in addition todealing with the introduction of new procedures. °Research based on the following recommendationswould provide results of direct benefit to research-ers undertaking applied analyses. °1. ° More information is concerning the

sensitivity of clustering procedures to differingdata characteristics. Such an understandingcould eventually result in guidelines which in-dicate the most appropriate clustering proce-dure to use, given the properties of the inputdataset. This would include all components inthe clustering process, such as the selection ofthe data elements, the similarity measure, andthe clustering method. For example, the pres-

ence of outliers may affect the decision madeat each step in the clustering process. Re-markably little research has been conducted onthis issue, but it is a rather complex problem.A good review of the difficulties involved canbe found in Soon (in press).

2. Little is known about the performance of theclustering routines that yield overlapping clus-ters. A careful examination of this issue is

needed. It would be useful to determine whether

any of the existing methods can reliably dis-criminate between disjoint structures and thosethat require an overlapping representation.Similarly, the of overlap that can betolerated before algorithm performance de-grades would be helpful information. Finally,the impact of various sources of error on clus-ter recovery for these methods is unknown.

3. More information is needed on the reliabilityof subjective decisions that must be made inconjunction with some clustering procedures.In particular, the determination of cluster

membership is often left up to the researcher’ sjudgment when ordination methods are used.If the assignment, process is reliable, is it valid?Similarly, the impact of the various graphicalprocedures used in clustering is not well under-stood. Issues concerning which graph is bestfor displaying cluster membership, inter-clus-ter relationships, and other features have notaddressed in a thorough manner.

4. When new derivational or simulation experi-ments are conducted, one or more of the meth-ods which have been widely used in past re-search should be included. This provides abasis for comparison and linkage with the pre-vious literature. In particular, when a newclustering method is introduced into the lit-

erature, comparative performance informationagainst existing procedures is essential for

evaluating the usefulness of the contribution.

for Applied Analyses

The clustering process is a complex sequence oftasks that must be carefully executed to obtain a



350

proper clustering of the ~aser’s data. Furthermore,the literature in the area of classification is quitediverse, with contributions coming from all fieldsof science. Ce y, much progress has been madesince Everitt (1979) published a paper on unre-solved problems in cluster analysis. Among theconcerns listed by Everitt were the problem of de-termining the number of clusters, selection of clus-tering method, graphical techniques, developmentof computer packages, more efficient algorithms,and the application of the jackknife (a permutation-like estimation procedure). Work in all of these

areas is continuing. Using the results that are cur-rently available, some guidelines can be offered tothe researcher interested in performing a clusteranalysis. These recommendations will be presentedwithin the context of the seven-step clustering pro-cess. Although the overview is necessarily some-what cursory, the critical decisions have been in-cluded.

l . The sample of the entities to be clustered shouldbe representative of the population. The sple could be systematic in nature and not ran-dom. If it is suspected that a particular clusterconsists of a fairly small proportion of the pop-ulation, then this cluster might be over-rep-resented in the sample in order to facilitatecluster detection and to obtain more stable es-timates of cluster centroids. The user mightconsider the inclusion of persons who serveas markers or &dquo;ld~~l types,9 that personsknown or believed to be members of different

clusters. Finally, it seems reasonable to deleteoutlier observations because these may de-

grade algorithm performance.2. Care must be exercised in the selection of the

variables used in the cluster analysis. Someapplied researchers seem to believe that theyshould use as many variables as are available.

However, selecting the correct variables is acritical part of the analysis. The addition ofonly one or two irrelevant variables can havea serious effect on cluster recovery. Thus, theinclusion of irrelevant variables should be

avoided. The researcher might consider pro-viding a justification for the inclusion of each

variable in terms of how it should discriminate

among clusters. The use of the optimal vari-weighting scheme of De Soete et al. (1985)can be used to improve cluster recovery inthose cases where it is uncertain which vari-

ables contribute to the clustering in the data.Finally, the practice of using principal com-ponents to preprocess the data may result inundetected clusters and interpretation prob-lems with the reduced variable space.

3. The issue of variable standardization must beaddressed. The routine application of stan-dardization in all is not necessarilyappropriate, especially when the variables havesimilar means and variances. If sizable dif-ferences in means or variances do exist and

these differences are not related to the clus-

tering in the data, then standardization wouldseem to be necessary. Alternative standard-ization procedures which are not based on thewell-known z score appear to be more effective

(Milligan & Cooper, in press). In particular,standardization based on division by range ofthe variable may improve recovery when stan-dardization is needed.

4. The researcher will need to select a similarityor dissimilarity measure. The measure shouldreflect those characteristics that are suspectedto define the clusters believed to be present inthe data. At times, a tailor-made index maybe needed and this is permissible. In order f®ran applied user to understand the properties ofvarious dissimilarity measures, a few typicalprofiles can be constructed. The dissimilaritymeasures can be computed between profiles, 9and the resulting values can be studied to de-termine whether the relevant characteristics are

being reflected in the values. The present au-thors have found this to be an effective ex-ercise.

5. A clustering method must be selected. Threemust be considered here. First, theclustering method should be one designed torecover the cluster types to be pres-ent in the data. A mismatch between the cluster

type being sought by the clustering algorithm



351

and the cluster types that actually exist in thedata may result in reduced or distorted recov-

ery. Second, the method should be effectiveat recovering the specified structure, even whenerror is present in the data. For hierarchicalmethods, this would tend to rule out the useof the single and complete methods in favorof Ward’s method or the group-average tech-

nique. In the case of partitioning methods suchas K-means, the use of fairly accurate startingseeds seems to be important.

Finally, access to computer software to runthe selected method is crucial. More often than

’

not, the actual clustering technique selected byan applied user is determined by this last fac-tor, and not by the first two criteria. For ex-ample, D’Andrade’s (1978) hierarchical methodwould be an attractive alternative to the single-and complete-link methods when the user pos-sesses ordinal data. However, software to runl~’Andr~dc’s routine has not been made avail-

able. Fortunately, the major statistical com-puter packages have been improving their clus-tering software over time, and this is becomingless of a problem. The Statistical Analysis Sys-tem (SAS Institute, 1985) is one such packagethat has greatly enhanced its clustering optionsover the past decade. Continued improvementsare expected and desirable.

6. The number of clusters will need to be deter-mined or specified for most clustering meth-ods. Milligan and Cooper (1985) provided auseful summary and bibliography of most ofthe techniques that have been proposed to ad-dress this decision task. Again, sAs has beenimplementing many of the rules which havebeen found to be reasonably effective.

7. ° The last step involves the interpretation, test-ing, and replication of the clustering results.The introduction of improved graphical tech-niques for interpretation purposes has certainlybeen a welcome development in the clusteringliterature. Access to these procedures is stillsomewhat limited, as is access to the testingroutines that will allow a user to test the hy-pothesis of significant clustering in the data.

Unfortunately, limited access exacerbates alongstanding problem in the practice of appliedclustering. Often, an applied researcher bringsto a cluster analysis the assumption that clus-ters actually do exist in the data. The idea oftesting this assumption escapes some research-ers. These same researchers would not as-

sume, say in the context of a laboratory ex-periment, that the groups were significantlydifferent as a result of the experimental ma-nipulations. Rather, the researcher would testfor this result. The same logic is valid in theclustering context. Finally, the two-samplecross-validation process proposed by R4cIntyreand ~l~shfield (1980) is an excellent strategyfor establishing the generalizability of a clusteranalysis.

References

Anderberg, M. R. (1973). Cluster analysis for research-ers. New York: Academic Press.

Andes, N. (1986, June). Validation of cluster solutionsusing discriminant analysis and bootstrap techniques.Paper presented at the meeting of the ClassificationSociety of North America, Columbus OH.

Arnold, S. J. (1979). A test for clusters. Journal ofMarketing Research, 19, 545-551.

Bailey, T. A., & Dubes, R. (1982). Clustering validityprofiles. Pattern Recognition, 15, 61-83.

Baker, F. B. (1974). Stability of two hierarchical group-ing techniques. Case I: Sensitivity to data errors. Jour-nal of the American Statistical Association, 69, 440-445.

Baker, F. B., & Hubert, L. J. (1975). Measuring thepower of hierarchical cluster analysis. Journal of theAmerican Statistical Association, 70, 31- 38.

Ball, G. H., & Hall, D. J. (1965). ISODATA, a novelmethod of data analysis and pattern classification.Menlo Park CA: Stanford Research Institute. (NTISNo. AD 699616)

Bayne, C. K., Beauchamp, J. J., Begovich, C. L., &Kane, V. E. (1980). Monte carlo comparisons of se-lected clustering procedures. Pattern Recognition, 12,51-62.

Beale, E. M. L. (1969). Cluster analysis. London: Sci-entific Control Systems.

Begovich, C. L., & Kane, V. E. (1982). Estimating thenumber of groups and group membership using sim-ulation cluster analysis. Pattern Recognition, 15, 335-342.



352

Blashfield, R. K. (1976). Mixture model tests of clusteranalysis: Accuracy of four agglomerative hierarchicalmethods. Psychological Bulletin, 83, 377-388.

Blashfield, R. K. (1977a). A consumer report on clusteranalysis software: (3) Iterative partitioning methods(NSF grant DCR 74-20007). State College PA: Penn-sylvania State University, Department of Psychology.

Blashfield, R. K. (1977b). The equivalence of three sta-tistical packages for performing hierarchical clusteranalysis. Psychometrika, 42, 429-431.

Blashfield, R. K. (1980). The growth of cluster analysis:Tryon, Ward, and Johnson. Multivariate BehavioralResearch, 15, 439-458.

Blashfield, R. K., & Aldenderfer, M. S. (1978). Theliterature of cluster analysis. Multivariate BehavioralResearch, 13, 271-295.

Blashfield, R. K., & Morey, L. C. (1980). A comparisonof four clustering methods using MMPI monte carlodata. Applied Psychological Measurement, 4, 57-64.

Bock, H. H. (1985). On some significance tests in clusteranalysis. Journal of Classification, 2, 77-108.

Calinski, R. B., & Harabasz, J. (1974). A dendrite methodfor cluster analysis. Communications in Statistics, 3,1-27.

Cattell, R. B. (1952). The three basic factor-analyticresearch designs: Their inter-relations and derivatives.Psychological Bulletin, 49, 499-520.

Cattell, R. B. (1978). The scientific use of factor anal-ysis. New York: Plenum Press.

Cormack, R. M. (1971). A review of classification.Journal of the Royal Statistical Society, Series A, 134,321-367.

Corter, J. E., & Tversky, A. (1986). Extended similaritytrees. Psychometrika, 51, 429-451.

Cronbach, L. J., & Gleser, G. C. (1953). Assessing thesimilarity between profiles. Psychological Bulletin,50, 456-473.

Cunningham, K. M., & Ogilvie, J. C. (1972). Evalu-ation of hierarchical grouping techniques: A prelim-inary study. Computer Journal, 15, 209-213.

D’Andrade, R. G. (1978). U-statistic hierarchical clus-tering. Psychometrika, 43, 59-67.

Day, W. H. E. (Ed.). (1986). Consensus classifications[Special issue]. Journal of Classification, 3(2).

De Soete, G., DeSarbo, W. S., & Carroll, J. D. (1985).Optimal variable weighting for hierarchical clustering:An alternating least squares approach. Journal ofClassification, 2, 173-192.

Dubes, R., & Jain, A. K. (1979). Validity studies inclustering methodologies. Pattern Recognition, 11,235-254.

Dubes, R., & Jain, A. K. (1980). Clustering metho-dologies in exploratory data analysis. Advances in

Computers, 19, 113-228.Duda, R. O., & Hart, P. E. (1973). Pattern classificationand scene analysis. New York: Wiley.

Edelbrock, C. (1979). Comparing the accuracy of hi-erarchical clustering algorithms: The problem of clas-sifying everybody. Multivariate Behavioral Research,14, 367-384.

Edelbrock, C., & McLaughlin, B. (1980). Hierarchicalcluster analysis using intraclass correlations: A mix-ture model study. Multivariate Behavioral Research,15, 299-318.

Edwards, A. W. F., & Cavalli-Sforza, L. (1965). Amethod for cluster analysis. Biometrics, 21, 362-375.

Everitt, B. S. (1979). Unresolved problems in clusteranalysis. Biometries, 35, 169-181.

Everitt, B. S. (1980). Cluster analysis (2nd ed.). Lon-don : Heinemann.

Everitt, B. S. (1981). A monte carlo investigation of thelikelihood ratio test for the number of components ina mixture of normal distributions. Multivariate Be-havioral Research, 16, 171-180.

Fisher, L., & Van Ness, J. W. (1971). Admissible clus-tering procedures. Biometrika, 58, 91-104.

Fleiss, J. L., Lawlor, W., Platman, S. R., & Fieve, R.R. (1971). On the use of inverted factor analysis forgenerating typologies. Journal of Abnormal Psychol-ogy, 77, 127-132.

Fleiss, J. L., & Zubin, J. (1969). On the methods andtheory of clustering. Multivariate Behavioral Re-search, 4, 235-250.

Friedman, H. P., & Rubin, J. (1967). On some invariantcriteria for grouping data. Journal of the AmericanStatistical Association, 62, 1159-1178.

Goldstein, S. G., & Linden, J. D. (1969). A comparisonof multivariate grouping techniques commonly usedwith profile data. Multivariate Behavioral Research,4, 103-114.

Good, I. J. (1982). An index of separateness of clustersand a permutation test for its statistical significance.Journal of Statistical Computing and Simulation, 15,81-84.

Gordon, A. D. (1987). A review of hierarchical clas-sification. Journal of the Royal Statistical Society,Series A, 150, 119-137.

Gower, J. C. (1967). A comparison of some methodsof cluster analysis. Biometrics, 23, 623-628.

Gower, J. C. (1975). Goodness-of-fit criteria for clas-sification and other patterned structures. In G. Esta-brook (Ed.), Proceedings of the 8th International Con-ference on Numerical Taxonomy. San Francisco:Freeman.

Gross, A.L. (1972). A monte carlo study of the accuracyof a hierarchical grouping procedure. Multivariate Be-havioral Research, 7, 379-389.

Harrigan, K. R. (1985). An application of clustering forstrategic group analysis. Strategic Management Jour-nal, 6, 55-73.

Hartigan, J. A. (1975). Clustering algorithms. New York:Wiley.



353

Hartigan, J. A. (1977). Distribution problems in clus-tering. In J. Van Ryzin (Ed.), Classification and clus-tering (pp. 45-71). New York: Academic Press.

Hartigan, J. A. (1978). Asymptotic distributions for

clustering criteria. Annals of Statistics, 6, 117-131.Hartigan, J. A. (1985). Statistical theory in clustering.

Journal of Classification, 2, 63-76.Hubert, L. J. (1974). Some applications of graph theory

to clustering. Psychometrika, 39, 283-309.Hubert, L. J., & Arable, P. (1985). Comparing parti-

tions. Journal of Classification, 2, 193 -218 .Hubert, L. J., & Baker, F. B. (1977). The comparison

and fitting of given classification schemes. Journal ofMathematical Psychology, 16, 233-253.

Jancey, R. C. (1966). Multidimensional group analysis.Australian Journal of Botany, 14, 127 -130.

Jardine, N., & Sibson, R. (1971). Mathematical tax-onomy. New York: Wiley.

Johnson, S. C. (1967). Hierarchical clustering schemes.Psychometrika, 32, 241-254.

Kaufman, R. L. (1985). Issues in muitivariate clusteranalysis: Some simulation results. Sociological Meth-ods and Research, 13, 467-486.

Kleiner, B., & Hartigan, J. A. (1981). Representingpoints in many dimensions by trees and castles (withcomments and rejoinder). Journal of the AmericanStatistical Association, 76, 260-276.

Kruskal, J. B., & Landwehr, J. M. (1983). Icicle plots:Better displays for hierarchical clustering. The Amer-ican Statistician, 37, 162-168.

Kuiper, F. K., & Fisher, L. (1975). A monte carlocomparison of six clustering procedures. Biometrics,31, 777-783.

Lance, G. N., & Williams, W. T. (1967). A generaltheory of classificatory sorting strategies: I. Hierar-chical systems. Computer Journal, 9, 373-380.

Lee, K. L. (1979). Multivariate tests for clusters. Jour-

nal of the American Statistical Association, 74, 708-714.

Ling, R. F. (1973). A probability theory of cluster anal-ysis. Journal of the American Statistical Association,68, 159-164.

Lorr, M. (1983). Cluster analysis for the social sciences.San Francisco: Jossey-Bass.

Marriott, F. H. C. (1971). Practical problems in a methodof cluster analysis. Biometrics, 27, 501-514.

Matula, D. W. (1977). Graph theoretic techniques forcluster analysis. In J. Van Ryzin (Ed.), Classificationand clustering (pp. 95-129). New York: AcademicPress.

McIntyre, R. M., & Blashfield, R. K. (1980). A nearest-centroid technique for evaluating the minimum-vari-ance clustering procedure. Multivariate BehavioralResearch, 15, 225-238.

McQuitty, L. L. (1987). Pattern-analytic clustering. NewYork: University Press of America.

Mezzich, J. (1978). Evaluating clustering methods forpsychiatric diagnosis. Biological Psychiatry, 13, 265-346.

Milligan, G. W. (1979). Ultrametric hierarchical clus-tering algorithms. Psychometrika, 44, 343-346.

Milligan, G. W. (1980). An examination of the effectof six types of error perturbation on fifteen clusteringalgorithms. Psychometrika, 45, 325-342.

Milligan, G. W. (1981a). A monte carlo study of thirtyinternal criterion measures for cluster analysis. Psy-chometrika, 46, 187-199.

Milligan, G. W. (1981b). A review of monte carlo testsof cluster analysis. Multivariate Behavioral Research,16, 379-407.

Milligan, G. W. (1985). An algorithm for generatingartificial test clusters. Psychometrika, 50, 123-127.

Milligan, G. W. (1987a). A study of the beta-flexibleclustering method (WPS 87-61). Columbus OH: OhioState University, Faculty of Management Sciences.

Milligan, G. W. (1987b). A validation study of a vari-able weighting algorithm (WPS 87-111). ColumbusOH: Ohio State University, Faculty of ManagementSciences.

Milligan, G. W., & Cooper, M. C. (1985). An exam-ination of procedures for determining the number ofclusters in a data set. Psychometrika, 50, 159-179.

Milligan, G. W., & Cooper, M. C. (1986). A study ofthe comparability of external criteria for hierarchicalcluster analysis. Multivariate Behavioral Research,21, 441-458.

Milligan, G. W., & Cooper, M. C. (in press). A studyof standardization of variables in cluster analysis.Journal of Classification.

Milligan, G. W., & Isaac, P. (1980). The validation offour ultrametric clustering algorithms. Pattern Rec-ognition, 12, 41-50.

Milligan, G. W., & Mahajan, V. (1980). A note onprocedures for testing the quality of a clustering of aset of objects. Decision Sciences, 11, 669-677.

Milligan, G. W., & Sokol, L. M. (1980). A two-stageclustering algorithm with robust recovery character-istics. Educational and Psychological Measurement,40, 755-759.

Mojena, R. (1977). Hierarchical grouping methods andstopping rules: An evaluation. Computer Journal, 20,359-363.

Morey, L. C., Blashfield, R. K., & Skinner, H. A.(1983). A comparison of cluster analysis techniqueswithin a sequential validation framework. Multivar-iate Behavioral Research, 18, 309-329.

Needham, R. M. (1967). Automatic classification in lin-guistics. The Statistician, 17, 45-54.

Ozawa, K. (1985). A stratificational overlapping clusterscheme. Pattern Recognition, 18, 279-286.

Peay, E. R. (1975). Nonmetric grouping: Clusters andcliques. Psychometrika, 40, 297-313.



354

Punj, G., & Stewart, D. W. (1983). Cluster analysis inmarketing research: Review and suggestions for ap-plication. Journal of Marketing Research, 20, 134-148.

Rand, W. M. (1971). Objective criteria for the evalu-ation of clustering methods. Journal of the AmericanStatistical Association, 66, 846 - 850.

Rohlf, F. J. (1974). Methods of comparing classifica-tions. Annual Review of Ecology and Systematics, 5,101-113.

Romesburg, H. C. (1984). Cluster analysis for research-ers. Belmont CA: Lifetime Learning Publications.

Sarle, W. S. (1983). Cubic clustering criterion (Tech.Rep. A-108). Cary NC: SAS Institute.

SAS Institute (1985). SAS user’s guide: Statistics, version5 edition. Cary NC: Author.

Scheibler, D., & Schneider, W. (1985). Monte carlotests of the accuracy of cluster analysis algorithms-A comparison of hierarchical and nonhierarchicalmethods. Multivariate Behavioral Research, 20, 283-304.

Scott, A. J., & Symons, M. J. (1971). Clustering meth-ods based on likelihood ratio criteria. Biometrics, 27,387-397.

Shepard, R. N., & Arabie, P. (1979). Additive cluster-ing : Representation of similarities as combinations ofdiscrete overlapping properties. Psychological Re-view, 86, 87-123.

Skinner, H. A. (1978). Differentiating the contributionof elevation, scatter, and shape in profile similarity.Educational and Psychological Measurement, 38, 297-308.

Sneath, P. H. A. (1969). Evaluation of clustering meth-ods. In A. J. Cole (Ed.), Numerical taxonomy (pp.257-271). New York: Academic Press.

Sneath, P. H. A. (1977). A method for testing the dis-tinctness of clusters: A test of the disjunction of twoclusters in Euclidean space as measured by their over-lap. Mathematical Geology, 9, 123-143.

Sneath, P. H. A. (1980). The risk of not recognizingfrom ordinations that clusters are distinct. Classifi-cation Society Bulletin, 4, 22-43.

Sneath, P. H. A., & Sokal, R. R. (1973). Numericaltaxonomy. San Francisco: Freeman.

Soon, S. C. (in press). On detection of extreme datapoints in cluster analysis. (Doctoral dissertation, Ohio

State University, 1988.) Dissertation Abstracts Inter-national.

Späth, H. (1980). Cluster analysis algorithms. New York:Wiley.

Tryon, R. C., & Bailey, D. C. (1970). Cluster analysis.New York: McGraw-Hill.

Turner, M. E. (1969). Credibility and cluster. Annals ofthe New York Academy of Sciences, 161, 680-688.

Ward, J. H. (1963). Hierarchical grouping to optimizean objective function. Journal of the American Sta-tistical Association, 58, 236-244.

Williams, W. T., Lance, G. N., Dale, M. B., & Clif-

ford, H. T. (1971). Controversy concerning the cri-teria for taxonometric strategies. Computer Journal,14, 162-165.

Wolfe, J. H. (1970). Pattern clustering by multivariatemixture analysis. Multivariate Behavioral Research,5, 329-350.

Wong, M. A. (1982). A hybrid clustering method foridentifying high-density clusters. Journal of the Amer-ican Statistical Association, 77, 841- 847.

Wong, M. A., & Lane, T. (1983). A kth nearest neighborclustering procedure. Journal of the Royal StatisticalSociety, Series B, 45, 362 - 368.

Wong, M. A., & Schaak, C. (1982). Using the kthnearest neighbor clustering procedure to determine thenumber of subpopulations. Proceedings of the Statis-tical Computing Section, American Statistical Asso-ciation, 40-48.

Author’s Address

Send requests for further information to Glenn W. Mil-ligan, Faculty of Management Sciences, 301 HagertyHall, Ohio State University, Columbus OH 43210, U.S.A.

Reprints

Reprints of this article may be obtained prepaid for $2.50(U.S. delivery) or $3.00 (outside U.S.; payment in U.S.funds drawn on a U.S. bank) from Applied PsychologicalMeasurement, N657 Elliott Hall, University of Minne-sota, Minneapolis MN 55455, U.S.A.



Methodology Clustering Methods · 329 Methodology Review: Clustering Methods Glenn W. Milligan and...

Documents

Transcript of Methodology Clustering Methods · 329 Methodology Review: Clustering Methods Glenn W. Milligan and...