MARCH, Psychological Bulletin - marces.org DT & Fiske DW (1959) Converge… · Psychological MARCH,...

25
VOL. 56, No.2 Psycholo gical MARCH, 1959 Bulletin CONVERGENT AND DISCRIMINANT VALIDATION BY THE MULTITRAIT-MULTIMETHOD MATRIXI DONALD T. CAMPBELL Northwestern University AND DONALD W. FISKE University of Chicago In the cumulative experience with measures of individual differences over the past 50 years, tests have , been accepted as valid or as invalid by research experiences of many sorts. The criteria suggested in this paper are all to be found in such cumulative evaluations, as well as in the recent discussions of validity. These criteria are clarified and imple .. men ted when considered jointly in the context of a multi trait-multi- method matrix. Aspects of the valida- tional process receiving particular emphasis are these: 1. Validation is typically conver- gent, a confirmation by independent measurement procedures. Independ- ence of methods is a common denom- inator among the major types of validity (excepting content validity) 1 The new data analyses reported in this paper were supported by funds from the Graduate School of Northwestern University and by the Department of Psychology of the University of Chicago. We are also indebted to numerous colleagues for their thoughtful criticisms and encouragement of an earlier draft of this paper, especially Benjamin S. Bloom, R. Darrell Bock, Desmond S. Cart- wright, Loren J. Chapman, Lee J. Cronbach, Carl P. Duncan, Lyle V. Jones J Joe Kamiya, Wilbur L. Layton, Jane Loevinger, Paul E. Meehl, Marshall H. Segall, Thornton B. Roby, Robert C. Tryon, Michael Wertheimer, and Robert F. Winch. insofar as they are to be distinguished from reliability. 2. For the justification" of novel trait measures, for the validation of test interpretation, or for the estab- lishment of construct validity, dis- criminant validation as well as con- vergent validation is required. Tests can be invalidated by too high cor- relations with other tests from which they were intended to differ. 3. Each test or task employed for measurement purposes is a trait .. method unit, a union of a particular trait content with measurement pro.. cedures not specific to that content. The systematic variance among test scores can be due to responses to the measurement features as well as re- sponses to the trait content. 4. In order to examine discrim- inant validity, and in order to esti- mate the relative contributions of trait and method variance, more than one trait as well as more than one method must be employed in the vali- dation process. In many instances it will be convenient to achieve this through a multitrait-multimethod matrix. Such a matrix presents all of the intercorrelations resulting when each of several traits is measured by each of several methods. To illustrate the suggested valida .. tional process, a synthetic example is 81

Transcript of MARCH, Psychological Bulletin - marces.org DT & Fiske DW (1959) Converge… · Psychological MARCH,...

VOL. 56, No.2

Psychological

MARCH, 1959

Bulletin

CONVERGENT AND DISCRIMINANT VALIDATION BY THEMULTITRAIT-MULTIMETHOD MATRIXI

DONALD T. CAMPBELLNorthwestern University

AND DONALD W. FISKEUniversity of Chicago

In the cumulative experience withmeasures of individual differencesover the past 50 years, tests have

, been accepted as valid or discard~d

as invalid by research experiences ofmany sorts. The criteria suggested inthis paper are all to be found in suchcumulative evaluations, as well as inthe recent discussions of validity.These criteria are clarified and imple..mented when considered jointly inthe context of a multitrait-multi­method matrix. Aspects of the valida­tional process receiving particularemphasis are these:

1. Validation is typically conver­gent, a confirmation by independentmeasurement procedures. Independ­ence of methods is a common denom­inator among the major types ofvalidity (excepting content validity)

1 The new data analyses reported in thispaper were supported by funds from theGraduate School of Northwestern Universityand by the Department of Psychology of theUniversity of Chicago. We are also indebtedto numerous colleagues for their thoughtfulcriticisms and encouragement of an earlierdraft of this paper, especially Benjamin S.Bloom, R. Darrell Bock, Desmond S. Cart­wright, Loren J. Chapman, Lee J. Cronbach,Carl P. Duncan, Lyle V. JonesJ Joe Kamiya,Wilbur L. Layton, Jane Loevinger, Paul E.Meehl, Marshall H. Segall, Thornton B. Roby,Robert C. Tryon, Michael Wertheimer, andRobert F. Winch.

insofar as they are to be distinguishedfrom reliability.

2. For the justification" of noveltrait measures, for the validation oftest interpretation, or for the estab­lishment of construct validity, dis­criminant validation as well as con­vergent validation is required. Testscan be invalidated by too high cor­relations with other tests from whichthey were intended to differ.

3. Each test or task employed formeasurement purposes is a trait ..method unit, a union of a particulartrait content with measurement pro..cedures not specific to that content.The systematic variance among testscores can be due to responses to themeasurement features as well as re­sponses to the trait content.

4. In order to examine discrim­inant validity, and in order to esti­mate the relative contributions oftrait and method variance, more thanone trait as well as more than onemethod must be employed in the vali­dation process. In many instances itwill be convenient to achieve thisthrough a multitrait-multimethodmatrix. Such a matrix presents all ofthe intercorrelations resulting wheneach of several traits is measured byeach of several methods.

To illustrate the suggested valida..tional process, a synthetic example is

81

82 D. T. CAMPBELL AND D. W. FISKE

TABLE 1A SYNTHETIC MULTITRAIT-MuLTIMETHOD MATRIX

Traits

Method 1 -Method 2

B~ AI

Method 3

Ba Ca

IVfethod 1 .51

.38

(.89)

.37 (.76)

.\~

Method 2 n~

A:J

Method 3 B~

Ca

....----------,.67 .... " .-1-2 .331

,............. '........... ::.43............. .66 ..... , .34,'I "...... -......, II ........"1.34 .32........ .58.-. ...---------~

(.94)

.67 (.92)

.58 .60 (.85)

Note.-The validity diagonals are the three sets of italicized values. The reliability diagonals are the three setsof values in parentheses. Each heterotrait-monomethod triangle is enclosed by a solid line. Each heterotrait­heteromethod triangle is enclosed by a broken line.

presented in Table 1. This illustra­tion involves three different traits,each measured by three methods,generating nine separate variables. Itwill be convenient to have labels forvarious regions of the matrix, andsuch have been provided in Table 1.The reliabilities will be spoken of interms of three reliability diagonals,one for each method. The reliabilitiescould also be designated as the mono..trait-monomethod values. Adjacentto each reliability diagonal is theheterotrait-monomethod triangle. Thereliability diagonal and the adjacentheterotrai t-monomethod trianglemake up a monomethod block. A heter­omethod block is made up of a validitydiagonal (which could also be desig­nated as monotrait-heteromethodvalues) and the two heterotrait-hetero..method triangles lying on each side ofit. Note that these two heterotrait..

heteromethod triangles are not iden­tical.

In terms of this diagram, four as­pects bear upon the question of valid­ity. In the first place, the entries inthe validity diagonal should be sig­nificantly different from zero andsufficiently large to encourage furtherexamination of validity. This re­quirement is evidence of convergentvalidity. Second, a validity diagonalvalue should be higher than the val­ues lying in its column and row in theheterotrait..heteromethod triangles.That is, a validity value for a variableshould be higher than the correlationsobtained between that variable andany other variable having neithertrait nor method in common. Thisrequirement may seem so minimaland so obvious as to not need stating,yet an inspection of the Iiteratureshows that it is frequently not met,

VALIDATION BY THE MULTITRAIT·MULTIMETHOD MATRIX 83

and may not be met even when thevalidity coefficients are of substantialsize. In Table 1, all of the validityvalues meet this requirement. Athird common-sense desideratum isthat a variable correlate higher withan independent effort to measure thesame trait than with measures de­signed to get at different traits whichhappen to employ the same method.For a given variable, this involvescomparing its values in the validitydiagonals with its values in the heter­otrait-monomethod triangles. Forvariables Al , B1, and CI , this require­ment is met to some degree. For theother variables, A2, As etc., it is notmet and this is probably typical ofthe usual case in individual differ­ences research, as will be discussed inwhat follows. A fourth desideratumis that the same pattern of trait in­terrelationship be shown in all of theheterotrait triangles of both the mon~

omethod and heteromethod blocks.The hypothetical data in Table 1meet this requirement to a verymarked degree, in spite of the dif­ferent generalleveJs of correlation in­volved in the several heterotrait tri­angles. The last three criteria pro­vide evidence for discriminant va­lidity.

Before examining the multitrait­multimethod matrices available inthe literature, some explication andjustification of this complex of re­quirements seems in order.

Convergence of independent methods:the distinction between reliability andvalidity. Both reliability and validityconcepts require that agreement be­tween measures be demonstrated. Acommon denominator which mostvalidity concepts share in contradis­tinction to reliability is that thisagreement represent the convergenceof independent approaches. The con­cept of independence is indicated by

such phrases as 4'external variable,""criterion performance," "behavioralcriterion" (American PsychologicalAssociation, 1954, pp. 13-15) used inconnection with concurrent and pre­dictive validity. For construct valid­ity it has been stated thus: "Numer­ous successful predictions dealingwith phenotypically diverse 'criteria'give greater weight to the claim ofconstruct validity than do ... pre­dictions involving very similar be­havior" (Cronbach & Meehl, 1955, p.295). The importance of independ­ence recurs in most discussions ofproof. For example, Ayer, discussinga historian's belief about a pastevent, says "if these sources arenumerous and independent, and ifthey agree with one another, he willbe reasonably confident that their ac­count of the matter is correct" (Ayer,1954, p. 39). In discussing the man­ner in which abstract scientific con­cepts are tied to operations, Feiglspeaks of their being llfixed" by Sltri­angulation in logical space" (Feigl.1958, p. 401).

Independence is, of course, a mat­ter of degree, and in this sense, relia­bility and validity can be seen as re­gions on a continuum. (Cf. Thur­stone, 1937, pp. 102-103.) Reliabilityis the agreement between two effortsto measure the same trait throughmaximally similar methods. Validityis represented in the agreement be­tween two attempts to measure thesame trait through maximally differ­ent methods. A split-half reliabilityis a little more like a validity coeffi­cient than is an immediate test-retestreliability, for the items are not quiteidentical. A correlation betweendissimilar subtests is probably a reli­ability measure, but is still closer tothe region called validity,

Some evaluation of validity cantake place even if the two methods

84 D. T. CAMPBELL AND D. W. FISKE

are not entirely independent. InTable 1, for example, it is possiblethat Methods 1 and 2 are not en­tirely independent. If underlyingTraits A and B are entirely inde­pendent, then the .10 minimum cor­relation in.. the heterotrait-hetero­method triangles may reflect methodcovariance. What if the overlap ofmethod variance were higher? Allcorrelations in the heteromethodblock would then be elevated, includ...ing the validity diagonal. 1'he hetero­method block involving Methods 2and 3 in Table 1 illustrates this. Thedegree of elevation of the validitydiagonal above the heterotrait-heter­omethod triangles remains compa­rable and relative validity can still beevaluated. The interpretation of thevalidity diagonal in an absolute fash­ion requires the fortunate coincidenceof both an independence of traitsand an independence of methods,represented by zero values in theheterotrait-heteromethod triangles.But zero values could also occurthrough a combination of negativecorrelation between traits and posi­tive correlation between methods, orthe reverse. In practice, perhaps allthat can be hoped for is evidence forrelative validity, that is, for commonvariance specific to a trait, above andbeyond shared method variance.

Discriminant validation. While theusual reason for the judgment of in­validity is low correlations in thevalidity diagonal (e.g., the DowneyWill-Temperament Test [Symonds,1931, p. 337ff]) tests have also beeninvalidated because of too high cor­relations with other tests purportingto measure different things. Theclassic case of the social intelligencetests is a case in point. (See belowand also [Strang, 1930; R. Thorndike,1936].) Such invalidation occurswhen values in the heterotrait-hetero-

method triangles are as high as thosein the validity diagonal, or evenwhere within a monomethod block,the heterotrait values are as high asthe reliabilities. Loevinger, GIeser,and DuBois (1953) have emphasizedthis requirement in the developmentof maximally discriminating subtests.

When a dimension of personality ishypothesized, when a construct isproposed, the proponent invariablyhas in mind distinctions between thenew dimension and other constructsalready in use. One cannot definewithout implying distinctions, andthe verification of these distinctionsis an important part of the valida­tional process. I n discussions of con­struct validity, it has been expressedin such terms as "from this point ofview, a low correlation with athleticability may be just as important andencouraging as a high correlationwith reading comprehension" (APA,1954, p. 17).

The test as a trait-method unit. Inany given psychological measuringdevice, there are certain features orstimuli introduced specifically torepresent the trait that it is intendedto measure. There are other featureswhich are characteristic of themethod being employed, featureswhich could also be present in effortsto measure other quite differenttraits. The test, or rating scale, orother device, almost inevitably elicitssystematic variance in response dueto both groups of features. To the ex..tent that irrelevant method variancecontributes to the scores obtained,these scores are invalid.

This source of invalidity was firstnoted in the "halo effects" found inratings (Thorndike, 1920). Studiesof individual differences among lab­oratory animals resulted in the recog­nition of "apparatus factors," usu­ally more dominant than psychologi-

VALIDATION BY THE MULTITRAIT..MULTIMETHOD MATRIX 85

cal process factors (Tryon, 1942) .For paper-and-pencil tests, methodsvariance has been noted under suchterms as "test-form factors't (Ver­non: 1957, 1958) and "response sets"(Cronbach: 1946,1950; Lorge, 1937).Cronbach has stated the point partic­ularly clearly: "The assumption isgenerally made ... that what thetest measures is determined by thecontent of the items. Yet the finalscore ... is a composite of effects re­sulting from the content of the itemand effects resulting from the formof the item used" (Cronbach, 1946,p.475). "Response sets always lowerthe logical validity of a test....Response sets interfere with infer­ences from test data" (p. 484).

While E. L. Thorndike (1920) waswilling to allege the presence of haloeffects by comparing the high ob­tained correlations with commonsense notions of what they ought tobe (e.g., it was unreasonable that ateacher's intelligence and voice qual­ity should correlate .63) and whilemuch of the evidence of response setvariance is of the same order, theclear-cut demonstration of the pres­ence of method variance requiresboth several traits and several meth­ods. Otherwise, high correlations be­tween tests might be explained as dueeither to basic trait similarity or toshared method variance. In themultitrait-multimethod matrix, thepresence of method variance is indi­cated by the difference in level of cor­relation between the parallel valuesof the monomethod block and theheteromethod blocks, assuming com­parable reliabili ties among all tests.Thus the contribution of method var­iance in Test Al of Table 1 is indi­cated by the elevation of r AIBl aboverAIB2' i.e., the difference between .51and .22, etc.

The distinction between trait and

method is of course relative to thetest constructor's intent. What is anunwanted response set for one testermay be a trait for another who wishesto measure acquiescence, willingnessto take an extreme stand, or tendencyto attribute socially desirable attri­butes to oneself (Cronbach: 1946,1950; Edwards, 1957; Lorge, 1937).

MULTITRAIT-MuLTIMETHOD MA­TRICES IN THE LITERATURE

Multitrait-multimethod matricesare rare in the test and measurementliterature. Most frequent are twotypes of fragment: two methods andone trait (single isolated values fromthe validity diagonal, perhaps ac­companied by a reliability or two),and heterotrait-monomethod tri­angles. Either type of fragment isapt to disguise the inadequacy of ourpresent measurement efforts, particu­larly in failing to call attention to thepreponderant strength of methodsvariance. The evidence of test valid­ity to be presented here is probablypoorer than most psychologists wouldhave expected.

One of the earliest matrices of thiskind was provided by Kelley andKrey in 1934. Peer judgments bystudents provided one method, scoreson a word-association test the other.Table 2 presents the data for the fourmost valid traits of the eight he em­ployed. The picture is one of strongmethod factors, particularly amongthe peer ratings, and almost total in­validity. For only one of the eightmeasures, School Drive, is the valuein the validity diagonal (.161) higherthan all of the heterotrait-hetero­method values. The absence of dis­criminant validity is further indi­cated by the tendency of the valuesin the monomethod triangles to ap­proximate the reliabilities.

An early illustration from the ani-

86 D. T. CAMPBELL AND D. W. FISKE

TABLE 2

PERSONALITY TRAITS OF ScHOOL CHILDREN FROM KELLEY'S STUDY

(Ncs311)

Peer Ratings Association Test

Peer RatingsCourtesy A1 ( .82)Honesty B1 .74 ( .80)Poise C1 .63 .65 (.74)School Drive 0 1 .76 .78 .65 (.89)

Association TestCourtesy A2 .13 .14 .10 .14 (.28)Honesty B2 .06 .12 .16 .08 .27 (.38)Poise C2 .01 .08 .10 .02 .19 .37 ( .42)School Drive D2 .12 .15 .14 .16 .27 .32 .18 (.36)

mal literature comes from Anderson's(1937) study of drives. Table 3 pre­sents a sample of his data. Onceagain, the highest correlations arefound among different constructsfrom the same method, showing thedominance of apparatus or methodfactors so typical of the whole field ofindividual differences. The validitydiagonal for hunger is higher than theheteroconstruct-heteromethod val­ues. The diagonal value for sex hasnot been italicized as a validi tycoefficient since the obstruction box

measure was pre-sex-opportunity, theactivity wheel post-opportunity.Nate that the high general levelof heterotrait-heteromethod valuescould be due either to correlation ofmethods variance between the twomethods, or to correlated trait vari­ance. On a priori grounds, however,the methods would seem about as in­dependent as one would be likely toachieve. The predominance of an ap­paratus factor for the activity wheelis evident from the fact that the cor­relation between hunger and thirst

TABLE 3

MEASURES OF DRIVES FROM ANDERSON'S DATA

(N==50)

Obstruction Box Activity Wheel

Al B1 C1 A2 B2 C2

Obstruction BoxHunger At ( .58)Thirst B1 .54 ( )Sex C1 .46 .70

Activity WheelHunger A2 .48 .31 .37 (.83)Thirst B2 .35 .33 .43 .87 ( .92)Post Sex C2 .31 .37 .44 .69 .78

Note.-Empty parentheses appear in this and subsequent tables where no appropriate reliability estimates arereported in the original paper.

VALIDATION BY THE MULTITRAIT·MULTIMETHOD MATRIX 87

TABLE 4SOCIAL INTELLIGENCE AND MENTAL ALERTNESS SUBTEST INTERCORRELATIONS FROM

THORNDIKEtS DATA

(N~750)

MemoryCompre-

Vocabularyhension

Ai B1 A2 B~ Aa Bs

MemorySocial Intelligence (Memory for Names & Faces) Al ( )Mental Alertness (Learning Ability) B1 .31 (

ComprehensionSocial Intelligence (Sense of Humor) At .30 .31 ( )Mental Alertness (Comprehension) B2 .29 .38 .48 (

VocabularySocial Intelligence (Recog. of Mental State) As .23 .35 .31 .35 ( )Mental Alertness (Vocabulary) B. .30 .58 .40 .48 .47 (

(.87) is of the same magnitude astheir test-retest reliabilities (.83 and.92 respectively).

R. L. Thorndike's study (1936) ofthe validity of the George Washing­ton Social Intelligence Test is theclassic instance of invalidation byhigh correlation between traits. It in­volved computing all of the intercor­relations among five subscales of theSocial Intelligence Test and five sub­scales of the George WashingtonMental Alertness Test. The model ofthe present paper would demand thateach of the traits, social intelligenceand mental alertness, be measured byat least two methods. While this fullsymmetry was not intended in thestudy, it can be so interpreted with­out too much distortion. For bothtraits, there were subtests employingacquisition of knowledge during thetesting period (Le., learning or mem­ory) , tests involving comprehensionof prose passages, and tests that in­volved a definitional activity. Table 4shows six of Thorndike's 10 variablesarranged as a multitrait-multimethodmatrix. If the three subtests of theSocial Intelligence Test are viewed

as three methods of measuring socialintelligence, then their intercorrela­tions (.30, .23, and .31) representvalidities that are not only lower thantheir corresponding monomethod val­ues, but also lower than the hetero­trait-heteromethod correlations, pro­viding a picture which totally fails toestablish social intelligence as a sep­arate dimension. The Mental Alert­ness validity diagonals (.38, .58, and.48) equal or exceed the monomethodvalues in two out of three cases, andexceed all heterotrait-heteromethodcontrol values. These results illus­trate the general conclusions reachedby Thorndike in his factor analysis ofthe whole 10 X 10 matrix,

The data of Table 4 could be usedto validate specific forms of cognitivefunctioning, as measured by the dif­ferent Hmethodsn represented byusual intelligence test content on theone hand and social content on theother. Table 5 rearranges the 15 val­ues for this purpose. The mono­method values and the validity diag­onals exchange places, while theheterotrait-heteromethod control co­efficients are the same in both tables.

88 D. T. CAMPBELL AND D. W. FISKE

TABLE 5

MEMORY, COMPREHENSION, AND VOCABULARY MEASURED WITHSOCIAL AND ABSTRACT CONTENT

Social Content Abstract Content

At B1 C1 A2 B2 C2

Social ContentMemory (Memory for Names and Faces) Al ( )Comprehension (Sense of Humor) B1 .30 ( )Vocabulary (Recognition of Mental State) C1 .23 .31

Abstract ContentMemory (Learning Ability) A2 .31 .31 .35 ( )Comprehension B2 .29 .48 .35 .38 ( )Vocabulary C2 .30 .40 .47 .58 .48

As judged against these latter values,comprehension (.48) and vocabulary(.47), but not memory (.31), showsome specific validity. This trans­mutability of the validation matrixargues for the comparisons within theheteromethod block as the most gen­erally relevant validation data, andillustrates the potential interchange­ability of trait and method com­ponents.

Some of the correlations in Chi's(1937) prodigious study of halo effectin ratings are appropriate to a multi­trait...multimethod matrix in whicheach rater might be regarded as rep­resenting a different method. Whilethe published report does not makethese available in detail because itemploys averaged values, it is appar­ent from a comparison of his TablesIV and VIII that the ratings gen­erally failed to meet the requirementthat ratings of the same trait by dif..ferent raters should correlate higherthan ratings of different traits by thesame rater. Validity is shown to theextent that of the correlations in theheteromethod block, those in thevalidity diagonal are higher than theaverage heteromethod-heterotraitvalues.

A conspicuously unsuccessful mul-

titrait-multimethod matrix is pro­vided by Campbell (1953, 1956) forrating of the leadership behavior ofofficers by themselves and by theirsubordinates. Only one of 11 var­iables (Recognition Behavior) metthe requirement of providing a valid­ity diagonal value higher than any ofthe heterotrait-heteromethod values,that validity being .29. For none ofthe variables were the validi tieshigher than heterotrait-monomethodvalues.

A study of attitudes toward au­thority and nonauthority figures byBurwen and Campbell (1957) con­tains a complex multitrait-multi­method matrix, one symmetrical ex­cerpt from which is shown in Table 6.Method variance was strong for mostof the procedures in this study.Where validity was found, it wasprimarily at the level of validitydiagonal values higher than hetero­trait-heteromethod values. As il­lustrated in Table 6, attitude towardfather showed this kind of validity, asdid attitude toward peers to a lesserdegree. Attitude toward boss showedno validity. There was no evidenceof a generalized attitude towardauthority which would include fatherand boss, although such values as the

VALIDATION BY THE MULTITRAIT·MULTIMETHOD MATRIX 89

TABLE 6ATTITUDES TOWARD FATHER, Boss, AND PEER, AS MEASURED BY

INTERVIEW AND CHECK-LIST OF DESCR.IPTIVE TRAITS

Interview Trait Check..List

Al B1 C1 A2 B2 C2

Interview(N=57)

Father Al ( )Boss B2 .64 ( )Peer Ct .65 .76 (

Trait Check-List(N:::155)

Father A2 .40 .08 .09 (.24)Boss B2 .19 -.10 -.03 .23 (.34)Peer C2 .27 .11 .23 .21 .45 (.55)

.64 correlation between father andboss as measured by interview mighthave seemed to confirm the hypothe­sis had they been encountered in iso­lation.

Borgatta (1954) has provided acomplex multimethod study fromwhich can be extracted Table 7, il-

lustrating the assessment of twotraits by four different methods. Forall measures but one, the highest cor­relation is the apparatus one, i.e.,with the other trait measured by thesame method rather than with thesame trait measured by a differentmethod. Neither of the traits ~nds

TABLE 7MULTIPLE MEASUREMENT OF Two SOCIOMETRIC TRAITS

(N == 125)

Sociometric Observation

by Others by Self Group In.. Roleteraction PIaying

As Ba

Sociometric by OthersPopularity At ( )Expansiveness Hi .47 (

Sociometric by SelfPopularity A2 .19 .18 ( )Expansiveness B2 .07 .08 .32

Observation of Group InteractionPopularity Aa .25 .18 .26 .11 ( )Expansiveness Ba .21 .12 .28 .15 .84

Observation of Role PlayingPopularity A4 .24 .14 .18 .01 .66 .58 ( )Expansiveness B4 .25 .12 .26 .05 .66 .76 .73 (

90 D. T. CAMPBELL AND D. W. FISKE

any consistent validation by the re­quirement that the validity diagonalsexceed the heterotrait-heteromethodcontrol values. As a most minimalrequirement, it might be asked if thesum of the two values in the validitydiagonal exceeds the sum of the twocontrol values, providing a compari­son in which differences in reliabilityor communality are roughly par..tialled out. This condition is achievedat the purely chance level of threetimes in the six tetrads. This matrixprovides an interesting range ofmethodological independence. Thetwo uSociometric by Others" meas­ures, while representing the judg­ments of the same set of fellow par­ticipants, come from distinct tasks:Popularity is based upon each par­ticipant's expressi0n of his ownfriendship preferences, while Ex­pansiveness is based upon each par­ticipant's guesses as to the other par­ticipant's choicest from which hasbeen computed each participant'sreputation for liking lots of other per­sons, i.e., being "expansive. tf In linewith this considerable independence,the evidence for a method factor isrelatively low in comparison with theobservational procedures. Similarly,the two uSociometric by Self' meas­ures represent quite 'separate tasks,Popularity coming from his estimatesof the choices he will receive fromothers, Expansiveness from the num­ber of expressions of attraction toothers which he makes on the socio­metric task. In contrast, the meas­ures of Popularity and Expansivenessfrom the observations of group inter­action and the role playing not onlyinvolve the same specific observers,but in addition the observers ratedthe pair of variables as a part of thesame rating task in each situation.The apparent degree of method vari­ance within each of the two observa-

tional situations, and the apparentsharing of method variance betweenthem, is correspondingly high.

In another paper by Borgatta(1955) t 12 interaction process vari­ables were measured by quantitativeobservation under two conditions,and by a projective test. In this test,the stimuli were pictures of groups,for which the S generated a series ofverbal interchanges; these were thenscored in Interaction Process Analy­sis categories. For illustrati'Ve pur­poses, Table 8 presents the five traitswhich had the highest mean com­munalities in the over-all factor anal­ysis. Between the two highly sim­ilar observational methods, valida­tion is excellent: trait variance runshigher than method variance; valid­ity diagonals are in general higherthan heterotrait values of both theheteromethod and monomethodsblocks, most unexceptionably so forGives Opinion and Gives Orientation.The pattern of correlation among thetraits is also in general confirmed.

Of greater interest because of thegreater independence of methods arethe blocks involving the projectivetest. Here the validity picture ismuch poorer. Gives Orientationcomes off best, its projective testvalidity values of .35 and .33 beingbested by only three monomethodvalues and by no heterotrait-hetero­method values within the projectiveblocks. All of the other validities areexceeded by some heterotrait-hetero­method value.

The projective test specialist mayobject to the implicit expectations ofa one-to-one correspondence betweenprojected action and overt action.Such expectations should not be at­tributed to Borgatta, and are notnecessary to the method here pro..posed. For the simple symmetricalmodel of this paper, it has been as-

TABLE 8 ~to<

INTERACTION PROCESS VARIABLES IN OBSERVED FREE BEHAVIOR, OBSERVED ROLE PLAYING AND A PROJECTIVE TEST~(N=125)

~Free Behavior Role Playing Projective Test ~

blAl B1 C1 D1 E 1 A2 B2 C2 D2 E2 Po:. B. C. Ds E. ~

Free Behavior ~Shows solidarity Al ( ) l:t.i

Gives suggestion Bt .25 ( ) ~Gives opinion C1 .13 .24 ( ) 8Gives orientation D 1 -.14 .26 .52 ( )

~Shows disagreement E1 .34 .41 .27 .02 (

2Role Playing .......Shows solidarity A2 .43 .43 .08 .10 .29 ( ) ~Gives suggestion B2 .16 .32 .00 .24 .07 .37 ( ) ~Gives opinion C2 .15 .27 .60 .38 .12 .01 .10 ( ) 8Gives orientation D2 - .12 .24 .44 .74 .08 .04 .18 .40 ( )

~Shows disagreement & .51 .36 .14 -.12 .50 .39 .27 .23 -.11~

Projective Test~Shows solidarity As .20 .17 .16 .12 .08 .17 .12 .30 .17 .22 ( )

Gives suggestion B. .05 .21 .05 .08 .13 .10 .19 -.02 .06 .30 .32 ( ) §Gives opinion C. .31 .30 .13 - .02 .26 .25 .19 .15 -.04 .53 .31 .63 ( )

~Gives orientation D s -.01 .09 .30 .35 -.05 .03 .00 .19 .33 .00 .37 .29 .32 ( )Shows disagreement Es .13 .18 .10 .14 .19 .22 .28 .02 .04 .23 .27 .51 .47 .30 )

~~

\0-

92 D. T. CAMPBELL AND D. W. FISKE

TABLE 9MAYO'S INTERCORRELATIONS BETWEEN OBJECTIVE AND RATING

MEASURES OF INTELLIGENCE AND EFFORT

(N= 166)

Peer Ratings

Al B1

Peer RatingIntelligence Al (.85)Effort B1 .66 (.84)

Objective MeasuresIntelligence A2 .46 .29Effort B2 .46 .40

Objective

( ).10

sumed that the measures are labeledin correspondence with the correla­tions expected, i.e., in correspondencewith the traits that the tests arealleged to diagnose. Note that inTable 8, Gives Opinion is the bestprojective test predictor of both freebehavior and role playing Shows Dis­agreement. Were a proper theoreticalrationale available, these valuesmight be regarded as validities.

Mayo (1956) has n1ade an analysisof test scores and ratings of effort andintelligence, to estimate the contribu­tion of halo (a kind of methods vari­ance) to ratings. As Table 9 shows,the validity picture is ambiguous.The method factor or halo effect forratings is considerable although thecorrelation between the two ratings(.66) is well below their reliabili ties

(.84 and .85). The objective meas­ures share no appreciable apparatusoverlap because they were independ­ent operations. In spite of Mayo'sargument that the ratings have somevalid trait variance, the .46 hetero­trait-heteromethod value seriouslyde­preciates the otherwise impressive .46and .40 validity values.

Cronbach (1949, p. 277) and Ver­non (1957, 1958) have both discussedthe multitrait-multimethod matrixshown in Table 10, based upon dataoriginally presented by H. S. Conrad.Using an approximative technique,Vernon estimates that 61 % of thesystematic variance is due to a gen­eral factor, that 21!% is due to thetest-form factors specific to verbal orto pictorial forms of items, and thatbut 111% is due to the content fac-

TABLE 10MECHANICAL AND ELECTRICAL FACTS MEASURED BY VERBAL AND PICTORIAL ITEMS

Verbal Items

At B1

Verbal IternsMechanical Facts Al (.89)Electrical Facts B1 .63 (.71)

Pictorial IternsMechanical Facts A2 .61 .45Electrical Facts Ba .49 .51

Pictorial Iterns

(.82).64 (.67)

VALIDATION BY THE MULTITRAIT·MULTIMETHOD MATRIX 93

tors specific to electrical or to mechan­ical contents. Note that for the pur­poses of estimating validity, the in­terpretation of the general factor,which he estimates from the .49 and.45 heterotrait-heteromethod values,is equivocal. I t could represent de­sired competence variance, represent­ing components common to both elec­trical and mechanical skills-perhapsresulting from general industrial shopexperience, common ability compo­nents, overlapping learning situations,and the like. On the other hand, thisgeneral factor could represent over­lapping method factors, and be due tothe presence in both tests of multiplechoice item format, IBM answersheets, or the heterogeneity of the Ssin conscientiousness, test-taking mo­tivation, and test-taking sophistica­tion. Until methods that are stillmore different and traits that arestill more independent are introducedinto the validation matrix, this gen­eral factor remains uninterpretable.From this standpoint it can be seenthat 21!% is a very minimal estimateof the total test-form variance in thetests, as it represents only test-formcomponents specific to the verbal orthe pictorial items, Le., test-formcomponents which the two forms donot share. Similarly, and more hope­fully, the 11!% content variance is avery minimal estimate of the totaltrue trai t variance of the tests, repre­senting only the true trait variancewhich electrical and mechanicalknowledge do not share.

Carroll (1952) has provided dataon the Guilford-Martin Inventory ofFactors STDCR and related ratingswhich can be rearranged into thematrix of Table 11. (Variable R hasbeen inverted to reduce the numberof negative correlations.) Two of themethods, Self Ratings and Inventoryscores, can be seen as sharing method

variance, and thus as having an in­flated validity diagonal. The moreindependent heteromethod blocks in­volving Peer Ratings show some evi­dence of discriminant and convergentvalidity, with validity diagonals av­eraging .33 (Inventory X Peer Rat­ings) and .39 (Self RatingsXPeerRatings) against heterotrait-hetero­method control values averaging .14and .16. While not intrinsically im­pressive, this picture is nonethelessbetter than most of the validity ma­trices here assembled. Note that theSelf Ratings show slightly highervalidity diagonal elevations than dothe Inventory scores, in spite of themuch greater length and undoubtedlyhigher reliability of the latter. In ad­dition, a method factor seems almosttotally lacking for the Self Ratings,while strongly present for the Inven­tory, so that the Self Ratings comeoff much the best if true trait vari­ance is expressed as a proportion oftotal reliable variance (as Vernon[1958] suggests). The method factorin the STDCR Inventory is undoubt­edly enhanced by scoring the sameitem in several scales, thus contribut­ing correlated error variance, whichcould be reduced without loss of reli­ability by the simple expedient ofadding more equivalent items andscoring each item in only one scale.It should be noted that Carroll makesexplicit use of the comparison of thevalidity diagonal with the hetero­trait...heteromethod values as a valid­ity indicator.

RATINGS IN THE ASSESSMENT STUDY

OF CLINICAL PSYCHOLOGISTS

The illustrations of multitrait­multimethod matrices presented sofar give a rather sorry picture of thevalidity of the measures of individualdifferences involved. The typicalcase shows an excessive amount of

TABLE 11

GUILFORD-MARTIN FACTORS STDCR AND RELATED RATINGS

(N=1l0)

Inventory Self Ratings Peer Ratingsl='

S T D C -R S T D C -R 5 T D C -R~

Inventory ~S (.92)

~T .27 (.89)D .62 .57 (.91) tl:l

C .36 .47 .90 (.91) ~-R .69 .32 .28 -.06 (.89) t-<

~

Self Ratings ?5S .57 .11 .19 -.01 .53 ( )

l='T .28 .65 .42 .26 .37 .26 ( )D .44 .25 .53 .45 .29 .31 .32 ( ) ~C .31 .20 .54 .52 .13 .11 .21 .47 ( )-R .15 .30 .12 .04 .34 .10 .12 .04 .06 ~

Peer Ratings ~t:t.1

S .37 .08 .10 -.01 .38 .42 .02 .08 .08 .31 (.81)T .23 .32 .15 .04 .40 .20 .39 .40 .21 .31 .37 (.66)D .31 .11 .27 .24 .25 .17 .09 .29 .27 .30 .49 .38 (.73)C .08 .15 .20 .26 -.05 .01 .06 .14 .30 .07 .19 .16 .40 (.75)·R .21 .20 -.03 -.16 .45 .28 .17 .08 .01 .56 .55 .56 .34 -.07 (.76)

2 We are indebted to E. Lowell Kelly forfurnishing the V.A. assessment date to us, andto Hugh Lane for producing the matrix ofintercorrelaHons.

In the original report the correlations werebased upon 128 men. The present analyseswere based on only 124 of these cases becauseof clerical errors. This reduction in N leadsto some very minor discrepancies betweenthese values and those previously reported·

VALIDATION BY THE MULTITRAIT-MULTIMETHOD MATRIX 95

method variance, which usually ex- ings, which were agreed upon byceeds the amount of trait variance. three staff members after discussionThis picture is certainly not as a re- and review of the enormous amountsuIt of a deliberate effort to select of data and the many other ratings onshockingly bad examples: these are each S. Unfortunately for our pur..ones we have encountered without at- poses, the staff members saw the rat­tempting an exhaustive coverage of ings by Self and Teammates beforethe literature. The several unpub- making theirs, although presumablyHsherl studies of which we are aware they were little influenced by theseshow the same picture. If they seem data because they had so much othermore disappointing than the general evidence available to them. (See Kel­run of validity data reported in the Iy & Fiske, 1951, especially p. 64.)journals, this impression may very The Self and Teammate ratings rep­well be because the portrait of valid.. resent entirely separate Umethods"ity provided by isolated values and can be given the major emphasisplucked from the validity diagonal is in evaluating the data to be pre­deceptive, and uninterpretable in sented.isolation from the total matrix. Yet In a previous analysis of these datait is clear that few of the classic ex- (Fiske, 1949), each of the three heter­amples of successful measurement of otrait-monomethod triangles wasindividual differences are involved, computed and factored. To provideand that in many of the instances, the a multitrait-multimethod matrix, thequality of the data might have been 1452 heteromethod correlations havesuch as to magnify apparatus factors, been computed especially for this re­etc. A more nearly ideal set of per- port.2 The full 66 X 66 matrix withsanality data upon which to illus- its 2145 coefficients is obviously tootrate the method was therefore large for presentation here, but willsought in the multiple application of be used in analyses that follow. Toa set of rating scales in the assess- provide an illustrative sample, Tablement study of clinical psychologists 12 presents the interrelationships(Kelly & Fiske, 1951). among five variables, selecting the

In that study, "Rating Scale A" one best representing each of the fivecontained 22 traits referring to 44be_ recurrent factors discovered in Fiske'shavior which can be directly observed (1949) previous analysis of the mono­on the surface." In using this scale method matrices. (These were chosenthe raters were instructed to "disre- without regard to their validity asgard any inferences about underlying indicated in the heteromethod blocks.dynamics or causes" (p. 207). The Assertive-No. 3 reflected-was se­SSt first-year clinical psychology stu- lected to represent Recurrent Factordents, rated themselves and also their 5 because Talkative had also a highthree teammates with whom theyhad participated in the various as­sessment procedures and with whomthey had lived for six days. Themedian of the three teammates' rat­ings was used for the Teammatescore. The Ss were also rated on these22 traits by the assessment staff. Ouranalysis uses the Final Pooled rat-

\00\

TABLE 12

RATINGS FROM ASSESSMENT STUDY OF CLINICAL PSYCHOLOGISTS(N=124)

Staff Ratings Teammate Ratings Self Ratings

Al Bl Cl Dl E l A. B. C. D. E. As B. C3 D3 Es ~

~Staff Ratings

&:Assertive Al (.89)Cheerful Bl .37 (.85) ~Serious Cl -.24 -.14 (.81) t:l:lUnshakable Poise Dl .25 .46 .08 (.84) ttlBroad Interests E l .35 .19 .09 .31 (.92) l:-l

l:-l

Teammate Ratings;l:..

Assertive A. .71 .35 - .18 .26 .41 (.82) ~Cheerful B. .39 .53 - .15 .38 .29 .37 (.76) ~Serious C. -.27 - .31 .43 - .06 .03 - .15 -.19 (.70)Unshakable Poise D. .03 -.05 .03 .20 .07 .11 .23 .19 (.74) ~Broad Interests E. .19 .05 .04 .29 .47 .33 .22 .19 .29 (.76)

~Self Ratings ~Assertive As .48 .31 - .22 .19 .12 .46 .36 - .15 .12 .23 ( )

Cheerful Bs .17 .42 -.10 .10 -.03 .09 .24 -.25 -.11 -.03 .23 ( )Serious Cs -.04 -.13 .22 - .13 -.05 -.04 -.11 .31 .06 .06 -.05 - .12 ( )Unshakable Poise Ds .13 .27 -.03 .22 -.04 .10 .15 .00 .14 -.03 .16 .26 .11 ( )Broad Interests Ea .37 .15 -.22 .09 .26 .27 .12 -.07 .05 .35 .21 .15 .17 .31 (

VALIDATION BY THE MULTITRAIT-MULTIMETHOD MATRIX 97

loading on the first recurrent factor.)The picture presented in Table 12

is, we believe, typical of the bestvalidity in personality trait ratingsthat psychology has to offer at thepresent time. It is comforting to notethat the picture is better than mostof those previously examined. Notethat the validities for Assertive ex..ceed heterotrait values of both themonomethod and heteromethod tri­angles. Cheerful, Broad Interests,and Serious have validities exceedingthe heterotrait-heteromethod valueswith two exceptions. Only for Un­shakable Poise does the evidence ofvalidity seem trivial. The elevationof the reliabilities above the hetero­trait-monomethod triangles is furtherevidence for discriminant validity.

A comparison of Table 12 with thefull matrix shows that the procedureof having but one variable to repre­sent each factor has enhanced the ap­pearance of validity, although notnecessarily in a misleading fashion.Where several variables are all highlyloaded on the same factor, theirUtrue" level of intercorrelation ishigh. Under these conditions, sam­pling errors can depress validity diag­onal values and enhance others toproduce occasional exceptions to thevalidity picture, both in the hetero­trait-monomethod matrix and in theheteromethod-heterotrait triangles.In this instance, with an N of 124, thesampling error is appreciable, andmay thus be expected to exaggeratethe degree of invalidity.

Within the monomethod sections,errors of measurement will be cor­related, raising the general level ofvalues found, while within the heter­omethods block, measurement errorsare independent, and tend to lowerthe values both along the validitydiagonal and in the heterotrait tri­angles. These effects, which may also

be stated in terms of method factorsor shared confounded irrelevancies,operate strongly in these data, asprobably in all data involving rat­ings. In such cases, where severalvariables represent each factor, noneof the variables consistently meetsthe criterion that validity values ex­ceed the corresponding values in themonomethod triangles, when the fullmatrix is examined.

To summarize the validation pic­ture with respect to comparisons ofvalidity values with other hetero­method values in each blocI{, Table13 has been prepared. For each traitand for each of the three hetero­method blocks, it presents the valueof the validity diagonal, the highestheterotrait value involving that trait,and the number out of the 42 suchheterotrait values which exceed thevalidity diagonal in magnitude. (Thenumber 42 comes from the groupingof the 21 other column values and the21 other row values for the columnand row intersecting at the givendiagonal value.)

On the requirement that the valid­ity diagonal exceed all others in itsheteromethod block, none of thetraits has a completely perfect record,although some come close. Assertivehas only one trivial exception in theTeammate..Self block. Talkative hasalmost as good a record, as doesImaginative. Serious has but two in­consequential exceptions and Interestin Women three. These traits standout as highly valid in both self­description and reputation. Notethat the actual validity ~oefficientsofthese four traits range from but .22 to.82, or, if we concentrate on theTeammate-Self block as most cer­tainly representing independentmethods, from but .31 to .46. Whilethese are the best traits, it seems thatmost of the traits have far above

98 D. T. CAMPBELL AND D. W. FISKE

TABLE 13

VALIDITIES OF TRAITS IN THE ASSESSMENT STUDY OF CLINICAL PSYCHOLOGISTS,AS JUDGED BY THE HETEROMETHOD COMPARISONS

Staff- Teammate Staff..Self Teammate-Self

High..No.

High..No.

High-No.Val. est

HigherVal. est

HigherVal. est

HigherHet. Het. Het.

1. Obstructiveness* .30 .34 2 .16 .27 9 .19 .24 12. Unpredictable .34 .26 0 .18 .24 3 .05 .19 293. Assertive* .71 .65 0 .48 .45 0 .46 .48 14. Cheerful* .53 .60 2 .42 .40 0 .24 .38 55. Serious· .43 .35 0 .22 .27 2 .31 .24 06. Cool, Aloof .49 .48 0 .20 .46 10 .02 .34 367. Unshakable Poise .20 .40 16 .22 .27 4 .14 .19 108. Broad Interests* .47 .46 0 .26 .37 6 .35 .32 09. Trustful .26 .34 5 .08 .25 19 .11 .17 9

10. Self-centered .30 .34 2 .17 .27 6 - .07 .19 3611. Talkative* .82 .65 0 .47 .45 0 .43 .48 112. Adventurous .45 .60 6 .28 .30 2 .16 .36 1413. Socially Awkward .45 .37 0 .06 .21 28 .04 .16 3014.. Adaptable* .44 .40 0 .18 .23 10 .17 .29 815. Self-sufficient· .32 .33 1 .13 .18 5 .18 .15 016. Worrying, Anxious* .41 .37 0 .23 .33 5 .15 .16 117. Conscientious .26 .33 4 .11 .32 19 .21 .23 218. Imaginative* .43 .46 1 .32 .31 0 .36 .32 019. Interest in Women* .42 .43 2 .55 .38 0 .37 .40 120. Secretive, Reserved* .40 .58 5 .38 .40 2 .32 .35 321. Independent Minded .39 .42 2 .08 .25 19 .21 .30 322. Emotional Expression* .62 .63 1 .31 .46 5 .19 .34 10

Note.-Val. -value in validity diagonal; Highest Het. =highest heterotrait value; No. Higher =number ofheterotrait values exceeding the validity diagonal.

tit Trait names which have validities in all three heteromethod blocks significantly greater than the heterotrait·heteromethod values at the .001 level.

chance validity. All those having 10or fewer exceptions have a degree ofvalidity significant at the .001 levelas crudely estimated by a one-tailedsign test.s All but one of the variablesmeet this level for the Staff..Team­mate block, all but four for the Staff..

8 If we take the validity value as fixed (ig..Doring its sampling fluctuations) J then we candetermine whether the number of valueslarger than it in its row and column is less thanexpected on the null hypothesis that half thevalues would be above it. This procedure re­quires the assumption that the position (aboveor below the validity value) of anyone ofthese comparison values is independent of theposition of each of the othersJ a dubious as­sumption when common methods and traitvariance are present.

Self block, all but five for the mostindependent block, Teammate-Self.The exceptions to significant validityare not parallel from column to col­umn, however, and only 13 of 22variables have .001 significant valid­ity in/jaIl threelblocks. These are indi..cated':by an:~asterisk in Table 13.

This highly significant generallevel of validity must not obscure themeaningful problem created by theoccasional exceptions, even for thebest variables. The excellent traitsof Assertive and Talkative providea case in point. In terms'~ofAFiske'soriginal analysis, both have highloadings on the recurrent factor"Confident self-expression" (repre..

VALIDATION BY THE MULTITRAIT-M-ULTIMETHOD MATRIX 99

sented by Assertive in Table 12).Talkative also had high loadings onthe recurrent factor of Social Adapta­bility (represented by Cheerful inTable 12). We would expect, there­fore, both high correlation betweenthem and significant discriminationas well. And even at the commonsense level, most psychologists wouldexpect fellow psychologists to dis­criminate validly between assertive­ness (nonsubmissiveness) and talka­tiveness. Yet in the Teammate-Selfblock, Assertive rated by self cor­relates ..48 with Talkative by team­mates, higher than either of theirvalidities in this block, .43 and .46.

In terms of the average values ofthe validities and the frequency ofexceptions, there is a distinct trendfor the Staff-Teammate block toshow the greatest agreement. Thiscan be attributed to several factors.Both represent ratings from the ex­ternal point of view. Both are aver­aged over three judges, minimizingindividual biases and undoubtedly in­creasing reliabilities. Moreover, theTeammate ratings were available tothe S~aff in making their ratings. An...other effect contributing to the lessadequate convergence and discrim­ination of Self ratings was a responseset toward the favorable pole whichgreatly reduced the range of thesemeasures (Fiske, 1949, p. 342). In­spection of the details of the instancesof invalidity summarized in Table 13shows that in most instances the ef­fect is attributable to the high spec­ificity and low communality for theself-rating trait. In these instances,the column and row intersecting atthe low validity diagonal are asym­metrical as far as general level of cor­relation is concerned, a fact coveredover by the condensation providedin Table 13.

The personality psychologist is

initially predisposed to reinterpretself-ratings, to treat them as symp­toms rather than to interpret themliterally. Thus, we were alert to in­stances in which the self ratings werenot literally interpretable, yet none­theless had a diagnostic significancewhen properly "translated.H By andlarge, the instances of invalidity ofself-descriptions found in this assess...ment study are not of this type. butrather are to be explained in terms ofan absence of communality for oneof the variables involved. In general,where these self descriptions are in­terpretable at all, they are as literallyinterpretable as are teammate de­scriptions. Such a finding may, ofcourse, reflect a substantial degree ofinsight on the part of these Ss.

The general success in discriminantvalidation coupled with the parallelfactor patterns found in Fiske'searlier analysis of the three intra­method matrices seemed to justify aninspection of the factor pattern valid­ity in this instance. One possible pro...cedure would be to do a single analy­sis of the whole 66X66 matrix.Other approaches focused upon sep­arate factoring of heteromethodsblocks, matrix by matrix, could alsobe suggested. Not only would suchmethods be extremely tedious, but inaddition they would leave undeter­mined the precise comparison offactor-pattern similarity. Correlat­ing factor loadings over the popula­tion of variables was employed forthis purpose by Fiske (1949) butwhile this provided for the identifica­tion of recurrent factors, no singleover-all index of factor pattern sim­ilarity was generated. Since our im­mediate interest was in confirming apattern of interrelationships, ratherthan in describing it, an efficientshort cut was available: namely totest the similari ty of the sets of heter-

100 D. T. CAMPBELL AND D. W. FISKE

otrait values by correlation coeffi­cients in which each entry repre­sented the size values of the givenheterotrait coefficients in two differ­ent matrices. For the full matrix,such correlations would be basedupon the N of the 22 X21/2 or 231specific heterotrait combinations.Correlations were computed betweenthe Teammate and Self monometh­ods matrices, selected as maximallyindependent. (The values to followwere computed from the original cor­relation matrix and are somewhathigher than that which would be ob­tained from a reflected matrix.) Thesimilarity between the two mono­methods matrices was .84, corrob­orating the factor-pattern similaritybetween these matrices describedmore fully by Fiske in his parallelfactor analyses of them. To carrythis mode of analysis into the hetero­method block, this block was treatedas though divided into two by thevalidity diagonal, the above diagonalvalues and the below diagonal repre­senting the maximally independentvalidation of the heterotrait correla­tion pattern. These two correlated.63, a value which, while lower, showsan impressive degree of confirmation.There remains the question as towhether this pattern upon which thetwo heteromethod-heterotrait tri­angles agree is the same one found incommon between the two mono­method triangles. The intra-Team­mate matrix correlated with the twoheteromethod triangles .71 and .71.The intra-Self matrix correlated withthe two .57 and .63. In general, then,there is evidence for validity of theintertrait relationship pattern.

DISCUSSION

Relation to construct validity. Whilethe validational criteria presented areexplicit or implicit in the discussions

of construct validity (Cronbach &Meehl, 1955; APA, 1954), this pa­per is primarily concerned with theadequacy of tests as measures of aconstruct rather than with the ade­quacy of a construct as determinedby the confirmation of theoreticallypredicted associations with measuresof other constructs. We believe thatbefore one can test the relationshipsbetween a specific trait and othertraits, one must have some confidencein one's measures of that trait. Suchconfidence can be supported by evi­dence of convergent and discriminantvalidation. Stated in different words,any conceptual formulation of traitwill usually include implicitly theproposition that this trait is a re­sponse tendency which can be ob­served under more than one experi­mental condition and that this traitcan be meaningfully differentiatedfrom other traits. The testing ofthese two propositions must be priorto the testing of other propositions toprevent the acceptance of erroneousconclusions. For example, a con­ceptual framework might postulate alarge correlation between Tr~its Aand B and no correlation betweenTraits A and C. If the experimenterthen measures A and B by onemethod (e.g., questionnaire) and Cby another method (such as the meas­urement of overt behavior in a situa­tion test)) his findings may be con­sistent with his hypotheses solely asa function of method variance com­mon to his measures of A and B butnot to C.

The requirements of this paper areintended to be as appropriate to therelatively atheoretical efforts typicalof the tests and measurements fieldas to more theoretical efforts. Thisemphasis on validational criteria ap­propriate to our present atheoreticallevel of test construction is not at all

VALIDATION BY THE MULTITRAIT..MULTIMETHOD MATRIX 101

incompatible with a recognition ofthe desirability of increasing the ex­tent to which all aspects of a test andthe testing situation are determinedby explicit theoretical considerations,as Jessor and Hammond have advo­cated (Jessor & Hammond, 1957).

Relation to operationalism. Under­wood (1957, p. 54) in his effectivepresentation of the operationalistpoint of view shows a realistic aware­ness of the amorphous type of theorywith which most psychologists work.He contrasts a psychologist's ·'lit­erary" conception with the latter'soperational definition as representedby his test or other measuring instru­ment. He recognizes the importanceof the literary definition in communi­cating and generating science. Hecautions that the operational defini­tion "may not at all measure theprocess he wishes to measure; it maymeasure something quite different"(1957, p. 55). He does not, however,indicate how one would know whenone was thus mistaken.

The requirements of the presentpaper may be seen as an extension ofthe kind of operationalism Under­wood has expressed. The test con­structor is asked to generate from hisliterary conception or private con­struct not one operational embodi­ment, but two or more, each as dif­feren t in research vehicle as possible.Furthermore, he is asked to make ex­plicit the distinction between his newvariable and other variables, distinc­tions which are almost certainly im­plied in his literary definition. In hisvery first validational efforts, beforehe ever rushes into print, he is askedto apply the several methods and sev­eral traits jointly. His literary defini­tion, his conception, is now best rep­resented in what his independentmeasures of the trait hold distinc­tively in common. The multitrait-

multimethod matrix is, we believe,an important practical first step inavoiding 4'the danger. a • that theinvestigator will fall into the trap ofthinking that because he went froman artistic or literary conception. .. to the construction of items for a

scale to measure it, he has validatedhis artistic conception" (Underwood,1957, p. 55). In contrast with thesingle operationalism now dominantin psychology, we are advocating amultiple operationalism, a convergentoperationalism (Garner, 1954; Garner,Hake, & Eriksen, 1956),a methodologi­cal triangulation (Campbell: 1953,1956), an operational delineation(Campbell, 1954), a convergent valida­tion.

Underwood's presentation and thatof this paper as a whole imply movingfrom concept to operation, a sequencethat is frequent in science, and per­haps typical. The same point can bemade, however, in inspecting a tran­sition from operation to construct.For any body of data taken from asingle operation, there is a subinfinityof interpretations possible; a sub­infinity of concepts, or combinationsof concepts, that it could represent.Any single operation, as representa­tive of concepts, is equivocal. In ananalogous fashion, when we view theAmes distorted room from a fixedpoint and through a single eye, thedata of the retinal pattern are equiv­ocal, in that a subinfinity of hexa­hedrons could generate the same pat­tern. The addition of a second view­point, as through binocular parallax,greatly reduces this equivocality,greatly limits the constructs thatcould jointly account for both sets ofdata. In Garner's (1954) study, thefractionation measures from a singlemethod were equivocal-they couldhave' been a function of the stimulusdistance being fractionated, or they

102 D. T. CAMPBELL AND D. W. FISKE

could have been a function of thecomparison stimuli used in the judg­ment process. A multiple, convergentoperationalism reduced this equivo­cality, showing the latter conceptual­ization to be the appropriate one, andrevealing a preponderance of meth­ods variance. Similarly for learningstudies: in ideotifying constructswith the response data from animalsin a specific operational setup there isequivocality which can operationallybe reduced by introducing transposi­tion tests, different operations so de­signed as to put to comparison therival conceptualizations (Campbell,1954).

Garner's convergent operational­ism and our insistence on more thanone method for measuring each con­cept depart from Bridgman's earlyposition that "if we have more thanone set of operations, we have morethan one concept, and strictly thereshould be a separate name to cor­respond to each different set of op­erations" (Bridgman, 1927, p. 10).At the current stage of psychologicalprogress, the crucial requirement isthe demonstration of some conver­gence, not complete congruence, be­tween two distinct sets of operations.With only one method, one has noway of distinguishing trait variancefrom unwanted method variance.When psychological measurementand conceptualization become betterdeveloped, it may well be appropri­ate to differentiate conceptually be­tween Trait-Method Unit Al andTrait-Method Unit A2, in whichTrait A is measured by differentmethods. More likely, what we havecalled method variance will be speci­fied theoretically in terms of a set ofconstructs. (This has in effect beenillustrated in the discussion above inwhich it was noted that the responseset variance might be viewed as

trait variance, and in the rearrange­ment of the social intelligence ma­trices of Tables 4 and 5.) It will thenbe recognized that measurement pro­cedures usually involve several the­oretical constructs in joint applica­tion. Using obtained measurementsto estimate values for a single con­struct under this condition still re­quires comparison of complex meas­ures varying in their trait composi­tion, in something like' a m ulti trait­multimethod matrix. Mill's jointmethod of similarities and differencesstill epitomizes much about the ef­fective experimental clarification ofconcepts.

The evaluation of a multitrait-multi­method matrix. The evaluation of thecorrelation matrix formed by inter­correlating several trait-method unitsmust take into consideration themany factors which are known toaffect the magnitude of correlations.A value in the validity diagonal mustbe assessed in the light of the reliabil­ities of the two measures involved:e.g., a low reliability for Test A 2

might exaggerate the apparentmethod variance in Test At. Again,the whole approach assumes ade­quate sampling of individuals: thecurtailment of the sample with re­spect to one or more traits will de­press the reliability coefficients andintercorrelations involving thesetrai ts. While restrictions of rangeover all traits produces serious diffi­culties in the interpretation of a mul­titrait-multimethod matrix andshould be avoided whenever possible,the presence of different degrees ofrestriction on different traits is themore serious hazard to meaningfulinterpretation.

Various statistical treatmentsfor multitrait-multimethod matricesmight be developed. We have con­sidered rough tests for the elevation

VALIDATION BY THE MULTITRAIT-MULTIMETHOD MATRIX 103

of a value< in the validity diagonalabove the comparison values in itsrow and column. Correlations be­tween the columns for variablesmeasuring the same trait, varianceanalyses, and factor analyses havebeen proposed to us. However, thedevelopment of such statistical meth­ods is beyond the scope of this paper.We believe that such summary sta­tistics are neither necessary nor ap­propriate at this time. Psychologiststoday should be concerned not withevaluating tests as if the tests werefixed and definitive, but rather withdeveloping better tests. We believethat a careful examination of a multi­trait-multimethod matrix will indi­cate to the experimenter what hisnext steps should be: it will indicatewhich methods should be discardedor replaced, which concepts needsharper delineation, and which con­cepts are poorly measured because ofexcessive or confounding method var­iance. Validity judgments based onsuch a matrix must take into accountthe stage of development of the con­structs, the postulated relationshipsamong them, the level of technicalrefinement of the methods, the rela­tive independence of the methods,and any pertinent characteristics ofthe sample of SSt We are proposingthat the validational process beviewed as an aspect of an ongoingprogram for improving measuringprocedures and that the llvaliditycoefficients" obtained at anyonestage in the process be interpreted interms of gains over preceding stagesand as indicators of where further ef­fort is needed.

The design of a multitrait-multi­method matrix. The several methodsand traits included in a validationalmatrix should be selected with care.The several methods used to measureeach trait should be appropriate to

the trait as conceptualized. Althoughthis view will reduce the range ofsuitable methods, it will rarely re..strict the measurement to one opera­tional procedure.

Wherever possible, the severalmethods in one matrix should be com­pletely independent of each other:there should-- be no prior reason forbelieving that they share methodvariance. This requirement is neces­sary to permit the values in the heter­omethod-heterotrait triangles to ap­proach zero. If the nature of thetraits rules out such independenceof methods, efforts should be made toobtain as much diversity as possiblein terms of data-sources and classifi­cation processes. Thus, the classes ofstimuli or the background situations,the experimental contexts, should bedifferent. Again, the persons provid­ing the observations should have dif­ferent roles or the procedures forscoring should be varied.

Plans for a validational matrixshould take into account the differ..ence between the interpretations re..garding convergence and discrimina­tion. I t is sufficient to demonstrateconvergence between two clearly dis ..tinct methods which show little over­lap in the heterotrait-heteromethodtriangles. While agreement betweenseveral methods is desirable, conver­gence between two is a satisfactoryminimal requirement. Discrimina­tive validation is not so easilyachieved. Just as it is impossible toprove the null hypothesis, or thatsome object does not exist, so onecan never establish that a trait, asmeasured, is differentiated from allother traits. One can only show thatthis measure of Trait A has littleoverlap with those measures of BandC, and no dependable generalizationbeyond Band C can be made. Forexample, social poise could probably

104 D. T. CAMPBELL AND D. W. FISKE

be readily discriminated from aes­thetic interests, but it should also bedifferentiated from leadership.

Insofar as the traits are related andare expected to correlate with eachother, the monomethod correlationswill be substantial and heteromethodcorrelations between traits will alsobe positive. For ease of interpreta­tion, it may be best to include in thematrix at least two traits, and prefer­ably two sets of traits, which arepostulated to be independent of eachother.

In closing, a word of caution isneeded. Many multitrait-multi­method matrices will show no con­vergent validation: no relationshipmay be found between two methodsof measuring a trait. I n this commonsituation, the experimenter shouldexamine the evidence in favor of sev­eral alternative propositions: (a)Neither method is adequate for meas­uring the trait; (b) One of the twomethods does not really measure thetrait. (When the evidence indicatesthat a method does not measure thepostulated trait, it may prove tomeasure some other trait. High cor­relations in the heterotrait-hetero­method triangles may provide hintsto such possibilities.) (c) The trait isnot a functional unity, the responsetendencies involved being specific to

the nontrai t a ttributes of each test.The failure to demonstrate conver­gence may lead to conceptual devel­opments rather than to the abandon­ment of a test.

SUMMARY

This paper advocates a valida­tional process utilizing a rnatrix ofintercorrelations among tests repre­senting at least two traits, each meas­ured by at least two methods. Meas­ures of the same trait should correlatehigher with each other than they dowith measures of different traits in­volving separate methods. Ideally,these validity values should also behigher than the correlations amongdifferent traits measured by the samemethod.

I]1uRtrations from the literatureshow that these desirable conditions,as a set, are rarely met. Method orapparatus factors make very largecontributions to psychological meas­urements.

The notions of convergence be­tween independent measures of thesame trait and discrimination be­tween measures of different traits arecompared \vith previously publishedformulations, such as construct valid­ity and convergent operationalism,Problems in the application of thisvalidational process are considered.

REFERENCES

AMERICAN PSYCHOLOGICAL ASSOCIATION.

Technical recommendations for psychologi­cal tests and diagnostic techniques. Psy­chol. Bull., Suppl., 1954, 51, Part 2, 1-38.

ANDERSON, E. E. Interrelationship of drivesin the male albino rat. 1. Intercorrelationsof measures of drives. J. compo Psychol.,1937,24, 73-118.

AYER, A. J. The problem of knowledge. NewYork: St Martin's Press, 1956.

BORGATTA, E. F. Analysis of social interactionand sociOlnetric perception. Sociometry,1954, 17, 7-32.

BORGATTA, E. F. Analysis of social interac-

tion: Actual, role-playing, and projective.J. abnorm. soc. Psychol., 1955,51, 394-405.

BRIDGMAN, P. W. The logic of modern Physics.New York: Macmillan, 1927.

BURWEN, L. S., & CAMPBELL, D. T. The gen­erality of attitudes toward authority andnonauthority figures. J. abnorm. soc. Psy­chol., 1957, 54, 24-31.

CAMPBELL, D. T. A study of leadership amongsubmarine officers. Columbus: Ohio StateUniver. Res. Found., 1953.

CAMPBELL, D. T. Operational delineation of"what is learned" via the transposition ex­periment. Psychol. Rev., 1954,61, 167-174.

VALIDATION BY THE MULTITRAIT-MULTIMETHQD MATRIX 105

CAMPBELL, D. T. Leadership and its effectsupon the group. Monogr. No. 83. Colum­bus: Ohio State Univer. Bur. Business Res.,1956.

CARROLL, J. B. Ratings on traits measured bya factored personality inventory. J. ab­norm. soc. Psychol., 1952, 47, 626-632.

CHI, P.-L. Statistical analysis of personalityrating. J. expo Educ., 1937, S, 229-245.

CRONBACH, L. J. Response sets and test valid­i ty. Educ. psychol. Measmt, 1946,' 6, 475­494.

CRONBACH, L. J. Essentials of psychologicaltesting. New York: Harper, 1949.

CRONBACH, L. J. Further evidence on re­sponse sets and test design. Educ. psychol.Measmt, 1950, 10, 3-31.

CRONBACH, L. J., & MEEHL, P. E. Constructvalidity in psychological tests. Psychol.Bull.• 1955, 52, 281-302.

EDWARDS, A. L. The social desirability variablein personality assessment and research. NewYork: Dryden, 1957.

FEIGL, H. The mental and the physical. InH. Feigl, M. Scriven, & G. Maxwell (Eds.),Minnesota studies in the philosoPhy of sci­ence. Vol. II. Concepts, theories and themind-body problem. Minneapolis: Univer.Minnesota Press, 1958.

FISKE, D. W. Consistency of the factorialstructures of personality ratings from differ­ent ~ources. J. abnorm. soc. Psychol., 1949,44,329-344.

GARNER, W. R. Context effects and the valid­ity of loudness scales. J. expo Psychol., 1954 j

48, 218-224.GARNER, W. R., HAKE, H. W., & ERIKSEN,

c. W. Operationism and the concept ofperception. Psychol. Rev., 1956, 63, 149-159.

lESSOR, R., & HAMMOND. K. R. Constructvalidity and the Taylor Anxiety Scale.Psychol. Bull., 1957, 54, 161-170.

KELLEY, T. L., & KREY, A. C. Tests and meas-

urements in the social sciences. New York:Scribner, 1934.

KELLY, E. L., & FISKE, D. W. The predictionof performance in clinical psychology. AnnArbor: Univer. Michigan Press, 1951.

LOEVINGER, J., GLESER, G. C., & DuBOIS,P. H., Maximizing the discriminating powerof a multiple-score test. Psychometrika,1953. 18, 309-317.

LORGE, 1. Gen-like: Halo or reality? Psych01.Bull., 1937, 34, 545-546.

MAYO, G. D. Peer ratings and halo. Educ.psychol. Measmt, 1956, 16, 317-323.

STRANG, R. Relation of social intelligence tocertain other factors. Sch. &1 Soc., 1930,32,268-272.

SYMONDS, P. M. Diagnosing personality andconduct. New York: Appleton-Ceotury,1931.

THORNDIKE, E. L. A constant error in psy­chological ratings. J. apple Psychol., 1920,4,25-29.

THORNDIKE, R. L. Factor analysis of socialand abstract intelligence. J. educe Psychol.•1936, 27, 231-233.

THURSTONE, L. L. The reliability and validityof tests. Ann Arbor: Edwards, 1937.

TRYON, R. C. Individual differences. In F. A.Moss (Ed.), Comparative Psychology. (2nded.) New York: Prentice-Hall. 1942. Pp.330-365.

UNDERWOOD, B. J. Psychological research.New York: Appleton-Century-Crofts, 1957.

VERNON, P. E. Educational ability and psy­chological factors. Address given to theJoint Education-Psychology Colloquium.Univer. of Illinois, March 29, 1957.

VERNON, P. E. Educational testing and test­form factors. Princeton: Educational Test­ing Service, 1958. (Res. Bull. RB-58-3.)

Received June 19, 1958.