MARCH, Psychological Bulletin - marces.org DT & Fiske DW (1959) Converge… · Psychological MARCH,...
Transcript of MARCH, Psychological Bulletin - marces.org DT & Fiske DW (1959) Converge… · Psychological MARCH,...
VOL. 56, No.2
Psychological
MARCH, 1959
Bulletin
CONVERGENT AND DISCRIMINANT VALIDATION BY THEMULTITRAIT-MULTIMETHOD MATRIXI
DONALD T. CAMPBELLNorthwestern University
AND DONALD W. FISKEUniversity of Chicago
In the cumulative experience withmeasures of individual differencesover the past 50 years, tests have
, been accepted as valid or discard~d
as invalid by research experiences ofmany sorts. The criteria suggested inthis paper are all to be found in suchcumulative evaluations, as well as inthe recent discussions of validity.These criteria are clarified and imple..mented when considered jointly inthe context of a multitrait-multimethod matrix. Aspects of the validational process receiving particularemphasis are these:
1. Validation is typically convergent, a confirmation by independentmeasurement procedures. Independence of methods is a common denominator among the major types ofvalidity (excepting content validity)
1 The new data analyses reported in thispaper were supported by funds from theGraduate School of Northwestern Universityand by the Department of Psychology of theUniversity of Chicago. We are also indebtedto numerous colleagues for their thoughtfulcriticisms and encouragement of an earlierdraft of this paper, especially Benjamin S.Bloom, R. Darrell Bock, Desmond S. Cartwright, Loren J. Chapman, Lee J. Cronbach,Carl P. Duncan, Lyle V. JonesJ Joe Kamiya,Wilbur L. Layton, Jane Loevinger, Paul E.Meehl, Marshall H. Segall, Thornton B. Roby,Robert C. Tryon, Michael Wertheimer, andRobert F. Winch.
insofar as they are to be distinguishedfrom reliability.
2. For the justification" of noveltrait measures, for the validation oftest interpretation, or for the establishment of construct validity, discriminant validation as well as convergent validation is required. Testscan be invalidated by too high correlations with other tests from whichthey were intended to differ.
3. Each test or task employed formeasurement purposes is a trait ..method unit, a union of a particulartrait content with measurement pro..cedures not specific to that content.The systematic variance among testscores can be due to responses to themeasurement features as well as responses to the trait content.
4. In order to examine discriminant validity, and in order to estimate the relative contributions oftrait and method variance, more thanone trait as well as more than onemethod must be employed in the validation process. In many instances itwill be convenient to achieve thisthrough a multitrait-multimethodmatrix. Such a matrix presents all ofthe intercorrelations resulting wheneach of several traits is measured byeach of several methods.
To illustrate the suggested valida..tional process, a synthetic example is
81
82 D. T. CAMPBELL AND D. W. FISKE
TABLE 1A SYNTHETIC MULTITRAIT-MuLTIMETHOD MATRIX
Traits
Method 1 -Method 2
B~ AI
Method 3
Ba Ca
IVfethod 1 .51
.38
(.89)
.37 (.76)
.\~
Method 2 n~
A:J
Method 3 B~
Ca
....----------,.67 .... " .-1-2 .331
,............. '........... ::.43............. .66 ..... , .34,'I "...... -......, II ........"1.34 .32........ .58.-. ...---------~
(.94)
.67 (.92)
.58 .60 (.85)
Note.-The validity diagonals are the three sets of italicized values. The reliability diagonals are the three setsof values in parentheses. Each heterotrait-monomethod triangle is enclosed by a solid line. Each heterotraitheteromethod triangle is enclosed by a broken line.
presented in Table 1. This illustration involves three different traits,each measured by three methods,generating nine separate variables. Itwill be convenient to have labels forvarious regions of the matrix, andsuch have been provided in Table 1.The reliabilities will be spoken of interms of three reliability diagonals,one for each method. The reliabilitiescould also be designated as the mono..trait-monomethod values. Adjacentto each reliability diagonal is theheterotrait-monomethod triangle. Thereliability diagonal and the adjacentheterotrai t-monomethod trianglemake up a monomethod block. A heteromethod block is made up of a validitydiagonal (which could also be designated as monotrait-heteromethodvalues) and the two heterotrait-hetero..method triangles lying on each side ofit. Note that these two heterotrait..
heteromethod triangles are not identical.
In terms of this diagram, four aspects bear upon the question of validity. In the first place, the entries inthe validity diagonal should be significantly different from zero andsufficiently large to encourage furtherexamination of validity. This requirement is evidence of convergentvalidity. Second, a validity diagonalvalue should be higher than the values lying in its column and row in theheterotrait..heteromethod triangles.That is, a validity value for a variableshould be higher than the correlationsobtained between that variable andany other variable having neithertrait nor method in common. Thisrequirement may seem so minimaland so obvious as to not need stating,yet an inspection of the Iiteratureshows that it is frequently not met,
VALIDATION BY THE MULTITRAIT·MULTIMETHOD MATRIX 83
and may not be met even when thevalidity coefficients are of substantialsize. In Table 1, all of the validityvalues meet this requirement. Athird common-sense desideratum isthat a variable correlate higher withan independent effort to measure thesame trait than with measures designed to get at different traits whichhappen to employ the same method.For a given variable, this involvescomparing its values in the validitydiagonals with its values in the heterotrait-monomethod triangles. Forvariables Al , B1, and CI , this requirement is met to some degree. For theother variables, A2, As etc., it is notmet and this is probably typical ofthe usual case in individual differences research, as will be discussed inwhat follows. A fourth desideratumis that the same pattern of trait interrelationship be shown in all of theheterotrait triangles of both the mon~
omethod and heteromethod blocks.The hypothetical data in Table 1meet this requirement to a verymarked degree, in spite of the different generalleveJs of correlation involved in the several heterotrait triangles. The last three criteria provide evidence for discriminant validity.
Before examining the multitraitmultimethod matrices available inthe literature, some explication andjustification of this complex of requirements seems in order.
Convergence of independent methods:the distinction between reliability andvalidity. Both reliability and validityconcepts require that agreement between measures be demonstrated. Acommon denominator which mostvalidity concepts share in contradistinction to reliability is that thisagreement represent the convergenceof independent approaches. The concept of independence is indicated by
such phrases as 4'external variable,""criterion performance," "behavioralcriterion" (American PsychologicalAssociation, 1954, pp. 13-15) used inconnection with concurrent and predictive validity. For construct validity it has been stated thus: "Numerous successful predictions dealingwith phenotypically diverse 'criteria'give greater weight to the claim ofconstruct validity than do ... predictions involving very similar behavior" (Cronbach & Meehl, 1955, p.295). The importance of independence recurs in most discussions ofproof. For example, Ayer, discussinga historian's belief about a pastevent, says "if these sources arenumerous and independent, and ifthey agree with one another, he willbe reasonably confident that their account of the matter is correct" (Ayer,1954, p. 39). In discussing the manner in which abstract scientific concepts are tied to operations, Feiglspeaks of their being llfixed" by Sltriangulation in logical space" (Feigl.1958, p. 401).
Independence is, of course, a matter of degree, and in this sense, reliability and validity can be seen as regions on a continuum. (Cf. Thurstone, 1937, pp. 102-103.) Reliabilityis the agreement between two effortsto measure the same trait throughmaximally similar methods. Validityis represented in the agreement between two attempts to measure thesame trait through maximally different methods. A split-half reliabilityis a little more like a validity coefficient than is an immediate test-retestreliability, for the items are not quiteidentical. A correlation betweendissimilar subtests is probably a reliability measure, but is still closer tothe region called validity,
Some evaluation of validity cantake place even if the two methods
84 D. T. CAMPBELL AND D. W. FISKE
are not entirely independent. InTable 1, for example, it is possiblethat Methods 1 and 2 are not entirely independent. If underlyingTraits A and B are entirely independent, then the .10 minimum correlation in.. the heterotrait-heteromethod triangles may reflect methodcovariance. What if the overlap ofmethod variance were higher? Allcorrelations in the heteromethodblock would then be elevated, includ...ing the validity diagonal. 1'he heteromethod block involving Methods 2and 3 in Table 1 illustrates this. Thedegree of elevation of the validitydiagonal above the heterotrait-heteromethod triangles remains comparable and relative validity can still beevaluated. The interpretation of thevalidity diagonal in an absolute fashion requires the fortunate coincidenceof both an independence of traitsand an independence of methods,represented by zero values in theheterotrait-heteromethod triangles.But zero values could also occurthrough a combination of negativecorrelation between traits and positive correlation between methods, orthe reverse. In practice, perhaps allthat can be hoped for is evidence forrelative validity, that is, for commonvariance specific to a trait, above andbeyond shared method variance.
Discriminant validation. While theusual reason for the judgment of invalidity is low correlations in thevalidity diagonal (e.g., the DowneyWill-Temperament Test [Symonds,1931, p. 337ff]) tests have also beeninvalidated because of too high correlations with other tests purportingto measure different things. Theclassic case of the social intelligencetests is a case in point. (See belowand also [Strang, 1930; R. Thorndike,1936].) Such invalidation occurswhen values in the heterotrait-hetero-
method triangles are as high as thosein the validity diagonal, or evenwhere within a monomethod block,the heterotrait values are as high asthe reliabilities. Loevinger, GIeser,and DuBois (1953) have emphasizedthis requirement in the developmentof maximally discriminating subtests.
When a dimension of personality ishypothesized, when a construct isproposed, the proponent invariablyhas in mind distinctions between thenew dimension and other constructsalready in use. One cannot definewithout implying distinctions, andthe verification of these distinctionsis an important part of the validational process. I n discussions of construct validity, it has been expressedin such terms as "from this point ofview, a low correlation with athleticability may be just as important andencouraging as a high correlationwith reading comprehension" (APA,1954, p. 17).
The test as a trait-method unit. Inany given psychological measuringdevice, there are certain features orstimuli introduced specifically torepresent the trait that it is intendedto measure. There are other featureswhich are characteristic of themethod being employed, featureswhich could also be present in effortsto measure other quite differenttraits. The test, or rating scale, orother device, almost inevitably elicitssystematic variance in response dueto both groups of features. To the ex..tent that irrelevant method variancecontributes to the scores obtained,these scores are invalid.
This source of invalidity was firstnoted in the "halo effects" found inratings (Thorndike, 1920). Studiesof individual differences among laboratory animals resulted in the recognition of "apparatus factors," usually more dominant than psychologi-
VALIDATION BY THE MULTITRAIT..MULTIMETHOD MATRIX 85
cal process factors (Tryon, 1942) .For paper-and-pencil tests, methodsvariance has been noted under suchterms as "test-form factors't (Vernon: 1957, 1958) and "response sets"(Cronbach: 1946,1950; Lorge, 1937).Cronbach has stated the point particularly clearly: "The assumption isgenerally made ... that what thetest measures is determined by thecontent of the items. Yet the finalscore ... is a composite of effects resulting from the content of the itemand effects resulting from the formof the item used" (Cronbach, 1946,p.475). "Response sets always lowerthe logical validity of a test....Response sets interfere with inferences from test data" (p. 484).
While E. L. Thorndike (1920) waswilling to allege the presence of haloeffects by comparing the high obtained correlations with commonsense notions of what they ought tobe (e.g., it was unreasonable that ateacher's intelligence and voice quality should correlate .63) and whilemuch of the evidence of response setvariance is of the same order, theclear-cut demonstration of the presence of method variance requiresboth several traits and several methods. Otherwise, high correlations between tests might be explained as dueeither to basic trait similarity or toshared method variance. In themultitrait-multimethod matrix, thepresence of method variance is indicated by the difference in level of correlation between the parallel valuesof the monomethod block and theheteromethod blocks, assuming comparable reliabili ties among all tests.Thus the contribution of method variance in Test Al of Table 1 is indicated by the elevation of r AIBl aboverAIB2' i.e., the difference between .51and .22, etc.
The distinction between trait and
method is of course relative to thetest constructor's intent. What is anunwanted response set for one testermay be a trait for another who wishesto measure acquiescence, willingnessto take an extreme stand, or tendencyto attribute socially desirable attributes to oneself (Cronbach: 1946,1950; Edwards, 1957; Lorge, 1937).
MULTITRAIT-MuLTIMETHOD MATRICES IN THE LITERATURE
Multitrait-multimethod matricesare rare in the test and measurementliterature. Most frequent are twotypes of fragment: two methods andone trait (single isolated values fromthe validity diagonal, perhaps accompanied by a reliability or two),and heterotrait-monomethod triangles. Either type of fragment isapt to disguise the inadequacy of ourpresent measurement efforts, particularly in failing to call attention to thepreponderant strength of methodsvariance. The evidence of test validity to be presented here is probablypoorer than most psychologists wouldhave expected.
One of the earliest matrices of thiskind was provided by Kelley andKrey in 1934. Peer judgments bystudents provided one method, scoreson a word-association test the other.Table 2 presents the data for the fourmost valid traits of the eight he employed. The picture is one of strongmethod factors, particularly amongthe peer ratings, and almost total invalidity. For only one of the eightmeasures, School Drive, is the valuein the validity diagonal (.161) higherthan all of the heterotrait-heteromethod values. The absence of discriminant validity is further indicated by the tendency of the valuesin the monomethod triangles to approximate the reliabilities.
An early illustration from the ani-
86 D. T. CAMPBELL AND D. W. FISKE
TABLE 2
PERSONALITY TRAITS OF ScHOOL CHILDREN FROM KELLEY'S STUDY
(Ncs311)
Peer Ratings Association Test
Peer RatingsCourtesy A1 ( .82)Honesty B1 .74 ( .80)Poise C1 .63 .65 (.74)School Drive 0 1 .76 .78 .65 (.89)
Association TestCourtesy A2 .13 .14 .10 .14 (.28)Honesty B2 .06 .12 .16 .08 .27 (.38)Poise C2 .01 .08 .10 .02 .19 .37 ( .42)School Drive D2 .12 .15 .14 .16 .27 .32 .18 (.36)
mal literature comes from Anderson's(1937) study of drives. Table 3 presents a sample of his data. Onceagain, the highest correlations arefound among different constructsfrom the same method, showing thedominance of apparatus or methodfactors so typical of the whole field ofindividual differences. The validitydiagonal for hunger is higher than theheteroconstruct-heteromethod values. The diagonal value for sex hasnot been italicized as a validi tycoefficient since the obstruction box
measure was pre-sex-opportunity, theactivity wheel post-opportunity.Nate that the high general levelof heterotrait-heteromethod valuescould be due either to correlation ofmethods variance between the twomethods, or to correlated trait variance. On a priori grounds, however,the methods would seem about as independent as one would be likely toachieve. The predominance of an apparatus factor for the activity wheelis evident from the fact that the correlation between hunger and thirst
TABLE 3
MEASURES OF DRIVES FROM ANDERSON'S DATA
(N==50)
Obstruction Box Activity Wheel
Al B1 C1 A2 B2 C2
Obstruction BoxHunger At ( .58)Thirst B1 .54 ( )Sex C1 .46 .70
Activity WheelHunger A2 .48 .31 .37 (.83)Thirst B2 .35 .33 .43 .87 ( .92)Post Sex C2 .31 .37 .44 .69 .78
Note.-Empty parentheses appear in this and subsequent tables where no appropriate reliability estimates arereported in the original paper.
VALIDATION BY THE MULTITRAIT·MULTIMETHOD MATRIX 87
TABLE 4SOCIAL INTELLIGENCE AND MENTAL ALERTNESS SUBTEST INTERCORRELATIONS FROM
THORNDIKEtS DATA
(N~750)
MemoryCompre-
Vocabularyhension
Ai B1 A2 B~ Aa Bs
MemorySocial Intelligence (Memory for Names & Faces) Al ( )Mental Alertness (Learning Ability) B1 .31 (
ComprehensionSocial Intelligence (Sense of Humor) At .30 .31 ( )Mental Alertness (Comprehension) B2 .29 .38 .48 (
VocabularySocial Intelligence (Recog. of Mental State) As .23 .35 .31 .35 ( )Mental Alertness (Vocabulary) B. .30 .58 .40 .48 .47 (
(.87) is of the same magnitude astheir test-retest reliabilities (.83 and.92 respectively).
R. L. Thorndike's study (1936) ofthe validity of the George Washington Social Intelligence Test is theclassic instance of invalidation byhigh correlation between traits. It involved computing all of the intercorrelations among five subscales of theSocial Intelligence Test and five subscales of the George WashingtonMental Alertness Test. The model ofthe present paper would demand thateach of the traits, social intelligenceand mental alertness, be measured byat least two methods. While this fullsymmetry was not intended in thestudy, it can be so interpreted without too much distortion. For bothtraits, there were subtests employingacquisition of knowledge during thetesting period (Le., learning or memory) , tests involving comprehensionof prose passages, and tests that involved a definitional activity. Table 4shows six of Thorndike's 10 variablesarranged as a multitrait-multimethodmatrix. If the three subtests of theSocial Intelligence Test are viewed
as three methods of measuring socialintelligence, then their intercorrelations (.30, .23, and .31) representvalidities that are not only lower thantheir corresponding monomethod values, but also lower than the heterotrait-heteromethod correlations, providing a picture which totally fails toestablish social intelligence as a separate dimension. The Mental Alertness validity diagonals (.38, .58, and.48) equal or exceed the monomethodvalues in two out of three cases, andexceed all heterotrait-heteromethodcontrol values. These results illustrate the general conclusions reachedby Thorndike in his factor analysis ofthe whole 10 X 10 matrix,
The data of Table 4 could be usedto validate specific forms of cognitivefunctioning, as measured by the different Hmethodsn represented byusual intelligence test content on theone hand and social content on theother. Table 5 rearranges the 15 values for this purpose. The monomethod values and the validity diagonals exchange places, while theheterotrait-heteromethod control coefficients are the same in both tables.
88 D. T. CAMPBELL AND D. W. FISKE
TABLE 5
MEMORY, COMPREHENSION, AND VOCABULARY MEASURED WITHSOCIAL AND ABSTRACT CONTENT
Social Content Abstract Content
At B1 C1 A2 B2 C2
Social ContentMemory (Memory for Names and Faces) Al ( )Comprehension (Sense of Humor) B1 .30 ( )Vocabulary (Recognition of Mental State) C1 .23 .31
Abstract ContentMemory (Learning Ability) A2 .31 .31 .35 ( )Comprehension B2 .29 .48 .35 .38 ( )Vocabulary C2 .30 .40 .47 .58 .48
As judged against these latter values,comprehension (.48) and vocabulary(.47), but not memory (.31), showsome specific validity. This transmutability of the validation matrixargues for the comparisons within theheteromethod block as the most generally relevant validation data, andillustrates the potential interchangeability of trait and method components.
Some of the correlations in Chi's(1937) prodigious study of halo effectin ratings are appropriate to a multitrait...multimethod matrix in whicheach rater might be regarded as representing a different method. Whilethe published report does not makethese available in detail because itemploys averaged values, it is apparent from a comparison of his TablesIV and VIII that the ratings generally failed to meet the requirementthat ratings of the same trait by dif..ferent raters should correlate higherthan ratings of different traits by thesame rater. Validity is shown to theextent that of the correlations in theheteromethod block, those in thevalidity diagonal are higher than theaverage heteromethod-heterotraitvalues.
A conspicuously unsuccessful mul-
titrait-multimethod matrix is provided by Campbell (1953, 1956) forrating of the leadership behavior ofofficers by themselves and by theirsubordinates. Only one of 11 variables (Recognition Behavior) metthe requirement of providing a validity diagonal value higher than any ofthe heterotrait-heteromethod values,that validity being .29. For none ofthe variables were the validi tieshigher than heterotrait-monomethodvalues.
A study of attitudes toward authority and nonauthority figures byBurwen and Campbell (1957) contains a complex multitrait-multimethod matrix, one symmetrical excerpt from which is shown in Table 6.Method variance was strong for mostof the procedures in this study.Where validity was found, it wasprimarily at the level of validitydiagonal values higher than heterotrait-heteromethod values. As illustrated in Table 6, attitude towardfather showed this kind of validity, asdid attitude toward peers to a lesserdegree. Attitude toward boss showedno validity. There was no evidenceof a generalized attitude towardauthority which would include fatherand boss, although such values as the
VALIDATION BY THE MULTITRAIT·MULTIMETHOD MATRIX 89
TABLE 6ATTITUDES TOWARD FATHER, Boss, AND PEER, AS MEASURED BY
INTERVIEW AND CHECK-LIST OF DESCR.IPTIVE TRAITS
Interview Trait Check..List
Al B1 C1 A2 B2 C2
Interview(N=57)
Father Al ( )Boss B2 .64 ( )Peer Ct .65 .76 (
Trait Check-List(N:::155)
Father A2 .40 .08 .09 (.24)Boss B2 .19 -.10 -.03 .23 (.34)Peer C2 .27 .11 .23 .21 .45 (.55)
.64 correlation between father andboss as measured by interview mighthave seemed to confirm the hypothesis had they been encountered in isolation.
Borgatta (1954) has provided acomplex multimethod study fromwhich can be extracted Table 7, il-
lustrating the assessment of twotraits by four different methods. Forall measures but one, the highest correlation is the apparatus one, i.e.,with the other trait measured by thesame method rather than with thesame trait measured by a differentmethod. Neither of the traits ~nds
TABLE 7MULTIPLE MEASUREMENT OF Two SOCIOMETRIC TRAITS
(N == 125)
Sociometric Observation
by Others by Self Group In.. Roleteraction PIaying
As Ba
Sociometric by OthersPopularity At ( )Expansiveness Hi .47 (
Sociometric by SelfPopularity A2 .19 .18 ( )Expansiveness B2 .07 .08 .32
Observation of Group InteractionPopularity Aa .25 .18 .26 .11 ( )Expansiveness Ba .21 .12 .28 .15 .84
Observation of Role PlayingPopularity A4 .24 .14 .18 .01 .66 .58 ( )Expansiveness B4 .25 .12 .26 .05 .66 .76 .73 (
90 D. T. CAMPBELL AND D. W. FISKE
any consistent validation by the requirement that the validity diagonalsexceed the heterotrait-heteromethodcontrol values. As a most minimalrequirement, it might be asked if thesum of the two values in the validitydiagonal exceeds the sum of the twocontrol values, providing a comparison in which differences in reliabilityor communality are roughly par..tialled out. This condition is achievedat the purely chance level of threetimes in the six tetrads. This matrixprovides an interesting range ofmethodological independence. Thetwo uSociometric by Others" measures, while representing the judgments of the same set of fellow participants, come from distinct tasks:Popularity is based upon each participant's expressi0n of his ownfriendship preferences, while Expansiveness is based upon each participant's guesses as to the other participant's choicest from which hasbeen computed each participant'sreputation for liking lots of other persons, i.e., being "expansive. tf In linewith this considerable independence,the evidence for a method factor isrelatively low in comparison with theobservational procedures. Similarly,the two uSociometric by Self' measures represent quite 'separate tasks,Popularity coming from his estimatesof the choices he will receive fromothers, Expansiveness from the number of expressions of attraction toothers which he makes on the sociometric task. In contrast, the measures of Popularity and Expansivenessfrom the observations of group interaction and the role playing not onlyinvolve the same specific observers,but in addition the observers ratedthe pair of variables as a part of thesame rating task in each situation.The apparent degree of method variance within each of the two observa-
tional situations, and the apparentsharing of method variance betweenthem, is correspondingly high.
In another paper by Borgatta(1955) t 12 interaction process variables were measured by quantitativeobservation under two conditions,and by a projective test. In this test,the stimuli were pictures of groups,for which the S generated a series ofverbal interchanges; these were thenscored in Interaction Process Analysis categories. For illustrati'Ve purposes, Table 8 presents the five traitswhich had the highest mean communalities in the over-all factor analysis. Between the two highly similar observational methods, validation is excellent: trait variance runshigher than method variance; validity diagonals are in general higherthan heterotrait values of both theheteromethod and monomethodsblocks, most unexceptionably so forGives Opinion and Gives Orientation.The pattern of correlation among thetraits is also in general confirmed.
Of greater interest because of thegreater independence of methods arethe blocks involving the projectivetest. Here the validity picture ismuch poorer. Gives Orientationcomes off best, its projective testvalidity values of .35 and .33 beingbested by only three monomethodvalues and by no heterotrait-heteromethod values within the projectiveblocks. All of the other validities areexceeded by some heterotrait-heteromethod value.
The projective test specialist mayobject to the implicit expectations ofa one-to-one correspondence betweenprojected action and overt action.Such expectations should not be attributed to Borgatta, and are notnecessary to the method here pro..posed. For the simple symmetricalmodel of this paper, it has been as-
TABLE 8 ~to<
INTERACTION PROCESS VARIABLES IN OBSERVED FREE BEHAVIOR, OBSERVED ROLE PLAYING AND A PROJECTIVE TEST~(N=125)
~Free Behavior Role Playing Projective Test ~
blAl B1 C1 D1 E 1 A2 B2 C2 D2 E2 Po:. B. C. Ds E. ~
Free Behavior ~Shows solidarity Al ( ) l:t.i
Gives suggestion Bt .25 ( ) ~Gives opinion C1 .13 .24 ( ) 8Gives orientation D 1 -.14 .26 .52 ( )
~Shows disagreement E1 .34 .41 .27 .02 (
2Role Playing .......Shows solidarity A2 .43 .43 .08 .10 .29 ( ) ~Gives suggestion B2 .16 .32 .00 .24 .07 .37 ( ) ~Gives opinion C2 .15 .27 .60 .38 .12 .01 .10 ( ) 8Gives orientation D2 - .12 .24 .44 .74 .08 .04 .18 .40 ( )
~Shows disagreement & .51 .36 .14 -.12 .50 .39 .27 .23 -.11~
Projective Test~Shows solidarity As .20 .17 .16 .12 .08 .17 .12 .30 .17 .22 ( )
Gives suggestion B. .05 .21 .05 .08 .13 .10 .19 -.02 .06 .30 .32 ( ) §Gives opinion C. .31 .30 .13 - .02 .26 .25 .19 .15 -.04 .53 .31 .63 ( )
~Gives orientation D s -.01 .09 .30 .35 -.05 .03 .00 .19 .33 .00 .37 .29 .32 ( )Shows disagreement Es .13 .18 .10 .14 .19 .22 .28 .02 .04 .23 .27 .51 .47 .30 )
~~
\0-
92 D. T. CAMPBELL AND D. W. FISKE
TABLE 9MAYO'S INTERCORRELATIONS BETWEEN OBJECTIVE AND RATING
MEASURES OF INTELLIGENCE AND EFFORT
(N= 166)
Peer Ratings
Al B1
Peer RatingIntelligence Al (.85)Effort B1 .66 (.84)
Objective MeasuresIntelligence A2 .46 .29Effort B2 .46 .40
Objective
( ).10
sumed that the measures are labeledin correspondence with the correlations expected, i.e., in correspondencewith the traits that the tests arealleged to diagnose. Note that inTable 8, Gives Opinion is the bestprojective test predictor of both freebehavior and role playing Shows Disagreement. Were a proper theoreticalrationale available, these valuesmight be regarded as validities.
Mayo (1956) has n1ade an analysisof test scores and ratings of effort andintelligence, to estimate the contribution of halo (a kind of methods variance) to ratings. As Table 9 shows,the validity picture is ambiguous.The method factor or halo effect forratings is considerable although thecorrelation between the two ratings(.66) is well below their reliabili ties
(.84 and .85). The objective measures share no appreciable apparatusoverlap because they were independent operations. In spite of Mayo'sargument that the ratings have somevalid trait variance, the .46 heterotrait-heteromethod value seriouslydepreciates the otherwise impressive .46and .40 validity values.
Cronbach (1949, p. 277) and Vernon (1957, 1958) have both discussedthe multitrait-multimethod matrixshown in Table 10, based upon dataoriginally presented by H. S. Conrad.Using an approximative technique,Vernon estimates that 61 % of thesystematic variance is due to a general factor, that 21!% is due to thetest-form factors specific to verbal orto pictorial forms of items, and thatbut 111% is due to the content fac-
TABLE 10MECHANICAL AND ELECTRICAL FACTS MEASURED BY VERBAL AND PICTORIAL ITEMS
Verbal Items
At B1
Verbal IternsMechanical Facts Al (.89)Electrical Facts B1 .63 (.71)
Pictorial IternsMechanical Facts A2 .61 .45Electrical Facts Ba .49 .51
Pictorial Iterns
(.82).64 (.67)
VALIDATION BY THE MULTITRAIT·MULTIMETHOD MATRIX 93
tors specific to electrical or to mechanical contents. Note that for the purposes of estimating validity, the interpretation of the general factor,which he estimates from the .49 and.45 heterotrait-heteromethod values,is equivocal. I t could represent desired competence variance, representing components common to both electrical and mechanical skills-perhapsresulting from general industrial shopexperience, common ability components, overlapping learning situations,and the like. On the other hand, thisgeneral factor could represent overlapping method factors, and be due tothe presence in both tests of multiplechoice item format, IBM answersheets, or the heterogeneity of the Ssin conscientiousness, test-taking motivation, and test-taking sophistication. Until methods that are stillmore different and traits that arestill more independent are introducedinto the validation matrix, this general factor remains uninterpretable.From this standpoint it can be seenthat 21!% is a very minimal estimateof the total test-form variance in thetests, as it represents only test-formcomponents specific to the verbal orthe pictorial items, Le., test-formcomponents which the two forms donot share. Similarly, and more hopefully, the 11!% content variance is avery minimal estimate of the totaltrue trai t variance of the tests, representing only the true trait variancewhich electrical and mechanicalknowledge do not share.
Carroll (1952) has provided dataon the Guilford-Martin Inventory ofFactors STDCR and related ratingswhich can be rearranged into thematrix of Table 11. (Variable R hasbeen inverted to reduce the numberof negative correlations.) Two of themethods, Self Ratings and Inventoryscores, can be seen as sharing method
variance, and thus as having an inflated validity diagonal. The moreindependent heteromethod blocks involving Peer Ratings show some evidence of discriminant and convergentvalidity, with validity diagonals averaging .33 (Inventory X Peer Ratings) and .39 (Self RatingsXPeerRatings) against heterotrait-heteromethod control values averaging .14and .16. While not intrinsically impressive, this picture is nonethelessbetter than most of the validity matrices here assembled. Note that theSelf Ratings show slightly highervalidity diagonal elevations than dothe Inventory scores, in spite of themuch greater length and undoubtedlyhigher reliability of the latter. In addition, a method factor seems almosttotally lacking for the Self Ratings,while strongly present for the Inventory, so that the Self Ratings comeoff much the best if true trait variance is expressed as a proportion oftotal reliable variance (as Vernon[1958] suggests). The method factorin the STDCR Inventory is undoubtedly enhanced by scoring the sameitem in several scales, thus contributing correlated error variance, whichcould be reduced without loss of reliability by the simple expedient ofadding more equivalent items andscoring each item in only one scale.It should be noted that Carroll makesexplicit use of the comparison of thevalidity diagonal with the heterotrait...heteromethod values as a validity indicator.
RATINGS IN THE ASSESSMENT STUDY
OF CLINICAL PSYCHOLOGISTS
The illustrations of multitraitmultimethod matrices presented sofar give a rather sorry picture of thevalidity of the measures of individualdifferences involved. The typicalcase shows an excessive amount of
TABLE 11
GUILFORD-MARTIN FACTORS STDCR AND RELATED RATINGS
(N=1l0)
Inventory Self Ratings Peer Ratingsl='
S T D C -R S T D C -R 5 T D C -R~
Inventory ~S (.92)
~T .27 (.89)D .62 .57 (.91) tl:l
C .36 .47 .90 (.91) ~-R .69 .32 .28 -.06 (.89) t-<
~
Self Ratings ?5S .57 .11 .19 -.01 .53 ( )
l='T .28 .65 .42 .26 .37 .26 ( )D .44 .25 .53 .45 .29 .31 .32 ( ) ~C .31 .20 .54 .52 .13 .11 .21 .47 ( )-R .15 .30 .12 .04 .34 .10 .12 .04 .06 ~
Peer Ratings ~t:t.1
S .37 .08 .10 -.01 .38 .42 .02 .08 .08 .31 (.81)T .23 .32 .15 .04 .40 .20 .39 .40 .21 .31 .37 (.66)D .31 .11 .27 .24 .25 .17 .09 .29 .27 .30 .49 .38 (.73)C .08 .15 .20 .26 -.05 .01 .06 .14 .30 .07 .19 .16 .40 (.75)·R .21 .20 -.03 -.16 .45 .28 .17 .08 .01 .56 .55 .56 .34 -.07 (.76)
2 We are indebted to E. Lowell Kelly forfurnishing the V.A. assessment date to us, andto Hugh Lane for producing the matrix ofintercorrelaHons.
In the original report the correlations werebased upon 128 men. The present analyseswere based on only 124 of these cases becauseof clerical errors. This reduction in N leadsto some very minor discrepancies betweenthese values and those previously reported·
VALIDATION BY THE MULTITRAIT-MULTIMETHOD MATRIX 95
method variance, which usually ex- ings, which were agreed upon byceeds the amount of trait variance. three staff members after discussionThis picture is certainly not as a re- and review of the enormous amountsuIt of a deliberate effort to select of data and the many other ratings onshockingly bad examples: these are each S. Unfortunately for our pur..ones we have encountered without at- poses, the staff members saw the rattempting an exhaustive coverage of ings by Self and Teammates beforethe literature. The several unpub- making theirs, although presumablyHsherl studies of which we are aware they were little influenced by theseshow the same picture. If they seem data because they had so much othermore disappointing than the general evidence available to them. (See Kelrun of validity data reported in the Iy & Fiske, 1951, especially p. 64.)journals, this impression may very The Self and Teammate ratings repwell be because the portrait of valid.. resent entirely separate Umethods"ity provided by isolated values and can be given the major emphasisplucked from the validity diagonal is in evaluating the data to be predeceptive, and uninterpretable in sented.isolation from the total matrix. Yet In a previous analysis of these datait is clear that few of the classic ex- (Fiske, 1949), each of the three heteramples of successful measurement of otrait-monomethod triangles wasindividual differences are involved, computed and factored. To provideand that in many of the instances, the a multitrait-multimethod matrix, thequality of the data might have been 1452 heteromethod correlations havesuch as to magnify apparatus factors, been computed especially for this reetc. A more nearly ideal set of per- port.2 The full 66 X 66 matrix withsanality data upon which to illus- its 2145 coefficients is obviously tootrate the method was therefore large for presentation here, but willsought in the multiple application of be used in analyses that follow. Toa set of rating scales in the assess- provide an illustrative sample, Tablement study of clinical psychologists 12 presents the interrelationships(Kelly & Fiske, 1951). among five variables, selecting the
In that study, "Rating Scale A" one best representing each of the fivecontained 22 traits referring to 44be_ recurrent factors discovered in Fiske'shavior which can be directly observed (1949) previous analysis of the monoon the surface." In using this scale method matrices. (These were chosenthe raters were instructed to "disre- without regard to their validity asgard any inferences about underlying indicated in the heteromethod blocks.dynamics or causes" (p. 207). The Assertive-No. 3 reflected-was seSSt first-year clinical psychology stu- lected to represent Recurrent Factordents, rated themselves and also their 5 because Talkative had also a highthree teammates with whom theyhad participated in the various assessment procedures and with whomthey had lived for six days. Themedian of the three teammates' ratings was used for the Teammatescore. The Ss were also rated on these22 traits by the assessment staff. Ouranalysis uses the Final Pooled rat-
\00\
TABLE 12
RATINGS FROM ASSESSMENT STUDY OF CLINICAL PSYCHOLOGISTS(N=124)
Staff Ratings Teammate Ratings Self Ratings
Al Bl Cl Dl E l A. B. C. D. E. As B. C3 D3 Es ~
~Staff Ratings
&:Assertive Al (.89)Cheerful Bl .37 (.85) ~Serious Cl -.24 -.14 (.81) t:l:lUnshakable Poise Dl .25 .46 .08 (.84) ttlBroad Interests E l .35 .19 .09 .31 (.92) l:-l
l:-l
Teammate Ratings;l:..
Assertive A. .71 .35 - .18 .26 .41 (.82) ~Cheerful B. .39 .53 - .15 .38 .29 .37 (.76) ~Serious C. -.27 - .31 .43 - .06 .03 - .15 -.19 (.70)Unshakable Poise D. .03 -.05 .03 .20 .07 .11 .23 .19 (.74) ~Broad Interests E. .19 .05 .04 .29 .47 .33 .22 .19 .29 (.76)
~Self Ratings ~Assertive As .48 .31 - .22 .19 .12 .46 .36 - .15 .12 .23 ( )
Cheerful Bs .17 .42 -.10 .10 -.03 .09 .24 -.25 -.11 -.03 .23 ( )Serious Cs -.04 -.13 .22 - .13 -.05 -.04 -.11 .31 .06 .06 -.05 - .12 ( )Unshakable Poise Ds .13 .27 -.03 .22 -.04 .10 .15 .00 .14 -.03 .16 .26 .11 ( )Broad Interests Ea .37 .15 -.22 .09 .26 .27 .12 -.07 .05 .35 .21 .15 .17 .31 (
VALIDATION BY THE MULTITRAIT-MULTIMETHOD MATRIX 97
loading on the first recurrent factor.)The picture presented in Table 12
is, we believe, typical of the bestvalidity in personality trait ratingsthat psychology has to offer at thepresent time. It is comforting to notethat the picture is better than mostof those previously examined. Notethat the validities for Assertive ex..ceed heterotrait values of both themonomethod and heteromethod triangles. Cheerful, Broad Interests,and Serious have validities exceedingthe heterotrait-heteromethod valueswith two exceptions. Only for Unshakable Poise does the evidence ofvalidity seem trivial. The elevationof the reliabilities above the heterotrait-monomethod triangles is furtherevidence for discriminant validity.
A comparison of Table 12 with thefull matrix shows that the procedureof having but one variable to represent each factor has enhanced the appearance of validity, although notnecessarily in a misleading fashion.Where several variables are all highlyloaded on the same factor, theirUtrue" level of intercorrelation ishigh. Under these conditions, sampling errors can depress validity diagonal values and enhance others toproduce occasional exceptions to thevalidity picture, both in the heterotrait-monomethod matrix and in theheteromethod-heterotrait triangles.In this instance, with an N of 124, thesampling error is appreciable, andmay thus be expected to exaggeratethe degree of invalidity.
Within the monomethod sections,errors of measurement will be correlated, raising the general level ofvalues found, while within the heteromethods block, measurement errorsare independent, and tend to lowerthe values both along the validitydiagonal and in the heterotrait triangles. These effects, which may also
be stated in terms of method factorsor shared confounded irrelevancies,operate strongly in these data, asprobably in all data involving ratings. In such cases, where severalvariables represent each factor, noneof the variables consistently meetsthe criterion that validity values exceed the corresponding values in themonomethod triangles, when the fullmatrix is examined.
To summarize the validation picture with respect to comparisons ofvalidity values with other heteromethod values in each blocI{, Table13 has been prepared. For each traitand for each of the three heteromethod blocks, it presents the valueof the validity diagonal, the highestheterotrait value involving that trait,and the number out of the 42 suchheterotrait values which exceed thevalidity diagonal in magnitude. (Thenumber 42 comes from the groupingof the 21 other column values and the21 other row values for the columnand row intersecting at the givendiagonal value.)
On the requirement that the validity diagonal exceed all others in itsheteromethod block, none of thetraits has a completely perfect record,although some come close. Assertivehas only one trivial exception in theTeammate..Self block. Talkative hasalmost as good a record, as doesImaginative. Serious has but two inconsequential exceptions and Interestin Women three. These traits standout as highly valid in both selfdescription and reputation. Notethat the actual validity ~oefficientsofthese four traits range from but .22 to.82, or, if we concentrate on theTeammate-Self block as most certainly representing independentmethods, from but .31 to .46. Whilethese are the best traits, it seems thatmost of the traits have far above
98 D. T. CAMPBELL AND D. W. FISKE
TABLE 13
VALIDITIES OF TRAITS IN THE ASSESSMENT STUDY OF CLINICAL PSYCHOLOGISTS,AS JUDGED BY THE HETEROMETHOD COMPARISONS
Staff- Teammate Staff..Self Teammate-Self
High..No.
High..No.
High-No.Val. est
HigherVal. est
HigherVal. est
HigherHet. Het. Het.
1. Obstructiveness* .30 .34 2 .16 .27 9 .19 .24 12. Unpredictable .34 .26 0 .18 .24 3 .05 .19 293. Assertive* .71 .65 0 .48 .45 0 .46 .48 14. Cheerful* .53 .60 2 .42 .40 0 .24 .38 55. Serious· .43 .35 0 .22 .27 2 .31 .24 06. Cool, Aloof .49 .48 0 .20 .46 10 .02 .34 367. Unshakable Poise .20 .40 16 .22 .27 4 .14 .19 108. Broad Interests* .47 .46 0 .26 .37 6 .35 .32 09. Trustful .26 .34 5 .08 .25 19 .11 .17 9
10. Self-centered .30 .34 2 .17 .27 6 - .07 .19 3611. Talkative* .82 .65 0 .47 .45 0 .43 .48 112. Adventurous .45 .60 6 .28 .30 2 .16 .36 1413. Socially Awkward .45 .37 0 .06 .21 28 .04 .16 3014.. Adaptable* .44 .40 0 .18 .23 10 .17 .29 815. Self-sufficient· .32 .33 1 .13 .18 5 .18 .15 016. Worrying, Anxious* .41 .37 0 .23 .33 5 .15 .16 117. Conscientious .26 .33 4 .11 .32 19 .21 .23 218. Imaginative* .43 .46 1 .32 .31 0 .36 .32 019. Interest in Women* .42 .43 2 .55 .38 0 .37 .40 120. Secretive, Reserved* .40 .58 5 .38 .40 2 .32 .35 321. Independent Minded .39 .42 2 .08 .25 19 .21 .30 322. Emotional Expression* .62 .63 1 .31 .46 5 .19 .34 10
Note.-Val. -value in validity diagonal; Highest Het. =highest heterotrait value; No. Higher =number ofheterotrait values exceeding the validity diagonal.
tit Trait names which have validities in all three heteromethod blocks significantly greater than the heterotrait·heteromethod values at the .001 level.
chance validity. All those having 10or fewer exceptions have a degree ofvalidity significant at the .001 levelas crudely estimated by a one-tailedsign test.s All but one of the variablesmeet this level for the Staff..Teammate block, all but four for the Staff..
8 If we take the validity value as fixed (ig..Doring its sampling fluctuations) J then we candetermine whether the number of valueslarger than it in its row and column is less thanexpected on the null hypothesis that half thevalues would be above it. This procedure requires the assumption that the position (aboveor below the validity value) of anyone ofthese comparison values is independent of theposition of each of the othersJ a dubious assumption when common methods and traitvariance are present.
Self block, all but five for the mostindependent block, Teammate-Self.The exceptions to significant validityare not parallel from column to column, however, and only 13 of 22variables have .001 significant validity in/jaIl threelblocks. These are indi..cated':by an:~asterisk in Table 13.
This highly significant generallevel of validity must not obscure themeaningful problem created by theoccasional exceptions, even for thebest variables. The excellent traitsof Assertive and Talkative providea case in point. In terms'~ofAFiske'soriginal analysis, both have highloadings on the recurrent factor"Confident self-expression" (repre..
VALIDATION BY THE MULTITRAIT-M-ULTIMETHOD MATRIX 99
sented by Assertive in Table 12).Talkative also had high loadings onthe recurrent factor of Social Adaptability (represented by Cheerful inTable 12). We would expect, therefore, both high correlation betweenthem and significant discriminationas well. And even at the commonsense level, most psychologists wouldexpect fellow psychologists to discriminate validly between assertiveness (nonsubmissiveness) and talkativeness. Yet in the Teammate-Selfblock, Assertive rated by self correlates ..48 with Talkative by teammates, higher than either of theirvalidities in this block, .43 and .46.
In terms of the average values ofthe validities and the frequency ofexceptions, there is a distinct trendfor the Staff-Teammate block toshow the greatest agreement. Thiscan be attributed to several factors.Both represent ratings from the external point of view. Both are averaged over three judges, minimizingindividual biases and undoubtedly increasing reliabilities. Moreover, theTeammate ratings were available tothe S~aff in making their ratings. An...other effect contributing to the lessadequate convergence and discrimination of Self ratings was a responseset toward the favorable pole whichgreatly reduced the range of thesemeasures (Fiske, 1949, p. 342). Inspection of the details of the instancesof invalidity summarized in Table 13shows that in most instances the effect is attributable to the high specificity and low communality for theself-rating trait. In these instances,the column and row intersecting atthe low validity diagonal are asymmetrical as far as general level of correlation is concerned, a fact coveredover by the condensation providedin Table 13.
The personality psychologist is
initially predisposed to reinterpretself-ratings, to treat them as symptoms rather than to interpret themliterally. Thus, we were alert to instances in which the self ratings werenot literally interpretable, yet nonetheless had a diagnostic significancewhen properly "translated.H By andlarge, the instances of invalidity ofself-descriptions found in this assess...ment study are not of this type. butrather are to be explained in terms ofan absence of communality for oneof the variables involved. In general,where these self descriptions are interpretable at all, they are as literallyinterpretable as are teammate descriptions. Such a finding may, ofcourse, reflect a substantial degree ofinsight on the part of these Ss.
The general success in discriminantvalidation coupled with the parallelfactor patterns found in Fiske'searlier analysis of the three intramethod matrices seemed to justify aninspection of the factor pattern validity in this instance. One possible pro...cedure would be to do a single analysis of the whole 66X66 matrix.Other approaches focused upon separate factoring of heteromethodsblocks, matrix by matrix, could alsobe suggested. Not only would suchmethods be extremely tedious, but inaddition they would leave undetermined the precise comparison offactor-pattern similarity. Correlating factor loadings over the population of variables was employed forthis purpose by Fiske (1949) butwhile this provided for the identification of recurrent factors, no singleover-all index of factor pattern similarity was generated. Since our immediate interest was in confirming apattern of interrelationships, ratherthan in describing it, an efficientshort cut was available: namely totest the similari ty of the sets of heter-
100 D. T. CAMPBELL AND D. W. FISKE
otrait values by correlation coefficients in which each entry represented the size values of the givenheterotrait coefficients in two different matrices. For the full matrix,such correlations would be basedupon the N of the 22 X21/2 or 231specific heterotrait combinations.Correlations were computed betweenthe Teammate and Self monomethods matrices, selected as maximallyindependent. (The values to followwere computed from the original correlation matrix and are somewhathigher than that which would be obtained from a reflected matrix.) Thesimilarity between the two monomethods matrices was .84, corroborating the factor-pattern similaritybetween these matrices describedmore fully by Fiske in his parallelfactor analyses of them. To carrythis mode of analysis into the heteromethod block, this block was treatedas though divided into two by thevalidity diagonal, the above diagonalvalues and the below diagonal representing the maximally independentvalidation of the heterotrait correlation pattern. These two correlated.63, a value which, while lower, showsan impressive degree of confirmation.There remains the question as towhether this pattern upon which thetwo heteromethod-heterotrait triangles agree is the same one found incommon between the two monomethod triangles. The intra-Teammate matrix correlated with the twoheteromethod triangles .71 and .71.The intra-Self matrix correlated withthe two .57 and .63. In general, then,there is evidence for validity of theintertrait relationship pattern.
DISCUSSION
Relation to construct validity. Whilethe validational criteria presented areexplicit or implicit in the discussions
of construct validity (Cronbach &Meehl, 1955; APA, 1954), this paper is primarily concerned with theadequacy of tests as measures of aconstruct rather than with the adequacy of a construct as determinedby the confirmation of theoreticallypredicted associations with measuresof other constructs. We believe thatbefore one can test the relationshipsbetween a specific trait and othertraits, one must have some confidencein one's measures of that trait. Suchconfidence can be supported by evidence of convergent and discriminantvalidation. Stated in different words,any conceptual formulation of traitwill usually include implicitly theproposition that this trait is a response tendency which can be observed under more than one experimental condition and that this traitcan be meaningfully differentiatedfrom other traits. The testing ofthese two propositions must be priorto the testing of other propositions toprevent the acceptance of erroneousconclusions. For example, a conceptual framework might postulate alarge correlation between Tr~its Aand B and no correlation betweenTraits A and C. If the experimenterthen measures A and B by onemethod (e.g., questionnaire) and Cby another method (such as the measurement of overt behavior in a situation test)) his findings may be consistent with his hypotheses solely asa function of method variance common to his measures of A and B butnot to C.
The requirements of this paper areintended to be as appropriate to therelatively atheoretical efforts typicalof the tests and measurements fieldas to more theoretical efforts. Thisemphasis on validational criteria appropriate to our present atheoreticallevel of test construction is not at all
VALIDATION BY THE MULTITRAIT..MULTIMETHOD MATRIX 101
incompatible with a recognition ofthe desirability of increasing the extent to which all aspects of a test andthe testing situation are determinedby explicit theoretical considerations,as Jessor and Hammond have advocated (Jessor & Hammond, 1957).
Relation to operationalism. Underwood (1957, p. 54) in his effectivepresentation of the operationalistpoint of view shows a realistic awareness of the amorphous type of theorywith which most psychologists work.He contrasts a psychologist's ·'literary" conception with the latter'soperational definition as representedby his test or other measuring instrument. He recognizes the importanceof the literary definition in communicating and generating science. Hecautions that the operational definition "may not at all measure theprocess he wishes to measure; it maymeasure something quite different"(1957, p. 55). He does not, however,indicate how one would know whenone was thus mistaken.
The requirements of the presentpaper may be seen as an extension ofthe kind of operationalism Underwood has expressed. The test constructor is asked to generate from hisliterary conception or private construct not one operational embodiment, but two or more, each as differen t in research vehicle as possible.Furthermore, he is asked to make explicit the distinction between his newvariable and other variables, distinctions which are almost certainly implied in his literary definition. In hisvery first validational efforts, beforehe ever rushes into print, he is askedto apply the several methods and several traits jointly. His literary definition, his conception, is now best represented in what his independentmeasures of the trait hold distinctively in common. The multitrait-
multimethod matrix is, we believe,an important practical first step inavoiding 4'the danger. a • that theinvestigator will fall into the trap ofthinking that because he went froman artistic or literary conception. .. to the construction of items for a
scale to measure it, he has validatedhis artistic conception" (Underwood,1957, p. 55). In contrast with thesingle operationalism now dominantin psychology, we are advocating amultiple operationalism, a convergentoperationalism (Garner, 1954; Garner,Hake, & Eriksen, 1956),a methodological triangulation (Campbell: 1953,1956), an operational delineation(Campbell, 1954), a convergent validation.
Underwood's presentation and thatof this paper as a whole imply movingfrom concept to operation, a sequencethat is frequent in science, and perhaps typical. The same point can bemade, however, in inspecting a transition from operation to construct.For any body of data taken from asingle operation, there is a subinfinityof interpretations possible; a subinfinity of concepts, or combinationsof concepts, that it could represent.Any single operation, as representative of concepts, is equivocal. In ananalogous fashion, when we view theAmes distorted room from a fixedpoint and through a single eye, thedata of the retinal pattern are equivocal, in that a subinfinity of hexahedrons could generate the same pattern. The addition of a second viewpoint, as through binocular parallax,greatly reduces this equivocality,greatly limits the constructs thatcould jointly account for both sets ofdata. In Garner's (1954) study, thefractionation measures from a singlemethod were equivocal-they couldhave' been a function of the stimulusdistance being fractionated, or they
102 D. T. CAMPBELL AND D. W. FISKE
could have been a function of thecomparison stimuli used in the judgment process. A multiple, convergentoperationalism reduced this equivocality, showing the latter conceptualization to be the appropriate one, andrevealing a preponderance of methods variance. Similarly for learningstudies: in ideotifying constructswith the response data from animalsin a specific operational setup there isequivocality which can operationallybe reduced by introducing transposition tests, different operations so designed as to put to comparison therival conceptualizations (Campbell,1954).
Garner's convergent operationalism and our insistence on more thanone method for measuring each concept depart from Bridgman's earlyposition that "if we have more thanone set of operations, we have morethan one concept, and strictly thereshould be a separate name to correspond to each different set of operations" (Bridgman, 1927, p. 10).At the current stage of psychologicalprogress, the crucial requirement isthe demonstration of some convergence, not complete congruence, between two distinct sets of operations.With only one method, one has noway of distinguishing trait variancefrom unwanted method variance.When psychological measurementand conceptualization become betterdeveloped, it may well be appropriate to differentiate conceptually between Trait-Method Unit Al andTrait-Method Unit A2, in whichTrait A is measured by differentmethods. More likely, what we havecalled method variance will be specified theoretically in terms of a set ofconstructs. (This has in effect beenillustrated in the discussion above inwhich it was noted that the responseset variance might be viewed as
trait variance, and in the rearrangement of the social intelligence matrices of Tables 4 and 5.) It will thenbe recognized that measurement procedures usually involve several theoretical constructs in joint application. Using obtained measurementsto estimate values for a single construct under this condition still requires comparison of complex measures varying in their trait composition, in something like' a m ulti traitmultimethod matrix. Mill's jointmethod of similarities and differencesstill epitomizes much about the effective experimental clarification ofconcepts.
The evaluation of a multitrait-multimethod matrix. The evaluation of thecorrelation matrix formed by intercorrelating several trait-method unitsmust take into consideration themany factors which are known toaffect the magnitude of correlations.A value in the validity diagonal mustbe assessed in the light of the reliabilities of the two measures involved:e.g., a low reliability for Test A 2
might exaggerate the apparentmethod variance in Test At. Again,the whole approach assumes adequate sampling of individuals: thecurtailment of the sample with respect to one or more traits will depress the reliability coefficients andintercorrelations involving thesetrai ts. While restrictions of rangeover all traits produces serious difficulties in the interpretation of a multitrait-multimethod matrix andshould be avoided whenever possible,the presence of different degrees ofrestriction on different traits is themore serious hazard to meaningfulinterpretation.
Various statistical treatmentsfor multitrait-multimethod matricesmight be developed. We have considered rough tests for the elevation
VALIDATION BY THE MULTITRAIT-MULTIMETHOD MATRIX 103
of a value< in the validity diagonalabove the comparison values in itsrow and column. Correlations between the columns for variablesmeasuring the same trait, varianceanalyses, and factor analyses havebeen proposed to us. However, thedevelopment of such statistical methods is beyond the scope of this paper.We believe that such summary statistics are neither necessary nor appropriate at this time. Psychologiststoday should be concerned not withevaluating tests as if the tests werefixed and definitive, but rather withdeveloping better tests. We believethat a careful examination of a multitrait-multimethod matrix will indicate to the experimenter what hisnext steps should be: it will indicatewhich methods should be discardedor replaced, which concepts needsharper delineation, and which concepts are poorly measured because ofexcessive or confounding method variance. Validity judgments based onsuch a matrix must take into accountthe stage of development of the constructs, the postulated relationshipsamong them, the level of technicalrefinement of the methods, the relative independence of the methods,and any pertinent characteristics ofthe sample of SSt We are proposingthat the validational process beviewed as an aspect of an ongoingprogram for improving measuringprocedures and that the llvaliditycoefficients" obtained at anyonestage in the process be interpreted interms of gains over preceding stagesand as indicators of where further effort is needed.
The design of a multitrait-multimethod matrix. The several methodsand traits included in a validationalmatrix should be selected with care.The several methods used to measureeach trait should be appropriate to
the trait as conceptualized. Althoughthis view will reduce the range ofsuitable methods, it will rarely re..strict the measurement to one operational procedure.
Wherever possible, the severalmethods in one matrix should be completely independent of each other:there should-- be no prior reason forbelieving that they share methodvariance. This requirement is necessary to permit the values in the heteromethod-heterotrait triangles to approach zero. If the nature of thetraits rules out such independenceof methods, efforts should be made toobtain as much diversity as possiblein terms of data-sources and classification processes. Thus, the classes ofstimuli or the background situations,the experimental contexts, should bedifferent. Again, the persons providing the observations should have different roles or the procedures forscoring should be varied.
Plans for a validational matrixshould take into account the differ..ence between the interpretations re..garding convergence and discrimination. I t is sufficient to demonstrateconvergence between two clearly dis ..tinct methods which show little overlap in the heterotrait-heteromethodtriangles. While agreement betweenseveral methods is desirable, convergence between two is a satisfactoryminimal requirement. Discriminative validation is not so easilyachieved. Just as it is impossible toprove the null hypothesis, or thatsome object does not exist, so onecan never establish that a trait, asmeasured, is differentiated from allother traits. One can only show thatthis measure of Trait A has littleoverlap with those measures of BandC, and no dependable generalizationbeyond Band C can be made. Forexample, social poise could probably
104 D. T. CAMPBELL AND D. W. FISKE
be readily discriminated from aesthetic interests, but it should also bedifferentiated from leadership.
Insofar as the traits are related andare expected to correlate with eachother, the monomethod correlationswill be substantial and heteromethodcorrelations between traits will alsobe positive. For ease of interpretation, it may be best to include in thematrix at least two traits, and preferably two sets of traits, which arepostulated to be independent of eachother.
In closing, a word of caution isneeded. Many multitrait-multimethod matrices will show no convergent validation: no relationshipmay be found between two methodsof measuring a trait. I n this commonsituation, the experimenter shouldexamine the evidence in favor of several alternative propositions: (a)Neither method is adequate for measuring the trait; (b) One of the twomethods does not really measure thetrait. (When the evidence indicatesthat a method does not measure thepostulated trait, it may prove tomeasure some other trait. High correlations in the heterotrait-heteromethod triangles may provide hintsto such possibilities.) (c) The trait isnot a functional unity, the responsetendencies involved being specific to
the nontrai t a ttributes of each test.The failure to demonstrate convergence may lead to conceptual developments rather than to the abandonment of a test.
SUMMARY
This paper advocates a validational process utilizing a rnatrix ofintercorrelations among tests representing at least two traits, each measured by at least two methods. Measures of the same trait should correlatehigher with each other than they dowith measures of different traits involving separate methods. Ideally,these validity values should also behigher than the correlations amongdifferent traits measured by the samemethod.
I]1uRtrations from the literatureshow that these desirable conditions,as a set, are rarely met. Method orapparatus factors make very largecontributions to psychological measurements.
The notions of convergence between independent measures of thesame trait and discrimination between measures of different traits arecompared \vith previously publishedformulations, such as construct validity and convergent operationalism,Problems in the application of thisvalidational process are considered.
REFERENCES
AMERICAN PSYCHOLOGICAL ASSOCIATION.
Technical recommendations for psychological tests and diagnostic techniques. Psychol. Bull., Suppl., 1954, 51, Part 2, 1-38.
ANDERSON, E. E. Interrelationship of drivesin the male albino rat. 1. Intercorrelationsof measures of drives. J. compo Psychol.,1937,24, 73-118.
AYER, A. J. The problem of knowledge. NewYork: St Martin's Press, 1956.
BORGATTA, E. F. Analysis of social interactionand sociOlnetric perception. Sociometry,1954, 17, 7-32.
BORGATTA, E. F. Analysis of social interac-
tion: Actual, role-playing, and projective.J. abnorm. soc. Psychol., 1955,51, 394-405.
BRIDGMAN, P. W. The logic of modern Physics.New York: Macmillan, 1927.
BURWEN, L. S., & CAMPBELL, D. T. The generality of attitudes toward authority andnonauthority figures. J. abnorm. soc. Psychol., 1957, 54, 24-31.
CAMPBELL, D. T. A study of leadership amongsubmarine officers. Columbus: Ohio StateUniver. Res. Found., 1953.
CAMPBELL, D. T. Operational delineation of"what is learned" via the transposition experiment. Psychol. Rev., 1954,61, 167-174.
VALIDATION BY THE MULTITRAIT-MULTIMETHQD MATRIX 105
CAMPBELL, D. T. Leadership and its effectsupon the group. Monogr. No. 83. Columbus: Ohio State Univer. Bur. Business Res.,1956.
CARROLL, J. B. Ratings on traits measured bya factored personality inventory. J. abnorm. soc. Psychol., 1952, 47, 626-632.
CHI, P.-L. Statistical analysis of personalityrating. J. expo Educ., 1937, S, 229-245.
CRONBACH, L. J. Response sets and test validi ty. Educ. psychol. Measmt, 1946,' 6, 475494.
CRONBACH, L. J. Essentials of psychologicaltesting. New York: Harper, 1949.
CRONBACH, L. J. Further evidence on response sets and test design. Educ. psychol.Measmt, 1950, 10, 3-31.
CRONBACH, L. J., & MEEHL, P. E. Constructvalidity in psychological tests. Psychol.Bull.• 1955, 52, 281-302.
EDWARDS, A. L. The social desirability variablein personality assessment and research. NewYork: Dryden, 1957.
FEIGL, H. The mental and the physical. InH. Feigl, M. Scriven, & G. Maxwell (Eds.),Minnesota studies in the philosoPhy of science. Vol. II. Concepts, theories and themind-body problem. Minneapolis: Univer.Minnesota Press, 1958.
FISKE, D. W. Consistency of the factorialstructures of personality ratings from different ~ources. J. abnorm. soc. Psychol., 1949,44,329-344.
GARNER, W. R. Context effects and the validity of loudness scales. J. expo Psychol., 1954 j
48, 218-224.GARNER, W. R., HAKE, H. W., & ERIKSEN,
c. W. Operationism and the concept ofperception. Psychol. Rev., 1956, 63, 149-159.
lESSOR, R., & HAMMOND. K. R. Constructvalidity and the Taylor Anxiety Scale.Psychol. Bull., 1957, 54, 161-170.
KELLEY, T. L., & KREY, A. C. Tests and meas-
urements in the social sciences. New York:Scribner, 1934.
KELLY, E. L., & FISKE, D. W. The predictionof performance in clinical psychology. AnnArbor: Univer. Michigan Press, 1951.
LOEVINGER, J., GLESER, G. C., & DuBOIS,P. H., Maximizing the discriminating powerof a multiple-score test. Psychometrika,1953. 18, 309-317.
LORGE, 1. Gen-like: Halo or reality? Psych01.Bull., 1937, 34, 545-546.
MAYO, G. D. Peer ratings and halo. Educ.psychol. Measmt, 1956, 16, 317-323.
STRANG, R. Relation of social intelligence tocertain other factors. Sch. &1 Soc., 1930,32,268-272.
SYMONDS, P. M. Diagnosing personality andconduct. New York: Appleton-Ceotury,1931.
THORNDIKE, E. L. A constant error in psychological ratings. J. apple Psychol., 1920,4,25-29.
THORNDIKE, R. L. Factor analysis of socialand abstract intelligence. J. educe Psychol.•1936, 27, 231-233.
THURSTONE, L. L. The reliability and validityof tests. Ann Arbor: Edwards, 1937.
TRYON, R. C. Individual differences. In F. A.Moss (Ed.), Comparative Psychology. (2nded.) New York: Prentice-Hall. 1942. Pp.330-365.
UNDERWOOD, B. J. Psychological research.New York: Appleton-Century-Crofts, 1957.
VERNON, P. E. Educational ability and psychological factors. Address given to theJoint Education-Psychology Colloquium.Univer. of Illinois, March 29, 1957.
VERNON, P. E. Educational testing and testform factors. Princeton: Educational Testing Service, 1958. (Res. Bull. RB-58-3.)
Received June 19, 1958.