Interobserver Agreement, Reliability, and...

15
Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement, Reliability, and Generalizability of Data Collected in Observational Studies Sandra K. Mitchell Department of Maternal and Child Nursing University of Washington Research in developmental and educational psychology has come to rely less on conventional psychometric tests and more on records of behavior made by human observers in natural and quasi-natural settings. Three coefficients that purport to reflect the quality of data collected in these observational studies are dis- cussed: the interobserver agreement percentage, the reliability coefficient, and the generalizability coefficient. It is concluded that although high interobserver agreement is desirable in observational studies, high agreement alone is not sufficient to insure the quality of the data that are collected. Evidence of the reliability or generalizability of the data should also be reported. Further ad- vantages of generalizability designs are discussed. Almost everyone engaged in research recog- nizes the need for reliable measuring instru- ments. Reliability is a central topic particu- larly for courses and textbooks concerned with the behavioral sciences. In spite of vary- ing theoretical derivations, its definition is remarkably uniform: A reliable instrument is one with small errors of measurement, one that shows stability, consistency, and depend- ability of scores for individuals on the trait, characteristic, or behavior being assessed. The preparation of this article was supported in part by Contract N01-NU-14174 with the Division of Nursing, Bureau of Health Resources and Develop- ment, U.S. Public Health Service, Health Resources Administration, Department of Health, Education, and Welfare. The article is based in part on a dis- sertation submitted in partial fulfillment of the re- quirements for the PhD degree at the University of Washington. I would like to thank Halbert B. Robinson and Kathryn E. Barnard for their support of the dis- sertation research, and Terence R. Mitchell and Nancy E. Jackson for their comments on an earlier draft of the article. Requests for reprints should be sent to Sandra K. Mitchell, Department of Maternal and Child Nurs- ing, Child Development and Mental Retardation Center, University of Washington, Seattle, Wash- ington 98195. Historically, the study of reliability has been linked to the study of individual differ- ences and has been largely restricted to stan- dardized tests of intelligence, achievement, and personality. These tests, however, are in- creasingly being replaced in developmental and educational psychology research by ob- servations of subjects made in natural and quasi-natural settings. Although these ob- servational studies vary widely in content and method, they all use human observers to record (and in some cases to summarize and abstract) the behavior of the subjects. Sur- prisingly, the reliability of these observa- tional methods has not received the same at- tention as has the reliability of the more tra- ditional methods (Johnson & Bolstad, 1973). There are at least three different ways to think about the reliability of observational data. First, the researcher could focus on the extent to which two observers, working inde- pendently, agree on what behaviors are oc- curring. A coefficient that reflects the extent of this agreement has often been used to re- port reliability in observational studies. Sec- ond, the observational measure could be considered a special case of standardized psy- chological test, and the definitions of reliabil- Copyright 1979 by the American Psychological Association, Inc. 0033-2909/79/8602-0376$00.75 376

Transcript of Interobserver Agreement, Reliability, and...

Page 1: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

Psychological Bulletin1979, Vol. 86, No. 2, 376-390

Interobserver Agreement, Reliability, andGeneralizability of Data Collected in

Observational Studies

Sandra K. MitchellDepartment of Maternal and Child Nursing

University of Washington

Research in developmental and educational psychology has come to rely less onconventional psychometric tests and more on records of behavior made by humanobservers in natural and quasi-natural settings. Three coefficients that purportto reflect the quality of data collected in these observational studies are dis-cussed: the interobserver agreement percentage, the reliability coefficient, and thegeneralizability coefficient. It is concluded that although high interobserveragreement is desirable in observational studies, high agreement alone is notsufficient to insure the quality of the data that are collected. Evidence of thereliability or generalizability of the data should also be reported. Further ad-vantages of generalizability designs are discussed.

Almost everyone engaged in research recog-nizes the need for reliable measuring instru-ments. Reliability is a central topic particu-larly for courses and textbooks concernedwith the behavioral sciences. In spite of vary-ing theoretical derivations, its definition isremarkably uniform: A reliable instrument isone with small errors of measurement, onethat shows stability, consistency, and depend-ability of scores for individuals on the trait,characteristic, or behavior being assessed.

The preparation of this article was supported inpart by Contract N01-NU-14174 with the Divisionof Nursing, Bureau of Health Resources and Develop-ment, U.S. Public Health Service, Health ResourcesAdministration, Department of Health, Education,and Welfare. The article is based in part on a dis-sertation submitted in partial fulfillment of the re-quirements for the PhD degree at the University ofWashington.

I would like to thank Halbert B. Robinson andKathryn E. Barnard for their support of the dis-sertation research, and Terence R. Mitchell andNancy E. Jackson for their comments on an earlierdraft of the article.

Requests for reprints should be sent to Sandra K.Mitchell, Department of Maternal and Child Nurs-ing, Child Development and Mental RetardationCenter, University of Washington, Seattle, Wash-ington 98195.

Historically, the study of reliability hasbeen linked to the study of individual differ-ences and has been largely restricted to stan-dardized tests of intelligence, achievement,and personality. These tests, however, are in-creasingly being replaced in developmentaland educational psychology research by ob-servations of subjects made in natural andquasi-natural settings. Although these ob-servational studies vary widely in contentand method, they all use human observers torecord (and in some cases to summarize andabstract) the behavior of the subjects. Sur-prisingly, the reliability of these observa-tional methods has not received the same at-tention as has the reliability of the more tra-ditional methods (Johnson & Bolstad, 1973).

There are at least three different ways tothink about the reliability of observationaldata. First, the researcher could focus on theextent to which two observers, working inde-pendently, agree on what behaviors are oc-curring. A coefficient that reflects the extentof this agreement has often been used to re-port reliability in observational studies. Sec-ond, the observational measure could beconsidered a special case of standardized psy-chological test, and the definitions of reliabil-

Copyright 1979 by the American Psychological Association, Inc. 0033-2909/79/8602-0376$00.75

376

Page 2: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

INTEROBSERVER AGREEMENT 377

ity that come from classic psychometric the-ory (e.g., test-retest and alternate forms)could be used. Finally, an observational mea-sure could be thought to provide data thatare under the influence of a number of dif-ferent aspects of the observation situation(e.g., different observers or different occa-sions), including individual differences amongsubjects. This third viewpoint was developedin Cronbach's (Cronbach, Gleser, Nanda, &Rajaratnam, 1972) theory of generalizability.The purpose of this article is to examine theappropriateness and correct interpretation ofthese three coefficients when they are usedto reflect the quality and dependability ofdata gathered in observational studies.

Observer Agreement

To insure that data collected by humanobservers are objective, researchers typicallyobtain and report coefficients that demonstratethat two or more observers watching the samebehavior at the same time will record thesame data. These coefficients are offered asthe reliability of the instrument being used.

Interobserver Agreement Percentage

The most common index of the quality ofthe data collected in observational studies isthe interobserver agreement percentage. Inits simplest form, this coefficient is just whatits name implies: the percentage of time unitsduring which the records of two observers arein agreement about the record of behavior.

Two comments can be made about the useof observer agreement percentages. First, themajority of studies report data only in thisfashion. Consider, for example, the field ofdevelopmental psychology, in which aboutone third of recently published research ar-ticles use observational techniques. In Vol-ume 47 (1976) of Child Development, 33full-length articles reported observational mea-sures. Of these studies, just under half (49%)reported only observer agreement figures asindicants of the quality of their data. Simi-larly, of 21 observational studies reported inVolume 12 (1976) of Developmental Psychol-

ogy, 51% reported observer agreement per-centages only.

Second, the amount of variability amongthe subjects in a study has very little im-pact, at least in theory, on the size of aninterobserver agreement percentage. In actualpractice, however, the degree of variabilitycan make quite a difference: In very homoge-nous groups, observer agreement percentagesare necessarily quite high because all scoresgiven to all subjects are very close together.Thus, a measure that shows high agreementmay, in some populations at least, have donea poor job of differentiating among subjects.

Other Problems With Observer Agreement

The interobserver agreement percentage hasseveral important shortcomings. It is, firstof all, insensitive to degrees of agreement,that is, it treats agreement as an all-or-nonephenomenon, with no room for partial or in-complete agreement. In this sense, the percent-age underestimates the actual extent of agree-ment between two observers. Second, someagreement between independent observers canbe expected on the basis of chance alone. Formany observational studies, especially thosethat use frequency counts of individual be-haviors, the extent of this chance agreementis dependent on the rates at which the targetbehaviors occur. Behaviors with very high andvery low frequencies can have extremely highchance levels of agreement (Johnson & Bol-stad, 1973). In this sense, the percentage ap-pears to overestimate the real agreement.

These difficulties in the use of observeragreement percentages are not unknown, andconsiderable effort has been expended to de-velop indices that overcome them. The al-ternative coefficients have been reviewed indetail elsewhere (Tinsley & Weiss, 1975). Ingeneral, they have been designed to overcomethe mathematical shortcomings of the agree-ment percentage, such as chance levels ofagreement.

A serious question remains, however, aboutthe utility of these alternatives. In spite oftheir mathematical superiority, they deal withonly one source of error (observer disagree-ment), and they deal with it without regard

Page 3: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

378 SANDRA K. MITCHELL

to the magnitude of the individual differences.These alternatives may give a more accuratepicture of the level of observer agreementin a study than does the simple percentage,but they do not otherwise describe the stabil-ity, consistency, and dependability of the datathat have been collected.

Influences on Observer Agreement

Interobserver agreement has been experi-mentally studied as a phenomenon in its ownright. In these experiments, observer ac-curacy, that is, agreement with a predeter-mined correct behavioral record, is the de-pendent variable.

Reid (1970) compared the accuracy of ob-servers during overt and covert assessment oftheir reliability and found that they were sig-nificantly more accurate when they wereaware that they were being checked. Roman-czyk, Kent, Diament, and O'Leary (1973)found a similar drop from the overt to thecovert assessment situation. They also foundthat observers recorded behavior differentlydepending on which of several researchers'records they thought their own records wouldbe compared with. Taplin and Reid (1973)found that observer accuracy decreased be-tween the end of a training period and thebeginning of data collection and that it in-creased on days when "spot checks" were ex-pected. Mash and McElwee (1974) reportedthat observers who had been trained to code"predictable" behavior (i.e., conversationswith redundant information) showed a de-cline in accuracy when they later coded"unpredictable" behavior, whereas observerstrained with unpredictable sequences showedno such decline in accuracy. Taken together,these studies imply that differences in experi-ence, mental set, and training of observerscan influence the accuracy with which behav-ioral records are made and scored.

Interobserver agreement has also been usedas the dependent variable in studies that com-pare different methods of observation. Mc-Dowell (1973) found comparable observeragreement in time sampling and continuousrecording of infant caretaking activities in aninstitution. Lytton (1973) found the inter-

observer agreement of ratings, home observa-tions, and laboratory experiments to varyonly slightly, but the amounts of time andeffort necessary to achieve these levels werequite different. Mash and McElwee's (1974)rating system with only four categories wasused more accurately than was an eight-cate-gory form. It appears, then, that there aredifferences in the interobserver agreementthat can be expected from different methodsof behavioral observation. These differencesare reflected both in different levels of agree-ment for equal amounts of training and in-equal levels of agreement for different amountsof training.

It is difficult, however, to put these agree-ment differences into perspective withoutknowledge of the overall variability amongsubjects in the studies. If between-subjectsvariability is low, then the reported differ-ences in agreement due to observer instruc-tions or observational methods may consid-erably influence the outcome of the study. Ifbetween-subjects variability is high, on theother hand, then these differences will prob-ably have little influence. The studies thatused agreement as a dependent variable haveidentified some problem areas in data collec-tion, but they have not evaluated the rela-tive importance of these problems.

Psychometric Theory of Reliability

The classical method of determining thereliability of a test is for the researcher toobtain two scores for a group of subjects onthe test. These two scores may come fromtwo separate scorings of the instrument, fromadministration of two parts or forms of theinstrument to the subjects, or from two ad-ministrations of the same instrument to thesubjects. The correlation between the twoscores is the reliability of the instrument.

The central theoretical concept that under-lies this psychometric view of reliability isthat every test score is composed of two parts:a true score, which reflects the presence orextent of some trait, characteristic, or be-havior, plus an error score, which is randomand independent of the true score (Nunnally,1967). The proportion of variance accounted

Page 4: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

INTEROBSERVER AGREEMENT 379

for by each of these parts is estimated fromthe correlation between the two scores ob-tained on the instrument.

The variance attributable to individual dif-ferences is usually given the same interpreta-tion, regardless of how the two scores used tocompute it were obtained. It reflects stabledifferences among individuals—the true scorepart of the data. The variance that is at-tributable to measurement error, however, issubject to varying interpretations, dependingon how the two scores were obtained.

The error always includes, of course, thereal error—those random fluctuations of themyriad factors that may affect the behaviorbeing measured. These include such variablesas the health or mental state of the subjects,the lighting or temperature in the testingroom, and so forth. But the error also includesother sources of variation, depending on themethod used to obtain the two scores. Theseother sources include differences within andbetween scorers, differences between differentsections or forms of a test, and changes insubjects' behavior between two administra-tions of a test.

Thus, the three most common proceduresfor determining reliability involve (a) obtain-ing two separate scorings of the same instru-ment (intrascorer or interscorer reliability),(b) obtaining scores on two parts of the sameinstrument or on two very similar instruments(split-half or alternate-forms reliability), and(c) obtaining two scores from two separateadministrations of the same instrument (test-retest reliability).

The researcher who wishes to use one ofthese classical methods of determining relia-bility in an observational study must some-how make his or her observations fit into thesame general pattern as psychological tests.However, instead of one test with many items,all intended to measure the same trait, charac-teristic, or behavior, the observational re-searcher has a tool with a relatively smallnumber of categories, with each category in-tended to measure a different trait, charac-teristic, or behavior. For each of these cate-gories, which are generally mutually exclusive,data are usually collected during many dis-tinct time units.

The most satisfactory way of making suchdata fit into the classic pattern seems to beto consider each mutually exclusive category(or type of behavior) a separate test with itsown reliability. Each time unit is consideredto be an item, since all time units are intendedto measure the same trait, behavior, or charac-teristic. For example, a behavioral codemight record a child's proximity to the teacherduring each 10-sec unit of an observation.Each 10-sec unit would be an item in a testthat measured proximity. If the measure con-sisted of a single summary score (such as ina rating), then there would be, in effect, noindividual items at all, just one score.

This analogy between test reliability andthe reliability of observational data can beextended to apply to each of the traditionalways of obtaining scores: intrascorer or inter-scorer reliability, split-half or alternate-formsreliability, and test-retest reliability.

Intrascorer or Interscorer Reliability

A clinical psychologist interested in self-directed aggression might listen twice to taperecordings of patients' responses to a projec-tive test, each time counting the number ofself-destructive statements. The correlationbetween the two counts for the group of pa-tients would be the intrascorer reliability ofthe self-destructiveness score. The true scoreimplied by this correlation would reflect realdifferences in self-destructiveness among thepatients. The error would include not only therandom error but also any inconsistencies inthe psychologist's use of the self-destructive-ness scale.

In actual practice, it is more likely thattwo psychologists would listen to the taperecordings. The correlation between their sepa-rate counts would be an interscorer reliabilitycoefficient. The true score would again reflectreal differences, but the error would reflectdifferences between the psychologists in theiruse of the scale, along with random error.

A similar situation exists, of course, whentwo or more observers record the behavior ofsubjects in other natural and quasi-naturalsettings. The correlation between the scores oftwo observers who kept track of how much

Page 5: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

380 SANDRA K. MITCHELL

individual attention each child received fromthe teacher would be an interobserver relia-bility coefficient. This coefficient, once again,should not be confused with the observeragreement percentage.

Split-Half or Alternate-Forms Reliability

One way of determining the reliability of astandardized test is to compare scores on twosubdivisions of the test (odd- and even-num-bered items, frequently) or scores on twovery similar versions of the test. In an ob-servational study, the corresponding compari-sons would be between subdivisions of oneobservation (e.g., odd- and even-numberedminutes during a tennis lesson), or betweentwo very similar observations (first and sec-ond halves of a lesson, perhaps). This is anexample of how time units can be consideredanalogous to test items.

Just as with interobserver reliability, thetrue score component of the variance in split-half or alternate-forms reliability reflects con-sistent individual differences among subjects.The error component, however, has a differ-ent interpretation. Along with random fluctua-tions in the behavior of the subjects, real dif-ferences in subject behavior between the twoobserved subdivisions are included as part ofthe error.

Test-Retest Reliability

Perhaps the most straightforward way toobtain two scores in a reliability study is toadminister the same instrument at two differ-ent times. An observer might, for example,visit classrooms on different days to recordthe teacher's use of a particular instructionaltechnique. As before, the true score is assumedto reflect some stable trait, characteristic, orbehavior. In this case, the error includes notonly random fluctuations of subject behaviorbut also whatever real changes in subject be-havior have occurred between the two admin-istrations of the test.

It is interesting to note that there is littledifference between alternate-forms and test-retest reliability for observational measures.Since time units serve as items, observations

made on different days can be consideredeither as alternate-forms or as test-retestconditions, depending on the situation.

Use of Reliability Coefficients

Three comments apply to all of these ver-sions of the reliability coefficient. First, al-though the examples given are from hypo-thetical observational studies, real observa-tional studies do not make use of all of thepossible coefficients. Interobserver reliabilityor agreement is reported to the virtual exclu-sion of split-half and test-retest coefficients.Once again, developmental psychology canserve as an example. In Volume 47 (1976) ofChild Development, 49% of the full-lengthresearch articles that used observational meth-ods reported only observer agreement, and39% more reported interobserver correlationcoefficients. Only three of the studies (12%)reported a reliability coefficient that reflectedthe stability of subject behavior over time,that is, a split-half or test-retest reliability.Similarly, in Volume 12 (1976) of Develop-mental Psychology, 57% of the studies re-ported agreement only, 38% reported observerreliability coefficients, and only one studyused a measure based on more than one sam-ple of behavior per subject.

Second, although this discussion has em-phasized the sources of error in these coef-ficients, the variance of true scores is as im-portant in determining the size of the reliabil-ity coefficient as the variance of error scores is.Recall that the reliability of a test score canbe expressed

true score variancetrue score variance + error variance'

For a given level of error variance, then, aninstrument will have a lower reliability whenit is used on a homogenous group of subjects(low true score variance) than it will whenit is used on a more heterogenous group (hightrue score variance). For instance, if the errorvariance is 10 and the true score variance isalso 10, the reliability of the instrument is10/20 or .50. But if the error variance is 10and the true score variance is 40, the reliabil-ity of the instrument is 40/50 or .80. This is

Page 6: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

INTEROBSERVER AGREEMENT 381

in contrast to the observer agreement per-centage, which is highest for homogenousgroups of subjects.

Third, it should be repeated that reliabilityand observer agreement are not the same. Itis possible, as illustrated by Tinsley and Weiss(1975), to have high interobserver agreementand a low reliability (correlation) coefficient,and vice versa. For instance, two observersmight have perfect agreement about the colorof shoes worn by children in a classroom,but if all the children wore red shoes, shoecolor would not differentiate among the chil-dren. On the other hand, there might be ahigh correlation between two observers' rec-ords of the duration of a teacher's attentionto a particular youngster, but if one observer'swatch ran slower than the other's watch, theywould probably never agree on the actualduration of the attention.

The differences between agreement and re-liability are based on the way the two indicesare denned. Reliability coefficients partitionthe variance of a set of scores into a truescore (individual differences) and an errorcomponent. The error component may includerandom fluctuations in the behavior of sub-jects, inconsistencies in the use of the scale,differences among observers, and so forth. In-terobserver agreement percentages, on theother hand, carry no information at all aboutindividual differences among subjects and con-tain information about only one of the pos-sible sources of error—differences among ob-servers. In other words, a reliability coefficientreflects the relative magnitude of all errorwith respect to true score variability, whereasan agreement percentage reflects the absolutemagnitude of just one kind of error.

All in all, there is no perfect reliability co-efficient, nor is there one that is even gen-erally best. Coefficients that use two scoringsof the same instrument (interobserver andintraobserver reliability) confound randomsubject error with differences within and be-tween scorers. Coefficients that use scoresfrom subdivisions or alternative forms of theinstruments (split-half and alternate-formsreliability) confound random subject errorwith differences between the subdivisions orforms. Finally, coefficients that use scores

from the same instrument administered ontwo occasions (test-retest reliability) con-found measurement errors with real changesin subject behavior that occur between thetwo administrations. The methods describedcannot, then, separately estimate variance intest scores attributable to scorers, subtests(or forms), or occasions, nor can they con-sider these sources of error simultaneously. Amore inclusive, multivariate theory is needed.

Generalizability Theory

Cronbach and his associates (Cronbach etal., 1972) have developed a theory that theycall the theory of generalizability. (For abrief introduction to the theory, see J. P.Campbell, 1976, pp. 185-222.) Instead ofassuming, as does classical test theory, thatindividual differences constitute the only law-ful source of variation in test scores, general-izability theory assumes that there may be anumber of sources of variation. These sourcesof variation other than individual differencesare called facets. Different scorers, alternatetest forms, or administration on different oc-casions are examples of facets that might bestudied. A particular combination of facetsmakes up the universe to which test scoresmay be generalized.

A generalizability study (G study) is morereminiscent of a factorial study in experi-mental psychology than of a reliability study.In a G study, the researcher must collectdata by systematically sampling conditionsfrom each facet in the universe. For instance,two scorers might each score two alternateforms of a test given on different days to agroup of subjects. Using an analysis of vari-ance, it is then possible to independently esti-mate the contributions of each of the facets—scorers, forms, occasions, as well as subjects—to the overall variation in the set of test scores.Besides looking at the conventional F sta-tistic to establish whether each facet makesa significant contribution to the scores, it ispossible to compute what Cronbach callsvariance components. These variance com-ponents reflect the size rather than the sta-tistical significance of the contribution ofeach facet to the observed scores.

Page 7: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

382 SANDRA K. MITCHELL

In examining the quality of observationaldata, though, not only are the absolute sizesof the variance components of interest butthe relative sizes of the components are alsoimportant. The relative sizes, therefore, arethe focus of this discussion.

Recall that a reliability coefficient reflectsthe partitioning of variance into true anderror components and that the coefficient isthe ratio of true score variance to obtainedscore variance. It represents, in other words,the proportion of the total variance that isaccounted for by individual differences. Inthe same way, generalizability coefficient re-flects the partitioning of variance into com-ponents that correspond to the facets sampledin the G study. The coefficient itself (an in-traclass correlation) combines these com-ponents in a ratio that also represents theproportion of variance attributable to indi-vidual differences for a particular universe(set of conditions).

It should not be assumed, however, that aG study generates one coefficient that is ap-propriate for all applications of the instru-ment. On the contrary, one G study can gen-erate several coefficients, each correspondingto a different universe of conditions. This factpoints to an important distinction betweenthe psychometric theory of reliability dis-cussed earlier and the theory of generalizabil-ity. In psychometric theory, conditions oftesting (or otherwise obtaining data) are as-sumed to influence only measurement error,not the true score on the instrument. In gen-eralizability theory, on the other hand, theconditions of testing are assumed to influencethe score itself. What Cronbach and his col-leagues have shown is that while true scoresall contain a common component, they alsocontain additional different components de-pending on the design; that is, it is not justthe error variance that differs among the sev-eral reliability coefficients. This relationshipcan be illustrated by returning to the earlierexample of interscorer reliability—two psy-chologists counting self-destructive statementsfrom tape recordings. In generalizabilityterms, this is a one-facet study, that is, itsamples observations from one facet (in thiscase, scorers) in addition to observations of

different subjects. Analyzed as a G study, onecan estimate variance components for sub-jects, for scorers, and for the interaction ofsubjects and scorers (which in this case isconfounded with the residual error). Thegeneralizability coefficient from this G studywould reflect the dependability of a score fora subject generalized over scorers. In otherwords, it would indicate the proportion ofvariance accounted for by individual differ-ences in subjects, above and beyond any ef-fects accounted for by differences betweenscorers.

Suppose that a second facet—occasions—were added to this study, so that each scorerwould count self-destructive statements foreach tape two times. This study combinesaspects of the interscorer and the intra-scorer reliability studies. The analysis of thistwo-facet G study would yield variance com-ponents for subjects, for scorers, and for oc-casions (which in this case would be inter-preted as intrascorer change). Further, vari-ance components could be computed for eachof the possible interactions of these facets:Subjects X Scorers, Subjects X Occasions,Scorers X Occasions, and Subjects X ScorersX Occasions (which in this case is confoundedwith residual error). The generalizability co-efficient from this study would reflect theproportion of variance accounted for by indi-vidual differences in subjects apart from theeffects between and within scorers.

Suppose further that this study were ex-tended to three facets by having each psy-chologist use two different scoring methodsfor each tape recording he or she listened to(perhaps a count of self-destructive statementsand a global rating of self-destructiveness).Observational methods would then be a facetin the universe of generalization.

Clearly each facet that is added to the studymakes the information available from theanalysis more complete. But there is a sig-nificant cost for the extra information pro-vided by each facet: The number of observa-tions required of each subject is multipliedby the number of conditions sampled in thefacet. In the present example, the one-facetstudy would have two scores, the two-facetstudy would have four scores, and the three-

Page 8: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

INTEROBSERVER AGREEMENT 383

facet study would have eight scores for eachsubject.

Described below are some three-facet Gstudies that parallel the intraobserver-inter-observer, split-half, and test-retest reliabilitystudies that were discussed earlier. All ofthese G studies use the same three facets (ob-servers, observational methods, and occa-sions) sampled for all subjects. The studiesdiffer in their definitions of occasions of mea-surement and thus in their interpretations ofthe resulting coefficients. The relationshipsamong the reliability studies and the pro-posed G studies are summarized in Table 1.

Duplicate Generalizability

One study with this basic design might useaudiotaped or videotaped recordings of be-havior, which would be scored on more thanone occasion by more than one observer, usingmore than one form of an observational instru-ment. In this G study (which I call duplicategeneralizability), the occasions of observa-tion actually consist of exactly the same be-havior by the subjects. It is, then, an exten-sion of the traditional intrascorer or inter-scorer reliability study. A reliability studyuses two scores for each subject (usually fromtwo different scorers) and confounds measure-ment error with differences within and be-tween scorers. A duplicate G study, on theother hand, has many scores for each subjectand separately estimates the contribution ofdifferences within and among observers. Al-though occasions is a facet in this G study,variance attributable to occasions cannot beinterpreted as within-subject change, sincethe same behavior occurs on each occasion ofobservation. In this study, occasions varianceshould be interpreted as a measure of within-observer stability. A duplicate G study wouldbe appropriate for demonstrating the depend-ability of an instrument used in a study inwhich the stability of the behavior over timewas not an issue.

Session Generalizability

A second G study with the basic designmight be called session generalizability. It

Table 1Correspondence Between Reliability Studiesand Generalizability Studies

Measurement Reliability Generalizabilityoccasions study study

Separate scorings Intrascorer or Duplicateof the same be- interscorerhavior orinstrument

Scores on twosubdivisions ofan instrumentor behaviorsample, or twovery similarinstruments orbehaviorsamples

Scores from sepa-rate adminis-trations of thesame instru-ment or sepa-rate samples ofbehavior

Split half oralternateforms

Session

Test-retest Developmental

would use as measurement occasions two sub-divisions of some behavioral sequence (i.e.,first and second halves or odd and evenminutes) and would be an extension of thetraditional split-half reliability study. In asplit-half reliability study, recall that errorsof measurement are confounded with differ-ences between the two halves of the test orobservation. In a session G study, differencesbetween subdivisions of the behavioral se-quence are estimated by the variance com-ponent for occasions. A session G study wouldbe used to estimate the dependability ofscores reflecting traits and behaviors expectedto be stable during the course of the behavioralsequence being observed, although perhapsno longer than that.

Developmental Generalizability

A third G study with the same basic de-sign might be called developmental generaliz-ability. It would use as measurement occasionstwo or more administrations of the same in-strument, perhaps at different ages or develop-mental stages. It is, then, an extension of the

Page 9: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

384 SANDRA K. MITCHELL

traditional test-retest reliability study, whichconfounds measurement error with truechanges in behavior that have occurred be-tween the two administrations of a test. Inthis developmental G study, these changes inbehavior over time would be estimated bythe occasions facet of the design. The develop-mental G study is best suited to measure oftraits or characteristics believed to be rela-tively enduring.

Use of Generalizability Coefficients

The comments made about the reliabilitystudies discussed earlier can be repeated forthese G studies. First, observational studiesvery rarely report data in G-study terms. Themethod does appear occasionally in disserta-tions (see, e.g., Leler, 1971; Mitchell, 1977),and it surfaces now and then in educationalpsychology research (Medley & Mitzel, 1963;McGaw, Wardrop, & Bunda, 1972). But toreturn again to developmental psychology asan example, there were no studies publishedin 1976 in either Child Development or De-velopmental Psychology that reported gen-eralizability coefficients. Second, regardless ofthe sizes of the variance components for thefacets, it is necessary to have a relativelylarge variance component for subjects to ob-tain a large generalizability coefficient. Allother things being equal, a sample of sub-jects with greater variability on the traitbeing measured will yield a higher generaliz-ability coefficient than will a sample of sub-jects with lesser variability on the trait.

The three G-study designs outlined here(duplicate, session, and developmental) allsample three important facets that may in-fluence scores in observational studies: ob-servers, observational methods, and occasionsof observation. They differ in the nature ofthe occasions that are sampled, and these oc-casions tell us something about the natureof the universe to which scores can be gen-eralized.

One way to contrast these universes is toimagine that the three types of studies wereconducted so that exactly the same subjects,observers, and observational methods wereused for all three. When only the nature of

the occasions facet is different, one can hy-pothesize certain relationships among thesizes of the coefficients derived from the threestudies. When duplicate G studies are con-ducted, it is expected that variance due tooccasions will be the smallest and hence thatthe generalizability coefficient will be thelargest. Further, when developmental G stud-ies are conducted, it is expected that variancedue to occasions will be the greatest and thatthe generalizability coefficients will be thesmallest. Finally, when session G studies areconducted it is expected that the variance dueto occasions and the resulting generalizabil-ity coefficients will be intermediate.

Other Uses for Generalizability Theory

Although measures of interobserver agree-ment and reliability have important uses inobservational research, it should be clear fromthe earlier discussion that it is the generaliz-ability coefficient that is potentially the mostuseful source of information about the qual-ity of such data. Generalizability theory,however, has many applications of interest forthe developmental psychologist other thanthe computation of a coefficient. Some ofthese applications are discussed below.

Multitrait-Multimethod Matrix

D. T. Campbell and Fiske (1959) intro-duced the notion of determining the validityof psychological measurement instruments byusing a multitrait-multimethod matrix. Asthe name suggests, this matrix consists ofscores for an individual on several traits, eachtrait assessed by two or more different meth-ods. Such a design is clearly an instance of aG study in which traits and methods are thetwo facets. The data analysis proposed byCampbell and Fiske uses a matrix of cor-relations, but other authors (i.e., Kavanagh,MacKinney, & Wolins, 1971) have used anal-yses of variance with the multitrait-multi-method design. In this form, such a matrixclosely resembles a G study.

The multitrait-multimethod matrix wasused by Wicker (197S) to examine the re-liability of observational records generated

Page 10: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

INTEROBSERVER AGREEMENT 385

from transcriptions of conversations. In thisstudy observers were treated as "methods,"and behavior samples were treated as "traits."By applying Campbell and Fiske's criteria tothe correlational matrix, Wicker concludedthat his data showed both convergent anddiscriminant validity.

Attribution oj Variance

Traditionally, psychological studies havesought either to demonstrate mean differencesbetween groups of subjects or to show con-sistent individual differences among subjects.Another kind of study, far less common, triesto systematically apportion the variance ina set of research data among several inde-pendent variables. This approach has beenpopular in efforts to resolve the issue ofwhether individual differences (personality)or situational differences (environment) arethe most important determinants of humanbehavior. In such studies, data are gatheredon a group of individuals in several situations.The relative importance of individual differ-ences and of situational differences is esti-mated by the use of the statistic known asomega-square. As Golding (1975) has ablydemonstrated, this experimental design ismore profitably viewed as a G study; general-izability coefficients answer questions aboutthe relative importance of different facets(here, individuals and situations) more suit-ably than does omega-square.

The logic involved in this kind of study isnot limited to the person-situation contro-versy, of course. A similar question might beasked about a study in which several ratersrate the behavior of a number of subjects. AsNorman and Goldberg (1966) pointed out,these data can be interpreted as reflecting thebehavior of the ratees (subjects) or the be-havior of the raters. Once again, this studyappears to be a straightforward, one-facetgeneralizability study in which raters is thefacet. Although Norman and Goldberg didnot use analysis of variance to analyze theirdata either, it is clear that the conceptualiza-tion is similar to that presented earlier: Thereare several sources of meaningful variation ina set of data, and only a multifacet study can

illuminate the relative contributions of differ-ent facets.

Observer Generalizability

The use of generalizability coefficients toestimate the contributions of facets other thanindividual differences to a set of test scoreshas a particular application to observationalstudies. Specifically, it allows a researcher tolook at the proportion of variance in scoresthat is attributable to the consistent behaviorof the observers.

On the surface, the function of the gen-eralizability coefficient sounds much like thefunction of the interobserver agreement per-centage, but in fact it is not. Recall that theagreement percentage did not take into ac-count the extent of overall variability in aset of data, whereas a generalizability coef-ficient does. It should be noted that theG study necessary to compute this coefficient isexactly the same study as that used to com-pute the more conventional coefficient basedon the behavior of the subjects. Nothing haschanged except one's point of reference.

Mitchell (1977) computed both subjectand observer generalizability coefficients in astudy in which 67 observers made repeatedobservations of 10 mother-infant pairs dur-ing the first year of life. She found that thecoefficients reflecting the variability accountedfor by observers were, in this study at least,greater than the coefficients reflecting subjectdifferences. Although this result is specific tothis particular set of data, the study is anexample of the usefulness of observer general-izability coefficients.

Single-Subject Studies

Studies of individual subjects have had fewways of reporting reliability in the traditionalsense. However, it is possible to conduct Gstudies using only a single subject. For ex-ample, several observers, several occasions, orseveral methods of observation might be usedwith a single subject. In this case, the general-izability coefficient would reflect the gener-alizability of a score recorded by a single ob-server under the circumstances sampled in the

Page 11: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

386 SANDRA K. MITCHELL

study; that is, it would be an observer gen-eralizability coefficient.

Such single-subject studies may be appro-priate even when many subjects are part ofa research project. It is commonly assumedthat the behavior of all subjects is recordedwith equal accuracy, that is, measurementerrors are approximately equal for all subjects.If this assumption is true, then the data forall subjects are presumably equally "good."On the other hand, if this assumption is incor-rect, so that subject behavior is recorded withvariable accuracy, then data for different sub-jects may have different meanings.

The possibility of systematic differences inmeasurement error among subjects has beenexplored for traditional psychological tests(Ghiselli, 1963). Berdie (1969), for example,found that intraindividual variability (i.e.,measurement error) was a stable trait forsome pencil-and-paper performance tests.Ghiselli (1960) examined the prediction of"predictability." He was able to predict theerrors of measurement for two groups of sub-jects on a reaction time test. Reliability forthe high group was .97, compared with .82for the low group.

A similar result was found in a quite dif-ferent study by Gorsuch, Henighan, andBarnard (1972) . Their interest was the in-ternal consistency of a children's pencil-and-paper test of locus of control. They foundthat the reliability of the scale differed sig-nificantly according to the reading ability ofthe children. The errors of measurement werequite small for the good readers, but werelarge for the poor ones.

Observational studies, however, rarely haveenough subjects to permit analysis of thiskind. An alternative is to make use of ob-server generalizability coefficients. Supposethat the data collected for each subject wereconsidered to be a mini-G-study. If the basicG-study design outlined earlier were used,each mini-G-study would have two facets(methods and occasions) that would be sam-pled for each observer. From each of thesemini-studies it would be possible to computean observer generalizability coefficient. Usingthis design, Mitchell (1977) found that al-though there were differences in observer

agreement for different subjects, subjects didnot differ in their observer generalizabilitycoefficients.

Summary

Three different coefficients that purport toreflect the quality of data gathered in ob-servational studies have been discussed. Thefirst and most commonly used of these wasobserver agreement. Coefficients of observeragreement are a source of important informa-tion about the quality of observational data:the objectivity of different observers usingthe same method to record the same behavior.Determination of interobserver agreement isa necessary part of the development anduse of observational measures. Interobserveragreement is not, however, sufficient by itself.

The second coefficient discussed was thereliability coefficient, obtained by fitting ob-servational data into the pattern used by de-velopers of standardized psychological tests.There are really many different reliabilitycoefficients, each defined by the way thescores are obtained for its computation. Re-liability coefficients provide useful informa-tion about the stability and consistency of in-dividual differences among subjects, but con-found measurement error with other sourcesof variability.

The third coefficient was the generalizabil-ity coefficient, as defined by Cronbach et al.'s(1972) multivariate theory. In one sense, ageneralizability coefficient supersedes a re-liability coefficient, because it too providesinformation about the stability and consist-ency of individual differences among subjects.Its superiority to the reliability coefficient liesin its ability to account for variance fromsources other than individual differences andmeasurement error. Besides giving informationthat can be reported as the generalizabilitycoefficient, a G study also permits innovativeways of looking at results from observationalstudies. These ways include variations on themultitrait-multimethod matrix, the attribu-tion of variance to independent variables, ob-server generalizability, and studies of singlesubjects. It is therefore especially unfortunatethat G-study designs are not used more fre-quently in observational research.

Page 12: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

INTEROBSERVER AGREEMENT 387

Recommendations

Researchers doing observational studies areobliged to show that their measuring instru-ments are reliable—that they have small er-rors of measurement and that the scores ofindividuals show stability, consistency, anddependability for the trait, characteristic, orbehavior being studied. The reason for thisobligation is practical: If the measure is notreliable, it cannot be expected to show lawfulrelationships with other variables being stud-ied. It is well-known that the reliability of astandardized test sets the limits of its validity(Nunnally, 1967). Similarly, the predictiveusefulness of observational measures is limitedby the stability and consistency of the scoresobtained from the observational instruments.

Observer agreement coefficients alone, re-gardless of their mathematical sophistication,are inadequate to demonstrate this stabilityand consistency. What alternative or addi-tional information ought to be reported inobservational studies, and how should it becollected?

First, and most basically, the coefficientscomputed should be based on the same scoresthat are used in the substantive analysis ofthe study. If a composite score (such as atotal of several categories or time units) is tobe used for analysis, it is this composite—andnot its component individual categories ortime units—that should be examined foragreement, reliability, or generalizability. Ofcourse, during the training of observers it isextremely helpful to compare the records ofdifferent observers on a trial-by-trial (ortime unit by time unit) basis, but such a com-parison does not suffice for reporting in a pub-lished research article. It is possible, albeitrare, to have acceptable agreement on a timeunit basis and yet unacceptable levels of agree-ment on a total score. It is also possible, andmuch more common, for observers to be inonly moderate agreement for small time units,but to show good agreement for a total score.In this case, an analysis of the trial-by-trialagreement would underestimate the agreementfor the measures actually employed in thestudy.

Similarly, if several different scores are to

be analyzed, coefficients should be computedfor each of them. Good agreement or reliabil-ity on the frequency of particular behaviors,for example, does not insure good agreementor reliability on their duration. The data thatare to be analyzed are the data that should bescrutinized for their stability and consistency.

Second, the coefficients should be com-puted from data that are part of the actualstudy being reported. The studies of observeragreement cited earlier (i.e., Reid, 1970;Romanczyk et al., 1973; Taplin & Reid,1973) show clearly that the quality of datacollected during a study may not be the sameas the quality collected during reliability as-sessment or training. This difference can alsobe expected for data collected during a pilotstudy or during a previous study that usedthe same instrument.

This means that the researcher must planto collect data that can be used to computecoefficients of agreement, reliability, or gen-eralizability at the same time the rest of thedata are gathered. Although this does entailsome additional data collection, the additionneed not generally be enormous (see, e.g.,Rowley, 1976).

Third, interobserver agreement should al-ways be obtained and reported. In most ob-servational studies, observer disagreement isan important source of error, and it shouldbe carefully and systematically monitored.This monitoring needs to be regular and un-obtrusive for the most accurate results. Re-searchers also need to be alert to the possi-bility that interobserver agreement may differfor different subjects, therefore observationsfrom many—preferably all—subjects shouldbe used when determining the level of agree-ment.

Although the observer agreement percentageis the most widely used and most easily com-puted index of agreement, it may be desirablein some cases to substitute other indices, suchas that suggested by Lawlis and Lu (1972).Whatever the exact form of the coefficient,however, both the researcher and the readershould remember that it reflects only onesource of error and that it reports this errorin absolute rather than in relative terms.

Fourth, a reliability or generalizability

Page 13: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

388 SANDRA K. MITCHELL

coefficient that uses two or more measure-ment occasions should be presented for anyscore that is used to predict other behavior.The researcher is obliged to demonstrate thatthe individual differences among subjects arestable over different occasions as well as overdifferent observers. This stability can be re-ported as a split-half, alternate-forms, ortest-retest reliability coefficient, or as a ses-sion or developmental generalizability coef-ficient. The single exception to this rule is astudy that focuses on some behavior not ex-pected to show stability over time (e.g., firstresponse to a new stimulus). In this case only,an interscorer reliability coefficient or a dupli-cate generalizability coefficient is appropriate.

It is impossible to overemphasize the im-portance of using two or more measurementoccasions to compute coefficients of reliabilityor generalizability. The purpose of reportingsuch a coefficient is to demonstrate that thedata being analyzed reflect stable, consistent,and dependable individual differences amongsubjects. If, however, a single measurementoccasion has been used, then the coefficientcan demonstrate only the competence andconsistency of the observers. Since many, ifnot most, observational studies obtain re-peated measures as part of the experimentaldesign, it is seldom necessary to collect ad-ditional data. What is necessary is to analyzeand report these observational measures interms of their stability over time.

Fifth, a generalizability study is usuallypreferable to the computation of a reliabilitycoefficient. First, a G study provides moreuseful information about sources of variabil-ity in a set of data than does a reliability co-efficient. Leler (1971), for instance, used thevariance components from a G-study analysismidway through her research to help refinethe observation instrument and to retrainobservers on some items. Second, the G-studydesign makes other kinds of analysis possiblefor most observational studies. These include,particularly, observer generalizability coef-ficients and coefficients for individual subjects.

Finally, the design of the generalizabilitystudy should correspond to the overall de-sign of the research. The G study for mostapplications does not need to be complex.

Two facets—observers and occasions—areusually sufficient. There are some studies,however, that require three-facet designs.

Studies that measure behavior before andafter some intervention (experimental treat-ment) should include a third facet in the Gstudy—before versus after the intervention.This is especially important if the interven-tion is likely to reduce the variability of theobserved behavior. For example, suppose astudy were undertaken to reduce the amountof aggressive behavior exhibited by schoolchildren on the playground. At the start of thestudy, the children would be quite variablein their playground aggression. If the inter-vention were successful, however, the variabil-ity after intervention would be quite low (allkids showing low aggression). A measure thatwould be quite reliable (in the classical sense)in differentiating among the children beforethe intervention might be inadequate after-wards. The G-study design allows the re-searcher to evaluate this possibility.

In the same way, studies that compare twogroups of subjects should include group mem-bership as a third facet in their G study; thatis, it is necessary to demonstrate that thescores for both groups show approximatelyequal stability and consistency over differentoccasions and different observers.

Conclusions

These recommendations have an importantempirical implication: Studies that followthem will report coefficients that are lower,perhaps substantially lower, than the coeffi-cients reported by studies that do not fol-low them. The procedures suggested here arestringently conservative, and the coefficientsthat they yield should be considered lowerlimits of the true dependability of the ob-servational data that are collected. Reviewersand readers, who are used to seeing reports ofobserver agreement in the .80s and .90s, willhave to change their expectations for reliabil-ity and generalizability coefficients, whichwill often be in the .50s and .60s. In fact,these new coefficients are not low; rather theold ones were inappropriately high. Observeragreement percentages, interobserver reliabil-

Page 14: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

INTEROBSERVER AGREEMENT 389

ity, and reliability or agreement determinedduring pretests or previous studies are allspuriously high estimates of the quality of thedata that are collected. Although we mayhave to revise our standards downward con-cerning the size of reported coefficients, wewill be revising our standards upward con-cerning the ways in which the data for thecoefficients are gathered and analyzed.

There is a methodological implication ofthese recommendations as well. Most observa-tional studies currently being published arepretty straightforward in experimental design.Their sophistication is usually based on thenature of the observations themselves: thecomplexity of the behavioral record, the lengthof time included in the record, or the specificnature of the situation or setting in which thedata are gathered. Future studies that followthe recommendations given here will be farmore complex in design than is now typicallythe case.

One final question concerns whether it isreally worth the additional time and moneynecessary to determine and report on thequality of the data in the ways suggestedhere. If including a G-study design in an ob-servational study has no benefits except thecomputation of a generalizability coefficient,the answer is probably no. In fact, though, in-cluding a G study usually provides a greatdeal of substantive information to the re-searcher. Are there differences in the behaviorreported by different observers? Do subjectsact differently on different occasions? Is thevariance of the data different before and afteran experimental treatment? Are all groups inthis study represented by data of equal qual-ity? All these important questions can be ad-dressed by including a G study in the overallresearch plan.

Even more importantly, though, the use ofgeneralizability designs focuses the attentionof the researcher (and the reader) on boththe individual differences among subjects andon the influence of other (usually environ-mental) factors on behavior. As Cronbach(1957) eloquently pointed out, psychologistshave historically tended to focus on one orthe other of these two aspects: individualdifferences (using correlational methods) or

group differences (using experimental meth-ods). Fittingly, it is now Cronbach's theoryof generalizability that makes it possible tocombine these two viewpoints. And observa-tional research is particularly well suited tothe task of looking at individual differencesin behavior in the context of systematic en-vironment variation.

References

Berdie, R. F. Consistency and generalizability ofintraindividual variability. Journal of AppliedPsychology, 1969, 53, 35-41.

Campbell, D. T., & Fiske, D. W. Convergent anddiscriminant validation by the multitrait-multi-method matrix. Psychological Bulletin, 1959, 56,81-105.

Campbell, J. P. Psychometric theory. In M. D.Dunnette (Ed.), Handbook of industrial and or-ganizational psychology. Chicago: Rand McNally,1976.

Cronbach, L. J. The two disciplines of scientificpsychology. American Psychologist, 1957, 12, 671-684.

Cronbach, L. J., Gleser, G. C., Nanda, H., & Ra-jaratnam, N. The dependability of behavioralmeasurements: Theory of generalizability for scoresand profiles. New York: Wiley, 1972.

Ghiselli, E. E. The prediction of predictability. Edu-cational and Psychological Measurement, 1960, 20,3-8.

Ghiselli, E. E. Moderating effects and differentialreliability and validity. Journal of Applied Psy-chology, 1963, 47, 81-86.

Golding, S. L. Flies in the ointment: Methodologicalproblems in the analysis of the percentage of vari-ance due to persons and situations. PsychologicalBulletin, 1975, 82, 278-281.

Gorsuch, R. L., Henighan, R. P., & Barnard, C. Locusof control: An example of dangers in using chil-dren's scales with children. Child Development,1972,45,579-590.

Johnson, S. M., & Bolstad, O. D. Methodological is-sues in naturalistic observations: Some problemsand solutions for field research. In L. A. Hamer-lynck, L. C. Handy, & E. J. Mash (Eds.), Be-havior change: Methodology, concepts and prac-tice. Champaign, 111.: Research Press, 1973.

Kavanagh, M. J., MacKinney, A. C., & Wolins, L.Issues in managerial performance: Multitrait-multimethod analyses of ratings. PsychologicalBulletin, 1971, 75, 34-39.

Lawlis, G. F., & Lu, E. Judgment of counselingprocess: Reliability, agreement, and error. Psy-chological Bulletin, 1972, 78, 17-20.

Leler, H. 0. Mother-child interaction and languageperformance in young disadvantaged Negro chil-dren (Doctoral dissertation, Stanford University,1970). Dissertation Abstracts International, 1971,31, 4971B. (University Microfilms No. 71-2793)

Page 15: Interobserver Agreement, Reliability, and …wtgrantmixedmethods.com/sites/default/files/literature...Psychological Bulletin 1979, Vol. 86, No. 2, 376-390 Interobserver Agreement,

390 SANDRA K. MITCHELL

Lytton, H. Three approaches to the study of parent-child interaction: Ethological, interview, and ex-perimental. Journal of Child Psychology and Psy-chiatry, 1973,14, 1-17.

Mash, E. J., & McElwee, J. D. Situational effects onobserver accuracy: Behavioral predictability, priorexperience, and complexity of coding categories.Child Development, 1974, 45, 367-377.

McDowell, E. E. Comparison of time-sampling andcontinuous recording techniques for observing de-velopmental changes in caretaker and infant be-haviors. Journal of Genetic Psychology, 1973, 123,99-105.

McGaw, B., Wardrop, J. L., & Bunda, M. A. Class-room observational schemes: Where are the er-rors? American Educational Research Journal,1972, 9, 13-27.

Medley, D. M., & Mitzel, H. E. Measuring class-room behavior by systematic observation. In L.Gage (Ed.), Handbook of research on teaching.Chicago: Rand McNally, 1963.

Mitchell, S. K. The reliability, generalizability and in-terobserver agreement of data collected in ob-servational studies (Doctoral dissertation, Uni-versity of Washington, 1976). Dissertation Ab-stracts International, 1977, 37, 3S83B. (UniversityMicrofilms No. 77-611)

Norman, W. T., & Goldberg, L. R. Raters, ratees, and

randomness in personality structure. Journal ofPersonality and Social Psychology, 1966, 4, 681-691.

Nunnally, J. C. Psychometric theory. New York:McGraw-Hill, 1967.

Reid, J. B. Reliability assessment of observationdata: A possible methodological problem. ChildDevelopment, 1970, 41, 1143-1 ISO.

Romanczyk, R. G., Kent, R. N., Diament, C., &O'Leary, K. D. Measuring the reliability of ob-servational data: A reactive process. Journal ofApplied Behavior Analysis, 1973, 6, 175-184.

Rowley, G. L. The reliability of observational mea-sures. American Educational Research Journal,1976,13, 51-59.

Taplin, P. S., & Reid, J. B. Effects of instructionalset and experimenter influence on observer reliabil-ity. Child Development, 1973, 44, 547-554.

Tinsley, H. E. A., & Weiss, D. J. Interrater reliabil-ity and agreement of subjective judgments. Jour-nal of Counseling Psychology, 1975, 22, 358-376.

Wicker, A. W. An application of the multitrait-multimethod logic to the reliability of observa-tional records. Personality and Social PsychologyBulletin, 1975, /, 575-579.

Received December 21, 1977 •