RESEARCH METHODOLOGY Interrater Reliability and Agreement ...

19
Journal of Counseling Pi 197S, Vol. 22, No. 4, "" RESEARCH METHODOLOGY This section reviews methodological problems in counseling psychology research and interprets current research trends and techniques to help counseling psychologists utilize developments in psychometrics, statis- tics, research design, and related areas. Interrater Reliability and Agreement of Subjective Judgments Howard E. A. Tinsley Southern Illinois University at Carbondale David J. Weiss University of Minnesota Indexes of interrater reliability and agreement are reviewed and suggestions are made regarding their use in counseling psychology research. The dis- tinction between agreement and reliability is clarified and the relationships between these indexes and the level of measurement and type of replication are discussed. Indexes of interrater reliability appropriate for use with ordinal and interval scales are considered. The intraclass correlation as a measure of interrater reliability is discussed in terms of the treatment of between-raters variance and the appropriateness of reliability estimates based on composite or individual ratings. The advisability of optimal weighting schemes for calculating composite ratings is also considered. Measures of interrater agreement for ordinal and interval scales are de- scribed, as are measures of interrater agreement for data at the nominal level of measurement. The rating scale is one of the most fre- quently used measuring instruments in coun- seling research. Although rating scales can assume a number of specific forms, they gen- erally require the rater to make a judgment about some characteristic of an object by assigning it to some point on a scale defined in terms of that characteristic. In counseling psychology research, the object to be rated can be a person (e.g., a client or counselor) or a process (e.g., types of counseling or ther- apy). Among the characteristics of coun- selors that have been measured by rating scales in counseling research are accurate empathy, nonpossessive warmth, uncondi- tional positive regard, genuineness, con- creteness, and intensity of interpersonal contact; clients have been rated on degree While the authors assume complete responsibil- ity for the contents of this paper, thanks are due to Joseph L. Fleiss, William G. Miller, John R. More- land, Gordon F. Pitz, and Diane J. Tinsley for their careful reviews and many helpful comments. Requests for reprints should be sent to Howard E. A. Tinsley, Department of Psychology, South- ern Illinois University, Carbondale, Illinois 62901. of pathology, type of pathology, contribu- tion to society, degree of disability, type of disability, quality of work adjustment, and type of verbal response; and counseling has been rated on its success and type of out- come. Rating scales have been used recently in studying the effectiveness of counselor- training techniques (Carkhuff & Griffin, 1970; Martin & Gazda, 1970; Myrick & Pare, 1971; Payne, Winter, & Bell, 1972; Pierce & Schauble, 1971; Truax & Lister, 1971), counseling outcomes (Garfield, Prager, & Bergin, 1971; Schuldt & Truax, 1970), the relationship of counselor verbal behaviors (McMullin, 1972) and counselor attire (Stillman & Resnick, 1972) to coun- seling outcomes, the content of interviews conducted by counselors of different theo- retical persuasions (Wittmer, 1971), the re- lationship of client attributes to client em- ployability (Tseng, 1972), and the effects of tape recording an interview on clients' self- reports (Tanney & Gelso, 1972). Because the datum recorded on a rating scale is the subjective judgment of the rater, 358

Transcript of RESEARCH METHODOLOGY Interrater Reliability and Agreement ...

Journal of Counseling Pi197S, Vol. 22, No. 4, ""

RESEARCH METHODOLOGYThis section reviews methodological problems in counseling psychologyresearch and interprets current research trends and techniques to helpcounseling psychologists utilize developments in psychometrics, statis-

tics, research design, and related areas.

Interrater Reliability and Agreement of Subjective Judgments

Howard E. A. TinsleySouthern Illinois University at Carbondale

David J. WeissUniversity of Minnesota

Indexes of interrater reliability and agreement are reviewed and suggestionsare made regarding their use in counseling psychology research. The dis-tinction between agreement and reliability is clarified and the relationshipsbetween these indexes and the level of measurement and type of replicationare discussed. Indexes of interrater reliability appropriate for use withordinal and interval scales are considered. The intraclass correlation as ameasure of interrater reliability is discussed in terms of the treatment ofbetween-raters variance and the appropriateness of reliability estimatesbased on composite or individual ratings. The advisability of optimalweighting schemes for calculating composite ratings is also considered.Measures of interrater agreement for ordinal and interval scales are de-scribed, as are measures of interrater agreement for data at the nominallevel of measurement.

The rating scale is one of the most fre-quently used measuring instruments in coun-seling research. Although rating scales canassume a number of specific forms, they gen-erally require the rater to make a judgmentabout some characteristic of an object byassigning it to some point on a scale definedin terms of that characteristic. In counselingpsychology research, the object to be ratedcan be a person (e.g., a client or counselor)or a process (e.g., types of counseling or ther-apy). Among the characteristics of coun-selors that have been measured by ratingscales in counseling research are accurateempathy, nonpossessive warmth, uncondi-tional positive regard, genuineness, con-creteness, and intensity of interpersonalcontact; clients have been rated on degree

While the authors assume complete responsibil-ity for the contents of this paper, thanks are due toJoseph L. Fleiss, William G. Miller, John R. More-land, Gordon F. Pitz, and Diane J. Tinsley fortheir careful reviews and many helpful comments.

Requests for reprints should be sent to HowardE. A. Tinsley, Department of Psychology, South-ern Illinois University, Carbondale, Illinois 62901.

of pathology, type of pathology, contribu-tion to society, degree of disability, type ofdisability, quality of work adjustment, andtype of verbal response; and counseling hasbeen rated on its success and type of out-come. Rating scales have been used recentlyin studying the effectiveness of counselor-training techniques (Carkhuff & Griffin,1970; Martin & Gazda, 1970; Myrick &Pare, 1971; Payne, Winter, & Bell, 1972;Pierce & Schauble, 1971; Truax & Lister,1971), counseling outcomes (Garfield,Prager, & Bergin, 1971; Schuldt & Truax,1970), the relationship of counselor verbalbehaviors (McMullin, 1972) and counselorattire (Stillman & Resnick, 1972) to coun-seling outcomes, the content of interviewsconducted by counselors of different theo-retical persuasions (Wittmer, 1971), the re-lationship of client attributes to client em-ployability (Tseng, 1972), and the effectsof tape recording an interview on clients' self-reports (Tanney & Gelso, 1972).

Because the datum recorded on a ratingscale is the subjective judgment of the rater,

358

RESEARCH METHODOLOGY 359

the generality of a set of ratings is always ofconcern. Generality is important in demon-strating that the obtained ratings are notthe idiosyncratic results of one rater's sub-jective judgment. Knowledge of the inter-rater reliability and interrater agreement iscrucial in evaluating the generality of a setof ratings. In almost all of the research pub-lished to date in which rating scales havebeen used, however, the interrater agree-ment of the ratings has not been reported.Failure to report the interrater reliabilityof ratings is also not uncommon (e.g., Cark-huff, 1971). Moreover, interrater reliabilityfrequently has been reported as having beendetermined by other investigators in otherresearch (e.g., Cannon & Carkhuff, 1969).Such practices are unacceptable.

This article differentiates between inter-rater reliability and interrater agreement,emphasizes the need for both types of evi-dence regarding a set of ratings, summarizesthe various procedures suggested for deter-mining interrater reliability and interrateragreement, presents methods for combiningratings to maximize interrater reliability,and, finally, recommends procedures for usein counseling research.

Agreement Versus Reliability

Interrater agreement represents the ex-tent to which the different judges tend to

make exactly the same judgments about therated subject. When judgments are madeon a numerical scale, interrater agreementmeans that the judges assigned exactly thesame values when rating the same person.Interrater reliability, on the other hand,represents the degree to which the ratings ofdifferent judges are proportional when ex-pressed as deviations from their means. Inpractice, this means that the relationship ofone rated individual to other rated individ-uals is the same although the absolute num-bers used to express this relationship maydiffer from judge to judge. Interrater reli-ability usually is reported in terms of correla-tional or analysis of variance indexes.

Table 1 shows hypothetical data (assum-ing interval-level measurement) in whichthree judges have rated 10 counselors on theTruax and Carkhuff (1967) 9-point Accu-rate Empathy Scale. Case 1 in Table 1 repre-sents a set of ratings in which all three judgesassign exactly the same ratings to each ofthe 10 counselors. These ratings have bothhigh interrater agreement and high inter-rater reliability. Case 2 in Table 1 showsratings with low interrater agreement buthigh interrater reliability (as indicated bythe intraclass correlation). These ratingshave low interrater agreement, since no tworaters gave the same rating to any counselor.Low agreement among the raters is re-

TABLE 1HYPOTHETICAL RATINGS OF ACCURATE EMPATHY ILLUSTRATING DIFFERENT LEVELS OFINTERRATER AGREEMENT AND INTERRATER RELIABILITY FOR INTERVAL-SCALED DATA

Counselor

ABCDEFGHIJ

MSD

Case 1: High interrateragreement and high interrater

reliability

Rater

1

1233456789

4.82.7

2

1233456789

4.82.7

3

1233456789

4.82.7

Case 2: Low interrateragreement and high interrater

reliability

Rater

1

1122334455

3.01.5

2

3344556677

5.01.5

3

5566778899

7.01.5

Case 3: High interrateragreement and low interrater

reliability

Rater

1

5554554545

4.7.5

2

4444454555

4.4.5

3

4355345435

4.1.9

360 RESEARCH METHODOLOGY

fleeted in their mean ratings of 3, 5, and 7,respectively, which translate into quite dif-ferent statements about the counselors interms of level of accurate empathy. How-ever, the ratings assigned to the counselorsby the three judges are proportional sincethe counselors are ordered similarly by thethree judges. If the investigator is interestedonly in the relative ordering of the counselorsby the judges, the interrater reliability is asatisfactory index of the quality of the rat-ings. The interrater reliability, however,fails to make evident the fact that the ratersdiffered in their ratings of the counselors.Whenever the counselor is interested in theabsolute value of the ratings, or the meaningof the ratings as defined by the points on thescale, the interrater agreement of the ratingsmust also be reported.

Finally, the ratings of Case 3 in Table 1have high interrater agreement but low in-terrater reliability. The high interrateragreement results from the fact that theratings assigned to the counselors by thethree raters are quite similar for 7 of the 10counselors. The interrater reliability is lowprimarily because of the restricted range ofratings given by the three raters. Such rat-ings may have occurred because the counsel-ors were all highly similar in accurate em-pathy or because the judges used the ratingscale improperly. Thus, when the variabilityof ratings is small, as is frequently the casein counseling research, interrater reliabilitymay be low. If, however, interrater agree-ment is high, the possibility exists that thesubjects are homogeneous on the trait ofinterest. This possibility can be investigatedby further study of the same subjects or byhaving the judges rate a sample of subjectsknown to be heterogeneous on the trait ofinterest. When both interrater reliabilityand agreement are low, the ratings are of novalue and should not be used for research orapplied purposes.

In summary, high reliability is no indica-tion that the raters agree in an absolutesense on the degree to which the ratees pos-sess the characteristic being judged (Case2). On the other hand, low reliability doesnot necessarily indicate that the raters are indisagreement (Case 3). Clearly, both types

of information are important in evaluatingsubjective ratings.

Level of Measurement

Measurements obtained from the ratinginstruments used in counseling researchsometimes possess the properties of nominal-level measurement. They more frequentlyhave the properties of ordinal-level measure-ment but rarely can be considered to be atthe interval level of measurement. Each levelof measurement requires different ap-proaches to the estimation of interrater re-liability and agreement.

Nominal scales involve simple classifica-tion without order. At the nominal level ofmeasurement the rater generally is concernedwith assigning the person to be rated to oneof a number of qualitatively different cate-gories. Examples of nominal rating includeclinical diagnosis (e.g., schizophrenic, psy-chopathic), diagnosis of physical disability(e.g., paraplegic, quadraplegic), and classifi-cation of presenting complaint (e.g., voca-tional concern, social-emotional concern).

Ordinal measurement implies the abilityto rank individuals on some continuum, butwithout implying equal intervals betweenthe ranks. Ordinal scales, in the form of thesimple graphic rating scale, are probably themost frequently used scales in counselingresearch. Such scales may vary considerablyin their appearance. The two dimensionsalong which these scales most commonlyvary are the number of scale levels (e.g.,3-, 6-, or 11-point scales) and the manner inwhich the scale levels are defined. Scalelevels can be defined by numerical values,adjective descriptions, or graphic intervals.Regardless of how scale points are defined,all such scales represent an ordinal level ofmeasurement.

At the interval level of measurement, truemeasures of distance are possible betweenindividuals on a continuum. While tech-nically most counseling research instrumentsdo not meet the formal characteristics of in-terval-level measurement, Baker, Hardyck,and Petrinovich (1966) have shown thatstatistics which assume interval-level meas-urement can be applied to ordinal datawithout distorting the sampling distribution

RESEARCH METHODOLOGY 361

of the statistic. Thus, the counseling psy-chologist may appropriately use interval-level statistics on ratings that result fromonly ordinal level measurement. The re-searcher should be aware, however, that suchapplications might result in measures ofinterrater reliability and agreement thatdistort the true relationship among raters, ifthe assumption of equal intervals is grosslyinappropriate.

The distinction between interrater reli-ability and interrater agreement blurs whenone deals with nominal scales. Since the rat-ing categories do not differ quantitatively,the disagreements in categorization generallydo not differ in their severity.1 Therefore,at the nominal-scale level the concept of"proportionality of ratings," which is cen-tral to interrater reliability, ceases to makesense. Agreement becomes an absolute; rat-ings are either in agreement or disagreement.As a result, no distinction exists betweeninterrater reliability and interrater agree-ment for nominal scales. The term interrateragreement is more appropriate, since it ismore consistent with the terminology em-ployed at the ordinal and interval levels ofmeasurement.

Type of Replication

Investigation of the generality of a set ofratings involves replication or repetition ofthe obtained ratings. Most frequently, repli-cation occurs between judges when severaljudges rate the same objects on the samevariable, as illustrated in Table 1. Anothermethod of replicating a set of ratings is therate-rerate method, the analogue of thetest-retest procedure for determining testreliability. In this method the judges rate agroup of subjects on some characteristic,then rerate the same subjects on the samecharacteristic at a later date. This replica-tion has been used recently by Carkhuff(1969), Carkhuff and Banks (1970), andCarkhuff and Griffin (1970).

The rate-rerate method is inappropriatefor demonstrating interrater agreement orinterrater reliability for several reasons.First, this measure of reliability does not

1 An exception is discussed later in the sectionon "weighted kappa."

indicate the degree to which the ratings ofthe different judges are in agreement.Rather, it provides a measure of the consist-ency or stability of the judges' ratings overtime. It would be possible, for example, toachieve a high rate-rerate reliability forthree raters who were in complete disagree-ment, as long as each rater was consistentacross time. A second problem with the rate-rerate method is that over a period of timethe subjects may change in terms of thecharacteristics being rated, thus causing thereliability or within-rater agreement of theratings to be low. The more accurately thejudges reflect this change by amending theirprevious ratings, the lower the reliabilitywill appear. Rate-rerate methods may alsogive spuriously high reliabilities when a non-changing data base is used (e.g., audio orvideo tapes of interviews) if the between-ratings time interval is short. In this situa-tion the reratings will likely be contaminatedby the rater's recall of his previous judg-ments. Finally, because the rate-reratemethod requires that the judges rate eachsubject on two different occasions, thismethod is seldom feasible in the field situa-tions in which much counseling research isconducted. Thus replication of ratings be-tween judges, rather than across time, isthe more appropriate method for counselingpsychology research.

INTERRATER RELIABILITY

Ordinal Scales

Finn (1970) has recommended the use ofthe following index as a measure of interraterreliability for ordinal ratings:

r = 1.0 -Observed varianceChance variance (1)

The expected or chance variance (sc2) of the

ratings is the variance expected if the rat-ings were assigned at random and may becalculated as2

(2)122 The authors are indebted to Joseph L. Fleiss

for suggesting the use of this equation in place of amore cumbersome equation used in an earlier draftof this article.

362 RESEARCH METHODOLOGY

where k = the number of scale categories.In the unusual case in which only one sub-ject has been rated, the observed varianceis simply the variance of the assigned rat-ings. When ratings are available for morethan one subject, the observed variance isthe within-subjects mean square obtainedfrom a one-way analysis of variance(ANOVA).

The degree to which the observed vari-ance is less than the expected variance is anindication of the amount of nonchance vari-ance in the ratings. The ratio of the two vari-ances gives the proportion of the chance (orrandom variance) present in the observedratings. Subtracting the ratio from 1.0 re-sults in the proportion of the total variancein the ratings that is due to nonrandom fac-tors. An r of 1.00 indicates perfect reliability;an r of 0 indicates that the observed ratingswere completely unreliable, that is, theyvaried as much as chance ratings. For thedata in Table 1, r = 1.00, .40, and .93 forCases 1, 2, and 3, respectively.

As we previously noted, indexes of inter-rater reliability are usually affected by re-duced variance in the ratings. One advantageof Finn's (1970) measure of reliability isthat it avoids this problem, as illustrated bythe results of Case 3. Finn's r can vary from0 to 1.0 and is not reduced by low within-judge variance.

The chance variance would be expectedif an infinite population of judges assignedthe ratings at random. In actual practice thecounseling psychologist will be dealing witha finite and often small sample of judges.Under such circumstances the random as-signment of ratings could result in an ob-served variance that, quite by chance, issmaller than the expected variance. Accord-ingly, the investigator should determinewhether the observed variance is signifi-cantly less than the chance variance prior tocalculating r. The hypothesis that the ob-served variance is equal to the chance vari-ance can be tested with the following chi-square test:8

3 Finn (1970, p. 73) has suggested an F ratio forthis purpose. This statistic appears inappropriatesince it is intended for use with two sample vari-ances (see Hays, 1963, pp. 348-355), whereas Finn'schance variance is a theoretically determined esti-mate of the population variance.

N(K -(3)

with N(K — I) degrees of freedom; whereN = number of subjects, K = number ofraters, S0

2 = the observed within-subjectsvariance and (Sc

2 = the expected chancevariance. Using the one-tailed chi-squaretest (Hays, 1963, pp. 344-45), a chi-squarevalue lower than the lower critical value(i.e., the .99 level for a test at p < .01) wouldindicate that $0

2 is significantly less than5C

2. Finn's r should be calculated only inthose instances in which the null hypothesisis rejected.

Two considerations must be kept in mindin following this procedure. First, this useof chi-square assumes that the ratings willbe normally distributed and that both thesubjects and raters will be randomly se-lected. In counseling research these assump-tions will frequently be violated. Violationof the normality assumption can be quiteserious when inferences are made aboutvariances (Hays, 1963, p. 347). Accordingly,a stringent critical value (e.g., p < .01) isadvised. Second, a significant chi-squaretells the investigator only that his resultsare not consistent with the hypothesis ofcompletely random responding.

Two problems are apparent with the useof Finn's (1970) r as an index of interraterreliability. First, computation of the chancevariance requires the assumption that everyrating has the same probability, because theratings of the judges are purely random. Thisassumption is violated whenever the judgeshave a response set to avoid the extremecategories of the rating scale. Wheneverthis occurs, the chance variance, as calcu-lated using Equation 2, would be greaterthan the "true" chance variance, thus caus-ing r to be spuriously high.

A second cause for concern is the lack ofsystematic evidence available for the evalua-tion of this measure of reliability. This isdue, in part, to its relative recency. In thecomparative studies published to date (Finn,1970, 1972), however, r and the intraclasscorrelation (discussed later) gave highlysimilar results in most instances. When thevariance in the ratings is severely restricted(as in Case 3), however, r and the intraclasscorrelation may differ substantially.

RESEARCH METHODOLOGY 363

Interval Scales

The measure of interrater reliability mostcommonly used with interval data (and withordinal scales which assume interval prop-erties) is the intraclass correlation (R). Rcan be interpreted as the proportion of thetotal variance in the ratings due to variancein the persons being rated. Values approach-ing the upper limit of R (1.00) indicate ahigh degree of reliability, whereas an R of 0indicates a complete lack of reliability.(Negative values of R are mathematicallypossible but are rarely observed in actualpractice; when observed, they imply RaterX Ratee interactions.) The more R departsfrom 1.00, the less reliable are the ratings.

Ebel (1951) compared the product-mo-ment correlation (applicable when only twopersons are being rated), the intraclass cor-relation, and two other indexes of interraterreliability. He concluded that the intraclasscorrelation was preferable, because it per-mits the inclusion or exclusion of the be-tween-rater variance as part of the errorvariance, because it allows an estimation ofthe precision of the reliability coefficient,and because it uses the familiar statisticsand computational procedures of analysis ofvariance. At the present time, the intraclasscorrelation is the most appropriate meas-ure of interrater reliability for interval scaledata (Ebel, 1951; Englehart, 1959; Guilford,1954).

More than one formula is available forthe intraclass correlation. In order to makeproper use of this correlation, the investi-gator must decide (a) whether mean differ-ences in the ratings of the judges should beregarded as rater error, and (6) whether heor she is concerned with the average reliabil-ity of the individual judge or the reliabilityof the average rating of all the judges. Theappropriate form of the intraclass correla-tion can then be chosen based on these de-cisions.

Between-Raters Variance

The data of Case 2 in Table 1 are an ex-ample in which the judges differ in theiraverage ratings. The means of the ratingsgiven by the three judges are 3.0, 5.0, and7.0, respectively. These "level" differences

constitute the only differences in the ratingsgiven by the three judges. If the between-judges level differences are ignored, the in-terrater reliability as measured by the in-traclass correlation will be 1.00. If, on theother hand, the between-judges variance isconsidered as error, the intraclass correla-tion is .18. In any case in which the mean orvariance of the ratings assigned by the vari-ous judges differs, the decision regarding thebetween-raters variance will influence theinterrater reliability of the data. If the dif-ferences in mean and/or variance are sizable,the exclusion of the between-raters variancefrom the error term will cause the reliabilitycoefficient to be substantially higher than ifit were included.

The desirability of removing the inter-judge variance in estimating the interraterreliability depends on the way in which theratings are to be used (Ebel, 1951). If be-tween-judges differences in the general levelof the ratings do not lead to correspondingdifferences in the ultimate classification ofthe subjects, the between-raters varianceshould not be included in the error term.Thus, when decisions are based on the meanof the ratings obtained from a set of observ-ers, or on ratings which have been adjustedfor rater differences (such as ranks or Zscores), the interjudge variance should notbe regarded as error. On the other hand, ifdecisions are made by comparing subjectsrated by different judges or sets of judges,or if the investigator wishes his results to begeneralizable to other samples of judgesusing the same scale with a similar sample ofsubjects, the between-raters variance shouldbe included as part of the error term.

Bartko (1966), Ebel (1951), Englehart(1959), Guilford (1954), and Silverstein(1966) have discussed the computation ofthe intraclass correlation. When the be-tween-judges variance is not be to includedin the error term, a two-way ANOVA pro-cedure is employed, in which mean squarefor persons (MSP), mean square for judges,mean square for error (MSe), and totalmean square are obtained following standardANOVA computations. With the assump-tion of no interaction between persons andjudges, the standard equation for the intra-class correlation is

364 RESEARCH METHODOLOGY

R = - MSe

MSf + MSe(K - (4)

where K = the number of judges ratingeach person. Application of this formula tothe data in Table 1 gives R = 1.00, 1.00,and — .008 for the three cases, respectively.

Use of Equation 4, based on the two-wayANOVA procedure, is justified only if theraters are regarded as "fixed" (i.e., the judgesrepresent the population of judges; Bartko,1966; Burdock, Fleiss, & Hardesty, 1963).This means the judges cannot be assumedto be a sample of judges from any populationof judges. Accordingly, the resulting intra-class correlation represents the interraterreliability for only that set of judges andcannot be generalized to any other set ofjudges employing the same rating scale onthe same sample of persons. In order togeneralize the interrater reliability to otherjudges rating the same or other subjects,the investigator must satisfy two require-ments not implied by the previously men-tioned use of the intraclass correlation. First,the investigator must use a random sampleof judges so that the judges may be con-sidered representative of the population ofjudges about which generalizations are tobe made. In few counseling research studieshas this requirement been met. Second, theequation must incorporate some estimationof the between-judges variance in order toestimate the degree to which R will varyacross samples of judges.

The standard one-way ANOVA, in whichonly mean square for persons (MSP) andmean square for error (MS*) are calculated,is the simplest method of computation whenthe between-judges variance is to be in-cluded in the error term (Ebel, 1951; Bartko,1966). If one uses the one-way ANOVA (vs.the two-way ANOVA), the mean square forjudges is included in the mean square forerror. The intraclass correlation can then becalculated using Equation 4. Under thesecircumstances, the intraclass correlationsfor the data in Table 1 are 72 = 1.00, .18,and —.10, respectively. These results takeinto account the differences in mean level ofthe ratings by the raters in each of the threecases. The differences between the two esti-

mates of reliability are largest for Case 2,where the largest mean differences amongraters occurred.

In counseling research the investigator isfrequently confronted with incomplete data.A problem arises in determining the appro-priate value of K in Equation 4 when sub-jects have been rated by varying numbers ofraters. When the investigator plans to usethe one-way ANOVA procedure, an averagevalue of K can be obtained using the follow-ing equation from Snedecor (1946):

(5)

where R = the average value of K to beinserted in Formula 4 for K, N — the num-ber of subjects, and K = the number ofjudges rating each subject (this value willvary from subject to subject).

One advantage of the intraclass correla-tion is the possibility of determining confi-dence intervals around R using equationsgiven by Ebel (1951). In most counselingresearch, however, the small number ofraters employed will result in the lower limitof the confidence interval being below zero.Thus, calculation of confidence intervals forR will emphasize the danger of attemptingto generalize interrater reliability coeffi-cients.

Finally, the comparison of Finn's r withthe two methods of computing R is informa-tive. Apparently r is, indeed, relatively inde-pendent of the observed variance in the rat-ings (see Cases 1 and 3), whereas R is sub-stantially lowered when the variance in theratings is restricted. Moreover, Case 2 illus-trates that r does include between-judgesvariance in the error term, as is obviousfrom the manner in which it is calculated.

Reliability of Composite Ratings

Average reliability or reliability of a com-posite. The intraclass correlation and Finn'sr give the average of the reliabilities of theindividual judges. As such, this measure ofreliability underestimates the interraterreliability of the composite rating (e.g., amean rating or the sum of the ratings) ofthe group of judges. The counseling psy-

RESEARCH METHODOLOGY 365

chologist must select the appropriate esti-mate of interrater reliability on the basis ofthe intended use of the ratings. If conclusionsare typically drawn from the ratings of asingle judge, the average reliability of theindividual ratings (intraclass correlation, orFinn's r) is appropriate. This is true eventhough the ratings of several judges may beavailable in the experimental situation. If,on the other hand, the composite rating of agroup of raters is the variate of interest, thereliability of the composite rating is moreappropriate. This reliability will be higherthan the reliability indicated by the intra-class correlation and can be estimated by theSpearman-Brown formula, or by the follow-ing equation (Ebel, 1951):

., MSP - MS,K =

MS*(6)

The two equations yield identical results;studies by Clark (1935), Rosander (1936),and Smith (1935) show that the Spearman-Brown formula accurately predicts thechange in the reliability of the ratings. Guil-ford (1954) has pointed out that this findingimplies that gains in interrater reliabilitycome from multiplying the number of raterswhen the initial reliability is low. Reliabilityincreases rapidly as a result of adding thefirst several raters, but smaller incrementsin reliability occur with additional raters.

Differential weighting. The question hasbeen raised as to whether some method ofdifferentially weighting the ratings given asubject might increase reliability more thanthe simple unit weighting procedure impliedin Spearman-Brown types of estimates. Kel-ley (1947), for example, suggested weightsfor combining ratings by different observersthat were based on the reliability of thejudges, as estimated from the intercorrela-tion matrix of the judges' ratings. Lawsheand Nagle (1952) reported, however, thatthe use of Kelley's weights offered no im-provement over unit weighting and, in someinstances, resulted in a composite score thatwas less reliable than the composite scorebased on unit weights.

Overall (1965) re-examined the questionof differential weights and concluded thatthe reliabilities and variances of the indi-

vidual judges were the major determinantsof the efficacy of differential weights. When-ever the judges can be assumed equal inthese respects, unit weights should be used,and the increase in reliability due to the useof a composite score can be estimated bythe Spearman-Brown formula or Equation6. Differential weights will yield a more re-liable composite score than unit weights onlywhen the judges' ratings differ substantiallyin reliability and/or variance. Overall rec-ommended a factor-analytic procedure4 forestimating the individual rater reliabilitiesand provided equations for calculating opti-mal weights for each judge and for estimat-ing the reliability of the weighted compositeof the ratings.5

The counseling psychologist must con-sider several factors in determining the de-sirability of using optimal weights as opposedto unit weights. If the judges' ratings do notdiffer substantially in reliability and/or vari-ance, unit weights are recommended. More-over, if the nature of the decision to be madedoes not require the greatest possible re-liability, unit weights will probably suffice.Another factor to be considered is the sta-bility of the rating situation. Whenever therating scale is modified, the membership ofthe group of judges changes, or the reliabilityor variance of the various judges' ratingschanges, a new set of optimal weights mustbe calculated. If raters receive consistentfeedback regarding the reliability of theirratings, for example, the low-reliability rat-ers probably will be able to improve theirreliability. Thus, optimal weights cannot beassumed to be stable across rating occasions,even though the same group of judges usesthe same rating scale in rating the same or asimilar group of subjects.

4 The reader not familiar with factor analysiswill find helpful reviews in Weiss (1970,1971).

* Krippendorff (1970) has suggested an analysisof variance procedure for estimating the reliabilityof the individual judges. Sufficient data for thecomparison of Overall's (1965) and Krippendorff'sstrategies has not yet been published. Overall'sprocedure is conceptually simpler, however, andcomputer programs for factor analysis are widelyavailable. Krippendorff's procedures require aform of computer analysis for which computerprograms are not readily available.

366 RESEARCH METHODOLOGY

INTERRATER AGREEMENT

Only recently have statistical indexesbeen designed specifically to indicate inter-rater agreement. However, several statisticalindexes designed for other purposes havebeen employed as measures of interrateragreement. These measures include the pro-portion or percentage of agreement (P), thepairwise correlation between judges' ratings,and various chi-square indexes.

Guttman, Spector, Sigal, Rakoff, andEpstein (1971), Kaspar, Throne, and Schul-man (1968), and Mickelson and Stevic(1971) have employed the percentage orproportion of judgments in which the judgesare in agreement (P) as a measure of inter-rater agreement. For the three cases in Ta-ble 1, P is 1.00, .00 and .10, respectively. Theadvantages of P include ease of calculationand the fact that its meaning is easily under-stood.

Cohen (1960) and Robinson (1957) havecriticized the percentage or proportion as anindex of agreement. One problem is that Ptreats interrater agreement in an absolute,all-or-none fashion. For example, the datain Case 2 of Table 1 represent perfect agree-ment among the judges on the relative levelof each of the counselors, but the proportionof agreement index indicates that the ab-solute agreement is zero. Another problemis that agreement can be expected on thebasis of chance alone; P overestimates thetrue absolute agreement by an amount re-lated to the number of raters and number ofpoints on the scale. Some adjustment in Pthat would show the proportion of nonchanceagreement therefore is desirable. These diffi-culties have been partially avoided in morerecent attempts to use P as a measure ofinterrater agreement, and, as will be shownbelow, P serves as the basis for measures ofagreement in nominal ratings.

The pairwise correlation of the judges'ratings has also been used as a measure ofinterrater agreement (e.g., Kaspar, Throne,& Schulman, 1968). The shortcomings ofpairwise correlations as a measure of inter-rater agreement should be obvious; corre-lations show proportional agreement oragreement of standardized ratings. If pair-wise correlation has any merit in evaluating

ratings, it is as a measure of interrater re-liability, not interrater agreement. As Ebel(1951) reported, the intraclass correlation issuperior to pairwise correlations as a meas-ure of interrater reliability, since the formerpermits the investigator to specify the vari-ance components included.

Finally, various indexes based on chi-square have been used as measures of inter-rater agreement (e.g., Taylor, 1968). Suchindexes have been criticized by Cohen(1960), Lu (1971), and Robinson (1957) onthe grounds that chi-square does not re-flect the degree of agreement among a groupof judges. Chi-square tests the hypothesisthat the proportions of subjects assigned tothe various rating categories by the differentjudges do not differ significantly. A non-significant chi-square indicates that the ob-served disagreement is not greater than thedisagreement that could be expected on thebasis of chance. No inferences can be madefrom a nonsignificant chi-square regardingthe degree of agreement. A significant chi-square can occur because of a departurefrom chance association in the direction ofgreater agreement or greater disagreement.A significant chi-square may indicate thatthe observed disagreement is greater thanthe disagreement expected on the basis ofchance.

Problems have been noted in the use ofKendall's coefficient of concordance as ameasure of interrater agreement. This meas-ure gives the average Spearman rank-ordercorrelation between each pair of judges. Inessence, the coefficient of concordance ig-nores differences in the absolute level andthe dispersion of the rankings assigned bythe various judges by forcing the data totake the form of ranks; it indicates the agree-ment among the serial orders assigned to thesubjects. Lu (1971) pointed out that thefrequent occurrence of tied ranks, whichcommonly results when the coefficient ofconcordance is used, makes the coefficientdifficult to calculate and somewhat power-less.

Ordinal and Interval Scales

Two measures of interrater agreementhave recently been formulated which meritthe attention of counseling psychologists

RESEARCH METHODOLOGY 367

using ordinal or interval scales (Lawlis &Lu, 1972; Lu, 1971). The measures differmarkedly in their concept of interrater agree-ment.6 When viewed from the perspective ofdecision theory (Cronbach & Gleser, 1965),the two indexes offer alternative methods ofquantifying the seriousness or the cost ofvarious disagreements.

The Lawlis and Lu (1972) measure ofinterrater agreement allows the investigatorsome flexibility in selecting a criterion foragreement, thereby avoiding the necessity oftreating agreement in an absolute, all-or-none fashion. For example, three raters whogave counselor ratings of 7, 8, and 9 onTruax and Carkhuff's (1967) 9-point Ac-curate Empathy Scale disagreed in an abso-lute sense, but all essentially rated the coun-selor as high in accurate empathy. Lawlisand Lu's (1972) index allows the option ofdefining agreement as identical ratings, asratings that differ by no more than 1 point,or as ratings that differ by no more than 2points. Agreement is tallied one subject ata time, by determining whether the totalset of ratings given that subject satisfies thecriterion. If agreement is defined as ratingsnot more than two scale categories apart, theratings in the above example would be in-terpreted as in agreement.

The Lawlis and Lu (1972) approach,therefore, uses a flexible model of the serious-ness of disagreements among the raters, andthe investigator can distinguish betweenserious and unimportant disagreements.When ratings that differ by 1 scale point orless are denned as "in agreement," the in-tent is that a disagreement of 1 scale pointis unimportant. Conversely, all disagree-ments that exceed the criterion (e.g., dis-agreements of 2 or more) are of equal serious-ness.

Lawlis and Lu (1972) suggest the follow-ing nonparametric chi-square as a test ofthe significance of interrater agreement:

, = (Ni - NP - .5YNP

(N, - -P) - .5)2 (7)

P)

where:Ni = the number of agreements,N = the number of individuals rated,P = the probability of chance agree-

ment on an individual7,.5 = a correction for continuity, and

NZ = the number of disagreements.The statistic is distributed as chi-squarewith 1 degree of freedom. This test is appro-priate only when the interrater agreement(JVi) is greater than the agreement expectedon the basis of chance (NP).

Failure to obtain a significant chi-squaremeans that the hypothesis of "randomlyassigned ratings" cannot be rejected. Lackof significant agreement may result fromsome characteristics of the scale (e.g.,vaguely defined or overlapping categories),the judges (e.g., carelessness, lack of propertraining, inability to make the required dis-crimination), or the subjects (e.g., failure toemit the behavior to be rated). When anonsignificant chi-square is obtained, theuse of the scale in the manner in which thedata were obtained is questionable.

A significant chi-square, on the otherhand, indicates that the observed agree-ment is greater than the agreement thatcould be expected on the basis of chance. Theinvestigator, however, also should be con-cerned with whether interrater agreementis high, moderate, or low, not only withwhether it is better than chance. We pro-pose the following as a measure of agree-ment:

Ni - NPN - NP ' (8)

6 The authors are indebted to Gordon P. Pitzfor pointing out this distinction.

where Ni, N, and P are defined as in Equa-tion 7. The value T, which is patterned afterCohen's (1960) K (to be discussed later),

7 Tables prepared by the authors are availablefor values of P most likely to arise in counselingpsychology research. These tables, available fromthe first author, include P for from 2 to 10 judgeson scales using 2-20 rating categories, and definingagreement as 0, 1, or 2 points discrepancy. WhileLawlis and Lu (1972, p. 19) present equations forthe calculation of P, their equations for 1 and 2points discrepancy contain typographical errors.In addition, their P value for a 10-point scale withfour judges (K = 4) and agreement defined as adiscrepancy of 1 point or less (r = 1) should be.0136, instead of .00136, as shown in their Table 4.

368 RESEARCH METHODOLOGY

should be calculated only when the hy-pothesis of chance agreement has been re-jected. When the observed agreement isequal to the expected chance agreement, T is0, and T is 1.0 when perfect interrater agree-ment is observed. Positive values of T indi-cate that the observed agreement is greaterthan chance agreement, while negativevalues indicate that observed agreement isless than chance agreement. The number ofagreements for Case 3 of Table 1 (whereagreement is denned as 0, 1, and 2 pointsdiscrepancy) is 1 (Counselor J), 7 (Counsel-ors A, C, D, F, G, H, and J), and 10 (allcounselors), respectively. The correspondingT values are .09, .68, and 1.00.

The results from Lawlis and Lu's (1972)chi-square and the associated T index arecontingent upon the definition of agreement.If the definition is changed, the results willchange. The counselor must study his ratingscale carefully and determine the implica-tions of the alternative definitions of agree-ment in the context of the research ques-tion. Only after careful study should a defi-nition of interrater agreement be selected.That definition must, of course, be deter-mined prior to the collection of the data.Moreover, the definition of agreement andthe rationale for adopting the definitionmust be specified in any report on the inter-rater agreement of the ratings. Wheneverthe counseling investigator defines agree-ment to include a discrepancy of 1 or 2points, he or she should also report the chi-square and the T value for agreement definedas identical ratings. This will allow the readerto evaluate the extent to which the conclu-sions drawn are contingent upon the defini-tion of agreement.

One further consideration must be keptin mind when using Lawlis and Lu's (1972)chi-square. This statistic requires the as-sumption that every judgment has the sameprobability, under the hypothesis that theratings of the judges are purely random. Aswas pointed out earlier, however, this as-sumption may not be completely applicablein some circumstances, if raters do not usethe end categories of a rating scale. Whenthis happens, the range of choice is narrowed,and the true probability of chance agree-

ment is greater than P. P, then, is the lowerlimit for the unknown probability of chanceagreement. This means that the probabilityof chance agreement is often underesti-mated and the significance of the observedagreement is overstated.

The careful investigator can meet thisproblem in' two ways. He or she can adjustthe required level of significance from thetraditional .05 to a more stringent level orcalculate the probability of chance agree-ment on fewer scale categories than areactually available. Thus, the investigatorusing a 7-point scale could employ the prob-ability of chance agreement for a 5- or 6-point scale.

Lu (1971) also noted that agreement canvary along a continuum from absolute agree-ment to no agreement. He developed a co-efficient of agreement which incorporatesdefinitions of minimum agreement derivedfrom analysis of variance and informationtheory. In analysis of variance terms, mini-mum agreement occurs when the within-subjects variance is at a maximum (i.e.,when the ratings are divided equally be-tween the two extreme categories). Accord-ing to information theory, minimum agree-ment is achieved when the judges assignany one of the ratings to a given subjectwith equal likelihood.

In contrast to Lawlis and Lu (1972), Lu's(1971) approach defines all rating disagree-ments as important, but the seriousness ofthe disagreement increases as a function ofthe squared magnitude of the disagreement.Thus, a disagreement of 5 points is moreserious than a disagreement of 4 points; adisagreement of 1 point is more serious thana disagreement of 0 points.

Lu's measure of interrater agreement isdefined as8

A = (9)

where /Sc2 = the expected within-subjects

variance when all the ratings are equallylikely, and *S0

2 = the observed within-sub-jects variance. A approaches 0 (reflecting

8 While this equation is algebraically equivalentto Equation 1, Lu (1971) and Finn (1970) defineboth S0

2 and <S02 differently.

RESEARCH METHODOLOGY 369

random interrater agreement) as <S02 ap-

proaches <S02; A approaches 1.00 (reflecting

perfect interrater agreement) as So2 ap-proaches 0. A can be negative, in which caseit indicates that disagreement is greaterthan would be observed under purely chanceresponding. In Table 1, A equals 1.0, .34,and .60 for Cases 1, 2, and 3, respectively.

Lu's A assumes the attribute under con-sideration is measurable conceptually on acontinuous scale. Due to practical limita-tions, however, attributes almost alwaysare rated in terms of an ordered set of non-overlapping categories. This necessitates thecalculation of a set of weights to be used indetermining (S0

2 and S02. The weights are

calculated as follows:

-1 Nr + ̂

Ym = (10)e-l KN

where:Ym = the weight assigned to Category

m,Nr = the number of subjects assigned

to Categories 0 through m — 1,Nm = the number of subjects assigned to

Category m,KN = the number of judges times the

number of subjects, andm = 1, 2, . . . , k categories.

Once calculated, the weights are placed ina Subjects X Judges matrix in place of theactual ratings. The matrix is analyzed fol-lowing a two-way factorial ANOVA pro-cedure, which yields sum of squares for be-tween subjects, within subjects, and total.Mean square within subjects represents So".Within-subjects variance under chance re-sponding (/S0

2) is obtained as follows:

8,2 _ 2-lYm / / ^ I m N

k \ k ) ' (11)

where Ym = the weight assigned the mthcategory, and fc = the number of categories.As with Finn's (1970) r and the Lawlis andLu (1972) chi-square, this definition ofchance variance requires the assumptionthat every rating has the same probability,because the ratings of the judges are purelyrandom.

Given the hypothesis that the ratingswere assigned randomly, the expected valueof >So2 would be (S0

2. Thus, the statisticalsignificance of A can be determined indi-rectly using Equation 3. A chi-square valuelower than the lower critical value (i.e.,the .99 level for a test at p < .01) wouldindicate that S0" is significantly less than»So2, thereby implying that A is significantlygreater than 0. Again, a stringent criticalvalue is advised because of the violation ofthe normality assumption.

Because of their recency, neither Lawlisand Lu's (1972) nor Lu's (1971) index hasundergone rigorous, systematic investiga-tion. Consequently, both indexes should beused with caution. The use of Lawlis andLu's (1972) chi-square and the T index isrecommended for two reasons. First, theirtreatment of interrater disagreements allowsthe counselor to more adequately distin-guish rating disagreements that have noserious consequences from those that are ofpractical significance. An unfortunate corol-lary, however, is that this same flexibilitymay tempt the researcher to manipulate hisor her definition of agreement in order tomake the results reach desired levels ofagreement. Second, Lu's (1971) index ismore appropriately classified as a measureof interrater reliability, despite the factthat he refers to it as a measure of agree-ment. Lu's (1971) strategy for measuringagreement is quite similar to Finn's (1970)concept of reliability, however, and the pat-tern of results obtained for Cases 1, 2, and3 in Table 1 (r = 1.00, .40, and .93; A =1.00, .34, and .60) are quite similar. Moreresearch is needed to clarify the relation-ships between r and A.

Nominal ScaksNumerous writers have discussed the

problem of the interrater agreement of nomi-nal scales (Cohen, 1960, 1968; Everitt, 1968;Fleiss, 1971; Fleiss, Cohen, & Everitt, 1969;Goodman & Kruskal, 1954; Scott, 1955).All of the measures suggested by these au-thors are based on the percentage or pro-portion of agreements among judges (P).Two objections have previously been raisedregarding the use of P. The fact that P

370 RESEARCH METHODOLOGY

treats agreement as an absolute is an ad-vantage rather than a disadvantage whendealing with nominal ratings, since all dis-agreements usually are regarded as equallyserious. The problem of chance agreementremains, however. Some method of repre-senting P as an improvement over chanceagreement must be found if P is to be usefulas an index of interrater agreement. Gutt-man et al. (1971) concluded, after a reviewof the literature, that there was a "tacit"consensus that 65% represented the mini-mum acceptable agreement. Such a stand-ard, however, would allow the judges to bein disagreement more than one third of thetime. We do not recommend acceptance ofsuch a crude rule of thumb. The followingmeasures of interrater agreement representobserved agreement as a function of chanceagreement and provide for statistical testsof the significance of the results.

Two Judges

Cohen (1960) has suggested the coefficientK as an indicator of the proportion of agree-ments between two raters after chance agree-ment has been removed from consideration.Cohen's K can be calculated as follows:

Tt T1

(12)1 - Po '

where P0 = the proportion of ratings inwhich the two judges agree, and Pc = theproportion of ratings for which agreement isexpected by chance.

For the data in Table 2, which shows hy-pothetical data for two judges who eachcategorized 100 interview statements, thetotal proportion of agreement (P0) is .70(i.e., .18 + .18 + .24 + .10). The expectedchance agreement can be obtained as fol-lows:

(13)i-i

where PJJ = the diagonal values obtainedby finding the joint probability of the margi-nal proportions (i.e., Pym-Pom, where jand g are the two raters and m is a categoryof the rating scale), and k = the number ofcategories. The expected chance agreement

TABLE 2HYPOTHETICAL PROPORTIONS FOR

CATEGORIZATIONS OF CLIENTINTERVIEW STATEMENTS BY

Two JUDGES

Judge B

Negative self-reference

Positive self-reference

Request forinforma-tion

Goal setting

Columntotal

Judge A

Nega-tive

self-re-ference

.18

.00

.06

.06

.30

Positiveself-re-ference

.00

.18

.00

.02

.20

Requestfor

infor-mation

.02

.12

.24.02

.40

Goalsetting

.00

.00

.00

.10

.10

Rowtotal

.20

.30

.30

.20

Note. N = 100 statements. Boldface entries in-dicate observed proportion of agreement for eachrating category.

(Pc) for the data in Table 2 is .26 [i.e., (.20X .30) + (.30 X .20) + (.30 X .40) +(.20 X .10)] and K is .59. Computationalexamples and formulas for computing K.from frequency data are provided by Cohen(1960).

When the marginal frequencies are iden-tical, K can vary from 1.00 to —1.00. A Kof 0 indicates that the observed agreement isexactly equal to the agreement that couldbe expected by chance. A negative value ofK indicates the observed agreement is lessthan the expected chance agreement, whilea K of 1.00 indicates perfect agreement be-tween the raters.

Others have suggested measures of in-terrater agreement that are identical in formto K but differ in their definition of chanceagreement. Goodman and Kruskal (1954)define P0 as the mean of the proportions inthe modal (most frequently used) categoriesof the two judges. In Scott's (1955) coeffi-cient of intercoder agreement, P0 is basedon the assumption that the proportion ofratings assigned to each category is equalfor the judges. In contrast, Cohen's (1960)K recognizes that judges distribute theirjudgments differently over categories (e.g.,

RESEARCH METHODOLOGY 371

see Table 2) and does not require the as-sumption of equal distribution of ratings.The only assumptions required by K arethat the subjects to be rated are independent,the judges assign their judgments independ-ently, and the categories of the nominalscale are independent, mutually exclusive,and exhaustive. Cohen (1960) and Fleisset al. (1969) provide formulas for the stand-ard error of K, for testing the significance ofthe difference between two KS, and for test-ing the hypothesis that K = 0.

Weighted kappa. K is based on the assump-tion that all disagreements in classificationare equally serious. In some instances,nominal scaling notwithstanding, the coun-seling researcher or practitioner may con-sider some disagreements among judges tobe more serious than others. If, for example,two counselors classified clients seen in acollege counseling center as normal, neurotic,schizoid, or psychopathic, disagreement asto whether a client was normal or schizoidmight be regarded as more serious than dis-agreement over whether the person wasnormal or neurotic. Cohen (1968) has de-veloped weighted kappa (KW) as an index ofinterrater agreement for use when the in-vestigator wishes to differentially weightdisagreements among nominal ratings.

The coefficient K may be regarded as aspecial case of /cw, in which weights of 1.0are assigned to agreements (the diagonalvalues in Table 2), while weights of 0 areassigned all disagreements (the off-diagonalvalues in Table 2). In KW the various typesof disagreements are assigned differentialweights. If, for example, the counselor hasassigned weights on the basis of degree ofagreement, a weight of 50 represents twiceas much agreement as a weight of 25 andfive times as much agreement as a weight of10. Weights may be assigned to indicatethe amount of agreement or disagreement.The wisdom of this procedure depends, ofcourse, upon the degree to which the trans-formed data represent psychological reality.Computational formulas and examples havebeen provided by Cohen (1968), Everitt(1968), and Fleiss et al. (1969).

When the investigator wishes to orderhis rating categories along a unidimensional

continuum, disagreement weights can becalculated using the following equation:

Vif = (k- c)\ (14)

where k = the number of rating categories,and c = the number of cells in the diagonalcontaining Cell jg. (For Table 2, Equation14 would yield disagreement weights of 0,1,4,9; 1,0, 1,4; 4, 1,0, 1; and 9, 4, 1,0 forRows 1 through 4, respectively.) This rat-ing procedure transforms nominal data todata possessing the properties of ratio meas-urement. Accordingly, the proportionalityof the ratings is again of interest. The indexKW deals with the proportionality of theratings and is identical to the intraclass cor-relation when disagreement weights areassigned as above (Fleiss & Cohen, 1973).For the reasons discussed previously, theintraclass correlation and Lawlis and Lu's(1972) chi-square and the T index are moreappropriate whenever the investigator trans-forms his data in this manner. The ratingcategories may be numbered 1 through fcand the subjects given the number of thecategory to which they are assigned.

The index KW will find little valid use incounseling research. The use of KW is appro-priate only in the special case in which thecounseling psychologist wishes to assignweights that are inconsistent with a unidi-mensional ordering of the rating categories.This would most likely occur when he or sheis operating on the basis of a theory thatpostulates a multidimensional relationshipamong the rating categories. In such in-stances the investigator must justify theweights assigned the rating categories byexplaining in detail the theory upon whichthe weights are predicated. The theory, ofcourse, becomes an integral part of the hy-pothesis being tested. Obviously, the weightsshould be specified in the research report.

Variable Set of Raters

The use of K is limited to the situation inwhich the same two judges rate each sub-ject. Fleiss (1971) has formulated an ex-tension of K for measuring interrater agree-ment when subjects are rated by differentsets of judges, but the number of judges persubject is constant. Using this method of

372 RESEARCH METHODOLOGY

measuring interrater agreement, all judgesneed not rate each subject. This measure ofinterrater agreement would be appropriate,for example, where different groups of threecounselors categorized clients according topresenting complaint, such as is illustratedin Table 3.

Like Cohen (1960), Fleiss (1971) uses Kto indicate the degree to which the observedagreement exceeds the agreement expectedon the basis of chance. The subscript v in theformulas below is added to indicate that thejudges may vary from subject to subject.Fleiss' formula for KV is

Kv =1 -

(15)

where P0y = the proportion of ratings inwhich the judges agree, and PCy = theproportion of ratings for which agreement isexpected by chance. Observed agreement iscalculated as follows:

N k

££t«=l m=l (16)

Nn(n -

where:N = the number of subjects rated,n = the number of ratings per subject,k = the number of categories in the

rating scale,nim = the number of judges who assign

Subject i to Category m,i — 1, 2, . . . , N subjects,

m = 1,2, . . . , k categories;and the proportion of agreements expectedon the basis of chance is

fc

= £ 3 2m ,

where

(17)

(18)vv

and N, n, k, w<ro, i, and m are defined as inEquation 16. Fleiss (1971) provides a com-putational example. The index KV for thedata in Table 3 is .20. Fleiss (1971) alsoprovides formulas for the standard error ofKV and for testing the hypothesis that the

observed agreement equals chance agree-ment.

Interrater agreement for one subject. Inspecial cases, the counseling researcher maywish to study the degree of agreement con-cerning a specific subject. Such a need mayarise, for example, when the subject's diag-nosis will determine the treatment or whenthe investigator wishes to identify subjectsfor whom the rating scale is inappropriate.It might be important for the counselor toknow that the presenting complaint of Client6 in Table 3, for example, is poorly cate-gorized, while the judges agree perfectly intheir categorization of Clients 3, 5, and 8.With a larger data base, the counseling re-searcher might be able to identify the "Cli-ent 6" type of client, for whom the scale isinappropriate.

The degree of agreement for one subjectis obtained as follows:

»(w,-m —(19)

n(n — 1)

where n, i, m, k, and riim are defined as inEquation 16. Table 3 shows the agreementof the judges for each of the 10 subjects. Thevalue POV is the mean of the agreements forthe individual subjects (i.e., the mean ofthe Pts). When the P,-s are of no importance,

TABLE 3NUMBER OP AGREEMENTS AMONG THREE JUDGES

ON CATEGORIZATION OP CLIENTS BYPRESENTING COMPLAINT

Client

123456789

10

Agreement oncategory

Type of concern

Voca-tional

2

2

12312

.19

Social-emotional

13

1

.52

Educa-tional

12

1311

21

.03

Agreementon client

.33

.331.00

.331.00

.00

.331.00

.33

.33

RESEARCH METHODOLOGY 373

however, the investigator will find Equation16 to be computationally simpler.

Intenater category agreement. Finally, theinvestigator may wish to determine the ex-tent of agreement in assigning subjects toCategory m. This will be important in study-ing the rating scale. A low degree of overallagreement may occur, for example, becauseof low agreement regarding only one or twocategories. This situation may occur whensome of the scale categories are poorly de-fined or are overlapping. Moreover, theomission of an important category from thescale may result in the inappropriate assign-ment of subjects to other categories by de-fault. Whatever the reason for the lack ofagreement, this information will direct at-tention to the categories where the greatestdisagreement occurs.

Used as an index of the extent to whichthe observed agreement for Category m ex-ceeds the expected agreement for Categorym, Km is calculated as

Km =

- NnPm[l + (n- l)Pm]

Nn(n -

where ntn, N, n, and Pm are as denned inEquation 16 and 18, and Qm = 1 — Pm.For the three categories in Table 3, Km is.19, .52, and .03, respectively, and is inter-preted in the same manner as K and KV.Fleiss (1971) provides additional computa-tional examples as well as techniques fortesting the hypothesis that the agreement inthe assignment of subjects to Category mis no better than chance agreement.

SUMMARY OF RECOMMENDATIONS

General ConsiderationsWhenever rating scales are employed by

psychologists, special attention should bepaid to the interrater reliability and inter-rater agreement of the ratings. Evidenceregarding both the reliability and agreementof the ratings is mandatory before the ratingscan be accepted.

In reporting the interrater reliability andagreement of ratings, the investigatorshould describe the manner in which theindex was calculated (e.g., "The intraclasscorrelation was used with between-raters

variance excluded from the error term tocalculate the average reliability of the in-dividual rater"; "Lawlis and Lu's chi-square was calculated for a 5-point scalewith three raters and with agreement de-fined as a discrepancy of 1 point or less"),and the assumptions required by the in-dexes employed.

Interrater reliability and agreement arefunctions of the subjects rated, the ratingscale used, and the judges making the rat-ings. Therefore, the use of estimates of inter-rater reliability and agreement as deter-mined from a training tape or generalizedfrom other research or other groups of ratersis undesirable. Since all measures of inter-rater reliability and agreement recommendedfor use in this article are one trial estimates,the interrater reliability and agreementshould be determined appropriately forevery set of rating data.

Finally, the psychologist should keep inmind that statistical tests of the hypothesisthat the observed interrater reliability oragreement is equal to 0 are only preliminaryto determining the degree of reliability oragreement. Of special interest to counselorswould be research designed to establishminimum standards for interrater reliabilityand agreement and to develop statisticaltests of the hypothesis that the observedreliability (or agreement) is less than orequal to the minimum standard.

Interrater ReliabilityWhile the rate-rerate method is frequently

employed as a measure of interrater reliabil-ity, ratings evaluated in this manner mustbe regarded as questionable, due to biasinherent in the repeated ratings procedure.The intraclass correlation (R, Equation 4)is recommended as the best measure of in-terrater reliability available for ordinaland interval level measurement. In report-ing R for a set of ratings, the investigatormust be explicit in describing the procedureemployed. In order to properly interpret theresults, the reader must know whether be-tween-raters variance was included or ex-cluded from the error term. If the between-judges variance is excluded from the errorterm, the investigator should point out thatthis precludes the possibility of generalizing

374 RESEARCH METHODOLOGY

his results to other situations. Generaliza-tions are permissible only when between-raters variance is regarded as error.

In addition, the investigator should makeclear whether the reported R represents theaverage reliability of the individual rater orthe reliability of the composite rating andshould describe how the ratings are to beused. Only a clear exposition of all of thesefactors will allow the reader to evaluate theresults intelligently.

Finn's r (Equation 1) is recommendedonly when the within-subjects variance inthe ratings is so severely restricted that theintraclass correlation is inappropriate (e.g.,Case 3 in Table 1). The hypothesis that theobserved within-subjects variance is equalto the chance within-subjects varianceshould, of course, be tested by the chi-squaretest (Equation 3) prior to the calculation ofr. Finn's r should be interpreted cautiously,however, since chance variance may be over-estimated, thereby spuriously inflating r.

Composite scores. For most purposes, psy-chologists will find the reliability of com-posite scores based on unit weights satis-factory. If the decision to be made is of suchimportance that the greatest possible pre-cision of measurement is required, and ifthe ratings of the several judges differ invariance and/or reliability, optimal weightsshould be used instead of unit weights. Inthis case the procedure developed by Over-all (1965) is appropriate. In reporting theresults of research employing a compositerating based on optimal weights, the coun-seling psychologist should indicate the vari-ance and the estimated reliability of eachjudge, the weights assigned the ratings ofeach judge, and the estimated reliability ofthe composite, based on unit weights andoptimal weights. Such complete disclosurewill allow the reader to evaluate thoroughlythe results and to determine for himself theadvantage of optimal weights in the specificinstance. Finally, the counselor must keep inmind the fact that optimal weights for rat-ings are specific to the rating situation andcannot be assumed to generalize across rat-ing occasions, even though the same judgesmay have employed the same rating scalewith a similar sample of subjects on two simi-lar occasions.

Inter rater Agreement

The proportion or percentage of agree-ments (P), chi-square indexes, the pairwisecorrelation between raters, and Kendall'scoefficient of concordance are inappropriateas measures of interrater agreement. Thelatter two indexes do not indicate interrateragreement, while P treats agreement as anabsolute and is inflated by chance or randomagreements.

We recommend the use of the Lawlis andLu (1972) index of interrater agreement(Equation 7). The definition of agreementadopted and an indication of the extent ofagreement (T index, Equation 8) shouldaccompany the Lawlis and Lu indication ofthe significance of the agreement. In addi-tion, the use of a stringent level of signifi-cance is recommended in evaluating agree-ment. Alternatively, the investigator shoulduse a chance P value based on fewer scalecategories than were actually available.This will partially correct for the possibilitythat the raters were predisposed to avoidthe extreme categories of the scale, therebycausing the probability of chance agreementto be greater than the appropriate P value.

Nominal Scales

Nominally scaled data permit an analysisonly of interrater agreement. The use ofCohen's K (Equation 12) is recommendedwhen the same two judges rate each sub-ject. Fleiss' KV (Equation 15) is recom-mended when subjects are rated by differentjudges but the number of judges rating eachsubject is held constant. The general use ofCohen's weighted kappa (KW) is not recom-mended.

At the present time, no measure of inter-rater agreement for nominal scales is avail-able for situations in which a varying num-ber of judges rate subjects, or in which thesame group of more than two judges rateseach subject. The latter situation is the casemost likely to occur in counseling research.Clearly, more research is needed.

REFERENCES

Baker, B. O., Hardyck, C. D., & Petrinovich, L. F.Weak measurement vs. strong statistics: Anempirical critique of S. S. Stevens' proscriptions

RESEARCH METHODOLOGY 375

on statistics. Educational and PsychologicalMeasurement, 1966, 26, 291-309.

Bartko, J. J. The intraclass correlation coefficientas a measure of reliability. Psychological Re-ports, 1966, 19, 3-11.

Burdock, E. I., Fleiss, J. L., & Hardesty, A. S. Anew view of interobserver agreement. PersonnelPsychology, 1963, 16, 373-384.

Cannon, J., & Carkhuff, R. R. Effects of rater levelof functioning and experience upon the dis-crimination of facilitative conditions. Journalof Consulting and Clinical Psychology, 1969, 83,189-194.

Carkhuff, R. R. Helper communication as a func-tion of helpee affect and content. Journal ofCounseling Psychology, 1969, 16, 126-131.

Carkhuff, R. R. Principles of social action in train-ing for new careers in human services. Journalof Counseling Psychology, 1971,18,147-151.

Carkhuff, R. R., & Banks, G. Training as a pre-ferred mode of facilitating relationship betweenraces and generations. Journal of CounselingPsychology, 1970, 17, 413-418.

Carkhuff, R. R., & Griffin, A. H. The selection andtraining of human relations specialists. Journalof Counseling Psychology, 1970,17, 443-450.

Clark, E. L. Spearman-Brown formula applied toratings of personality traits. Journal of Educa-tional Psychology, 1935, 26, 552-555.

Cohen, J. A coefficient of agreement for nominalscales. Educational and Psychological Measure-ment, I960, 20, 37-46.

Cohen, J. Weighted kappa: Nominal scale agree-ment with provision for scaled disagreement orpartial credit. Psychological Bulletin, 1968, 70,213-220.

Cronbach, L. J., & Gleser, G. C. Psychologicaltests and personnel decisions. Urbana: Universityof Illinois Press, 1965.

Ebel, R. L. Estimation of the reliability of ratings.Psychometrika, 1951, 16, 407-424.

Englehart, M. D. A method of estimating the re-liability of ratings compared with certain meth-ods of estimating the reliability of tests. Educa-tional and Psychological Measurement, 1959, 19,579-588.

Everitt, B. S. Moments of the statistics kappa andweighted kappa. British Journal of Mathematicaland Statistical Psychology, 1968, 21, 97-103.

Finn, R. H. A note on estimating the reliability ofcategorical data. Educational and PsychologicalMeasurement, 1970, SO, 71-76.

Finn, R. H. Effects of some variations in ratingscale characteristics on the means and reliabili-ties of ratings. Educational and PsychologicalMeasurement, 1972, SB, 255-265.

Fleiss, J. L. Measuring nominal scale agreementamong many raters. Psychological Bulletin, 1971,76, 378-382.

Fleiss, J. L., & Cohen, J. The equivalenceof weighted kappa and the intraclass correlationcoefficient as measures of reliability. Educationaland Psychological Measurement, 1973, 33,613-619.

Fleiss, J. L., Cohen, J., & Everitt, B. S. Large

sample standard errors of kappa and weightedkappa. Psychological Bulletin, 1969, 72, 323-327.

Garfield, S. L., Prager, R. A., & Bergin, A. E.Evaluation of outcome in psychotherapy. Jour-nal of Consulting and Clinical Psychology, 1971,37, 307-313.

Goodman, L. A., & Kruskal, W. H. Measures ofassociation for cross classifications. Journal ofthe American Statistical Association, 1954, 49,732-764.

Guilford, J. P. Psychometric methods (2nd ed.).New York: McGraw-Hill, 1954.

Guttman, H. A., Spector, R. M., Sigal, J. J.,Rakoff, V., & Epstein, N. B. Reliability ofcoding affective communication in family ther-apy sessions: Problems of measurement andinterpretation. Journal of Consulting and Clini-cal Psychology, 1971, 37, 397-402.

Hays, W. L. Statistics for psychologists. New York:Holt, Rinehart & Winston, 1963.

Kaspar, J. C., Throne, F. M., & Schulman, J. L.A study of the interjudge reliability in scoringthe responses of a group of mentally retardedboys to three WISC subscales. Educational andPsychological Measurement, 1968, 28, 469-477.

Kelley, T. L. Fundamentals of statistics. Cam-bridge, Mass.: Harvard University Press, 1947.

Krippendorff, K. Estimating the reliability, sys-tematic error and random error of interval data.Educational and Psychological Measurement,1970, SO, 61-70.

Lawlis, G. F., & Lu, E. Judgment of counselingprocess: Reliability, agreement, and error.Psychological Bulletin, 1972, 78,17-20.

Lawshe, C. H., & Nagle, B. F. A note on the com-bination of ratings on the basis of reliability.Psychological Bulletin, 1952, 49, 270-273.

Lu, K. H. A measure of agreement among subjec-tive judgments. Educational and PsychologicalMeasurement, 1971, 31, 75-84.

Martin, D. G., & Gazda, G, M. A method of self-evaluation for counselor education utilizing themeasurement of facilitative condition. CounselorEducation and Supervision, 1970, 9, 87-92.

McMullin, R. E. Effects of counselor focusing onclient self-experiencing under low attitudinalconditions. Journal of Counseling Psychology,1972, 19, 282-285.

Mickelson, D. T., & Stevic, R. R. Differentialeffects of facilitative and nonfacilitative be-havioral counselors. Journal of CounselingPsychology, 1971, 18, 314-319.

My rick, R. D., & Pare, D. D. A study of the effectsof group sensitivity training with student coun-selor-consultants. Counselor Education andSupervision, 1971, 11, 90-96.

Overall, J. E. Reliability of composite ratings.Educational and Psychological Measurement,1965, 25, 1011-1022.

Payne, P. A., Winter, D. E., & Bell, G. E. Effectsof supervisor style on the learning of empathy ina supervision analogue. Counselor Educationand Supervision, 1972,11, 262-269.

Pierce, R. M., & Schauble, P. G. Toward the

376 RESEARCH METHODOLOGY

development of facilitative counselors: Theeffects of practicum instruction and individualsupervision. Counselor Education and Supervi-sion, 1971, 11, 83-89.

Robinson, W. S. The statistical measurement ofagreement. American Sociological Review, 1957,**, 17-25.

Rosander, A. C. The Spearman-Brown formula inattitude scale construction. Journal of Experi-mental Psychology, 1936,19,486-495.

Schuldt, W. J., & Truax, C. B. Variability of out-come in psychotherapeutic research. Journal ofCounseling Psychology, 1970, 17, 405-408.

Scott, W. A. Reliability of content analysis: Thecase of nominal scale coding. Public OpinionQuarterly, 1955, 19, 321-325.

Silverstein, A. B. A note on interjudge reliability.Psychological Reports, 1966,19,1170.

Smith, F. F. Objectivity as a criterion for estimat-ing the validity of questionnaire data. Journalof Educational Psychology, 1935, SB, 481-496.

Snedecor, G. W. Statistical methods (4th ed.).Ames: Iowa State College Press, 1946.

Stillman, S., & Resnick, H. Does counselor attirematter? Journal of Counseling Psychology, 1972,19, 347-348.

Tanney, M. F., & Gelso, C. J. Effect of recording

on clients. Journal of Counseling Psychology,1972, 19, 349-350.

Taylor, J. B. Rating scales as measures of clinicaljudgment: A method for increasing scale reliabil-ity and sensitivity. Educational and Psychologi-cal Measurement, 1968, jSS, 747-766.

Truax, C. B., & Carkhuff, R. R. Toward effectivecounseling and psychotherapy. Chicago: Aldine,1967.

Truax, C. B., & Lister, J. L. Effects of short-termtraining upon accurate empathy and non-posses-sive warmth. Counselor Education and Super-vision, 1971, 10, 120-125.

Tseng, M. S. Self-perception and employ ability:A vocational rehabilitation program. Journal ofCounseling Psychology, 1972, 19, 314-317.

Weiss, D. J. Factor analysis and counseling re-search. Journal of Counseling Psychology, 1970,17, 477-485.

Weiss, D. J. Further considerations in applica-tions of factor analysis. Journal of CounselingPsychology, 1971, 18, 85-92.

Wittmer, J. An objective scale for content analysisof the counselor's interview behavior. CounselorEducation and Supervision, 1971, 10, 283-290.

(Received August 19, 1974)