Comparison of therapist to patient judgment bias in low vision

33
Therapist Judgment Bias and Reliability Relative to that of Patients in the Estimation of Functional Ability from Ordinal Ratings Robert W. Massof, 1 Theresa M. Smith, 2 Lisa S. Foret, 3 Guy Davis, 3 and Kyoko Fujiwara 1 1 Lions Vision Research and Rehabilitation Center, Wilmer Eye Institute, Johns Hopkins University School of Medicine 2 Department of Occupational Therapy and Rehabilitation Sciences, University of Texas Medical Branch Galveston 3 Evangeline Home Health, Lake Charles, LA Supported by grant EY022322 from the National Eye Institute, National Institutes of Health, Bethesda, MD. 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Transcript of Comparison of therapist to patient judgment bias in low vision

Page 1: Comparison of therapist to patient judgment bias in low vision

Therapist Judgment Bias and Reliability Relative to that of Patients in

the Estimation of Functional Ability from Ordinal Ratings

Robert W. Massof,1 Theresa M. Smith,2 Lisa S. Foret,3 Guy Davis,3 and Kyoko Fujiwara1

1Lions Vision Research and Rehabilitation Center, Wilmer Eye Institute, Johns Hopkins

University School of Medicine

2Department of Occupational Therapy and Rehabilitation Sciences, University of Texas

Medical Branch Galveston

3Evangeline Home Health, Lake Charles, LA

Supported by grant EY022322 from the National Eye Institute, National Institutes of Health,

Bethesda, MD.

1

1

2

3

4

5

6

7

8

9

10

11

12

Page 2: Comparison of therapist to patient judgment bias in low vision

Abstract

Objective: To present and evaluate a measurement model for estimating the judgment bias of

therapists and patients when rating functional ability. Design: Observational study of the

agreement between therapist ratings and patient self-ratings of functional ability. Setting:

Measures made by telephone interview and in the patient’s home. Participants: Forty-five home

health care patients who have a secondary diagnosis of low vision. Main Outcome Measures:

Functional ability estimated from Rasch analysis of patient difficulty ratings of calibrated items

(activity goals) in the Activity Inventory (AI) and therapist ratings using a FIM scale of the same

activity goals, both at initial evaluation and again after discharge. Results: A linear relationship

was observed between functional ability measures estimated from therapist ratings and measures

estimated from patient self-ratings with the same slope, but different intercepts, for measures

obtained at baseline and at post-rehabilitation follow-up. Conclusions: The observed linear

relationship between measures estimated from therapist ratings and measures estimated from

patient ratings confirms the model prediction. The intercept corresponds to the difference

between the therapist’s judgment bias and the average judgment bias of all patients. Relative to

patient judgments, the therapist’s estimate of functional ability at baseline was less than the

patients’ estimates; it was greater than the patients’ estimates at follow-up. The slope of the line

corresponds to the square root of the ratio of the between-patient plus within-patient variance in

judgment bias to the within-therapist variance in judgment bias. The results indicate that

between-patient variance is almost 3 times the within-therapist variance.

2

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

Page 3: Comparison of therapist to patient judgment bias in low vision

Introduction

Rehabilitation medicine employs three different approaches to estimate the functional ability of

patients: 1) measures of task performance time and/or accuracy;1 2) patient ratings of their own

ability and/or frequency of performing activities;2 and 3) ratings by a therapist or proxy of a

patient’s ability and/or frequency of performing activities.3 Functional ability is a trait of the

patient. Task performance time and accuracy, patient ratings, and therapist ratings only are

indicators of functional ability. Measurements of functional ability per se must be inferred from

the observed indicators. Because functional ability is a property of the patient, valid and unbiased

measures of functional ability estimated from the three different approaches should agree.

Measurement validity refers to the accuracy of the assumption that the estimated measure is

linear with the magnitude of the variable of interest. Measurement bias refers to the agreement

(or disagreement) between different measures of the same variable when the variable magnitude

has not changed between measures. In the case of functional ability, measurement validity and

bias can be influenced by the sample of activities selected for observation and, in the case of

ratings, by properties of the judge.

This paper is concerned with comparing functional ability measures estimated from ratings by

patients to functional ability measures estimated from ratings by a therapist. More specifically,

this paper focuses on the estimation of relative biases and measurement uncertainties of judges

when comparing functional ability measures estimated from a therapist’s judgments to functional

ability measures estimated from patient judgments of themselves. We first present a model of

patient self-ratings and a parallel model of therapist ratings of the patient, explicitly identifying

respective biases and sources of variance in the observations, and show how the two sets of

3

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

Page 4: Comparison of therapist to patient judgment bias in low vision

ratings are related. We then test the model with a substantive example using low vision

rehabilitation of visually impaired home health care patients.

Model of Patient Self-Ratings and Therapist Ratings

Using ordered rating scale categories (e.g., level of “difficulty” or level of “independence”), both

the patient and the therapist are asked to judge the patient’s ability to perform specific activities,

referred to as “items”. The true ability of patient n, which we are attempting to estimate from the

patient’s and therapist’s ratings, is α n. The ability required to perform each of the items, ρ j for

item j, is a property of the item that is independent of the judge (whether patient or therapist).

The model assumes that both the patient and therapist are judging the magnitude of the patient’s

functional reserve for the activity described by the item, which is the difference between the

ability of patient n and the ability required by item j, i.e., α n−ρ j. Both the patient and therapist

are instructed in the use of the ratings, but they develop their own criteria for each rating

category that they will assign to a patient/item pair. These criteria, or “thresholds”, can be

thought of as boundaries between neighboring categories on a continuous functional reserve

scale. The thresholds are denoted as τ kx for the boundary set by judge k between rating category

x-1 and rating category x (k n in the case of patient self-judgment).

Although the value of ρ j is independent of the judge, judges’ estimates of ρ j are likely to be

biased. If ρ̂kj is the estimate of ρ j by judge k, then ρ̂kj=ρ j+ϵ kj where ϵ kj is the bias of judge k in

estimating the ability required by item j. Similarly, the average threshold for rating category x

across a population of judges is τ x, therefore, τ kx=τ x+ηkx where ηkx is the bias of judge k, relative

to the average judge, in the choice of threshold for rating category x. In the case of therapists or

proxies, the population of judges would refer to all therapists or to all proxies, respectively. If we

4

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

Page 5: Comparison of therapist to patient judgment bias in low vision

define ϵ k to be the average bias of judge k across items and ηk to be the average bias of judge k

across rating category thresholds, then we can re-express the bias terms as the sum of a fixed

variable (average) and a random variable (), i.e., ϵ kj=ϵ k+δ ϵ kj and ηkx=ηk+δηkx (if there is only a

single judge contributing to the estimate of τ x, then ηk=0). In each case, the random variable has

an expected value of zero and incorporates variance associated with real differences in bias

between items and/or categories, estimation uncertainty, and parameter instability.

The judge assigns rating category x to item j if the estimated functional reserve exceeds the

judge’s criterion for category x (and all lower categories) and is less than the criterion for

category x+1 (and all higher categories), i.e.,

τ k1 ,⋯ , τkx<α n− ρ̂kj<τkx +1 ,⋯ , τ km. (1a)

Substituting the definitions presented in the preceding paragraph and, for judge k, combining the

random variables into a single random term and combining the fixed bias variables into a single

fixed term, expression (1a) can be expanded to make the fixed and random variables explicit, i.e.,

τ 1+δkj 1 ,⋯ , τx+δ kjx<αn−ρ j−βk< τ x+1+δ kjx+1 ,⋯ , τm+δ kjm (1b)

where δ kjx=δ ηkx+δ ϵ kj and βk=ϵ k+ηk. The judgment bias of judge k is summarized with the bias

term βk and the reliability of judge k is summarized by the variance of δ kjx, which we designate

as σ kjx2 .

Rasch analysis is used routinely to estimate the average expected rating category thresholds (τ x

for rating category x), the true person measures (α n for person n), and the true item measures (ρ j

for item j) from distributions of observed ratings across persons and items.4 Judgment bias, βk,

affects the accuracy of the estimates and the variance of the random terms, σ kjx2 , affects

estimation precision (i.e., reliability). Rasch models assume homogeneity of variance, i.e., σ kjx2 is

5

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

Page 6: Comparison of therapist to patient judgment bias in low vision

the same for all persons, items, and rating category thresholds (a requirement of unidimensional

measures). Homogeneity of variance means that σ kjx2 =σk

2. Rasch models also assume that the

random terms are statistically independent of one another.4 Various statistical tests are used to

evaluate how well the set of observed ratings conform to these assumptions of the Rasch model.4

In the case of patient self-judgment, when there are N patients there also are N judges. However,

Rasch models typically (but not necessarily) assume that there is just a single judge, which in

effect is the average of the judges. In this case, when σ k2 is referring to the average of N judges, it

must include variance between judges, σ bn

2 , as well as variance within judges, σ n2. We therefore

define the variance of the average patient judge to be

σ P2=σbn

2 +∑n=1

N

σn2/ N , (2)

the sum of between patient variance and average within patient variance. When a single therapist

is the judge, the variance of the therapist can be attributed entirely to the variance within the

judge, σ T2=σk

2. To complete the definition of terms for our model, the fixed judgment bias of each

patient is βn and the fixed judgment bias of the therapist is βT.

In practice, Rasch models normalize the estimated person and item measures to the square root

of the judge’s variance and ignore the judge’s bias (unless made explicit in a facet model5). Thus,

person measures estimated from patient self-judgments are expressed as

α̂ nP=( αn+βP )/σ P (3)

for person n, where βP=∑n=1

N βn

N, the average bias across patients. The person measures estimated

from a therapist’s judgments are expressed as

α̂ nT= (α n+βT )/σT (4)

6

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

Page 7: Comparison of therapist to patient judgment bias in low vision

for the same person n. Because both eqs.(3) and (4) are linear functions of the true person

measure, α n, we expect the relationship between person measures estimated from a therapist’s

ratings and corresponding person measures estimated from patients rating themselves to be

α̂ nT=σ P

σTα̂nP+

βT−β P

σT, (5)

a linear relationship for which the slope is the ratio of the standard deviation for the average

patient to the standard deviation for the therapist and the intercept is the weighted difference

between therapist and average patient judgment biases.

Methods

Research Design

The present study is part of a larger observational study still in progress. Data reported here were

collected pre and post usual occupational therapy intervention provided in the participant’s home

by one occupational therapist who has specialty training in low vision rehabilitation and 12 years

of experience providing rehabilitation services to home health care patients with low vision.

Participants

Eligibility criteria for the study were: 1) patients were new to the occupational therapist; 2)

patients were adults admitted to home health care; 3) patients met the visual impairment

diagnostic criteria for Medicare or other third party coverage of low vision rehabilitation

services;6 and 4) patients understood English and had good enough hearing to be able to

participate in telephone interviews. Forty-five low vision patients participated in this study.

Procedures

The study conformed to the tenets of the Declaration of Helsinki and was approved by the Johns

Hopkins Institutional Review Board. After the patient consented to participate, one of the

7

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

Page 8: Comparison of therapist to patient judgment bias in low vision

investigators administered the Activity Inventory (AI),7-9 an adaptive rating scale instrument, by

telephone interview. Participants rated the importance of the 50 activity goals in the AI, and rated

the difficulty of those goals that were rated to be at least “slightly important”. In the instructions

to the participant, both importance and difficulty ratings were qualified as to be able to perform

the activity “without depending on another person”. Goals included in this study were those that

the participant also rated to be at least “slightly difficult”. In addition, participants rated the

difficulties of tasks in the AI that are nested under goals that were rated to be at least slightly

important and slightly difficult.

At the time of the initial patient evaluation, the occupational therapist was provided with a list of

the AI goals and subsidiary tasks that were rated by the participant to be at least slightly difficult,

however, the actual ratings assigned by the participant to each goal and task were not revealed.

After completing the initial patient evaluation, the occupational therapist assigned a FIM scale

score3,10 to each of the participant-identified AI goals. Table 1 lists the FIM rating scale

categories. The occupational therapist then developed the patient’s plan of care and provided

rehabilitation services following usual procedures. At discharge the occupational therapist again

used the FIM scale to rate the participant’s functional independence level for the same AI goals

that were rated at the initial evaluation. The AI was re-administered to the participant by

telephone interview one to two months after discharge from occupational therapy.

Data Analysis

Rasch analysis, using the Andrich rating scale model11 (Winsteps 3.6512), was employed to

estimate the visual ability of each participant before and after rehabilitation on a continuous

interval scale from the participants’ difficulty ratings of the AI goals. The item measures for the

50 goals in the AI item bank and the response category thresholds for levels of difficulty were

8

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

Page 9: Comparison of therapist to patient judgment bias in low vision

anchored to values estimated from the difficulty ratings of 3200 low vision patients.13 Rasch

analysis also was performed on the FIM scale ratings of each patient’s AI goals by the

occupational therapist using the same anchored item measures for the goals. In the case of

analysis of FIM ratings, participant’s ratings obtained prior to the initial patient evaluation and

ratings obtained post-discharge were stacked and analyzed together to estimate response

category thresholds for the 7 FIM scale categories. An information-weighted mean square fit

statistic (infit) and the standard error were estimated for each response category threshold and for

each person measure.

FIM

score

Description

1 Totally dependent – patient able to perform less than 25 % of the task

2 Maximal assistance required – patient able to perform 25% of the task

3 Moderate assistance required – patient able to perform 50% of the task

4 Minimal assistance required – patient able to perform 75% of the task

5 Supervision or set-up required – patient performs task without direct assistance

6 Modified independence – patient requires assistive equipment, more time, or safety

concern

7 Independent – no assistance required, patient able to perform 100% of the task

Table 1

Functional Independence Measure (FIM) Scale Categories

Results

Participants

9

168

169

170

171

172

173

174

175

176

177

178

179

180

Page 10: Comparison of therapist to patient judgment bias in low vision

Complete data were obtained from 41 of the 45 enrolled participants. All participants resided in

Louisiana. Participants consisted of 15 males ( 33%) and 30 females ( 67%) between the ages of

30 and 98 years old (median = 80, SD = 17). Measured binocular visual acuity with habitual

correction ranged from 20/20 to 20/900 (median = 20/65, SD= 0.52 log MAR); 3 participants

had no light perception in either eye and 2 participants had only light perception in the better eye.

Among participants with measurable visual acuity, binocular log contrast sensitivity ranged from

0.07 to 1.67 (normal>1.6; median = 1.02, SD = 0.44). For binocular central visual field measures

(12.5o), 35% of participants had central scotomas (blind spots), 20% had hemi- or quad-field

defects, 27% had contracted visual fields, and visual fields could not be performed on 18% .

FIM Rating Scale Evaluation

The therapist used all 7 of the FIM scale response categories to rate AI goals selected by

participants at baseline and/or at follow-up. As shown in the Table 2 columns labeled Baseline

Count and Follow-up Count, FIM scale scores of 4 or less were used most frequently at baseline

and FIM scale scores of 5 or 6 were used most frequently at follow-up. The category threshold

corresponds to the value of functional reserve (difference between the estimated person measure

and estimated item measure) at which the probability of using FIM score x is equal to the

probability of using FIM score x-1, for x = 2 to 7. The ordering of thresholds should agree with

the ordering of the FIM scale scores. The thresholds are ordered for response categories 2

through 6. The threshold for response category 7 is disordered. However, the assignment of FIM

scale score 7 occurred rarely – it represents only 1.3% of the total number of FIM scale scores

assigned.

The Rasch model predicts the response category assigned to every combination of person and

item measures. The residual is defined to be the difference between the FIM scale score observed

10

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

Page 11: Comparison of therapist to patient judgment bias in low vision

for each person/item combination and the FIM scale score predicted for the corresponding

person and item measure estimates. The infit mean square is the ratio of the observed sums of

squared residuals for FIM ratings, which are expected to be distributed as 2, to the sums of

squared residuals expected by the Rasch model, which corresponds to the expected value of 2.

The expected value of 2 is equal to the degrees of freedom, thus, the infit mean square is

expected to be distributed as 2/df, which in turn has an expected value of 1.0.4 The infit mean

square is interpreted as the ratio of the observed variance in the residuals to the expected

variance. Infit mean square values greater than 1.0 indicate that the observed variance is greater

than expected. As can be seen in the last two columns of Table 2, the observed variance in

residuals for response category 6 is more than twice the expected variance both at baseline and at

follow-up. As a rule of thumb, infit mean squares greater than 1.3 are considered to be indicative

of excessive observed variance.14 With that criterion, only FIM response categories 1 through 3

at baseline and 4 and 5 at follow-up behave as expected by the Rasch model, which suggests

inconsistency in the use of the other FIM response categories across patients and/or across items.

Table 2

Functional Independence Measure (FIM) response counts, estimated category thresholds in the Andrich

model, and information-weighted mean square residuals (Infit) at baseline and follow-up by rating scale

response category.

11

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

Page 12: Comparison of therapist to patient judgment bias in low vision

Rating scale Baseline Follow-up Category Baseline Follow-upFIM Score Count Count threshold Infit Infit

1 103 28 NA 1.27 3.472 107 25 -2.88 1.2 3.173 124 16 -2.03 1.29 2.064 145 41 -1.11 1.61 1.015 31 123 1.55 1.71 0.916 4 212 2.83 2.25 2.147 3 10 1.63 1.75 2.83

Infit mean squares also were estimated for each participant at baseline by summing observed

squared residuals and expected squared residuals across goals. For degrees of freedom of 25 or

greater, the cube-root of the 2 distribution is well approximated by a normal distribution.15

Therefore, the infit mean square for each participant was transformed to a standard normal

deviate and expressed as a z-score.4 Figure 1 illustrates the distribution of infit z-scores on the

abscissa and the distribution of person measures, i.e., estimated functional ability, on the ordinate

for all 41 participants. The solid vertical line indicates the expected value of the infit z-score and

the dashed vertical lines define the range of plus-and-minus two standard deviations from the

expected value. The majority of participants’ infit mean square z-scores are symmetrically

distributed about the expected value of 0 and fall in the expected range of +2 SD. These results

are consistent with the expectations of a valid measure. However, there are seven clear outliers

where the observed variance in the residuals is more than two standard deviations greater than

the expected variance. The functional abilities of these outliers fall in the middle of the

participants’ distribution of functional ability (on the vertical axis).

12

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

Page 13: Comparison of therapist to patient judgment bias in low vision

-6

-5

-4

-3

-2

-1

0

1

-3 -2 -1 0 1 2 3 4 5 6

FIM

-esti

mat

ed p

erso

n m

easu

re (a

ncho

red

AI g

oals)

INFIT MNSQ (zstd)

Figure 1. Distribution of infit z-scores across items for each participant on the abscissa and the

distribution of person measures on the ordinate.

Comparison of Functional Ability Estimates from AI and FIM Ratings

Because all AI item measures were anchored to calibrated values, i.e., ρ j in eq. (1b), person

measure estimates from patients’ difficulty ratings and person measure estimates for the same

patients from the therapist’s FIM ratings are expected to be in the same units of functional

ability. However, the Andrich rating scale model assumes that the variance in judgment bias is

constant, thereby normalizing the true values of functional ability, i.e., α n in eq. (3) and eq. (4),

to the standard deviation of judgment bias, i.e., σ P in eq. (3) and σ T in eq. (4). Thus, we expect

the standard errors of the two sets of estimated person measures to agree. There is no significant

difference (paired t-test, p=0.93) between the standard error of the person measure estimated

from patient difficulty ratings (mean = 0.414) and the standard error of the person measure

estimated from therapist FIM ratings (mean = 0.415).

13

237

238

239

240

241

242

243

244

245

246

247

248

249

250

Page 14: Comparison of therapist to patient judgment bias in low vision

It is possible that FIM ratings could be different enough from difficulty ratings that using item

measures anchored with values estimated from difficulty ratings is not appropriate for the FIM

scale. If so, variance in residuals should be greater for FIM ratings than for difficulty ratings.

With the exception of the FIM outliers noted above, Figure 2 illustrates that the z-scores for

transformed infit mean squares for the two sets of estimates of person measures at baseline are

within the range of values expected by the 2 distribution (2 SD box).

-4

-2

0

2

4

6

8

10

-4 -2 0 2 4 6 8 10

INFI

T M

NSQ

ZST

D (F

IM)

INFIT MNSQ ZSTD (AI)

Figure 2. Z-scores for transformed infit mean squares for person measures estimated from therapist FIM

ratings (ordinate) vs. transformed infit mean squats for person measures estimated from patients’

difficulty ratings (abscissa).

Measures of functional ability, both at baseline and post-discharge, were estimated from patients’

difficulty ratings of those AI goals that were rated at baseline to be at least slightly important.

Measures of functional ability also were estimated for the same patients at baseline and at

discharge from the therapist’s ratings of the same set of AI goals for each patient using FIM scale

scores. For measures based on patients’ difficulty ratings and measures based on the therapist’s

FIM scale scores, the mean functional ability at baseline was subtracted from each corresponding

baseline measure and the mean functional ability at post-discharge was subtracted from each

14

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

Page 15: Comparison of therapist to patient judgment bias in low vision

corresponding post-discharge measure. Figure 3 is a scatter plot comparing measures based on

patients’ difficulty ratings of the important AI goals (abscissa) to the occupational therapist FIM

scale ratings of the same AI goals (ordinate) for baseline (filled circles) and post-discharge (open

circles) measures relative to their respective means. Bivariate linear regression, minimizing

orthogonal distance of data points from the regression line (i.e., principal component), was

performed on the combined baseline and post-discharge data. The slope of the regression line is

1.96 and the intercept is -0.04. The Pearson correlation is 0.52.

-3

-2

-1

0

1

2

3

-2 -1.5 -1 -0.5 0 0.5 1 1.5

Func

tiona

l abi

lity

(OT

FIM

scal

e ) -

Mea

n

Functional ability (patient difficulty ratings) - Mean

PRE

POST

Figure 3. Comparing person measures based on patients’ difficulty ratings of important AI goals

(abscissa) to occupational therapist FIM scale ratings of the same AI goals (ordinate) for baseline and

post-discharge measures relative to their respective means.

Figure 4 illustrates scatter plots of the unadjusted functional ability measures estimated from the

occupational therapist FIM scale ratings of AI goals (ordinate) versus the unadjusted functional

ability measures estimated from the patient’s difficulty ratings of the same AI goals (abscissa) at

baseline (filled circles) and at post-discharge follow-up (open circles). The lines fit to the data by

orthogonal regression have the same slope (1.96), which was estimated from the regression line

fit to the combined data in Figure 3. The intercepts are -1.02 for the baseline measures and 1.63

for the post-discharge measures. The dashed lines illustrate the respective mean functional ability

15

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

Page 16: Comparison of therapist to patient judgment bias in low vision

measures. The difference between the vertical dashed lines is the intervention effect (difference

between the means) estimated from patient difficulty ratings (translates to Cohen’s effect size =

0.49) and the difference between the horizontal dashed lines is the intervention effect estimated

from the therapist’s FIM scale ratings (Cohen’s effect size = 3.28)

-5

-4

-3

-2

-1

0

1

2

3

4

-2 -1.5 -1 -0.5 0 0.5 1

Func

tiona

l abi

lity

(OT

FIM

scal

e ra

ting)

Functional ability (patient difficulty rating)

PRE

POST

Figure 4. Unadjusted functional ability measures estimated from the occupational therapist FIM scale

ratings of AI goals (ordinate) versus unadjusted functional ability measures estimated from patient’s

difficulty ratings of same AI goals (abscissa) at baseline (filled circles) and at post-discharge follow-up

(open circles).

Discussion and Conclusions

The linear relationship between functional ability estimated from patient difficulty ratings and

functional ability estimated from the therapist’s FIM scale ratings confirms the expectations of

the model expressed by eqs. (3) and (4), which lead to the specific prediction of a linear function

expressed by eq. (5). If we interpret the results in Figure 4 in terms of eq. (5), then we must

conclude from the slope of the regression lines that σ P=1.96 σT , both at baseline and at post-

discharge follow-up. This result means that the variance in bias for the average of the patients is

nearly 4 times that of the within person variance in bias for our single therapist. If we can assume

that the average variance in bias within patients is approximately the same as the within person

16

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

Page 17: Comparison of therapist to patient judgment bias in low vision

variance in bias of our sole therapist, then in eq. (2), ∑n=1

N

σn2/ N ≅ σT

2 , and substituting 1.962 σT2 for

σ P2 in eq. (2), we obtain an estimate for the standard deviation of bias between-patients to be

σ bn=1.69 σT .

From eq. (5), the intercepts of the regression lines in Figure (4) correspond to the difference

between the fixed bias of the average patient and the therapist’s fixed bias, in within-therapist

standard deviation units. The intercept for baseline measures indicates that fixed bias for the

average patient, βP is 1.02 logits greater than the therapist’s fixed bias,βT. However, post-

discharge the therapist’s fixed bias is 1.63 logits greater than the fixed bias of the average

patient. From the patients’ perspective, the therapist is underestimating patients’ functional

abilities at baseline and overestimating patients’ functional abilities at post-discharge follow-up.

From the therapist’s perspective, the patients are overestimating their functional abilities at

baseline and underestimating their functional abilities at post-discharge follow-up.

We cannot draw any conclusions from this study about why the difference between therapist and

average patient bias is negative at baseline and positive at post-discharge follow-up. One could

speculate that patients tend to be stoic and/or stubborn – underestimating the magnitude of their

problems at baseline and underestimating improvements in their function at follow-up.

Anecdotally, during evaluation therapists often see evidence of problems that patients deny or do

not recognize (e.g., seeing pills on the floor, stained clothing, signs of poor hygiene). Therapists

also report that patients may be able to perform a task after therapy, but refuse to accept the

required adaptation as an improvement over dependency. From another viewpoint, a cynic might

claim that the therapist is exaggerating the patient’s problems at baseline and exaggerating the

success of therapy at follow-up, making the intervention look more effective than it actually is.

17

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

Page 18: Comparison of therapist to patient judgment bias in low vision

However, in the final analysis we only can estimate differences between people in judgment

biases – we cannot know their values relative to a ground truth.

The purpose of this study has been to present and test a model of judgment bias and show how

judgment bias can influence measures estimated by psychometric models from observer

magnitude estimates. The observation of a linear relationship between continuous interval-scale

measures estimated from ordinal patient ratings and equivalent measures estimated from ordinal

therapist ratings confirms the linear prediction of the model. Grounded in a simple axiomatic

scaling theory, the model provides plausible interpretations of the slopes and intercepts of the

linear relationships in terms of fixed and random bias parameters. This model can be used as a

tool to study the effects of independent variables on judgment bias or compare differences

between judges.

18

326

327

328

329

330

331

332

333

334

335

336

Page 19: Comparison of therapist to patient judgment bias in low vision

19

337

Page 20: Comparison of therapist to patient judgment bias in low vision

References

1. Owsley C, Sloane M, McGwin G Jr, Ball K. Timed instrumental activities of daily living

tasks: relationship to cognitive function and everyday performance assessments in older

adults. Gerontology 2002;48:254-265.

2. McHorney CA, Haley SM, Ware JE Jr. Evaluation of the MOS SF-36 Physical Functioning

Scale (PF-10): II, Comparison of relative precision using Likert and Rasch scoring methods.

J Clin Epidemiol. 1997;50:451-461.

3. Granger CV, Deutsch A, Linn RT. Rasch analysis of the Functional Independence Measure

(FIM) Mastery Test. Arch Phys Med Rehabil. 1998;79:52-57.

4. Massof RW. Understanding Rasch and Item Response Theory models: Applications to the

estimation and validation of interval latent trait measures from responses to rating scale

questionnaires. Ophthal Epidemiol. 2011;18:1-19.

5. Fisher AG. The assessment of IADL motor skills: An application of many-faceted Rasch

analysis. Am J Occup Ther. 1993;47:319-329.

6. U.S. Department of Health & Human Services, Centers for Medicare and Medicaid Services.

(2002). Program memorandum intermediaries/carriers: Transmittal AB-02-078, May 29,

2002. Baltimore, MD: Government Printing Office.

7. Massof RW, Hsu CT, Baker FH, Barnett GD, Park WL, Deremeik JT, Rainey C, Epstein C.

Visual disability variables. I: The importance and difficulty of activity goals for a sample of

low vision patients. Arch Phys Med Rehabil. 2005;86:946-953.

8. Massof RW, Hsu CT, Baker FH, Barnett GD, Park WL, Deremeik JT, Rainey C, Epstein C.

Visual disability variables. II: The difficulty of tasks for a sample of low vision patients.

Arch Phys Med Rehabil. 2005;86:954-967.

20

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

Page 21: Comparison of therapist to patient judgment bias in low vision

9. Massof RW, Ahmadian L, Grover LL, Deremeik J T, Goldstein J E, Rainey C, Epstein C,

Barnett GD. The Activity Inventory: an adaptive visual function questionnaire. Optom Vis

Sci, 2007;84:763-774.

10. Centers for Medicare/Medicaid Services. (2004). The Inpatient Rehabilitation Facility-

Patient Assessment Instrument Training Manual. Available from

https://www.cms.gov/medicare/medicare-fee-for-service-payment/inpatientrehabfacpps/

irfpai.html

11. Andrich D. A rating formulation for rating response categories. Psychometrika 1978;43:561-

573.

12. Lincare JM, Wright BD. A user's guide to Winsteps. Rasch model computer program:

Chicago, IL: MESA Press. 2001.

13. Goldstein JE, Chun MW, Fletcher DC, Deremeik JT, Massof RW. Visual ability of patients

seeking outpatient low vision services in the United States. JAMA Ophthalmol

2014;132;1169-1177.

14. Bond, T., & Fox , C. M. Applying the Rasch model: Fundamental measurement in the human

sciences. (2 Ed.). New York, NY: Routledge, 2007.

15. Wilson EB, Hilferty MM. The distribution of chi-square. Proc Natl Acad Sci USA

1931;17:684-688.

21

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381