Comparison of therapist to patient judgment bias in low vision
Transcript of Comparison of therapist to patient judgment bias in low vision
Therapist Judgment Bias and Reliability Relative to that of Patients in
the Estimation of Functional Ability from Ordinal Ratings
Robert W. Massof,1 Theresa M. Smith,2 Lisa S. Foret,3 Guy Davis,3 and Kyoko Fujiwara1
1Lions Vision Research and Rehabilitation Center, Wilmer Eye Institute, Johns Hopkins
University School of Medicine
2Department of Occupational Therapy and Rehabilitation Sciences, University of Texas
Medical Branch Galveston
3Evangeline Home Health, Lake Charles, LA
Supported by grant EY022322 from the National Eye Institute, National Institutes of Health,
Bethesda, MD.
1
1
2
3
4
5
6
7
8
9
10
11
12
Abstract
Objective: To present and evaluate a measurement model for estimating the judgment bias of
therapists and patients when rating functional ability. Design: Observational study of the
agreement between therapist ratings and patient self-ratings of functional ability. Setting:
Measures made by telephone interview and in the patient’s home. Participants: Forty-five home
health care patients who have a secondary diagnosis of low vision. Main Outcome Measures:
Functional ability estimated from Rasch analysis of patient difficulty ratings of calibrated items
(activity goals) in the Activity Inventory (AI) and therapist ratings using a FIM scale of the same
activity goals, both at initial evaluation and again after discharge. Results: A linear relationship
was observed between functional ability measures estimated from therapist ratings and measures
estimated from patient self-ratings with the same slope, but different intercepts, for measures
obtained at baseline and at post-rehabilitation follow-up. Conclusions: The observed linear
relationship between measures estimated from therapist ratings and measures estimated from
patient ratings confirms the model prediction. The intercept corresponds to the difference
between the therapist’s judgment bias and the average judgment bias of all patients. Relative to
patient judgments, the therapist’s estimate of functional ability at baseline was less than the
patients’ estimates; it was greater than the patients’ estimates at follow-up. The slope of the line
corresponds to the square root of the ratio of the between-patient plus within-patient variance in
judgment bias to the within-therapist variance in judgment bias. The results indicate that
between-patient variance is almost 3 times the within-therapist variance.
2
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Introduction
Rehabilitation medicine employs three different approaches to estimate the functional ability of
patients: 1) measures of task performance time and/or accuracy;1 2) patient ratings of their own
ability and/or frequency of performing activities;2 and 3) ratings by a therapist or proxy of a
patient’s ability and/or frequency of performing activities.3 Functional ability is a trait of the
patient. Task performance time and accuracy, patient ratings, and therapist ratings only are
indicators of functional ability. Measurements of functional ability per se must be inferred from
the observed indicators. Because functional ability is a property of the patient, valid and unbiased
measures of functional ability estimated from the three different approaches should agree.
Measurement validity refers to the accuracy of the assumption that the estimated measure is
linear with the magnitude of the variable of interest. Measurement bias refers to the agreement
(or disagreement) between different measures of the same variable when the variable magnitude
has not changed between measures. In the case of functional ability, measurement validity and
bias can be influenced by the sample of activities selected for observation and, in the case of
ratings, by properties of the judge.
This paper is concerned with comparing functional ability measures estimated from ratings by
patients to functional ability measures estimated from ratings by a therapist. More specifically,
this paper focuses on the estimation of relative biases and measurement uncertainties of judges
when comparing functional ability measures estimated from a therapist’s judgments to functional
ability measures estimated from patient judgments of themselves. We first present a model of
patient self-ratings and a parallel model of therapist ratings of the patient, explicitly identifying
respective biases and sources of variance in the observations, and show how the two sets of
3
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
ratings are related. We then test the model with a substantive example using low vision
rehabilitation of visually impaired home health care patients.
Model of Patient Self-Ratings and Therapist Ratings
Using ordered rating scale categories (e.g., level of “difficulty” or level of “independence”), both
the patient and the therapist are asked to judge the patient’s ability to perform specific activities,
referred to as “items”. The true ability of patient n, which we are attempting to estimate from the
patient’s and therapist’s ratings, is α n. The ability required to perform each of the items, ρ j for
item j, is a property of the item that is independent of the judge (whether patient or therapist).
The model assumes that both the patient and therapist are judging the magnitude of the patient’s
functional reserve for the activity described by the item, which is the difference between the
ability of patient n and the ability required by item j, i.e., α n−ρ j. Both the patient and therapist
are instructed in the use of the ratings, but they develop their own criteria for each rating
category that they will assign to a patient/item pair. These criteria, or “thresholds”, can be
thought of as boundaries between neighboring categories on a continuous functional reserve
scale. The thresholds are denoted as τ kx for the boundary set by judge k between rating category
x-1 and rating category x (k n in the case of patient self-judgment).
Although the value of ρ j is independent of the judge, judges’ estimates of ρ j are likely to be
biased. If ρ̂kj is the estimate of ρ j by judge k, then ρ̂kj=ρ j+ϵ kj where ϵ kj is the bias of judge k in
estimating the ability required by item j. Similarly, the average threshold for rating category x
across a population of judges is τ x, therefore, τ kx=τ x+ηkx where ηkx is the bias of judge k, relative
to the average judge, in the choice of threshold for rating category x. In the case of therapists or
proxies, the population of judges would refer to all therapists or to all proxies, respectively. If we
4
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
define ϵ k to be the average bias of judge k across items and ηk to be the average bias of judge k
across rating category thresholds, then we can re-express the bias terms as the sum of a fixed
variable (average) and a random variable (), i.e., ϵ kj=ϵ k+δ ϵ kj and ηkx=ηk+δηkx (if there is only a
single judge contributing to the estimate of τ x, then ηk=0). In each case, the random variable has
an expected value of zero and incorporates variance associated with real differences in bias
between items and/or categories, estimation uncertainty, and parameter instability.
The judge assigns rating category x to item j if the estimated functional reserve exceeds the
judge’s criterion for category x (and all lower categories) and is less than the criterion for
category x+1 (and all higher categories), i.e.,
τ k1 ,⋯ , τkx<α n− ρ̂kj<τkx +1 ,⋯ , τ km. (1a)
Substituting the definitions presented in the preceding paragraph and, for judge k, combining the
random variables into a single random term and combining the fixed bias variables into a single
fixed term, expression (1a) can be expanded to make the fixed and random variables explicit, i.e.,
τ 1+δkj 1 ,⋯ , τx+δ kjx<αn−ρ j−βk< τ x+1+δ kjx+1 ,⋯ , τm+δ kjm (1b)
where δ kjx=δ ηkx+δ ϵ kj and βk=ϵ k+ηk. The judgment bias of judge k is summarized with the bias
term βk and the reliability of judge k is summarized by the variance of δ kjx, which we designate
as σ kjx2 .
Rasch analysis is used routinely to estimate the average expected rating category thresholds (τ x
for rating category x), the true person measures (α n for person n), and the true item measures (ρ j
for item j) from distributions of observed ratings across persons and items.4 Judgment bias, βk,
affects the accuracy of the estimates and the variance of the random terms, σ kjx2 , affects
estimation precision (i.e., reliability). Rasch models assume homogeneity of variance, i.e., σ kjx2 is
5
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
the same for all persons, items, and rating category thresholds (a requirement of unidimensional
measures). Homogeneity of variance means that σ kjx2 =σk
2. Rasch models also assume that the
random terms are statistically independent of one another.4 Various statistical tests are used to
evaluate how well the set of observed ratings conform to these assumptions of the Rasch model.4
In the case of patient self-judgment, when there are N patients there also are N judges. However,
Rasch models typically (but not necessarily) assume that there is just a single judge, which in
effect is the average of the judges. In this case, when σ k2 is referring to the average of N judges, it
must include variance between judges, σ bn
2 , as well as variance within judges, σ n2. We therefore
define the variance of the average patient judge to be
σ P2=σbn
2 +∑n=1
N
σn2/ N , (2)
the sum of between patient variance and average within patient variance. When a single therapist
is the judge, the variance of the therapist can be attributed entirely to the variance within the
judge, σ T2=σk
2. To complete the definition of terms for our model, the fixed judgment bias of each
patient is βn and the fixed judgment bias of the therapist is βT.
In practice, Rasch models normalize the estimated person and item measures to the square root
of the judge’s variance and ignore the judge’s bias (unless made explicit in a facet model5). Thus,
person measures estimated from patient self-judgments are expressed as
α̂ nP=( αn+βP )/σ P (3)
for person n, where βP=∑n=1
N βn
N, the average bias across patients. The person measures estimated
from a therapist’s judgments are expressed as
α̂ nT= (α n+βT )/σT (4)
6
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
for the same person n. Because both eqs.(3) and (4) are linear functions of the true person
measure, α n, we expect the relationship between person measures estimated from a therapist’s
ratings and corresponding person measures estimated from patients rating themselves to be
α̂ nT=σ P
σTα̂nP+
βT−β P
σT, (5)
a linear relationship for which the slope is the ratio of the standard deviation for the average
patient to the standard deviation for the therapist and the intercept is the weighted difference
between therapist and average patient judgment biases.
Methods
Research Design
The present study is part of a larger observational study still in progress. Data reported here were
collected pre and post usual occupational therapy intervention provided in the participant’s home
by one occupational therapist who has specialty training in low vision rehabilitation and 12 years
of experience providing rehabilitation services to home health care patients with low vision.
Participants
Eligibility criteria for the study were: 1) patients were new to the occupational therapist; 2)
patients were adults admitted to home health care; 3) patients met the visual impairment
diagnostic criteria for Medicare or other third party coverage of low vision rehabilitation
services;6 and 4) patients understood English and had good enough hearing to be able to
participate in telephone interviews. Forty-five low vision patients participated in this study.
Procedures
The study conformed to the tenets of the Declaration of Helsinki and was approved by the Johns
Hopkins Institutional Review Board. After the patient consented to participate, one of the
7
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
investigators administered the Activity Inventory (AI),7-9 an adaptive rating scale instrument, by
telephone interview. Participants rated the importance of the 50 activity goals in the AI, and rated
the difficulty of those goals that were rated to be at least “slightly important”. In the instructions
to the participant, both importance and difficulty ratings were qualified as to be able to perform
the activity “without depending on another person”. Goals included in this study were those that
the participant also rated to be at least “slightly difficult”. In addition, participants rated the
difficulties of tasks in the AI that are nested under goals that were rated to be at least slightly
important and slightly difficult.
At the time of the initial patient evaluation, the occupational therapist was provided with a list of
the AI goals and subsidiary tasks that were rated by the participant to be at least slightly difficult,
however, the actual ratings assigned by the participant to each goal and task were not revealed.
After completing the initial patient evaluation, the occupational therapist assigned a FIM scale
score3,10 to each of the participant-identified AI goals. Table 1 lists the FIM rating scale
categories. The occupational therapist then developed the patient’s plan of care and provided
rehabilitation services following usual procedures. At discharge the occupational therapist again
used the FIM scale to rate the participant’s functional independence level for the same AI goals
that were rated at the initial evaluation. The AI was re-administered to the participant by
telephone interview one to two months after discharge from occupational therapy.
Data Analysis
Rasch analysis, using the Andrich rating scale model11 (Winsteps 3.6512), was employed to
estimate the visual ability of each participant before and after rehabilitation on a continuous
interval scale from the participants’ difficulty ratings of the AI goals. The item measures for the
50 goals in the AI item bank and the response category thresholds for levels of difficulty were
8
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
anchored to values estimated from the difficulty ratings of 3200 low vision patients.13 Rasch
analysis also was performed on the FIM scale ratings of each patient’s AI goals by the
occupational therapist using the same anchored item measures for the goals. In the case of
analysis of FIM ratings, participant’s ratings obtained prior to the initial patient evaluation and
ratings obtained post-discharge were stacked and analyzed together to estimate response
category thresholds for the 7 FIM scale categories. An information-weighted mean square fit
statistic (infit) and the standard error were estimated for each response category threshold and for
each person measure.
FIM
score
Description
1 Totally dependent – patient able to perform less than 25 % of the task
2 Maximal assistance required – patient able to perform 25% of the task
3 Moderate assistance required – patient able to perform 50% of the task
4 Minimal assistance required – patient able to perform 75% of the task
5 Supervision or set-up required – patient performs task without direct assistance
6 Modified independence – patient requires assistive equipment, more time, or safety
concern
7 Independent – no assistance required, patient able to perform 100% of the task
Table 1
Functional Independence Measure (FIM) Scale Categories
Results
Participants
9
168
169
170
171
172
173
174
175
176
177
178
179
180
Complete data were obtained from 41 of the 45 enrolled participants. All participants resided in
Louisiana. Participants consisted of 15 males ( 33%) and 30 females ( 67%) between the ages of
30 and 98 years old (median = 80, SD = 17). Measured binocular visual acuity with habitual
correction ranged from 20/20 to 20/900 (median = 20/65, SD= 0.52 log MAR); 3 participants
had no light perception in either eye and 2 participants had only light perception in the better eye.
Among participants with measurable visual acuity, binocular log contrast sensitivity ranged from
0.07 to 1.67 (normal>1.6; median = 1.02, SD = 0.44). For binocular central visual field measures
(12.5o), 35% of participants had central scotomas (blind spots), 20% had hemi- or quad-field
defects, 27% had contracted visual fields, and visual fields could not be performed on 18% .
FIM Rating Scale Evaluation
The therapist used all 7 of the FIM scale response categories to rate AI goals selected by
participants at baseline and/or at follow-up. As shown in the Table 2 columns labeled Baseline
Count and Follow-up Count, FIM scale scores of 4 or less were used most frequently at baseline
and FIM scale scores of 5 or 6 were used most frequently at follow-up. The category threshold
corresponds to the value of functional reserve (difference between the estimated person measure
and estimated item measure) at which the probability of using FIM score x is equal to the
probability of using FIM score x-1, for x = 2 to 7. The ordering of thresholds should agree with
the ordering of the FIM scale scores. The thresholds are ordered for response categories 2
through 6. The threshold for response category 7 is disordered. However, the assignment of FIM
scale score 7 occurred rarely – it represents only 1.3% of the total number of FIM scale scores
assigned.
The Rasch model predicts the response category assigned to every combination of person and
item measures. The residual is defined to be the difference between the FIM scale score observed
10
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
for each person/item combination and the FIM scale score predicted for the corresponding
person and item measure estimates. The infit mean square is the ratio of the observed sums of
squared residuals for FIM ratings, which are expected to be distributed as 2, to the sums of
squared residuals expected by the Rasch model, which corresponds to the expected value of 2.
The expected value of 2 is equal to the degrees of freedom, thus, the infit mean square is
expected to be distributed as 2/df, which in turn has an expected value of 1.0.4 The infit mean
square is interpreted as the ratio of the observed variance in the residuals to the expected
variance. Infit mean square values greater than 1.0 indicate that the observed variance is greater
than expected. As can be seen in the last two columns of Table 2, the observed variance in
residuals for response category 6 is more than twice the expected variance both at baseline and at
follow-up. As a rule of thumb, infit mean squares greater than 1.3 are considered to be indicative
of excessive observed variance.14 With that criterion, only FIM response categories 1 through 3
at baseline and 4 and 5 at follow-up behave as expected by the Rasch model, which suggests
inconsistency in the use of the other FIM response categories across patients and/or across items.
Table 2
Functional Independence Measure (FIM) response counts, estimated category thresholds in the Andrich
model, and information-weighted mean square residuals (Infit) at baseline and follow-up by rating scale
response category.
11
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
Rating scale Baseline Follow-up Category Baseline Follow-upFIM Score Count Count threshold Infit Infit
1 103 28 NA 1.27 3.472 107 25 -2.88 1.2 3.173 124 16 -2.03 1.29 2.064 145 41 -1.11 1.61 1.015 31 123 1.55 1.71 0.916 4 212 2.83 2.25 2.147 3 10 1.63 1.75 2.83
Infit mean squares also were estimated for each participant at baseline by summing observed
squared residuals and expected squared residuals across goals. For degrees of freedom of 25 or
greater, the cube-root of the 2 distribution is well approximated by a normal distribution.15
Therefore, the infit mean square for each participant was transformed to a standard normal
deviate and expressed as a z-score.4 Figure 1 illustrates the distribution of infit z-scores on the
abscissa and the distribution of person measures, i.e., estimated functional ability, on the ordinate
for all 41 participants. The solid vertical line indicates the expected value of the infit z-score and
the dashed vertical lines define the range of plus-and-minus two standard deviations from the
expected value. The majority of participants’ infit mean square z-scores are symmetrically
distributed about the expected value of 0 and fall in the expected range of +2 SD. These results
are consistent with the expectations of a valid measure. However, there are seven clear outliers
where the observed variance in the residuals is more than two standard deviations greater than
the expected variance. The functional abilities of these outliers fall in the middle of the
participants’ distribution of functional ability (on the vertical axis).
12
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
-6
-5
-4
-3
-2
-1
0
1
-3 -2 -1 0 1 2 3 4 5 6
FIM
-esti
mat
ed p
erso
n m
easu
re (a
ncho
red
AI g
oals)
INFIT MNSQ (zstd)
Figure 1. Distribution of infit z-scores across items for each participant on the abscissa and the
distribution of person measures on the ordinate.
Comparison of Functional Ability Estimates from AI and FIM Ratings
Because all AI item measures were anchored to calibrated values, i.e., ρ j in eq. (1b), person
measure estimates from patients’ difficulty ratings and person measure estimates for the same
patients from the therapist’s FIM ratings are expected to be in the same units of functional
ability. However, the Andrich rating scale model assumes that the variance in judgment bias is
constant, thereby normalizing the true values of functional ability, i.e., α n in eq. (3) and eq. (4),
to the standard deviation of judgment bias, i.e., σ P in eq. (3) and σ T in eq. (4). Thus, we expect
the standard errors of the two sets of estimated person measures to agree. There is no significant
difference (paired t-test, p=0.93) between the standard error of the person measure estimated
from patient difficulty ratings (mean = 0.414) and the standard error of the person measure
estimated from therapist FIM ratings (mean = 0.415).
13
237
238
239
240
241
242
243
244
245
246
247
248
249
250
It is possible that FIM ratings could be different enough from difficulty ratings that using item
measures anchored with values estimated from difficulty ratings is not appropriate for the FIM
scale. If so, variance in residuals should be greater for FIM ratings than for difficulty ratings.
With the exception of the FIM outliers noted above, Figure 2 illustrates that the z-scores for
transformed infit mean squares for the two sets of estimates of person measures at baseline are
within the range of values expected by the 2 distribution (2 SD box).
-4
-2
0
2
4
6
8
10
-4 -2 0 2 4 6 8 10
INFI
T M
NSQ
ZST
D (F
IM)
INFIT MNSQ ZSTD (AI)
Figure 2. Z-scores for transformed infit mean squares for person measures estimated from therapist FIM
ratings (ordinate) vs. transformed infit mean squats for person measures estimated from patients’
difficulty ratings (abscissa).
Measures of functional ability, both at baseline and post-discharge, were estimated from patients’
difficulty ratings of those AI goals that were rated at baseline to be at least slightly important.
Measures of functional ability also were estimated for the same patients at baseline and at
discharge from the therapist’s ratings of the same set of AI goals for each patient using FIM scale
scores. For measures based on patients’ difficulty ratings and measures based on the therapist’s
FIM scale scores, the mean functional ability at baseline was subtracted from each corresponding
baseline measure and the mean functional ability at post-discharge was subtracted from each
14
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
corresponding post-discharge measure. Figure 3 is a scatter plot comparing measures based on
patients’ difficulty ratings of the important AI goals (abscissa) to the occupational therapist FIM
scale ratings of the same AI goals (ordinate) for baseline (filled circles) and post-discharge (open
circles) measures relative to their respective means. Bivariate linear regression, minimizing
orthogonal distance of data points from the regression line (i.e., principal component), was
performed on the combined baseline and post-discharge data. The slope of the regression line is
1.96 and the intercept is -0.04. The Pearson correlation is 0.52.
-3
-2
-1
0
1
2
3
-2 -1.5 -1 -0.5 0 0.5 1 1.5
Func
tiona
l abi
lity
(OT
FIM
scal
e ) -
Mea
n
Functional ability (patient difficulty ratings) - Mean
PRE
POST
Figure 3. Comparing person measures based on patients’ difficulty ratings of important AI goals
(abscissa) to occupational therapist FIM scale ratings of the same AI goals (ordinate) for baseline and
post-discharge measures relative to their respective means.
Figure 4 illustrates scatter plots of the unadjusted functional ability measures estimated from the
occupational therapist FIM scale ratings of AI goals (ordinate) versus the unadjusted functional
ability measures estimated from the patient’s difficulty ratings of the same AI goals (abscissa) at
baseline (filled circles) and at post-discharge follow-up (open circles). The lines fit to the data by
orthogonal regression have the same slope (1.96), which was estimated from the regression line
fit to the combined data in Figure 3. The intercepts are -1.02 for the baseline measures and 1.63
for the post-discharge measures. The dashed lines illustrate the respective mean functional ability
15
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
measures. The difference between the vertical dashed lines is the intervention effect (difference
between the means) estimated from patient difficulty ratings (translates to Cohen’s effect size =
0.49) and the difference between the horizontal dashed lines is the intervention effect estimated
from the therapist’s FIM scale ratings (Cohen’s effect size = 3.28)
-5
-4
-3
-2
-1
0
1
2
3
4
-2 -1.5 -1 -0.5 0 0.5 1
Func
tiona
l abi
lity
(OT
FIM
scal
e ra
ting)
Functional ability (patient difficulty rating)
PRE
POST
Figure 4. Unadjusted functional ability measures estimated from the occupational therapist FIM scale
ratings of AI goals (ordinate) versus unadjusted functional ability measures estimated from patient’s
difficulty ratings of same AI goals (abscissa) at baseline (filled circles) and at post-discharge follow-up
(open circles).
Discussion and Conclusions
The linear relationship between functional ability estimated from patient difficulty ratings and
functional ability estimated from the therapist’s FIM scale ratings confirms the expectations of
the model expressed by eqs. (3) and (4), which lead to the specific prediction of a linear function
expressed by eq. (5). If we interpret the results in Figure 4 in terms of eq. (5), then we must
conclude from the slope of the regression lines that σ P=1.96 σT , both at baseline and at post-
discharge follow-up. This result means that the variance in bias for the average of the patients is
nearly 4 times that of the within person variance in bias for our single therapist. If we can assume
that the average variance in bias within patients is approximately the same as the within person
16
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
variance in bias of our sole therapist, then in eq. (2), ∑n=1
N
σn2/ N ≅ σT
2 , and substituting 1.962 σT2 for
σ P2 in eq. (2), we obtain an estimate for the standard deviation of bias between-patients to be
σ bn=1.69 σT .
From eq. (5), the intercepts of the regression lines in Figure (4) correspond to the difference
between the fixed bias of the average patient and the therapist’s fixed bias, in within-therapist
standard deviation units. The intercept for baseline measures indicates that fixed bias for the
average patient, βP is 1.02 logits greater than the therapist’s fixed bias,βT. However, post-
discharge the therapist’s fixed bias is 1.63 logits greater than the fixed bias of the average
patient. From the patients’ perspective, the therapist is underestimating patients’ functional
abilities at baseline and overestimating patients’ functional abilities at post-discharge follow-up.
From the therapist’s perspective, the patients are overestimating their functional abilities at
baseline and underestimating their functional abilities at post-discharge follow-up.
We cannot draw any conclusions from this study about why the difference between therapist and
average patient bias is negative at baseline and positive at post-discharge follow-up. One could
speculate that patients tend to be stoic and/or stubborn – underestimating the magnitude of their
problems at baseline and underestimating improvements in their function at follow-up.
Anecdotally, during evaluation therapists often see evidence of problems that patients deny or do
not recognize (e.g., seeing pills on the floor, stained clothing, signs of poor hygiene). Therapists
also report that patients may be able to perform a task after therapy, but refuse to accept the
required adaptation as an improvement over dependency. From another viewpoint, a cynic might
claim that the therapist is exaggerating the patient’s problems at baseline and exaggerating the
success of therapy at follow-up, making the intervention look more effective than it actually is.
17
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
However, in the final analysis we only can estimate differences between people in judgment
biases – we cannot know their values relative to a ground truth.
The purpose of this study has been to present and test a model of judgment bias and show how
judgment bias can influence measures estimated by psychometric models from observer
magnitude estimates. The observation of a linear relationship between continuous interval-scale
measures estimated from ordinal patient ratings and equivalent measures estimated from ordinal
therapist ratings confirms the linear prediction of the model. Grounded in a simple axiomatic
scaling theory, the model provides plausible interpretations of the slopes and intercepts of the
linear relationships in terms of fixed and random bias parameters. This model can be used as a
tool to study the effects of independent variables on judgment bias or compare differences
between judges.
18
326
327
328
329
330
331
332
333
334
335
336
19
337
References
1. Owsley C, Sloane M, McGwin G Jr, Ball K. Timed instrumental activities of daily living
tasks: relationship to cognitive function and everyday performance assessments in older
adults. Gerontology 2002;48:254-265.
2. McHorney CA, Haley SM, Ware JE Jr. Evaluation of the MOS SF-36 Physical Functioning
Scale (PF-10): II, Comparison of relative precision using Likert and Rasch scoring methods.
J Clin Epidemiol. 1997;50:451-461.
3. Granger CV, Deutsch A, Linn RT. Rasch analysis of the Functional Independence Measure
(FIM) Mastery Test. Arch Phys Med Rehabil. 1998;79:52-57.
4. Massof RW. Understanding Rasch and Item Response Theory models: Applications to the
estimation and validation of interval latent trait measures from responses to rating scale
questionnaires. Ophthal Epidemiol. 2011;18:1-19.
5. Fisher AG. The assessment of IADL motor skills: An application of many-faceted Rasch
analysis. Am J Occup Ther. 1993;47:319-329.
6. U.S. Department of Health & Human Services, Centers for Medicare and Medicaid Services.
(2002). Program memorandum intermediaries/carriers: Transmittal AB-02-078, May 29,
2002. Baltimore, MD: Government Printing Office.
7. Massof RW, Hsu CT, Baker FH, Barnett GD, Park WL, Deremeik JT, Rainey C, Epstein C.
Visual disability variables. I: The importance and difficulty of activity goals for a sample of
low vision patients. Arch Phys Med Rehabil. 2005;86:946-953.
8. Massof RW, Hsu CT, Baker FH, Barnett GD, Park WL, Deremeik JT, Rainey C, Epstein C.
Visual disability variables. II: The difficulty of tasks for a sample of low vision patients.
Arch Phys Med Rehabil. 2005;86:954-967.
20
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
9. Massof RW, Ahmadian L, Grover LL, Deremeik J T, Goldstein J E, Rainey C, Epstein C,
Barnett GD. The Activity Inventory: an adaptive visual function questionnaire. Optom Vis
Sci, 2007;84:763-774.
10. Centers for Medicare/Medicaid Services. (2004). The Inpatient Rehabilitation Facility-
Patient Assessment Instrument Training Manual. Available from
https://www.cms.gov/medicare/medicare-fee-for-service-payment/inpatientrehabfacpps/
irfpai.html
11. Andrich D. A rating formulation for rating response categories. Psychometrika 1978;43:561-
573.
12. Lincare JM, Wright BD. A user's guide to Winsteps. Rasch model computer program:
Chicago, IL: MESA Press. 2001.
13. Goldstein JE, Chun MW, Fletcher DC, Deremeik JT, Massof RW. Visual ability of patients
seeking outpatient low vision services in the United States. JAMA Ophthalmol
2014;132;1169-1177.
14. Bond, T., & Fox , C. M. Applying the Rasch model: Fundamental measurement in the human
sciences. (2 Ed.). New York, NY: Routledge, 2007.
15. Wilson EB, Hilferty MM. The distribution of chi-square. Proc Natl Acad Sci USA
1931;17:684-688.
21
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381