Putting the Patient in Patient Reported Outcomes: A Robust...

Putting the Patient in Patient Reported Outcomes:

A Robust Methodology for Health Outcomes

Assessment

May 2014

Abstract

When analyzing many health-related quality-of-life (HRQoL) outcomes, statis-

tical inference is often based on the summary score formed by combining the

individual domains of the HRQoL profile into a single measure. Through a se-

ries of Monte Carlo simulations, this paper illustrates that reliance solely on the

summary score may lead to biased estimates of incremental effects, and I propose

a novel two-stage approach that allows for unbiased estimation of incremental

effects. The proposed methodology essentially reverses the order of the analy-

sis, from one of “aggregate, then estimate” to one of “estimate, then aggregate.”

Compared to relying solely on the summary score, the approach also offers a more

patient-centered interpretation of results by estimating regression coefficients and

incremental effects in each of the HRQoL domains, while still providing estimated

effects in terms of the overall summary score. I provide an application to the es-

timation of incremental effects of demographic and clinical variables on HRQoL

following surgical treatment for adult scoliosis and spinal deformity.

Word, Table, and Figure Count: Approximately 4950 words of body text (excluding

footnotes), 6 tables, 2 figures

Running Head: Putting the Patient in PROMs

1

JEL Classification: I10, C24, C25, C34, C35, C51

Keywords: patient-reported outcome measures, quality-adjusted life-years, cost-effectiveness,

comparative-effectiveness

Funding: This project was supported by grant number XX from the Agency for Health-

care Research and Quality. The content is solely the responsibility of the author and

does not necessarily represent the official views of the Agency for Healthcare Research

and Quality.

2

1 Introduction

Improving the efficiency of health care delivery hinges on accurate methodologies for

economic evaluation and comparative effectiveness. Accompanying results must also be

sufficiently parsimonious so as to ensure the appropriate interpretation and dissemina-

tion of findings. To this end, substantial research has been devoted to the appropriate

analysis of health-related quality-of-life (HRQoL) and, more generally, patient-reported

outcome measures (PROMs). The U.K.’s National Health Service (NHS) explicitly

mandates the use of such data in health care decision making, and the U.S. appears

to be following suit with substantial investment in the Patient-Centered Outcomes Re-

search Institute (PCORI) created under the Patient Protection and Affordable Care

Act (PCORI, 2012; Selby et al., 2012; Devlin et al., 2010; Department of Health, 2008).

PCORI was specifically created to promote and ultimately fund the development of

comparative effectiveness research in health care, although they are statutorily pro-

hibited from funding cost effectiveness research aimed at estimating costs per quality-

adjusted life years (QALYs).

For the purposes of economic evaluation and comparative effectiveness, PROMs are

of interest for several reasons. First, they are outcome measures rather than process

measures, the latter of which dominate the quality measures reported by the Centers for

Medicare and Medicaid Services (CMS) and the National Committee for Quality Assur-

ance (NCQA, 2008). Only recently has CMS started closely tracking outcome measures

such as 30-day readmissions and mortality. Second, PROMs can be consistently studied

across a range of conditions and treatment options, offering a more appropriate com-

parison of treatments than is typically available with purely clinical outcome measures.

Third, a patient’s self-reported HRQoL is generally considered to be a valuable health

outcome measure and one which providers should routinely seek to improve (Porter,

2010; Ahmed et al., 2012). A recent article in the Wall Street Journal described HRQoL

data as “[helping] medical providers see the big picture...and makes for happier, health-

ier patients,” stating that increased reliance on HRQoL measures was “transforming

3

health care” (Landro, 2012). Finally, and perhaps most importantly, PROMs offer the

potential for truly patient-centered care, allowing providers to administer and evaluate

health care based on outcomes elicited directly from patients themselves (Porter, 2010).

Despite the growing awareness and use of PROMs, I argue in this paper that existing

methodologies for analyzing HRQoL data are deficient because they rely solely on the

HRQoL summary score in estimating incremental effects. Specifically, the most common

approach to analyzing HRQoL data is to combine individual HRQoL domains into a

single summary score using some existing scoring algorithm. These summary or index

scores are often then used as weights over time in order to estimate QALYs (Powell,

1984; Austin, 2002; Manca et al., 2005; Drummond et al., 2005; Brazier & Ratcliffe,

2007; Gray et al., 2011; Basu & Manca, 2012). Aside from normative concerns regarding

which weights to use, an analysis based solely on the summary scores is flawed for at

least three reasons.

First, relying on the summary scores comes with an inherent loss of information and

may ultimately bias incremental effects estimates (Mortimer & Segal, 2008; Gutacker

et al., 2012; Parkin et al., 2010). For example, in many HRQoL outcome measures, there

exists variation in the underlying domain scores that is not reflected in the summary

score (Brazier & Ratcliffe, 2007; Gray et al., 2011). This loss of variation is inherent

to the scoring process and not due to any specific algorithm. Second, the empirical

distribution of summary scores is often subject to significant floor or ceiling effects

and may also be multi-modal, necessitating empirical methodologies more complicated

than a simple linear regression (Austin, 2002; Manca et al., 2005; Basu & Manca,

2012; Hernandez Alava et al., 2012). The extent to which alternative distributional

assumptions regarding the summary score approximate the true distribution will vary

by application. Third, and perhaps more importantly, the reliance on summary scores

reflects a fundamental divide between the actual outcomes effected versus the outcomes

being analyzed. For researchers interested in the effect of some covariate on HRQoL,

these effects occur by definition at the individual domain level since this is the level at

which respondents are asked about their quality of life (e.g., the physical functioning

4

or mental health domains of a larger HRQoL profile). Effects on the summary score

are somewhat artificial as they exist only by combining the individual domains and

associated effects. It is unclear a priori whether the effects estimated at each domain

and then combined to form an effect on the summary score would yield the same result

as an analysis based solely on the summary score. In fact, as the findings in Section 3

indicate, the order of estimation and aggregation to the summary score is an important

(but unappreciated) aspect of statistical inference.

As a result, there is growing concern in the literature regarding the appropriateness

of HRQoL summary scores as the outcome of interest (Sculpher & Gafni, 2001; Brazier

et al., 2009). For example, Gutacker et al. (2012) considers an ordered probit model

in analyzing EQ-5D scores, accounting for baseline quality-of-life through the panel

structure and exploiting the ordered probit construct to explicitly model individual

domain scores. The authors avoid an analysis based solely on the summary scores.

Devlin et al. (2010) considers an alternative classification system and a health profile

grid, each of which exploit rankings of EQ-5D health states and attempt to summarize

patient outcomes based on those reporting an unequivocal improvement, worsening, or

no change in health. The studies of Gutacker et al. (2012), Devlin et al. (2010), and

others illustrate concern surrounding the appropriateness of relying solely on summary

scores in estimating the effects of an intervention and other covariates on a patient’s

well-being. However, in avoiding the summary scores entirely, these approaches are

silent as to the incremental effects on the summary score and offer little in terms

of comparing results across other studies (where summary scores remain the primary

outcome of interest).

This paper proposes a novel two-stage estimator (2SE) that first estimates regres-

sion coefficients and incremental effects based on the full HRQoL profile and then

re-interprets these effects in terms of the summary score. Through a series of Monte

Carlos simulations, the paper illustrates how a reliance solely on the summary score

may lead to biased incremental effects estimates, while the 2SE is shown to restore

the unbiased estimation of incremental effects. The proposed methodology essentially

5

reverses the order of the analysis, from one of “aggregate, then estimate” to one of

“estimate, then aggregate.” The 2SE also allows for a more patient-centered discussion

wherein the incremental effects of treatment or other covariates are domain-specific and

more applicable to areas of health deemed most important to a given patient. Impor-

tantly, by re-interpreting the incremental effects in terms of summary scores, the 2SE

maintains the parsimonious interpretation that has proven so valuable in the applied

cost- and comparative-effectiveness literature. I then apply the 2SE along with other

common estimators in the literature to a prospective, multi-center dataset on HRQoL

outcomes for adult scoliosis and spinal deformity patients.

The current paper therefore contributes to the growing empirical literature on the

appropriate analysis of HRQoL outcomes. This analysis is also broadly related to

theoretical econometric research surrounding the differences between marginal effects

calculated from multivariate estimation versus marginal effects calculated from uni-

variate outcomes formed by collapsing the underlying multivariate outcomes (Mullahy,

2011). I discuss the empirical framework and 2SE in Section 2. Details of the Monte

Carlo exercise are presented in Section 3, with an application presented in Section 4.

Section 5 concludes.

2 Methodology

The primary goal of the current analysis is to accurately estimate the effect of a co-

variate, x, on a patient’s HRQoL summary score. For consistency with the empirical

application in Section 4, I adopt the SF-6D as the measure of HRQoL; however, the

intuition and methodological contribution of the paper extends to similar metrics such

as the EQ-5D.

6

2.1 Summary of the SF-6D

The SF-6D is a six-dimensional health profile derived from a subset of responses from the

SF-36 or SF-12 (Brazier et al., 2002; Brazier & Ratcliffe, 2007). The six dimensions of

health classified by the SF-6D are: 1) physical functioning; 2) role limitations; 3) social

functioning; 4) pain; 5) mental health; and 6) vitality. Each domain is characterized

numerically with a range of integers, where a 1 indicates the best value in each domain.

The worst value in each domain varies, with values up to 6 in the physical functioning

and pain domains, values up to 5 in the social functioning, mental health, and vitality

domains, and values up to 4 in the role limitations domain. The patient’s full SF-6D

profile is therefore characterized by a series of six integers, with the best health state

represented by {1, 1, 1, 1, 1, 1} and the worst health state represented by {6, 4, 5, 6, 5, 5}.

Taking all possible combinations of responses, the SF-6D defines 18,000 unique

health states. Each health state can then be converted into a single index score using

available scoring algorithms that essentially assign weights to each domain and inter-

actions between domains. Following the algorithm in Brazier & Ratcliffe (2007), the

resulting SF-6D index score ranges from 0.30 to 1.0, with 0.30 representing the poorest

health state, {6, 4, 5, 6, 5, 5}, and 1 representing the best health state, {1, 1, 1, 1, 1, 1}.

The scoring algorithm from Brazier & Ratcliffe (2007) is reproduced in Table 1.

Table 1

The appropriate algorithm to calculate a summary score remains an area of debate

in the literature (Parkin et al., 2010). Importantly, the proposed methodology relies

on the scoring algorithm only to reinterpret the estimated incremental effects in terms

of the summary score. Although the estimated incremental effects will certainly differ

depending on the scoring algorithm adopted, the focus of this paper is on highlighting

the bias introduced when relying solely on the summary score. To this end, the intuition

underlying this analysis extends broadly to other scoring algorithms, including some

7

of the more recent literature on HRQoL crosswalks intended to convert responses from

one HRQoL instrument into those of another instrument (Dakin, 2013).

2.2 The Two-Stage Estimator

The proposed 2SE applies when one is interested in estimating the incremental effect

of some covariate on a summary score, which is itself derived from a combination of

individual responses. Several alternative models have also been proposed to estimate

such effects, including ordinary least squares (OLS), variations of the classic Tobit

model, censored least-absolute deviations models, Beta MLE, and Beta QMLE models

(Powell, 1984; Austin, 2002; Basu & Manca, 2012).1 Rather than rely on the univariate

outcome, the 2SE first estimates the coefficients of interest based on the underlying

SF-6D responses and then re-interprets the coefficients in terms of the summary score.2

The 2SE first models each individual health domain using an ordered probit model

(Gutacker et al., 2012), where the response in each domain intuitively follows from a

latent index variable,

y∗id = xiβd + εid. (1)

Here, xi denotes a set of independent variables possibly including a constant term,

d denotes the relevant health domain, d = 1, ..., 6, and εid is assumed to follow a

normal distribution with µ = 0 and σ = 1. In general, εid could be correlated across

domains. Such correlation could be accounted for in the proposed methodology (e.g.,

by adopting a composite marginal likelihood estimation for multivariate ordered probit

or logit models as in Bhat et al. (2010)); however, such an approach would only impact

the efficiency of the estimated coefficients and would not impact the point estimates.

As such, I simplify the analysis by assuming zero cross-equation correlation.

1We ignore issues of selection or the role of baseline HRQoL in order to focus solely on the estimationof incremental effects in settings where standard regression models are considered appropriate.

2Since QALYs generally reflect health states as well as the time spent in each health state, I donot treat QALYs as synonymous with the HRQoL summary scores; however, as Basu & Manca (2012)indicates, it is relatively common in practice that researchers estimate QALYs based on a single follow-up survey administered at one year after treatment, in which case the summary score is equivalent toa QALY.

8

Denote by yid the observed response for patient i in domain d. For example, in the

physical functioning domain (d = 1), yi1 ∈ {1, ..., 6}. As y∗i1 crosses several unknown

thresholds (denoted by αj), the observed response moves up the health status ranking

such that yi1 = 1 for α0 < y∗i1 < α1 and yi1 = 6 for y∗i1 ≥ α5. Note that the ordering

from best to worst or worst to best is irrelevant provided the appropriate adjustments

are made when estimating summary scores. Since most statistical software programs

estimate ordered discrete choice models such that a higher value is better, I adopt a

worst to best ordering in the analysis, which I then convert to a best to worst ordering

to apply the scoring algorithm. More compactly, the observed dependent variable, yid,

takes the form

yid = j if αd,j−1 ≤ y∗id ≤ αd,j, j = {1, ..., Jd} , (2)

where Jd differs across domains as discussed previously. Importantly, even with a

well-behaved distribution of latent variables, y∗id, the ordered discrete choice framework

can generate distributions with strong floor and ceiling-effects via different threshold

values, αj. As a result, the estimation of ordered, discrete dependent variable models

can avoid the distributional and statistical difficulties present in models based solely on

the summary scores.

I estimate separate ordered probit models for each HRQoL domain, and the results

of each model are used to form predicted probabilities of responses, denoted P dij, for per-

son i, response j, and domain d. In the physical functioning domain, the regression re-

sults therefore provide six predicted probabilities for each person,(P PFi1 , P PF

i2 , ..., P PFi6

).

Continuing this process across all six domains yields a total of 31 predicted probabilities

- one for each possible response in each domain - for each person. Applied to HRQoL

measures like the SF-6D, one difficulty surrounds the “most severe” category, where

Brazier et al. (2002) defines “most severe” as any one of the following responses: a level

of 4 or more in the physical functioning, social functioning, mental health, or vitality

domains; a level of 3 or more in the role limitation domain; or a level of 5 or more in the

pain domain. The probability of a “most severe” health status can then be calculated

9

following the principle of inclusion and exclusion for probability.3

With a slight abuse of notation, the inclusion-exclusion principle states that the

probability of the union of N non-mutually exclusive events is given as:

P (A1 ∪ A2 ∪ ... ∪ AN) = P (A1) + ...+ P (AN) +N∑

n=2

(−1)n+1P (∩ n events) . (3)

Applied to the SF-6D, I denote by APF the outcomes of the physical functioning domain

that enter into the “most severe” indicator, and similarly by ARL for the role limitations

domain, ASF for the social functioning domain, AP for the pain domain, AMH for the

mental health domain, and AV for the vitality domain. Since only one value can be

reported in each domain, these terms enter directly into equation 3, where

P (APF ) = Pr(PF = 4) + Pr(PF = 5) + Pr(PF = 6),

P (ARL) = Pr(RL = 3) + Pr(RL = 4),

P (ASF ) = Pr(SF = 4) + Pr(SF = 5),

P (AP ) = Pr(Pain = 5) + Pr(Pain = 6),

P (AMH) = Pr(MH = 4) + Pr(MH = 5), and

P (AV ) = Pr(V = 4) + Pr(V = 5).

An estimate of P (A1 ∪ A2 ∪ ... ∪ A6), denoted P (Most Severe), can therefore be

obtained by applying the inclusion-exclusion principle to the individual estimates of

the probabilities of each outcome in each domain, P dij. Based on the scoring algorithm

in Table 1, the probability estimates from the ordered probit estimation can then be

3A similar term which combines the scores across several individual domains also appears in theEQ-5D scoring algorithm (Shaw et al., 2005; Agency for Healthcare Research and Quality, 2005).

10

converted to a predicted SF-6D summary score, Si:

Si = 1− 0.035×(P PFi2 + P PF

i3

)− 0.044× P PF

i4 − 0.056× P PFi5 − 0.117× P PF

i6 (4)

− 0.053×(PRLi2 + PRL

i3 + PRLi4

)− 0.057× P SF

i2 − 0.059× P SFi3 − 0.072× P SF

i4 − 0.087× P SFi5

− 0.042×(P Paini2 + P Pain

i3

)− 0.065× P Pain

i4 − 0.102× P Paini5 − 0.171× P Pain

i6

− 0.042×(PMHi2 + PMH

i3

)− 0.100× PMH

i4 − 0.118× PMHi5

− 0.071×(P Vi2 + P V

i3 + P Vi4

)− 0.092× P V

i5

− 0.061× P (Most Severe) .

In an of itself, the predicted summary score is of little value. If researchers were

interested only in the value of a respondent’s summary score, then clearly the observed

summary score formed from the observed responses would be most relevant. The pre-

dicted summary score is instead critical to the estimation of incremental effects via

the method of recycled predictions (Oaxaca, 1973; Graubard & Korn, 1999; Basu &

Rathouz, 2005; Basu, 2005; Glick, 2007; Kleinman & Norton, 2009). For example, if

we are interested in the average effect of a one standard deviation increase in x on

respondents’ summary scores, the 2SE would proceed as follows. First, estimate or-

dered probit models in each domain and form the predicted summary score based on

the observed independent variables, Si|xi. Second, replace xi with the hypothetical

values of interest, x′i = xi + σx, and based on the same coefficients estimated from

the ordered probit models, form the predicted summary scores for these hypothetical

values, Si|x′i. Taking the difference in each predicted summary score, Si|x′i − Si|xi,

and averaging across all individuals provides an estimate of the average effect of a one

standard deviation change in x. This recycled predictions method (also referred to as

predictive margins) also avoids the difficulty of computing and interpreting marginal

effects in nonlinear models (Norton et al., 2004) and can be particularly valuable when

the variable of interest is interacted with other covariates.

11

By definition, the predicted probabilities from the first stage regressions are es-

timates of the true probabilities and are therefore uncertain. To accommodate this

variation, standard errors and confidence intervals around the incremental effects are

estimated via bootstrap, where each iteration of the bootstrap includes both stages

of the 2SE. Uncertainty surrounding the parameters in the ordered probit model is

therefore incorporated into the final estimated effects.

3 Simulation

I simulate data consistent with the latent index model discussed above. Alternatively,

authors sometimes simulate summary scores directly under a series of different distri-

butional assumptions (e.g., Basu & Manca (2012)); however, in application, the level of

measurement is always at the individual HRQoL domain, and summary scores are only

generated after converting the individual domain scores. Simulation based on the un-

derlying HRQoL domains is therefore more consistent with the likely DGPs encountered

in practice.

3.1 Data

In practice, the distribution of summary scores is often highly skewed, censored, and

multi-modal. For example, in a large study of laparoscopic-assisted versus abdominal

hysterectomy (the EVALUATE trial), the observed distributions in both treatment

arms were highly left-skewed with strong ceiling-effects at 1 (Basu & Manca, 2012;

Garry et al., 2004; Sculpher et al., 2004). Basu & Manca (2012) reproduces graphs from

several additional applications in which the summary score distributions are similarly

skewed, censored, or bi-modal.

To reflect the breadth of distributions encountered in practice, I simulate data under

several alternative DGPs. The DGPs are intentionally over-simplified in order to gen-

erate distributional properties of interest and to focus specifically on the estimation of

12

incremental effects. In all cases, I simulate a latent continuous variable for each HRQoL

domain (d = 1, ..., 6), denoted y∗id, as a function of a single independent variable, xi,

and a normal i.i.d. error term, εid.

Denote by γ the intercept coefficient and by β the coefficient on x. Then the D× 1

vector of latent HRQoL values, y∗i , is as follows:

y∗i = γ + βxi + εi, where

ε ∼ N (0D×1, ID×D) ,

x ∼ U[0, 1],

γ = ID×1, and

β = 1.5× ID×1.

Observed HRQoL values, yid for d ∈ (1, 2, 3, 4, 5, 6), are then generated based on the

value of the latent value, y∗id, relative to the Jd × 1 vector of threshold values in each

domain, αd, where Jd = 6 in the physical functioning and pain domains, Jd = 4 in

the role limitations domain, and Jd = 5 in the social functioning, mental health, and

vitality domains.

Alternative specifications of α are used to generate different distributional prop-

erties of the summary scores. Specifically, I consider five different threshold values

corresponding to each of five distributions of interest. In each domain, threshold values

are set to specific quantiles of the empirical distribution of the latent variable, F (y∗d).

Denoting the τjth quantile by qy∗d(τj) for all j ∈ {1, ..., Jd}, data are simulated under

the following alternative specifications of τj:

1. τ = [.1, .3, .5, .7, .9, 1]′ in the physical functioning and pain domains, τ = [.1, .3, .6, .8, 1]′

in the social functioning, mental health, and vitality domains, and τ = [.1, .4, .8, 1]′

in the role limitations domain. These values for τ generate a bell-shaped distri-

bution between 0.3 and 1, illustrated in panel (a) of Figure 1.

2. τj = 0.5 × jJd

, which generates a right-censored distribution, illustrated in panel

13

(b) of Figure 1.

3. τj = 0.25 × jJd

, which generates a heavily right-censored distribution, illustrated

in panel (c) of Figure 1.

4. τj = 0.25×(

1− jJd

)+ j

Jd, which generates a left-censored distribution, illustrated

in panel (d) of Figure 1.

5. τj = 0.5 ×(

1− jJd

)+ j

Jd, which generates a heavily left-censored distribution,

illustrated in panel (e) of Figure 1.

Figure 1

3.2 Monte Carlo Results

The focus of the Monte Carlo study is to compare incremental effects in the summary

score domain calculated with existing regression methods to the incremental effects cal-

culated using the 2SE. The primary hypothesis is that an ordered discrete choice model

(e.g., an ordered probit or logit) can better accommodate the idiosyncratic properties of

distributions encountered in practice. By modeling HRQoL domains directly and then

re-interpreting in terms of the summary score, the results are therefore (arguably) more

robust to a wide range of distributions relative to models based solely on the summary

score.

For each of the five DGPs discussed above, I simulate 1,000 datasets consisting of

N = 500 observations (patients). I estimate coefficients with four alternative estima-

tors: 1) 2SE; 2) standard OLS; 3) the Beta MLE model proposed in Basu & Manca

(2012); and 4) the Beta QMLE also proposed in Basu & Manca (2012). In all cases,

incremental effects are calculated using the method of recycled predictions as discussed

previously, interpreted as the average change in summary scores following a one stan-

dard deviation change in x. The results are summarized in Table 2.

14

Table 2

The 2SE consistently provides accurate estimates of the true incremental effect

across a range of alternative distributions. By comparison, incremental effects estimated

with OLS are downward (upward) biased in the presence of sufficient ceiling (floor)

effects. The Beta MLE and QMLE estimators perform better than OLS; however,

the Beta MLE estimator still provides biased estimates in the presence of uniformly

distributed summary scores with mild ceiling effects (DGP 2). In addition, Beta MLE

and Beta QMLE estimators are both less accurate relative to the 2SE, where estimates

from the latter are generally centered around the true effects while estimates from the

Beta MLE and Beta QMLE models differ from the true effect by 10% or more on

average. The 2SE also provides the lowest RMSE in all cases, although the differences

in RMSE across estimators are minimal and statistically insignificant.4

As discussed in Basu & Manca (2012), if the true marginal effect is relatively small

and the data are subject to strong ceiling or floor effects, biases in marginal effects

may be relatively minor. I therefore simulated additional datasets with β = 5 × ID×1

rather than β = 1.5 × ID×1. I focus on DGPs 3 and 5 above (strong ceiling and floor

effects, respectively), where any bias would be most apparent. Results are summarized

in Table 3. Here, the 2SE provides accurate estimates of the true incremental effect,

while all other estimators yield biased estimates. Differences in RMSE are also larger

relative to those in Table 2, with the 2SE again providing the minimum RMSE in all

cases.

Table 3

4Although the efficiency of these estimates will clearly depend on the overall model fit, the results arequalitatively unchanged when considering alternative simulations in which the model fit is intentionallyreduced (via a larger variance in the distribution of the error term, ε). Moreover, there would be noreason in practice to propose a different set of independent variables for the 2SE compared to anotherestimator such as standard OLS or Beta MLE. Concerns regarding the choice of covariates thereforeapply equally to all estimators considered in the analysis. Results are similarly unchanged whenallowing for non-zero cross-equation correlation across HRQoL domains. Results from these sensitivityanalyses are excluded for brevity but available upon request.

15

4 Application to Scoliosis Surgery

I apply the proposed 2SE to the estimation of the effect of observed pre-operative

variables on post-operative HRQoL and summary scores following surgical treatment

for adult spinal deformity (ASD). Surgical treatment of ASD is one of the lesser studied

but fastest growing and most expensive areas of spine surgery, affecting as much as 32%

of the adult population and up to 60% of the elderly (Robin et al., 1982; Schwab et al.,

2003, 2005, 2008).

4.1 Data

The data for this study were collected from a multi-center, prospective database main-

tained by the International Spine Study Group (ISSG). The dataset consists of 209

adult scoliosis and spinal deformity patients undergoing surgery at any participating

ISSG member site, with institutional review board approval obtained at all centers. For

purposes of this application, I limit the analysis to the following covariates: 1) age; 2)

gender; 3) baseline SF-6D scores; 4) total number of vertebrae fused at surgery (i.e.,

the number of “levels” fused); and 5) surgical approach. The outcome of interest is

patients’ HRQoL one year after surgery. Summary statistics are provided in Table 4.

Table 4

4.2 Results

Coefficient estimates are provided in Table 5. Although the coefficients in the ordered

probit regressions do not easily compare to those from the OLS, Beta MLE, and Beta

QMLE regressions, the ordered probit analysis immediately allows a more patient-

centered interpretation than is provided by the other estimators. To the extent that

a given patient’s preferences are such that certain health domains are more important

16

than others, the results may support a more meaningful discussion for shared decision-

making purposes. The ordered probit analysis also reveals important differences across

health domains that are not identified in the other estimators. Namely, the role of age,

gender, levels fused, and baseline HRQoL clearly differs across health domains, with

age having a significant positive impact in some domains, a significant negative impact

on others, with no significant impact on overall HRQoL. Similarly, gender and surgical

approach are estimated to have no significant impact on overall HRQoL despite having

a significant effect on the role limitations domain.

Table 5

The impact of baseline HRQoL is also more clearly represented with the ordered

probit results. For example, post-operative mental health scores are influenced heavily

by a patient’s baseline mental health score, much more so than in the other health

domains. This is consistent with the underlying nature of the disease, which can have

major negative effects on a patient’s daily activities and body image, but may not

generally impact a patient’s overall mental health. As such, for two patient’s with

an identical SF-6D index score, a patient with lower baseline mental health will have

relatively less opportunity for HRQoL improvement following surgery. This interpreta-

tion would not be available with the standard empirical framework based solely on the

summary scores (Manca et al., 2005).

Incremental effects estimated from the method of recycled predictions are summa-

rized in Table 6. For binary variables such as “Female” and “Posterior Approach”,

the incremental effect represents the predicted change in summary scores for women

relative to men and for patient’s with a posterior approach relative to a combined an-

terior/posterior approach, respectively. For age, the incremental effect represents the

predicted change in the summary score following a one-year increase in age at surgery;

and for levels fused and each HRQoL domain, the incremental effects represent the pre-

dicted change in summary scores following a one-unit increase (improvement) from the

17

median (e.g., an increase from 9 to 10 levels fused or from a baseline physical function-

ing domain score of 4 to 3). As should be the case given the well-behaved distribution

of summary scores, the incremental effects for age, gender, levels fused, and surgical

approach are similar for all estimators considered.

Table 6

The results from Table 6 also illustrate the loss of variation when estimating effects

based solely on the summary score. For example, an improvement from 4 to 3 or from 3

to 2 in a patient’s baseline “role limitations” domain will have no impact on the patient’s

summary score because the scoring algorithm is such that the score does not vary along

these values of the role limitations domain. A similar scenario unfolds for certain

values of the physical functioning, pain, mental health, and vitality domains. Because

of this loss of variation due to the scoring algorithm, incremental effects estimates for

the role limitations or mental health domains are not available when relying solely on

the summary score in the current application. By modeling each domain separately,

the 2SE avoids this problem and allows for a more complete estimation of incremental

effects at all values of each baseline HRQoL domain.5

5 Discussion

This paper develops a new two-stage estimator (2SE) for analyzing HRQoL outcomes

which offers important benefits relative to existing methodologies. Primarily, the paper

illustrates how a reliance solely on the summary score may lead to biased incremen-

tal effects estimates, while the 2SE is shown to restore the unbiased estimation of

incremental effects. The proposed methodology essentially reverses the order of the

5Such differences could be avoided somewhat by including each baseline HRQoL domain score asa covariate in the OLS, Beta MLE, and Beta QMLE regressions; however, this is not the standardapproach adopted in the literature. Moreover, this approach would not fully resolve the differences, asincremental effects under the 2SE remain higher in the mental health and vitality domains, and lowerin the pain domain. Results of this analysis are not included but are available upon request.

18

analysis, from one of “aggregate, then estimate” to one of “estimate, then aggregate.”

The 2SE also allows for a more patient-centered discussion wherein the incremental ef-

fects of treatment or other covariates are domain-specific and more applicable to areas

of health deemed most important to a given patient. Importantly, the 2SE offers a

unified framework by which to estimate incremental effects at the individual domain

level while still interpreting these same effects in terms of the overall summary score.

The improvements offered by the 2SE come at some cost. Namely, the 2SE is analyt-

ically more difficult to implement than a standard OLS and perhaps more complicated

than the Beta MLE, Beta QMLE, and other estimators relying solely on the summary

score. The 2SE also requires sufficient sample size (larger than standard OLS) in or-

der to estimate the ordered dependent variable models. However, as shown through

the Monte Carlo exercise, the standard estimators are less robust to the idiosyncratic

distributional properties of summary scores than is the 2SE. Moreover, the 2SE allows

for an interpretation in terms of summary scores just as the OLS, Beta MLE, and

Beta QMLE models do. The added computational burden therefore falls solely on the

analyst rather than the end-user of the results. As such, the proposed 2SE offers an im-

provement over existing estimators with no additional complexity for the end-user. In

light of the growing use of patient-reported outcome measures for purposes of provider

comparison and quality reporting (Nuttall et al., 2013), the proposed 2SE should be

considered as an alternative estimator for analysis of HRQoL outcomes in practice.

19

References

Agency for Healthcare Research and Quality. 2005. Calculating the U.S. Population-

based EQ-5D Index Score.

Ahmed, Sara, Berzon, Richard A, Revicki, Dennis A, Lenderking, William R, Moin-

pour, Carol M, Basch, Ethan, Reeve, Bryce B, Wu, Albert W, et al. 2012. The use of

patient-reported outcomes (PRO) within comparative effectiveness research: impli-

cations for clinical practice and health care policy. Medical Care, 50(12), 1060–1070.

AHRQ. 2012. Healthcare Cost and Utilization Project (HCUP), National Inpatient

Sample.

Austin, P.C. 2002. A comparison of methods for analyzing health-related quality-of-life

measures. Value in Health, 5(4), 329–337.

Basu, A., & Manca, A. 2012. Regression Estimators for Generic Health-Related Quality

of Life and Quality-Adjusted Life Years. Medical Decision Making, 32(1), 56–69.

Basu, Anirban. 2005. Extended generalized linear models: simultaneous estimation of

flexible link and variance functions. Stata Journal, 5(4), 501–516.

Basu, Anirban, & Rathouz, Paul J. 2005. Estimating marginal and incremental effects

on health outcomes using flexible link and variance function models. Biostatistics,

6(1), 93–109.

Bhat, C.R., Varin, C., & Ferdous, N. 2010. A comparison of the maximum simulated

likelihood and composite marginal likelihood estimation approaches in the context of

the multivariate ordered-response model. Advances in Econometrics, 26, 65–106.

Brazier, J., & Ratcliffe, J. 2007. Measuring and valuing health benefits for economic

evaluation. Oxford University Press, USA.

Brazier, J., Roberts, J., & Deverill, M. 2002. The estimation of a preference-based

measure of health from the SF-36. Journal of health economics, 21(2), 271–292.

20

Brazier, John E, Dixon, Simon, & Ratcliffe, Julie. 2009. The role of patient preferences

in cost-effectiveness analysis. Pharmacoeconomics, 27(9), 705–712.

Dakin, Helen. 2013. Review of studies mapping from quality of life or clinical measures

to EQ-5D: an online database. Health and quality of life outcomes, 11(1), 151.

Department of Health. 2008. Guidance on the Routine Collection of Patient Reported

Outcome Measures (PROMs).

Devlin, N.J., Parkin, D., & Browne, J. 2010. Patient-reported outcome measures in

the NHS: new methods for analysing and reporting EQ-5D data. Health economics,

19(8), 886–905.

Drummond, M.F., Sculpher, M.J., & Torrance, G.W. 2005. Methods for the economic

evaluation of health care programmes. Oxford University Press, USA.

Garry, Ray, Fountain, Jayne, Mason, Su, Hawe, Jeremy, Napp, Vicky, Abbott, Ja-

son, Clayton, Richard, Phillips, Graham, Whittaker, Mark, Lilford, Richard, et al.

2004. The eVALuate study: two parallel randomised trials, one comparing laparo-

scopic with abdominal hysterectomy, the other comparing laparoscopic with vaginal

hysterectomy. British Medical Journal, 328(7432), 129–133.

Glick, H. 2007. Economic evaluation in clinical trials. Oxford University Press, USA.

Graubard, Barry I, & Korn, Edward L. 1999. Predictive margins with survey data.

Biometrics, 55(2), 652–659.

Gray, A.M., Clarke, P.M., Wolstenholme, J., & Wordsworth, S. 2011. Applied Methods

of Cost-effectiveness Analysis in Healthcare. Oxford Univ Pr.

Gutacker, N., Bojke, C., Daidone, S., Devlin, N., & Street, A. 2012. Analysing Hospital

Variation in Health Outcome at the Level of EQ-5D Dimensions.

21

Hernandez Alava, Monica, Wailoo, Allan J, & Ara, Roberta. 2012. Tails from the peak

district: adjusted limited dependent variable mixture models of EQ-5D questionnaire

health state utility values. Value in Health, 15(3), 550–561.

Kleinman, Lawrence C, & Norton, Edward C. 2009. What’s the risk? A simple ap-

proach for estimating adjusted risk measures from nonlinear models including logistic

regression. Health services research, 44(1), 288–302.

Landro, L. 2012. The Simple Idea That Is Transforming Health Care. The Wall Street

Journal.

Manca, A., Hawkins, N., & Sculpher, M.J. 2005. Estimating mean QALYs in trial-based

cost-effectiveness analysis: the importance of controlling for baseline utility. Health

economics, 14(5), 487–496.

Mortimer, D., & Segal, L. 2008. Comparing the incomparable? A systematic review

of competing techniques for converting descriptive measures of health status into

QALY-weights. Medical decision making, 28(1), 66.

Mullahy, J. 2011. Marginal Effects in Multivariate Probit and Kindred Discrete and

Count Outcome Models, with Applications in Health Economics. Tech. rept. National

Bureau of Economic Research.

NCQA. 2008. National Committee for Quality Assurance (NCQA). HEDIS and quality

measurement: technical resources.

Norton, Edward C, Wang, Hua, & Ai, Chunrong. 2004. Computing interaction effects

and standard errors in logit and probit models. Stata Journal, 4, 154–167.

Nuttall, David, Parkin, David, & Devlin, Nancy. 2013. Inter-provider Comparison of

Patient-reported Outcomes: Developing and Adjustment to Account for Differences

in Patient Case Mix. Health Economics.

22

Oaxaca, Ronald. 1973. Male-female wage differentials in urban labor markets. Inter-

national economic review, 14(3), 693–709.

Parkin, D., Rice, N., & Devlin, N. 2010. Statistical analysis of EQ-5D profiles: does

the use of value sets bias inference? Medical Decision Making, 30(5), 556–565.

PCORI. 2012. Draft National Priorities for Research and Research Agenda: version 1.

Porter, Michael E. 2010. What Is Value in Health Care? New England Journal of

Medicine, 363(26), 2477–2481. PMID: 21142528.

Powell, J.L. 1984. Least absolute deviations estimation for the censored regression

model. Journal of Econometrics, 25(3), 303–325.

Robin, G., Span, Y., Steinberg, R., Making, M., & Menczel, J. 1982. Scoliosis in the

elderly: a follow-up study. Spine, 7(4), 355–359.

Schwab, Frank, Dubey, Ashok, Pagala, Murali, Gamez, Lorenzo, & Farcy, Jean P. 2003.

Adult scoliosis: a health assessment analysis by SF-36. Spine, 28(6), 602–606.

Schwab, Frank, Dubey, Ashok, Gamez, Lorenzo, El Fegoun, Abdelkrim Benchikh,

Hwang, Ki, Pagala, Murali, & Farcy, J-P. 2005. Adult scoliosis: prevalence, SF-

36, and nutritional parameters in an elderly volunteer population. Spine, 30(9),

1082–1085.

Schwab, Frank J, Lafage, Virginie, Farcy, Jean-Pierre, Bridwell, Keith H, Glassman,

Stephen, & Shainline, Michael R. 2008. Predicting outcome and complications in the

surgical treatment of adult scoliosis. Spine, 33(20), 2243–2247.

Sculpher, Mark, & Gafni, Amiram. 2001. Recognizing diversity in public preferences:

The use of preference sub-groups in cost-effectiveness analysis. Health economics,

10(4), 317–324.

Sculpher, Mark, Manca, Andrea, Abbott, Jason, Fountain, Jayne, Mason, Su, & Garry,

Ray. 2004. Cost effectiveness analysis of laparoscopic hysterectomy compared with

23

standard hysterectomy: results from a randomised trial. British Medical Journal,

328(7432), 134–139.

Selby, J.V., Beal, A.C., & Frank, L. 2012. The Patient-Centered Outcomes Research

Institute (PCORI) national priorities for research and initial research agenda. JAMA:

The Journal of the American Medical Association, 307(15), 1583–1584.

Shaw, J.W., Johnson, J.A., & Coons, S.J. 2005. US valuation of the EQ-5D health

states: development and testing of the D1 valuation model. Medical care, 43(3), 203.

24

6 Tables and Figures

Table 1: Scoring Algorithm for SF-6Da

Starting value = 1.0 (perfect health)

Physical Functioning (PF)PF=2 or PF=3 -0.035PF=4 -0.044PF=5 -0.056PF=6 -0.117

Role Limitations (RL)RL=2 or RL=3 or RL=4 -0.053

Social Functioning (SF)SF=2 -0.057SF=3 -0.059SF=4 -0.072SF=5 -0.087

Pain (P)P=2 or P=3 -0.042P=4 -0.065P=5 -0.102P=6 -0.171

Mental Health (MH)MH=2 or MH=3 -0.042MH=4 -0.100MH=5 -0.118

Vitality (V)V=2 or V=3 or V=4 -0.071V=5 -0.092

Combination of Domains“Most Severe” -0.061

aAlgorithm based on Brazier & Ratcliffe (2007). “Most Severe” denotes any one of the followingresponses: a level of 4 or more in the physical functioning, social functioning, mental health, orvitality domains; a level of 3 or more in the role limitation domain; or a level of 5 or more in thepain domain.

25

Figure 1: Empirical QALY Distributions in Monte Carlo Study0

1020

3040

50F

requ

ency

.4 .6 .8 1SF-6D Index Score

010

2030

40F

requ

ency

.2 .4 .6 .8 1SF-6D Index Score

(a) (d)

τPF,Pain = [.1, .3, .5, .7, .9, 1]′ τj = 0.25×(

1− jJd

)+ j

Jd

τSF,MH,V = [.1, .3, .6, .8, 1]′

τRL = [.1, .4, .8, 1]′

010

2030

4050

Fre

quen

cy


020

4060

80F

requ

ency

.3 .4 .5 .6 .7 .8SF-6D Index Score

(b) (e)

τj = 0.5× jJd

τj = 0.5×(

1− jJd

)+ j

Jd

050

100

150

200

Fre

quen

cy


(c)τj = 0.25× j

Jd

26

Table 2: Incremental Effects Estimates under Alternative DGPsa

Model Incremental Effect St. Dev. Mean % Bias Lower % Bias Upper % Bias RMSE

DGP 1: τPF,Pain = [.1, .3, .5, .7, .9, 1]′, τSF,MH,V = [.1, .3, .6, .8, 1]′, τRL = [.1, .4, .8, 1]′

True Effect 0.070 0.002Two-stage Approach 0.070 0.003 -0.73% -11.85% 11.64% 0.0827OLS 0.073 0.004 3.79% -8.89% 17.18% 0.0828Beta MLE 0.077 0.004 9.49% -4.84% 25.44% 0.0830Beta QMLE 0.075 0.004 6.27% -6.66% 19.96% 0.0829

DGP 2: τj = 0.5× jJd

True Effect 0.093 0.003Two-stage Approach 0.092 0.005 -0.64% -12.62% 11.48% 0.1041OLS 0.089 0.005 -3.84% -15.36% 8.39% 0.1043Beta MLE 0.142 0.010 52.57% 28.34% 76.59% 0.1115Beta QMLE 0.102 0.006 10.14% -4.26% 25.24% 0.1043

DGP 3: τj = 0.25× jJd

True Effect 0.076 0.003Two-stage Approach 0.075 0.005 -1.34% -15.60% 15.21% 0.0916OLS 0.065 0.004 -15.02% -29.91% -1.40% 0.0923Beta MLE 0.075 0.008 -1.01% -23.44% 23.44% 0.0935Beta QMLE 0.086 0.006 12.71% -5.97% 32.68% 0.0917

DGP 4: τj = 0.25×(

1− jJd

)+ j

Jd

True Effect 0.075 0.002Two-stage Approach 0.075 0.003 -0.22% -10.58% 11.14% 0.0966OLS 0.083 0.004 10.32% -2.40% 24.52% 0.0968Beta MLE 0.083 0.005 10.71% -2.67% 25.71% 0.0969Beta QMLE 0.082 0.004 9.20% -3.23% 22.88% 0.0968

DGP 5: τj = 0.5×(

1− jJd

)+ j

Jd

True Effect 0.062 0.002Two-stage Approach 0.061 0.003 -0.28% -11.20% 11.19% 0.0916OLS 0.072 0.004 16.70% 2.21% 32.65% 0.0920Beta MLE 0.070 0.004 13.03% -1.05% 28.53% 0.0919Beta QMLE 0.070 0.004 13.46% -0.26% 28.56% 0.0919

aResults based on 1,000 bootstrap iterations for N = 500 observations in each DGP. Upper % biasand lower % bias denote the upper and lower 95% confidence intervals of the percent difference betweenthe estimated incremental effect and the true incremental effect. RMSE=root mean squared error.

27

Table 3: Incremental Effects Estimates with Larger True Effecta

Model Incremental Effect St. Dev. Mean % Bias Lower % Bias Upper % Bias RMSE

DGP 3: τj = 0.25× jJd

True Effect 0.168 0.006Two-stage Approach 0.167 0.005 -0.48% -7.35% 6.17% 0.0676OLS 0.120 0.004 -28.43% -35.49% -20.96% 0.0945Beta MLE 0.137 0.009 -18.27% -29.84% -6.09% 0.0940Beta QMLE 0.216 0.008 29.11% 19.12% 39.04% 0.0698

DGP 5: τj = 0.5×(

1− jJd

)+ j

Jd

True Effect 0.111 0.002Two-stage Approach 0.112 0.002 0.28% -3.65% 4.52% 0.0688OLS 0.155 0.004 39.58% 29.10% 50.09% 0.0887Beta MLE 0.147 0.006 32.44% 21.11% 45.85% 0.0875Beta QMLE 0.142 0.003 27.71% 18.95% 36.38% 0.0864

aResults based on 1,000 bootstrap iterations for N = 500 observations in each DGP, with datasimulated using β = 5× ID×1 rather than β = 1.5× ID×1. Upper % bias and lower % bias denote theupper and lower 95% confidence intervals of the percent difference between the estimated incrementaleffect and the true incremental effect. RMSE=root mean squared error.

28

Table 4: Summary Statistics for ISSG Data (N=209)

Variable Mean StandardDeviation

Age 58.65 13.56Levels Fused 10.36 4.34

Count Percent

Female 175 84%Posterior Approach 71 34%

Baseline Post-operativeCount Percent Count Percent

Physical Functioning DomainPF=1 0 0% 0 0%PF=2 10 5% 27 13%PF=3 43 21% 61 29%PF=4 65 31% 35 17%PF=5 77 37% 74 35%PF=6 14 7% 12 6%

Role Limitations DomainRL=1 8 4% 23 11%RL=2 68 33% 80 38%RL=3 4 2% 7 3%RL=4 129 62% 99 47%

Social Functioning DomainSF=1 38 18% 79 38%SF=2 38 18% 44 21%SF=3 63 30% 50 24%SF=4 47 22% 26 12%SF=5 12 11% 10 5%

Pain DomainP=1 1 0% 15 7%P=2 12 6% 27 13%P=3 25 12% 70 33%P=4 50 24% 47 22%P=5 75 36% 35 17%P=6 46 22% 15 7%

Mental Health DomainMH=1 37 18% 83 40%MH=2 65 31% 64 31%MH=3 56 27% 34 16%MH=4 37 18% 23 11%MH=5 14 7% 5 2%

Vitality DomainV=1 4 2% 6 3%V=2 29 14% 69 33%V=3 53 25% 67 32%V=4 61 29% 37 18%V=5 62 30% 30 14%

29

Table 5: Regression Resultsa

OLS Beta Beta Ordered ProbitOLS MLE QMLE

Outcome: QALY QALY QALY PF RL SF P MH V

Age 0.00* 0.00 0.00* 0.01 -0.01** 0.01 0.01* 0.01* 0.00(0.00) (0.00) (0.00) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01)

Female -0.02 -0.10 -0.07 0.01 -0.59*** 0.21 -0.15 -0.40* -0.22(0.02) (0.11) (0.10) (0.21) (0.22) (0.21) (0.20) (0.23) (0.21)

Levels Fused -0.00 -0.01 -0.01 -0.03* -0.03* -0.02 0.02 -0.01 0.01(0.00) (0.01) (0.01) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02)

Posterior Approach 0.01 0.07 0.07 0.23 0.40** 0.24 0.00 0.00 0.15(0.02) (0.09) (0.09) (0.18) (0.19) (0.19) (0.18) (0.19) (0.18)

Baseline HRQoLSF-6D Index 0.57*** 2.49*** 2.54***

(0.07) (0.40) (0.37)PF 0.49***

(0.09)RL 0.36***

(0.08)SF 0.39***

(0.07)P 0.43***

(0.07)MH 0.59***

(0.08)V 0.42***

(0.07)RMSE 0.1103 .1218 0.1100 0.1099

aResults based on OLS, Beta MLE, Beta QMLE, and Ordered Probit regressions. Beta MLE andQMLE estimation follows the procedure and code available from Basu & Manca (2012). Standarderrors in parenthesis, * p<0.1. ** p<0.05. *** p<0.01. RMSE: root mean squared error.

30

Table 6: Incremental Effectsa

OLS Beta MLE Beta QMLE 2SE

Age 0.001 0.001 0.001 0.001(0.001) (0.001) (0.001) (0.001)

Female -0.016 -0.021 -0.016 -0.021(0.022) (0.024) (0.022) (0.021)

Levels Fused -0.002 -0.003 -0.002 -0.001(0.002) (0.002) (0.002) (0.002)

Posterior Approach 0.015 0.016 0.015 0.018(0.019) (0.021) (0.019) (0.018)

Baseline HRQoLPF 0.011 0.010 0.011 0.008

(0.002) (0.002) (0.001) (0.002)RL 0.000 0.000 0.000 0.005

– – – (0.001)SF 0.001 0.001 0.001 0.011

(0.000) (0.000) (0.000) (0.002)P 0.025 0.024 0.024 0.015

(0.004) (0.004) (0.003) (0.003)MH 0.000 0.000 0.000 0.015

– – – (0.002)V 0.004 0.003 0.003 0.005

(0.001) (0.001) (0.001) (0.001)

aIncremental effects on QALYs estimated via the method of recycled predictions following OLS,Beta MLE, Beta QMLE, and 2SE (Oaxaca, 1973; Graubard & Korn, 1999; Basu & Rathouz, 2005;Basu, 2005; Glick, 2007; Kleinman & Norton, 2009). Beta MLE and QMLE estimation followsthe procedure and code available from Basu & Manca (2012). Bootstrapped standard errors inparenthesis based on 1,000 iterations.

31

Putting the Patient in Patient Reported Outcomes: A Robust...

Documents

Transcript of Putting the Patient in Patient Reported Outcomes: A Robust...