Sum Scores and Scores of Individual Components in Clinical Practice and Clinical Trials
Propensity Scores in Medical Device Trials
-
Upload
biomedical-statistical-consulting -
Category
Education
-
view
580 -
download
0
description
Transcript of Propensity Scores in Medical Device Trials
Design of Non-Randomized Medical Device Trials Based on Sub classification Using
Propensity Score Quintiles
Greg Maislin Principal Biostatistician
Biomedical Statistical Consulting, Wynnewood PA
Adjunct Associate Professorof Biostatistics in Medicine
Director, BiostatisticsDivision of Sleep Medicine
University of Pennsylvania School of Medicine
Co-Author
Donald B. Rubin
John L. Loeb Professor of Statistics
Department of Statistics, Harvard University
Supported by
Smith & Nephew
Acknowelgments
Introduction
Motivation for comparing non-randomized cohorts is increasing in orthopedic device studies designed to provide evidence of safety and effectiveness in support of PMA and 510(k) studies.
Patients are savvy and knowledgeable about the orthopedic product pipeline. Some are willing to travel abroad to avoid randomization.
The use of properly an a priori designed non-randomized study has potential for substantially reducing Sponsor burden yet permit valid scientific evidence for review.
Many Sponsors possess cohorts of patients receiving treatments for the same indication from prior well-controlled trials designed to support PMA or 510(k) regulatory approval.
PS methods can effectively reduce selection bias in non-randomized group comparisons.
Brief History of PS I
Cochran Biometrics Biometrics 1968 In the beginning…. Initial bias due to x defined as aresponse curve difference, e.g.,
R1(x) – R2(x)
If // then treatment effect can be estimate at any x. Observational studies problematic when not // (the usual case). Used direct adjustment and showed that 5 subclasses are
sufficient to remove over 90% of the bias due to covariates
Cochran and Rubin Indian J. of Statistics 1973 Standardized initial bias defined. Don’s tradition of providing
usable tools is clearly apparent by the introduction of a standardized measure for initial bias based on the mean difference as a % of the average SD computed in the usual way.
Brief History of PS II
Rubin Biometrics 1973a Treatment effect defined. The average difference between non-
parallel response surfaces over the P1 (e.g., investigational device) population. If constant, this simplifies to a fixed difference between parallel response surfaces.
Defined to the (average) effect of the treatment variable or more simply “the treatment effect”, τ .
τ = E1 {R1(x) – R2(x)}
Rubin Biometrics 1973b Focus on relative covariate variance. Foreshadows Mahalanobis distance within calipers
determined by propensity scores.
Brief History of PS III
Rosenbaum and Rubin Biometrika 1983 Theorem 1 implies that if a subclass of unites or a matched treatment-
control pair is homogeneous in e(x), then the treated and control units in that subclass or matched pair will have the same distribution of x.
Theorem 2 implies that if subclasses or matched treated-control pairs are homogeneous in both e(x) and certain chosen components of x, it is still reasonable to expect balance on the other components of x within these refined subclasses or matched pairs.
Theorem 3 states that under strongly ignorable treatment assignment, units with the same value of the balancing score b(x) but different treatments can act as controls for each other, in the sense that the expected difference in their responses equals the average treatment effect.
Theorem 4 states that if treatment assignment is strongly ignorable and b(x) is a balancing score, then the expected difference in observed responses to treatments at b(x) is equal to the average tx effect at b(x).
Brief History of PS IV
Rosenbaum and Rubin JASA 1984 Focus on estimated PS, rather than true PS Appendix B is about missing covariates. e(x) = Pr(z = 1 | X) extended to e*(x) = Pr(z = 1 | X*) “In practice we may estimate e* in several ways….”
Rosenbaum and Rubin American Statistician 1985 Suggest the use of logit(PS) to avoid ‘compression of
scale’ near 0 and 1. Equal-percent bias reducing (EPBR) occurs if expected
value of X is a monotone transformation of e(x), References Rubin (1976b, Theorem 2) that Mahalanobis metric matching is EPBR.
Brief History of PS V
Rubin Health Services & Outcomes Research
Methodology 2001
Using Propensity Score to Help Design Observational
Studies: Application to Tobacco Litigation Describes a number of diagnostic tools that were used in
our study Standardized bias Ratio of investigational to control PS variance Table summarizing how many of the covariates have
specified variance ratios orthogonal to the propensity score within and outside of an ideal range (>4/5 and <=5/4).
Lessons Learned from Brief History
A non-randomized group comparison should be 'designed' prior to analysis just as randomized group comparisons.
‘’Design' may be interpreted as "contemplating, collecting, organizing, and analyzing of data that takes place prior to seeing any outcome data (Rubin 2008)".
A principled approach to performing treatment group comparisons in observational studies includes a priori specification of the specifics regarding the treatment group comparison without access to outcome data.
"The propensity score is the observational study analogue of complete randomization in randomized experiments in the sense that its use is not intended to increase precision but only to eliminate systematic biases in treatment-control comparisons (Rubin 2008)“.
Very Recent References
Rubin DB. The design versus the analysis of observational
studies for causal effects: Parallels with the design of
randomized trials. Statistics in Medicine 2007, 26:20-36.
Rubin DB. For objective causal inference, design trumps
analysis. The Annals of Applied Statistics 2008, 2:3:808-840.
Yue LQ. Statistical and regulatory issue with the application of
propensity score analysis to nonrandomized medical device
clinical studies, Journal of Biopharmaceutical Statistics 2007,
17: 1-13. (includes 6 invited comments and a rejoinder)
Methods I
PS modeling effort is harder than naively thought by many! Multiple linear regression with no higher order terms or interactions is likely to be wholly inadequate.
We presents a new heuristic for PS modeling building in line with methods developed and described by Rosenbaum and Rubin (1984), Rubin (2001), and Imbens and Rubin (2009) but that iteratively uses standardized effect sizes of higher ordered terms to obtain a more precise estimates of PS.
No concern for Type I error inflation because no outcome data is used in this ‘Design of the observational study'.
Graphical techniques are emphasized that are effective in showing a non-statistical audience that balance among observed variables is at least as good as expected through randomization.
Methods II
Our heuristic includes a trio of "stages" that may be repeated
as many times as necessary. (1) estimating a main effects PS model (2) identification and adding of required higher order terms
through evaluation of within subclass bias effect sizes and other relevant PS diagnostic information
(3) exclusion of subjects in one treatment group with insufficient 'covariate overlap‘ based on higher ordered model.
At each stage, additional PS diagnostics are employed in
order to identify key terms and then finally, to validate final
model.
Methods III
Magnitudes of standardized effect sizes for group differences in
linear, squares, and cross-product terms within subclasses provide the
key insight into what additional terms the PS model must have in order
to achieve adequate within-subclass balance between treatment
groups.
During the iterative process it may be found that there is a lack of
sufficient overlap in some part of the PS distributions to permit valid
statistical inference. If so, subjects in one group with values in that part
of the PS distribution lacking subjects in the other must be excluded in
order to permit valid inference that is free from extrapolation.
Additional distributional diagnostics performed to guide selection of
higher ordered terms.
Methods III(b) (Nursing example) University of Pennsylvania School of Nursing Naylor ECC: Design of Observation Study using PS Methods PS_ASC_RNC Model 8 Details of propensity distribution by device group The UNIVARIATE Procedure Variable: LOGIT_PS (Propensity Score) Schematic Plots | 5 + | | | | 2.5 + | | | | +-----+ | | | | | 0 + *--+--* +-----+ | +-----+ *-----* | | | + | | | +-----+ -2.5 + | | | | | | -5 + | | | -7.5 + | | | -10 + | * -10.47466 8885843 | | -12.5 + | * -13.13123 8881011 | | -15 + ------------+-----------+----------- GRP_ASC 1 2
Methods IV
Specific choices made during this model building process pose no
concern for Type I error inflation because these analyses do not
involve any outcome data. In this way the sequential model-building
exercise should be viewed as part of the 'design of the
observational study'.
Graphical techniques are used to demonstrate in a simple way to
a non-statistical audience that the proposed approach does result
in good balance of observed factors between treatment groups,
highlighting that balance among observed covariates can often be
greater than expected had treatments been randomized.
Motivating Example I
Data is from ancillary analysis conducted to support the
findings from a prospective, parallel groups, randomized,
non-inferiority trial of an investigational treatment relative to a
putative standard-of-care injectable treatment
(methylprednisolone) for reducing knee pain at Week 12 due
to osteoarthritis.
The primary clinical endpoint was successful relief of pain as
determined from changes from baseline to Week 12 in the
WOMAC Pain (≥ 5 points reduction in WOMAC pain score
that is ≥ 40% smaller than pretreatment value).
Motivating Example II
The Sponsor had conducted two prior superiority studies against
saline control and used these two studies to narrow the indicated
population. Post hoc pooled analyses of these studies identified
sub populations likely to benefit from the investigational treatment.
Therefore, the Sponsor conducted a non-inferiority study relative to
a putative standard of care (methylprednisolone), but choose not to
include a saline control activity in the study population.
The goal of this PS analyses was to provide external confirmation
that the control treatment was superior to saline for reducing knee
pain due to osteoarthritis at Week 12 to mitigate against the lack of
internal control.
Motivating Example III
The first two studies provide nearly an ideal reservoir of
potential saline controls for use in the principled design
of an non-randomized study comparing relative efficacy
between methylprednisolone and saline in reducing pain
at Week 12 due to osteoarthritis.
All patients randomized to receive saline in Studies 1
and 2 who met the inclusion and exclusion criteria for
Study 3 were evaluated as candidates for comparison
with methylprednisolone using PS sub classification.
Design of Non Randomized Study I
Apply Study 3 exclusion criteria from Study 3 to saline
subjects from Studies 1 and 2
Without access to any outcome data, use propensity score
(PS) methods in design phase to create subclasses of
saline and methylp subjects who have the same distribution
of observed background covariates
Design of Non Randomized Study II
• Start with all saline subjects from Studies 1 and 2
(N=174+110 = 284).
• Exclude saline subjects that do not meet Study 3
exclusion criteria.
• Identify CR (clinically relevant) Saline Cohort of subjects
(N=108) for input in PS analyses
• (Excluded 4 additional control subjects from CR cohort as
‘non-overlappers’ (final N=104))
• (Started with N=215 in investigational group and ended
up with N=169.)
Design of Non Randomized Study III
Variable Explanation Type of Variable GLOB2 Global assessment (baseline) ordered categorical variable (1-5) analyzed as continuous variable WS2 WOMAC stiffness (baseline) continuous WPF2 WOMAC physical function (baseline) continuous WP2 WOMAC pain (baseline) continuous OA_DURATION Duration of OA (years) continuous Age Age at treatment (years) continuous BMI Body Mass Index (k/m2) continuous MALE Sex of patient dichotomous (male vs female) K_L_34 Kellgren-Lawrence grade dichotomous (grade 3 vs grade 2) IA_STER10 Previous steroid injection dichotomous (yes vs no) PREV_HA10 Previous HA injection dichotomous (yes vs no) PREV_SURG10 Previous surgery dichotomous (yes vs no)
Design of Non Randomized Study III
Design of Non Randomized Study IV
Variance
Methylp Control 1B95% CI
LB95% CI
UB t-statp-
value 2R
Unadjusted B 169 104 0.67 0.42 0.93 5.42 0.00 1.30
Q1 22 33 0.42 -0.13 0.96 1.51 0.14 0.51
Q2 27 28 -0.16 -0.69 0.37 -0.59 0.56 0.86
Q3 34 19 0.69 0.11 1.27 2.41 0.02 1.04
Q4 43 13 -0.37 -0.99 0.26 -1.16 0.25 1.14
Q5 43 11 -0.16 -0.82 0.50 -0.47 0.64 2.16
0.08 1.14 0.25 1.14
88%
Notes: 1 B - Standardized bias (difference in means / SDpooled)2 R - Ratio of variances (saline divided by methylp)3 For Mean B over Subclasses, t-stat (p-value) is for the treatment group contrast controlling for PS subclass.
Source [PSMI1_M_vs_S12 Model 6.sas]
Standardized Effect Size for Bias
Group by subcall interaction F(4,263)=1.91, p=0.11
Mean % Bias reduction
Table 4.2(1) Analysis of Mean Bias Reductionfor Subclassification using PS Model 6 (FINAL MODEL)
Sample Sizes Test for Bias
Mean B or R over Subclasses3
Design of Non Randomized Study IV Table 4.5(1) Maximum Likelihood Estimates of PS Model 6 Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -7.7229 6.6920 1.3318 0.2485 GLOB2 1 1.4416 1.3942 1.0691 0.3011 WS2 1 3.4653 1.4064 6.0713 0.0137 WPF2 1 -0.1931 0.2159 0.8004 0.3710 WP2 1 0.1848 0.8518 0.0471 0.8283 OA_DURATION 1 -0.0358 0.2154 0.0276 0.8681 AGE 1 0.0440 0.0640 0.4739 0.4912 BMI 1 0.00987 0.1683 0.0034 0.9532 MALE 1 0.4444 0.2922 2.3133 0.1283 K_L_34 1 0.1058 0.2889 0.1342 0.7141 IA_STER10 1 0.7486 1.1697 0.4096 0.5222 PREV_HA10 1 -1.8260 4.2554 0.1841 0.6678 PREV_SURG10 1 -0.4358 0.3257 1.7908 0.1808 WP2_BMI 1 0.00492 0.0187 0.0690 0.7928 WP2_AGE 1 -0.00019 0.00832 0.0005 0.9818 WPF2_WP2 1 -0.0112 0.00768 2.1317 0.1443 GLOB2_WPF2 1 -0.00577 0.0170 0.1158 0.7337 GLOB2SQ 1 -0.2320 0.1688 1.8881 0.1694 WPF2_AGE 1 0.00189 0.00216 0.7694 0.3804 WPF2_IA_STER10 1 0.0370 0.0374 0.9778 0.3227 WS2_IA_STER10 1 -0.3608 0.3112 1.3447 0.2462 WS2_AGE 1 -0.0252 0.0138 3.3369 0.0677 WS2_WPF2 1 -0.00082 0.0161 0.0026 0.9595 WPF2_BMI 1 0.00652 0.00490 1.7723 0.1831 WS2SQ 1 0.0319 0.0857 0.1384 0.7099 AGE_PREV_HA10 1 -0.00203 0.0476 0.0018 0.9660 WPF2_OA_DURATION 1 0.00228 0.00271 0.7067 0.4006 OA_DURATION_BMI 1 -0.00304 0.00718 0.1795 0.6718 BMI_PREV_HA10 1 0.0944 0.1004 0.8842 0.3471 IA_STER10_PREV_HA10 1 -0.6172 0.7997 0.5955 0.4403 WP2_PREV_HA10 1 0.1519 0.2597 0.3420 0.5587 WS2_PREV_HA10 1 -0.1847 0.3925 0.2213 0.6380 WS2_BMI 1 -0.0635 0.0321 3.9177 0.0478 WPF2_PREV_HA10 1 -0.0457 0.0629 0.5264 0.4681
Design of Non Randomized Study VI
Table 4.2(2)Individual Residual Variance RatiosFor PS Model 6 Methyl-p vs Saline
Residual Variance Variance Var. ratio >1/2 & >4/5 & >5/4 Obs VARIABLE Control Active control/active <=1/2 <=4/5 <=5/4 & <=2 >=2
1 glob2 0.58 0.52 1.11 0 0 1 0 0 2 ws2 2.06 1.69 1.22 0 0 1 0 0 3 wpf2 127.17 98.39 1.29 0 0 0 1 0 4 wp2 4.11 4.55 0.90 0 0 1 0 0 5 oa_duration 21.06 17.69 1.19 0 0 1 0 0 6 age 107.60 96.25 1.12 0 0 1 0 0 7 bmi 18.20 16.72 1.09 0 0 1 0 0 8 male 0.23 0.24 0.98 0 0 1 0 0 9 k_l_34 0.25 0.24 1.03 0 0 1 0 0 10 ia_ster10 0.20 0.20 1.01 0 0 1 0 0 11 prev_ha10 0.15 0.12 1.22 0 0 1 0 0 12 prev_surg10 0.21 0.20 1.05 0 0 1 0 0 ===== ====== ====== ===== === 0 0 11 1 0
Design of Non Randomized Study VII
Group by Subclass
Interaction
Diff. SEF or
Chi-sq.Diff. SE
F or Chi-sq.
F or Chi-sq.
Statistical Model
Global assessment -0.146 0.095 2.35 0.004 0.096 0.00 0.31 ANOVA
WOMAC stiffness 0.118 0.170 0.48 0.018 0.179 0.02 0.36 ANOVA
WOMAC physical function 0.678 1.308 0.27 0.082 1.377 0.00 1.02 ANOVA
WOMAC pain 0.155 0.263 0.35 0.031 0.290 0.28 0.49 ANOVA
Duration of OA -1.080 0.577 3.51 -0.020 0.578 0.00 0.66 ANOVA
Age -0.144 1.250 0.01 0.199 1.320 0.02 0.10 ANOVA
BMI 0.396 0.523 0.57 -0.038 0.549 0.00 0.10 ANOVA
Male gender 0.315 0.250 1.58 0.004 0.270 0.00 0.01 Logistic Regr.Kellgren-Lawrence 3 vs 2 0.071 0.254 0.08 -0.011 0.268 0.00 0.68 Logistic Regr.Previous steroid injection -0.015 0.281 0.00 -0.038 0.296 0.02 1.59 Logistic Regr.Previous HA injection -0.316 0.330 0.92 -0.019 0.351 0.00 0.07 Logistic Regr.Previous surgery -0.381 0.261 2.14 0.002 0.284 0.00 1.62 Logistic Regr.
Table 4.3(1)Assessment of Bias Reduction in Individual Covariates
Due to Stratification on PS Subclasses
Notes:1 One-w ay ANOVA comparing methylprednisolone saline cohort for continuous patient factors and simple logistic regression for dichotomous factors. For ANOVA the column labeled dif ference is the methylprednisolone minus saline control unadjusted differences in means. 2 Tw o-w ay ANOVA comparing methylprednisolone to saline for continuous patient factors controlling for PS subclassif ication or multiple logistic regression for dichotomous factors containing treatment group andPS subclass (df=4). For logistic regression models, the values in the columns labeled "Diff." and "SE" refers to the estimated log odds ratio and its standard error.
Source [PSMI1_M_vs_S12 Model 6.sas]
Unadjusted1 With PS Subclassification Adjustment
Figure 4.3Bias Reduction for Each Covariate
in the Final PS Model (Model 6)
t-statistic or signed square root of chi-square
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Global assessment
WOMAC stiffness
WOMAC physical function
WOMAC pain
Duration of OA
Age
BMI
Male gender
Kellgren-Lawrence 3 vs 2
Previous steroid injection
Previous HA injection
Previous surgery
Prior to PS subclassificationAfter PS subclassification|statistic value| < 0.5
Percentages of Methylp and Saline Subjects Within Each of the PS Subclasses
Subclass 1
Subclass 2
Subclass 3
Subclass 4
Subclass 5
Per
cen
tag
e o
f C
oh
ort
0
10
20
30
40
50 Methylp (N=169) mean (SD) linear PS=0.70 (0.63)Saline (N=104) mean (SD) linear PS=0.25 (0.71)
Linear PS percentiles in the pooled distributions
20th=-0.040, 40th=0.414, 60th=0.772, 80th=1.134
Q1 Q2 Q3 Q4 Q5
Per
cen
tag
e o
f co
ho
rtw
ith
Ch
arac
teri
stic
0
10
20
30
40
50
60
70
80
90
100
PS MethylprednisolonePS Saline Control
Figure 4.4(a)Balance Within Subclasses:
Male Gender
Q1 Q2 Q3 Q4 Q5
Per
cen
tag
e o
f co
ho
rtw
ith
Ch
arac
teri
stic
0
10
20
30
40
50
60
70
80
90
100
PS MethylprednisolonePS Saline Control
Figure 4.4(b)Balance Within Subclasses:
Kellgren-Lawrence Grade 3 vs 2
Q1 Q2 Q3 Q4 Q5
Per
cen
tag
e o
f co
ho
rtw
ith
Ch
arac
teri
stic
0
10
20
30
40
50
60
70
80
90
100
PS MethylprednisolonePS Saline Control
Figure 4.4(c)Balance Within Subclasses:Previous Steroid Injection
Q1 Q2 Q3 Q4 Q5
Per
cen
tag
e o
f co
ho
rtw
ith
Ch
arac
teri
sti
c
0
10
20
30
40
50
60
70
80
90
100
PS MethylprednisolonePS Saline Control
Figure 4.4(d)Balance Within Subclasses:
Previous HA Injection
Q1 Q2 Q3 Q4 Q5
Per
cen
tag
e o
f co
ho
rtw
ith
Ch
arac
teri
sti
c
0
10
20
30
40
50
60
70
80
90
100
PS MethylprednisolonePS Saline Control
Figure 4.4(e)Balance Within Subclasses:
Previous Surgery
Treatment Group (M vs C)and PS Quintile (1, 2, 3, 4, 5)
Met
hylp Q
1
Control Q
1
Met
hylp Q
2
Control Q
2
Met
hylp Q
3
Control Q
3
Met
hylp Q
4
Control Q
4
Met
hylp Q
5
Control Q
520
30
40
50
60
70
80
90
Figure 4.4(f)Balance Within Subclasses:
Age (years)
Treatment Group (M vs C)and PS Quintile (1, 2, 3, 4, 5)
Met
hylp Q
1
Control Q
1
Met
hylp Q
2
Control Q
2
Met
hylp Q
3
Control Q
3
Met
hylp Q
4
Control Q
4
Met
hylp Q
5
Control Q
515
20
25
30
35
40
45
Figure 4.4(g)Balance Within Subclasses:
BMI (k/m2)
Treatment Group (M vs C)and PS Quintile (1, 2, 3, 4, 5)
Met
hylp Q
1
Control Q
1
Met
hylp Q
2
Control Q
2
Met
hylp Q
3
Control Q
3
Met
hylp Q
4
Control Q
4
Met
hylp Q
5
Control Q
5
0
5
10
15
20
25
30
Figure 4.4(h)Balance Within Subclasses:
OA Duration (years)
Treatment Group (M vs C)and PS Quintile (1, 2, 3, 4, 5)
Met
hylp Q
1
Control Q
1
Met
hylp Q
2
Control Q
2
Met
hylp Q
3
Control Q
3
Met
hylp Q
4
Control Q
4
Met
hylp Q
5
Control Q
5
0
10
20
30
40
50
60
70
Figure 4.4(i)Balance Within Subclasses:
WOMAC Physical Function Score
Treatment Group (M vs C)and PS Quintile (1, 2, 3, 4, 5)
Met
hylp Q
1
Control Q
1
Met
hylp Q
2
Control Q
2
Met
hylp Q
3
Control Q
3
Met
hylp Q
4
Control Q
4
Met
hylp Q
5
Control Q
56
8
10
12
14
16
18
Figure 4.4(j)Balance Within Subclasses:
WOMAC Pain Score
Treatment Group (M vs C)and PS Quintile (1, 2, 3, 4, 5)
Met
hylp Q
1
Control Q
1
Met
hylp Q
2
Control Q
2
Met
hylp Q
3
Control Q
3
Met
hylp Q
4
Control Q
4
Met
hylp Q
5
Control Q
5
0
2
4
6
8
10
Figure 4.4(k)Balance Within Subclasses:
WOMAC Stiffness Score
Treatment Group (M vs C)and PS Quintile (1, 2, 3, 4, 5)
Met
hylp Q
1
Control Q
1
Met
hylp Q
2
Control Q
2
Met
hylp Q
3
Control Q
3
Met
hylp Q
4
Control Q
4
Met
hylp Q
5
Control Q
50
1
2
3
4
5
6
Figure 4.4(l)Balance Within Subclasses:
Global Score
This concluded the observational study design phase for methylp versus saline
To summarize, the observation design phase included: Study 3 exclusions applied to Study 1 and 2 saline subjects PS analyses used to identify iteratively balanced cohorts
among Subject 3 methylp subjects and Study 1 and 2 saline subjects for use in subsequent outcome comparison.
PS diagnostics used to confirm covariate balance between groups within PS strata in a straightforward and easy to communicate fashion.
Now that the design is fixed (including defining the relevant
study cohorts to include N=169 methylp and N=104 saline
subjects with subclass balance), it is permissible to examine
outcome data.
Summary and Conclusion
Within propensity score subclasses based on final PS
models produced at least as much balance in observed
covariate distributions as there would be had treatment
assignments been randomized. That is, stratification by
PS subclasses in this study was shown to effectively
minimize bias in between-group efficacy comparisons.