Propensity Scores in Medical Device Trials

Design of Non-Randomized Medical Device Trials Based on Sub classification Using

Propensity Score Quintiles

Greg Maislin Principal Biostatistician

Biomedical Statistical Consulting, Wynnewood PA

Adjunct Associate Professorof Biostatistics in Medicine

Director, BiostatisticsDivision of Sleep Medicine

University of Pennsylvania School of Medicine

Co-Author

Donald B. Rubin

John L. Loeb Professor of Statistics

Department of Statistics, Harvard University

Supported by

Smith & Nephew

Acknowelgments

Introduction

Motivation for comparing non-randomized cohorts is increasing in orthopedic device studies designed to provide evidence of safety and effectiveness in support of PMA and 510(k) studies.

Patients are savvy and knowledgeable about the orthopedic product pipeline. Some are willing to travel abroad to avoid randomization.

The use of properly an a priori designed non-randomized study has potential for substantially reducing Sponsor burden yet permit valid scientific evidence for review.

Many Sponsors possess cohorts of patients receiving treatments for the same indication from prior well-controlled trials designed to support PMA or 510(k) regulatory approval.

PS methods can effectively reduce selection bias in non-randomized group comparisons.

Brief History of PS I

Cochran Biometrics Biometrics 1968 In the beginning…. Initial bias due to x defined as aresponse curve difference, e.g.,

R1(x) – R2(x)

If // then treatment effect can be estimate at any x. Observational studies problematic when not // (the usual case). Used direct adjustment and showed that 5 subclasses are

sufficient to remove over 90% of the bias due to covariates

Cochran and Rubin Indian J. of Statistics 1973 Standardized initial bias defined. Don’s tradition of providing

usable tools is clearly apparent by the introduction of a standardized measure for initial bias based on the mean difference as a % of the average SD computed in the usual way.

Brief History of PS II

Rubin Biometrics 1973a Treatment effect defined. The average difference between non-

parallel response surfaces over the P1 (e.g., investigational device) population. If constant, this simplifies to a fixed difference between parallel response surfaces.

Defined to the (average) effect of the treatment variable or more simply “the treatment effect”, τ .

τ = E1 {R1(x) – R2(x)}

Rubin Biometrics 1973b Focus on relative covariate variance. Foreshadows Mahalanobis distance within calipers

determined by propensity scores.

Brief History of PS III

Rosenbaum and Rubin Biometrika 1983 Theorem 1 implies that if a subclass of unites or a matched treatment-

control pair is homogeneous in e(x), then the treated and control units in that subclass or matched pair will have the same distribution of x.

Theorem 2 implies that if subclasses or matched treated-control pairs are homogeneous in both e(x) and certain chosen components of x, it is still reasonable to expect balance on the other components of x within these refined subclasses or matched pairs.

Theorem 3 states that under strongly ignorable treatment assignment, units with the same value of the balancing score b(x) but different treatments can act as controls for each other, in the sense that the expected difference in their responses equals the average treatment effect.

Theorem 4 states that if treatment assignment is strongly ignorable and b(x) is a balancing score, then the expected difference in observed responses to treatments at b(x) is equal to the average tx effect at b(x).

Brief History of PS IV

Rosenbaum and Rubin JASA 1984 Focus on estimated PS, rather than true PS Appendix B is about missing covariates. e(x) = Pr(z = 1 | X) extended to e*(x) = Pr(z = 1 | X*) “In practice we may estimate e* in several ways….”

Rosenbaum and Rubin American Statistician 1985 Suggest the use of logit(PS) to avoid ‘compression of

scale’ near 0 and 1. Equal-percent bias reducing (EPBR) occurs if expected

value of X is a monotone transformation of e(x), References Rubin (1976b, Theorem 2) that Mahalanobis metric matching is EPBR.

Brief History of PS V

Rubin Health Services & Outcomes Research

Methodology 2001

Using Propensity Score to Help Design Observational

Studies: Application to Tobacco Litigation Describes a number of diagnostic tools that were used in

our study Standardized bias Ratio of investigational to control PS variance Table summarizing how many of the covariates have

specified variance ratios orthogonal to the propensity score within and outside of an ideal range (>4/5 and <=5/4).

Lessons Learned from Brief History

A non-randomized group comparison should be 'designed' prior to analysis just as randomized group comparisons.

‘’Design' may be interpreted as "contemplating, collecting, organizing, and analyzing of data that takes place prior to seeing any outcome data (Rubin 2008)".

A principled approach to performing treatment group comparisons in observational studies includes a priori specification of the specifics regarding the treatment group comparison without access to outcome data.

"The propensity score is the observational study analogue of complete randomization in randomized experiments in the sense that its use is not intended to increase precision but only to eliminate systematic biases in treatment-control comparisons (Rubin 2008)“.

Very Recent References

Rubin DB. The design versus the analysis of observational

studies for causal effects: Parallels with the design of

randomized trials. Statistics in Medicine 2007, 26:20-36.

Rubin DB. For objective causal inference, design trumps

analysis. The Annals of Applied Statistics 2008, 2:3:808-840.

Yue LQ. Statistical and regulatory issue with the application of

propensity score analysis to nonrandomized medical device

clinical studies, Journal of Biopharmaceutical Statistics 2007,

17: 1-13. (includes 6 invited comments and a rejoinder)

Methods I

PS modeling effort is harder than naively thought by many! Multiple linear regression with no higher order terms or interactions is likely to be wholly inadequate.

We presents a new heuristic for PS modeling building in line with methods developed and described by Rosenbaum and Rubin (1984), Rubin (2001), and Imbens and Rubin (2009) but that iteratively uses standardized effect sizes of higher ordered terms to obtain a more precise estimates of PS.

No concern for Type I error inflation because no outcome data is used in this ‘Design of the observational study'.

Graphical techniques are emphasized that are effective in showing a non-statistical audience that balance among observed variables is at least as good as expected through randomization.

Methods II

Our heuristic includes a trio of "stages" that may be repeated

as many times as necessary. (1) estimating a main effects PS model (2) identification and adding of required higher order terms

through evaluation of within subclass bias effect sizes and other relevant PS diagnostic information

(3) exclusion of subjects in one treatment group with insufficient 'covariate overlap‘ based on higher ordered model.

At each stage, additional PS diagnostics are employed in

order to identify key terms and then finally, to validate final

model.

Methods III

Magnitudes of standardized effect sizes for group differences in

linear, squares, and cross-product terms within subclasses provide the

key insight into what additional terms the PS model must have in order

to achieve adequate within-subclass balance between treatment

groups.

During the iterative process it may be found that there is a lack of

sufficient overlap in some part of the PS distributions to permit valid

statistical inference. If so, subjects in one group with values in that part

of the PS distribution lacking subjects in the other must be excluded in

order to permit valid inference that is free from extrapolation.

Additional distributional diagnostics performed to guide selection of

higher ordered terms.

Methods III(b) (Nursing example) University of Pennsylvania School of Nursing Naylor ECC: Design of Observation Study using PS Methods PS_ASC_RNC Model 8 Details of propensity distribution by device group The UNIVARIATE Procedure Variable: LOGIT_PS (Propensity Score) Schematic Plots | 5 + | | | | 2.5 + | | | | +-----+ | | | | | 0 + *--+--* +-----+ | +-----+ *-----* | | | + | | | +-----+ -2.5 + | | | | | | -5 + | | | -7.5 + | | | -10 + | * -10.47466 8885843 | | -12.5 + | * -13.13123 8881011 | | -15 + ------------+-----------+----------- GRP_ASC 1 2

Methods IV

Specific choices made during this model building process pose no

concern for Type I error inflation because these analyses do not

involve any outcome data. In this way the sequential model-building

exercise should be viewed as part of the 'design of the

observational study'.

Graphical techniques are used to demonstrate in a simple way to

a non-statistical audience that the proposed approach does result

in good balance of observed factors between treatment groups,

highlighting that balance among observed covariates can often be

greater than expected had treatments been randomized.

Motivating Example I

Data is from ancillary analysis conducted to support the

findings from a prospective, parallel groups, randomized,

non-inferiority trial of an investigational treatment relative to a

putative standard-of-care injectable treatment

(methylprednisolone) for reducing knee pain at Week 12 due

to osteoarthritis.

The primary clinical endpoint was successful relief of pain as

determined from changes from baseline to Week 12 in the

WOMAC Pain (≥ 5 points reduction in WOMAC pain score

that is ≥ 40% smaller than pretreatment value).

Motivating Example II

The Sponsor had conducted two prior superiority studies against

saline control and used these two studies to narrow the indicated

population. Post hoc pooled analyses of these studies identified

sub populations likely to benefit from the investigational treatment.

Therefore, the Sponsor conducted a non-inferiority study relative to

a putative standard of care (methylprednisolone), but choose not to

include a saline control activity in the study population.

The goal of this PS analyses was to provide external confirmation

that the control treatment was superior to saline for reducing knee

pain due to osteoarthritis at Week 12 to mitigate against the lack of

internal control.

Motivating Example III

The first two studies provide nearly an ideal reservoir of

potential saline controls for use in the principled design

of an non-randomized study comparing relative efficacy

between methylprednisolone and saline in reducing pain

at Week 12 due to osteoarthritis.

All patients randomized to receive saline in Studies 1

and 2 who met the inclusion and exclusion criteria for

Study 3 were evaluated as candidates for comparison

with methylprednisolone using PS sub classification.

Design of Non Randomized Study I

Apply Study 3 exclusion criteria from Study 3 to saline

subjects from Studies 1 and 2

Without access to any outcome data, use propensity score

(PS) methods in design phase to create subclasses of

saline and methylp subjects who have the same distribution

of observed background covariates

Design of Non Randomized Study II

• Start with all saline subjects from Studies 1 and 2

(N=174+110 = 284).

• Exclude saline subjects that do not meet Study 3

exclusion criteria.

• Identify CR (clinically relevant) Saline Cohort of subjects

(N=108) for input in PS analyses

• (Excluded 4 additional control subjects from CR cohort as

‘non-overlappers’ (final N=104))

• (Started with N=215 in investigational group and ended

up with N=169.)

Design of Non Randomized Study III

Variable Explanation Type of Variable GLOB2 Global assessment (baseline) ordered categorical variable (1-5) analyzed as continuous variable WS2 WOMAC stiffness (baseline) continuous WPF2 WOMAC physical function (baseline) continuous WP2 WOMAC pain (baseline) continuous OA_DURATION Duration of OA (years) continuous Age Age at treatment (years) continuous BMI Body Mass Index (k/m2) continuous MALE Sex of patient dichotomous (male vs female) K_L_34 Kellgren-Lawrence grade dichotomous (grade 3 vs grade 2) IA_STER10 Previous steroid injection dichotomous (yes vs no) PREV_HA10 Previous HA injection dichotomous (yes vs no) PREV_SURG10 Previous surgery dichotomous (yes vs no)

Design of Non Randomized Study III

Design of Non Randomized Study IV

Variance

Methylp Control 1B95% CI

LB95% CI

UB t-statp-

value 2R

Unadjusted B 169 104 0.67 0.42 0.93 5.42 0.00 1.30

Q1 22 33 0.42 -0.13 0.96 1.51 0.14 0.51

Q2 27 28 -0.16 -0.69 0.37 -0.59 0.56 0.86

Q3 34 19 0.69 0.11 1.27 2.41 0.02 1.04

Q4 43 13 -0.37 -0.99 0.26 -1.16 0.25 1.14

Q5 43 11 -0.16 -0.82 0.50 -0.47 0.64 2.16

0.08 1.14 0.25 1.14

88%

Notes: 1 B - Standardized bias (difference in means / SDpooled)2 R - Ratio of variances (saline divided by methylp)3 For Mean B over Subclasses, t-stat (p-value) is for the treatment group contrast controlling for PS subclass.

Source [PSMI1_M_vs_S12 Model 6.sas]

Standardized Effect Size for Bias

Group by subcall interaction F(4,263)=1.91, p=0.11

Mean % Bias reduction

Table 4.2(1) Analysis of Mean Bias Reductionfor Subclassification using PS Model 6 (FINAL MODEL)

Sample Sizes Test for Bias

Mean B or R over Subclasses3

Design of Non Randomized Study IV Table 4.5(1) Maximum Likelihood Estimates of PS Model 6 Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -7.7229 6.6920 1.3318 0.2485 GLOB2 1 1.4416 1.3942 1.0691 0.3011 WS2 1 3.4653 1.4064 6.0713 0.0137 WPF2 1 -0.1931 0.2159 0.8004 0.3710 WP2 1 0.1848 0.8518 0.0471 0.8283 OA_DURATION 1 -0.0358 0.2154 0.0276 0.8681 AGE 1 0.0440 0.0640 0.4739 0.4912 BMI 1 0.00987 0.1683 0.0034 0.9532 MALE 1 0.4444 0.2922 2.3133 0.1283 K_L_34 1 0.1058 0.2889 0.1342 0.7141 IA_STER10 1 0.7486 1.1697 0.4096 0.5222 PREV_HA10 1 -1.8260 4.2554 0.1841 0.6678 PREV_SURG10 1 -0.4358 0.3257 1.7908 0.1808 WP2_BMI 1 0.00492 0.0187 0.0690 0.7928 WP2_AGE 1 -0.00019 0.00832 0.0005 0.9818 WPF2_WP2 1 -0.0112 0.00768 2.1317 0.1443 GLOB2_WPF2 1 -0.00577 0.0170 0.1158 0.7337 GLOB2SQ 1 -0.2320 0.1688 1.8881 0.1694 WPF2_AGE 1 0.00189 0.00216 0.7694 0.3804 WPF2_IA_STER10 1 0.0370 0.0374 0.9778 0.3227 WS2_IA_STER10 1 -0.3608 0.3112 1.3447 0.2462 WS2_AGE 1 -0.0252 0.0138 3.3369 0.0677 WS2_WPF2 1 -0.00082 0.0161 0.0026 0.9595 WPF2_BMI 1 0.00652 0.00490 1.7723 0.1831 WS2SQ 1 0.0319 0.0857 0.1384 0.7099 AGE_PREV_HA10 1 -0.00203 0.0476 0.0018 0.9660 WPF2_OA_DURATION 1 0.00228 0.00271 0.7067 0.4006 OA_DURATION_BMI 1 -0.00304 0.00718 0.1795 0.6718 BMI_PREV_HA10 1 0.0944 0.1004 0.8842 0.3471 IA_STER10_PREV_HA10 1 -0.6172 0.7997 0.5955 0.4403 WP2_PREV_HA10 1 0.1519 0.2597 0.3420 0.5587 WS2_PREV_HA10 1 -0.1847 0.3925 0.2213 0.6380 WS2_BMI 1 -0.0635 0.0321 3.9177 0.0478 WPF2_PREV_HA10 1 -0.0457 0.0629 0.5264 0.4681

Design of Non Randomized Study VI

Table 4.2(2)Individual Residual Variance RatiosFor PS Model 6 Methyl-p vs Saline

Residual Variance Variance Var. ratio >1/2 & >4/5 & >5/4 Obs VARIABLE Control Active control/active <=1/2 <=4/5 <=5/4 & <=2 >=2

1 glob2 0.58 0.52 1.11 0 0 1 0 0 2 ws2 2.06 1.69 1.22 0 0 1 0 0 3 wpf2 127.17 98.39 1.29 0 0 0 1 0 4 wp2 4.11 4.55 0.90 0 0 1 0 0 5 oa_duration 21.06 17.69 1.19 0 0 1 0 0 6 age 107.60 96.25 1.12 0 0 1 0 0 7 bmi 18.20 16.72 1.09 0 0 1 0 0 8 male 0.23 0.24 0.98 0 0 1 0 0 9 k_l_34 0.25 0.24 1.03 0 0 1 0 0 10 ia_ster10 0.20 0.20 1.01 0 0 1 0 0 11 prev_ha10 0.15 0.12 1.22 0 0 1 0 0 12 prev_surg10 0.21 0.20 1.05 0 0 1 0 0 ===== ====== ====== ===== === 0 0 11 1 0

Design of Non Randomized Study VII

Group by Subclass

Interaction

Diff. SEF or

Chi-sq.Diff. SE

F or Chi-sq.

F or Chi-sq.

Statistical Model

Global assessment -0.146 0.095 2.35 0.004 0.096 0.00 0.31 ANOVA

WOMAC stiffness 0.118 0.170 0.48 0.018 0.179 0.02 0.36 ANOVA

WOMAC physical function 0.678 1.308 0.27 0.082 1.377 0.00 1.02 ANOVA

WOMAC pain 0.155 0.263 0.35 0.031 0.290 0.28 0.49 ANOVA

Duration of OA -1.080 0.577 3.51 -0.020 0.578 0.00 0.66 ANOVA

Age -0.144 1.250 0.01 0.199 1.320 0.02 0.10 ANOVA

BMI 0.396 0.523 0.57 -0.038 0.549 0.00 0.10 ANOVA

Male gender 0.315 0.250 1.58 0.004 0.270 0.00 0.01 Logistic Regr.Kellgren-Lawrence 3 vs 2 0.071 0.254 0.08 -0.011 0.268 0.00 0.68 Logistic Regr.Previous steroid injection -0.015 0.281 0.00 -0.038 0.296 0.02 1.59 Logistic Regr.Previous HA injection -0.316 0.330 0.92 -0.019 0.351 0.00 0.07 Logistic Regr.Previous surgery -0.381 0.261 2.14 0.002 0.284 0.00 1.62 Logistic Regr.

Table 4.3(1)Assessment of Bias Reduction in Individual Covariates

Due to Stratification on PS Subclasses

Notes:1 One-w ay ANOVA comparing methylprednisolone saline cohort for continuous patient factors and simple logistic regression for dichotomous factors. For ANOVA the column labeled dif ference is the methylprednisolone minus saline control unadjusted differences in means. 2 Tw o-w ay ANOVA comparing methylprednisolone to saline for continuous patient factors controlling for PS subclassif ication or multiple logistic regression for dichotomous factors containing treatment group andPS subclass (df=4). For logistic regression models, the values in the columns labeled "Diff." and "SE" refers to the estimated log odds ratio and its standard error.

Source [PSMI1_M_vs_S12 Model 6.sas]

Unadjusted1 With PS Subclassification Adjustment

Figure 4.3Bias Reduction for Each Covariate

in the Final PS Model (Model 6)

t-statistic or signed square root of chi-square

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

Global assessment

WOMAC stiffness

WOMAC physical function

WOMAC pain

Duration of OA

Age

BMI

Male gender

Kellgren-Lawrence 3 vs 2

Previous steroid injection

Previous HA injection

Previous surgery

Prior to PS subclassificationAfter PS subclassification|statistic value| < 0.5

Percentages of Methylp and Saline Subjects Within Each of the PS Subclasses

Subclass 1

Subclass 2

Subclass 3

Subclass 4

Subclass 5

Per

cen

tag

e o

f C

oh

ort

0

10

20

30

40

50 Methylp (N=169) mean (SD) linear PS=0.70 (0.63)Saline (N=104) mean (SD) linear PS=0.25 (0.71)

Linear PS percentiles in the pooled distributions

20th=-0.040, 40th=0.414, 60th=0.772, 80th=1.134

Q1 Q2 Q3 Q4 Q5

Per

cen

tag

e o

f co

ho

rtw

ith

Ch

arac

teri

stic

0

10

20

30

40

50

60

70

80

90

100

PS MethylprednisolonePS Saline Control

Figure 4.4(a)Balance Within Subclasses:

Male Gender

Q1 Q2 Q3 Q4 Q5

Per

cen

tag

e o

f co

ho

rtw

ith

Ch

arac

teri

stic

0

10

20

30

40

50

60

70

80

90

100


Figure 4.4(b)Balance Within Subclasses:

Kellgren-Lawrence Grade 3 vs 2

Q1 Q2 Q3 Q4 Q5

Per

cen

tag

e o

f co

ho

rtw

ith

Ch

arac

teri

stic

0

10

20

30

40

50

60

70

80

90

100


Figure 4.4(c)Balance Within Subclasses:Previous Steroid Injection

Q1 Q2 Q3 Q4 Q5

Per

cen

tag

e o

f co

ho

rtw

ith

Ch

arac

teri

sti

c

0

10

20

30

40

50

60

70

80

90

100


Figure 4.4(d)Balance Within Subclasses:

Previous HA Injection

Q1 Q2 Q3 Q4 Q5

Per

cen

tag

e o

f co

ho

rtw

ith

Ch

arac

teri

sti

c

0

10

20

30

40

50

60

70

80

90

100


Figure 4.4(e)Balance Within Subclasses:

Previous Surgery

Treatment Group (M vs C)and PS Quintile (1, 2, 3, 4, 5)

Met

hylp Q

1

Control Q

1

Met

hylp Q

2

Control Q

2

Met

hylp Q

3

Control Q

3

Met

hylp Q

4

Control Q

4

Met

hylp Q

5

Control Q

520

30

40

50

60

70

80

90

Figure 4.4(f)Balance Within Subclasses:

Age (years)


Met

hylp Q

1

Control Q

1

Met

hylp Q

2

Control Q

2

Met

hylp Q

3

Control Q

3

Met

hylp Q

4

Control Q

4

Met

hylp Q

5

Control Q

515

20

25

30

35

40

45

Figure 4.4(g)Balance Within Subclasses:

BMI (k/m2)


Met

hylp Q

1

Control Q

1

Met

hylp Q

2

Control Q

2

Met

hylp Q

3

Control Q

3

Met

hylp Q

4

Control Q

4

Met

hylp Q

5

Control Q

5

0

5

10

15

20

25

30

Figure 4.4(h)Balance Within Subclasses:

OA Duration (years)


Met

hylp Q

1

Control Q

1

Met

hylp Q

2

Control Q

2

Met

hylp Q

3

Control Q

3

Met

hylp Q

4

Control Q

4

Met

hylp Q

5

Control Q

5

0

10

20

30

40

50

60

70

Figure 4.4(i)Balance Within Subclasses:

WOMAC Physical Function Score


Met

hylp Q

1

Control Q

1

Met

hylp Q

2

Control Q

2

Met

hylp Q

3

Control Q

3

Met

hylp Q

4

Control Q

4

Met

hylp Q

5

Control Q

56

8

10

12

14

16

18

Figure 4.4(j)Balance Within Subclasses:

WOMAC Pain Score


Met

hylp Q

1

Control Q

1

Met

hylp Q

2

Control Q

2

Met

hylp Q

3

Control Q

3

Met

hylp Q

4

Control Q

4

Met

hylp Q

5

Control Q

5

0

2

4

6

8

10

Figure 4.4(k)Balance Within Subclasses:

WOMAC Stiffness Score


Met

hylp Q

1

Control Q

1

Met

hylp Q

2

Control Q

2

Met

hylp Q

3

Control Q

3

Met

hylp Q

4

Control Q

4

Met

hylp Q

5

Control Q

50

1

2

3

4

5

6

Figure 4.4(l)Balance Within Subclasses:

Global Score

This concluded the observational study design phase for methylp versus saline

To summarize, the observation design phase included: Study 3 exclusions applied to Study 1 and 2 saline subjects PS analyses used to identify iteratively balanced cohorts

among Subject 3 methylp subjects and Study 1 and 2 saline subjects for use in subsequent outcome comparison.

PS diagnostics used to confirm covariate balance between groups within PS strata in a straightforward and easy to communicate fashion.

Now that the design is fixed (including defining the relevant

study cohorts to include N=169 methylp and N=104 saline

subjects with subclass balance), it is permissible to examine

outcome data.

Summary and Conclusion

Within propensity score subclasses based on final PS

models produced at least as much balance in observed

covariate distributions as there would be had treatment

assignments been randomized. That is, stratification by

PS subclasses in this study was shown to effectively

minimize bias in between-group efficacy comparisons.

Propensity Scores in Medical Device Trials

Education

Transcript of Propensity Scores in Medical Device Trials