Bootstrap Analysis Double- Independent Programming: Issues ... · Simple Case Bootstrap: The Idea !...

Stephanie, living with rheumatoid arthrits

Bootstrap Analysis Double-Independent Programming:

Issues and Solutions. Presented by Nils Pénard

UCB

Outline

!   Work Productivity Survey (WPS)

!   Presentation of the simple case: no variance stabilization •  Description of the method

•  Double-programming issues and solutions

•  Limitations: bias in bootstrap

!   Presentation of variance stabilization case •  Description of the method

•  Double-programming issues and solutions

2

Simple Case

!   Description of the method: •  Step 1: re-sampling

•  Step 2: Bootstrap distribution

•  Step 3: Approximate a p-value and a C.I.

!   Issues encountered (linked to boot itself/linked to validation)

3

Simple Case Bootstrap: The Idea

!   Approximate the distribution of a statistical estimator using the observed sample

!   Step 1: Random re-sampling of observed sample, B times

!   Step 2: Compute the stats from these B samples: distribution of the statistic under the null hypothesis

!   Step 3: Use this distribution to approximate a p-value and a C.I.

4

Step 1 - Re-sampling: Implementation

!   Gleason (1988): algorithm for re-sampling •  Stratified

•  With replacement

•  Balanced

5

S1 S2 S3 S4

S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4

S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4

S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4

1

2

3

Run 1 Run 2

Group y

Run 10000 Run 3

4

Step 2 – Distribution of Bootstrap Statistics

6

S2

S3

S2

S4

S5

S7

S8

S8

Group y Group z

"↓$↑∗ = ' ↓$↑∗ − ' /S E ↓$↑∗  

S E ↓$↑∗ =√(S D↓*$↑∗2 /++S D↓,$↑∗2 /-) 

S1

S4

S3

S4 Run 2

S5

S6

S7

S8

Run 10000

Run b

Step 3 - Confidence Interval

7

( '  − "↓(1− //2 )↑∗ 12 ('), '  − "↓(//2)↑∗ 12 ('))

"↓(//2)↑∗ 

250th

"↓(1− //2 )↑∗ 

9750th

Step 3: P-value

!   Formula: (#{x} -> no. of times x occurs)

!   Two-sided test (absolute value)

!   From the t* distribution: •  How many random t* are better than tobs?

•  Large |t*| should be rare

•  If effect, |tobs| should be larger than |t*|

8

3 ↓$44" =# {|"↓$↑∗ |≥|"↓4$7 |}∕: 

Issues Encountered

Step 1 - Re-sampling: Issues

!   Three issues linked to validation aspect:

!   Computationally intensive: !   Example: !   80 patients in Group y and 66 patients in Group z, B=10000 replications

!   (80+66)*10000=1,460,000 obs.

!   Solution: Use built-in SAS procedure as much as possible

!   Random sampling must be the same on both sides: !   Solution: Use of SEED

!   Sampling validation: space issue, I/O access issue

9


!   Computation intensive and random sampling

10


!   Large amount of data !   (100+230)*10000=3,300,000 obs.

!   WPS: 8 questions * 7 visits * no. of comparisons-> lots!

!   Runtime (I/O access), disk space (cache)

!   Intermediary dataset: !   Store sampling for only one visit, one question, one comparison !   Default visit, question, comparison is chosen arbitrarily

!   Can be used to check if the sampling is identical on both sides

!   Once sampling is checked: not saved anymore

11

Step 2 and Step 3: Issues and Solutions

!   T* distribution: •  No particular issues

•  Make sure formulas are performed in same order: •  Floating point representation error

!   Confidence interval (finding the high t* percentile):

•  Simple trick: B=9999

•  Need to implement linear interpolation: •  Programmers agreement

•  Macrotize and validate

12

(0.05/2)×(10000+1)=0.975 ×10001=9750.975 (//2)×(:+1)

Bootstrap Simple Case:

Limitations of the Method

Bootstrap Bias

!   The simple case bootstrap works well if: •  The t* distribution is symmetrical

•  In some cases, it is not: bootstrap bias

!   Solution: variance stabilization

13

Variance Stabilization

!  Not stabilized:

14

!  Stabilized:

Step 1 : Smooth Function

!   Non-linear fit •  PROC LOESS

!   Numeric integration •  Macrotize and validate

•  Simpson rule, other possibilities

15

;(<)=∫↑<▒1/7(?) ⅆ? 

G-transformation: Look-up Table

16

"↓$↑∗ =g( ' ↓$↑∗ )−g( ' )

( '  − "↓(1− //2 )↑∗ , '  − "↓(//2)↑∗ )

"↓4$7↑  =g(' )−g('↓0 )

!   Formula changes:

Inverse G-transformation: Look-up Table

!   Confidence intervals were computed in G space: •  Back transform in the bootstrap space

17

Simple Case vs. Stabilized Variance

!   Comparison of final outputs

18

Conclusion

!   Big data, small output

!   Double programming advantages •  Difficult to write macro specifications directly

•  Macrotize the program afterward

!   Keep bootstrap modular •  i.e., sampling; statistics, p-value

•  But also functions: interpolation, g-transform, inverse g-transform

!   Most points are applicable to other simulation methods

19

20

Questions?

Introduction

! Two groups comparison Health Eco, Patient Reported Outcomes variables. Short WPS presentation

!   Parametric tests: Parametric assumptions not respected •  Data transformation (not arithmetic mean comparison)

!   Classic non-parametric test:(Mann-Whitney, Wilcoxon)

•  more concerned with distribution shape (goodness of fit)

!   Simulation based techniques:

•  Permutation test

•  Bootstrap-t

21

Bootstrap Analysis Double- Independent Programming: Issues ... · Simple Case Bootstrap: The Idea !...

Documents

Transcript of Bootstrap Analysis Double- Independent Programming: Issues ... · Simple Case Bootstrap: The Idea !...