Bootstrap Analysis Double- Independent Programming: Issues ... · Simple Case Bootstrap: The Idea !...
Transcript of Bootstrap Analysis Double- Independent Programming: Issues ... · Simple Case Bootstrap: The Idea !...
Stephanie, living with rheumatoid arthrits
Bootstrap Analysis Double-Independent Programming:
Issues and Solutions. Presented by Nils Pénard
UCB
Outline
! Work Productivity Survey (WPS)
! Presentation of the simple case: no variance stabilization • Description of the method
• Double-programming issues and solutions
• Limitations: bias in bootstrap
! Presentation of variance stabilization case • Description of the method
• Double-programming issues and solutions
2
Simple Case
! Description of the method: • Step 1: re-sampling
• Step 2: Bootstrap distribution
• Step 3: Approximate a p-value and a C.I.
! Issues encountered (linked to boot itself/linked to validation)
3
Simple Case Bootstrap: The Idea
! Approximate the distribution of a statistical estimator using the observed sample
! Step 1: Random re-sampling of observed sample, B times
! Step 2: Compute the stats from these B samples: distribution of the statistic under the null hypothesis
! Step 3: Use this distribution to approximate a p-value and a C.I.
4
Step 1 - Re-sampling: Implementation
! Gleason (1988): algorithm for re-sampling • Stratified
• With replacement
• Balanced
5
S1 S2 S3 S4
S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4
S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4
S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4
1
2
3
Run 1 Run 2
Group y
Run 10000 Run 3
4
Step 2 – Distribution of Bootstrap Statistics
6
S2
S3
S2
S4
S5
S7
S8
S8
Group y Group z
"↓$↑∗ = ' ↓$↑∗ − ' /S E ↓$↑∗
S E ↓$↑∗ =√(S D↓*$↑∗2 /++S D↓,$↑∗2 /-)
S1
S4
S3
S4 Run 2
S5
S6
S7
S8
Run 10000
Run b
Step 3 - Confidence Interval
7
( ' − "↓(1− //2 )↑∗ 12 ('), ' − "↓(//2)↑∗ 12 ('))
"↓(//2)↑∗
250th
"↓(1− //2 )↑∗
9750th
Step 3: P-value
! Formula: (#{x} -> no. of times x occurs)
! Two-sided test (absolute value)
! From the t* distribution: • How many random t* are better than tobs?
• Large |t*| should be rare
• If effect, |tobs| should be larger than |t*|
8
3 ↓$44" =# {|"↓$↑∗ |≥|"↓4$7 |}∕:
Issues Encountered
Step 1 - Re-sampling: Issues
! Three issues linked to validation aspect:
! Computationally intensive: ! Example: ! 80 patients in Group y and 66 patients in Group z, B=10000 replications
! (80+66)*10000=1,460,000 obs.
! Solution: Use built-in SAS procedure as much as possible
! Random sampling must be the same on both sides: ! Solution: Use of SEED
! Sampling validation: space issue, I/O access issue
9
Step 1 - Re-sampling: Issues
! Computation intensive and random sampling
10
Step 1 - Re-sampling: Issues
! Large amount of data ! (100+230)*10000=3,300,000 obs.
! WPS: 8 questions * 7 visits * no. of comparisons-> lots!
! Runtime (I/O access), disk space (cache)
! Intermediary dataset: ! Store sampling for only one visit, one question, one comparison ! Default visit, question, comparison is chosen arbitrarily
! Can be used to check if the sampling is identical on both sides
! Once sampling is checked: not saved anymore
11
Step 2 and Step 3: Issues and Solutions
! T* distribution: • No particular issues
• Make sure formulas are performed in same order: • Floating point representation error
! Confidence interval (finding the high t* percentile):
• Simple trick: B=9999
• Need to implement linear interpolation: • Programmers agreement
• Macrotize and validate
12
(0.05/2)×(10000+1)=0.975 ×10001=9750.975 (//2)×(:+1)
Bootstrap Simple Case:
Limitations of the Method
Bootstrap Bias
! The simple case bootstrap works well if: • The t* distribution is symmetrical
• In some cases, it is not: bootstrap bias
! Solution: variance stabilization
13
Variance Stabilization
! Not stabilized:
14
! Stabilized:
Step 1 : Smooth Function
! Non-linear fit • PROC LOESS
! Numeric integration • Macrotize and validate
• Simpson rule, other possibilities
15
;(<)=∫↑<▒1/7(?) ⅆ?
G-transformation: Look-up Table
16
"↓$↑∗ =g( ' ↓$↑∗ )−g( ' )
( ' − "↓(1− //2 )↑∗ , ' − "↓(//2)↑∗ )
"↓4$7↑ =g(' )−g('↓0 )
! Formula changes:
Inverse G-transformation: Look-up Table
! Confidence intervals were computed in G space: • Back transform in the bootstrap space
17
Simple Case vs. Stabilized Variance
! Comparison of final outputs
18
Conclusion
! Big data, small output
! Double programming advantages • Difficult to write macro specifications directly
• Macrotize the program afterward
! Keep bootstrap modular • i.e., sampling; statistics, p-value
• But also functions: interpolation, g-transform, inverse g-transform
! Most points are applicable to other simulation methods
19
20
Questions?
Introduction
! Two groups comparison Health Eco, Patient Reported Outcomes variables. Short WPS presentation
! Parametric tests: Parametric assumptions not respected • Data transformation (not arithmetic mean comparison)
! Classic non-parametric test:(Mann-Whitney, Wilcoxon)
• more concerned with distribution shape (goodness of fit)
! Simulation based techniques:
• Permutation test
• Bootstrap-t
21