(and Precision) Effective Research Design Power to Power and Sampling Analysis.pdf · (and...

Department of Statistics

9 February 2007 SSP Core Facility 1

(and Precision) Effective Research Design Planningfor Grant Proposals & More

Walt Stroup, Ph.D.Professor & Chair, Department of Statistics

University of Nebraska, Lincoln

Power



Outline for Talk

I. What is “Power Analysis”? Why should I do it? II. Essential Background

III. A Word about SoftwareIV. Decisions that Affect Power – several examplesV. Latest Thinking

VI. Final Thoughts



Power and Precision Defined

▪ Precision a.k.a “Margin of Error”− In most cases, the standard error of relevant estimate

▪ Power− Prob { reject H0 given H0 false }− Prob { research hypothesis statistically significant }

▪ Power analysis− essentially, “If I do the study this way, power = ?”

▪ Sample size estimation− How many observations required to achieve given power?



What’s involved in Power Analysis

▪ WHAT IT’S NOT:▪ “Painting by numbers...”

▪ IF IT’S DONE RIGHT▪ Power analysis should be

− a comprehensive conversation to plan the study− a “dress rehearsal” for the statistical analysis once the

data are collected



Why do a Power Analysis?

▪ For NIH Grant Proposal− because it’s required

▪ For many other grant proposals− because it gives you a competitive edge▪ Other reasons

− practical: increases chance of success; reduces “we don’t have time to do it right, but lots of time to do it over” syndrome

− ethical



Ethical???▪ Last Ph.D. in U.S. Senate▪ Irritant to doctrinaire left and right▪ Keynote address to 1997

American Stat. Assoc. “... we can continue to make policy based on ‘data-free ideology’ on we can inform policy where possible by competent inquiry...”

late U.S. Senator Daniel Patrick Moynihan



Ethical

▪ Results of your study may affect policy▪ Well-conceived research means

− better information− greater chance of sound decisions

▪ Poorly-conceived research− lost opportunity− deprives policy-makers of information that might have

been useful− or worse: bad information misinforms or misleads public



What affects Power & Precision?

▪ A short statistics lesson1. What goes into computing test statistics2. What test statistics are supposed to tell us3. A bit about the distribution of test statistics4. Central and non-central t, F, and chi-square

( mostly F )



What goes into a test statistic?

Research hypothesis – motivation for studyAssumed not true unless data show

compelling evidence otherwise

Research hypothesis: HA ; opposite: H0

H0 true HA true

Fail to reject H0 ☺ Type II error

Reject H0 Type I error Power



What goes into a test statistic?

▪ Visualize using F▪ But same basic principles for t, chi-square, etc▪ F is ratio of variation attributable to factor under

study vs. variation attributable to noise

N of obs effect sizevariance of noise(i.e. among obs)



When H0 True – i.e. no trt effect



When H0 false (i.e. Research HA true)



What affects Power?




What should be in a conversation about Power?


▪ Effect size: what is the minimum that matters?▪ Variance: how much “noise” in the response

variable (range? distribution? count? pct?)▪ Practical Constraints▪ Design: same N can produce varying Power



About Software (part I)▪ Canned Software

− lots of it− Xiang and Zhou working on report− “painting by numbers”

▪ Simulation− most accurate; not constrained by canned scenarios− you can see what will happen if you actually do this...

▪ “Exemplary data set” + modeling software− nearly as accurate as simulation− “dress rehearsal” for actual analysis− MIXED, GLIMMIX, NLMIXED: if you can model it

you can do power analysis



Design Decisions – Some Examples

▪ Main Idea: For the same amount of effort, or $$$, or # observations, power and precision can be quite different▪ Power analysis objective: Work smarter, not

harder

▪ Simple example – design of regression study− From STAT 412 exercise



Treatment Design Exercise

▪ Class was asked to predict Bounce Height of basketball from Drop Height and to see if relationship changes depending on floor surface▪ Decision: What drop heights to use???



Objectives and Operating Definitions

▪ Recall objective: does drop: bounce height relationship change with floor surface?

operating definition



Consequences of Drop Height Decisions▪ Should we use fewer drops heights & more obs per drop

height or vice versa?

table from Stat 412 Avery archive



Simulation

▪ CRD example: 3 treatments, 5 reps / treatment▪ Suspected Effect size: 6-10% relative to control,

whose mean is known to be ~ 100▪ Standard deviation: 10 considered “reasonable”▪ Simulate 1000 experiments▪ Reject H0: equal trt means 228 times

− power = 0.228 at alpha=0.05▪ Ctl mean ranked correctly 820 times▪ (intermediate mean ranked correctly 589 times)



“Exemplary Data”▪ Many software packages for power & sample size

− e.g SAS PROC POWER− for FIXED effect models only

▪ “Exemplary Data” more general▪ Especially (but not only) when “Mixed Model Issues”

− random effects− split-plot structure− errors potentially correlated: longitudinal or spatial data− any other non-standard model structure

▪ Methods use PROC MIXED or GLIMMIX− adapted from Stroup (2002, JABES)

▪ Chapter 12, SAS for Mixed Models − (Littell, et al, 2006)



“Exemplary Data” - Computing Power using SAS

➢ create data set like proposed design

➢ run PROC GLIMMIX (or MIXED) with variance fixed

➢ φ=(F computed by GLIMMIX)×rank(K) [or chi-sq with GLM]

➢ use GLIMMIX to compute φ➢ critical F (Fcrit ) is value s.t.

P{F (rank(K), υ, 0 ) > Fcrit}= α [or chi-square]

➢ Power = P{F [rank(K), υ, φ] >Fcrit }

➢ SAS functions can compute Fcrit & Power



/* step 1 - create data set with same structure as proposed design use MU (expected mean) instead of observed Y_ij values *//* this example shows power for 5, 10, and 15 e.u. per trt */

data crdpwrx1; input trt mu; do n=5 to 15 by 5; do eu=1 to n; output; end; end;cards;1 1002 943 90;

Compute Power with GLIMMIX – CRD example



Compute Power with GLIMMIX – CRD example/* step 2 - use PROC GLIMMIX to compute non-centrality parameters for ANOVA tests & contrasts ODS statements output them to new data sets */proc sort data=crdpwrx1;by n;

proc glimmix data=crdpwrx1;by n; class trt; model mu=trt; parms (100)/hold=1; contrast 'et1 v et2' trt 0 1 -1; contrast 'c vs et' trt 2 -1 -1; ods output tests3=b; ods output contrasts=c;run;



/* step 3: combine ANOVA & contrast n-c parameter data sets use SAS functions PROBF and FINV to compute power */data power; set b c; alpha=0.05; ncparm=numdf*fvalue; fcrit=finv(1-alpha,numdf,dendf,0); power=1-probf(fcrit,numdf,dendf,ncparm);proc print;

Obs Effect Label DF DenDF alpha nc fcrit power

1 trt 2 12 0.05 2.53333 3.88529 0.223612 et1 v et2 1 12 0.05 0.40000 4.74723 0.089803 c vs et 1 12 0.05 2.13333 4.74723 0.26978

Type III Tests of Fixed Effects

EffectNum

DFDen DF F Value Pr > F

trt 2 12 1.27 0.3169

Contrasts

Label Num DF Den DF F Value Pr > Fet1 v et2 1 12 0.40 0.5390

c vs et 1 12 2.13 0.1698

Note close agreementof Simulated Power(0.228) and “exemplarydata” power (0.224)



More Advanced Example

▪ Plots in 8 x 3 grid▪ Main variation along 8 “rows”▪ 3 x 2 treatment design▪ Alternative designs

− randomized complete block (4 blocks, size 6)− incomplete block (8 blocks, size 3)− split plot

▪ RCBD “easy” but ignores natural variation



Picture the 8 x 3 Grid

Gradient

e.g. 8 schools, gradient is “SES”, 3 classrooms each



SAS Programs to Compare 8 x 3 Designdata a; input bloc trtmnt @@; do s_plot=1 to 3; input dose @@; mu=trtmnt*(0*(dose=1)+4*(dose=2)+8*(dose=3)); output; end;cards;1 1 1 2 31 2 1 2 32 1 1 2 32 2 1 2 33 1 1 2 33 2 1 2 34 1 1 2 34 2 1 2 3;

proc glimmix data=a noprofile; class bloc trtmnt dose; model mu=bloc trtmnt|dose; random trtmnt/subject=bloc; parms (4) (6) / hold=1,2; lsmeans trtmnt*dose / diff; contrast 'trt x lin'

trtmnt*dose 1 0 -1 -1 0 1; ods output diffs=b; ods output contrasts=c;run;

Split-Plot



8 x 3 – Incomplete Blockdata a; input bloc @@; do eu=1 to 3; input trtmnt dose @@; mu=trtmnt*(0*(dose=1)+4*(dose=2)+8*(dose=3)); output; end;cards;1 1 1 1 2 1 32 1 1 1 2 2 23 1 1 1 3 2 34 1 1 2 1 2 25 1 2 1 3 2 26 1 2 2 1 2 37 1 3 2 1 2 38 2 1 2 2 2 3;

proc glimmix data=a noprofile; class bloc trtmnt dose; model mu=trtmnt|dose; random intercept / subject=bloc; parms (4) (6) / hold=1,2; lsmeans trtmnt*dose / diff; contrast 'trt x lin'




8 x 3 Example - RCBDdata a; input trtmnt dose @@; do bloc=1 to 4; mu=trtmnt*(0*(dose=1)+4*(dose=2)+8*(dose=3)); output; end;cards;1 1 1 2 1 3 2 1 2 2 2 3; proc glimmix data=a noprofile;

class bloc trtmnt dose; model mu=bloc trtmnt|dose; parms (10) / hold=1; lsmeans trtmnt*dose / diff; contrast 'trt x lin'




How did designs compare?

▪ Suppose main objective is compare regression over 3 levels of doses: do they differ by treatment? (similar to basketball experiment) ▪ Operating definition is thus H0: dose regression

coefficient equal▪ Power for Randomized Block: 0.66▪ Power for Incomplete Block: 0.85▪ Power for Split-Plot: 0.85▪ Same # observations – you can work smarter



But what if I don’t know Trt Effect Size or Variance?

▪ “How can I do a power analysis? If I knew the effect size and the variance I wouldn’t have to do the study.”▪ What trt effect size is NOT: it is NOT the

effect size you are going to observe▪ It is somewhere between

− what current knowledge suggests is a reasonable expectation

− minimum difference that would be considered “important” or “meaningful”



And Variance??

▪ Know thy relevant background / Do thy homework▪ Literature search: what have others working

with similar subjects reported as variance?▪ Pilot study▪ Educated guess

− range you’d expect 95% of likely obs? divide it by 4− most extreme values you can plausibly imagine? divide

range by 6



Hierarchical Linear Models

▪ From Bovaird (10-27-2006) seminar ▪ 2 treatment▪ 20 classrooms / trt▪ 25 students / classroom▪ 4 years▪ reasonable ideas of classroom(trt), student

(classroom*trt), within student variances as well as effect size▪ Implement via exemplary data + GLIMMIX



Categorical Data?

▪ Example: Binary data▪ “Standard” has success probability of 0.25▪ “New & Improved” hope to increase to 0.30▪ Have N subjects at each of L locations

▪ For sake of argument, suppose we have− 900 subjects / location− 10 locations



Power for GLMs

▪ 2 treatments▪ P{favorable outcome}▪ for trt 1 p= 0.30; for trt 2 p=0.25▪ power if n1=300; n2=600data a; input trt y n; datalines;1 90 3002 150 600;

proc glimmix; class trt; model y/n=trt / chisq; ods output tests3=pwr;run; data power;

set pwr; alpha=0.05; ncparm=numdf*chisq; crit=cinv(1-alpha,numdf,0); power=1-probchi(crit,numdf,ncparm); proc print; run;

exemplary data



Power for GLMM▪ Same trt and sample size per location as before▪ 10 locations▪ Var(Location)=0.25; Var(Trt*Loc)=0.125▪ Variance Components: variation in log(OddsRatio)▪ Power?data a; input trt y n; do loc=1 to 10; output; end; datalines; 1 90 300 2 150 600 ;

proc glimmix data=a initglm; class trt loc; model y/n = trt / oddsratio; random intercept trt / subject=loc; random _residual_; parms (0.25) (0.125) (1) / hold=1,2,3; ods output tests3=pwr;run;



GLMM Power Analysis Results

Obs Effect NumDF DenDF alpha ncparm fcrit power1trt 1 9 0.05 2.29868 5.11736 0.27370

Odds Ratio Estimates

trt _trt Estimate DF95% Confidence

Limits1 2 1.286 9 0.884 1.871

Gives you expected Conf Limits for # Locations & N / Loccontemplated

Gives you the power of the test of TRT effect on prob(favorable)



GLMM Power: Impact of Sample Size?

▪ N of subjects per trt per location?▪ N of Locations?

Three cases1. n-300/600 10 loc2. n=600/1200, 10 loc3. n=300/600, 20 loc

data a; input trt y n; do loc=1 to 10; output; end; datalines; 1 90 300 2 150 600 ;





GLMM Power: Impact of Sample Size?Recall, for 10 locations, N=300/600, CI for OddsRatio was (0.884, 1.871); Power was 0.274

For 10 locations, N=600 / 1200

Odds Ratio Estimates

trt _trt Estimate DF 95% Confidence Limits1 2 1.286 9 0.891 1.855


For 20 locations, N=300 / 600Odds Ratio Estimates

trt _trt Estimate DF 95% Confidence Limits1 2 1.286 19 1.006 1.643


N alone has almost no impact



Recent developments

▪ Continue binary example▪ Power analysis shows:

-level 0.10 0.05 0.05 0.01 0.05 0.01Power 0.80 0.80 0.90 0.80 0.95 0.90Llocations 27 38 46 53 57 68

what do you do?



More Information

▪ Consider studies directed toward improving success rate similar to that proposed in study▪ Lit search yields 95 such studies▪ 29 have reported statistically significant gains of

p1-p2>0.05 (or, alternatively, significant odds ratios of [(30/70)/(25/75)]=1.28 or greater)▪ If this holds, “prior” prob (desired effect size ) is

approx 0.3



An Intro Stat Result

real Pr{type I error}is more like 0.23than 0.10!!!



Returning to All Scenarios

-level 0.10 0.05 0.05 0.01 0.05 0.01

Power 0.80 0.80 0.90 0.80 0.95 0.90

Llocations 27 38 46 53 57 68

Pr{DES | reject H0 }

0.77 0.87 0.89 0.97 0.89 0.97

NOTE dramatic impact of alpha-level when “prior” Pr { DES } is relatively lowPOWER role increases at Pr { DES } increases



Closing Comments

▪ In case it’s not obvious− I’m not a fan of “painting by numbers”− Role of power analysis misunderstood &

underappreciated▪ MOST of ALL it is an opportunity to explore and

rehearse study design & planned analysis▪ Engage statistician as a participating

member of research team ▪ Give it the TIME it REQUIRES

46

Thanks

... for coming

(and Precision) Effective Research Design Power to Power and Sampling Analysis.pdf · (and...

Documents

Transcript of (and Precision) Effective Research Design Power to Power and Sampling Analysis.pdf · (and...